What is EDA?
NIST/SEMATECH Section 1.1.1-1.1.4 EDA Introduction
What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is an approach and philosophy for data analysis that employs a variety of techniques (mostly graphical) to:
- Maximize insight into a data set
- Uncover underlying structure
- Extract important variables
- Detect outliers and anomalies
- Test underlying assumptions
- Develop parsimonious models
- Determine optimal factor settings
The approach is precisely that: not a set of techniques, but an attitude and philosophy about how a data analysis should be carried out (Section 1.1.1).
EDA is not identical to statistical graphics, although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques — all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue: it is a philosophy as to how we dissect a data set, what we look for, how we look, and how we interpret. EDA heavily uses the collection of techniques we call “statistical graphics,” but it is not identical to statistical graphics per se (Section 1.1.1).
The seminal work in EDA is Exploratory Data Analysis, Tukey (1977). Over the years it has benefitted from other noteworthy publications such as Data Analysis and Regression, Mosteller and Tukey (1977); Interactive Data Analysis, Hoaglin (1977); and The ABC’s of EDA, Velleman and Hoaglin (1981). The approach is described in detail in the NIST/SEMATECH Engineering Statistics Handbook as a complement to classical and Bayesian methods.
How EDA Differs from Classical and Bayesian Analysis
Three popular data analysis approaches exist: Classical, Exploratory (EDA), and Bayesian. All three start with a general science or engineering problem and yield science or engineering conclusions. The difference is the sequence and focus of the intermediate steps (Section 1.1.2).
The three canonical analysis sequences are:
- Classical: Problem => Data => Model => Analysis => Conclusions
- EDA: Problem => Data => Analysis => Model => Conclusions
- Bayesian: Problem => Data => Model => Prior Distribution => Analysis => Conclusions
The key difference is how each approach deals with the underlying model for the data. For classical analysis, the data collection is followed by the imposition of a model (normality, linearity, etc.) and the analysis, estimation, and testing that follow are focused on the parameters of that model. For EDA, the data collection is not followed by a model imposition; rather it is followed immediately by analysis with a goal of inferring what model would be appropriate. For Bayesian analysis, the analyst attempts to incorporate scientific and engineering knowledge into the analysis by imposing a data-independent distribution on the parameters of the selected model; the analysis thus consists of formally combining both the prior distribution on the parameters and the collected data to jointly make inferences and/or test assumptions about the model parameters (Section 1.1.2).
In the real world, data analysts freely mix elements of all three approaches. The distinctions above are made to emphasize the major differences among them.
When to Use Each Approach
In practice, these paradigms are complementary rather than competing:
- EDA first — to understand the shape, spread, and quirks of your dataset before committing to a model.
- Classical analysis — when you have strong theoretical justification for a specific model and need formal inference (confidence intervals, hypothesis tests).
- Bayesian analysis — when informative prior knowledge is available and you want to formally combine that prior with the collected data.
A robust analysis workflow typically begins with EDA, uses its findings to inform model selection, and then applies classical or Bayesian methods for confirmatory analysis.
EDA vs. Summary Analysis
Summary analysis is simply a numeric reduction of a historical data set. It is quite passive. Its focus is in the past. Quite commonly, its purpose is to simply arrive at a few key statistics (for example, mean and standard deviation) which may then either replace the data set or be added to it in the form of a summary table (Section 1.1.3).
In contrast, EDA has as its broadest goal the desire to gain insight into the engineering or scientific process behind the data. Whereas summary statistics are passive and historical, EDA is active and futuristic. In an attempt to understand the process and improve it in the future, EDA uses the data as a “window” to peer into the heart of the process that generated the data. There is an archival role for summary statistics, but there is an enormously larger role for the EDA approach (Section 1.1.3).
EDA Goals
The primary goal of EDA is to maximize the analyst’s insight into a data set and into its underlying structure, while providing all of the specific items that an analyst would want to extract from a data set (Section 1.1.4), such as:
- A good-fitting, parsimonious model
- A list of outliers
- A sense of robustness of conclusions
- Estimates for parameters
- Uncertainties for those estimates
- A ranked list of important factors
- Conclusions as to whether individual factors are statistically significant
- Optimal settings
These items serve as the specific targets of an analysis, but the real insight and “feel” for a data set comes as the analyst judiciously probes and explores the various subtleties of the data. That feel comes almost exclusively from the application of various graphical techniques, which serve as the window into the essence of the data. Graphics are irreplaceable — there are no quantitative analogues that will give the same insight as well-chosen graphics (Section 1.1.4).
The EDA Philosophy
Several principles distinguish the EDA mindset:
- Openness to the unexpected. The analyst does not know in advance what the data will reveal. EDA embraces surprise rather than filtering it out.
- Skepticism of assumptions. Every standard statistical procedure carries assumptions — normality, independence, constant variance. EDA provides tools to test those assumptions before they are relied upon (see Underlying Assumptions).
- Emphasis on visualization. Numerical summaries compress data into a few numbers; graphics preserve the richness that those summaries discard (see Role of Graphics).
- Role of judgment. EDA is as much an art as a science. The analyst’s domain knowledge, experience, and intuition are essential for interpreting what the plots and summaries reveal.
Getting Started with EDA
The primary tools of EDA are graphical — histograms, scatter plots, box plots, and probability plots — supported by lightweight quantitative summaries such as measures of location and measures of scale.
A practical starting point for any new dataset is the 4-Plot, which simultaneously checks location stability, variation stability, randomness, and distributional shape. From there, deeper investigation can branch into specific graphical techniques such as the run sequence plot for time-dependent patterns or the normal probability plot for distributional assessment.
Cross-References
- Histogram — the most common graphical tool in EDA
- Measures of Location — fundamental quantitative summary
- 4-Plot — universal first-pass diagnostic
- The Role of Graphics — why visualization is central to EDA
- Underlying Assumptions — the four standard assumptions that EDA helps verify