The Role of Graphics in EDA

NIST/SEMATECH Section 1.1.5-1.1.6 The Role of Graphics / An EDA/Graphics Example

Why Graphics Are Central to EDA

The NIST/SEMATECH Engineering Statistics Handbook devotes Section 1.1.5 to the role of graphics because visualization is not merely helpful in EDA — it is foundational. Summary statistics such as the mean and standard deviation compress an entire dataset into one or two numbers. That compression is useful for communication, but it inevitably discards information. A graphical display preserves the richness of the data: its shape, its outliers, its clusters, and its gaps.

The human visual system is extraordinarily good at detecting patterns. We perceive trends, clusters, gaps, and anomalies in a scatter plot almost instantaneously — long before a formal test could be specified. EDA leverages this perceptual strength by making graphics the primary investigative tool rather than an afterthought. As Section 1.1.4 of the Handbook states:

“Graphics are irreplaceable — there are no quantitative analogues that will give the same insight as well-chosen graphics.”

Quantitative vs. Graphical Procedures

Section 1.1.5 splits statistics and data analysis procedures broadly into two parts:

Quantitative techniques yield numeric or tabular output. Examples include:

Hypothesis testing
Analysis of variance
Point estimates and confidence intervals
Least squares regression

Graphical techniques provide visual insight into data structure. Examples include:

Both categories are valuable, but EDA relies heavily on the graphical side. The Handbook is emphatic: “Graphical procedures are not just tools that we could use in an EDA context, they are tools that we must use.”

The Seven Roles of Graphics

Section 1.1.5 identifies graphical tools as the shortest path to gaining insight into a dataset. Specifically, graphics serve seven purposes:

Testing assumptions
Model selection
Model validation
Estimator selection
Relationship identification
Factor effect determination
Outlier detection

The Handbook warns that neglecting any of these is a real cost:

“If one is not using statistical graphics, then one is forfeiting insight into one or more aspects of the underlying structure of the data.”

Anscombe’s Quartet: The Case for Visualization (Section 1.1.6)

Section 1.1.6 provides the classic demonstration of why graphics are indispensable. Anscombe (1973) constructed four datasets that share nearly identical summary statistics:

N = 11
Mean of X = 9.0
Mean of Y = 7.5
Intercept = 3
Slope = 0.5
Residual standard deviation = 1.237 (1.236 for data sets 3 and 4)
Correlation = 0.816 (0.817 for data set 4)

Yet the scatter plots of these four datasets are dramatically different:

Anscombe's Quartet (1973). All four datasets share nearly identical summary statistics (mean, variance, correlation, regression line), yet their scatter plots reveal fundamentally different structures. Dashed line: y = 3 + 0.5x.

Data set 1 is clearly linear with some scatter.
Data set 2 is clearly quadratic.
Data set 3 clearly has an outlier.
Data set 4 is obviously the victim of a poor experimental design, with a single point far removed from the bulk of the data “wagging the dog.”

No numerical summary alone can distinguish them; only a scatter plot reveals the truth.

Focus vs. Filter: Why Statistics Alone Mislead

Section 1.1.6 explains why the quantitative approach is incomplete on its own. Quantitative statistics are numeric summaries that do a good job of focusing on a particular aspect of the data (location, intercept, slope, degree of relatedness) by reducing the data to a few numbers. But in doing so, they also filter the data, necessarily omitting and screening out other sometimes crucial information.

“Quantitative statistics focus but also filter; and filtering is exactly what makes the quantitative approach incomplete at best and misleading at worst.”

Anscombe’s Quartet illustrates this perfectly. The estimated intercepts (= 3) and slopes (= 0.5) are identical across all four datasets, yet the scatter plots above show that a linear model is only appropriate for data set 1. For data sets 2, 3, and 4 the linearity assumption is the “fatal flaw” in the analysis. The EDA approach of deliberately postponing model selection until further along in the analysis avoids this trap.

Graphics vs. Quantitative Summaries

Graphical and quantitative methods serve different but complementary roles. Quantitative summaries provide precise numerical characterizations that are easy to compare across datasets. Graphics provide the context that those numbers lack: the overall shape of the distribution, the presence of multiple modes, and the nature of departures from an assumed model.

In practice, EDA alternates between graphical exploration and quantitative confirmation. A histogram may suggest that the data are right-skewed; a skewness statistic quantifies how much. A normal probability plot may show curvature in the tails; an Anderson-Darling test provides a formal p-value. Neither approach is complete without the other, but the graphical step must come first.

Cross-References

Scatter Plot — the workhorse bivariate display
Histogram — univariate distribution display
Box Plot — compact distribution summary with outlier flagging
Normal Probability Plot — distributional assessment tool
Block Plot — graphical technique for factor effects
4-Plot — multi-panel diagnostic for rapid assumption testing
6-Plot — extended diagnostic display
What is EDA? — the philosophy behind the graphical approach