The 4-Plot for Assumption Testing
NIST/SEMATECH Section 1.3.3.32 4-Plot, 1.2.4 Interpretation of 4-Plot
A Simple, Efficient, and Powerful Diagnostic
The 4-plot is a collection of four specific EDA graphical techniques whose purpose is to test the assumptions that underlie most measurement processes. As described in Sections 1.3.3.32 and 1.2.4 of the NIST/SEMATECH Engineering Statistics Handbook, these four techniques are simple, efficient, and powerful for the routine testing of underlying assumptions.
There are four assumptions that typically underlie all measurement processes; namely, that the data from the process at hand “behave like”:
- Random drawings
- From a fixed distribution
- With that distribution having a fixed location
- With that distribution having fixed variation
If all four assumptions hold, then the above four plots will have a characteristic appearance; if any of the underlying assumptions fail to hold, then it will be revealed by an anomalous appearance in one or more of the plots. Rather than checking assumptions one at a time with separate tools, the analyst can assess all four simultaneously and see how they interact. A dataset that passes all four panels of the 4-plot is ready for standard statistical analysis; a dataset that fails one or more panels requires investigation before proceeding.
Applicability to Residuals
Although the 4-plot has an obvious use for univariate and time series data, its usefulness extends far beyond that. Many statistical models of the form have the same underlying assumptions for the error term. That is, no matter how complicated the functional fit, the assumptions on the underlying error term are still the same. The 4-plot can and should be routinely applied to the residuals when fitting models regardless of whether the model is simple or complicated.
The Four Panels
The 4-plot arranges four diagnostic graphics in a fixed layout. Each panel targets a specific assumption.
Panel 1: Run Sequence Plot (Upper Left)
The run sequence plot displays the data in the order they were collected, with the observation index on the horizontal axis and the measured value on the vertical axis.
What to look for:
- Fixed location: If the fixed location assumption holds, the run sequence plot will be flat and non-drifting. Visible trends, step changes, or oscillations indicate that the location is shifting over time.
- Fixed variation: If the fixed variation assumption holds, the vertical spread in the run sequence plot will be approximately the same over the entire horizontal axis. A funnel shape (widening or narrowing) suggests changing variance.
Panel 2: Lag Plot (Upper Right)
The lag plot graphs each observation against the immediately preceding observation ( versus ). If the data are truly random and independent, the lag plot will be structureless and random.
What to look for:
- Randomness: A structureless, blob-shaped scatter with no apparent pattern confirms independence. Clustering along the diagonal indicates positive autocorrelation and suggests an autoregressive model may be appropriate. A tight elliptical pattern indicates a sinusoidal (single-cycle) model. The lag plot is also valuable for outlier detection — points that lie far from the main structure warrant investigation.
Panel 3: Histogram (Lower Left)
The histogram bins the data and displays the frequency distribution as a bar chart. It reveals the overall shape of the data’s distribution without imposing any assumed model.
What to look for:
- Distribution shape: Is the distribution symmetric and bell-shaped (suggesting normality), skewed to one side, bimodal (two peaks), or uniform? The presence of multiple modes may indicate that the data come from a mixture of populations.
- Outliers: Isolated bars far from the main mass of data point to potential outliers that warrant investigation.
Panel 4: Normal Probability Plot (Lower Right)
The normal probability plot is the most sensitive of the four panels for assessing whether the data follow a normal (Gaussian) distribution. It plots the ordered data values against the theoretical quantiles of the standard normal distribution.
What to look for:
- Normality: If the data are normally distributed, the points will fall approximately on a straight line. Both short-tailed and long-tailed departures from normality produce an S-like pattern in the middle of the plot. The distinguishing factor is the direction of departure of the first few and last few points from the fitted line. For long (heavy) tails, the first few points show increasing departure below the line and the last few points show increasing departure above the line. For short (light) tails, this pattern is reversed — the first few points depart above the line and the last few points depart below the line.
Questions the 4-Plot Answers
The 4-plot can provide answers to many questions:
- Is the process in-control, stable, and predictable?
- Is the process drifting with respect to location?
- Is the process drifting with respect to variation?
- Are the data random?
- Is an observation related to an adjacent observation?
- If the data are a time series, is it white noise?
- If the data are a time series and not white noise, is it sinusoidal, autoregressive, etc.?
- If the data are non-random, what is a better model?
- Does the process follow a normal distribution?
- If non-normal, what distribution does the process follow?
- Is the model valid and sufficient?
- If the default model is insufficient, what is a better model?
- Is the formula valid for the standard error of the mean?
- Is the sample mean a good estimator of the process location?
- If not, what would be a better estimator?
- Are there any outliers?
Probabilistic Predictability
Predictability is an all-important goal in science and engineering. If the four assumptions hold, then we have achieved probabilistic predictability — the ability to make probability statements not only about the process in the past, but also about the process in the future. In short, such processes are said to be “statistically in control.” If the four assumptions do not hold, then we have a process that is drifting (with respect to location, variation, or distribution), is unpredictable, and is out of control. A simple characterization of such processes by a location estimate, a variation estimate, or a distribution “estimate” inevitably leads to optimistic and grossly invalid engineering conclusions.
Interpreting the Combined Evidence
The real strength of the 4-plot is in reading all four panels together. Here are common patterns:
- All four panels clean: The data satisfy all standard assumptions. The process is “statistically in control.” Proceed with parametric analysis (t-tests, ANOVA, regression).
- Run sequence shows a trend, other panels noisy: The location is shifting. Before analyzing, detrend the data or model the trend explicitly.
- Lag plot shows structure: The data are autocorrelated. Standard error formulas will underestimate uncertainty. Consider time-series methods or subsample to reduce dependence.
- Histogram bimodal, probability plot curved: The data likely come from a mixture of two populations. Investigate whether a grouping variable can separate them.
- Probability plot shows heavy tails: The normal assumption is violated. Consider transformations (log, Box-Cox) or use robust/non-parametric methods.
Beyond the 4-Plot
When the 4-plot reveals a violation, deeper investigation is needed. The 6-plot is a collection of six graphical techniques whose purpose is to assess the validity of a versus fit, where the fit can be linear, non-linear, LOWESS, spline, or any other fit utilizing a single independent variable. Formal hypothesis tests — the runs test for randomness, the Anderson-Darling test for normality — can quantify what the 4-plot reveals visually. See When Assumptions Fail for guidance on remedial actions.
Related Techniques
- Run Sequence Plot — tests fixed location and fixed variation
- Lag Plot — tests randomness
- Histogram — tests distribution shape
- Normal Probability Plot — tests normality
- Autocorrelation Plot — quantifies serial dependence beyond the lag plot
- Spectral Plot — identifies cyclic components in the frequency domain
- PPCC Plot — identifies the distributional family via probability plot correlation coefficients
Cross-References
- 4-Plot — the technique page with interactive visualization
- 6-Plot — extended diagnostic for single-independent-variable model validation
- Underlying Assumptions — the four assumptions that the 4-plot tests
- When Assumptions Fail — what to do when the 4-plot reveals violations