EDA Problem Categories
NIST/SEMATECH Section 1.1.7 General Problem Categories
General Problem Categories
The NIST/SEMATECH Engineering Statistics Handbook (Section 1.1.7) provides a convenient classification of EDA problems into eight distinct categories. Each category is defined by its data structure, underlying model, desired output, and recommended EDA techniques. Recognizing which category your problem falls into is the first step toward selecting the right analytical approach.
The eight categories are presented below exactly as the Handbook defines them.
1. Univariate
Data: A single column of numbers, Y.
Model: y = constant + error
Output:
- A number (the estimated constant in the model).
- An estimate of uncertainty for the constant.
- An estimate of the distribution for the error.
Techniques:
The univariate category is the most fundamental: you have a single measured quantity and want to characterize its center, spread, and distributional shape. The 4-Plot provides a comprehensive single-page diagnostic, while probability plots and PPCC plots help identify the best-fitting distribution for the error term.
2. Control
Data: A single column of numbers, Y.
Model: y = constant + error
Output: A “yes” or “no” to the question: “Is the system out of control?”
Techniques:
- Control Charts
Control problems share the same data structure and model as univariate problems, but the goal is different. Rather than characterizing the distribution, you want to determine whether the process generating the data is stable over time. Control charts monitor sequential observations against expected limits to flag departures from statistical control.
3. Comparative
Data: A single response variable and k independent variables (Y, X1, X2, …, Xk), with primary focus on one of these independent variables (the primary factor).
Model: y = f(x1, x2, …, xk) + error
Output: A “yes” or “no” to the question: “Is the primary factor significant?”
Techniques:
Comparative problems ask whether a specific factor has a real effect on the response, after accounting for other variables. The block plot is particularly useful for visualizing factor effects while controlling for other sources of variation.
4. Screening
Data: A single response variable and k independent variables (Y, X1, X2, …, Xk).
Model: y = f(x1, x2, …, xk) + error
Output:
- A ranked list (from most important to least important) of factors.
- Best settings for the factors.
- A good model/prediction equation relating Y to the factors.
Techniques:
Screening problems arise in designed experiments where multiple factors are varied simultaneously. The goal is to determine which factors matter most. Unlike comparative problems (which focus on a single primary factor), screening casts a wider net to rank all factors by importance.
5. Optimization
Data: A single response variable and k independent variables (Y, X1, X2, …, Xk).
Model: y = f(x1, x2, …, xk) + error
Output: Best settings for the factor variables.
Techniques:
- Block Plot
- Least Squares Fitting
- Contour Plot
Optimization goes beyond screening: once you know which factors matter, the goal becomes finding the factor settings that produce the best response. Contour plots are especially valuable here, as they visualize the response surface and reveal the region of optimal settings.
6. Regression
Data: A single response variable and k independent variables (Y, X1, X2, …, Xk). The independent variables can be continuous.
Model: y = f(x1, x2, …, xk) + error
Output: A good model/prediction equation relating Y to the factors.
Techniques:
- Least Squares Fitting
- Scatter Plot
- 6-Plot
Regression problems focus on building a predictive model. While the data structure resembles screening and optimization, the emphasis is on obtaining a good functional relationship between the response and its predictors. The 6-Plot provides a comprehensive residual diagnostic for validating the fitted model.
7. Time Series
Data: A column of time-dependent numbers, Y. Time is an independent variable that can be either explicit or implied. If the data are not equi-spaced, the time variable should be explicitly provided.
Model: yt = f(t) + error. The model can be either time-domain based or frequency-domain based.
Output: A good model/prediction equation relating Y to previous values of Y.
Techniques:
- Autocorrelation Plot
- Spectrum
- Complex Demodulation Amplitude Plot
- Complex Demodulation Phase Plot
- ARIMA Models
Time series problems are distinguished by the sequential, time-dependent nature of the data. The analyst must determine whether the underlying structure is best captured in the time domain (autocorrelation, ARIMA) or the frequency domain (spectral analysis, complex demodulation).
8. Multivariate
Data: k factor variables (X1, X2, …, Xk).
Model: The model is not explicit.
Output: Identify the underlying correlation structure in the data.
Techniques:
- Star Plot
- Scatter Plot Matrix
- Conditioning Plot
- Profile Plot
- Principal Components
- Clustering
- Discrimination/Classification
Multivariate problems differ fundamentally from the preceding categories: there is no single response variable and no explicit model. Instead, the goal is to discover relationships and structure among multiple variables simultaneously. Note that multivariate analysis is only covered lightly in the NIST Handbook.
Choosing the Right Category
Identifying your problem category narrows the field of applicable techniques considerably. The Handbook recommends starting with the graphical displays appropriate to your category, then following up with corresponding quantitative methods for formal confirmation.
Key distinctions to keep in mind:
- Univariate vs. Control: Same data and model, but different questions (characterize vs. monitor).
- Comparative vs. Screening: Comparative focuses on one primary factor; screening ranks all factors.
- Screening vs. Optimization: Screening identifies important factors; optimization finds the best settings.
- Regression vs. Optimization: Both seek a model, but regression emphasizes prediction while optimization emphasizes finding the best operating point.
- Time Series: Distinguished by time-dependent data and the choice between time-domain and frequency-domain approaches.
- Multivariate: No explicit response variable; the goal is to uncover correlation structure.
Cross-References
- What is EDA? — the overall philosophy behind exploratory data analysis
- The 4-Plot — the primary diagnostic tool for univariate problems
- Scatter Plot — foundational technique for comparative and regression problems
- Star Plot — key multivariate visualization technique