EDA Problem Categories

NIST/SEMATECH Section 1.1.7 General Problem Categories

General Problem Categories

The NIST/SEMATECH Engineering Statistics Handbook (Section 1.1.7) provides a convenient classification of EDA problems into eight distinct categories. Each category is defined by its data structure, underlying model, desired output, and recommended EDA techniques. Recognizing which category your problem falls into is the first step toward selecting the right analytical approach.

The eight categories are presented below exactly as the Handbook defines them.

1. Univariate

Data: A single column of numbers, Y.

Model: y = constant + error

Output:

A number (the estimated constant in the model).
An estimate of uncertainty for the constant.
An estimate of the distribution for the error.

Techniques:

The univariate category is the most fundamental: you have a single measured quantity and want to characterize its center, spread, and distributional shape. The 4-Plot provides a comprehensive single-page diagnostic, while probability plots and PPCC plots help identify the best-fitting distribution for the error term.

2. Control

Data: A single column of numbers, Y.

Model: y = constant + error

Output: A “yes” or “no” to the question: “Is the system out of control?”

Techniques:

Control Charts

Control problems share the same data structure and model as univariate problems, but the goal is different. Rather than characterizing the distribution, you want to determine whether the process generating the data is stable over time. Control charts monitor sequential observations against expected limits to flag departures from statistical control.

3. Comparative

Data: A single response variable and k independent variables (Y, X₁, X₂, …, X_k), with primary focus on one of these independent variables (the primary factor).

Model: y = f(x₁, x₂, …, x_k) + error

Output: A “yes” or “no” to the question: “Is the primary factor significant?”

Techniques:

Comparative problems ask whether a specific factor has a real effect on the response, after accounting for other variables. The block plot is particularly useful for visualizing factor effects while controlling for other sources of variation.

4. Screening

Data: A single response variable and k independent variables (Y, X₁, X₂, …, X_k).

Model: y = f(x₁, x₂, …, x_k) + error

Output:

A ranked list (from most important to least important) of factors.
Best settings for the factors.
A good model/prediction equation relating Y to the factors.

Techniques:

Screening problems arise in designed experiments where multiple factors are varied simultaneously. The goal is to determine which factors matter most. Unlike comparative problems (which focus on a single primary factor), screening casts a wider net to rank all factors by importance.

5. Optimization

Data: A single response variable and k independent variables (Y, X₁, X₂, …, X_k).

Model: y = f(x₁, x₂, …, x_k) + error

Output: Best settings for the factor variables.

Techniques:

Block Plot
Least Squares Fitting
Contour Plot

Optimization goes beyond screening: once you know which factors matter, the goal becomes finding the factor settings that produce the best response. Contour plots are especially valuable here, as they visualize the response surface and reveal the region of optimal settings.

6. Regression

Data: A single response variable and k independent variables (Y, X₁, X₂, …, X_k). The independent variables can be continuous.

Model: y = f(x₁, x₂, …, x_k) + error

Output: A good model/prediction equation relating Y to the factors.

Techniques:

Least Squares Fitting
Scatter Plot
6-Plot

Regression problems focus on building a predictive model. While the data structure resembles screening and optimization, the emphasis is on obtaining a good functional relationship between the response and its predictors. The 6-Plot provides a comprehensive residual diagnostic for validating the fitted model.

7. Time Series

Data: A column of time-dependent numbers, Y. Time is an independent variable that can be either explicit or implied. If the data are not equi-spaced, the time variable should be explicitly provided.

Model: y_t = f(t) + error. The model can be either time-domain based or frequency-domain based.

Output: A good model/prediction equation relating Y to previous values of Y.

Techniques:

Time series problems are distinguished by the sequential, time-dependent nature of the data. The analyst must determine whether the underlying structure is best captured in the time domain (autocorrelation, ARIMA) or the frequency domain (spectral analysis, complex demodulation).

8. Multivariate

Data: k factor variables (X₁, X₂, …, X_k).

Model: The model is not explicit.

Output: Identify the underlying correlation structure in the data.

Techniques:

Star Plot
Scatter Plot Matrix
Conditioning Plot
Profile Plot
Principal Components
Clustering
Discrimination/Classification

Multivariate problems differ fundamentally from the preceding categories: there is no single response variable and no explicit model. Instead, the goal is to discover relationships and structure among multiple variables simultaneously. Note that multivariate analysis is only covered lightly in the NIST Handbook.

Choosing the Right Category

Identifying your problem category narrows the field of applicable techniques considerably. The Handbook recommends starting with the graphical displays appropriate to your category, then following up with corresponding quantitative methods for formal confirmation.

Key distinctions to keep in mind:

Univariate vs. Control: Same data and model, but different questions (characterize vs. monitor).
Comparative vs. Screening: Comparative focuses on one primary factor; screening ranks all factors.
Screening vs. Optimization: Screening identifies important factors; optimization finds the best settings.
Regression vs. Optimization: Both seek a model, but regression emphasizes prediction while optimization emphasizes finding the best operating point.
Time Series: Distinguished by time-dependent data and the choice between time-domain and frequency-domain approaches.
Multivariate: No explicit response variable; the goal is to uncover correlation structure.

Cross-References

What is EDA? — the overall philosophy behind exploratory data analysis
The 4-Plot — the primary diagnostic tool for univariate problems
Scatter Plot — foundational technique for comparative and regression problems
Star Plot — key multivariate visualization technique