EDA Visual Encyclopedia
A comprehensive interactive reference for Exploratory Data Analysis, covering graphical techniques, quantitative methods, probability distributions, case studies, and reference tables. Based on the NIST/SEMATECH Engineering Statistics Handbook.
Foundations
View all →What is EDA?
Section 1.1
Learn what Exploratory Data Analysis is, how it differs from classical and Bayesian approaches, and why EDA is essential for understanding data before formal modeling
The Role of Graphics in EDA
Section 1.1.5
Discover why graphical methods are central to Exploratory Data Analysis and how visual techniques reveal patterns that summary statistics can miss
EDA Problem Categories
Section 1.1.7
The eight general problem categories in EDA: univariate, control, comparative, screening, optimization, regression, time series, and multivariate — each with its data structure, model, output, and recommended techniques
Underlying Assumptions
Section 1.2
Understand the four underlying assumptions of statistical analysis — random drawings, fixed distribution, fixed location, and fixed variation — and why testing them matters
When Assumptions Fail
Section 1.2.5
Understand the consequences when underlying statistical assumptions are violated and how non-compliance affects the validity of analysis results
The 4-Plot for Assumption Testing
Section 1.3.3.32
Learn how the 4-plot technique tests all four underlying assumptions simultaneously using a run sequence plot, lag plot, histogram, and normal probability plot
Graphical Techniques
View all →Autocorrelation Plot
Section 1.3.3.1
An autocorrelation plot displays the autocorrelation of a dataset at successive lag values. It is used to check whether a time series is random or exhibits serial dependence.
Bihistogram
Section 1.3.3.2
A bihistogram displays the histograms of two datasets on a common horizontal axis, one plotted upward and one downward. It is used to compare the distributional characteristics of two groups.
Block Plot
Section 1.3.3.3
A block plot displays the means of groups arranged by blocks, revealing whether block effects or treatment effects dominate. It is used in designed experiments to visualize factor and interaction effects.
Bootstrap Plot
Section 1.3.3.4
A bootstrap plot displays the computed value of a sample statistic versus the subsample number for repeated resamples drawn with replacement. It is used to assess the uncertainty, stability, and confidence interval of an estimate.
Box-Cox Linearity Plot
Section 1.3.3.5
A Box-Cox linearity plot identifies the optimal power transformation to achieve linearity between two variables. It is used when a linear model is desired but the relationship appears curvilinear.
Box-Cox Normality Plot
Section 1.3.3.6
A Box-Cox normality plot identifies the optimal power transformation to make a dataset approximately normally distributed. It is used when normality is required for statistical tests but the raw data is skewed.
Box Plot
Section 1.3.3.7
A box plot summarizes a dataset using its median, quartiles, and potential outliers in a compact graphical form. It is used to quickly compare distributions across groups and identify skewness and outliers.
Complex Demodulation
Section 1.3.3.8
Complex demodulation plots display the amplitude and phase of a sinusoidal component in a time series at a specified frequency. They are used to detect whether the amplitude or phase of a cyclic signal changes over time.
Contour Plot
Section 1.3.3.10
A contour plot is a graphical technique for representing a three-dimensional surface by plotting constant z slices, called contours, on a two-dimensional format. It is an alternative to a 3-D surface plot.
DOE Plots
Section 1.3.3.11
DOE plots include the DOE scatter plot, DOE mean plot, and DOE standard deviation plot used in designed experiments. They visualize factor effects on location and spread to determine which factors are statistically significant.
Histogram
Section 1.3.3.14
A histogram is a graphical summary of the frequency distribution of a single variable. It displays the shape, center, and spread of a dataset by dividing the data range into bins and counting observations in each bin.
Lag Plot
Section 1.3.3.15
A lag plot displays each observation against the observation at a fixed lag interval. It is used to check for randomness, serial correlation, and non-linear structure in time-ordered data.
Linear Plots
Section 1.3.3.16
Linear plots include the linear correlation plot, linear intercept plot, linear slope plot, and linear residual standard deviation plot. They are used to assess how well a linear model fits the data and how its parameters vary across subsets.
Mean Plot
Section 1.3.3.20
A mean plot displays the group means for a factor variable, with an overall reference line for the grand mean. It is used to determine whether a factor has a significant effect on the response variable.
Normal Probability Plot
Section 1.3.3.21
A normal probability plot displays the sorted data values against their expected normal quantiles. It is used to assess whether data follow a normal distribution, with deviations from the reference line indicating non-normality.
Probability Plot
Section 1.3.3.22
A probability plot displays the sorted data values against the theoretical quantiles of a specified distribution. It is used to assess whether data follow any distribution family, not just the normal.
PPCC Plot
Section 1.3.3.23
A probability plot correlation coefficient (PPCC) plot displays the correlation from a probability plot for a family of distributions indexed by a shape parameter. It is used to identify the best-fitting distribution or optimal transformation.
Q-Q Plot
Section 1.3.3.24
A quantile-quantile (Q-Q) plot compares the quantiles of two data sets to determine if they come from populations with a common distribution. It is similar to a probability plot, but compares two empirical samples rather than one sample against a theoretical distribution.
Run-Sequence Plot
Section 1.3.3.25
A run-sequence plot displays the data values in the order they were collected, with the vertical axis showing the response and the horizontal axis the run order. It is used to detect shifts in location, scale, or the presence of outliers over time.
Scatter Plot
Section 1.3.3.26
A scatter plot displays the relationship between two quantitative variables by plotting data points on a two-dimensional graph. It is used to identify patterns, trends, correlations, and outliers in bivariate data.
Scatterplot Matrix
Section 1.3.3.26.11
A scatterplot matrix displays all pairwise scatter plots of variables in a dataset arranged in a grid. It is used to explore multivariate relationships and detect pairwise correlations, clusters, and outliers.
Conditioning Plot
Section 1.3.3.26.12
A conditioning plot (coplot) is a plot of two variables conditional on the value of a third variable. It is used to explore how a bivariate relationship changes across levels of a third variable.
Spectral Plot
Section 1.3.3.27
A spectral plot displays the power spectrum of a time series, showing the contribution of each frequency component to the overall variance. It is used to detect dominant periodicities and cyclic behavior in time-ordered data.
Standard Deviation Plot
Section 1.3.3.28
A standard deviation plot displays the group standard deviations versus group identifier, with an overall reference line. It is the scale counterpart to the mean plot, used to detect whether variability is constant across groups or changing over time.
Star Plot
Section 1.3.3.29
A star plot displays multivariate data as a series of equi-angular spokes radiating from a center point, with each spoke representing a variable. It is used to compare multiple observations across many variables simultaneously.
Weibull Plot
Section 1.3.3.30
A Weibull plot is a specialized probability plot for assessing whether data follow a Weibull distribution and for estimating the shape and scale parameters. It is used in reliability engineering and failure analysis.
Youden Plot
Section 1.3.3.31
A Youden plot compares paired measurements from two runs or conditions by plotting them against each other with reference lines for the medians. It is used in interlaboratory studies to distinguish between within-lab and between-lab variability.
4-Plot
Section 1.3.3.32
A 4-plot combines a run-sequence plot, lag plot, histogram, and normal probability plot in a single display. It is used as a quick screening tool to simultaneously check the four underlying assumptions of a univariate dataset.
6-Plot
Section 1.3.3.33
A 6-plot is a regression diagnostic display with six panels: response vs predictor, residuals vs predictor, residuals vs predicted values, lag plot of residuals, histogram of residuals, and normal probability plot of residuals. It is used to assess model adequacy after fitting a regression model.
Quantitative Methods
View all →Measures of Location
Section 1.3.5.1
Measures of location summarize the central tendency of a dataset using statistics such as the mean, median, and mode. They are used to characterize where the center of a distribution lies.
Confidence Limits for the Mean
Section 1.3.5.2
Confidence limits define an interval that contains the true population mean with a specified level of confidence. They are used to quantify the uncertainty in a sample mean estimate.
Two-Sample t-Test
Section 1.3.5.3
The two-sample t-test determines whether the means of two independent groups differ significantly. It is used to compare location parameters when the data are approximately normally distributed.
One-Factor ANOVA
Section 1.3.5.4
One-factor analysis of variance tests whether the means of three or more groups differ significantly. It is used when comparing location parameters across multiple levels of a single factor.
Multi-Factor ANOVA
Section 1.3.5.5
Multi-factor analysis of variance tests for main effects and interactions among two or more factors simultaneously. It is used in designed experiments to identify which factors and factor combinations significantly affect the response.
Measures of Scale
Section 1.3.5.6
Measures of scale quantify the spread or variability of a dataset using statistics such as the standard deviation, variance, and range. They are used to characterize how dispersed the data are around the center.
Bartlett's Test
Section 1.3.5.7
Bartlett's test assesses whether several groups have equal variances, assuming the data are normally distributed. It is used to verify the homogeneity of variance assumption before applying ANOVA or t-tests.
Chi-Square Test for Standard Deviation
Section 1.3.5.8
The chi-square test for the standard deviation tests whether a population standard deviation equals a specified value. It is used to assess whether the variability of a process meets a target specification.
F-Test for Equality of Two Variances
Section 1.3.5.9
The F-test compares the variances of two independent groups to determine if they are significantly different. It is used to check the equal variance assumption before performing a two-sample t-test.
Levene Test for Equality of Variances
Section 1.3.5.10
The Levene test assesses whether multiple groups have equal variances without requiring normality. It is used as a robust alternative to Bartlett's test when the data may not be normally distributed.
Measures of Skewness and Kurtosis
Section 1.3.5.11
Skewness measures the asymmetry of a distribution, while kurtosis measures the heaviness of its tails relative to a normal distribution. They are used to characterize the shape of a dataset beyond location and scale.
Autocorrelation
Section 1.3.5.12
The autocorrelation coefficient quantifies the linear dependence between observations at different time lags. It is used to test whether successive measurements are statistically independent or exhibit serial correlation.
Runs Test for Randomness
Section 1.3.5.13
The runs test determines whether the order of observations above and below the median is random. It is a non-parametric test used to detect trends, oscillations, or other departures from randomness.
Anderson-Darling Test
Section 1.3.5.14
The Anderson-Darling test assesses whether a dataset follows a specified probability distribution, with particular sensitivity in the tails. It is used as a formal goodness-of-fit test complementing graphical methods.
Chi-Square Goodness-of-Fit Test
Section 1.3.5.15
The chi-square goodness-of-fit test determines whether observed frequency counts match expected counts under a hypothesized distribution. It is used for both continuous and discrete distribution testing with binned data.
Kolmogorov-Smirnov Goodness-of-Fit Test
Section 1.3.5.16
The Kolmogorov-Smirnov test compares the empirical cumulative distribution function with a theoretical one or with another sample. It is used as a distribution-free goodness-of-fit test based on the maximum distance between CDFs.
Grubbs' Test for Outliers
Section 1.3.5.17.1
Grubbs' test detects a single outlier in a univariate dataset assumed to come from a normally distributed population. It is used to formally test whether the most extreme value in a sample is statistically aberrant.
Yates Analysis for Designed Experiments
Section 1.3.5.18
Yates analysis is an efficient algorithm for computing main effects and interactions in two-level full factorial experiments. It is used to systematically estimate all factor effects from a 2^k factorial design.
Probability Distributions
View all →Normal Distribution
Section 1.3.6.6.1
The normal (Gaussian) distribution is the most important continuous probability distribution, characterized by its symmetric bell-shaped curve. It is fully defined by its mean $\mu$ and standard deviation $\sigma$ and arises naturally via the central limit theorem.
Uniform Distribution
Section 1.3.6.6.2
The uniform distribution assigns equal probability to all values within a specified interval $[a, b]$. It is the simplest continuous distribution and serves as a baseline for random number generation.
Cauchy Distribution
Section 1.3.6.6.3
The Cauchy distribution is a symmetric distribution with heavy tails, centred at location $x_0$ with scale $\gamma$. Its mean and variance are undefined, making it a canonical example of a pathological distribution in statistics.
Student's t-Distribution
Section 1.3.6.6.4
Student's $t$-distribution arises when estimating the mean of a normally distributed population with small sample sizes. It has $\nu$ degrees of freedom and approaches the normal distribution as $\nu \to \infty$.
F-Distribution
Section 1.3.6.6.5
The $F$-distribution is the ratio of two scaled chi-squared variables with $d_1$ and $d_2$ degrees of freedom. It is used in analysis of variance (ANOVA) and for comparing variances of two populations.
Chi-Square Distribution
Section 1.3.6.6.6
The chi-square distribution with $k$ degrees of freedom is the distribution of a sum of squares of $k$ independent standard normal random variables. It is fundamental to hypothesis testing and confidence interval estimation.
Exponential Distribution
Section 1.3.6.6.7
The exponential distribution models the time between events in a Poisson process with rate $\lambda$. It is a one-parameter distribution commonly used for reliability analysis and waiting time problems.
Weibull Distribution
Section 1.3.6.6.8
The Weibull distribution is a versatile distribution used in reliability engineering and failure analysis. It can model increasing ($\alpha > 1$), decreasing ($\alpha < 1$), or constant ($\alpha = 1$) failure rates depending on its shape parameter $\alpha$.
Lognormal Distribution
Section 1.3.6.6.9
The lognormal distribution describes a random variable $X$ whose logarithm $\ln X$ is normally distributed with parameters $\mu$ and $\sigma$. It is commonly used to model positive-valued data with right skew, such as income, stock prices, and particle sizes.
Birnbaum-Saunders (Fatigue Life) Distribution
Section 1.3.6.6.10
The Birnbaum-Saunders distribution models fatigue life of materials subject to cyclic stress, with shape $\alpha$ and scale $\beta$. It is derived from a physical crack-growth model and is widely used in reliability engineering.
Gamma Distribution
Section 1.3.6.6.11
The gamma distribution is a two-parameter family of continuous distributions with shape $\alpha$ and scale $\beta$ that generalizes the exponential and chi-square distributions. It is used to model waiting times, rainfall amounts, and insurance claims.
Double Exponential (Laplace) Distribution
Section 1.3.6.6.12
The double exponential (Laplace) distribution is a symmetric distribution centred at $\mu$ with scale $\beta$, having heavier tails than the normal distribution. It is the distribution of the difference of two independent exponential random variables.
Power Normal Distribution
Section 1.3.6.6.13
The power normal distribution is used in reliability analysis to model the minimum of $p$ independent normal lifetimes. When $p = 1$ it reduces to the standard normal distribution.
Power Lognormal Distribution
Section 1.3.6.6.14
The power lognormal distribution generalizes the lognormal distribution for reliability analysis, modeling the minimum of $p$ independent lognormal lifetimes with scale $\sigma$. It provides additional shape flexibility beyond the lognormal.
Tukey-Lambda Distribution
Section 1.3.6.6.15
The Tukey-Lambda distribution is a symmetric family defined by its quantile function. By varying the shape parameter $\lambda$, it can approximate the normal, logistic, Cauchy, and uniform distributions.
Extreme Value Type I (Gumbel) Distribution
Section 1.3.6.6.16
The extreme value type I (Gumbel) distribution models the maximum or minimum of a large number of independent samples, with location $\mu$ and scale $\beta$. It is widely used in hydrology, meteorology, and structural engineering for modeling extreme events.
Beta Distribution
Section 1.3.6.6.17
The beta distribution is a continuous distribution defined on the interval $[0,\,1]$ with shape parameters $\alpha$ and $\beta$. It is commonly used to model proportions, probabilities, and random variables with bounded support.
Binomial Distribution
Section 1.3.6.6.18
The binomial distribution models the number of successes in $n$ independent Bernoulli trials, each with success probability $p$. It is the foundation for binary outcome analysis and quality control sampling.
Poisson Distribution
Section 1.3.6.6.19
The Poisson distribution models the number of events occurring in a fixed interval of time or space when events occur independently at a constant average rate $\lambda$. It is used for count data in fields ranging from telecommunications to epidemiology.
Case Studies
View all →Beam Deflections Case Study
Section 1.4.2
EDA case study analyzing NIST LEW.DAT dataset of 200 beam deflection measurements to demonstrate detection of periodic structure and randomness failure
Ceramic Strength Case Study
Section 1.4.2
EDA case study analyzing NIST JAHANMI2.DAT ceramic strength data to demonstrate analysis of a designed experiment with batch, lab, and machining factor effects
Fatigue Life of Aluminum Alloy Specimens
Section 1.4.2
EDA case study analyzing NIST BIRNSAUN.DAT fatigue life data to demonstrate probabilistic model selection for reliability engineering
Filter Transmittance Case Study
Section 1.4.2
EDA case study analyzing NIST filter transmittance data to demonstrate detection of non-randomness caused by a too-fast sampling rate in data acquisition
Heat Flow Meter 1 Case Study
Section 1.4.2
EDA case study analyzing NIST ZARR13.DAT heat flow meter calibration data to demonstrate a well-behaved univariate measurement process
Josephson Junction Cryothermometry
Section 1.4.2
EDA case study analyzing NIST SOULEN.DAT voltage count data to demonstrate univariate analysis with discrete integer measurements and mild assumption violations
Normal Random Numbers Case Study
Section 1.4.2
EDA case study analyzing 500 normal random numbers from a Rand Corporation publication to demonstrate assumption verification techniques
Random Walk Case Study
Section 1.4.2
EDA case study analyzing NIST RANDWALK.DAT dataset — a cumulative sum of uniform random numbers — to demonstrate detection of non-stationary location and violation of the fixed-location assumption
Standard Resistor Case Study
Section 1.4.2
EDA case study analyzing NIST standard resistor data to demonstrate detection of drift in location, non-constant variation, and non-randomness caused by seasonal humidity effects on measurement equipment
Uniform Random Numbers Case Study
Section 1.4.2
EDA case study analyzing NIST RANDU.DAT dataset to demonstrate detection of non-normal underlying distributions
Reference
View all →EDA Analysis Questions
Section 1.3.2
The seven standard questions that Exploratory Data Analysis answers, from characterizing the underlying distribution to identifying outliers and determining model fit
Techniques by Category
Section 1.3.4
Complete taxonomy of EDA techniques organized by graphical versus quantitative categories with cross-references to detailed technique pages
Probability Distribution Tables
Section 1.3.6
Critical value and percentile tables for common probability distributions used in statistical hypothesis testing and confidence interval construction
Related Distributions
Section 1.3.6.7
Mathematical relationships between probability distributions including special cases, limiting forms, and transformation connections
About This Encyclopedia
Learn about the methodology behind Exploratory Data Analysis, how this encyclopedia was built, and why EDA remains essential in the age of machine learning. The companion blog post covers all six sections, the technical implementation, and practical guidance for applying these techniques to your own datasets.
Read: Exploratory Data Analysis — A Visual Encyclopedia