Skip to main content

EDA Visual Encyclopedia

A comprehensive interactive reference for Exploratory Data Analysis, covering graphical techniques, quantitative methods, probability distributions, case studies, and reference tables. Based on the NIST/SEMATECH Engineering Statistics Handbook.

Foundations

View all →

Graphical Techniques

View all →

Autocorrelation Plot

Section 1.3.3.1

An autocorrelation plot displays the autocorrelation of a dataset at successive lag values. It is used to check whether a time series is random or exhibits serial dependence.

Bihistogram

Section 1.3.3.2

A bihistogram displays the histograms of two datasets on a common horizontal axis, one plotted upward and one downward. It is used to compare the distributional characteristics of two groups.

Block Plot

Section 1.3.3.3

A block plot displays the means of groups arranged by blocks, revealing whether block effects or treatment effects dominate. It is used in designed experiments to visualize factor and interaction effects.

Bootstrap Plot

Section 1.3.3.4

A bootstrap plot displays the computed value of a sample statistic versus the subsample number for repeated resamples drawn with replacement. It is used to assess the uncertainty, stability, and confidence interval of an estimate.

Box-Cox Linearity Plot

Section 1.3.3.5

A Box-Cox linearity plot identifies the optimal power transformation to achieve linearity between two variables. It is used when a linear model is desired but the relationship appears curvilinear.

Box-Cox Normality Plot

Section 1.3.3.6

A Box-Cox normality plot identifies the optimal power transformation to make a dataset approximately normally distributed. It is used when normality is required for statistical tests but the raw data is skewed.

Box Plot

Section 1.3.3.7

A box plot summarizes a dataset using its median, quartiles, and potential outliers in a compact graphical form. It is used to quickly compare distributions across groups and identify skewness and outliers.

Complex Demodulation

Section 1.3.3.8

Complex demodulation plots display the amplitude and phase of a sinusoidal component in a time series at a specified frequency. They are used to detect whether the amplitude or phase of a cyclic signal changes over time.

Contour Plot

Section 1.3.3.10

A contour plot is a graphical technique for representing a three-dimensional surface by plotting constant z slices, called contours, on a two-dimensional format. It is an alternative to a 3-D surface plot.

DOE Plots

Section 1.3.3.11

DOE plots include the DOE scatter plot, DOE mean plot, and DOE standard deviation plot used in designed experiments. They visualize factor effects on location and spread to determine which factors are statistically significant.

Histogram

Section 1.3.3.14

A histogram is a graphical summary of the frequency distribution of a single variable. It displays the shape, center, and spread of a dataset by dividing the data range into bins and counting observations in each bin.

Lag Plot

Section 1.3.3.15

A lag plot displays each observation against the observation at a fixed lag interval. It is used to check for randomness, serial correlation, and non-linear structure in time-ordered data.

Linear Plots

Section 1.3.3.16

Linear plots include the linear correlation plot, linear intercept plot, linear slope plot, and linear residual standard deviation plot. They are used to assess how well a linear model fits the data and how its parameters vary across subsets.

Mean Plot

Section 1.3.3.20

A mean plot displays the group means for a factor variable, with an overall reference line for the grand mean. It is used to determine whether a factor has a significant effect on the response variable.

Normal Probability Plot

Section 1.3.3.21

A normal probability plot displays the sorted data values against their expected normal quantiles. It is used to assess whether data follow a normal distribution, with deviations from the reference line indicating non-normality.

Probability Plot

Section 1.3.3.22

A probability plot displays the sorted data values against the theoretical quantiles of a specified distribution. It is used to assess whether data follow any distribution family, not just the normal.

PPCC Plot

Section 1.3.3.23

A probability plot correlation coefficient (PPCC) plot displays the correlation from a probability plot for a family of distributions indexed by a shape parameter. It is used to identify the best-fitting distribution or optimal transformation.

Q-Q Plot

Section 1.3.3.24

A quantile-quantile (Q-Q) plot compares the quantiles of two data sets to determine if they come from populations with a common distribution. It is similar to a probability plot, but compares two empirical samples rather than one sample against a theoretical distribution.

Run-Sequence Plot

Section 1.3.3.25

A run-sequence plot displays the data values in the order they were collected, with the vertical axis showing the response and the horizontal axis the run order. It is used to detect shifts in location, scale, or the presence of outliers over time.

Scatter Plot

Section 1.3.3.26

A scatter plot displays the relationship between two quantitative variables by plotting data points on a two-dimensional graph. It is used to identify patterns, trends, correlations, and outliers in bivariate data.

Scatterplot Matrix

Section 1.3.3.26.11

A scatterplot matrix displays all pairwise scatter plots of variables in a dataset arranged in a grid. It is used to explore multivariate relationships and detect pairwise correlations, clusters, and outliers.

Conditioning Plot

Section 1.3.3.26.12

A conditioning plot (coplot) is a plot of two variables conditional on the value of a third variable. It is used to explore how a bivariate relationship changes across levels of a third variable.

Spectral Plot

Section 1.3.3.27

A spectral plot displays the power spectrum of a time series, showing the contribution of each frequency component to the overall variance. It is used to detect dominant periodicities and cyclic behavior in time-ordered data.

Standard Deviation Plot

Section 1.3.3.28

A standard deviation plot displays the group standard deviations versus group identifier, with an overall reference line. It is the scale counterpart to the mean plot, used to detect whether variability is constant across groups or changing over time.

Star Plot

Section 1.3.3.29

A star plot displays multivariate data as a series of equi-angular spokes radiating from a center point, with each spoke representing a variable. It is used to compare multiple observations across many variables simultaneously.

Weibull Plot

Section 1.3.3.30

A Weibull plot is a specialized probability plot for assessing whether data follow a Weibull distribution and for estimating the shape and scale parameters. It is used in reliability engineering and failure analysis.

Youden Plot

Section 1.3.3.31

A Youden plot compares paired measurements from two runs or conditions by plotting them against each other with reference lines for the medians. It is used in interlaboratory studies to distinguish between within-lab and between-lab variability.

4-Plot

Section 1.3.3.32

A 4-plot combines a run-sequence plot, lag plot, histogram, and normal probability plot in a single display. It is used as a quick screening tool to simultaneously check the four underlying assumptions of a univariate dataset.

6-Plot

Section 1.3.3.33

A 6-plot is a regression diagnostic display with six panels: response vs predictor, residuals vs predictor, residuals vs predicted values, lag plot of residuals, histogram of residuals, and normal probability plot of residuals. It is used to assess model adequacy after fitting a regression model.

Quantitative Methods

View all →

Measures of Location

Section 1.3.5.1

Measures of location summarize the central tendency of a dataset using statistics such as the mean, median, and mode. They are used to characterize where the center of a distribution lies.

Confidence Limits for the Mean

Section 1.3.5.2

Confidence limits define an interval that contains the true population mean with a specified level of confidence. They are used to quantify the uncertainty in a sample mean estimate.

Two-Sample t-Test

Section 1.3.5.3

The two-sample t-test determines whether the means of two independent groups differ significantly. It is used to compare location parameters when the data are approximately normally distributed.

One-Factor ANOVA

Section 1.3.5.4

One-factor analysis of variance tests whether the means of three or more groups differ significantly. It is used when comparing location parameters across multiple levels of a single factor.

Multi-Factor ANOVA

Section 1.3.5.5

Multi-factor analysis of variance tests for main effects and interactions among two or more factors simultaneously. It is used in designed experiments to identify which factors and factor combinations significantly affect the response.

Measures of Scale

Section 1.3.5.6

Measures of scale quantify the spread or variability of a dataset using statistics such as the standard deviation, variance, and range. They are used to characterize how dispersed the data are around the center.

Bartlett's Test

Section 1.3.5.7

Bartlett's test assesses whether several groups have equal variances, assuming the data are normally distributed. It is used to verify the homogeneity of variance assumption before applying ANOVA or t-tests.

Chi-Square Test for Standard Deviation

Section 1.3.5.8

The chi-square test for the standard deviation tests whether a population standard deviation equals a specified value. It is used to assess whether the variability of a process meets a target specification.

F-Test for Equality of Two Variances

Section 1.3.5.9

The F-test compares the variances of two independent groups to determine if they are significantly different. It is used to check the equal variance assumption before performing a two-sample t-test.

Levene Test for Equality of Variances

Section 1.3.5.10

The Levene test assesses whether multiple groups have equal variances without requiring normality. It is used as a robust alternative to Bartlett's test when the data may not be normally distributed.

Measures of Skewness and Kurtosis

Section 1.3.5.11

Skewness measures the asymmetry of a distribution, while kurtosis measures the heaviness of its tails relative to a normal distribution. They are used to characterize the shape of a dataset beyond location and scale.

Autocorrelation

Section 1.3.5.12

The autocorrelation coefficient quantifies the linear dependence between observations at different time lags. It is used to test whether successive measurements are statistically independent or exhibit serial correlation.

Runs Test for Randomness

Section 1.3.5.13

The runs test determines whether the order of observations above and below the median is random. It is a non-parametric test used to detect trends, oscillations, or other departures from randomness.

Anderson-Darling Test

Section 1.3.5.14

The Anderson-Darling test assesses whether a dataset follows a specified probability distribution, with particular sensitivity in the tails. It is used as a formal goodness-of-fit test complementing graphical methods.

Chi-Square Goodness-of-Fit Test

Section 1.3.5.15

The chi-square goodness-of-fit test determines whether observed frequency counts match expected counts under a hypothesized distribution. It is used for both continuous and discrete distribution testing with binned data.

Kolmogorov-Smirnov Goodness-of-Fit Test

Section 1.3.5.16

The Kolmogorov-Smirnov test compares the empirical cumulative distribution function with a theoretical one or with another sample. It is used as a distribution-free goodness-of-fit test based on the maximum distance between CDFs.

Grubbs' Test for Outliers

Section 1.3.5.17.1

Grubbs' test detects a single outlier in a univariate dataset assumed to come from a normally distributed population. It is used to formally test whether the most extreme value in a sample is statistically aberrant.

Yates Analysis for Designed Experiments

Section 1.3.5.18

Yates analysis is an efficient algorithm for computing main effects and interactions in two-level full factorial experiments. It is used to systematically estimate all factor effects from a 2^k factorial design.

Probability Distributions

View all →

Normal Distribution

Section 1.3.6.6.1

The normal (Gaussian) distribution is the most important continuous probability distribution, characterized by its symmetric bell-shaped curve. It is fully defined by its mean $\mu$ and standard deviation $\sigma$ and arises naturally via the central limit theorem.

Uniform Distribution

Section 1.3.6.6.2

The uniform distribution assigns equal probability to all values within a specified interval $[a, b]$. It is the simplest continuous distribution and serves as a baseline for random number generation.

Cauchy Distribution

Section 1.3.6.6.3

The Cauchy distribution is a symmetric distribution with heavy tails, centred at location $x_0$ with scale $\gamma$. Its mean and variance are undefined, making it a canonical example of a pathological distribution in statistics.

Student's t-Distribution

Section 1.3.6.6.4

Student's $t$-distribution arises when estimating the mean of a normally distributed population with small sample sizes. It has $\nu$ degrees of freedom and approaches the normal distribution as $\nu \to \infty$.

F-Distribution

Section 1.3.6.6.5

The $F$-distribution is the ratio of two scaled chi-squared variables with $d_1$ and $d_2$ degrees of freedom. It is used in analysis of variance (ANOVA) and for comparing variances of two populations.

Chi-Square Distribution

Section 1.3.6.6.6

The chi-square distribution with $k$ degrees of freedom is the distribution of a sum of squares of $k$ independent standard normal random variables. It is fundamental to hypothesis testing and confidence interval estimation.

Exponential Distribution

Section 1.3.6.6.7

The exponential distribution models the time between events in a Poisson process with rate $\lambda$. It is a one-parameter distribution commonly used for reliability analysis and waiting time problems.

Weibull Distribution

Section 1.3.6.6.8

The Weibull distribution is a versatile distribution used in reliability engineering and failure analysis. It can model increasing ($\alpha > 1$), decreasing ($\alpha < 1$), or constant ($\alpha = 1$) failure rates depending on its shape parameter $\alpha$.

Lognormal Distribution

Section 1.3.6.6.9

The lognormal distribution describes a random variable $X$ whose logarithm $\ln X$ is normally distributed with parameters $\mu$ and $\sigma$. It is commonly used to model positive-valued data with right skew, such as income, stock prices, and particle sizes.

Birnbaum-Saunders (Fatigue Life) Distribution

Section 1.3.6.6.10

The Birnbaum-Saunders distribution models fatigue life of materials subject to cyclic stress, with shape $\alpha$ and scale $\beta$. It is derived from a physical crack-growth model and is widely used in reliability engineering.

Gamma Distribution

Section 1.3.6.6.11

The gamma distribution is a two-parameter family of continuous distributions with shape $\alpha$ and scale $\beta$ that generalizes the exponential and chi-square distributions. It is used to model waiting times, rainfall amounts, and insurance claims.

Double Exponential (Laplace) Distribution

Section 1.3.6.6.12

The double exponential (Laplace) distribution is a symmetric distribution centred at $\mu$ with scale $\beta$, having heavier tails than the normal distribution. It is the distribution of the difference of two independent exponential random variables.

Power Normal Distribution

Section 1.3.6.6.13

The power normal distribution is used in reliability analysis to model the minimum of $p$ independent normal lifetimes. When $p = 1$ it reduces to the standard normal distribution.

Power Lognormal Distribution

Section 1.3.6.6.14

The power lognormal distribution generalizes the lognormal distribution for reliability analysis, modeling the minimum of $p$ independent lognormal lifetimes with scale $\sigma$. It provides additional shape flexibility beyond the lognormal.

Tukey-Lambda Distribution

Section 1.3.6.6.15

The Tukey-Lambda distribution is a symmetric family defined by its quantile function. By varying the shape parameter $\lambda$, it can approximate the normal, logistic, Cauchy, and uniform distributions.

Extreme Value Type I (Gumbel) Distribution

Section 1.3.6.6.16

The extreme value type I (Gumbel) distribution models the maximum or minimum of a large number of independent samples, with location $\mu$ and scale $\beta$. It is widely used in hydrology, meteorology, and structural engineering for modeling extreme events.

Beta Distribution

Section 1.3.6.6.17

The beta distribution is a continuous distribution defined on the interval $[0,\,1]$ with shape parameters $\alpha$ and $\beta$. It is commonly used to model proportions, probabilities, and random variables with bounded support.

Binomial Distribution

Section 1.3.6.6.18

The binomial distribution models the number of successes in $n$ independent Bernoulli trials, each with success probability $p$. It is the foundation for binary outcome analysis and quality control sampling.

Poisson Distribution

Section 1.3.6.6.19

The Poisson distribution models the number of events occurring in a fixed interval of time or space when events occur independently at a constant average rate $\lambda$. It is used for count data in fields ranging from telecommunications to epidemiology.

Case Studies

View all →

Beam Deflections Case Study

Section 1.4.2

EDA case study analyzing NIST LEW.DAT dataset of 200 beam deflection measurements to demonstrate detection of periodic structure and randomness failure

Ceramic Strength Case Study

Section 1.4.2

EDA case study analyzing NIST JAHANMI2.DAT ceramic strength data to demonstrate analysis of a designed experiment with batch, lab, and machining factor effects

Fatigue Life of Aluminum Alloy Specimens

Section 1.4.2

EDA case study analyzing NIST BIRNSAUN.DAT fatigue life data to demonstrate probabilistic model selection for reliability engineering

Filter Transmittance Case Study

Section 1.4.2

EDA case study analyzing NIST filter transmittance data to demonstrate detection of non-randomness caused by a too-fast sampling rate in data acquisition

Heat Flow Meter 1 Case Study

Section 1.4.2

EDA case study analyzing NIST ZARR13.DAT heat flow meter calibration data to demonstrate a well-behaved univariate measurement process

Josephson Junction Cryothermometry

Section 1.4.2

EDA case study analyzing NIST SOULEN.DAT voltage count data to demonstrate univariate analysis with discrete integer measurements and mild assumption violations

Normal Random Numbers Case Study

Section 1.4.2

EDA case study analyzing 500 normal random numbers from a Rand Corporation publication to demonstrate assumption verification techniques

Random Walk Case Study

Section 1.4.2

EDA case study analyzing NIST RANDWALK.DAT dataset — a cumulative sum of uniform random numbers — to demonstrate detection of non-stationary location and violation of the fixed-location assumption

Standard Resistor Case Study

Section 1.4.2

EDA case study analyzing NIST standard resistor data to demonstrate detection of drift in location, non-constant variation, and non-randomness caused by seasonal humidity effects on measurement equipment

Uniform Random Numbers Case Study

Section 1.4.2

EDA case study analyzing NIST RANDU.DAT dataset to demonstrate detection of non-normal underlying distributions

Reference

View all →

About This Encyclopedia

Learn about the methodology behind Exploratory Data Analysis, how this encyclopedia was built, and why EDA remains essential in the age of machine learning. The companion blog post covers all six sections, the technical implementation, and practical guidance for applying these techniques to your own datasets.

Read: Exploratory Data Analysis — A Visual Encyclopedia