Skip to main content

Exploratory Data Analysis: A Visual Encyclopedia

10 min read
Cover image for Exploratory Data Analysis: A Visual Encyclopedia

Every dataset has a story to tell. The challenge is knowing how to listen. Before fitting models, running hypothesis tests, or building dashboards, there is a critical first step that separates rigorous analysis from educated guessing: Exploratory Data Analysis.

EDA is the practice of using visual and quantitative techniques to understand the shape, structure, and surprises hidden in your data before committing to any formal statistical model. John Tukey formalized the approach in the 1970s, and NIST later codified it in their Engineering Statistics Handbook, which remains the most comprehensive public-domain reference on the subject.

After spending months implementing every technique and distribution from the NIST handbook as interactive visualizations, I published the EDA Visual Encyclopedia, a 90+ page reference that turns static statistical theory into explorable, parameterized tools. This post walks through what is inside, why it matters, and how I built it.

What Is Exploratory Data Analysis?

EDA is a philosophy more than a procedure. Where classical statistics starts with a hypothesis and tests it against data, EDA starts with data and asks: what is here? The foundations section of the encyclopedia covers this distinction in detail, but the core idea is simple. Look at your data from as many angles as possible before making assumptions.

Tukey argued that the most dangerous thing a statistician could do was apply a model to data without first understanding its structure. A normal distribution assumption applied to bimodal data produces meaningless confidence intervals. A linear regression fitted to data with outliers produces misleading slopes. A time series model applied to data with a hidden seasonal component produces forecasts that are worse than random guessing.

EDA catches these problems before they become expensive mistakes. The assumptions page explains the four key assumptions that underlie most statistical techniques (randomness, fixed distribution, fixed location, and fixed variation), and the consequences of violated assumptions page demonstrates what happens when you skip this step.

The role of graphics in EDA cannot be overstated. A single histogram reveals distributional shape, outliers, gaps, and clustering patterns that would require dozens of summary statistics to capture numerically. The 4-plot technique, which combines a run sequence plot, lag plot, histogram, and normal probability plot on a single page, can diagnose most common data pathologies in seconds.

What Is Inside the Encyclopedia

The EDA Visual Encyclopedia is organized into six sections, each covering a distinct aspect of exploratory analysis. Together they form a complete reference for anyone working with data.

Graphical Techniques: 29 Interactive Visualizations

The largest section covers 29 graphical techniques ranging from fundamental tools like histograms, scatter plots, and box plots to specialized diagnostic plots like autocorrelation plots, normal probability plots, and Weibull plots.

Each technique page includes an interactive SVG visualization generated from realistic sample data, a clear explanation of what the plot reveals, interpretation guidelines, and variant visualizations where applicable. The scatter plot page shows both the basic scatter and a version with a confidence band, while the box plot page demonstrates single-group and multi-group comparisons.

A few specialized techniques deserve particular attention:

  • The PPCC plot (Probability Plot Correlation Coefficient) is a powerful but underused tool for identifying the best-fitting distribution family. It sweeps across shape parameters, and the peak correlation indicates the closest distributional match.

  • The star plot provides a multivariate visualization that maps each variable to a spoke of a radial chart, making it possible to compare multiple observations across many dimensions at once.

  • The spectral plot uses power spectral density estimation to detect hidden periodicities in time series data. This is essential for manufacturing quality control and signal processing.

  • The conditioning plot reveals how the relationship between two variables changes across levels of a third, which is critical for detecting interaction effects.

Quantitative Methods: 18 Statistical Tests

The quantitative methods section covers 18 statistical procedures organized by purpose: measures of location (mean, median, mode), measures of scale (standard deviation, range, IQR), hypothesis tests for comparing groups (two-sample t-test, one-factor ANOVA, multi-factor ANOVA), tests for equal variance (Bartlett’s test, Levene’s test, F-test), and goodness-of-fit tests (Anderson-Darling, chi-square, Kolmogorov-Smirnov).

Each method page includes the mathematical formulation rendered with KaTeX, step-by-step interpretation guidance, and connections to related graphical techniques. The skewness and kurtosis page is particularly useful as a quick diagnostic. Skewness values far from zero indicate asymmetry, and kurtosis values far from three indicate heavy or light tails relative to the normal distribution.

Probability Distributions: 19 Interactive Explorers

The probability distributions section is where the encyclopedia becomes genuinely interactive. Each of the 19 distributions, from the ubiquitous normal distribution to specialized ones like the Tukey-Lambda and fatigue-life (Birnbaum-Saunders), includes a client-side distribution explorer built with D3.

The explorer renders both the PDF (probability density function) and CDF (cumulative distribution function) in real time as you adjust parameter sliders. Change the shape parameter of a Weibull distribution and watch the density curve transform from exponential-like to approximately normal. Adjust the degrees of freedom of a chi-square distribution and observe the skewness decrease as the distribution approaches normality.

The mathematical formulas for each distribution are rendered with KaTeX directly in the page. The related distributions page maps the connections between all 19 distributions: which are special cases of others, which converge in the limit, and which arise as transformations.

Case Studies: 9 Real-World Analyses

The case studies section walks through 9 complete EDA analyses on real engineering datasets from the NIST archive. These are not toy examples. The ceramic strength study analyzes a designed experiment to uncover an unexpected batch effect in silicon nitride specimens. The heat flow meter study examines calibration data to confirm a well-behaved measurement process. The fatigue life study compares four candidate probability distributions for aluminum alloy failure-time data.

Each case study follows the NIST methodology: state the questions, examine the data graphically, apply quantitative tests, and draw conclusions. The beam deflections study is a great starting point because it shows how the autocorrelation plot and spectral analysis detect hidden periodic structure that summary statistics completely miss.

Foundations and Reference

The foundations section provides the conceptual framework: what EDA is, the role of graphics, underlying assumptions, what happens when assumptions fail, problem categories, and the 4-plot technique as a universal starting point.

The reference section includes cross-reference tables: techniques organized by category, distribution tables, related distribution maps, and analysis questions to guide technique selection.

Technical Implementation

Building a 90+ page interactive statistics reference as a static site presented several interesting technical challenges.

Build-Time SVG Generation

All graphical technique visualizations are generated at build time as inline SVG. There is no client-side rendering overhead for these pages because the SVGs are part of the HTML payload. The visualizations work without JavaScript, load instantly, and are fully indexable by search engines. For a reference site where most visitors will be reading and studying rather than interacting, this is the right tradeoff.

The SVG generators are written in TypeScript and use a shared coordinate system with consistent margins, axis labeling, and color palettes that automatically adapt to light and dark themes via CSS custom properties. Each generator takes a dataset and configuration object and returns a complete SVG string. The underlying statistics library implements kernel density estimation, FFT-based spectral analysis, linear regression, normal quantile computation, and autocorrelation in pure TypeScript with no runtime dependencies.

D3 Micro-Module Isolation

The interactive distribution explorers use D3 for real-time chart rendering, but the full D3 library weighs 280KB gzipped. That is unacceptable for a site where most pages do not need it. Instead, the implementation uses D3 micro-modules (d3-scale, d3-shape, d3-axis) that total 17KB gzipped, and Astro’s client:visible directive ensures the JavaScript only loads when a distribution page scrolls into view.

Bundle analysis confirms that D3 is isolated to a single chunk that only loads on distribution pages. The 70+ technique, case study, and reference pages ship zero client-side JavaScript.

KaTeX for Mathematical Notation

Statistical formulas are rendered with KaTeX, which produces high-quality mathematical typesetting at build time. The remark-math plugin processes LaTeX notation in MDX files, and KaTeX CSS is conditionally loaded only on pages that contain formulas.

For .astro pages (as opposed to MDX), formulas use katex.renderToString() at build time, producing static HTML that requires no client-side JavaScript. The distribution explorer pages end up with interactive charts but static formulas, which is the best of both approaches.

Why EDA Still Matters in the Age of AI

It is tempting to think that modern machine learning makes EDA obsolete. If a neural network can learn arbitrary nonlinear relationships from data, why bother plotting histograms and running normality tests?

The answer is that EDA is not about finding patterns. It is about understanding data quality. A machine learning model trained on data with measurement errors, selection bias, or distributional anomalies will learn those artifacts and amplify them. The model will be confidently wrong, and you will not know why until the predictions fail in production.

Consider a practical example. You receive a dataset of sensor readings from a manufacturing process. You train a predictive model, it achieves 95% accuracy on the test set, and you deploy it. Two weeks later, the model’s predictions drift catastrophically. What happened? A quick run sequence plot would have revealed that the sensor readings have a systematic drift over time. The training data captured one regime, and the deployment data captured another. An autocorrelation plot would have confirmed that consecutive readings are not independent, violating the randomness assumption that your cross-validation implicitly relied on.

EDA is the quality control layer. It catches the problems that no amount of model complexity can fix: sensors that drifted during data collection, outliers from equipment malfunction, missing data patterns that are not random, and distributional shifts between training and deployment environments. The Grubbs’ test can formally identify outliers. The runs test detects non-randomness. The Anderson-Darling test evaluates distributional fit. These are not outdated tools. They are the immune system of modern data science.

In machine learning pipelines specifically, EDA serves three critical functions. First, it validates that the training data is representative of the deployment environment. Second, it identifies feature engineering opportunities that improve model performance far more than hyperparameter tuning. Third, it provides the diagnostic framework for debugging model failures when they inevitably occur. A scatter plot matrix of feature pairs reveals collinearity, non-linear relationships, and clustering structure that directly inform model selection.

Getting Started with the Encyclopedia

If you are new to EDA, I recommend starting with the foundations section to understand the philosophy, then working through the 4-plot technique which gives you a universal starting point for any dataset. From there, the beam deflections case study demonstrates how autocorrelation and spectral analysis uncover periodic structure that simple summary statistics miss entirely.

For experienced practitioners, the reference tables provide quick lookups organized by analysis goal, and the distribution explorer is useful for comparing distributional fits and understanding parameter sensitivity.

The EDA Visual Encyclopedia puts every tool from the NIST handbook at your fingertips: interactive, visual, and immediately applicable. Whether you are a data scientist validating a training dataset, a quality engineer analyzing manufacturing measurements, or a student learning statistics for the first time, the techniques in this reference will make your analysis more rigorous and your conclusions more trustworthy.

Explore the full EDA Visual Encyclopedia to see all 90+ pages in action.

← Back to Blog