Histogram

NIST/SEMATECH Section 1.3.3.14 Histogram

What It Is

A histogram is a graphical summary of the frequency distribution of a single variable, constructed by dividing the data range into contiguous intervals called bins and drawing rectangles whose heights represent the count or proportion of observations falling in each bin. An optional kernel density estimation (KDE) overlay provides a smooth estimate of the underlying probability density function.

The data range is divided into $k$ contiguous, non-overlapping intervals (bins) of equal width. The height of each bar represents the count (or relative frequency) of observations falling in that bin. The number of bins affects the visual impression: too few bins over-smooth and hide structure, too many bins create noise. The Freedman-Diaconis rule ( $h = 2 \cdot \text{IQR} \cdot N^{-1/3}$ ) and Sturges’ rule ( $k = 1 + \log_2 N$ ) provide automatic defaults. An optional kernel density estimate (KDE) overlay provides a smooth probability density curve.

Questions This Plot Answers

What kind of population distribution do the data come from?
Where are the data located (center)?
How spread out are the data?
Are the data symmetric or skewed?
Are there outliers in the data?

Why It Matters

The histogram is the foundational graphical technique in exploratory data analysis. It provides the most direct visual answer to the question "what does my data look like?" and is the prerequisite for choosing appropriate statistical methods, since nearly every statistical procedure depends on distributional shape assumptions.

When to Use a Histogram

Use a histogram as the primary tool for assessing the shape, center, and spread of a univariate dataset. It is the most fundamental graphical technique in exploratory data analysis, answering questions about symmetry, skewness, modality, tail behavior, and the presence of outliers or gaps. Histograms are routinely used in process characterization, incoming inspection, and data screening to form an initial impression of the data before applying more specialized techniques.

How to Interpret a Histogram

The horizontal axis represents the measurement scale and the vertical axis represents frequency or relative frequency. A bell-shaped histogram centered on the mean suggests approximate normality. A histogram that peaks on one side and trails off on the other indicates skewness: a long right tail means right skew and a long left tail means left skew. Two or more distinct peaks indicate a bimodal or multimodal distribution, often arising from mixed populations or distinct operating conditions. Short-tailed histograms, where observations are tightly concentrated around the center, suggest a distribution with lighter tails than the normal. Long-tailed histograms, where observations extend far from the center, suggest heavier tails. A histogram with a single extreme observation far from the main body of data flags a potential outlier. Variant patterns include normal, short-tailed symmetric, long-tailed symmetric, bimodal, right-skewed, left-skewed, uniform, and distributions with outliers, each carrying distinct diagnostic meaning.

Examples

Normal

A bell-shaped, symmetric histogram with most frequency counts bunched in the middle and counts tapering smoothly in both tails. This is the classical moderate-tailed distribution and confirms that standard statistical methods (t-tests, confidence intervals, capability indices) are appropriate. Verify with a normal probability plot.

Short-Tailed

A symmetric histogram whose tails approach zero very fast, giving a truncated or "sawed-off" appearance. The classical short-tailed distribution is the uniform (rectangular). For short-tailed data, the midrange (smallest + largest) / 2 is the best location estimator, not the sample mean. Verify with a uniform probability plot.

Long-Tailed

A symmetric histogram whose tails decline to zero very slowly, with probability extending far from the center. The classical long-tailed distribution is the Cauchy. For long-tailed data, the median is the best location estimator because the mean is heavily influenced by extreme observations. Robust methods or a heavy-tailed model (e.g., t-distribution) may be needed.

Bimodal (Sinusoidal)

Two peaks in a symmetric histogram caused by an underlying deterministic sinusoidal pattern in the data. Unlike a mixture of populations, this bimodality arises from cyclic behavior. Investigate with a run sequence plot, lag plot (an elliptical pattern confirms sinusoidality), or spectral plot to estimate the dominant frequency.

Bimodal (Mixture)

Two peaks where each mode has a rough bell-shaped component, indicating a mixture of two distinct populations or processes. Investigation should identify the physical source of the two modes (e.g., two machines, operators, or material batches). Fit a mixture model p × φ₁ + (1−p) × φ₂ to estimate the mixing proportion and component parameters.

Right Skewed

The histogram peaks on the left side and has a long tail extending to the right. The mean is above the median. Right skew commonly arises from a natural lower bound, start-up effects, or reliability processes. Consider fitting a Weibull, lognormal, or gamma distribution.

Left Skewed

The histogram peaks on the right side and has a long tail extending to the left. The mean is below the median. Less common than right skew, left skew occurs in failure-time data with wear-out mechanisms or processes approaching an upper bound.

With Outlier

The main body of the histogram follows a recognizable pattern (often symmetric), but one or more bars appear isolated far from the bulk of the data. Outliers may indicate measurement errors, equipment failures, or genuine extreme events. Do not automatically discard outliers without investigation — a box plot provides a more sensitive outlier display.

Assumptions and Limitations

The appearance of a histogram depends on the choice of bin width and starting point. Too few bins obscure detail, while too many bins produce a noisy display that is hard to interpret. Rules of thumb such as the Sturges, Freedman-Diaconis, or Scott rules provide reasonable defaults, but the analyst should experiment with several bin widths. The histogram does not perform well for very small samples, where a dot plot or stem-and-leaf display may be more informative.

See It In Action

This technique is demonstrated in the following case studies:

Heat Flow Meter 1 Case Study

Reference: NIST/SEMATECH e-Handbook of Statistical Methods, Section 1.3.3.14

Formulas

Freedman-Diaconis Rule

h = 2 \cdot \text{IQR} \cdot N^{-1/3}

The optimal bin width h based on the interquartile range (IQR) and sample size N. This rule is robust to outliers because it uses the IQR rather than the standard deviation.

Sturges’ Rule

k = 1 + \log_2 N

The number of bins k based on the sample size N. Simple and widely used, but tends to over-smooth for large samples.

Relative Frequency (Density Normalization)

f_i = \frac{n_i}{N \cdot h}

The normalized frequency for bin i, where n_i is the count, N is the total number of observations, and h is the bin width. Under this normalization the area under the histogram equals one, making it comparable to a probability density function.

Python Example

import numpy as np
import matplotlib.pyplot as plt

# Generate bimodal data: mixture of two normals
rng = np.random.default_rng(42)
data = np.concatenate([
    rng.normal(loc=50, scale=5, size=300),
    rng.normal(loc=70, scale=8, size=200)
])

# Create histogram with density overlay
fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(data, bins=30, density=True, alpha=0.7,
        color='steelblue', edgecolor='white')
ax.set_xlabel("Value")
ax.set_ylabel("Density")
ax.set_title("Histogram with Bimodal Data")
plt.tight_layout()
plt.show()