Skip to main content

Box Plot

NIST/SEMATECH Section 1.3.3.7 Box Plot

Filter A Filter B Filter C Filter D 2.4 2.6 2.8 3 Transmittance Box Plot
A box plot (Chambers 1983), also known as a box-and-whisker plot, is an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. The central box spans from the lower quartile ($Q_1$, 25th percentile) to the upper quartile ($Q_3$, 75th percentile), representing the middle 50% of the data. A line inside the box marks the median, and whiskers extend from the quartiles to the most extreme data point within 1.5 IQR of the box (the inner fence).

What It Is

A box plot (Chambers 1983), also known as a box-and-whisker plot, is an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. The central box spans from the lower quartile (Q1Q_1, 25th percentile) to the upper quartile (Q3Q_3, 75th percentile), representing the middle 50% of the data. A line inside the box marks the median, and whiskers extend from the quartiles to the most extreme data point within 1.5 IQR of the box (the inner fence).

A useful variation of the box plot more specifically identifies outliers using inner and outer fences. The box spans from Q1Q_1 to Q3Q_3 with the median marked inside, and the interquartile range (IQR\text{IQR}) measures the spread of the middle 50%. Whiskers extend from each quartile to the most extreme data point within the inner fence (1.5 IQR from the box). Points beyond the inner fence but within the outer fence (3.0 IQR) are plotted as small circles (mild outliers), and points beyond the outer fence are plotted as large circles (extreme outliers). The exact fence formulas are given below.

Questions This Plot Answers

  • Is a factor significant?
  • Does the location differ between subgroups?
  • Does the variation differ between subgroups?
  • Are there any outliers?

Why It Matters

The box plot is an important EDA tool for determining if a factor has a significant effect on the response with respect to either location or variation. It is also an effective tool for summarizing large quantities of information.

When to Use a Box Plot

Use a box plot when comparing the location, spread, and symmetry of one or more groups in a compact graphical format. Box plots are particularly effective for side-by-side comparisons of multiple samples or factor levels, making them a staple of exploratory data analysis in quality engineering, process comparison, and designed experiments. A single box plot can be drawn for one batch of data with no distinct groups; alternatively, multiple box plots can be drawn together to compare multiple data sets or groups in a single data set.

How to Interpret a Box Plot

The position of the median line within the box reveals the symmetry of the distribution: a centered median indicates symmetry, while a median closer to Q1Q_1 or Q3Q_3 suggests right or left skewness, respectively. The length of the box shows the interquartile range and serves as a robust measure of spread. Whisker lengths indicate the range of the bulk of the data, and individual points plotted beyond the whiskers are candidate outliers deserving further investigation. When comparing multiple box plots, differences in box height indicate differing variability, while vertical offsets between median lines indicate differences in location.

Examples

Equal Groups

All box plots have similar medians, similar IQR\text{IQR} heights, and similar whisker lengths. This indicates no significant difference between groups — the factor does not affect either the location or the spread of the response.

Location Shift

Box plots have similar heights and whisker lengths but different median positions. This indicates the factor affects the average response without changing the variability, a classic location effect.

Spread Difference

Box plots have similar medians but markedly different heights. Taller boxes indicate groups with greater variability. This dispersion effect is important for process optimization and robust parameter design.

Assumptions and Limitations

Box plots make no distributional assumptions and are appropriate for any continuous or ordinal data. However, they can be misleading for very small samples where quartile estimates are unreliable, and they do not reveal multi-modality within a group. For multiple box plots, the width of the box can be set proportional to the number of points in the given group or sample, though some implementations set all boxes to the same width.

See It In Action

This technique is demonstrated in the following case studies:

Reference: NIST/SEMATECH e-Handbook of Statistical Methods, Section 1.3.3.7

Formulas

Interquartile Range

IQR=Q3Q1\text{IQR} = Q_3 - Q_1

The interquartile range is the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). It measures the spread of the middle 50% of the data and forms the height of the box.

Inner Fences (Mild Outlier Boundaries)

L1=Q11.5×IQRU1=Q3+1.5×IQRL_1 = Q_1 - 1.5 \times \text{IQR} \qquad U_1 = Q_3 + 1.5 \times \text{IQR}

The inner fences define the boundary for mild outliers. Whiskers extend from the quartiles to the most extreme data points within these fences. Points between the inner and outer fences are flagged as mild outliers.

Outer Fences (Extreme Outlier Boundaries)

L2=Q13.0×IQRU2=Q3+3.0×IQRL_2 = Q_1 - 3.0 \times \text{IQR} \qquad U_2 = Q_3 + 3.0 \times \text{IQR}

The outer fences define the boundary for extreme outliers. Points beyond these fences are flagged as extreme outliers and plotted with larger symbols to distinguish them from mild outliers.

Python Example

import numpy as np
import matplotlib.pyplot as plt
# Generate data for 4 groups with different characteristics
rng = np.random.default_rng(42)
group_a = rng.normal(loc=50, scale=5, size=30)
group_b = rng.normal(loc=55, scale=8, size=30)
group_c = rng.normal(loc=45, scale=5, size=30)
group_d = rng.normal(loc=50, scale=12, size=30)
# Create box plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot([group_a, group_b, group_c, group_d],
tick_labels=['Group A', 'Group B', 'Group C', 'Group D'],
vert=True, patch_artist=True,
boxprops=dict(facecolor='steelblue', alpha=0.7))
ax.set_ylabel("Measurement Value")
ax.set_title("Box Plot: Comparison of Four Groups")
plt.tight_layout()
plt.show()