Box Plot
NIST/SEMATECH Section 1.3.3.7 Box Plot
What It Is
A box plot (Chambers 1983), also known as a box-and-whisker plot, is an excellent tool for conveying location and variation information in data sets, particularly for detecting and illustrating location and variation changes between different groups of data. The central box spans from the lower quartile (, 25th percentile) to the upper quartile (, 75th percentile), representing the middle 50% of the data. A line inside the box marks the median, and whiskers extend from the quartiles to the most extreme data point within 1.5 IQR of the box (the inner fence).
A useful variation of the box plot more specifically identifies outliers using inner and outer fences. The box spans from to with the median marked inside, and the interquartile range () measures the spread of the middle 50%. Whiskers extend from each quartile to the most extreme data point within the inner fence (1.5 IQR from the box). Points beyond the inner fence but within the outer fence (3.0 IQR) are plotted as small circles (mild outliers), and points beyond the outer fence are plotted as large circles (extreme outliers). The exact fence formulas are given below.
Questions This Plot Answers
- Is a factor significant?
- Does the location differ between subgroups?
- Does the variation differ between subgroups?
- Are there any outliers?
Why It Matters
The box plot is an important EDA tool for determining if a factor has a significant effect on the response with respect to either location or variation. It is also an effective tool for summarizing large quantities of information.
When to Use a Box Plot
Use a box plot when comparing the location, spread, and symmetry of one or more groups in a compact graphical format. Box plots are particularly effective for side-by-side comparisons of multiple samples or factor levels, making them a staple of exploratory data analysis in quality engineering, process comparison, and designed experiments. A single box plot can be drawn for one batch of data with no distinct groups; alternatively, multiple box plots can be drawn together to compare multiple data sets or groups in a single data set.
How to Interpret a Box Plot
The position of the median line within the box reveals the symmetry of the distribution: a centered median indicates symmetry, while a median closer to or suggests right or left skewness, respectively. The length of the box shows the interquartile range and serves as a robust measure of spread. Whisker lengths indicate the range of the bulk of the data, and individual points plotted beyond the whiskers are candidate outliers deserving further investigation. When comparing multiple box plots, differences in box height indicate differing variability, while vertical offsets between median lines indicate differences in location.
Examples
Equal Groups
All box plots have similar medians, similar heights, and similar whisker lengths. This indicates no significant difference between groups — the factor does not affect either the location or the spread of the response.
Location Shift
Box plots have similar heights and whisker lengths but different median positions. This indicates the factor affects the average response without changing the variability, a classic location effect.
Spread Difference
Box plots have similar medians but markedly different heights. Taller boxes indicate groups with greater variability. This dispersion effect is important for process optimization and robust parameter design.
Assumptions and Limitations
Box plots make no distributional assumptions and are appropriate for any continuous or ordinal data. However, they can be misleading for very small samples where quartile estimates are unreliable, and they do not reveal multi-modality within a group. For multiple box plots, the width of the box can be set proportional to the number of points in the given group or sample, though some implementations set all boxes to the same width.
See It In Action
This technique is demonstrated in the following case studies:
Reference: NIST/SEMATECH e-Handbook of Statistical Methods, Section 1.3.3.7
Formulas
Interquartile Range
The interquartile range is the difference between the upper quartile (75th percentile) and the lower quartile (25th percentile). It measures the spread of the middle 50% of the data and forms the height of the box.
Inner Fences (Mild Outlier Boundaries)
The inner fences define the boundary for mild outliers. Whiskers extend from the quartiles to the most extreme data points within these fences. Points between the inner and outer fences are flagged as mild outliers.
Outer Fences (Extreme Outlier Boundaries)
The outer fences define the boundary for extreme outliers. Points beyond these fences are flagged as extreme outliers and plotted with larger symbols to distinguish them from mild outliers.
Python Example
import numpy as npimport matplotlib.pyplot as plt
# Generate data for 4 groups with different characteristicsrng = np.random.default_rng(42)group_a = rng.normal(loc=50, scale=5, size=30)group_b = rng.normal(loc=55, scale=8, size=30)group_c = rng.normal(loc=45, scale=5, size=30)group_d = rng.normal(loc=50, scale=12, size=30)
# Create box plotfig, ax = plt.subplots(figsize=(10, 5))ax.boxplot([group_a, group_b, group_c, group_d], tick_labels=['Group A', 'Group B', 'Group C', 'Group D'], vert=True, patch_artist=True, boxprops=dict(facecolor='steelblue', alpha=0.7))ax.set_ylabel("Measurement Value")ax.set_title("Box Plot: Comparison of Four Groups")plt.tight_layout()plt.show()