Scatterplot Matrix
NIST/SEMATECH Section 1.3.3.26.11 Scatterplot Matrix
What It Is
A scatterplot matrix displays all pairwise scatter plots of variables in a multivariate dataset, arranged in a symmetric grid where the row and column positions identify the variable pair. The diagonal cells show the identity line (X_i vs X_i) by default; modern implementations commonly substitute variable names or univariate summaries such as histograms, and the off-diagonal cells show bivariate scatter plots.
For p variables, a p x p grid is created where cell (i,j) contains the scatter plot of variable i vs variable j. The diagonal cells show variable names or univariate summaries. The matrix is symmetric: cell (i,j) mirrors cell (j,i). The number of panels grows as p*(p-1)/2 unique pairs.
Questions This Plot Answers
- Are there pairwise relationships between the variables?
- What is the nature of these relationships (linear, non-linear)?
- Are there outliers in the data?
- Is there clustering by groups in the data?
Why It Matters
The scatterplot matrix provides a comprehensive bivariate overview of multivariate data in a single display. It is the standard first step in multivariate analysis, revealing correlation structure, non-linear relationships, clusters, and outliers that inform variable selection and modeling decisions. In interactive implementations, linking and brushing allows the analyst to select (brush) a subset of points in one panel and see the same observations highlighted in all other panels, making it easy to trace multivariate patterns across variable pairs.
When to Use a Scatterplot Matrix
Use a scatterplot matrix when exploring a multivariate dataset to identify which variable pairs exhibit strong correlations, non-linear relationships, clusters, or outliers. It provides a comprehensive overview of the bivariate structure of the data in a single display, guiding decisions about variable selection, transformation, and modeling strategy. The scatterplot matrix is particularly useful in regression analysis, principal component analysis, and multivariate quality control.
How to Interpret a Scatterplot Matrix
Each off-diagonal cell shows the scatter plot for one pair of variables. Strong linear patterns indicate high correlation, while formless clouds indicate weak relationships. The symmetry of the matrix means that each pair appears twice, reflected across the diagonal, which allows using the lower triangle for scatter plots and the upper triangle for correlation coefficients or other summaries. Consistent patterns across multiple pairs may indicate an underlying latent factor. Outliers that appear in multiple scatter plots simultaneously are multivariate outliers that warrant investigation. Comparing the upper and lower triangles can reveal asymmetric relationships or conditioning effects.
Assumptions and Limitations
The scatterplot matrix grows quadratically with the number of variables, so it is most practical for datasets with 3 to 8 variables. Beyond that, the individual panels become too small to interpret effectively. The display only reveals pairwise relationships and cannot capture higher-order interactions or conditional dependencies. Large datasets may require transparency or sampling to manage overplotting in each panel.
Reference: NIST/SEMATECH e-Handbook of Statistical Methods, Section 1.3.3.26.11
Python Example
import numpy as npimport pandas as pdimport seaborn as snsimport matplotlib.pyplot as plt
# Generate 4 correlated variablesrng = np.random.default_rng(42)n = 150z1 = rng.standard_normal(n)z2 = rng.standard_normal(n)z3 = rng.standard_normal(n)
df = pd.DataFrame({ 'Strength': 50 + 5 * z1, 'Hardness': 30 + 3 * z1 + 2 * z2, 'Density': 7.5 + 0.5 * z1 + 0.8 * z3, 'Elasticity': 200 + 20 * z2 + 10 * z3,})
g = sns.pairplot(df, diag_kind='hist', plot_kws={'alpha': 0.5, 's': 20})g.figure.suptitle("Scatterplot Matrix — 4 Material Properties", y=1.02)plt.tight_layout()plt.show()