Scatter Plot
NIST/SEMATECH Section 1.3.3.26 Scatter Plot
What It Is
A scatter plot displays the relationship between two quantitative variables by plotting each observation as a point in a two-dimensional coordinate system, with one variable on the horizontal axis and the other on the vertical axis. Optional enhancements include a regression line, confidence bands, and a LOWESS smoother to highlight the underlying trend.
Each observation is plotted as a point at coordinates . An optional fitted regression line or LOWESS smoother highlights the central trend. The vertical scatter of points around the trend line indicates the strength of the relationship — tight scatter means strong association, wide scatter means weak association. The shape of the point cloud reveals the functional form: an elliptical cloud suggests linearity, a curved band suggests a non-linear relationship.
Questions This Plot Answers
- Are variables and related?
- Are variables and linearly related?
- Are variables and non-linearly related?
- Does the variation in change depending on ?
- Are there outliers?
Why It Matters
The scatter plot is the single most important graphical tool for bivariate analysis. It provides a direct, assumption-free view of the relationship between two variables, revealing the form (linear, curved, none), direction (positive, negative), strength, and any anomalies (outliers, clusters, heteroscedasticity). No regression model should be fit without first examining the scatter plot. Important: correlation does not imply causation. A strong association between two variables does not prove that changes in one variable cause changes in the other.
When to Use a Scatter Plot
Use a scatter plot as the primary tool for exploring the relationship between two continuous variables, including questions about correlation, linearity, and the presence of clusters or outliers. It is the foundation of bivariate exploratory analysis and is used before fitting regression models to verify that the assumed relationship is reasonable. Scatter plots are ubiquitous in science, engineering, and business analysis wherever two-variable relationships need to be visualized.
How to Interpret a Scatter Plot
A positive linear trend, with points rising from lower left to upper right, indicates a positive correlation between the variables. A negative linear trend indicates a negative correlation. A cloud of points with no discernible pattern indicates little or no relationship. Curvature in the scatter pattern suggests a non-linear relationship that may require transformation or a polynomial model. Distinct clusters of points suggest subgroups in the data. A single point far from the main body of data is an outlier that may be influential in a fitted model. Variant patterns include no relationship (random scatter), strong positive linear, strong negative linear, exact linear (), quadratic, exponential, heteroscedastic, homoscedastic, clustered, outlier-contaminated, and sinusoidal patterns, each suggesting different modeling strategies.
Examples
Strong Positive Correlation
Points form a tight band rising from lower-left to upper-right. As increases, increases proportionally. The correlation coefficient is close to . A linear model is appropriate.
Strong Negative Correlation
Points form a tight band falling from upper-left to lower-right. As increases, decreases proportionally. The correlation coefficient is close to . A linear model with negative slope is appropriate.
Weak Positive Correlation
Points show a general upward trend but with considerable scatter around the trend line. The correlation coefficient is moderate (–). The relationship exists but other factors contribute substantially to the variability in .
No Correlation
Points form a structureless circular or rectangular cloud with no discernible trend. The correlation coefficient is near zero. provides no information about , and a regression model would not be useful.
Quadratic Relationship
Points follow a U-shaped or inverted-U curve. The linear correlation may be near zero despite a strong relationship existing. A quadratic model () or polynomial regression is needed.
Exponential Relationship
Points follow an exponential curve with increasing (or decreasing) at an accelerating rate. A log transformation of or an exponential model is needed. The linear correlation underestimates the true strength of the relationship.
Heteroscedastic
The vertical spread of points increases (or decreases) systematically as increases, forming a fan or trumpet shape. This non-constant variance violates a key regression assumption and may require weighted regression or a variance-stabilizing transformation.
Clustered
Points form two or more distinct groups separated by gaps, rather than a continuous distribution. This suggests the data come from different subpopulations or operating conditions. Each cluster may have its own regression relationship.
With Outliers
Most points follow a clear pattern, but one or more points lie far from the main body. These outliers may be influential observations that distort the regression fit. They should be investigated for measurement errors or genuine extreme conditions.
Exact Linear
All points fall exactly on a straight line with no scatter whatsoever (). This perfect linear relationship is rare in practice and may indicate a deterministic formula, a calibration artifact, or data fabrication. It is included as a reference case showing the theoretical ideal that real data approach but never quite reach.
Sinusoidal (Damped)
Points follow a damped sine-wave pattern where the amplitude decreases as increases. The relationship is periodic but with diminishing oscillation. A linear or polynomial model is inadequate; a damped sinusoidal model () is needed, as described in NIST Section 1.3.3.26.7.
Homoscedastic
Points scatter uniformly around the regression line with constant vertical spread across the full range of . This constant variance (homoscedasticity) satisfies a key assumption of ordinary least squares regression, meaning standard confidence intervals and hypothesis tests on the regression coefficients are valid.
Assumptions and Limitations
The scatter plot makes no distributional assumptions and is valid for any pair of continuous variables. However, it shows only the marginal bivariate relationship and cannot account for confounding by additional variables. For datasets with many observations, overplotting can obscure patterns, and techniques such as transparency, jittering, or hexagonal binning may be needed.
See It In Action
This technique is demonstrated in the following case studies:
Reference: NIST/SEMATECH e-Handbook of Statistical Methods, Section 1.3.3.26
Formulas
Linear Model
The simple linear regression model, appropriate for strong positive or negative linear scatter patterns.
Quadratic Model
A polynomial model for U-shaped or inverted-U scatter patterns where curvature is present.
Exponential Model
An exponential model for scatter patterns where Y increases (or decreases) at an accelerating rate.
Sinusoidal Model
A damped sinusoidal model for oscillating scatter patterns (NIST 1.3.3.26.7).
Python Example
import numpy as npimport matplotlib.pyplot as plt
# Generate bivariate data with positive correlationrng = np.random.default_rng(42)n = 100x = rng.uniform(0, 10, n)y = 1.8 * x + 5 + rng.normal(0, 3, n)
fig, ax = plt.subplots(figsize=(8, 6))ax.scatter(x, y, alpha=0.6, s=40, edgecolors='white', linewidth=0.5)ax.set_xlabel("X")ax.set_ylabel("Y")ax.set_title("Scatter Plot — Bivariate Relationship")ax.grid(True, alpha=0.3)plt.tight_layout()plt.show()