Skip to main content

Normal Random Numbers Case Study

NIST/SEMATECH Section 1.4.2.1 Normal Random Numbers

Background and Data

This case study applies exploratory data analysis techniques to 500 observations generated from a standard normal distribution N(0,1)N(0,1), taken from a Rand Corporation publication. The primary goal is to demonstrate the ideal case where a well-behaved dataset satisfies all four underlying assumptions of a univariate measurement process: fixed location, fixed variation, randomness, and fixed distribution. This serves as a baseline example — when the process is operating correctly, the EDA graphical and quantitative outputs should confirm every assumption.

The dataset originates from NIST/SEMATECH Section 1.4.2.1. With n=500n = 500 observations, the sample provides sufficient power for detecting moderate departures from assumptions.

Dataset

RANDN.DAT
Observations: 500
Variable: Random normal value

Rand Corporation, 500 standard normal N(0,1) random numbers

NIST source description
Rand normal numbers. Sample size N = 500. 10 numbers per line, format 10F7.4.
Preview data
# Value
1 -1.276
2 -1.218
3 -0.453
4 -0.35
5 0.723
6 0.676
7 -1.099
8 -0.314
9 -0.394
10 -0.633
... 490 more rows

Test Underlying Assumptions

Goals

The analysis has three primary objectives:

  1. Model validation — assess whether the univariate model is an appropriate fit for the normal random numbers:
Yi=C+EiY_i = C + E_i
  1. Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:

    • Random sampling — the data are uncorrelated
    • Fixed distribution — the data come from a fixed distribution
    • Fixed location — the distribution location (mean) is constant
    • Fixed variation — the distribution scale (standard deviation) is constant
  2. Confidence interval validity — determine whether the standard confidence interval formula is appropriate:

Yˉ±2sN\bar{Y} \pm \frac{2s}{\sqrt{N}}

where ss is the standard deviation. This formula relies on all four assumptions holding; if they are violated, the confidence interval has no statistical meaning.

Graphical Output and Interpretation

4-Plot Overview

The 4-plot for this dataset is the archetype of a healthy process. All four component plots are consistent with the underlying assumptions: the run sequence plot shows a stable band, the lag plot shows a structureless circular cloud, the histogram is bell-shaped, and the normal probability plot is approximately linear. No remedial action is needed.

Four-plot diagnostic layout for the normal random numbers dataset (run sequence, lag, histogram, normal probability). All four assumptions are satisfied.

The assumptions are addressed by the four diagnostic plots:

  1. The run sequence plot (upper left) shows 500 observations fluctuating randomly around a stable central location near zero with no trend, shift, or outliers — the fixed-location and fixed-variation assumptions appear satisfied.
  2. The lag plot (upper right) shows a structureless, roughly circular scatter cloud centered at the origin — no evidence of serial dependence, supporting the randomness assumption.
  3. The histogram (lower left) displays the classic symmetric, bell-shaped curve expected from normal data, centered near zero with no obvious skewness or heavy tails.
  4. The normal probability plot (lower right) shows data points falling closely along a straight diagonal line with no systematic curvature — confirming that a normal distribution is a reasonable model.

All four assumptions appear satisfied. The univariate model is appropriate and the standard confidence interval is valid. No remedial action is needed.

Run Sequence Plot

The run sequence plot displays 500 observations fluctuating randomly around a stable central location near zero with no discernible trend, shift, or cyclic pattern. Both the location and scale appear constant across the entire run, confirming the fixed-location and fixed-variation assumptions.

Run sequence plot of 500 normal random numbers showing stable location near zero and constant variation — no trend, shift, or cyclic pattern.

Lag Plot

The lag plot at lag 1 shows a structureless, roughly circular scatter cloud centered at the origin. There is no evidence of serial dependence, autocorrelation, or clustering. This supports the randomness assumption.

Lag-1 plot showing a structureless circular cloud centered at the origin, confirming the randomness assumption.

Histogram

The histogram displays the classic symmetric, bell-shaped curve expected from normal data. The distribution is unimodal with no obvious skewness or heavy tails. The peak is centered near zero, consistent with the known mean of the generating distribution.

Histogram with KDE overlay displaying the classic symmetric bell-shaped curve expected from normal data, centered near zero.

Normal Probability Plot

The normal probability plot shows data points falling closely along a straight diagonal line with no systematic curvature, S-shapes, or outlier departures. This confirms that the normal distribution is an appropriate model for these data.

Normal probability plot with data points falling closely along a straight diagonal line, confirming the normal distribution assumption.

Autocorrelation Plot

The autocorrelation plot confirms the randomness assumption. All autocorrelation coefficients at lags 1 through 50 fall within the 95% confidence bands at ±2/√N, consistent with white noise.

Autocorrelation plot showing all coefficients at lags 1 through 50 within the 95% confidence bands, consistent with white noise.

Spectral Plot

The spectral plot shows the frequency-domain structure. For independent random data, the spectrum should be approximately flat with no dominant peaks.

Spectral plot showing an approximately flat spectrum with no dominant peaks, confirming independent random data.

Quantitative Output and Interpretation

Summary Statistics

StatisticValue
Sample size nn500
Mean Yˉ\bar{Y}-0.0029
Std Dev ss1.0210
Median-0.0930
Min-2.6470
Max3.4360
Range6.0830

The sample mean and standard deviation are close to the theoretical values of 0 and 1, respectively. The range of approximately 6.1 standard deviations is consistent with a sample of size 500 from a standard normal distribution.

Location Test

The location test fits a linear regression of the response YY against the run-order index X=1,2,,NX = 1, 2, \ldots, N and tests whether the slope is significantly different from zero.

H0 ⁣:B1=0vs.Ha ⁣:B10H_0\!: B_1 = 0 \quad \text{vs.} \quad H_a\!: B_1 \neq 0
ParameterEstimateStd Errort-Value
B0B_0 (intercept)0.0069910.091550.0764
B1B_1 (slope)-0.000039630.0003167-0.1251

Residual standard deviation: 1.02205 with 498 degrees of freedom.

Conclusion: The slope t-value of -0.1251 is well within the critical bounds ±t0.975,498=±1.96\pm t_{0.975,\,498} = \pm 1.96, so we fail to reject H0H_0 — the location is constant. There is no detectable drift in the data over the 500 observations.

Variation Test

Bartlett’s test for homogeneity of variance divides the data into k=4k = 4 equal-length intervals of 125 observations each and tests whether their variances are equal.

H0 ⁣:σ12=σ22=σ32=σ42vs.Ha ⁣:at least one σi2 differsH_0\!: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 = \sigma_4^2 \quad \text{vs.} \quad H_a\!: \text{at least one } \sigma_i^2 \text{ differs}
StatisticValue
Test statistic TT2.3737
Degrees of freedomk1=3k - 1 = 3
Critical value χ0.95,32\chi^2_{0.95,\,3}7.8147

Conclusion: The test statistic of 2.3737 is well below the critical value of 7.8147, so we fail to reject H0H_0 — the variances are not significantly different across the four intervals. The variation is constant across the run.

Randomness Tests

Two complementary tests assess whether the observations are independent.

Runs test — tests whether the sequence of values above and below the median was produced randomly.

H0 ⁣:sequence is randomvs.Ha ⁣:sequence is not randomH_0\!: \text{sequence is random} \quad \text{vs.} \quad H_a\!: \text{sequence is not random}
StatisticValue
Test statistic ZZ-1.0744
Critical value Z1α/2Z_{1-\alpha/2}1.96

Conclusion: Z=1.0744|Z| = 1.0744 is within the critical bounds of ±1.96\pm 1.96, so we fail to reject H0H_0 — the data are consistent with a random sequence.

Lag-1 autocorrelation — measures the linear dependence between consecutive observations.

StatisticValue
r1r_10.045
Critical value 2/N2/\sqrt{N}0.087

Conclusion: The lag-1 autocorrelation of 0.045 is well within the critical bounds of ±0.087\pm 0.087. No significant autocorrelation is detected at lag 1 or at any lag through lag 50. The randomness assumption is satisfied.

Distribution Test

The probability plot correlation coefficient (PPCC) for a normal distribution is 0.996, slightly below the critical value of 0.997 for n=500n = 500, so the PPCC test technically rejects normality at the 5% level. The Anderson-Darling test yields a test statistic of A2=1.0612A^2 = 1.0612, which exceeds the critical value of 0.787 and also rejects normality at the 5% level.

TestStatisticCritical ValueResult
PPCC (normal)0.9960.997Marginally rejects
Anderson-DarlingA2=1.0612A^2 = 1.06120.787Rejects

However, both rejections are marginal, and the graphical evidence from the normal probability plot — which shows data closely following the theoretical straight line — supports the conclusion that the data are reasonably well-modeled by a normal distribution. With n=500n = 500, these tests have very high power and can detect trivially small departures from exact normality. A normal model remains appropriate for practical purposes.

Outlier Detection

Grubbs’ test identifies the observation farthest from the mean and tests whether it is a statistical outlier.

StatisticValue
Test statistic GG3.3681
Critical value (n = 500, α=0.05\alpha = 0.05)3.8631

Conclusion: The test statistic of 3.3681 is below the critical value of 3.8631, so we fail to reject — no significant outliers are identified at the 5% level.

Test Summary

AssumptionTestStatisticCritical ValueResult
Fixed locationRegression on run ordert=0.1251t = {-0.1251}1.96Fail to reject
Fixed variationBartlett’s testT=2.3737T = 2.37377.8147Fail to reject
RandomnessRuns testZ=1.0744Z = {-1.0744}1.96Fail to reject
RandomnessAutocorrelation lag-1r1=0.045r_1 = 0.0450.087Fail to reject
NormalityAnderson-DarlingA2=1.0612A^2 = 1.06120.787Marginally rejects
OutliersGrubbs’ testG=3.3681G = 3.36813.8631Fail to reject

All four assumptions are satisfied. The univariate model Yi=C+EiY_i = C + E_i is appropriate for this data.

Interpretation

The graphical and quantitative analyses converge on a consistent conclusion: the normal random numbers dataset satisfies all four underlying assumptions of a univariate measurement process. The run sequence plot shows a stable band with no visible trend or shift, confirmed by the location test (t=0.1251t = -0.1251, well within ±1.96\pm 1.96). Bartlett’s test confirms constant variation (T=2.3737T = 2.3737, well below the critical value of 7.8147). The randomness assumption is supported by both the structureless lag plot and the quantitative tests: the runs test (Z=1.0744Z = -1.0744) and lag-1 autocorrelation (r1=0.045r_1 = 0.045) are both non-significant, and the autocorrelation plot shows all coefficients within the 95% confidence bands.

The only nuance is the formal normality assessment. The Anderson-Darling test (A2=1.0612A^2 = 1.0612) marginally rejects normality at the 5% level, and the PPCC of 0.996 falls just below the critical value of 0.997. However, the normal probability plot shows excellent linearity with no systematic curvature, and Grubbs’ test (G=3.3681G = 3.3681, below the critical value of 3.8631) confirms no outliers. The apparent rejection is a power artifact at n=500n = 500 — with this sample size, formal tests can detect trivially small departures from exact normality that have no practical significance. The data are well-modeled by a normal distribution.

The univariate model Yi=C+EiY_i = C + E_i is appropriate for this dataset. Because all four assumptions hold, the standard 95% confidence interval Yˉ±2s/N\bar{Y} \pm 2s/\sqrt{N} is statistically valid. This case study serves as the reference baseline — the ideal outcome when a process is operating correctly and the standard methodology applies without qualification.

Conclusions

All four assumptions of a univariate measurement process are satisfied for this dataset:

  1. Fixed location — confirmed; no significant drift (regression slope t=0.1251t = -0.1251)
  2. Fixed variation — confirmed; Bartlett’s test shows homogeneous variance (T=2.3737T = 2.3737)
  3. Randomness — confirmed; no significant autocorrelation (r1=0.045r_1 = 0.045) and runs test passes (Z=1.0744Z = -1.0744)
  4. Distribution — the normal distribution is a reasonable model despite marginal formal rejections

The data are random, from a fixed location (mean approximately 0), with fixed variation (standard deviation approximately 1), and follow a normal distribution. The recommended model is the sample mean as the best estimate of location, with the standard 95% confidence interval:

Yˉ±2sN=0.0029±2(1.0210)500=0.0029±0.0913\bar{Y} \pm \frac{2s}{\sqrt{N}} = -0.0029 \pm \frac{2(1.0210)}{\sqrt{500}} = -0.0029 \pm 0.0913

This gives a 95% confidence interval of approximately (0.094,  0.089)(-0.094,\; 0.089) for the true mean. Because all four assumptions hold, this confidence interval is statistically valid.

This case study serves as the reference standard for comparison with datasets that violate one or more assumptions, such as the random walk case study where the fixed-location and randomness assumptions are catastrophically violated.