Normal Random Numbers Case Study
NIST/SEMATECH Section 1.4.2.1 Normal Random Numbers
Background and Data
This case study applies exploratory data analysis techniques to 500 observations generated from a standard normal distribution , taken from a Rand Corporation publication. The primary goal is to demonstrate the ideal case where a well-behaved dataset satisfies all four underlying assumptions of a univariate measurement process: fixed location, fixed variation, randomness, and fixed distribution. This serves as a baseline example — when the process is operating correctly, the EDA graphical and quantitative outputs should confirm every assumption.
The dataset originates from NIST/SEMATECH Section 1.4.2.1. With observations, the sample provides sufficient power for detecting moderate departures from assumptions.
Dataset
Rand Corporation, 500 standard normal N(0,1) random numbers
NIST source description
Rand normal numbers. Sample size N = 500. 10 numbers per line, format 10F7.4.
Preview data
| # | Value |
|---|---|
| 1 | -1.276 |
| 2 | -1.218 |
| 3 | -0.453 |
| 4 | -0.35 |
| 5 | 0.723 |
| 6 | 0.676 |
| 7 | -1.099 |
| 8 | -0.314 |
| 9 | -0.394 |
| 10 | -0.633 |
| ... 490 more rows | |
Test Underlying Assumptions
Goals
The analysis has three primary objectives:
- Model validation — assess whether the univariate model is an appropriate fit for the normal random numbers:
-
Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:
- Random sampling — the data are uncorrelated
- Fixed distribution — the data come from a fixed distribution
- Fixed location — the distribution location (mean) is constant
- Fixed variation — the distribution scale (standard deviation) is constant
-
Confidence interval validity — determine whether the standard confidence interval formula is appropriate:
where is the standard deviation. This formula relies on all four assumptions holding; if they are violated, the confidence interval has no statistical meaning.
Graphical Output and Interpretation
4-Plot Overview
The 4-plot for this dataset is the archetype of a healthy process. All four component plots are consistent with the underlying assumptions: the run sequence plot shows a stable band, the lag plot shows a structureless circular cloud, the histogram is bell-shaped, and the normal probability plot is approximately linear. No remedial action is needed.
The assumptions are addressed by the four diagnostic plots:
- The run sequence plot (upper left) shows 500 observations fluctuating randomly around a stable central location near zero with no trend, shift, or outliers — the fixed-location and fixed-variation assumptions appear satisfied.
- The lag plot (upper right) shows a structureless, roughly circular scatter cloud centered at the origin — no evidence of serial dependence, supporting the randomness assumption.
- The histogram (lower left) displays the classic symmetric, bell-shaped curve expected from normal data, centered near zero with no obvious skewness or heavy tails.
- The normal probability plot (lower right) shows data points falling closely along a straight diagonal line with no systematic curvature — confirming that a normal distribution is a reasonable model.
All four assumptions appear satisfied. The univariate model is appropriate and the standard confidence interval is valid. No remedial action is needed.
Run Sequence Plot
The run sequence plot displays 500 observations fluctuating randomly around a stable central location near zero with no discernible trend, shift, or cyclic pattern. Both the location and scale appear constant across the entire run, confirming the fixed-location and fixed-variation assumptions.
Lag Plot
The lag plot at lag 1 shows a structureless, roughly circular scatter cloud centered at the origin. There is no evidence of serial dependence, autocorrelation, or clustering. This supports the randomness assumption.
Histogram
The histogram displays the classic symmetric, bell-shaped curve expected from normal data. The distribution is unimodal with no obvious skewness or heavy tails. The peak is centered near zero, consistent with the known mean of the generating distribution.
Normal Probability Plot
The normal probability plot shows data points falling closely along a straight diagonal line with no systematic curvature, S-shapes, or outlier departures. This confirms that the normal distribution is an appropriate model for these data.
Autocorrelation Plot
The autocorrelation plot confirms the randomness assumption. All autocorrelation coefficients at lags 1 through 50 fall within the 95% confidence bands at ±2/√N, consistent with white noise.
Spectral Plot
The spectral plot shows the frequency-domain structure. For independent random data, the spectrum should be approximately flat with no dominant peaks.
Quantitative Output and Interpretation
Summary Statistics
| Statistic | Value |
|---|---|
| Sample size | 500 |
| Mean | -0.0029 |
| Std Dev | 1.0210 |
| Median | -0.0930 |
| Min | -2.6470 |
| Max | 3.4360 |
| Range | 6.0830 |
The sample mean and standard deviation are close to the theoretical values of 0 and 1, respectively. The range of approximately 6.1 standard deviations is consistent with a sample of size 500 from a standard normal distribution.
Location Test
The location test fits a linear regression of the response against the run-order index and tests whether the slope is significantly different from zero.
| Parameter | Estimate | Std Error | t-Value |
|---|---|---|---|
| (intercept) | 0.006991 | 0.09155 | 0.0764 |
| (slope) | -0.00003963 | 0.0003167 | -0.1251 |
Residual standard deviation: 1.02205 with 498 degrees of freedom.
Conclusion: The slope t-value of -0.1251 is well within the critical bounds , so we fail to reject — the location is constant. There is no detectable drift in the data over the 500 observations.
Variation Test
Bartlett’s test for homogeneity of variance divides the data into equal-length intervals of 125 observations each and tests whether their variances are equal.
| Statistic | Value |
|---|---|
| Test statistic | 2.3737 |
| Degrees of freedom | |
| Critical value | 7.8147 |
Conclusion: The test statistic of 2.3737 is well below the critical value of 7.8147, so we fail to reject — the variances are not significantly different across the four intervals. The variation is constant across the run.
Randomness Tests
Two complementary tests assess whether the observations are independent.
Runs test — tests whether the sequence of values above and below the median was produced randomly.
| Statistic | Value |
|---|---|
| Test statistic | -1.0744 |
| Critical value | 1.96 |
Conclusion: is within the critical bounds of , so we fail to reject — the data are consistent with a random sequence.
Lag-1 autocorrelation — measures the linear dependence between consecutive observations.
| Statistic | Value |
|---|---|
| 0.045 | |
| Critical value | 0.087 |
Conclusion: The lag-1 autocorrelation of 0.045 is well within the critical bounds of . No significant autocorrelation is detected at lag 1 or at any lag through lag 50. The randomness assumption is satisfied.
Distribution Test
The probability plot correlation coefficient (PPCC) for a normal distribution is 0.996, slightly below the critical value of 0.997 for , so the PPCC test technically rejects normality at the 5% level. The Anderson-Darling test yields a test statistic of , which exceeds the critical value of 0.787 and also rejects normality at the 5% level.
| Test | Statistic | Critical Value | Result |
|---|---|---|---|
| PPCC (normal) | 0.996 | 0.997 | Marginally rejects |
| Anderson-Darling | 0.787 | Rejects |
However, both rejections are marginal, and the graphical evidence from the normal probability plot — which shows data closely following the theoretical straight line — supports the conclusion that the data are reasonably well-modeled by a normal distribution. With , these tests have very high power and can detect trivially small departures from exact normality. A normal model remains appropriate for practical purposes.
Outlier Detection
Grubbs’ test identifies the observation farthest from the mean and tests whether it is a statistical outlier.
| Statistic | Value |
|---|---|
| Test statistic | 3.3681 |
| Critical value (n = 500, ) | 3.8631 |
Conclusion: The test statistic of 3.3681 is below the critical value of 3.8631, so we fail to reject — no significant outliers are identified at the 5% level.
Test Summary
| Assumption | Test | Statistic | Critical Value | Result |
|---|---|---|---|---|
| Fixed location | Regression on run order | 1.96 | Fail to reject | |
| Fixed variation | Bartlett’s test | 7.8147 | Fail to reject | |
| Randomness | Runs test | 1.96 | Fail to reject | |
| Randomness | Autocorrelation lag-1 | 0.087 | Fail to reject | |
| Normality | Anderson-Darling | 0.787 | Marginally rejects | |
| Outliers | Grubbs’ test | 3.8631 | Fail to reject |
All four assumptions are satisfied. The univariate model is appropriate for this data.
Interpretation
The graphical and quantitative analyses converge on a consistent conclusion: the normal random numbers dataset satisfies all four underlying assumptions of a univariate measurement process. The run sequence plot shows a stable band with no visible trend or shift, confirmed by the location test (, well within ). Bartlett’s test confirms constant variation (, well below the critical value of 7.8147). The randomness assumption is supported by both the structureless lag plot and the quantitative tests: the runs test () and lag-1 autocorrelation () are both non-significant, and the autocorrelation plot shows all coefficients within the 95% confidence bands.
The only nuance is the formal normality assessment. The Anderson-Darling test () marginally rejects normality at the 5% level, and the PPCC of 0.996 falls just below the critical value of 0.997. However, the normal probability plot shows excellent linearity with no systematic curvature, and Grubbs’ test (, below the critical value of 3.8631) confirms no outliers. The apparent rejection is a power artifact at — with this sample size, formal tests can detect trivially small departures from exact normality that have no practical significance. The data are well-modeled by a normal distribution.
The univariate model is appropriate for this dataset. Because all four assumptions hold, the standard 95% confidence interval is statistically valid. This case study serves as the reference baseline — the ideal outcome when a process is operating correctly and the standard methodology applies without qualification.
Conclusions
All four assumptions of a univariate measurement process are satisfied for this dataset:
- Fixed location — confirmed; no significant drift (regression slope )
- Fixed variation — confirmed; Bartlett’s test shows homogeneous variance ()
- Randomness — confirmed; no significant autocorrelation () and runs test passes ()
- Distribution — the normal distribution is a reasonable model despite marginal formal rejections
The data are random, from a fixed location (mean approximately 0), with fixed variation (standard deviation approximately 1), and follow a normal distribution. The recommended model is the sample mean as the best estimate of location, with the standard 95% confidence interval:
This gives a 95% confidence interval of approximately for the true mean. Because all four assumptions hold, this confidence interval is statistically valid.
This case study serves as the reference standard for comparison with datasets that violate one or more assumptions, such as the random walk case study where the fixed-location and randomness assumptions are catastrophically violated.