Skip to main content

Uniform Random Numbers Case Study

NIST/SEMATECH Section 1.4.2.2 Uniform Random Numbers

Background and Data

This case study applies exploratory data analysis to 500 observations drawn from a uniform distribution U(0,1), taken from a Rand Corporation publication. The purpose is to demonstrate how EDA graphical and quantitative techniques detect a non-normal underlying distribution while the remaining three assumptions — fixed location, fixed variation, and randomness — are all satisfied. This is the complementary case to the normal random numbers study.

The dataset originates from NIST/SEMATECH Section 1.4.2.2. With n = 500 uniformly distributed observations, the distribution assumption is the only one that fails. This case study is critically important because it illustrates the effects of a known underlying non-normal distribution and validates whether EDA techniques correctly identify departure from normality when the true distribution is known in advance.

Dataset

RANDU.DAT
Observations: 500
Variable: Random uniform value

Rand Corporation, 500 uniform U(0,1) random numbers

NIST source description
Rand uniform numbers (based on combining 6 consecutive Rand uniform digits). Sample size N = 500. Reference program: MAIN.RANDU.
Preview data
# Value
1 0.100973
2 0.253376
3 0.520135
4 0.863467
5 0.354876
6 0.80959
7 0.911739
8 0.292749
9 0.37542
10 0.480564
... 490 more rows

Test Underlying Assumptions

Goals

The analysis has three primary objectives:

  1. Model validation — assess whether the univariate model is an appropriate fit for the uniform random number data:
Yi=C+EiY_i = C + E_i
  1. Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:

    • Random sampling — the data are uncorrelated
    • Fixed distribution — the data come from a fixed distribution
    • Fixed location — the distribution location (mean) is constant
    • Fixed variation — the distribution scale (standard deviation) is constant
  2. Confidence interval validity — determine whether the standard confidence interval formula is appropriate:

Yˉ±2sN\bar{Y} \pm \frac{2s}{\sqrt{N}}

where ss is the standard deviation. This formula relies on all four assumptions holding; if the distributional assumption is violated, the confidence interval may not have the correct coverage probability.

If any assumptions are violated, identify the nature of the violations and recommend appropriate remedial actions.

Graphical Output and Interpretation

4-Plot Overview

The 4-plot immediately reveals one anomaly: while the run sequence plot and lag plot appear healthy, the histogram is rectangular rather than bell-shaped, and the normal probability plot shows strong S-shaped curvature. Three of four assumptions hold; the distributional assumption does not.

Four-plot diagnostic layout for the uniform random numbers dataset (run sequence, lag, histogram, normal probability).

The assumptions are addressed by the four diagnostic plots:

  1. The run sequence plot (upper left) shows data fluctuating randomly in a stable band between 0 and 1 with no trend or shifts — the fixed-location and fixed-variation assumptions appear satisfied.
  2. The lag plot (upper right) displays a structureless square cloud (the square shape, rather than circular, is characteristic of uniform data with bounded support) — the randomness assumption appears satisfied.
  3. The histogram (lower left) is approximately flat (rectangular) across [0, 1] rather than bell-shaped — this is the clearest signal that the data are not normally distributed. The flat shape is the signature of a uniform distribution.
  4. The normal probability plot (lower right) shows a pronounced S-shaped curve — the classic indicator of a light-tailed (platykurtic) distribution compared to the normal.

Three of four assumptions hold; the distributional assumption does not. The data are random, from a fixed location, with fixed variation, but the underlying distribution is uniform rather than normal.

Run Sequence Plot

The run sequence plot shows data fluctuating randomly in a stable band between 0 and 1. There is no trend, shift, or systematic pattern. Both location and variation remain constant across the run.

Run sequence plot of the uniform random numbers showing a stable band between 0 and 1 with no trend or shift.

Lag Plot

The lag plot at lag 1 displays a structureless square cloud, consistent with independent, identically distributed observations. The square shape (rather than circular) is characteristic of uniform data with bounded support.

Lag-1 plot showing a structureless square cloud — characteristic of independent, uniformly distributed observations.

Histogram with Normal PDF Overlay

The histogram with an overlaid normal probability density function shows that the flat, rectangular shape of the data is a poor fit to the bell-shaped normal curve. The frequencies are relatively flat across the range of the data while the normal PDF peaks sharply in the center. This mismatch is the clearest graphical signal that the normal distribution is not appropriate for these data.

Histogram with overlaid normal PDF curve. The flat rectangular histogram is a poor fit to the bell-shaped normal distribution.

Normal Probability Plot

The normal probability plot shows a pronounced S-shaped curve rather than the expected straight line. The tails curve away from the theoretical line in opposite directions — the lower tail curves upward and the upper tail curves downward. This S-shape is the classic indicator of a light-tailed (platykurtic) distribution compared to the normal.

Normal probability plot showing a pronounced S-shaped curve, indicating departure from normality (light-tailed, platykurtic distribution).

Histogram with Uniform PDF Overlay

The histogram with a uniform PDF overlay provides a much better fit. The flat, rectangular histogram shape matches the expected constant frequency under the U(0,1) distribution, in stark contrast to the poor normal fit above.

Histogram with uniform PDF overlay showing the expected frequency under U(0,1) against the observed data. The flat histogram shape matches the uniform distribution.

Uniform Probability Plot

The uniform probability plot provides the definitive distributional test for these data. When the data truly follow a uniform distribution, the plot should be nearly linear.

Uniform probability plot showing a nearly linear pattern, confirming an excellent fit to the uniform distribution.

The nearly linear pattern confirms that the uniform distribution is an excellent model for these data, contrasting sharply with the S-shaped curvature in the normal probability plot above.

Bootstrap Plot

The bootstrap plot compares the sampling variability of three location estimators — the mean, median, and midrange — by drawing 100 subsamples with replacement from the data and computing each statistic for every subsample. The top row shows the computed statistic versus subsample number; the bottom row shows the corresponding histograms.

Bootstrap plots for mean, median, and midrange showing sampling variability (top) and distributions (bottom). The midrange has the smallest variance and is the superior location estimator for uniform data.

The midrange has a far smaller variance than either the mean or the median, making it the superior location estimator for uniform data. The bootstrap confidence interval for the midrange is substantially tighter than for the mean, demonstrating why the standard mean-based confidence interval overestimates uncertainty for this distribution.

Autocorrelation Plot

The autocorrelation plot confirms the randomness assumption. All autocorrelation coefficients at lags 1 through 50 fall within the 95% confidence bands at ±2/√N, consistent with independent observations.

Autocorrelation plot with all coefficients at lags 1 through 50 within the 95% confidence bands, confirming independence.

Spectral Plot

The spectral plot shows a flat power spectrum with no dominant peaks, consistent with white noise. This confirms that the data have no periodic structure or frequency-domain dependence.

Spectral plot showing a flat power spectrum with no dominant peaks, consistent with white noise.

Quantitative Output and Interpretation

Summary Statistics

StatisticValue
Sample size nn500
Mean Yˉ\bar{Y}0.5078
Std Dev ss0.2943
Median0.5184
Min0.0025
Max0.9971
Range0.9946

The sample mean of 0.5078 is close to the theoretical value of 0.5, and the sample standard deviation of 0.2943 approximates the theoretical value of 1/12=0.28871/\sqrt{12} = 0.2887.

Location Test

The location test fits a linear regression of the response YY against the run-order index X=1,2,,NX = 1, 2, \ldots, N and tests whether the slope is significantly different from zero.

H0 ⁣:B1=0vs.Ha ⁣:B10H_0\!: B_1 = 0 \quad \text{vs.} \quad H_a\!: B_1 \neq 0
ParameterEstimateStd Errort-Value
B0B_0 (intercept)0.5229230.0263819.82
B1B_1 (slope)6.025×105-6.025 \times 10^{-5}9.125×1059.125 \times 10^{-5}−0.66

Residual standard deviation: 0.2945 with 498 degrees of freedom.

Conclusion: The slope t-value of −0.66 does not exceed the critical value t0.975,498=1.96t_{0.975,\,498} = 1.96, so we fail to reject H0H_0 — the slope is not significantly different from zero at the 5% significance level. The process mean is stable at approximately 0.508. The fixed-location assumption is satisfied.

Variation Test

The Levene test is used instead of Bartlett’s test because the data are non-normal (uniform). The Levene test (median-based variant) divides the data into k=4k = 4 equal-length intervals and tests whether their variances are homogeneous.

H0 ⁣:σ12=σ22=σ32=σ42vs.Ha ⁣:at least one σi2 differsH_0\!: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 = \sigma_4^2 \quad \text{vs.} \quad H_a\!: \text{at least one } \sigma_i^2 \text{ differs}
StatisticValue
Test statistic WW0.07983
Degrees of freedomk1=3k - 1 = 3 and Nk=496N - k = 496
Critical value F0.05,3,496F_{0.05,\,3,\,496}2.623

Conclusion: The test statistic of 0.07983 is well below the critical value of 2.623, so we fail to reject H0H_0 — the variances are not significantly different across the four intervals. The fixed-variation assumption is satisfied.

Randomness Tests

Two complementary tests assess whether the observations are independent.

Runs test — tests whether the sequence of values above and below the median was produced randomly.

H0 ⁣:sequence is randomvs.Ha ⁣:sequence is not randomH_0\!: \text{sequence is random} \quad \text{vs.} \quad H_a\!: \text{sequence is not random}
StatisticValue
Test statistic ZZ0.2686
Critical value Z1α/2Z_{1-\alpha/2}1.96

Conclusion: Z=0.269|Z| = 0.269 does not exceed 1.96, so we fail to reject H0H_0 — the sequence is consistent with randomness.

Lag-1 autocorrelation — measures the linear dependence between consecutive observations.

StatisticValue
r1r_10.03
Critical value 2/N2/\sqrt{N}0.087

Conclusion: The lag-1 autocorrelation of 0.03 is well within the critical bounds of ±0.087\pm 0.087, and values at all lags through lag 50 fall within the significance bounds. The data are independent. The randomness assumption is satisfied.

Distribution Test

The Anderson-Darling test and PPCC test are applied to test whether the data follow a normal distribution.

Normal distribution tests:

TestStatisticCritical ValueResult
Anderson-Darling A2A^25.7650.787Reject normality
Normal PPCC0.97720.987Reject normality

The Anderson-Darling statistic of 5.765 far exceeds the 5% critical value of 0.787, and the normal PPCC of 0.9772 is less than its critical value of 0.987. Both tests decisively reject the null hypothesis of normality.

Uniform distribution tests:

TestStatisticResult
Uniform PPCC0.9996Excellent fit
Uniform Anderson-Darling A2A^20.9082Good fit

The uniform PPCC of 0.9996 indicates an excellent fit to the uniform distribution. The skewness and kurtosis values are approximately 0.00 and approximately −1.20, respectively — close to the theoretical uniform values of 0.0 and −1.2. The negative excess kurtosis confirms a platykurtic (flat-topped) distribution relative to the normal.

Conclusion: The data are not normally distributed but are an excellent fit to the uniform distribution.

Outlier Detection

Grubbs’ test does not identify outliers, which is expected since the uniform distribution has finite support and the observations are well-contained within [0, 1].

Test Summary

AssumptionTestStatisticCritical ValueResult
Fixed locationRegression on run ordert=0.66t = {-0.66}1.96Fail to reject
Fixed variationLevene testW=0.080W = 0.0802.623Fail to reject
RandomnessRuns testZ=0.269Z = 0.2691.96Fail to reject
RandomnessAutocorrelation lag-1r1=0.03r_1 = 0.030.087Fail to reject
DistributionAnderson-Darling (normal)A2=5.765A^2 = 5.7650.787Reject
DistributionPPCC (normal)0.97720.987Reject
OutliersGrubbs’ testNo outliers

Three of four assumptions hold. The distributional assumption fails — the data follow a uniform, not normal, distribution. The univariate model Yi=C+EiY_i = C + E_i is appropriate, but the error term EiE_i follows a uniform rather than normal distribution.

Interpretation

The graphical and quantitative analyses converge on a clear finding: the uniform random numbers dataset satisfies three of the four underlying assumptions of a univariate measurement process, with the distributional assumption being the sole departure. The run sequence plot shows a stable band between 0 and 1 with no trend or shift, confirmed by the location test (t=0.66t = {-0.66}, well within the critical bounds of ±1.96\pm 1.96). The Levene test confirms constant variation (W=0.080W = 0.080, far below the critical value of 2.6232.623). Independence is supported by the structureless lag plot, the runs test (Z=0.269Z = 0.269, non-significant), and the lag-1 autocorrelation (r1=0.03r_1 = 0.03, well within ±0.087\pm 0.087). The autocorrelation plot confirms that all coefficients through lag 50 fall within the 95% confidence bands, and the spectral plot shows a flat spectrum consistent with white noise.

The distinctive feature of this case study is the distributional finding. The Anderson-Darling test (A2=5.765A^2 = 5.765) far exceeds the 5% critical value of 0.7870.787, and the normal PPCC of 0.97720.9772 falls well below the critical value of 0.9870.987 — both decisively reject normality. Graphically, the histogram displays a flat, rectangular shape rather than a bell curve, and the normal probability plot shows the characteristic S-shaped curvature of a platykurtic distribution. In contrast, the uniform probability plot is nearly linear, and the uniform PPCC of 0.99960.9996 indicates an excellent fit to the uniform distribution. The skewness and kurtosis values (approximately 0.000.00 and 1.20-1.20) match the theoretical uniform moments, confirming that the departure from normality is not due to outliers or contamination but to the fundamental shape of the distribution.

Because the distributional assumption is violated, the standard confidence interval formula Yˉ±2s/N\bar{Y} \pm 2s/\sqrt{N} is not strictly valid, although at n=500n = 500 the central limit theorem provides reasonable coverage. The sample mean of 0.5080.508 is a consistent estimator, but for uniform data the mid-range (min+max)/2(\min + \max)/2 is the optimal location estimator with a faster convergence rate. The standard normal-based interval of (0.482,  0.534)(0.482,\; 0.534) overestimates uncertainty compared to a distribution-specific approach — a bootstrap confidence interval of approximately (0.497,  0.503)(0.497,\; 0.503) is substantially tighter. When the normality assumption is known to be violated, a distribution-free or bootstrap interval should be preferred.

Conclusions

Three of four assumptions hold: fixed location, fixed variation, and randomness are confirmed. The distributional assumption fails — the data follow a uniform, not normal, distribution.

This has practical implications for inference. The sample mean remains a valid location estimate, but confidence intervals based on the normal distribution will be incorrect. The standard confidence interval:

Yˉ±2sN=0.508±0.026\bar{Y} \pm \frac{2s}{\sqrt{N}} = 0.508 \pm 0.026

is approximately valid by the central limit theorem (with n=500n = 500 the CLT provides reasonable coverage), yielding a 95% interval of (0.482, 0.534). However, for uniform data the mid-range is the optimal location estimator, with a tighter 95% bootstrap confidence interval of approximately (0.497, 0.503).

A distribution-free confidence interval or bootstrap approach should be preferred when the normality assumption is known to be violated. This case study demonstrates that EDA techniques — particularly the histogram, normal probability plot, Anderson-Darling test, and PPCC — reliably detect non-normal distributions, even when all other assumptions are satisfied.