Skip to main content

Random Walk Case Study

NIST/SEMATECH Section 1.4.2.3 Random Walk

Background and Data

This case study applies exploratory data analysis to 500 observations from the NIST RANDWALK.DAT dataset, which is a cumulative sum of uniform random numbers:

Ri=j=1i(Uj0.5)R_i = \sum_{j=1}^{i}(U_j - 0.5)

where UU is a set of uniform random numbers. The purpose is to demonstrate how EDA techniques detect a non-stationary process where the fixed-location assumption is violated. In a random walk, each observation equals the previous observation plus a random increment, creating persistent drift.

The dataset originates from NIST/SEMATECH Section 1.4.2.3. This is a critically important case study because random walks appear frequently in financial time series, sensor drift, and instrument calibration problems.

Dataset

RANDWALK.DAT
Observations: 500
Variable: Cumulative sum

Cumulative sum of uniform random numbers

NIST source description
Random walk data. Response variable Y. Generation algorithm: cumulative sum of (uniforms − 0.5), i.e. RW(i) = Σ(U(j) − 0.5) for j = 1 to i. Number of observations = 500.
Preview data
# Value
1 -0.399027
2 -0.645651
3 -0.625516
4 -0.262049
5 -0.407173
6 -0.097583
7 0.314156
8 0.106905
9 -0.017675
10 -0.037111
... 490 more rows

Test Underlying Assumptions

Goals

The analysis has three primary objectives:

  1. Model validation — assess whether the univariate model is an appropriate fit for the random walk data:
Yi=C+EiY_i = C + E_i
  1. Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:
    • Random sampling — the data are uncorrelated
    • Fixed distribution — the data come from a fixed distribution
    • Fixed location — the distribution location (mean) is constant
    • Fixed variation — the distribution scale (standard deviation) is constant
  2. Confidence interval validity — determine whether the standard confidence interval formula is appropriate:
Yˉ±2sN\bar{Y} \pm \frac{2s}{\sqrt{N}}

where ss is the standard deviation. This formula relies on all four assumptions holding; if they are violated, the confidence interval has no statistical meaning.

If the assumptions are violated, identify the nature and severity of the violations and recommend appropriate remedial actions.

4-Plot Overview

The 4-plot is the primary graphical tool for testing all four assumptions simultaneously. For the random walk data, it reveals severe violations.

Four-plot diagnostic layout for the random walk dataset (run sequence, lag, histogram, normal probability).

The run sequence plot (upper left) shows a meandering, non-stationary trajectory rather than a stable band — the location is clearly not fixed. The lag plot (upper right) shows a tight diagonal line indicating extreme autocorrelation — the data are not random. The histogram (lower left) and normal probability plot (lower right) are not meaningful when location and independence assumptions are violated.

Fixed Location — Run Sequence Plot

The run sequence plot tests whether the location is constant over time. For the random walk, instead of fluctuating around a fixed horizontal level, the data wander freely with no tendency to return to a central value.

Run sequence plot of the random walk data showing non-stationary drift over 500 observations.

Conclusion: The trajectory drifts both positive and negative over the 500 observations, demonstrating classic non-stationary behavior. The fixed-location assumption is violated.

Randomness — Lag Plot

The lag plot tests whether the data are random by plotting YiY_i versus Yi1Y_{i-1}. If the data are random, the lag plot should show a structureless cloud.

Lag-1 plot showing extreme autocorrelation — consecutive values cluster tightly along the y = x diagonal.

Conclusion: The lag plot shows an extremely tight linear cluster along the diagonal y=xy = x, indicating consecutive observations are nearly identical. The lag-1 autocorrelation is 0.987. The randomness assumption is catastrophically violated.

Randomness — Autocorrelation Plot

When the lag plot reveals significant non-randomness, the autocorrelation plot provides essential follow-up detail. It shows the autocorrelation coefficient rkr_k at each lag kk, with 95% confidence bands at ±2/N\pm 2/\sqrt{N}.

Autocorrelation plot showing significant positive autocorrelation at lags 1 through 100, decreasing linearly — characteristic of a random walk process.

The autocorrelation plot shows significant positive autocorrelation at lags 1 through 100 in a linearly decreasing fashion. This slow, linear decay is the signature of a random walk — the process has long-range memory because each observation carries forward the cumulative effect of all prior random increments. Every lag from 1 to 100 exceeds the 95% confidence band, confirming that the randomness assumption is not just violated but fundamentally inapplicable.

Randomness — Spectral Plot

The spectral plot complements the autocorrelation plot by showing the frequency-domain structure of the data — specifically, the power spectral density as a function of frequency.

Spectral plot showing a single dominant low-frequency peak, confirming the non-stationary, drift-dominated character of the random walk.

The spectral plot shows a single dominant low-frequency peak. Virtually all power is concentrated near frequency zero, which corresponds to slow, long-period drift. There are no secondary peaks at higher frequencies, confirming that the data have no periodic component — the non-randomness is purely drift-driven, consistent with the random walk mechanism.

Fixed Distribution — Histogram

The histogram assesses the shape of the underlying distribution. However, because the location and randomness assumptions are already violated, the histogram cannot be meaningfully interpreted.

Histogram with KDE overlay of the random walk data. The irregular shape reflects the non-stationary path, not a meaningful distribution.

Conclusion: The irregular, multi-modal shape depends entirely on the particular path taken by the random walk. Distribution testing is not meaningful when prior assumptions fail.

Normality — Normal Probability Plot

The normal probability plot tests whether the data follow a normal distribution. Like the histogram, this test is invalid when location and independence assumptions are violated.

Normal probability plot of the random walk data. Curvature indicates departure from normality, though the test is invalid for non-stationary data.

Conclusion: The plot shows substantial curvature and departure from linearity, but this result must be interpreted cautiously — normality testing is not valid when stationarity and independence fail.

Quantitative Results

Summary Statistics

StatisticValue
Sample size nn500
Mean Yˉ\bar{Y}3.2167
Std Dev ss2.0787
Autocorrelation r1r_10.987

These summary statistics are essentially meaningless as descriptors of a fixed process because the location is non-stationary. In particular, the standard confidence interval is not valid:

Yˉ±2sN=3.217±0.186\bar{Y} \pm \frac{2s}{\sqrt{N}} = 3.217 \pm 0.186

This dramatically understates the true uncertainty because it assumes independent observations. With autocorrelation this extreme, the effective sample size is far smaller than 500, and any confidence interval computed from the standard formula has no statistical justification.

Location Test

The location test fits a linear regression of the response YY against the run-order index X=1,2,,NX = 1, 2, \ldots, N and tests whether the slope is significantly different from zero.

H0 ⁣:B1=0vs.Ha ⁣:B10H_0\!: B_1 = 0 \quad \text{vs.} \quad H_a\!: B_1 \neq 0
ParameterEstimateStd Errort-Value
B0B_0 (intercept)1.83350.172110.650
B1B_1 (slope)0.0055220.0005959.275

Residual standard deviation: 1.9214 with 498 degrees of freedom.

Conclusion: The slope t-value of 9.275 exceeds the critical value t0.975,498=1.96t_{0.975,\,498} = 1.96, so we reject H0H_0 — the location is not constant. The data drift upward at a rate of approximately 0.0055 units per observation.

Variation Test

The Levene test (median-based variant, preferred for non-normal data) divides the data into k=4k = 4 equal-length intervals and tests whether their variances are homogeneous.

H0 ⁣:σ12=σ22=σ32=σ42vs.Ha ⁣:at least one σi2 differsH_0\!: \sigma_1^2 = \sigma_2^2 = \sigma_3^2 = \sigma_4^2 \quad \text{vs.} \quad H_a\!: \text{at least one } \sigma_i^2 \text{ differs}
StatisticValue
Test statistic WW10.459
Degrees of freedomk1=3k - 1 = 3 and Nk=496N - k = 496
Critical value F0.05,3,496F_{0.05,\,3,\,496}2.623

Conclusion: The test statistic of 10.459 exceeds the critical value of 2.623, so we reject H0H_0 — the variances are significantly different across the four intervals. This is expected for a random walk: the variance of a cumulative sum grows with time.

Randomness Tests

Two complementary tests assess whether the observations are independent.

Runs test — tests whether the sequence of values above and below the median was produced randomly.

H0 ⁣:sequence is randomvs.Ha ⁣:sequence is not randomH_0\!: \text{sequence is random} \quad \text{vs.} \quad H_a\!: \text{sequence is not random}
StatisticValue
Test statistic ZZ−20.324
Critical value Z1α/2Z_{1-\alpha/2}1.96

Conclusion: Z=20.324|Z| = 20.324 far exceeds 1.96, so we reject H0H_0 — the data are not random. The negative Z indicates far fewer runs than expected, meaning the data cluster in long sequences above or below the median.

Lag-1 autocorrelation — measures the linear dependence between consecutive observations.

StatisticValue
r1r_10.987
Critical value 2/N2/\sqrt{N}0.087

Conclusion: The lag-1 autocorrelation of 0.987 vastly exceeds the critical value of 0.087. Consecutive observations are nearly identical — the randomness assumption is catastrophically violated.

Distribution and Outlier Tests

Since the quantitative tests confirm that the assumptions of randomness, constant location, and constant scale are all violated, distributional measures have no meaning. The histogram and normal probability plot reflect the particular random walk path, not any underlying fixed distribution. Therefore the distribution and outlier tests are omitted.

Test Summary

AssumptionTestStatisticCritical ValueResult
Fixed locationRegression on run ordert=9.275t = 9.2751.96Reject
Fixed variationLevene testW=10.459W = 10.4592.623Reject
RandomnessRuns testZ=20.324Z = {-20.324}1.96Reject
RandomnessAutocorrelation lag-1r1=0.987r_1 = 0.9870.087Reject
DistributionNot meaningful
OutliersNot meaningful

Three of the four assumptions are severely violated. The univariate model Yi=C+EiY_i = C + E_i is not appropriate for this data.

Interpretation

The combined graphical and quantitative evidence presents a clear picture:

  • The run sequence plot shows significant shifts in location over time — the process drifts without returning to a stable mean
  • The lag plot reveals extreme non-randomness — consecutive observations are nearly identical
  • The autocorrelation plot quantifies this dependence: significant autocorrelation persists at all lags from 1 through 100, declining linearly — the hallmark of an integrated (non-stationary) process
  • The spectral plot confirms the dominance of low-frequency drift with a single peak near zero frequency

Because the randomness and fixed-location assumptions fail catastrophically, the distributional assumptions (histogram and normal probability plot) become irrelevant. The univariate model is not valid for this data:

Yi=C+Ei— rejectedY_i = C + E_i \quad \text{— rejected}

Develop a Better Model

The univariate model Yi=C+EiY_i = C + E_i fails because it ignores the strong dependence between consecutive observations. The lag plot showed a distinct linear pattern when plotting YiY_i versus Yi1Y_{i-1}, suggesting a first-order autoregressive relationship. This motivates the model:

Yi=A0+A1Yi1+EiY_i = A_0 + A_1 \cdot Y_{i-1} + E_i

Fitting this model by ordinary least squares regression of YiY_i on Yi1Y_{i-1} yields:

ParameterEstimateStd Errort-Value
A0A_0 (intercept)0.05020.02422.075
A1A_1 (slope)0.98710.0063156.350

The slope is overwhelmingly significant (t=156.35t = 156.35 with 497 residual degrees of freedom), confirming the strong autoregressive dependence. The residual standard deviation is 0.2931, compared to the original data’s standard deviation of 2.079 — a roughly 7-fold reduction in variability. The autoregressive model captures the vast majority of the structure in the data.

Because A11.0A_1 \approx 1.0 and A00A_0 \approx 0, the model simplifies to:

Yi=Yi1+EiY_i = Y_{i-1} + E_i

This is precisely the random walk definition — each observation equals the previous one plus a random increment. The residuals are the first differences:

Ei=YiYi1E_i = Y_i - Y_{i-1}

These should recover the underlying uniform random numbers.

Validate New Model

Predicted vs. Original

The first validation step is to plot the predicted values against the original data:

Y^i=0.0502+0.9871Yi1\hat{Y}_i = 0.0502 + 0.9871 \cdot Y_{i-1}
Predicted values from the AR(1) model versus original data, showing a reasonably good fit along the diagonal.

The points cluster tightly along the diagonal, indicating a reasonably good fit. The model captures the essential dynamics of the random walk process.

4-Plot of Residuals

The 4-plot applied to the residuals tests whether the error term EiE_i satisfies all four assumptions.

Four-plot diagnostic layout for AR(1) residuals — all four assumptions appear satisfied after removing the autoregressive structure.

The transformation is dramatic compared to the original data’s 4-plot:

  • Run sequence plot (upper left) — no significant shifts in location or scale over time; the residuals fluctuate around zero in a stable band
  • Lag plot (upper right) — random, structureless cloud; the autoregressive dependence has been completely removed
  • Histogram (lower left) — relatively flat appearance, suggesting a uniform rather than normal distribution
  • Normal probability plot (lower right) — S-shaped departure from linearity, confirming the residuals are not normally distributed

Residual Run Sequence Plot

Run sequence plot of AR(1) residuals showing no significant shifts in location or scale.

The residuals show no trend, no drift, and no shifts in variability — the fixed-location and fixed-variation assumptions are satisfied.

Residual Lag Plot

Lag-1 plot of AR(1) residuals showing a random, structureless cloud — the autoregressive dependence has been removed.

The residual lag plot shows a random, structureless cloud with no discernible pattern — the randomness assumption is satisfied. This is a stark contrast to the original data’s tight diagonal cluster.

Residual Histogram

Histogram of AR(1) residuals showing a relatively flat, uniform-like shape.

The histogram of residuals shows a relatively flat, rectangular shape — characteristic of a uniform distribution rather than a normal (bell-shaped) one. This is expected, since the underlying random walk was constructed from uniform random numbers.

Residual Autocorrelation Plot

Autocorrelation plot of AR(1) residuals — no significant autocorrelation at any lag, confirming independence.

The residual autocorrelation plot shows no significant autocorrelation at any lag — all values fall within the 95% confidence bands. The autoregressive model has successfully removed the dependence structure.

Residual Spectral Plot

Spectral plot of AR(1) residuals showing a flat spectrum with no dominant peaks — consistent with white noise.

The residual spectral plot shows a flat spectrum with no dominant peaks — consistent with white noise. The single low-frequency peak from the original data has been completely eliminated.

Uniform Probability Plot

Since the histogram suggests a uniform rather than normal distribution, the uniform probability plot provides the definitive distributional test.

Uniform probability plot of AR(1) residuals. The nearly linear pattern confirms a uniform distribution is a good model for the error component.

The nearly linear pattern confirms that a uniform distribution is a good model for the error component. This validates the full model:

Yi=Yi1+Ei,EiUniform(0.5,  0.5)Y_i = Y_{i-1} + E_i, \quad E_i \sim \text{Uniform}(-0.5,\; 0.5)

Validation Summary

AssumptionResult
Fixed locationSatisfied — no shifts in residual location
Fixed variationSatisfied — stable residual spread
RandomnessSatisfied — no autocorrelation in residuals
DistributionUniform — confirmed by probability plot

All four assumptions are satisfied for the residuals. The AR(1) model — or equivalently, first-differencing — successfully transforms the non-stationary random walk into a stationary, independent, uniformly distributed series.

Conclusions

The random walk data violate three of the four assumptions of the univariate model:

  1. Fixed location — rejected; the process drifts stochastically over time
  2. Fixed variation — rejected; variances differ significantly across quarters
  3. Randomness — rejected; extreme positive autocorrelation (r1=0.987r_1 = 0.987) persisting through lag 100

The autoregressive model Yi=A0+A1Yi1+EiY_i = A_0 + A_1 \cdot Y_{i-1} + E_i captures the dependence structure, achieving a 7-fold reduction in residual standard deviation. Because A11.0A_1 \approx 1.0, the simplified form Yi=Yi1+EiY_i = Y_{i-1} + E_i is preferred — first-differencing recovers the underlying uniform random increments, which satisfy all four assumptions. The error component follows a uniform distribution, consistent with the known data-generation process.

This case study demonstrates the full EDA cycle: the run sequence plot and lag plot detect the violations, the autocorrelation plot and spectral plot characterize the dependence structure, the autoregressive model addresses the violations, and the residual diagnostics confirm the model’s validity.