Random Walk Case Study
NIST/SEMATECH Section 1.4.2.3 Random Walk
Background and Data
This case study applies exploratory data analysis to 500 observations from the NIST RANDWALK.DAT dataset, which is a cumulative sum of uniform random numbers:
where is a set of uniform random numbers. The purpose is to demonstrate how EDA techniques detect a non-stationary process where the fixed-location assumption is violated. In a random walk, each observation equals the previous observation plus a random increment, creating persistent drift.
The dataset originates from NIST/SEMATECH Section 1.4.2.3. This is a critically important case study because random walks appear frequently in financial time series, sensor drift, and instrument calibration problems.
Dataset
Cumulative sum of uniform random numbers
NIST source description
Random walk data. Response variable Y. Generation algorithm: cumulative sum of (uniforms − 0.5), i.e. RW(i) = Σ(U(j) − 0.5) for j = 1 to i. Number of observations = 500.
Preview data
| # | Value |
|---|---|
| 1 | -0.399027 |
| 2 | -0.645651 |
| 3 | -0.625516 |
| 4 | -0.262049 |
| 5 | -0.407173 |
| 6 | -0.097583 |
| 7 | 0.314156 |
| 8 | 0.106905 |
| 9 | -0.017675 |
| 10 | -0.037111 |
| ... 490 more rows | |
Test Underlying Assumptions
Goals
The analysis has three primary objectives:
- Model validation — assess whether the univariate model is an appropriate fit for the random walk data:
- Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:
- Random sampling — the data are uncorrelated
- Fixed distribution — the data come from a fixed distribution
- Fixed location — the distribution location (mean) is constant
- Fixed variation — the distribution scale (standard deviation) is constant
- Confidence interval validity — determine whether the standard confidence interval formula is appropriate:
where is the standard deviation. This formula relies on all four assumptions holding; if they are violated, the confidence interval has no statistical meaning.
If the assumptions are violated, identify the nature and severity of the violations and recommend appropriate remedial actions.
4-Plot Overview
The 4-plot is the primary graphical tool for testing all four assumptions simultaneously. For the random walk data, it reveals severe violations.
The run sequence plot (upper left) shows a meandering, non-stationary trajectory rather than a stable band — the location is clearly not fixed. The lag plot (upper right) shows a tight diagonal line indicating extreme autocorrelation — the data are not random. The histogram (lower left) and normal probability plot (lower right) are not meaningful when location and independence assumptions are violated.
Fixed Location — Run Sequence Plot
The run sequence plot tests whether the location is constant over time. For the random walk, instead of fluctuating around a fixed horizontal level, the data wander freely with no tendency to return to a central value.
Conclusion: The trajectory drifts both positive and negative over the 500 observations, demonstrating classic non-stationary behavior. The fixed-location assumption is violated.
Randomness — Lag Plot
The lag plot tests whether the data are random by plotting versus . If the data are random, the lag plot should show a structureless cloud.
Conclusion: The lag plot shows an extremely tight linear cluster along the diagonal , indicating consecutive observations are nearly identical. The lag-1 autocorrelation is 0.987. The randomness assumption is catastrophically violated.
Randomness — Autocorrelation Plot
When the lag plot reveals significant non-randomness, the autocorrelation plot provides essential follow-up detail. It shows the autocorrelation coefficient at each lag , with 95% confidence bands at .
The autocorrelation plot shows significant positive autocorrelation at lags 1 through 100 in a linearly decreasing fashion. This slow, linear decay is the signature of a random walk — the process has long-range memory because each observation carries forward the cumulative effect of all prior random increments. Every lag from 1 to 100 exceeds the 95% confidence band, confirming that the randomness assumption is not just violated but fundamentally inapplicable.
Randomness — Spectral Plot
The spectral plot complements the autocorrelation plot by showing the frequency-domain structure of the data — specifically, the power spectral density as a function of frequency.
The spectral plot shows a single dominant low-frequency peak. Virtually all power is concentrated near frequency zero, which corresponds to slow, long-period drift. There are no secondary peaks at higher frequencies, confirming that the data have no periodic component — the non-randomness is purely drift-driven, consistent with the random walk mechanism.
Fixed Distribution — Histogram
The histogram assesses the shape of the underlying distribution. However, because the location and randomness assumptions are already violated, the histogram cannot be meaningfully interpreted.
Conclusion: The irregular, multi-modal shape depends entirely on the particular path taken by the random walk. Distribution testing is not meaningful when prior assumptions fail.
Normality — Normal Probability Plot
The normal probability plot tests whether the data follow a normal distribution. Like the histogram, this test is invalid when location and independence assumptions are violated.
Conclusion: The plot shows substantial curvature and departure from linearity, but this result must be interpreted cautiously — normality testing is not valid when stationarity and independence fail.
Quantitative Results
Summary Statistics
| Statistic | Value |
|---|---|
| Sample size | 500 |
| Mean | 3.2167 |
| Std Dev | 2.0787 |
| Autocorrelation | 0.987 |
These summary statistics are essentially meaningless as descriptors of a fixed process because the location is non-stationary. In particular, the standard confidence interval is not valid:
This dramatically understates the true uncertainty because it assumes independent observations. With autocorrelation this extreme, the effective sample size is far smaller than 500, and any confidence interval computed from the standard formula has no statistical justification.
Location Test
The location test fits a linear regression of the response against the run-order index and tests whether the slope is significantly different from zero.
| Parameter | Estimate | Std Error | t-Value |
|---|---|---|---|
| (intercept) | 1.8335 | 0.1721 | 10.650 |
| (slope) | 0.005522 | 0.000595 | 9.275 |
Residual standard deviation: 1.9214 with 498 degrees of freedom.
Conclusion: The slope t-value of 9.275 exceeds the critical value , so we reject — the location is not constant. The data drift upward at a rate of approximately 0.0055 units per observation.
Variation Test
The Levene test (median-based variant, preferred for non-normal data) divides the data into equal-length intervals and tests whether their variances are homogeneous.
| Statistic | Value |
|---|---|
| Test statistic | 10.459 |
| Degrees of freedom | and |
| Critical value | 2.623 |
Conclusion: The test statistic of 10.459 exceeds the critical value of 2.623, so we reject — the variances are significantly different across the four intervals. This is expected for a random walk: the variance of a cumulative sum grows with time.
Randomness Tests
Two complementary tests assess whether the observations are independent.
Runs test — tests whether the sequence of values above and below the median was produced randomly.
| Statistic | Value |
|---|---|
| Test statistic | −20.324 |
| Critical value | 1.96 |
Conclusion: far exceeds 1.96, so we reject — the data are not random. The negative Z indicates far fewer runs than expected, meaning the data cluster in long sequences above or below the median.
Lag-1 autocorrelation — measures the linear dependence between consecutive observations.
| Statistic | Value |
|---|---|
| 0.987 | |
| Critical value | 0.087 |
Conclusion: The lag-1 autocorrelation of 0.987 vastly exceeds the critical value of 0.087. Consecutive observations are nearly identical — the randomness assumption is catastrophically violated.
Distribution and Outlier Tests
Since the quantitative tests confirm that the assumptions of randomness, constant location, and constant scale are all violated, distributional measures have no meaning. The histogram and normal probability plot reflect the particular random walk path, not any underlying fixed distribution. Therefore the distribution and outlier tests are omitted.
Test Summary
| Assumption | Test | Statistic | Critical Value | Result |
|---|---|---|---|---|
| Fixed location | Regression on run order | 1.96 | Reject | |
| Fixed variation | Levene test | 2.623 | Reject | |
| Randomness | Runs test | 1.96 | Reject | |
| Randomness | Autocorrelation lag-1 | 0.087 | Reject | |
| Distribution | — | — | — | Not meaningful |
| Outliers | — | — | — | Not meaningful |
Three of the four assumptions are severely violated. The univariate model is not appropriate for this data.
Interpretation
The combined graphical and quantitative evidence presents a clear picture:
- The run sequence plot shows significant shifts in location over time — the process drifts without returning to a stable mean
- The lag plot reveals extreme non-randomness — consecutive observations are nearly identical
- The autocorrelation plot quantifies this dependence: significant autocorrelation persists at all lags from 1 through 100, declining linearly — the hallmark of an integrated (non-stationary) process
- The spectral plot confirms the dominance of low-frequency drift with a single peak near zero frequency
Because the randomness and fixed-location assumptions fail catastrophically, the distributional assumptions (histogram and normal probability plot) become irrelevant. The univariate model is not valid for this data:
Develop a Better Model
The univariate model fails because it ignores the strong dependence between consecutive observations. The lag plot showed a distinct linear pattern when plotting versus , suggesting a first-order autoregressive relationship. This motivates the model:
Fitting this model by ordinary least squares regression of on yields:
| Parameter | Estimate | Std Error | t-Value |
|---|---|---|---|
| (intercept) | 0.0502 | 0.0242 | 2.075 |
| (slope) | 0.9871 | 0.0063 | 156.350 |
The slope is overwhelmingly significant ( with 497 residual degrees of freedom), confirming the strong autoregressive dependence. The residual standard deviation is 0.2931, compared to the original data’s standard deviation of 2.079 — a roughly 7-fold reduction in variability. The autoregressive model captures the vast majority of the structure in the data.
Because and , the model simplifies to:
This is precisely the random walk definition — each observation equals the previous one plus a random increment. The residuals are the first differences:
These should recover the underlying uniform random numbers.
Validate New Model
Predicted vs. Original
The first validation step is to plot the predicted values against the original data:
The points cluster tightly along the diagonal, indicating a reasonably good fit. The model captures the essential dynamics of the random walk process.
4-Plot of Residuals
The 4-plot applied to the residuals tests whether the error term satisfies all four assumptions.
The transformation is dramatic compared to the original data’s 4-plot:
- Run sequence plot (upper left) — no significant shifts in location or scale over time; the residuals fluctuate around zero in a stable band
- Lag plot (upper right) — random, structureless cloud; the autoregressive dependence has been completely removed
- Histogram (lower left) — relatively flat appearance, suggesting a uniform rather than normal distribution
- Normal probability plot (lower right) — S-shaped departure from linearity, confirming the residuals are not normally distributed
Residual Run Sequence Plot
The residuals show no trend, no drift, and no shifts in variability — the fixed-location and fixed-variation assumptions are satisfied.
Residual Lag Plot
The residual lag plot shows a random, structureless cloud with no discernible pattern — the randomness assumption is satisfied. This is a stark contrast to the original data’s tight diagonal cluster.
Residual Histogram
The histogram of residuals shows a relatively flat, rectangular shape — characteristic of a uniform distribution rather than a normal (bell-shaped) one. This is expected, since the underlying random walk was constructed from uniform random numbers.
Residual Autocorrelation Plot
The residual autocorrelation plot shows no significant autocorrelation at any lag — all values fall within the 95% confidence bands. The autoregressive model has successfully removed the dependence structure.
Residual Spectral Plot
The residual spectral plot shows a flat spectrum with no dominant peaks — consistent with white noise. The single low-frequency peak from the original data has been completely eliminated.
Uniform Probability Plot
Since the histogram suggests a uniform rather than normal distribution, the uniform probability plot provides the definitive distributional test.
The nearly linear pattern confirms that a uniform distribution is a good model for the error component. This validates the full model:
Validation Summary
| Assumption | Result |
|---|---|
| Fixed location | Satisfied — no shifts in residual location |
| Fixed variation | Satisfied — stable residual spread |
| Randomness | Satisfied — no autocorrelation in residuals |
| Distribution | Uniform — confirmed by probability plot |
All four assumptions are satisfied for the residuals. The AR(1) model — or equivalently, first-differencing — successfully transforms the non-stationary random walk into a stationary, independent, uniformly distributed series.
Conclusions
The random walk data violate three of the four assumptions of the univariate model:
- Fixed location — rejected; the process drifts stochastically over time
- Fixed variation — rejected; variances differ significantly across quarters
- Randomness — rejected; extreme positive autocorrelation () persisting through lag 100
The autoregressive model captures the dependence structure, achieving a 7-fold reduction in residual standard deviation. Because , the simplified form is preferred — first-differencing recovers the underlying uniform random increments, which satisfy all four assumptions. The error component follows a uniform distribution, consistent with the known data-generation process.
This case study demonstrates the full EDA cycle: the run sequence plot and lag plot detect the violations, the autocorrelation plot and spectral plot characterize the dependence structure, the autoregressive model addresses the violations, and the residual diagnostics confirm the model’s validity.