Filter Transmittance Case Study
NIST/SEMATECH Section 1.4.2.6 Filter Transmittance
Background and Data
This case study applies exploratory data analysis to 50 filter transmittance measurements from a NIST optical filter calibration experiment. The response variable is the transmittance value of a glass filter. The data were collected by NIST chemist Radu Mavrodineaunu during the 1970s using an automatic data acquisition system. The primary purpose is to demonstrate how EDA techniques detect non-randomness and a shift in location, and how the root cause turned out to be an instrumentation problem rather than a real process change.
The dataset originates from NIST/SEMATECH Section 1.4.2.6. With observations ranging from approximately 2.00130 to 2.00270, this study illustrates a case where the data acquisition system was sampling too fast, causing measurements to be contaminated by the previous reading.
Dataset
Mavrodineanu, NIST, glass filter transmittance (1970s)
NIST source description
NIST filter transmittance experiment. Experimentalist: Mavrodineanu, 1970s. Response variable = transmittance of a filter (nominal value = 2). Data collected by automatic data acquisition system at 10 observations per second; 50 observations collected in 5 seconds. Number of observations = 50.
Preview data
| # | Value |
|---|---|
| 1 | 2.0018 |
| 2 | 2.0017 |
| 3 | 2.0018 |
| 4 | 2.0019 |
| 5 | 2.0018 |
| 6 | 2.0017 |
| 7 | 2.0015 |
| 8 | 2.0014 |
| 9 | 2.0015 |
| 10 | 2.0015 |
| ... 40 more rows | |
Test Underlying Assumptions
Goals
The analysis has three primary objectives:
- Model validation — assess whether the univariate model is an appropriate fit for the filter transmittance data:
-
Assumption testing — evaluate whether the data satisfy the four standard assumptions for a measurement process in statistical control:
- Random sampling — the data are uncorrelated
- Fixed distribution — the data come from a fixed distribution
- Fixed location — the distribution location (mean) is constant
- Fixed variation — the distribution scale (standard deviation) is constant
-
Confidence interval validity — determine whether the standard confidence interval formula is appropriate:
where is the standard deviation. This formula relies on all four assumptions holding; if they are violated, the confidence interval has no statistical meaning.
If the assumptions are violated, identify the nature and severity of the violations and recommend appropriate remedial actions.
Graphical Output and Interpretation
4-Plot Overview
The 4-plot is the primary graphical tool for testing all four assumptions simultaneously.
The assumptions are addressed by the four diagnostic plots:
- The run sequence plot (upper left) indicates a significant shift in location around observation 35 — the first portion fluctuates around approximately 2.001–2.002, while the later portion shifts upward to 2.002–2.003. The fixed-location assumption is violated.
- The lag plot (upper right) displays a strong linear pattern along the diagonal, indicating that consecutive observations are highly correlated — the randomness assumption is severely violated.
- Since the lag plot indicates significant non-randomness, the histogram (lower left) and normal probability plot (lower right) are not meaningful for interpretation.
The serious violation of the randomness assumption means that the univariate model is not valid. Given the linear appearance of the lag plot, an autoregressive model might seem appropriate, but as the root cause investigation below reveals, the solution was to fix the instrumentation rather than fit a more complex statistical model.
Run Sequence Plot
The run sequence plot shows 50 observations with a clear shift in location around observation 35. The first portion of the data fluctuates around approximately 2.001 to 2.002, while the later portion shifts upward to approximately 2.002 to 2.003. This shift indicates the fixed-location assumption is violated.
Conclusion: The visible upward shift around demonstrates that the fixed-location assumption is violated.
Lag Plot
The lag plot at lag 1 displays a strong linear pattern along the diagonal, indicating that consecutive observations are highly correlated. This is the hallmark of severe non-randomness. The linear appearance means that each measurement is heavily influenced by the previous measurement, which is inconsistent with independent sampling.
Conclusion: The tight linear cluster along indicates consecutive observations are nearly identical. The randomness assumption is severely violated.
Histogram
The histogram shows the distribution of the 50 transmittance values. However, because the randomness assumption is severely violated (lag-1 autocorrelation ), the histogram does not represent the distribution of independent errors. Instead, it reflects the combined effects of the autocorrelation structure and the location shift visible in the run sequence plot.
The histogram appears roughly symmetric, but this graphical assessment has limited interpretive value when the data are strongly autocorrelated. The shape is influenced by the drift and serial dependence rather than the underlying error distribution alone.
Normal Probability Plot
The normal probability plot assesses whether the data follow a normal distribution. As with the histogram, interpretation is limited because the severe autocorrelation violates the independence assumption required for distributional tests.
The probability plot shows some curvature, but this cannot be reliably interpreted as evidence for or against normality when the observations are not independent. The distributional assessment is deferred until the randomness issue is resolved.
Autocorrelation Plot
The autocorrelation plot quantifies the serial dependence detected by the lag plot. With , the 95% confidence bands are at .
The autocorrelation plot confirms extreme positive autocorrelation, consistent with a data acquisition system that was sampling too fast — each measurement was contaminated by the previous reading.
Spectral Plot
The spectral plot shows the frequency-domain structure of the data.
The spectral plot shows dominant low-frequency content, consistent with the slow drift and location shift visible in the run sequence plot. The lack of a sharp spectral peak distinguishes this from periodic data (like the beam deflections case study) — the non-randomness here is drift-based rather than cyclic.
Quantitative Output and Interpretation
Summary Statistics
| Statistic | Value |
|---|---|
| Sample size | 50 |
| Mean | 2.0019 |
| Median | 2.0018 |
| Min | 2.0013 |
| Max | 2.0027 |
| Range | 0.0014 |
| Std Dev | 0.0004 |
| Autocorrelation | 0.94 |
The mean and median are close, but the summary statistics mask the location shift visible in the run sequence plot. The small standard deviation of 0.0004 reflects within-measurement-level variation combined with the drift across measurements. The standard confidence interval is suspect:
This interval dramatically understates the true uncertainty because it assumes independent observations. With autocorrelation of 0.94, the effective sample size is far smaller than 50, and any confidence interval computed from the standard formula has no statistical justification.
Location Test
The location test fits a linear regression of the response against the run-order index and tests whether the slope is significantly different from zero.
| Parameter | Estimate | Std Error | t-Value |
|---|---|---|---|
| (intercept) | 2.001428 | 0.000104 | 19244.8 |
| (slope) | 1.85E-05 | 3.31E-06 | 5.582 |
Residual standard deviation: 0.000260 with 48 degrees of freedom.
Conclusion: The slope t-value of 5.582 exceeds the critical value , so we reject — the location is not constant. Although the estimated slope of 1.85E-05 is nearly zero in absolute terms, the data range is also very small (2.0013 to 2.0027), so the drift is relatively meaningful within the scale of the measurements.
Variation Test
The Levene test (median-based variant) divides the data into equal-length intervals and tests whether their variances are homogeneous.
| Statistic | Value |
|---|---|
| Test statistic | 0.971 |
| Degrees of freedom | and |
| Critical value | 2.806 |
Conclusion: The test statistic of 0.971 does not exceed the critical value of 2.806, so we fail to reject — there is no evidence of a change in variation across the run. The constant-variation assumption is satisfied.
Randomness Tests
Two complementary tests assess whether the observations are independent.
Runs test — tests whether the sequence of values above and below the median was produced randomly.
| Statistic | Value |
|---|---|
| Test statistic | −5.3246 |
| Critical value | 1.96 |
Conclusion: far exceeds 1.96, so we reject — the data are not random. The negative Z indicates far fewer runs than expected, meaning the data cluster in long sequences above or below the median.
Lag-1 autocorrelation — measures the linear dependence between consecutive observations.
| Statistic | Value |
|---|---|
| 0.94 | |
| Critical value | 0.277 |
Conclusion: The lag-1 autocorrelation of 0.94 vastly exceeds the critical value of 0.277. Consecutive observations are nearly identical — the randomness assumption is severely violated. For , the 95% critical value is , and the observed autocorrelation exceeds this bound by more than a factor of three.
Distribution and Outlier Tests
Since the randomness assumption is rejected, the distributional tests are not meaningful and are omitted. When data are autocorrelated, the histogram and normal probability plot reflect the dependence structure rather than the underlying distribution of independent errors. The Grubbs’ test for outliers is also omitted because it assumes approximately normally distributed, independent data.
Test Summary
| Assumption | Test | Statistic | Critical Value | Result |
|---|---|---|---|---|
| Fixed location | Regression on run order | 2.01 | Reject | |
| Fixed variation | Levene test | 2.806 | Fail to reject | |
| Randomness | Runs test | 1.96 | Reject | |
| Randomness | Autocorrelation lag-1 | 0.277 | Reject | |
| Distribution | — | — | — | Not meaningful |
| Outliers | — | — | — | Not meaningful |
Two of the four assumptions are violated: the location is not fixed (significant drift) and the data are severely non-random (extreme autocorrelation). Only the variation assumption passes. The univariate model is not appropriate for this data.
Interpretation
Two of the four assumptions fail: the location drift is statistically significant (, well above the critical value of 2.01) and the randomness violation is extreme (, runs test ). Only the variation assumption passes (Levene test ). The distribution and outlier tests are not meaningful given the severity of the autocorrelation. The lag plot showed the problem immediately — a tight linear cluster along that is the unmistakable signature of strongly correlated consecutive observations.
The severity of the randomness violation is extreme. The lag-1 autocorrelation of 0.94 means each measurement explains approximately 88% of the variance of the next (). The run sequence plot shows both a gradual drift and a distinct shift around observation 35. The spectral plot confirms low-frequency dominance without periodic peaks — the non-randomness is drift-based, not cyclic. The standard confidence interval of dramatically understates the true uncertainty because it assumes independent observations; with an effective sample size far smaller than 50, the real uncertainty is substantially larger.
The univariate model is not appropriate for this data. The observations violate independence so severely that no standard statistical inference is valid. However, the resolution is not a more complex statistical model but rather an investigation of the experimental process. The graphical and quantitative analysis successfully identified the problem; the root cause investigation below reveals that the data acquisition system was sampling too fast, contaminating each measurement with the previous reading.
Root Cause Investigation
In discussions with the scientist, the non-randomness was entirely unexpected. An examination of the experimental process revealed that the sampling rate for the automatic data acquisition system was too fast. The equipment did not have sufficient time to reset before the next sample started, resulting in the current measurement being contaminated by the previous measurement. The solution was to rerun the experiment allowing more time between samples.
This illustrates a key principle of exploratory data analysis: when statistical tests reveal unexpected patterns, the root cause is often found in the instrumentation or experimental process — not in the data themselves. The lag plot and autocorrelation test pointed directly to the mechanism (consecutive readings correlated), and the scientist’s domain knowledge identified the specific cause (insufficient reset time between samples).
Conclusions
The filter transmittance data fail the randomness assumption due to extreme positive autocorrelation () caused by an instrumentation problem. The data acquisition system was sampling too fast, causing each measurement to be contaminated by the previous reading. The location test also reveals a significant upward drift (), with a visible shift around observation 35. Only the variation assumption is satisfied ().
The standard confidence interval:
is not valid because the independence assumption is catastrophically violated.
Simple graphical techniques, particularly the lag plot and run sequence plot, efficiently detected this problem. The recommended action was to rerun the experiment with a slower sampling rate. This case study demonstrates that unexpected results in EDA should prompt investigation of the experimental process itself, not just application of more complex statistical models. The role of graphical and statistical analysis is to detect problems; resolving the issues requires the knowledge of the scientist or engineer.