Skip to main content

Chi-Square Goodness-of-Fit Test

NIST/SEMATECH Section 1.3.5.15 Chi-Square Goodness-of-Fit Test

What It Is

The chi-square goodness-of-fit test determines whether observed frequency counts match expected counts under a hypothesized distribution. It bins the data and compares observed bin frequencies to the frequencies predicted by the theoretical distribution.

When to Use It

Use the chi-square goodness-of-fit test for distribution testing when the data can be naturally binned into categories, or when testing discrete distributions. It is one of the oldest and most general goodness-of-fit tests, applicable to any distribution (continuous or discrete) once the data are binned. It is particularly useful when testing hypotheses about categorical data or when a closed-form CDF is not available.

How to Interpret

If the chi-square statistic exceeds the critical value at the chosen significance level with the appropriate degrees of freedom, reject the null hypothesis that the data follow the specified distribution. Large contributions to the statistic from individual bins indicate where the model fits poorly. The test is sensitive to the choice of binning: too few bins reduce power, while too many bins can produce bins with very low expected counts (a rule of thumb requires E_i >= 5 for all bins). The chi-square goodness-of-fit test is less powerful than the Anderson-Darling or Kolmogorov-Smirnov tests for continuous data because binning discards information about the ordering within bins.

Assumptions and Limitations

The chi-square approximation requires that all expected cell frequencies are sufficiently large (typically >= 5). Observations must be independent. The test is applicable to both continuous and discrete distributions, but for continuous data the Anderson-Darling test is generally preferred because it does not require binning.

Reference: NIST/SEMATECH e-Handbook, Section 1.3.5.15

Formulas

Chi-Square Statistic

χ2=i=1k(OiEi)2Ei\chi^2 = \sum_{i=1}^{k} \frac{(O_i - E_i)^2}{E_i}

The sum of squared differences between observed (O_i) and expected (E_i) counts, each divided by the expected count. Under H0, this approximately follows a chi-square distribution.

Expected Frequency

Ei=npiE_i = n \cdot p_i

The expected count for bin i, where n is the total sample size and p_i is the probability of falling in bin i under the hypothesized distribution.

Degrees of Freedom

ν=k1m\nu = k - 1 - m

The degrees of freedom equal the number of non-empty bins (k) minus one, minus the number of parameters estimated from the data (m). Equivalently, NIST defines c = m + 1, giving df = k - c.