This vignette explores the Anderson–Darling k-Sample test. CMH-17-1G [1] provides a formulation for this test that appears different than the formulation given by Scholz and Stephens in their 1987 paper [2].

Both references use different nomenclature, which is summarized as follows:

Term | CMH-17-1G | Scholz and Stephens |
---|---|---|

A sample | \(i\) | \(i\) |

The number of samples | \(k\) | \(k\) |

An observation within a sample | \(j\) | \(j\) |

The number of observations within the sample \(i\) | \(n_i\) | \(n_i\) |

The total number of observations within all samples | \(n\) | \(N\) |

Distinct values in combined data, ordered | \(z_{(1)}\)…\(z_{(L)}\) | \(Z_1^*\)…\(Z_L^*\) |

The number of distinct values in the combined data | \(L\) | \(L\) |

Given the possibility of ties in the data, the discrete version of the test must be used Scholz and Stephens (1987) give the test statistic as:

\[ A_{a k N}^2 = \frac{N - 1}{N}\sum_{i=1}^k \frac{1}{n_i}\sum_{j=1}^{L}\frac{l_j}{N}\frac{\left(N M_{a i j} - n_i B_{a j}\right)^2}{B_{a j}\left(N - B_{a j}\right) - N l_j / 4} \]

CMH-17-1G gives the test statistic as:

\[ ADK = \frac{n - 1}{n^2\left(k - 1\right)}\sum_{i=1}^k\frac{1}{n_i}\sum_{j=1}^L h_j \frac{\left(n F_{i j} - n_i H_j\right)^2}{H_j \left(n - H_j\right) - n h_j / 4} \]

By inspection, the CMH-17-1G version of this test statistic contains an extra factor of \(\frac{1}{\left(k - 1\right)}\).

Scholz and Stephens indicate that one rejects \(H_0\) at a significance level of \(\alpha\) when:

\[ \frac{A_{a k N}^2 - \left(k - 1\right)}{\sigma_N} \ge t_{k - 1}\left(\alpha\right) \]

This can be rearranged to give a critical value:

\[ A_{c r i t}^2 = \left(k - 1\right) + \sigma_N t_{k - 1}\left(\alpha\right) \]

CHM-17-1G gives the critical value for \(ADK\) for \(\alpha=0.025\) as:

\[ ADC = 1 + \sigma_n \left(1.96 + \frac{1.149}{\sqrt{k - 1}} - \frac{0.391}{k - 1}\right) \]

The definition of \(\sigma_n\) from the two sources differs by a factor of \(\left(k - 1\right)\).

The value in parentheses in the CMH-17-1G critical value corresponds
to the interpolation formula for \(t_m\left(\alpha\right)\) given in Scholz
and Stephen’s paper. It should be noted that this is *not* the
student’s t-distribution, but rather a distribution referred to as the
\(T_m\) distribution.

The `cmstatr`

package use the package
`kSamples`

to perform the k-sample Anderson–Darling tests.
This package uses the original formulation from Scholz and Stephens, so
the test statistic will differ from that given software based on the
CMH-17-1G formulation by a factor of \(\left(k-1\right)\).

For comparison, SciPy’s
implementation also uses the original Scholz and Stephens
formulation. The statistic that it returns, however, is the normalized
statistic, \(\left[A_{a k N}^2 - \left(k -
1\right)\right] / \sigma_N\), rather than `kSamples`

’s
\(A_{a k N}^2\) value. To be
consistent, SciPy also returns the critical values \(t_{k-1}(\alpha)\) directly. (Currently,
SciPy also floors/caps the returned p-value at 0.1% / 25%.) The values
of \(k\) and \(\sigma_N\) are available in
`cmstatr`

’s `ad_ksample`

return value, if an exact
comparison to Python SciPy is necessary.

The conclusions about the null hypothesis drawn, however, will be the same, whether R or CMH-17-1G or SciPy.

[1]

“Composite Materials Handbook, Volume 1.
Polymer Matrix Composites Guideline for Characterization of Structural
Materials,” SAE International, CMH-17-1G, Mar. 2012.

[2]

F.
W. Scholz and M. A. Stephens, “K-Sample Anderson--Darling
Tests,” *Journal of the American Statistical Association*,
vol. 82, no. 399. pp. 918–924, Sep-1987.