Cliff Delta Test

Explanation

The Cliff Delta test, is a test for stochastic-equivelance. This means that even if the medians are equal between two indpendent samples, this test could be significant.

Lets say we have one group A that scored 1, 2, 2, 5, 6, 6, 7, and another group B that scored 4, 4, 4, 5, 10, 10, 12. Each group has the same median (i.e. 5), and are symmetric around the median, but if a high score is positive, most people would rather be in group B than in group A. This is where ‘stochastic equality’ comes in. It looks at the chance if you pick a random person from group A and B each, the one from group A scores lower than the one from group B, and add half the chance that their equal. In this example that’s about 0.68.

The test uses the (Glass) Rank Biserial correlation coefficient, which is the same as Cliff Delta. As a test Cliff (1993) developed this as well, but only with large samples. Vargha (2000, p. 280) and also Vargha and Delaney (2000, p. 7, eq. 9) used a t-distribution using as degrees of freedom something similar in line with the Fligner-Policello test, while Delaney and Vargha (2002) proposed also an alternative degrees of freedom, in line with the Brunner-Munzel test

Performing the Test

with Excel

Excel file: To Be Made

with stikpetE

To Be Made

without stikpetE

To Be Made

with Python

Jupyter Notebook: TS - Cliff Delta (IS) (P)

with stikpetP

To Be Made

without stikpetP

To Be Made

with R

Jupyter Notebook: To Be Made

with stikpetR

To Be Made

without stikpetR

To Be Made

with SPSS

To Be Made

Formulas

The test-statistic for this test is (Cliff, 1993, p. 500):

\(C = \frac{d}{\sqrt{\hat{\sigma}_d^2}}\)

Which follows a normal distribution (Cliff, 1993, p. 500), i.e.

\(p = 2 \times \left(1 - \Phi\left(\left|C\right|\right)\right)\)

With:

\(d = \frac{\sum_{i=1}^{n_1}\sum_{j=1}^{n_2} d_{x_i, y_j}}{n_1 \times n_2}\)

\(d_{x_i, y_j} = S_{i,j}\)

\(S_{i,j} = \text{sign}\left(x_i - y_j\right) = \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases}\)

\(\hat{\sigma}_d^2 = \max{\left(s_d^2, s_{d, min}^2\right)}\)

\(s_{d, min}^2 = \frac{1 - d^2}{n_1\times n_2 - 1}\)

\(s_d^2 = \frac{n_2^2 \times SS_{\hat{d}_{1}} + n_1^2 \times SS_{\hat{d}_{2}} - SS_{d_{xy}}}{n_1\times n_2 \times \left(n_1 - 1\right) \times \left(n_2 - 1\right)}\)

\(SS_{\hat{d}_{1}} = \sum_{i=1}^{n_1} \left(\hat{d}_{i1} - d\right)^2, SS_{\hat{d}_{2}} = \sum_{j=1}^{n_2} \left(\hat{d}_{j2} - d\right)^2, SS_{d_{xy}} = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \left(d_{x_i, y_j} - d\right)^2\)

\(\hat{d}_{i1} = \frac{1}{n_2} \times \sum_{j=1}^{n_2} \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases} = \frac{1}{n_2} \times \sum_{j=1}^{n_2} S_{i,j}\)

\(\hat{d}_{j2} = \frac{1}{n_1} \times \sum_{i=1}^{n_1} \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases} = \frac{1}{n_1} \times \sum_{i=1}^{n_1} S_{i,j}\)

Alternatively, but with the same result, the sample variance of d, can be calculated with:

\(s_d^2 = \frac{n_2}{n_1} \times \frac{s_{\hat{d}_1}^2}{n_2 - 1} + \frac{n_1}{n_2} \times \frac{s_{\hat{d}_2}^2}{n_1 - 1} - \frac{s_{d_{xy}}^2}{n_1 \times n_2}\)

\(s_{\hat{d}_1}^2 = \frac{SS_{\hat{d}_{1}}}{n_1 -1}, s_{\hat{d}_2}^2 = \frac{SS_{\hat{d}_{2}}}{n_2 -1}, s_{d_{xy}}^2 = \frac{SS_{d_{xy}}}{\left(n_1 - 1\right) \times \left(n_2 - 1\right)}\)

A different estimate (a 'consistent') is given by (Cliff, 1993, p. 499, eq. 7):

\(\hat{\sigma}_d^2 = \frac{\left(n_2 - 1\right) \times s_{\hat{d}_1}^2 + \left(n_1 - 1\right) \times s_{\hat{d}_2}^2 + s_{d_{xy}}^2}{n_1 \times n_2}\)

Vargha (2000, p. 280) and also Vargha and Delaney (2000, p. 7, eq. 9) use a t-distribution, instead of the standard normal distribution. They use the same test-statistic, and as degrees of freedom they use:

\(p = 2 \times \left(1 - t\left(\left|C\right|, df\right)\right)\)

\(df = \frac{\left(a + b\right)^2}{\frac{a^2}{n_1 - 1} + \frac{b^2}{n_2 - 1}}\)

With:

\(a_{BM} = \frac{1}{n_1} \times \frac{s_{R_1^*}^2}{n_2^2}, b_{BM} = \frac{1}{n_2} \times \frac{s_{R_2^*}^2}{n_1^2}\)

\(s_{R_1^*}^2 = \frac{SS_{R_1^*}}{n_1 -1}, s_{R_2^*}^2 = \frac{SS_{R_2^*}}{n_2 -1}\)

\(SS_{R_1^*} = \sum_{i=1}^{n_1} \left(r_{i1}^* - \bar{R}_1^*\right)^2, SS_{R_2^*} = \sum_{j=1}^{n_2} \left(r_{j2}^* - \bar{R}_2^*\right)^2\)

\(\bar{R}_1^* = \frac{\sum_{i=1}^{n_1} r_{i1}^*}{n_1}, \bar{R}_2^* = \frac{\sum_{j=1}^{n_2} r_{j2}^*}{n_2}\)

\(r_{i,1}^* = \sum_{j=1}^{n_2} \begin{cases} 1 & s_{i,j} = 1 \\ 0.5 & s_{i,j} = 0 \\ 0 & s_{i,j} = -1 \end{cases}\)

\(r_{i,2}^* = \sum_{i=1}^{n_1} \begin{cases} 1 & s_{i,j} = -1 \\ 0.5 & s_{i,j} = 0 \\ 0 & s_{i,j} = 1 \end{cases}\)

Delaney and Vargha (2002) proposed also an alternative degrees of freedom, in line with the Brunner-Munzel test, as:

\(a_{FP} = \frac{s_{R_1^*}^2}{n_1}, b_{FP} = \frac{s_{R_2^*}^2}{n_2}\)

Symbols used:

\(n_{k}\), the number of scores in category k
\(\Phi\left(\dots\right)\), the cumulative distribution function of the standard normal distribution
\(t\left(\dots\right)\), the cumulative distribution function of the t distribution

A few additional notes.

In Vargha and Delaney (2000, p. 7, eq. 9) they mention "df is rounded to the nearest integer", which is why this function allows for a rounding of the degrees of freedom (df).
The \(r^*\) values, are so-called placement values. These can also be calculated by subtracting the within-mid-rank from the pooled mid-rank.
The \(d\) value is known as Cliff Delta, and is also the same as the (Glass) rank biserial correlation coefficient. This in itself is sometimes used as an effect size measure, and various methods are available to calculate it.
For the \(\hat{d}_{j2}\) Cliff (1993) mentions: " \( d_{.j} \) represents the proportion of scores from the first population that lies above a given score from the second, minus the reverse " (p. 499), which is the one used here. However, Delaney and Vargha (2002) wrote: " \(d_{.j}\) denotes the proportion of the \(X\) scores that lie below \(Y_j\) minus the proportion that lie above" (p. 9), but this is most likely a mistake.

Interpreting the Result

The assumption about the population for this test (the null hypothesis) is that the two samples are stochastically equivelant.

The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.

If the assumption is rejected, we conclude that the two samples are not stochastically equal. This indicates the scores in one of the two are 'higher' than in the other.

Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.

Writing the results

Writing up the results of the test uses the format (APA, 2019 p. 182):

\(t\)(<degrees of freedom>) = <\(C\)-value>, p = <p-value>

So for example:

A Cliff-Delta test indicated a significant difference between males and females in the distribution of scores, t(36.98) = 4.466, p < .001.

A few notes about reporting statistical results with APA:

The p-value is shown with three decimal places, and no 0 before the decimal sign. If the p-value is below .0005, it can be reported as p < .001.
t is a standard abbreviations from APA for the t-distribution (see APA, 2019, table 6.5).
APA does not require to include references nor formulas for statical analysis that are in common use (2019, p. 181).
APA (2019, p. 88) states to also report an effect size measure.

Next...

The next step is to determine an effect size measure. Varha-Delaney A, a Rosenthal Correlation, or a (Glass) Rank Biserial Correlations (Cliff Delta), could be suitable for this.

Alternatives

alternatives for testing stochastic equivelance:

Mann-Whitney U. Chung and Romano (2011, p. 5) note that it fails to control type 1 errors
Brunner-Munzel test
the Brunner-Munzel studentized permutation test
C-square test, which is an improvement on the Brunner-Munzel test

if you only want to test if the medians are equal:

Mann-Whitney U, assuming distributions have the same shape
Fligner-Policello, assuming distributions are symmetric around the median, and continuous data
Mood-Median, although according to Schlag (2015) this is actually testing quantiles, and can lead to over rejection.
Schlag, but only used to accept or reject, no p-value

Links to parts

Google adds