Cliff Delta Test
Explanation
The Cliff Delta test, is a test for stochastic-equivelance. This means that even if the medians are equal between two indpendent samples, this test could be significant.
Lets say we have one group A that scored 1, 2, 2, 5, 6, 6, 7, and another group B that scored 4, 4, 4, 5, 10, 10, 12. Each group has the same median (i.e. 5), and are symmetric around the median, but if a high score is positive, most people would rather be in group B than in group A. This is where ‘stochastic equality’ comes in. It looks at the chance if you pick a random person from group A and B each, the one from group A scores lower than the one from group B, and add half the chance that their equal. In this example that’s about 0.68.
The test uses the (Glass) Rank Biserial correlation coefficient, which is the same as Cliff Delta. As a test Cliff (1993) developed this as well, but only with large samples. Vargha (2000, p. 280) and also Vargha and Delaney (2000, p. 7, eq. 9) used a t-distribution using as degrees of freedom something similar in line with the Fligner-Policello test, while Delaney and Vargha (2002) proposed also an alternative degrees of freedom, in line with the Brunner-Munzel test
Performing the Test
with Excel
Excel file: To Be Made
with stikpetE
To Be Made
without stikpetE
To Be Made
with Python
Jupyter Notebook: TS - Cliff Delta (IS) (P)
with stikpetP
To Be Made
without stikpetP
To Be Made
with R
Jupyter Notebook: To Be Made
with stikpetR
To Be Made
without stikpetR
To Be Made
with SPSS
To Be Made
Formulas
The test-statistic for this test is (Cliff, 1993, p. 500):
\(C = \frac{d}{\sqrt{\hat{\sigma}_d^2}}\)
Which follows a normal distribution (Cliff, 1993, p. 500), i.e.
\(p = 2 \times \left(1 - \Phi\left(\left|C\right|\right)\right)\)
With:
\(d = \frac{\sum_{i=1}^{n_1}\sum_{j=1}^{n_2} d_{x_i, y_j}}{n_1 \times n_2}\)
\(d_{x_i, y_j} = S_{i,j}\)
\(S_{i,j} = \text{sign}\left(x_i - y_j\right) = \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases}\)
\(\hat{\sigma}_d^2 = \max{\left(s_d^2, s_{d, min}^2\right)}\)
\(s_{d, min}^2 = \frac{1 - d^2}{n_1\times n_2 - 1}\)
\(s_d^2 = \frac{n_2^2 \times SS_{\hat{d}_{1}} + n_1^2 \times SS_{\hat{d}_{2}} - SS_{d_{xy}}}{n_1\times n_2 \times \left(n_1 - 1\right) \times \left(n_2 - 1\right)}\)
\(SS_{\hat{d}_{1}} = \sum_{i=1}^{n_1} \left(\hat{d}_{i1} - d\right)^2, SS_{\hat{d}_{2}} = \sum_{j=1}^{n_2} \left(\hat{d}_{j2} - d\right)^2, SS_{d_{xy}} = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2} \left(d_{x_i, y_j} - d\right)^2\)
\(\hat{d}_{i1} = \frac{1}{n_2} \times \sum_{j=1}^{n_2} \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases} = \frac{1}{n_2} \times \sum_{j=1}^{n_2} S_{i,j}\)
\(\hat{d}_{j2} = \frac{1}{n_1} \times \sum_{i=1}^{n_1} \begin{cases} 1 & x_i \gt y_j \\ 0 & x_i = y_j \\ -1 & x_i \lt y_j \end{cases} = \frac{1}{n_1} \times \sum_{i=1}^{n_1} S_{i,j}\)
Alternatively, but with the same result, the sample variance of d, can be calculated with:
\(s_d^2 = \frac{n_2}{n_1} \times \frac{s_{\hat{d}_1}^2}{n_2 - 1} + \frac{n_1}{n_2} \times \frac{s_{\hat{d}_2}^2}{n_1 - 1} - \frac{s_{d_{xy}}^2}{n_1 \times n_2}\)
\(s_{\hat{d}_1}^2 = \frac{SS_{\hat{d}_{1}}}{n_1 -1}, s_{\hat{d}_2}^2 = \frac{SS_{\hat{d}_{2}}}{n_2 -1}, s_{d_{xy}}^2 = \frac{SS_{d_{xy}}}{\left(n_1 - 1\right) \times \left(n_2 - 1\right)}\)
A different estimate (a 'consistent') is given by (Cliff, 1993, p. 499, eq. 7):
\(\hat{\sigma}_d^2 = \frac{\left(n_2 - 1\right) \times s_{\hat{d}_1}^2 + \left(n_1 - 1\right) \times s_{\hat{d}_2}^2 + s_{d_{xy}}^2}{n_1 \times n_2}\)
Vargha (2000, p. 280) and also Vargha and Delaney (2000, p. 7, eq. 9) use a t-distribution, instead of the standard normal distribution. They use the same test-statistic, and as degrees of freedom they use:
\(p = 2 \times \left(1 - t\left(\left|C\right|, df\right)\right)\)
\(df = \frac{\left(a + b\right)^2}{\frac{a^2}{n_1 - 1} + \frac{b^2}{n_2 - 1}}\)
With:
\(a_{BM} = \frac{1}{n_1} \times \frac{s_{R_1^*}^2}{n_2^2}, b_{BM} = \frac{1}{n_2} \times \frac{s_{R_2^*}^2}{n_1^2}\)
\(s_{R_1^*}^2 = \frac{SS_{R_1^*}}{n_1 -1}, s_{R_2^*}^2 = \frac{SS_{R_2^*}}{n_2 -1}\)
\(SS_{R_1^*} = \sum_{i=1}^{n_1} \left(r_{i1}^* - \bar{R}_1^*\right)^2, SS_{R_2^*} = \sum_{j=1}^{n_2} \left(r_{j2}^* - \bar{R}_2^*\right)^2\)
\(\bar{R}_1^* = \frac{\sum_{i=1}^{n_1} r_{i1}^*}{n_1}, \bar{R}_2^* = \frac{\sum_{j=1}^{n_2} r_{j2}^*}{n_2}\)
\(r_{i,1}^* = \sum_{j=1}^{n_2} \begin{cases} 1 & s_{i,j} = 1 \\ 0.5 & s_{i,j} = 0 \\ 0 & s_{i,j} = -1 \end{cases}\)
\(r_{i,2}^* = \sum_{i=1}^{n_1} \begin{cases} 1 & s_{i,j} = -1 \\ 0.5 & s_{i,j} = 0 \\ 0 & s_{i,j} = 1 \end{cases}\)
Delaney and Vargha (2002) proposed also an alternative degrees of freedom, in line with the Brunner-Munzel test, as:
\(a_{FP} = \frac{s_{R_1^*}^2}{n_1}, b_{FP} = \frac{s_{R_2^*}^2}{n_2}\)
Symbols used:
- \(n_{k}\), the number of scores in category k
- \(\Phi\left(\dots\right)\), the cumulative distribution function of the standard normal distribution
- \(t\left(\dots\right)\), the cumulative distribution function of the t distribution
A few additional notes.
- In Vargha and Delaney (2000, p. 7, eq. 9) they mention "df is rounded to the nearest integer", which is why this function allows for a rounding of the degrees of freedom (df).
- The \(r^*\) values, are so-called placement values. These can also be calculated by subtracting the within-mid-rank from the pooled mid-rank.
- The \(d\) value is known as Cliff Delta, and is also the same as the (Glass) rank biserial correlation coefficient. This in itself is sometimes used as an effect size measure, and various methods are available to calculate it.
- For the \(\hat{d}_{j2}\) Cliff (1993) mentions: " \( d_{.j} \) represents the proportion of scores from the first population that lies above a given score from the second, minus the reverse " (p. 499), which is the one used here. However, Delaney and Vargha (2002) wrote: " \(d_{.j}\) denotes the proportion of the \(X\) scores that lie below \(Y_j\) minus the proportion that lie above" (p. 9), but this is most likely a mistake.
Interpreting the Result
The assumption about the population for this test (the null hypothesis) is that the two samples are stochastically equivelant.
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, we conclude that the two samples are not stochastically equal. This indicates the scores in one of the two are 'higher' than in the other.
Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.
Writing the results
Writing up the results of the test uses the format (APA, 2019 p. 182):
\(t\)(<degrees of freedom>) = <\(C\)-value>, p = <p-value>
So for example:
A Cliff-Delta test indicated a significant difference between males and females in the distribution of scores, t(36.98) = 4.466, p < .001.
A few notes about reporting statistical results with APA:
- The p-value is shown with three decimal places, and no 0 before the decimal sign. If the p-value is below .0005, it can be reported as p < .001.
- t is a standard abbreviations from APA for the t-distribution (see APA, 2019, table 6.5).
- APA does not require to include references nor formulas for statical analysis that are in common use (2019, p. 181).
- APA (2019, p. 88) states to also report an effect size measure.
Next...
The next step is to determine an effect size measure. Varha-Delaney A, a Rosenthal Correlation, or a (Glass) Rank Biserial Correlations (Cliff Delta), could be suitable for this.
Alternatives
alternatives for testing stochastic equivelance:
- Mann-Whitney U. Chung and Romano (2011, p. 5) note that it fails to control type 1 errors
- Brunner-Munzel test
- the Brunner-Munzel studentized permutation test
- C-square test, which is an improvement on the Brunner-Munzel test
if you only want to test if the medians are equal:
- Mann-Whitney U, assuming distributions have the same shape
- Fligner-Policello, assuming distributions are symmetric around the median, and continuous data
- Mood-Median, although according to Schlag (2015) this is actually testing quantiles, and can lead to over rejection.
- Schlag, but only used to accept or reject, no p-value
Google adds