Mann-Whitney U Test
Explanation
A Mann-Whitney U test, is a test that could be used when you have a binary and an ordinal variable. It compares the mean ranks of each group (from the binary variable). Ranks are simply determined by first sorting all the scores on the ordinal variable, then the lowest score gets rank 1, the next one rank 2, etc. The highest rank possible is therefore the total number of cases. If two or more scores are the same, the average of the ranks they would have gotten is used. So, if for example the fourth score is a 9, the fifth is a 9 and the sixth is a 9, then the rank for score four, five and six will each be (4+5+6) / 3 = 5.
The null hypothesis is that the probability of a randomly selected case having a score greater than a random score from the other category is 50% (Divine et al., 2018, p. 286). According to Laerd Statistics, it actually depends if both groups have similar shapes, or different. If similar, then the null hypothesis is that the medians are equal, otherwise it tests if the distributions are equal.
The test is also referred to as the Wilcoxon-Mann-Whitney test, or Mann-Whitney-Wilcoxon test. This is because Mann and Whitney expanded on an idea from Wilcoxon. It is the same as the Wilcoxon Rank Sum test.
The test can be done with an exact distribution (The Wilcoxon-Rank-Sum distribution), but is most often approximated using a normal distribution. The term ‘exact’ might give you the impression that you should always use it, since ‘exact’ sounds better than ‘approximate’. Some might indeed argue for this (for example Berger (2017)), but the ‘exact’ test often requires a lot more computational power (even for computers today), and in some cases there are those who actually claim the approximate test is preferred (see for example Agresti and Coull (1998)).
Performing the Test
with Excel
Excel file from video: TS - Mann-Whitney U (E).xlsm
with stikpetE
To Be Made
without stikpetE
with Python
Notebook from video: TS - Mann-Whitney U (P).ipynb
with stikpetP
To Be Made
without stikpetP
with SPSS
using Nonparametric
using Legacy Dialogs
Formulas
Formula
The formula for the U statistics is:
\(U_i=R_i-\frac{n_i\times\left(n_i+1\right)}{2}\)
In this formula ni is the number of scores in category i, and Ri the sum of the ranks from category i.
Often however there are ties, and we then need to adjust for those. We then need the z-statistic:
\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE}\)
The formula for SE (standard error) is:
\SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right )}\)
The N is the total number of scors (i.e. n1 + n2) and Ti the tie correction for tie i. For each unique rank the Ti is determined by:
\T_i=\frac{t_i^3-t_i}{12}\)
Where ti is the number of ranks tied for unique rank i.
Example
Note: different example then the one used in the rest of this section.
We are given the scores of one group of people:
\X_1=(1,2,5,2,2)\)
And another group:
\X_2=(4,3,5,5)\)
Note that the number of scores in the first group is five, and in the second four, so:
\n_1=5, n_2=4, N=5+4=9\)
If we combine both groups we get:
\C=(1,2,5,2,2,4,3,5,5)\)
The lowest score is a 1, so this gets a rank of 1. Then there are three 2's, so these get ranks 2, 3, and 4, or on average 3. There is only one 3, so this gets rank 5, only one 4 which gets rank 6, and there are three 5's, so these get ranks 7, 8, and 9, or on average 8. Replacing the original scores with the ranks (average ones), and summing them up we get for the first group:
\R_1=1+3+8+3+3=18\)
And for the second:
\R_2=6+5+8+8=27\)
The U statistic of the first group is:
\U_1=R_1-\frac{n_1\times\left(n_1+1\right)}{2} =18-\frac{5\times\left(5+1\right)}{2} =18-\frac{30}{2}=18-15=3\)
And for the second group:
\U_2=R_2-\frac{n_2\times\left(n_2+1\right)}{2} =27-\frac{4\times\left(4+1\right)}{2} =27-\frac{20}{2}=27-10=17\)
We had three 2's, and also three 5's. So for the frequencies of ties we get the sequence:
\T=(3,3)\)
Now calculate the adjustment for each frequency of ties
\T_1=\frac{t_1^3-t_1}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)
\T_2=\frac{t_2^3-t_2}{12} =\frac{3^3-3}{12} =\frac{27-3}{12} =\frac{24}{12}=2\)
Then the standard error:
\SE=\sqrt{\frac{n_1\times n_2}{N\times\left(N-1\right)}\times\left(\frac{N^3-N}{12}-\sum{T_i}\right)} =\sqrt{\frac{5\times 4}{9\times\left(9-1\right)}\times\left(\frac{9^3-9}{12}-(2+2))\right)}\)
\=\sqrt{\frac{20}{72}\times\left(\frac{729-9}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(\frac{720}{12}-4)\right)} =\sqrt{\frac{5}{18}\times\left(60-4)\right)}\)
\=\sqrt{\frac{5}{18}\times56} =\sqrt{\frac{5\times56}{18}} =\sqrt{\frac{5\times28}{9}} =\sqrt{\frac{140}{9}}\)
\=\frac{\sqrt{140}}{\sqrt9} =\frac{\sqrt{4\times35}}{3} =\frac{\sqrt{4}\times\sqrt{35}}{3} =\frac{2}{3}\sqrt{35}\approx3.944\)
Finally the Z-score. If we use U1:
\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 3-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{6-20}{\frac{4}{3}\sqrt{35}} =\frac{-14}{\frac{4\sqrt{35}}{3}}\)
\=\frac{-14\times3}{4\sqrt{35}} =\frac{-7\times3}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}} =\frac{-21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{-21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)
\=\frac{-21\sqrt{35}}{2\times35} =\frac{-21\sqrt{35}}{70} =\frac{-3\sqrt{35}}{10}\approx-1.775\)
If we use U2:
\Z=\frac{2\times U_i-n_1\times n_2}{2\times SE} =\frac{2\times 17-5\times 4}{2\times \frac{2}{3}\sqrt{35}} =\frac{34-20}{\frac{4}{3}\sqrt{35}} =\frac{14}{\frac{4\sqrt{35}}{3}}\)
\=\frac{14\times3}{4\sqrt{35}} =\frac{7\times3}{2\sqrt{35}} =\frac{21}{2\sqrt{35}} =\frac{21}{2\sqrt{35}}\times\frac{\sqrt{35}}{\sqrt{35}} =\frac{21\times\sqrt{35}}{2\sqrt{35}\times\sqrt{35}}\)
\=\frac{21\sqrt{35}}{2\times35} =\frac{21\sqrt{35}}{70} =\frac{3\sqrt{35}}{10}\approx1.775\)
For the two-tailed significance we can then use the standard normal distribution. Usually this is found either by using a table, or some software, but if you must know the formula would be:
\2\times\int_{x=|Z|}^{\infty}\left(\frac{1}{\sqrt{2\times\pi}}\times e^{-\frac{x^2}{2}} \right )\)
Interpreting the Result
The assumption about the population for this test (the null hypothesis) is that the medians are equal (if both categories have similar distribution), or the distributions are equal (if they look different).
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, we conclude that the medians in the population will be different, or the distributions will be different.
Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.
Writing the results
Writing up the results of the test uses the format (APA, 2019 p. 182):
U(n1 = <number of cases in 1st category>, n2 = <number of cases in 2nd category>) = <U-value>, p = <p-value>
So for example if an exact test is used:
An exact Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, U(n1 = 11, n2 = 34) = 285.5, p = .008.
If you do not have an exact p-value, then use the approximated one. In that case the test-statistic Z is actually used. The report would then go something like:
A Mann-Whitney U test indicated that the mean ranks for male and female were significantly different, z(n1 = 11, n2 = 34) = 2.845, p = .004.
A few notes about reporting statistical results with APA:
- The p-value is shown with three decimal places, and no 0 before the decimal sign. If the p-value is below .0005, it can be reported as p < .001.
- Both U and z are standard abbreviations from APA for the Mann-Whitney U test statistic, and standardized score (see APA, 2019, table 6.5). They do not need to be explained.
- APA does not require to include references nor formulas for statical analysis that are in common use (2019, p. 181).
- APA (2019, p. 88) states to also report an effect size measure.
Next...
The next step is to determine an effect size measure. Varha-Delaney A, a Rosenthal Correlation, or a (Glass) Rank Biserial Correlations (Cliff Delta), could be suitable for this.
Alternatives
alternatives for testing stochastic equivelance:
- Brunner-Munzel test
- the Brunner-Munzel studentized permutation test
- C-square test, which is an improvement on the Brunner-Munzel test
- Cliff-Delta, which according to Delaney and Vargha (2002), performs similar as the Brunner-Munzel test.
if you only want to test if the medians are equal:
- Mann-Whitney U, assuming distributions have the same shape
- Fligner-Policello, assuming distributions are symmetric around the median, and continuous data
- Mood-Median, although according to Schlag (2015) this is actually testing quantiles, and can lead to over rejection.
- Schlag, but only used to accept or reject, no p-value
Google adds