Brunner-Munzel Test

Explanation

The Brunner-Munzel test, is a test for stochastic-equivelance. This means that even if the medians are equal between two indpendent samples, this test could be significant.

Lets say we have one group A that scored 1, 2, 2, 5, 6, 6, 7, and another group B that scored 4, 4, 4, 5, 10, 10, 12. Each group has the same median (i.e. 5), and are symmetric around the median, but if a high score is positive, most people would rather be in group B than in group A. This is where ‘stochastic equality’ comes in. It looks at the chance if you pick a random person from group A and B each, the one from group A scores lower than the one from group B, and add half the chance that their equal. In this example that’s about 0.68.

Brunner and Munzel (2000, p. 21) indicate the test-statistic that is computed follows a standard normal distribution, if each category has 50 or more data points. They also remark (p. 22) that the test is no longer accurate if sample sizes are less than 10, although in Schüürhuis et al. (2025, p. 18) 15 is listed. Neubert and Brunner (2007) propose to use a studentized permutation test in these cases. Schüürhuis et al. (2025) developed an improved version of this test as well, called a \(C^2\) test.

Performing the Test

with Excel

Excel file: TS - Brunner-Munzel (E).xlsm

with stikpetE

without stikpetE

with Python

Jupyter Notebook: TS - Brunner-Munzel (P).ipynb

with stikpetP

without stikpetP

with R

Jupyter Notebook: TS - Brunner-Munzel (R).ipynb

with stikpetR

without stikpetR

with SPSS

To Be Made

Formulas

The test-statistic is calculated using (Brunner & Munzel, 2000, p. 21, eq. 4.8):

\(W_n^{BM} = \frac{\bar{R}_2 - \bar{R}_1}{\sqrt{N \times \hat{\sigma}_N^2}} = \left(\hat{p} - \frac{1}{2}\right)\times\sqrt{\frac{N}{\hat{\sigma}_N^2}}\)

with degrees of freedom (Brunner & Munzel, 2000, p. 22):

\(df = \frac{\hat{\sigma}_N^4}{N^2 \times \sum_{k=1}^2 \frac{\hat{\sigma}_k^4}{\left(n_k - 1\right)\times n_k^2}}\)

The p-value can then be calculated using:

\(p = 2 \times \left(1 - T\left(\left|W_n^{BM}\right|, df\right)\right)\)

The total estimated variance (Brunner & Munzel, 2000, p. 21, eq. 4.7):

\(\hat{\sigma}_N^2 = N\times\left(\frac{\hat{\sigma}_1^2}{n_1} + \frac{\hat{\sigma}_2^2}{n_2}\right)\)

The estimated variance for each category (Brunner & Munzel, 2000, p. 20, eq. 4.6):

\(\hat{\sigma}_k^2 = \frac{s_{R_k^*}^2}{\left(N - n_k\right)^2}\)

the sample variance of placement values (Brunner & Munzel, 2000, p. 20, eq. 4.5):

\(s_{R_k^*}^2 = \frac{\sum_{i=1}^{n_k} \left(R_{ik}^* - \bar{R}_{k}^*\right)^2}{n_k - 1}\)

mean of the placement values:

\(\bar{R}_{k}^* = \frac{\sum_{i=1}^{n_k} R_{ik}^*}{n_k}\)

the placement values:

\(R_{ik}^* = R_{ik} - R_{ik}^{(k)}\)

mean of the pooled ranks for each category:

\(\bar{R}_k = \frac{\sum_{i=1}^{n_k} R_{ik}}{n_k}\)

and for the second formula for \(W_n^{BM}\):

\(\hat{p} = \frac{1}{n_1}\times\left(\bar{R}_2 - \frac{n_2 + 1}{2}\right)\)

Symbols used:

\(R_{ik}\), the rank of the i-th score in category k, when using all combined scores
\(R_{ik}^{(k)}\), the rank of the i-th score in category k, when using only scores from category k
\(N\), the total sample size
\(n_{k}\), the number of scores in category k
\(T\left(\dots\right)\), the cumulative distribution function of the t-distribution

For a studentized permutation test (Neubert & Brunner, 2007) we can use the following steps

determine the observed \(W_{obs}^{BM}\) test statistic.
shuffle all the scores and re-calculate the test-statistic.
determine if the result is less or above the observed value.
repeat steps 2 and 3 many times, keeping track of the number of times the result was above or below the observed test statistic
determine the approximate p-value by \(2\times\frac{\min{\left(n_{above}, n_{below}\right)}}{n_{iters}}\)

Neubert and Brunner (2007) used 10,000 iterations (permutation). The formula for the p-value in the last step can be found in Schüürhuis et al., (2025, p. 7)

Interpreting the Result

The assumption about the population for this test (the null hypothesis) is that the two samples are stochastically equivelant.

The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.

If the assumption is rejected, we conclude that the two samples are not stochastically equal. This indicates the scores in one of the two are 'higher' than in the other.

Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.

Writing the results

Writing up the results of the test uses the format (APA, 2019 p. 182):

t(<degrees of freedom>) = <t-value>, p = <p-value>

So for example:

A Brunner-Munzel test indicated a significant difference between males and females in the distribution of scores, t(36.98) = 4.466, p < .001.

A few notes about reporting statistical results with APA:

The p-value is shown with three decimal places, and no 0 before the decimal sign. If the p-value is below .0005, it can be reported as p < .001.
t is a standard abbreviations from APA for the t-distribution (see APA, 2019, table 6.5).
APA does not require to include references nor formulas for statical analysis that are in common use (2019, p. 181).
APA (2019, p. 88) states to also report an effect size measure.

Next...

The next step is to determine an effect size measure. Varha-Delaney A, a Rosenthal Correlation, or a (Glass) Rank Biserial Correlations (Cliff Delta), could be suitable for this.

Alternatives

alternatives for testing stochastic equivelance:

Mann-Whitney U. Chung and Romano (2011, p. 5) note that it fails to control type 1 errors
C-square test, which is an improvement on the Brunner-Munzel test
Cliff-Delta, which according to Delaney and Vargha (2002), performs similar as the Brunner-Munzel test.

if you only want to test if the medians are equal:

Mann-Whitney U, assuming distributions have the same shape
Fligner-Policello, assuming distributions are symmetric around the median, and continuous data
Mood-Median, although according to Schlag (2015) this is actually testing quantiles, and can lead to over rejection.
Schlag, but only used to accept or reject, no p-value

Links to parts

Google adds