Biserial Correlation Coefficient
Explanation
This is an extension of the point-biserial correlation coefficient, if the categories come from a so-called latent normally distributed scale. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
There is a warning though that if one of the two categories has very small sample size compared to the other, this coefficient will not be very accurate (Soper, 1914, p.390; Jacobs & Viechtbauer, 2017, p. 165). Soper (1914, p. 390) warns to use this if one category is 4% or less from the combined sample size. On a website ChangingMinds someone posted as limit 10% (ChangingMinds, n.d.), unfortunately without a source.
Obtaining the Measure
with Python
Jupyter Notebook: ES - Biserial Correlation (P).ipynb
with stikpetP
To Be Made
without stikpetP
To Be Made
Manually (using Formula)
The formula (Tate, 1955a, p. 1087):
$$r_b = \frac{p \times q \times \left(\bar{x}_2 - \bar{x}_1\right)}{\sigma_x \times p_{z_p}}$$
With: $$p = \frac{n_1}{n}, q = \frac{n_0}{n}$$ $$\bar{x}_0 = \frac{\sum_{i=1}^{n_0} x_{i,0}}{n_0}$$ $$\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\sigma = \sqrt{\frac{SS}{n}}$$ $$SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2$$ $$z_p = \Phi^{-1}\left(p\right)$$ $$p_{z_p} = \phi\left(z_p\right)$$
Symbols used:
- \(n_1\), the sample size of the first category
- \(n_2\), the sample size of the second category
- \(n\), the total sample size, i.e. \(n = n_1 + n_2\)
- \(x_{i,j}\) is the \(i\)-th score in category \(j\)
The oldest formula I could find is from Pearson (1909, p. 97), which somewhat re-written is:
$$r_b = \frac{\frac{\bar{x}_1 - \bar{x}}{\sigma_x}}{\frac{p_{z_p}}{p}}$$
Since divide by a fraction is multiplying by its inverse, Soper (1914, p. 384) has:
$$r_b = \frac{\bar{x}_1 - \bar{x}}{\sigma_x} \times \frac{p}{p_{z_p}}$$
If we were to create binary values of the categories, then Tate (1955a, p. 1079; 1955b, p. 207) used the covariance between these and the scores:
$$r_b = \frac{\sigma_{bx}}{\sigma_x \times p_{z_p}}$$
Not too surprising, since it can be shown that \(\sigma_{bx} = p \times q \times \left(\bar{x}_1 - \bar{x}_0\right)\)
Tata (1955a, p. 1087; 1955b, p. 207) also show a conversion using the point-rank biserial: $$r_b = r_{pb} \times \frac{\sigma_b}{p_{z_p}}$$
Note that all of these should give the same result.
Interpretation
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
Cohen (1988, p. 82) provided the following rules of thumb for the biserial correlation:
|rb| | Interpretation |
---|---|
0.00 < 0.125 | Negligible |
0.125 < 0.304 | Small |
0.304 < 0.465 | Medium |
0.465 or more | Large |
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., p. 82) by J. Cohen, 1988, L. Erlbaum Associates. |
Note however, as with any rule-of-thumb for effect sizes and correlations, there are those who frown upon using these and recommend to talk to experts in the field to find out what might be considered small, medium, high.
Google adds