Point Biserial Correlation Coefficient
Explanation
This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).
Note that if the two categories come from a so-called latent normally distributed variable, the biserial correlation might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).
Obtaining the Measure
with Excel
Excel file: ES - Point Biserial (E).xlsm
with stikpetE
To Be Made
without stikpetE
To Be Made
with Python
Jupyter Notebook: ES - Point Biserial (P).ipynb
with stikpetP
To Be Made
without stikpetP
To Be Made
with R (Studio)
Jupyter Notebook: ES - Point Biserial (R).ipynb
with stikpetR
To Be Made
without stikpetR
To Be Made
Manually (using Formula)
The formula (Tate, 1955, p. 1081):
$$r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sigma_x} \times \sqrt{p \times q}$$
With:
$$p = \frac{n_1}{n}, q = \frac{n_2}{n}$$ $$\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\bar{x}_2 = \frac{\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$ $$\sigma = \sqrt{\frac{SS}{n}}$$ $$SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2$$
Symbols used:
- \(n_1\), the sample size of the first category
- \(n_2\), the sample size of the second category
- \(n\), the total sample size, i.e. \(n = n_1 + n_2\)
- \(x_{i,j}\) is the \(i\)-th score in category \(j\)
The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is:
$$r_{pb} = \frac{\bar{x}_2 - \bar{x}}{\sigma_x} \times \frac{\sqrt{p \times q}}{q}$$
Tate also gave another formula (Tate, 1954, p. 606):
$$r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{SS}} \times \frac{n_1 \times n_2}{n}$$
Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples:
$$r_{pb} = \sqrt{\frac{t^2}{t^2 + df}}$$
As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.
Note that all of these should give the same result.
Interpretation
As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).
With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.
Cohen (1988, p. 82) provided the following rules of thumb for the point-biserial correlation:
|rpb| | Interpretation |
---|---|
0.00 < 0.100 | Negligible |
0.100 < 0.243 | Small |
0.243 < 0.371 | Medium |
0.371 or more | Large |
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., p. 82) by J. Cohen, 1988, L. Erlbaum Associates. |
Note however, as with any rule-of-thumb for effect sizes and correlations, there are those who frown upon using these and recommend to talk to experts in the field to find out what might be considered small, medium, high.
Google adds