Point Biserial Correlation Coefficient

Explanation

This can be seen as coding a binary variable with the groups into 0 and 1, and then calculates a (Pearson) correlation coefficient between the those values and the scores (Tate, 1954, p. 603).

Note that if the two categories come from a so-called latent normally distributed variable, the biserial correlation might be better. This is the case if scores were categorized and then compared to some other numeric scores (e.g. grades being categorized into pass/fail, and then use this pass/fail to correlate with age).

Obtaining the Measure

with Excel

Excel file: ES - Point Biserial (E).xlsm

with stikpetE

To Be Made

without stikpetE

To Be Made

with Python

Jupyter Notebook: ES - Point Biserial (P).ipynb

with stikpetP

To Be Made

without stikpetP

To Be Made

with R (Studio)

Jupyter Notebook: ES - Point Biserial (R).ipynb

with stikpetR

To Be Made

without stikpetR

To Be Made

Manually (using Formula)

The formula (Tate, 1955, p. 1081):

$$r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sigma_x} \times \sqrt{p \times q}$$

With:

$$p = \frac{n_1}{n}, q = \frac{n_2}{n}$$ $$\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\bar{x}_1 = \frac{\sum_{i=1}^{n_1} x_{i,1}}{n_1}$$ $$\bar{x}_2 = \frac{\sum_{i=1}^{n_2} x_{i,2}}{n_2}$$ $$\sigma = \sqrt{\frac{SS}{n}}$$ $$SS = \sum_{j=1}^{2} \sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}\right)^2$$

Symbols used:

$n_1$, the sample size of the first category
$n_2$, the sample size of the second category
$n$, the total sample size, i.e. $n = n_1 + n_2$
$x_{i,j}$ is the $i$-th score in category $j$

The oldest formula I could find is from Soper (1914, p. 384), which somewhat re-written is:

$$r_{pb} = \frac{\bar{x}_2 - \bar{x}}{\sigma_x} \times \frac{\sqrt{p \times q}}{q}$$

Tate also gave another formula (Tate, 1954, p. 606):

$$r_{pb} = \frac{\bar{x}_2 - \bar{x}_1}{\sqrt{SS}} \times \frac{n_1 \times n_2}{n}$$

Friedman (1968, p. 245) uses the degrees of freedom and test-statistic from the Student t-test for independent samples:

$$r_{pb} = \sqrt{\frac{t^2}{t^2 + df}}$$

As mentioned in the introduction, it can also be calculated by converting the categories to binary values, and then determine the Pearson product-moment correlation coefficient between these binary values and the scores.

Note that all of these should give the same result.

Interpretation

As the name implies a correlation coefficient indicates how two variables co-relate, i.e. if one goes up is it likely for the other to go up or down. A zero would indicate there is not (linear) relation, while a -1 would mean a perfect negative correlation (if one goes up, the other goes down, and vice versa), and a +1 a perfect positive correlation (if one goes up, the other also goes up, and vice versa).

With two categories we could read this more as if the score go up and there is a positive correlation, it is more likely that it came from a category 1 case, rather than a category 0.

Cohen (1988, p. 82) provided the following rules of thumb for the point-biserial correlation:

Table 1
Rule of thumb for point-biserial correlation
\|r_pb\|	Interpretation
0.00 < 0.100	Negligible
0.100 < 0.243	Small
0.243 < 0.371	Medium
0.371 or more	Large
Note: Adapted from Statistical power analysis for the behavioral sciences (2nd ed., p. 82) by J. Cohen, 1988, L. Erlbaum Associates.

Note however, as with any rule-of-thumb for effect sizes and correlations, there are those who frown upon using these and recommend to talk to experts in the field to find out what might be considered small, medium, high.

Links to parts

Google adds