Tetrachoric Correlation
Explanation
In essence this attempts to mimic a correlation coefficient between two scale variables. It can be defined as "An estimate of the correlation between two random variables having a bivariate normal distribution, obtained from the information from a double dichotomy of their bivariate distribution" (Everitt, 2004, p. 372).
This assumes the two binary variables have ‘hidden’ underlying normal distribution. If so, the combination of the two forms a bivariate normal distribution with a specific correlation between them. The quest is then to find the correlation, such that the cumulative density function of the z-values of the two marginal totals of the top-left cell (a) match that value.
This is quite tricky to do, so a few have proposed an approximation for this. These include Yule r, Pearson Q4 and Q5, Camp, Becker and Clogg, and Bonett and Price. These are available in the more general binary-binary association section.
Besides closed form approximation formula's, various algorithms have been designed as well. The three most often mentioned are Brown (1977), Kirk (1973), and Divgi (1979).
Obtaining the Measure
(click below on program of interest to expand)
with Excel
to be done
with Flowgorithm
The Kirk-algorithm flowgorithm file: FL-EStetrachoricKirk.fprg.
The Brown-algorithm flowgorithm file: FL-EStetrachoricBrown.fprg.
with Python
to be done
with R (studio)
to be done
with SPSS
to be done
Formula
Given a 2x2 table as shown in table 1
Column 1 | Column 2 | Total | |
---|---|---|---|
Row 1 | \(a\) | \(b\) | \(R_1 = a + b\) |
Row 2 | \(c\) | \(d\) | \(R_2 = c + d\) |
Total | \(C_1 = a + c\) | \(C_2 = b + d\) | \(n = a + b + c + d\) |
The tetrachoric correlation coefficient \(r_t\) is the value for which:
\(\frac{a}{n}=BN\left(z_1, z_2, r_t\right)\)
With:
\(z_1 = \Phi^{-1}\left(\frac{C_1}{n}\right), z_2 = \Phi^{-1}\left(\frac{R_1}{n}\right)\)
Symbols used:
- \(BN\left(\dots\right)\), is the bivariate standard normal distribution cumulative distribution function.
- \(\Phi^{-1}\), is the inverse standard normal distribution function.
The above formula can be found in Pearson (1900, p. 13) and more in modern notation in Long et al. (2009, p. 430).
Algorithms for this have been developed by:
The Brown and Kirk algorithms were also converted to Flowgorithm files, in case you like to see the flow-charts of these.
For approximations of the coefficient see the binary-binary effect sizes page.
Interpretation
The tetrachoric correlation coefficient can range from -1 to 1. A -1 suggests a perfect negative correlation, and a +1 a perfect positive correlation. Unfortunately, I've been unable to find specific rules of thumb for in-between values, but some (e.g. Faul et al. (2009, p. 1150)), simply use the rules of thumb for a 'regular' Pearson correlation.
Alternatives
Besides the tetrachoric correlation there are many other effect size measures that could be used with two binary variables. These have been collected on the binary-binary effect sizes page.
Google adds