Measures of Similarity / Association
Introduction
Click here to see a video instead of reading.
There are a lot of effect sizes suggested over the years with two binary variables. Cohen W, Cohen h and tetrachoric correlations are discussed separately. Most often these are referred to as either a measure of similarity or a measure of asscociation. I tried to group the others as much as possible. Assuming we have a cross-table as in Table 1.
Column 1 | Column 2 | Total | |
---|---|---|---|
Row 1 | \(a\) | \(b\) | \(R_1 = a + b\) |
Row 2 | \(c\) | \(d\) | \(R_2 = c + d\) |
Total | \(C_1 = a + c\) | \(C_2 = b + d\) | \(n = R_1 + R_2\) |
We then have various measures that focus on the top-left cell (\(a\))
- Russell-Rao (Russell & Rao, 1940)
- Dice-1 (Dice, 1945, p. 302)
- Dice-2 (Dice, 1945, p. 302)
- Braun-Blanquet (Braun-Blanquet, 1932)
- Simpson Similarity (Simpson, 1943, p. 20, 1960, p. 301)
- Kulczynski-1 (Kulczynski, 1927)
- Jaccard = Tanimoto (Jaccard, 1901, 1912, p. 39; Tanimoto, 1958, p. 5)
- Sokal-Sneath-1 = Anderberg (Sokal & Sneath, 1963, p. 129)
- Gleasson = Dice-3 = Nei-Li = Czekanowski (Gleason, 1920, p. 31; Dice, 1945, p. 302; Nei & Li, 1979, p. 5270)
- Mountford (Mountford, 1962, p. 45)
- Driver-Kroeber = Ochiai-1 = Otsuka (Driver & Kroeber, 1932, p. 219; Ochiai, 1957)
- Sorgenfrei (Sorgenfrei, 1958)
- Johnson (Johnson, 1967)
- Kulczynski-2 = Driver-Kroeber-2 (Kulczynski, 1927; Driver & Kroeber, 1932, p. 219)
- Fager-McGowan-1 (Fager & McGowan, 1963, p. 454)
- Fager-McGowan-2 (Fager & McGowan, 1963, p. 454)
- tarantula (Jones & Harrold, 2005)
- Ample
- Gilbert (Gilbert, 1884, p. 171)
- Fossum-Kaskey (Fossum & Kaskey, 1966, p. 65)
- Forbes - 1 (Forbes, 1907, p. 279)
- Eyraud (Eyraud, 1936)
Some focus on the top-left and bottom-right cell (\(a, d\)):
- Sokal-Michener (Matching Coefficient) (Sokal & Michener, 1958, p. 1417)
- Faith (Faith, 1983, p. 290)
- Sokal-Sneath-5 (Sokal & Sneath, 1963, p. 129)
- Rogers-Tanimoto (Rogers & Tanimoto, 1960)
- Sokal-Sneath-2 = Gower-Legendre (Sokal & Sneath, 1963, p. 129; Gower & Legendre, 1986)
- Gower
- Sokal-Sneath-4 = Ochiai-2 (Sokal & Sneath, 1963, p. 130; Ochiai, 1957)
- Rogot-Goldberg (Rogot & Goldberg, 1966, p. 997)
- Sokal-Sneath-3 (Sokal & Sneath, 1963, p. 130)
- Hawkin-Dotson (Hawkins & Dotson, 1975, pp. 372–373)
- Clement (Clement, 1976, p. 258)
- Harris-Lahey (Harris & Lahey, 1978, p. 526)
- Austin-Colwell (Austin & Colwell, 1977, p. 205)
- Baroni-Urbani-Buser-1 (Baroni-Urbani & Buser, 1976, p. 258)
Some focus on (\(ad - bc\)):
- Peirce-1 (Peirce, 1884, p. 453)
- Peirce-2 (Peirce, 1884, p. 453)
- Cole C1 (Cole, 1949, p. 415)
- Loevinger = Forbes 2 (Loevinger, 1947, p. 30)
- Cole C7 (Coefficient of Interspecific Association) (Cole, 1949, p. 420)
- Dennis (Dennis, 1965, p. 69)
- (Pearson/Yule) Phi Coefficient / Cole C2 (Pearson, 1900a, p. 12)
- Doolittle (Doolittle, 1885, p. 123)
- Peirce-3 (Choi et al., 2010, p. 45)
- Cohen-kappa (Cohen, 1960, p. 40)
- McEwen-Michael Coefficient / Cole C3 (Michael, 1920, p. 57)
- Kuder-Richardson (Kuder & Richardson, 1937)
- Scott (Scott, 1955, p. 324)
- Maxwell-Pilliner (Maxwell & Pilliner, 1968)
- Cole C5 (Cole, 1949, p. 416)
- Hamann (Hamann, 1961)
- Fleiss (Fleiss, 1975, p. 656)
Others have a format of \(\frac{x-y}{x+y}\):
- Yule Q = Cole C4 = Pearson Q2 (Yule, 1900, p. 272)
- Yule Y (Yule, 1912, p. 592)
- Digby H (Digby, 1983, p. 754)
- Edward Q (Edwards, 1957)
- Tarwid (Tarwid, 1960, p. 117)
- Bonett-Price Y* (Bonett & Price, 2007, p. 433)
Some use the \(\chi^2\) statistic:
- Contingency coefficient (Pearson, 1904, p. 9)
- Cohen w (Cohen, 1988, p. 216)
- Pearson (Choi et al., 2010, p. 45; K. Pearson, 1904)
- Hurlbert / Cole C8 (Hurlbert, 1969, p. 1)
- Stiles (Stiles, 1961, p. 272)
and a few others:
- McConnaughey (McConnaughey, 1964)
- Baroni-Urbani-Buser-2 (Baroni-Urbani & Buser, 1976, p. 258)
- Kent-Foster-1 (Kent & Foster, 1977, p. 311)
- Kent-Foster-2 (Kent & Foster, 1977, p. 311)
- Tulloss (Tulloss, 1997, p. 133)
- Gilbert-Wells (Gilbert & Wells, 1966)
- Yule r/ Pearson Q3 / Cole c6 / Pearson-Heron (Yule, 1900, p. 276)
- Anderberg (Anderberg, 1973)
- Alroy F (Alroy, 2015)
- Pearson Q1 (Pearson, 1900a, p. 15)
- Goodman-Kruskal Lambda-1 (Goodman & Kruskal, 1954, p. 743)
- Goodman-Kruskal Lambda-2 (Warrens, 2008, p. 220)
- Odds Ratio (Fisher, 1935, p. 50)
- Pearson Q4 (Pearson, 1900a, p. 16)
- Pearson Q5 (Pearson, 1900a, p. 16)
- Camp (3 ver.) (Camp, 1934, p. 309)
- Becker-Clogg-1 (Becker & Clogg, 1988, pp. 410–412)
- Becker-Clogg-2 (Becker & Clogg, 1988, pp. 410–412)
- Bonett-Price-2 (Bonett & Price, 2005, p. 216)
- Bonett-Price-3 (Bonett & Price, 2005, p. 216)
- Chen-Popovich (Chen & Popovich, 2002, p. 37)
Some further explained
If a test with a chi-square distribution was used, an obvious candidate to use in a measure of effect size is the test statistic, the χ^2. One of the earliest and often mentioned measure uses this: the phi coefficient (or mean square contingency). Both Yule (1900) and Pearson (1900)mention this measure, and Cole (1949) refers to it as Cole C4. It is interesting that this gives the same result, as if you would assign a 0 and 1 to each of the two variables categories, and calculate the regular correlation coefficient.
.This measure is also sometimes used for larger tables, but the range of values it can hold depends then on the size of the table. To overcome this an alternative was proposed by Pearson (1904): the contingency coefficient. This will range between 0 and 1, but the real maximum still depends on the size of the table.
Cole (1949) noted that for a 2x2 table the maximum would be √(1/2) and we prefer to have a correlation have a maximum of 1. Cole C5 does this by simply taking the contingency coefficient and dividing it by the √(1/2).
Cohen w (1988), Cole C8 (= Hurlbert) (1969) and Stiles (1961) also use the chi-square statistic.Another approach is by realizing that if there is no association, all cells have the same value, i.e. a = b = c = d. We can also write this as:
\(a=\frac{\left(a+b\right)\times\left(a+c\right)}{n}\)
The Forbes Coefficient (Forbes, 1907) uses this. This has a value of 1 if there is no association, while it has a value of 0 or 2 when there is a perfect one.
To adjust to the more traditional range of -1 to 1, Cole 1 simply subtracts one from the Forbes coefficient.
Another range of measures employ the Odds Ratio. Edwards (1963) argued that a measure of association for a 2x2 table should be some function of the cross-ratio bc/ad (e.g. the Odds Ratio).
Yule Q and Yule Y do exactly that. They are both of the format of:
\(\frac{OR^x-1}{OR^x+1}\)
Yule Q actually looks at the difference between the number of pairs that are in agreement and those in disagreement, and then divides this over the total possible number of pairs.
Click for more details on Yule's Y
If we have a 2 by 2 table, and one of the diagonals is zero, there would be a perfect association. For example:
Column 1 | Column 2 | |
---|---|---|
Row 1 | 5 | 0 |
Row 2 | 0 | 5 |
In contrast, if all the values would be the same, there wouldn't be any association:
Column 1 | Column 2 | |
---|---|---|
Row 1 | 5 | 5 |
Row 2 | 5 | 5 |
How about a table where the diagonals have the same values. For example:
Column 1 | Column 2 | |
---|---|---|
Row 1 | 3 | 5 |
Row 2 | 5 | 3 |
In this case, we can split the table into two tables, such that the sum of the two tables would add up to the original. The split is made in such a way that one has a perfect association, while the other is then having no association at all.
Column 1 | Column 2 | |
---|---|---|
Row 1 | 0 | 2 |
Row 2 | 2 | 0 |
Column 1 | Column 2 | |
---|---|---|
Row 1 | 3 | 3 |
Row 2 | 3 | 3 |
The sum of all counts in the perfect association part is 0+2+2+0=4, while for the no association part we get 3+3+3+3=12. Overall, we have 4/(4+12) = 4/16 = 25% of perfect association.
Notice that we can use the following formula in case of a table with the same diagonal:
\(\frac{a-b}{a+b}\)
If we apply this to our table 1, we get a nice value of 1, in table 2 a value of 0 and in table 3 a value of -0.25. This however, only works if each of the diagonals has the same values.
Unfortunately though, most tables don't have the same values across their diagonal. However, we can transform any table into a table with the same Odds Ratio, which is then symmetrical. This can be done by setting:
\(a = d = \sqrt{OR}, b = c = 1\)
Lets have a look at an example:
Column 1 | Column 2 | |
---|---|---|
Row 1 | 3 | 15 |
Row 2 | 8 | 16 |
We first determine the Odds Ratio of this table:
\(OR = \frac{a\times d}{b\times c} = \frac{3\times 16}{15\times 8} = \frac{48}{120} = \frac{2}{5}\)
We now use the suggestion: \(a = d = \sqrt{OR}, b = c = 1\), to create a table with the same Odds Ratio, but that has the diagonals the same:
Column 1 | Column 2 | |
---|---|---|
Row 1 | \(\sqrt{\frac{2}{5}}\) | 1 |
Row 2 | 1 | \(\sqrt{\frac{2}{5}}\) |
If you like, you can double check to see if the Odds Ratio of Table 6 is still \(\frac{2}{5}\)
Since Table 6 has the same values on the diagonals, we can apply our formula from earlier:
\(\frac{a-b}{a+b} = \frac{\sqrt{\frac{2}{5}}-1}{\sqrt{\frac{2}{5}}+1}\approx -0.225\)
So about 23% of a perfect association. Yule's Y does all these steps for us in one go, and will then of course produce the same result:
\(Y = \frac{\sqrt{a\times d}-\sqrt{b\times c}}{\sqrt{a\times d}+\sqrt{b\times c}} \approx -0.225\)
Unfortunately if any of the four cells is 0, the association will always return a 1 or -1. Similar if one or two cells have very high counts and the others very few, the result might be close to 1 or -1, even though there is almost no association. This is the same problem as for Yule's Q and the reason why Michael and McEwen developed their variation on Yule's Q.
For Yule Q the \(x=1\) and for Yule Y \(x=0.5\). Digby (1983, p. 754) showed that Yule’s Q consistently overestimates the association, while Yule’s Y underestimates it It seems that a better approximation might be somewhere between 0.5 and 1 as the power to use on the Odds Ratio. Digby H found the best result at 0.75, while Edwards (1957 as cited in Becker & Clogg, 1988, p. 409) had proposed π/4 (appr. 0.79).
Bonett and Price Y* uses a function to determine what the power should be (Bonett & Price, 2007).
A problem with all of Forbes and OR based is that the if only one cell is very large compared to the others, or if one cell is 0 the association will be quite large (close to -1 or 1).
Michael (1920, p. 55) worked together with McEwen and tried to overcome this with the 'McEwen and Michael coefficient'. Cole however criticized Michael a bit on this, that although at first one might not consider the example table a strong association, with so little data, there are actually very few other value arrangements with the same marginal totals, that would have yielded a stronger association. He states: “with any given series of collections containing two species the possible number of tables yielding different values for the number of joint occurrences is exactly one more than the smallest of the four marginal totals” (Cole, 1949, p. 417).
Cole C7 attempts to overcome this problem. Cole suggested to divide the absolute difference of deviation from no association by the number of possible tables with a positive association. He called this the coefficient of interspecific association
Another category are the tetrachorical correlations. This is quite tricky to do, so a few have proposed an approximation for this. Becker and Clogg discuss the relation between the Odds Ratio and the tetrachoric correlation. They start by exploring Yule’s Q and generalize it to:
\(Q\left(x\right) = \frac{OR^x - 1}{OR^x + 1}\)
So that Yule’s Q is Q(1), Edward’s Q is Q(π/4), and Digby’s H is Q(3/4). They then go on to come up with their own more complicated method to calculate an optimal value for x.
Bonett and Price continued on this work and also refer to Pearson’s four methods (Pearson Q1, Pearson Q2 (= Yule’s Q), Pearson Q3 and Pearson Q4), Walker and Lev, Edwards, Digby and Lord, Novick and the two from Becker and Clogg. They then derive two methods of their own they show are more accurate.
The Walker and Lev method they refer to is the same as Pearson’s Q3.
Besides closed form approximation formula's, various algorithms have been designed as well. See the separate tetrachoric correlations page for more details on this.
Click here on how to calculate each of the measures
with Flowgorithm
The Flowgorithm of the versions listed below were all included in one file, except for Brown's approximation, and Kirk's approximation:
Flowgorithm file: FL-ESbinBinAssociation.fprg.
See the section of Brown, and Kirk for the link to their approximation Flowgorithm files.
Becker and Clogg (1988)
Bonett and Price (2005)
Camp (1934)
Cole (1949)
Cole C1:
Cole C2, see Yule's (= Pearson's Phi) (1920)
Cole C3, see McEwen and Michael (1920)
Cole C4, see Yule's Q (= Pearson Q 2) (1900)
Cole C5:
Cole C6, see Pearson Q 3 (1900)
Cole C7:
Digby (1983)
Divgi (1979)
Edwards (1957)
Forbes (1907)
McEwen and Michael (1920)
same as Cole C3:
Pearson (1900)
Pearson Q1:
Pearson Q2, see Yule's Q (1900)
Pearson Q3 (= Yule's r, Cole's C6):
Pearson Q4:
Pearson Q5:
Yule's Q and Phi(1900)
Yule's Q (= Pearson's Q2, Cole's C4:
Yule's Phi = Pearson's Phi:
Yule's r see Pearson Q3 (1900)
Yule's Y (1912)
with Formula's
Alroy's Forbes Adjustment
Alroy adjusts the Forbes coefficient by setting (Alroy, 2015):
\(F' = \frac{a\times\left(n' + \sqrt{n'}\right)}{a\times\left(n' + \sqrt{n'}\right) + \frac{3}{2}\times b\times c}\)
With:
\(n' = a + b + c\)
Alroy refers to the Forbes coefficient as a measure of similarity. Alroy then sets out to improve the measure by disregarding the lower left value (d).
Anderberg
Equation 70 found in Choi et al. (2010):
\(\frac{\sigma - \sigma'}{2\times n}\)
With:
\(\sigma = \max{\left(a, b\right)} + \max{\left(c, d\right)} + \max{\left(a, c\right)} + \max{\left(b, d\right)}\)
\(\sigma' = \max{\left(R_1, R_2\right)} + \max{\left(C_1, C_2\right)}\)
Supposedly from Anderberg (1973):
Austin-Colwell Similarity
Austin and Colwell (1977, p. 205):
\(S = \frac{2}{\pi} \text{arcsin} \sqrt{\frac{a + d}{n}}\)
Equation 21 found in Hubálek (1982)
Baroni-Urbani-Buser Similarity
Baroni-Urbani and Buser S (1976, p. 258):
\(S_{**} = \frac{\sqrt{a\times d} + a}{\sqrt{a\times d} + a + b + c}\)
\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)
Equations 71 and 72 from Choi et al. (2010):, equation 32 and 33 in Hubálek (1982) and equation 38a and 38b from Warrens (2008)
Becker and Clogg (rtet)
Becker and Clogg (1988, pp. 410-412)
\( \rho^* = \frac{g-1}{g+1} \)
\( \rho^{**} = \frac{OR^{13.3/\Delta} - 1}{OR^{13.3/\Delta} + 1} \)
with:
\(g=e^{12.4\times\phi - 24.6\times\phi^3}\)
\(\phi = \frac{\ln\left(OR\right)}{\Delta}\)
\(OR=\frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)} = \frac{a\times d}{b\times c}\)
\(\Delta = \left(\mu_{R1} - \mu_{R2}\right) \times \left(v_{C1} - v_{C2}\right)\)
\(\mu_{R1} = \frac{-e^{-\frac{t_r^2}{2}}}{p_{R1}}, \mu_{R2} = \frac{e^{-\frac{t_r^2}{2}}}{p_{R2}} \)
\(v_{C1} = \frac{-e^{-\frac{t_c^2}{2}}}{p_{C1}}, v_{C2} = \frac{e^{-\frac{t_c^2}{2}}}{p_{C2}} \)
\(t_r = \Phi^{-1}\left(p_{R1}\right), t_c = \Phi^{-1}\left(p_{C1}\right)\)
\(p_{x} = \frac{x}{n}\)
\(\Phi^{-1}\left(x\right)\) is the inverse standard normal cumulative distribution function
\(OR\) is the Odds Ratio
Bonett and Price Y*
Bonett and Price (2007, pp. 433-434)
\(Y* = \frac{\hat{\omega}^x-1}{\hat{\omega}^x+1}\)
With:
\(x = \frac{1}{2}-\left(\frac{1}{2}-p_{min}\right)^2\)
\(p_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)}{n}\)
\(\hat{\omega} = \frac{\left(a+0.1\right)\times\left(d+0.1\right)}{\left(b+0.1\right)\times\left(c+0.1\right)}\)
Note that \(\hat{\omega}\) is a biased corrected version of the Odds Ratio.
Bonett and Price ρ* (rtet)
Bonett and Price ρ* (2005, p. 216)
\(\rho^* = \cos\left(\frac{\pi}{1+\omega^c}\right)\)
With:
\(\omega = OR = \frac{a\times d}{b\times c}\)
\(c = \frac{1-\frac{\left|R_1-C_1\right|}{5\times n} - \left(\frac{1}{2}-p_{min}\right)^2}{2}\)
\(p_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)}{n}\)
Bonett and Price ρhat* (rtet)
Bonett and Price ρhat* (2005, p. 216)
\(\hat{\rho}^* = \cos\left(\frac{\pi}{1+\hat{\omega}^\hat{c}}\right)\)
With:
\(\hat{\omega} = \frac{\left(a+\frac{1}{2}\right)\times \left(d+\frac{1}{2}\right)}{\left(b+\frac{1}{2}\right)\times \left(c+\frac{1}{2}\right)}\)
\(\hat{c} = \frac{1-\frac{\left|R_1-C_1\right|}{5\times \left(n+2\right)} - \left(\frac{1}{2}-\hat{p}_{min}\right)^2}{2}\)
\(\hat{p}_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)+1}{n+2}\)
Braun and Blanquet
Braun and Blanquet supposedly from (1932):
\(\frac{a}{\max{\left(R_1, C_1\right)}}\)
\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)
Equation 46 from Choi et al. (2010), equation 1 in Hubálek (1982) and equation 12 from Warrens (2008)
Camp (rtet)
Camp (1934, pp. 309) describes the following steps for the calculation:
Step 1: If total of column 1 (\(C_1\)) is less than column 2 (\(C_2\)), swop the two columns
Step 2: Calculate \(p = \frac{C_1}{n}\), \(p_1 = \frac{a}{n}\), and \(p_2 = \frac{c}{C_2}\)
Step 3: Determine \(z_1\), \(z_2\) as the normal deviate corresponding to the area \(p_1\), \(p_2\) resp. (inverse standard normal cumulative distribution), i.e. \(z_1 = \Phi^{-1}\left(p_1\right), z_2 = \Phi^{-1}\left(p_2\right)\)
Step 4: Determine y the normal ordinate corresponding to \(p\) (the height of the normal distribution)
Step 5: Calculate \(m = \frac{p\times\left(1-p\right)\times\left(z_1 + z_2\right)}{y}\)
Step 6: Find phi in a table of phi values
Camp suggested for a very basic approximation to simply use \(\phi=1\).
For a better approximation Camp made the following table:
p | 0.5 | 0.6 | 0.7 | 0.8 | 0.9 |
---|---|---|---|---|---|
phi | 0.637 | 0.63 | 0.62 | 0.60 | 0.56 |
Cureton (1968, p. 241) expanded on this table and produced:
p | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
---|---|---|---|---|---|---|---|---|---|---|---|
0.5 | 0.637 | 0.636 | 0.636 | 0.635 | 0.635 | 0.634 | 0.634 | 0.633 | 0.633 | 0.632 | 0.631 |
0.6 | 0.631 | 0.631 | 0.630 | 0.629 | 0.628 | 0.627 | 0.626 | 0.625 | 0.624 | 0.622 | 0.621 |
0.7 | 0.621 | 0.620 | 0.618 | 0.616 | 0.614 | 0.612 | 0.610 | 0.608 | 0.606 | 0.603 | 0.600 |
0.8 | 0.600 | 0.597 | 0.594 | 0.591 | 0.587 | 0.583 | 0.579 | 0.574 | 0.569 | 0.564 | 0.559 |
Step 7: Calculate \(r_t = \frac{m}{\sqrt{1+\phi\times m^2}}\)
Cureton (1968) describes quite a few shortcomings with this approximation, and circumstances when it might be appropriate.
Chen and Popovich (rtet)
Chen and Popovich (2002, pp. 37-38):
\(\lambda_x = \Phi^{-1}\left(\frac{R_1}{n}\right)\)
\(\lambda_y = \Phi^{-1}\left(\frac{C_1}{n}\right)\)
Clement
Clement Inter-observer Agreement (1976, p. 258):
\(IAR = \frac{a \times R_2}{n \times R_1} + \frac{d \times R_1}{n \times R_2}\)
Equation 37 from Warrens (2008)
Cohen Kappa
Cohen Kappa (1960, p. 40):
\(\kappa = \frac{n \times P - Q}{n^2 - Q} = \frac{2\times\left(a\times d - b\times c \right)}{R_1\times C_2 + R_2\times C_1}\)
With:
\(P = \sum{i=1}^r F_{i,i}\)
\(Q = \sum{i=1}^r R_i\times C_i\)
Equation 24 from Warrens (2008)
Cohen w
Cohen w (1988, p. 216):
\(w = \sqrt{\frac{\chi^2}{n}}\)
With:
\(\chi^2 = \frac{n\times\left(a\times d - b\times c\right)^2}{R_1\times R_2\times C_1\times C_2}\)
See Cohen W for more information and rules-of-thumb.
Cole C1
Cole's C1(1949, p. 415) is an adjusted Forbes Coefficient to:
\( C_1=\frac{a\times d - b\times c}{\left(a+b\right)\times\left(a+c\right)} \)
Cole C5
Cole's C5 (1949, p. 416) is an adjustment of the Contingency Coefficient for 2x2 tables:
\( C_5 = \frac{\sqrt{2}\times\left(a\times d-b\times c\right)} {\sqrt{\left(a\times d-b\times c\right)^2 + R_1\times R_2\times C_1\times C_2}} \)
Cole C7
Cole's C7 (1949, pp. 420-421) is the measure Cole derived himself and calls it 'coefficient of interspecific association" given as:
\(\frac{a\times d - b\times c}{\left(a+b\right) \times \left(b + d\right)}\)
Contingency Coefficient
Contingency Coefficient (Pearson, 1904, p. 9):
\(C = \sqrt{\frac{\chi^2}{n + \chi^2}}\)
With:
\(\chi^2 = \frac{n\times\left(a\times d - b\times c\right)^2}{R_1\times R_2\times C_1\times C_2}\)
Dennis \(R^2\)
Dennis (1965, p. 69):
\(R^2 = \frac{a \times d - b \times c}{sqrt{n \times R_1 \times C_1}}\)
Equation 44 from Choi et al. (2010)
Dice Coefficient of Association
Dice (1945, p. 302):
\(\text{Coefficient of Association} = \frac{a}{a + b}\)
\(\text{Coefficient of Association} = \frac{a}{a + c}\)
\(\text{coincidence index} = \frac{2 \times a}{2 \times a + b + c}\)
The Coincidence Index is the same as Gleason.
Equation 17a and 17b from Warrens (2008)
Digby H
Digby's H (1983, p. 754)
\(H = \frac{\left(a\times d\right)^{3/4} - \left(b\times c\right)^{3/4}}{\left(a\times d\right)^{3/4} + \left(b\times c\right)^{3/4}}\)
Doolittle i (inference-ratio)
Doolittle i (1885, p. 123):
\(\frac{\left(a \times d - b \times c\right)^2}{R_1 \times R_2 \times C_1 \times C_2}\)
Equation 31 in Hubálek (1982) and equation 2 from Warrens (2008)
Driver-Kroeber A (=Kulczynski v2) and G (= Ochiai-1 = Otsuka)
Driver and Kroeber (1932, p. 219):
\(G = \frac{a}{\sqrt{R_1 \times C_1}}\)
\(A = \frac{1}{2} \times \left(\frac{a}{R_1} + \frac{a}{C_1}\right)\)
G is listed as equation 31, 33 and 38 from Choi et al. (2010) but attributes it as Ochiai-I refering to Ochaiai (1957) in equation 33, while the same equation is used for 38 but then attributed to Otsuka (1936), equation 11 in Hubálek (1982) and equation 13 from Warrens (2008)
A is listed as equation 41 and 42 from Choi et al. (2010) , equation 7 and 8 in Hubálek (1982) and equation 11a from Warrens (2008). The original source is most likely Kulczynski (1927)
Edwards Q
Edwards Q (1957, p. ???) as a variation on Yule's Q:
\( Q_E = \frac{OR^{\pi/4} - 1}{OR^{\pi/4} + 1} \)
\(OR=\frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)} = \frac{a\times d}{b\times c}\)
Eyraud
Eyraud supposedly from (1936):
\(\frac{a - R_1 \times C_1}{R_1 \times R_2 \times C_1 \times C_2}\)
Equation 74 from Choi et al. (2010) but with an additonal n factor in denominator, equation 17 in Hubálek (1982) and equation 12 from Warrens (2008)
Faith C
Faith C (1983, p. 290):
\(C = \frac{a + \frac{1}{2}\times d}{n}\)
\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)
Equation 10 from Choi et al. (2010).
Fager-McGowan (Index of Affinity)
Fager-McGowan (Index of Affinity) (1963, p. 454) as interpreted by Warrens (2008):
\(\frac{a}{\sqrt{R_1 \times C_1}} - \frac{1}{2 \times \sqrt{\max{\left(R_1, C_1\right)}}}\)
as interpreted by Choi et al. (2010) and Hubálek (1982):
\(\frac{a}{\sqrt{R_1 \times C_1}} - \frac{\sqrt{\max{\left(R_1, C_1\right)}}}{2}\)
Equation 47 from Choi et al. (2010), equation 13 in Hubálek (1982) and equation 29 from Warrens (2008)
Fleiss M
Fleiss M (1975, p. 656):
\(M = \frac{\left(a \times d - b \times c\right) \times \left(R_1 \times C_2 + R_2 \times C_1\right)}{2 \times R_1 \times R_2 \times C_1 \times C_2}\)
Equation 36 from Warrens (2008)
Forbes Coefficient
Forbes Coefficient (1907, p. 279)
\( F=\frac{n\times a}{\left(a+b\right)\times\left(a+c\right)} \)
Cole (1949) later adjusted this to range from -1 to 1 and dubbed it C1.
Fossum-Kaskey Chi-Square
Fossum-Kaskey Chi-Square (1966, p.65):
\(\chi_{FK}^2 = \frac{n \times \left(a - \frac{1}{2}\right)^2}{\left(a + b\right) \times \left(a + c\right)}\)
Equation 35 from Choi et al. (2010).
Gilbert i'
Gilbert i' (1884, p. 171):
\(i' = \frac{a \times n - R_1 \times C_1}{C_1 \times n + R_1 \times n - a \times n - R_1 \times C_1}\)
Equation 46 from Choi et al. (2010), equation 1 in Hubálek (1982) and equation 12 from Warrens (2008)
Gilbert-Wells
Gilbert-Wells supposedly from (1966):
\(\ln{\left(a\right)} - \ln{\left(n\right)} - \ln{\left(\frac{R_1}{n}\right)} - \ln{\left(\frac{C_1}{n}\right)}\)
Equation 55 from Choi et al. (2010), equation 38 in Hubálek (1982).
Gleason (= Nei-Li = Czekanowski = Dice 3)
Gleason (1920, p. 31):
\(\frac{2 \times a}{2 \times a + b + c}\)
Choi et al. (2010) has this labelled as Dice (eq. 2) and Czekanowski (eq. 3) and Nei-Li (eq 5), equation 5 in Hubálek (1982) and equation 9 from Warrens (2008)
Goodman-Kruskal Lambda
Goodman-Kruskal Lambda (1954, p. 743):
\(\frac{\sigma-\sigma'}{2n-\sigma'}\)
With:
\(\sigma = \max\left(a,b\right)+\max\left(c,d\right)+\max\left(a,c\right)+\max\left(b,d\right)\)
\(\sigma' = \max\left(R_1, R_2\right)+\max\left(C_1, C_2\right)\)
Warrens (2008, p. 220) however, describes a different version:
\(\frac{2\min\left(a,d\right)-b-c}{2\min\left(a,d\right)+b+c}\)
Equation 20 from Warrens (2008)
Gower
Gower supposedly from (1971):
\(\frac{a + d}{\sqrt{R_1 \times R_2 \times C_1 \times C_2}}\)
Equation 50 from Choi et al. (2010).
Hamann
Hamann supposedly from (1961):
\(\frac{\left(a + d\right) - \left(b + c\right)}{n}\)
Equation 67 from Choi et al. (2010), equation 24 in Hubálek (1982) and equation 27 from Warrens (2008)
Harris-Lahey Weighted Agreement
Harris-Lahey Weighted Agreement (1978) is an adjustment to Clement formula:
\(WA = \frac{a \times \left(R_2 + C_2\right)}{2 \times n \ times \left(a + b + c\right)} + \frac{d \times \left(R_1 + C_1\right)}{2 \times n \ times \left(b + c + d\right)}\)
Equation 40 from Warrens (2008) although Warrens ommits the factor n in each denominator.
Hawkins-Dotson
Hawkins-Dotson (1975):
\(\frac{1}{2} \times \left(\frac{a}{a + b + c} + \frac{d}{b + c + d}\right)\)
Equation 34 from Warrens (2008)
Hurlbert (= Cole C8)
Hurlbert (1969, p3):
\(C_8 = \frac{ad-bc}{\left\lvert ad-bc\right\rvert}\sqrt{\frac{\chi^2-\chi_{min}^2}{\chi_{max}^2-\chi_{min}^2}}\)
With:
\(\chi_{max}^2 = \begin{cases} \frac{nR_1 C_2}{R_2 C_1} & \text{ if } ad \geq bc \\ \frac{nR_1 C_1}{R_2 C_2} & \text{ if } ad < bc \text{ and } a \leq d \\ \frac{nR_2 C_2}{R_1 C_1} & \text{ if } ad < bc \text{ and } a > d \end{cases}\)
\(\chi_{min}^2 = \frac{n^3\left(\hat{a} - g\left(\hat{a}\right)\right)^2}{R_1 R_2 C_1 C_2}\)
\(\hat{a}=\frac{R_1 C_1}{n}\)
\(g\left(\hat{a}\right) = \begin{cases} \lfloor \hat{a} \rfloor & \text{ if } ad < bc \\ \lceil \hat{a}\rceil & \text{ if } ad \geq bc \end{cases}\)
Note that Hurlbert showed that Cole’s C’s can be rewritten to use the chi-square value, and then introduced a new one and labelled it C8.
Equation 35 in Hubálek (1982).
Jaccard (= Tanimoto)
Jaccard Coefficient of Cummunity (1901, 1912, p. 39) and Tanimoto (1958, p. 5):
\(\frac{a}{a + b + c}\)
Equation 1+65 from Choi et al. (2010), equation 4 in Hubálek (1982) and equation 6 from Warrens (2008)
Johnson
Johnson supposedly from (1967):
\(\frac{a}{a + b} + \frac{a}{a + c}\)
Equation 43 from Choi et al. (2010), equation 9 in Hubálek (1982) and equation 33 from Warrens (2008)
Kent-Foster K
Kent and Foster (1977, p. 311):
\(K_{occ} = \frac{-b \times c}{b \times R_1 + c \times C_1 + b \times c}\)
\(K_{non occ} = \frac{-b \times c}{b \times R_2 + c \times C_2 + b \times c}\)
Equation 39a+b from Warrens (2008)
Kuder-Richardson
Kuder-Richardson supposedly from (1937):
\(\frac{4 \times \left(a \times d - b \times c\right)}{R_1 \times R_2 \times C_1 \times C_2 + 2 \times a \times d - 2 \times b \times c}\)
Equation 14 from Warrens (2008)
Kulczynski A
A first version would be:
\(\frac{a}{b + c}\)
Equation 64 from Choi et al. (2010), equation 3 in Hubálek (1982) and equation 11b from Warrens (2008)
A second version would be, supposedly from Kulczynski (1927):
\(A = \frac{1}{2} \times \left(\frac{a}{R_1} + \frac{a}{C_1}\right)\)
Equation 41 and 42 from Choi et al. (2010), equation 7 and 8 in Hubálek (1982) and equation 11a from Warrens (2008)
Loevinger H (= Forbes 2)
Forbes supposedly from (1925), but also found in Loevinger (1947, p. 30):
\(\frac{a \times d - b \times c}{\min{\left(R_1 \times R_2 \times C_1 \times C_2\right)}}\)
Equation 48 from Choi et al. (2010), equation 42 in Hubálek (1982) and equation 18 from Warrens (2008)
Maxwell-Pilliner
Maxwell-Pilliner supposedly from (1968):
\(\frac{2 \times \left(a \times d - b \times c \right)}{\left(a + b\right) \times \left(c + d\right) + \left(a + c\right) \times \left(b + d\right)}\)
Equation 35 from Warrens (2008)
McConnaughey
McConnaughey supposedly from (1964):
\(\frac{a^2 - b \times c}{R_1 \times C_1}\)
Equation 39 from Choi et al. (2010), equation 10 in Hubálek (1982) and equation 31 from Warrens (2008)
McEwen and Michael (= Cole C3)
Michael (1920, p. 55) worked together with McEwen and named the following 'McEwen and Michael coefficient':
\(\frac{a\times d - b\times c}{\left(\frac{a+ d}{2}\right)^2 + \left(\frac{b+ c}{2}\right)^2}= \frac{4\times\left(a\times d - b\times c\right)}{\left(a+d\right)^2+\left(b+c\right)^2}\)
Later Cole (1949, p. 415) rewrote this equation for his C3 into:
\(C_3 = \frac{4\times\left(a\times d - b\times c\right)}{\left(a+d\right)^2+\left(b+c\right)^2}\)
Equation 68 from Choi et al. (2010), equation 39 in Hubálek (1982) and equation 10 from Warrens (2008)
Mountford
Mountford (1962, p. 45):
\(\frac{2 \times a}{a \times \left(b + c\right) + 2 \times b \times c}\)
Equation 37 from Choi et al. (2010), equation 15 in Hubálek (1982) and equation 28 from Warrens (2008)
Pearson
A measure labelled ‘Pearson’ from Choi et al. (2010, p. 45):
\(\sqrt{\frac{\phi^2}{n + \phi^2}}\)
Equation 53 from Choi et al. (2010).
Pearson Phi Coefficient (= Yule Phi Coefficient = Cole C2)
Pearson Phi Coefficient (Pearson, 1900, p. 12) is not a tetrachoric correlation, but rather the 'standard' correlation if you would give each of the two values for each of the two variables a 0 and 1. Yule also derived the same result (Yule, 1912, p. 596). The formula can be written as:
\(\phi = \frac{a\times d - b\times c}{\sqrt{R_1\times R_2\times C_1\times C_2}}\)
The formula can also be re-written to use the Pearson chi-square statistic , Cole's C2 (1949) uses this:
\(C_2 = \sqrt{\frac{\chi^2}{n}}\)
Cohen (1988, p. 216) refers to this as 'w'. He uses the proportions in the cross table and gives the equation:
\(w = \sqrt{\sum_i\sum_j\frac{\left(p_{ij} - \pi_{ij}\right)^2}{\pi_{ij}}}\)
Where \(\pi_{ij}\) is the expected proportion for cell row i column j, and \(p_{ij}\) the sample proportion.
Cohen also provided some guidelines for the interpretation:
phi | Interpretation |
---|---|
0 < 0.10 | negligible |
0.10 < 0.30 | small |
0.30 < 0.50 | medium |
0.50 or more | large |
Note: Adapted from Statistical power analysis for the behavioral sciences by J. Cohen, 1988, pp. 224-225. |
Equation 54 from Choi et al. (2010), equation 30 in Hubálek (1982) and equation 7 from Warrens (2008)
Pearson Q1
Pearson Q1 (Pearson 1900, p. 15)
\( Q_1 = \sin\left(\frac{\pi}{2} \times \frac{a\times d - b\times c}{\left(a+b\right)\times\left(b+d\right)}\right) \)
Note that Pearson (1900) stated: "Q1 was found of little service" (p. 16).
Pearson Q4
Pearson Q4 (1900, p. 16)
\( Q_4 = \sin\left(\frac{\pi}{2} \times \frac{1}{1 + \frac{2\times b\times c\times n}{\left(a\times d -b\times c\right) \times \left(b+c\right)}}\right) \)
Pearson Q5
Pearson Q5 (1900, p. 16)
\( Q_5 = \sin\left(\frac{\pi}{2} \times \frac{1}{\sqrt{1+ k^2}}\right) \)
with \( k^2 = \frac{4\times a\times b\times c\times d\times n^2}{\left(a\times d-b\times c\right)^2\times\left(a+d\right)\times\left(b+c\right)} \)
Peirce i
Peirce i (1884, p. 453):
\(i = \frac{a \times d - b \times c}{R_1 \times R_2}\)
Equation 1a from Warrens (2008)
a second version is also possible:
\(i = \frac{a \times d - b \times c}{C_1 \times C_2}\)
Equation 1b from Warrens (2008)
Equation 73 in Choi et al. (2010) is different and was unable to find the original source for the variation they use. They use:
\(i = \frac{a \times d - b \times c}{a \times b + 2 \times b \times c + c \times d}\)
Equation 73 from Choi et al. (2010), equation 16 in Hubálek (1982).
Rogers-Tanimoto Similarity Ratio
Rogers-Tanimoto Similarity Ratio (1960, p. 1117):
\(s = \frac{a + d}{a + 2\left(b + c\right) + d}\)
Equation 9 from Choi et al. (2010), equation 23 in Hubálek (1982) and equation 25 from Warrens (2008)
Rogot-Goldberg Index of Adjusted Agreement
Rogot-Goldberg Index of Adjusted Agreement (1966, p. 997):
\(s = \frac{a}{R_1 + C_1}+\frac{d}{R_2 + C_2}\)
Equation 32 from Warrens (2008)
Russell-Rao
Russel-Rao supposedly from (1940):
\(\frac{a}{n}\)
Equation 14 from Choi et al. (2010, p. 44), equation 14 in Hubálek (1982, p. 672) and equation 15from Warrens (2008, p. 220)
Scott Index of Reliability
Scott Index of Reliability (1955, p. 323):
\(\pi = \frac{4 \times a \times d - \left(b + c\right)^2}{\left(R_1 + C_1\right) \times \left(R_2 + C_2\right)}\)
Equation 21 from Warrens (2008)
Sokal-Michener / Matching Coefficient
Sokal-Michener / Matching Coefficient (1958, p. 1417):
\(\frac{a + d}{n}\)
Equation 7 from Choi et al. (2010), equation 20 in Hubálek (1982) and equation 22 from Warrens (2008)
Simpson Index of Taxonomic Resemblance
Simpson Index of Taxonomic Resemblance (1943, p.20; 1960, p. 301):
\(\frac{a}{\min{\left(R_1 \times C_1\right)}}\)
Equation 45 from Choi et al. (2010), equation 2 in Hubálek (1982) and equation 16 from Warrens (2008)
Sokal-Sneath S
Sokal-Sneath version 1 (1963, p. 129):
\(\frac{a}{a + 2 \times b + 2 \times c}\)
Ling labels equation 7 as ‘Anderberg’ and uses a formula that can be rewritten to the above version.
Equation 6 from Choi et al. (2010), equation 6 in Hubálek (1982) and equation 30a from Warrens (2008)
Sokal-Sneath version 2 (1963, p. 129):
\(\frac{2 \times a + 2 \times d}{2 \times a + b + c + 2 \times d}\)
Choi and also Ling lists also a version of ‘Gower & Legendre’ (Gower & Legendre, 1986)
Equation 8 and 11 from Choi et al. (2010), equation 22 in Hubálek (1982) and equation 30b from Warrens (2008)
Sokal-Sneath version 3 (1963, p. 130):
\(\frac{1}{4} \times \left(\frac{a}{R_1} + \frac{a}{C_1} + \frac{d}{R_2} + \frac{d}{C_2}\right)\)
Equation 49 from Choi et al. (2010), equation 18 in Hubálek (1982) and equation 30c from Warrens (2008)
Sorgenfrei
Sorgenfrei supposedly from (1958):
\(\frac{a^2}{R_1 \times C_1}\)
Equation 36 from Choi et al. (2010), equation 12 in Hubálek (1982) and equation 23 from Warrens (2008)
Stiles Association Factor
Stiles Association Factor (1961, p. 272) is the log of chi-square with Yates correction:
\(\log_{10}\left(\frac{n\left(\left\lvert a \times d - b \times c \right\rvert -\frac{n}{2}\right)^2}{R_1 \times R_2 \times C_1 \times C_2}\right)\)
Equation 59 from Choi et al. (2010) and equation 26 from Warrens (2008)
tarantula
tarantula supposedly from Jones and Harrold (2005):
\(\frac{a \times R_2}{c \times R_1}\)
Equation 75 from Choi et al. (2010).
Tarwid
Tarwid (1960, p. 117):
\(\frac{n \times a - R_1 \times C_1}{n \times a + R_1 \times C_1}\)
Equation 40 from Choi et al. (2010), equation 43 in Hubálek (1982).
Tulloss Tripartite Similarity Index
Tulloss Tripartite Similarity Index (1997, p. 132):
\(T = \sqrt{U \times S\times R}\)
With:
\(U = \log_2\left(1 + \frac{\min\left(b, c\right) + a}{\max\left(b, c\right) + a}\right)\)
\(S = \frac{1}{\sqrt{\log_2\left(2 + \frac{\min\left(b, c\right)}{a + 1}\right)}}\)
\(R = \log_2\left(1 + \frac{a}{R_1}\right) \times \log_2\left(1 + \frac{a}{C_1}\right)\)
Yule Q (= Cole C4, and Pearson Q2)
Yule's Q (Yule, 1900, p. 272) can be calculated using:
\(Q = \frac{a\times d - b\times c}{a\times d + b\times c}\)
As for the interpretation for Q, there aren't many rules of thumb I could find. One I did find is from Glen (2017):
|Q| | Interpretation |
---|---|
0 < 0.30 | negligible |
0.30 < 0.50 | small |
0.50 < 0.70 | medium |
0.70 or more | large |
Note: Adapted from Gamma Coefficient (Goodman and Kruskal's Gamma) & Yule's Q by S. Glen, 2017. |
Alternative, the Q can be converted to an Odds Ratio and in turn, some proposed to convert an Odds Ratio to Cohen's d. Also Cohen's d can be calculated to a correlation coefficient (r), for which again there are different tables.
\(OR = \frac{1 + Q}{1 - Q}\)
Equation 61 from Choi et al. (2010), equation 36 in Hubálek (1982) and equation 3 from Warrens (2008)
Yule's r (= Pearson Q3 = Cole C6)
Yule proposed to convert his Q to a correlation coefficient using (Yule, 1900, p. 276):
\(r_Q = \cos\left(\frac{\sqrt{k}}{1+\sqrt{k}}\times\pi\right)\)
\(k = \frac{1-Q}{1+Q}\)
Pearson's Q3 (1900) will give the same result, although he used the sinus function:
\(Q_2 = \sin\left(\frac{\pi}{2}\times\frac{\sqrt{a\times d} - \sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\right)\)
Cole (1949) rewrote this to:
\(C_6 = \cos\left(\frac{\pi\times\sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\right)\)
Equation 55 from Choi et al. (2010) and equation 38 in Hubálek (1982).
Yule Y (coefficient of colligation)
Yule's Y (1912, p. 592) is a further adaptation of Yule's Q into:
\(Y = \frac{\sqrt{a\times d} - \sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\)
Yule referred to this as referred to as the coefficient of colligation.
The Odds Ratio can also be calculated from this using:
\(OR = \left(\frac{Y + 1}{1 - Y}\right)^2\)
Tables for the interpretation of the Odds Ratio can then be used.
Equation 63 from Choi et al. (2010), equation 37 in Hubálek (1982) and equation 8 from Warrens (2008)
Google adds