Measures of Similarity / Association

Introduction

Click here to see a video instead of reading.

There are a lot of effect sizes suggested over the years with two binary variables. Cohen W, Cohen h and tetrachoric correlations are discussed separately. Most often these are referred to as either a measure of similarity or a measure of asscociation. I tried to group the others as much as possible. Assuming we have a cross-table as in Table 1.

Table 1
Two-by-Two General Table
	Column 1	Column 2	Total
Row 1	\(a\)	\(b\)	\(R_1 = a + b\)
Row 2	\(c\)	\(d\)	\(R_2 = c + d\)
Total	\(C_1 = a + c\)	\(C_2 = b + d\)	\(n = R_1 + R_2\)

We then have various measures that focus on the top-left cell (\(a\))

Russell-Rao (Russell & Rao, 1940)
Dice-1 (Dice, 1945, p. 302)
Dice-2 (Dice, 1945, p. 302)
Braun-Blanquet (Braun-Blanquet, 1932)
Simpson Similarity (Simpson, 1943, p. 20, 1960, p. 301)
Kulczynski-1 (Kulczynski, 1927)
Jaccard = Tanimoto (Jaccard, 1901, 1912, p. 39; Tanimoto, 1958, p. 5)
Sokal-Sneath-1 = Anderberg (Sokal & Sneath, 1963, p. 129)
Gleasson = Dice-3 = Nei-Li = Czekanowski (Gleason, 1920, p. 31; Dice, 1945, p. 302; Nei & Li, 1979, p. 5270)
Mountford (Mountford, 1962, p. 45)
Driver-Kroeber = Ochiai-1 = Otsuka (Driver & Kroeber, 1932, p. 219; Ochiai, 1957)
Sorgenfrei (Sorgenfrei, 1958)
Johnson (Johnson, 1967)
Kulczynski-2 = Driver-Kroeber-2 (Kulczynski, 1927; Driver & Kroeber, 1932, p. 219)
Fager-McGowan-1 (Fager & McGowan, 1963, p. 454)
Fager-McGowan-2 (Fager & McGowan, 1963, p. 454)
tarantula (Jones & Harrold, 2005)
Ample
Gilbert (Gilbert, 1884, p. 171)
Fossum-Kaskey (Fossum & Kaskey, 1966, p. 65)
Forbes - 1 (Forbes, 1907, p. 279)
Eyraud (Eyraud, 1936)

Some focus on the top-left and bottom-right cell (\(a, d\)):

Sokal-Michener (Matching Coefficient) (Sokal & Michener, 1958, p. 1417)
Faith (Faith, 1983, p. 290)
Sokal-Sneath-5 (Sokal & Sneath, 1963, p. 129)
Rogers-Tanimoto (Rogers & Tanimoto, 1960)
Sokal-Sneath-2 = Gower-Legendre (Sokal & Sneath, 1963, p. 129; Gower & Legendre, 1986)
Gower
Sokal-Sneath-4 = Ochiai-2 (Sokal & Sneath, 1963, p. 130; Ochiai, 1957)
Rogot-Goldberg (Rogot & Goldberg, 1966, p. 997)
Sokal-Sneath-3 (Sokal & Sneath, 1963, p. 130)
Hawkin-Dotson (Hawkins & Dotson, 1975, pp. 372–373)
Clement (Clement, 1976, p. 258)
Harris-Lahey (Harris & Lahey, 1978, p. 526)
Austin-Colwell (Austin & Colwell, 1977, p. 205)
Baroni-Urbani-Buser-1 (Baroni-Urbani & Buser, 1976, p. 258)

Some focus on (\(ad - bc\)):

Peirce-1 (Peirce, 1884, p. 453)
Peirce-2 (Peirce, 1884, p. 453)
Cole C1 (Cole, 1949, p. 415)
Loevinger = Forbes 2 (Loevinger, 1947, p. 30)
Cole C7 (Coefficient of Interspecific Association) (Cole, 1949, p. 420)
Dennis (Dennis, 1965, p. 69)
(Pearson/Yule) Phi Coefficient / Cole C2 (Pearson, 1900a, p. 12)
Doolittle (Doolittle, 1885, p. 123)
Peirce-3 (Choi et al., 2010, p. 45)
Cohen-kappa (Cohen, 1960, p. 40)
McEwen-Michael Coefficient / Cole C3 (Michael, 1920, p. 57)
Kuder-Richardson (Kuder & Richardson, 1937)
Scott (Scott, 1955, p. 324)
Maxwell-Pilliner (Maxwell & Pilliner, 1968)
Cole C5 (Cole, 1949, p. 416)
Hamann (Hamann, 1961)
Fleiss (Fleiss, 1975, p. 656)

Others have a format of \(\frac{x-y}{x+y}\):

Yule Q = Cole C4 = Pearson Q2 (Yule, 1900, p. 272)
Yule Y (Yule, 1912, p. 592)
Digby H (Digby, 1983, p. 754)
Edward Q (Edwards, 1957)
Tarwid (Tarwid, 1960, p. 117)
Bonett-Price Y* (Bonett & Price, 2007, p. 433)

Some use the \(\chi^2\) statistic:

Contingency coefficient (Pearson, 1904, p. 9)
Cohen w (Cohen, 1988, p. 216)
Pearson (Choi et al., 2010, p. 45; K. Pearson, 1904)
Hurlbert / Cole C8 (Hurlbert, 1969, p. 1)
Stiles (Stiles, 1961, p. 272)

and a few others:

McConnaughey (McConnaughey, 1964)
Baroni-Urbani-Buser-2 (Baroni-Urbani & Buser, 1976, p. 258)
Kent-Foster-1 (Kent & Foster, 1977, p. 311)
Kent-Foster-2 (Kent & Foster, 1977, p. 311)
Tulloss (Tulloss, 1997, p. 133)
Gilbert-Wells (Gilbert & Wells, 1966)
Yule r/ Pearson Q3 / Cole c6 / Pearson-Heron (Yule, 1900, p. 276)
Anderberg (Anderberg, 1973)
Alroy F (Alroy, 2015)
Pearson Q1 (Pearson, 1900a, p. 15)
Goodman-Kruskal Lambda-1 (Goodman & Kruskal, 1954, p. 743)
Goodman-Kruskal Lambda-2 (Warrens, 2008, p. 220)
Odds Ratio (Fisher, 1935, p. 50)
Pearson Q4 (Pearson, 1900a, p. 16)
Pearson Q5 (Pearson, 1900a, p. 16)
Camp (3 ver.) (Camp, 1934, p. 309)
Becker-Clogg-1 (Becker & Clogg, 1988, pp. 410–412)
Becker-Clogg-2 (Becker & Clogg, 1988, pp. 410–412)
Bonett-Price-2 (Bonett & Price, 2005, p. 216)
Bonett-Price-3 (Bonett & Price, 2005, p. 216)
Chen-Popovich (Chen & Popovich, 2002, p. 37)

Some further explained

If a test with a chi-square distribution was used, an obvious candidate to use in a measure of effect size is the test statistic, the χ^2. One of the earliest and often mentioned measure uses this: the phi coefficient (or mean square contingency). Both Yule (1900) and Pearson (1900)mention this measure, and Cole (1949) refers to it as Cole C₄. It is interesting that this gives the same result, as if you would assign a 0 and 1 to each of the two variables categories, and calculate the regular correlation coefficient.

This measure is also sometimes used for larger tables, but the range of values it can hold depends then on the size of the table. To overcome this an alternative was proposed by Pearson (1904): the contingency coefficient. This will range between 0 and 1, but the real maximum still depends on the size of the table.

Cole (1949) noted that for a 2x2 table the maximum would be √(1/2) and we prefer to have a correlation have a maximum of 1. Cole C₅ does this by simply taking the contingency coefficient and dividing it by the √(1/2).

Cohen w (1988), Cole C₈ (= Hurlbert) (1969) and Stiles (1961) also use the chi-square statistic.

Another approach is by realizing that if there is no association, all cells have the same value, i.e. a = b = c = d. We can also write this as:

\(a=\frac{\left(a+b\right)\times\left(a+c\right)}{n}\)

The Forbes Coefficient (Forbes, 1907) uses this. This has a value of 1 if there is no association, while it has a value of 0 or 2 when there is a perfect one.

To adjust to the more traditional range of -1 to 1, Cole ₁ simply subtracts one from the Forbes coefficient.

Another range of measures employ the Odds Ratio. Edwards (1963) argued that a measure of association for a 2x2 table should be some function of the cross-ratio bc/ad (e.g. the Odds Ratio).

Yule Q and Yule Y do exactly that. They are both of the format of:

\(\frac{OR^x-1}{OR^x+1}\)

Yule Q actually looks at the difference between the number of pairs that are in agreement and those in disagreement, and then divides this over the total possible number of pairs.

Click for more details on Yule's Y

If we have a 2 by 2 table, and one of the diagonals is zero, there would be a perfect association. For example:

Table 1
Example of perfect association
	Column 1	Column 2
Row 1	5	0
Row 2	0	5

In contrast, if all the values would be the same, there wouldn't be any association:

Table 2
Example of no association
	Column 1	Column 2
Row 1	5	5
Row 2	5	5

How about a table where the diagonals have the same values. For example:

Table 3
Example of crosswise symmetrical table
	Column 1	Column 2
Row 1	3	5
Row 2	5	3

In this case, we can split the table into two tables, such that the sum of the two tables would add up to the original. The split is made in such a way that one has a perfect association, while the other is then having no association at all.

Table 4a
Table 3 perfect association part
	Column 1	Column 2
Row 1	0	2
Row 2	2	0

Table 4b
Table 3 no association part
	Column 1	Column 2
Row 1	3	3
Row 2	3	3

The sum of all counts in the perfect association part is 0+2+2+0=4, while for the no association part we get 3+3+3+3=12. Overall, we have 4/(4+12) = 4/16 = 25% of perfect association.

Notice that we can use the following formula in case of a table with the same diagonal:

\(\frac{a-b}{a+b}\)

If we apply this to our table 1, we get a nice value of 1, in table 2 a value of 0 and in table 3 a value of -0.25. This however, only works if each of the diagonals has the same values.

Unfortunately though, most tables don't have the same values across their diagonal. However, we can transform any table into a table with the same Odds Ratio, which is then symmetrical. This can be done by setting:

\(a = d = \sqrt{OR}, b = c = 1\)

Lets have a look at an example:

Table 5
Example data
	Column 1	Column 2
Row 1	3	15
Row 2	8	16

We first determine the Odds Ratio of this table:

\(OR = \frac{a\times d}{b\times c} = \frac{3\times 16}{15\times 8} = \frac{48}{120} = \frac{2}{5}\)

We now use the suggestion: \(a = d = \sqrt{OR}, b = c = 1\), to create a table with the same Odds Ratio, but that has the diagonals the same:

Table 6
Table 6 transformed
	Column 1	Column 2
Row 1	\(\sqrt{\frac{2}{5}}\)	1
Row 2	1	\(\sqrt{\frac{2}{5}}\)

If you like, you can double check to see if the Odds Ratio of Table 6 is still \(\frac{2}{5}\)

Since Table 6 has the same values on the diagonals, we can apply our formula from earlier:

\(\frac{a-b}{a+b} = \frac{\sqrt{\frac{2}{5}}-1}{\sqrt{\frac{2}{5}}+1}\approx -0.225\)

So about 23% of a perfect association. Yule's Y does all these steps for us in one go, and will then of course produce the same result:

\(Y = \frac{\sqrt{a\times d}-\sqrt{b\times c}}{\sqrt{a\times d}+\sqrt{b\times c}} \approx -0.225\)

Unfortunately if any of the four cells is 0, the association will always return a 1 or -1. Similar if one or two cells have very high counts and the others very few, the result might be close to 1 or -1, even though there is almost no association. This is the same problem as for Yule's Q and the reason why Michael and McEwen developed their variation on Yule's Q.

For Yule Q the \(x=1\) and for Yule Y \(x=0.5\). Digby (1983, p. 754) showed that Yule’s Q consistently overestimates the association, while Yule’s Y underestimates it It seems that a better approximation might be somewhere between 0.5 and 1 as the power to use on the Odds Ratio. Digby H found the best result at 0.75, while Edwards (1957 as cited in Becker & Clogg, 1988, p. 409) had proposed π/4 (appr. 0.79).

Bonett and Price Y* uses a function to determine what the power should be (Bonett & Price, 2007).

A problem with all of Forbes and OR based is that the if only one cell is very large compared to the others, or if one cell is 0 the association will be quite large (close to -1 or 1).

Michael (1920, p. 55) worked together with McEwen and tried to overcome this with the 'McEwen and Michael coefficient'. Cole however criticized Michael a bit on this, that although at first one might not consider the example table a strong association, with so little data, there are actually very few other value arrangements with the same marginal totals, that would have yielded a stronger association. He states: “with any given series of collections containing two species the possible number of tables yielding different values for the number of joint occurrences is exactly one more than the smallest of the four marginal totals” (Cole, 1949, p. 417).

Cole C₇ attempts to overcome this problem. Cole suggested to divide the absolute difference of deviation from no association by the number of possible tables with a positive association. He called this the coefficient of interspecific association

Another category are the tetrachorical correlations. This is quite tricky to do, so a few have proposed an approximation for this. Becker and Clogg discuss the relation between the Odds Ratio and the tetrachoric correlation. They start by exploring Yule’s Q and generalize it to:

\(Q\left(x\right) = \frac{OR^x - 1}{OR^x + 1}\)

So that Yule’s Q is Q(1), Edward’s Q is Q(π/4), and Digby’s H is Q(3/4). They then go on to come up with their own more complicated method to calculate an optimal value for x.

Bonett and Price continued on this work and also refer to Pearson’s four methods (Pearson Q1, Pearson Q2 (= Yule’s Q), Pearson Q3 and Pearson Q4), Walker and Lev, Edwards, Digby and Lord, Novick and the two from Becker and Clogg. They then derive two methods of their own they show are more accurate.

The Walker and Lev method they refer to is the same as Pearson’s Q3.

Besides closed form approximation formula's, various algorithms have been designed as well. See the separate tetrachoric correlations page for more details on this.

Click here on how to calculate each of the measures

with Excel

video to be uploaded

Excel file: ES - BinBin Association.xlsm.

with Flowgorithm

The Flowgorithm of the versions listed below were all included in one file, except for Brown's approximation, and Kirk's approximation:

Flowgorithm file: FL-ESbinBinAssociation.fprg.

See the section of Brown, and Kirk for the link to their approximation Flowgorithm files.

Becker and Clogg (1988)

Flowgorithm Becker and Clogg

Bonett and Price (2005)

Flowgorithm Bonett and Price (2005)

Brown (1977) - Algorithm AS 116

Flowgorithm Brown (1977) - Algorithm AS 116

Flowgorithm file: FL-EStetrachoricBrown.fprg.

Camp (1934)

Flowgorithm Camp (1934)

Cole (1949)

Cole C₁:

Flowgorithm Cole C1 (1949)

Cole C₂, see Yule's (= Pearson's Phi) (1920)

Cole C₃, see McEwen and Michael (1920)

Cole C₄, see Yule's Q (= Pearson Q ₂) (1900)

Cole C₅:

Flowgorithm Cole C5 (1949)

Cole C₆, see Pearson Q ₃ (1900)

Cole C₇:

Flowgorithm Cole C7 (1949)

Digby (1983)

Flowgorithm Digby (1983)

Divgi (1979)

Flowgorithm Divgi (1979)

Edwards (1957)

Flowgorithm Edwards (1957)

Forbes (1907)

Flowgorithm Forbes (1907)

Kirk (1973) - Fortran TET8

Flowgorithm Kirk (1973) - Fortran TET8

And helper functions:

Flowgorithm Kirk (1973) - Fortran TET8 helper function fn1

Flowgorithm Kirk (1973) - Fortran TET8 helper function fn2

Flowgorithm file: FL-EStetrachoricKirk.fprg.

McEwen and Michael (1920)

same as Cole C₃:

Flowgorithm McEwen and Michael (1920)

Pearson (1900)

Pearson Q₁:

Flowgorithm Pearson Q1 (1900)

Pearson Q₂, see Yule's Q (1900)

Pearson Q₃ (= Yule's r, Cole's C₆):

Flowgorithm Pearson Q3 (1900)

Pearson Q₄:

Flowgorithm Pearson Q4 (1900)

Pearson Q₅:

Flowgorithm Pearson Q5 (1900)

Yule's Q and Phi(1900)

Yule's Q (= Pearson's Q₂, Cole's C₄:

Flowgorithm Yule (1900)

Yule's Phi = Pearson's Phi:

Flowgorithm Phi

Yule's r see Pearson Q₃ (1900)

Yule's Y (1912)

Flowgorithm Yule (1912)

with Python

video to be uploaded

Jupyter Notebook: ES - BinBin Association.ipynb.

with R

video to be uploaded

R script: ES - BinBin Association.R.

with Formula's

Alroy's Forbes Adjustment

Alroy adjusts the Forbes coefficient by setting (Alroy, 2015):

\(F' = \frac{a\times\left(n' + \sqrt{n'}\right)}{a\times\left(n' + \sqrt{n'}\right) + \frac{3}{2}\times b\times c}\)

With:

\(n' = a + b + c\)

Alroy refers to the Forbes coefficient as a measure of similarity. Alroy then sets out to improve the measure by disregarding the lower left value (d).

ample

Equation 76 found in Choi et al. (2010):

\(\left|\frac{a\times R_2}{c\times R_1}\right|\)

Anderberg

Equation 70 found in Choi et al. (2010):

\(\frac{\sigma - \sigma'}{2\times n}\)

With:

\(\sigma = \max{\left(a, b\right)} + \max{\left(c, d\right)} + \max{\left(a, c\right)} + \max{\left(b, d\right)}\)

\(\sigma' = \max{\left(R_1, R_2\right)} + \max{\left(C_1, C_2\right)}\)

Supposedly from Anderberg (1973):

Austin-Colwell Similarity

Austin and Colwell (1977, p. 205):

\(S = \frac{2}{\pi} \text{arcsin} \sqrt{\frac{a + d}{n}}\)

Equation 21 found in Hubálek (1982)

Baroni-Urbani-Buser Similarity

Baroni-Urbani and Buser S (1976, p. 258):

\(S_{**} = \frac{\sqrt{a\times d} + a}{\sqrt{a\times d} + a + b + c}\)

\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)

Equations 71 and 72 from Choi et al. (2010):, equation 32 and 33 in Hubálek (1982) and equation 38a and 38b from Warrens (2008)

Becker and Clogg (r_tet)

Becker and Clogg (1988, pp. 410-412)

\( \rho^* = \frac{g-1}{g+1} \)

\( \rho^{**} = \frac{OR^{13.3/\Delta} - 1}{OR^{13.3/\Delta} + 1} \)

with:

\(g=e^{12.4\times\phi - 24.6\times\phi^3}\)

\(\phi = \frac{\ln\left(OR\right)}{\Delta}\)

\(OR=\frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)} = \frac{a\times d}{b\times c}\)

\(\Delta = \left(\mu_{R1} - \mu_{R2}\right) \times \left(v_{C1} - v_{C2}\right)\)

\(\mu_{R1} = \frac{-e^{-\frac{t_r^2}{2}}}{p_{R1}}, \mu_{R2} = \frac{e^{-\frac{t_r^2}{2}}}{p_{R2}} \)

\(v_{C1} = \frac{-e^{-\frac{t_c^2}{2}}}{p_{C1}}, v_{C2} = \frac{e^{-\frac{t_c^2}{2}}}{p_{C2}} \)

\(t_r = \Phi^{-1}\left(p_{R1}\right), t_c = \Phi^{-1}\left(p_{C1}\right)\)

\(p_{x} = \frac{x}{n}\)

\(\Phi^{-1}\left(x\right)\) is the inverse standard normal cumulative distribution function

\(OR\) is the Odds Ratio

Bonett and Price Y*

Bonett and Price (2007, pp. 433-434)

\(Y* = \frac{\hat{\omega}^x-1}{\hat{\omega}^x+1}\)

With:

\(x = \frac{1}{2}-\left(\frac{1}{2}-p_{min}\right)^2\)

\(p_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)}{n}\)

\(\hat{\omega} = \frac{\left(a+0.1\right)\times\left(d+0.1\right)}{\left(b+0.1\right)\times\left(c+0.1\right)}\)

Note that \(\hat{\omega}\) is a biased corrected version of the Odds Ratio.

Bonett and Price ρ^* (r_tet)

Bonett and Price ρ^* (2005, p. 216)

\(\rho^* = \cos\left(\frac{\pi}{1+\omega^c}\right)\)

With:

\(\omega = OR = \frac{a\times d}{b\times c}\)

\(c = \frac{1-\frac{\left|R_1-C_1\right|}{5\times n} - \left(\frac{1}{2}-p_{min}\right)^2}{2}\)

\(p_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)}{n}\)

Bonett and Price ρ_hat^* (r_tet)

Bonett and Price ρ_hat^* (2005, p. 216)

\(\hat{\rho}^* = \cos\left(\frac{\pi}{1+\hat{\omega}^\hat{c}}\right)\)

With:

\(\hat{\omega} = \frac{\left(a+\frac{1}{2}\right)\times \left(d+\frac{1}{2}\right)}{\left(b+\frac{1}{2}\right)\times \left(c+\frac{1}{2}\right)}\)

\(\hat{c} = \frac{1-\frac{\left|R_1-C_1\right|}{5\times \left(n+2\right)} - \left(\frac{1}{2}-\hat{p}_{min}\right)^2}{2}\)

\(\hat{p}_{min} = \frac{\text{MIN}\left(R_1, R_2, C_1, C_2\right)+1}{n+2}\)

Braun and Blanquet

Braun and Blanquet supposedly from (1932):

\(\frac{a}{\max{\left(R_1, C_1\right)}}\)

\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)

Equation 46 from Choi et al. (2010), equation 1 in Hubálek (1982) and equation 12 from Warrens (2008)

Camp (r_tet)

Camp (1934, pp. 309) describes the following steps for the calculation:

Step 1: If total of column 1 (\(C_1\)) is less than column 2 (\(C_2\)), swop the two columns

Step 2: Calculate \(p = \frac{C_1}{n}\), \(p_1 = \frac{a}{n}\), and \(p_2 = \frac{c}{C_2}\)

Step 3: Determine \(z_1\), \(z_2\) as the normal deviate corresponding to the area \(p_1\), \(p_2\) resp. (inverse standard normal cumulative distribution), i.e. \(z_1 = \Phi^{-1}\left(p_1\right), z_2 = \Phi^{-1}\left(p_2\right)\)

Step 4: Determine y the normal ordinate corresponding to \(p\) (the height of the normal distribution)

Step 5: Calculate \(m = \frac{p\times\left(1-p\right)\times\left(z_1 + z_2\right)}{y}\)

Step 6: Find phi in a table of phi values

Camp suggested for a very basic approximation to simply use \(\phi=1\).

For a better approximation Camp made the following table:

p	0.5	0.6	0.7	0.8	0.9
phi	0.637	0.63	0.62	0.60	0.56

Cureton (1968, p. 241) expanded on this table and produced:

p	0	1	2	3	4	5	6	7	8	9	10
0.5	0.637	0.636	0.636	0.635	0.635	0.634	0.634	0.633	0.633	0.632	0.631
0.6	0.631	0.631	0.630	0.629	0.628	0.627	0.626	0.625	0.624	0.622	0.621
0.7	0.621	0.620	0.618	0.616	0.614	0.612	0.610	0.608	0.606	0.603	0.600
0.8	0.600	0.597	0.594	0.591	0.587	0.583	0.579	0.574	0.569	0.564	0.559

Step 7: Calculate \(r_t = \frac{m}{\sqrt{1+\phi\times m^2}}\)

Cureton (1968) describes quite a few shortcomings with this approximation, and circumstances when it might be appropriate.

Chen and Popovich (r_tet)

Chen and Popovich (2002, pp. 37-38):

\(\lambda_x = \Phi^{-1}\left(\frac{R_1}{n}\right)\)

\(\lambda_y = \Phi^{-1}\left(\frac{C_1}{n}\right)\)

Clement

Clement Inter-observer Agreement (1976, p. 258):

\(IAR = \frac{a \times R_2}{n \times R_1} + \frac{d \times R_1}{n \times R_2}\)

Equation 37 from Warrens (2008)

Cohen Kappa

Cohen Kappa (1960, p. 40):

\(\kappa = \frac{n \times P - Q}{n^2 - Q} = \frac{2\times\left(a\times d - b\times c \right)}{R_1\times C_2 + R_2\times C_1}\)

With:

\(P = \sum{i=1}^r F_{i,i}\)

\(Q = \sum{i=1}^r R_i\times C_i\)

Equation 24 from Warrens (2008)

Cohen w

Cohen w (1988, p. 216):

\(w = \sqrt{\frac{\chi^2}{n}}\)

With:

\(\chi^2 = \frac{n\times\left(a\times d - b\times c\right)^2}{R_1\times R_2\times C_1\times C_2}\)

See Cohen W for more information and rules-of-thumb.

Cole C₁

Cole's C₁(1949, p. 415) is an adjusted Forbes Coefficient to:

\( C_1=\frac{a\times d - b\times c}{\left(a+b\right)\times\left(a+c\right)} \)

Cole C₅

Cole's C₅ (1949, p. 416) is an adjustment of the Contingency Coefficient for 2x2 tables:

\( C_5 = \frac{\sqrt{2}\times\left(a\times d-b\times c\right)} {\sqrt{\left(a\times d-b\times c\right)^2 + R_1\times R_2\times C_1\times C_2}} \)

Cole C₇

Cole's C₇ (1949, pp. 420-421) is the measure Cole derived himself and calls it 'coefficient of interspecific association" given as:

\(\frac{a\times d - b\times c}{\left(a+b\right) \times \left(b + d\right)}\)

Contingency Coefficient

Contingency Coefficient (Pearson, 1904, p. 9):

\(C = \sqrt{\frac{\chi^2}{n + \chi^2}}\)

With:

\(\chi^2 = \frac{n\times\left(a\times d - b\times c\right)^2}{R_1\times R_2\times C_1\times C_2}\)

Dennis \(R^2\)

Dennis (1965, p. 69):

\(R^2 = \frac{a \times d - b \times c}{sqrt{n \times R_1 \times C_1}}\)

Equation 44 from Choi et al. (2010)

Dice Coefficient of Association

Dice (1945, p. 302):

\(\text{Coefficient of Association} = \frac{a}{a + b}\)

\(\text{Coefficient of Association} = \frac{a}{a + c}\)

\(\text{coincidence index} = \frac{2 \times a}{2 \times a + b + c}\)

The Coincidence Index is the same as Gleason.

Equation 17a and 17b from Warrens (2008)

Digby H

Digby's H (1983, p. 754)

\(H = \frac{\left(a\times d\right)^{3/4} - \left(b\times c\right)^{3/4}}{\left(a\times d\right)^{3/4} + \left(b\times c\right)^{3/4}}\)

Doolittle i (inference-ratio)

Doolittle i (1885, p. 123):

\(\frac{\left(a \times d - b \times c\right)^2}{R_1 \times R_2 \times C_1 \times C_2}\)

Equation 31 in Hubálek (1982) and equation 2 from Warrens (2008)

Driver-Kroeber A (=Kulczynski v2) and G (= Ochiai-1 = Otsuka)

Driver and Kroeber (1932, p. 219):

\(G = \frac{a}{\sqrt{R_1 \times C_1}}\)

\(A = \frac{1}{2} \times \left(\frac{a}{R_1} + \frac{a}{C_1}\right)\)

G is listed as equation 31, 33 and 38 from Choi et al. (2010) but attributes it as Ochiai-I refering to Ochaiai (1957) in equation 33, while the same equation is used for 38 but then attributed to Otsuka (1936), equation 11 in Hubálek (1982) and equation 13 from Warrens (2008)

A is listed as equation 41 and 42 from Choi et al. (2010) , equation 7 and 8 in Hubálek (1982) and equation 11a from Warrens (2008). The original source is most likely Kulczynski (1927)

Edwards Q

Edwards Q (1957, p. ???) as a variation on Yule's Q:

\( Q_E = \frac{OR^{\pi/4} - 1}{OR^{\pi/4} + 1} \)

\(OR=\frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)} = \frac{a\times d}{b\times c}\)

Eyraud

Eyraud supposedly from (1936):

\(\frac{a - R_1 \times C_1}{R_1 \times R_2 \times C_1 \times C_2}\)

Equation 74 from Choi et al. (2010) but with an additonal n factor in denominator, equation 17 in Hubálek (1982) and equation 12 from Warrens (2008)

Faith C

Faith C (1983, p. 290):

\(C = \frac{a + \frac{1}{2}\times d}{n}\)

\(S_{*} = \frac{a - b - c + \sqrt{a\times d}}{a + b + c\sqrt{a\times d}}\)

Equation 10 from Choi et al. (2010).

Fager-McGowan (Index of Affinity)

Fager-McGowan (Index of Affinity) (1963, p. 454) as interpreted by Warrens (2008):

\(\frac{a}{\sqrt{R_1 \times C_1}} - \frac{1}{2 \times \sqrt{\max{\left(R_1, C_1\right)}}}\)

as interpreted by Choi et al. (2010) and Hubálek (1982):

\(\frac{a}{\sqrt{R_1 \times C_1}} - \frac{\sqrt{\max{\left(R_1, C_1\right)}}}{2}\)

Equation 47 from Choi et al. (2010), equation 13 in Hubálek (1982) and equation 29 from Warrens (2008)

Fleiss M

Fleiss M (1975, p. 656):

\(M = \frac{\left(a \times d - b \times c\right) \times \left(R_1 \times C_2 + R_2 \times C_1\right)}{2 \times R_1 \times R_2 \times C_1 \times C_2}\)

Equation 36 from Warrens (2008)

Forbes Coefficient

Forbes Coefficient (1907, p. 279)

\( F=\frac{n\times a}{\left(a+b\right)\times\left(a+c\right)} \)

Cole (1949) later adjusted this to range from -1 to 1 and dubbed it C₁.

Fossum-Kaskey Chi-Square

Fossum-Kaskey Chi-Square (1966, p.65):

\(\chi_{FK}^2 = \frac{n \times \left(a - \frac{1}{2}\right)^2}{\left(a + b\right) \times \left(a + c\right)}\)

Equation 35 from Choi et al. (2010).

Gilbert i'

Gilbert i' (1884, p. 171):

\(i' = \frac{a \times n - R_1 \times C_1}{C_1 \times n + R_1 \times n - a \times n - R_1 \times C_1}\)

Equation 46 from Choi et al. (2010), equation 1 in Hubálek (1982) and equation 12 from Warrens (2008)

Gilbert-Wells

Gilbert-Wells supposedly from (1966):

\(\ln{\left(a\right)} - \ln{\left(n\right)} - \ln{\left(\frac{R_1}{n}\right)} - \ln{\left(\frac{C_1}{n}\right)}\)

Equation 55 from Choi et al. (2010), equation 38 in Hubálek (1982).

Gleason (= Nei-Li = Czekanowski = Dice 3)

Gleason (1920, p. 31):

\(\frac{2 \times a}{2 \times a + b + c}\)

Choi et al. (2010) has this labelled as Dice (eq. 2) and Czekanowski (eq. 3) and Nei-Li (eq 5), equation 5 in Hubálek (1982) and equation 9 from Warrens (2008)

Goodman-Kruskal Lambda

Goodman-Kruskal Lambda (1954, p. 743):

\(\frac{\sigma-\sigma'}{2n-\sigma'}\)

With:

\(\sigma = \max\left(a,b\right)+\max\left(c,d\right)+\max\left(a,c\right)+\max\left(b,d\right)\)

\(\sigma' = \max\left(R_1, R_2\right)+\max\left(C_1, C_2\right)\)

Warrens (2008, p. 220) however, describes a different version:

\(\frac{2\min\left(a,d\right)-b-c}{2\min\left(a,d\right)+b+c}\)

Equation 20 from Warrens (2008)

Gower

Gower supposedly from (1971):

\(\frac{a + d}{\sqrt{R_1 \times R_2 \times C_1 \times C_2}}\)

Equation 50 from Choi et al. (2010).

Hamann

Hamann supposedly from (1961):

\(\frac{\left(a + d\right) - \left(b + c\right)}{n}\)

Equation 67 from Choi et al. (2010), equation 24 in Hubálek (1982) and equation 27 from Warrens (2008)

Harris-Lahey Weighted Agreement

Harris-Lahey Weighted Agreement (1978) is an adjustment to Clement formula:

\(WA = \frac{a \times \left(R_2 + C_2\right)}{2 \times n \ times \left(a + b + c\right)} + \frac{d \times \left(R_1 + C_1\right)}{2 \times n \ times \left(b + c + d\right)}\)

Equation 40 from Warrens (2008) although Warrens ommits the factor n in each denominator.

Hawkins-Dotson

Hawkins-Dotson (1975):

\(\frac{1}{2} \times \left(\frac{a}{a + b + c} + \frac{d}{b + c + d}\right)\)

Equation 34 from Warrens (2008)

Hurlbert (= Cole C8)

Hurlbert (1969, p3):

\(C_8 = \frac{ad-bc}{\left\lvert ad-bc\right\rvert}\sqrt{\frac{\chi^2-\chi_{min}^2}{\chi_{max}^2-\chi_{min}^2}}\)

With:

\(\chi_{max}^2 = \begin{cases} \frac{nR_1 C_2}{R_2 C_1} & \text{ if } ad \geq bc \\ \frac{nR_1 C_1}{R_2 C_2} & \text{ if } ad < bc \text{ and } a \leq d \\ \frac{nR_2 C_2}{R_1 C_1} & \text{ if } ad < bc \text{ and } a > d \end{cases}\)

\(\chi_{min}^2 = \frac{n^3\left(\hat{a} - g\left(\hat{a}\right)\right)^2}{R_1 R_2 C_1 C_2}\)

\(\hat{a}=\frac{R_1 C_1}{n}\)

\(g\left(\hat{a}\right) = \begin{cases} \lfloor \hat{a} \rfloor & \text{ if } ad < bc \\ \lceil \hat{a}\rceil & \text{ if } ad \geq bc \end{cases}\)

Note that Hurlbert showed that Cole’s C’s can be rewritten to use the chi-square value, and then introduced a new one and labelled it C8.

Equation 35 in Hubálek (1982).

Jaccard (= Tanimoto)

Jaccard Coefficient of Cummunity (1901, 1912, p. 39) and Tanimoto (1958, p. 5):

\(\frac{a}{a + b + c}\)

Equation 1+65 from Choi et al. (2010), equation 4 in Hubálek (1982) and equation 6 from Warrens (2008)

Johnson

Johnson supposedly from (1967):

\(\frac{a}{a + b} + \frac{a}{a + c}\)

Equation 43 from Choi et al. (2010), equation 9 in Hubálek (1982) and equation 33 from Warrens (2008)

Kent-Foster K

Kent and Foster (1977, p. 311):

\(K_{occ} = \frac{-b \times c}{b \times R_1 + c \times C_1 + b \times c}\)

\(K_{non occ} = \frac{-b \times c}{b \times R_2 + c \times C_2 + b \times c}\)

Equation 39a+b from Warrens (2008)

Kuder-Richardson

Kuder-Richardson supposedly from (1937):

\(\frac{4 \times \left(a \times d - b \times c\right)}{R_1 \times R_2 \times C_1 \times C_2 + 2 \times a \times d - 2 \times b \times c}\)

Equation 14 from Warrens (2008)

Kulczynski A

A first version would be:

\(\frac{a}{b + c}\)

Equation 64 from Choi et al. (2010), equation 3 in Hubálek (1982) and equation 11b from Warrens (2008)

A second version would be, supposedly from Kulczynski (1927):

\(A = \frac{1}{2} \times \left(\frac{a}{R_1} + \frac{a}{C_1}\right)\)

Equation 41 and 42 from Choi et al. (2010), equation 7 and 8 in Hubálek (1982) and equation 11a from Warrens (2008)

Loevinger H (= Forbes 2)

Forbes supposedly from (1925), but also found in Loevinger (1947, p. 30):

\(\frac{a \times d - b \times c}{\min{\left(R_1 \times R_2 \times C_1 \times C_2\right)}}\)

Equation 48 from Choi et al. (2010), equation 42 in Hubálek (1982) and equation 18 from Warrens (2008)

Maxwell-Pilliner

Maxwell-Pilliner supposedly from (1968):

\(\frac{2 \times \left(a \times d - b \times c \right)}{\left(a + b\right) \times \left(c + d\right) + \left(a + c\right) \times \left(b + d\right)}\)

Equation 35 from Warrens (2008)

McConnaughey

McConnaughey supposedly from (1964):

\(\frac{a^2 - b \times c}{R_1 \times C_1}\)

Equation 39 from Choi et al. (2010), equation 10 in Hubálek (1982) and equation 31 from Warrens (2008)

McEwen and Michael (= Cole C₃)

Michael (1920, p. 55) worked together with McEwen and named the following 'McEwen and Michael coefficient':

\(\frac{a\times d - b\times c}{\left(\frac{a+ d}{2}\right)^2 + \left(\frac{b+ c}{2}\right)^2}= \frac{4\times\left(a\times d - b\times c\right)}{\left(a+d\right)^2+\left(b+c\right)^2}\)

Later Cole (1949, p. 415) rewrote this equation for his C₃ into:

\(C_3 = \frac{4\times\left(a\times d - b\times c\right)}{\left(a+d\right)^2+\left(b+c\right)^2}\)

Equation 68 from Choi et al. (2010), equation 39 in Hubálek (1982) and equation 10 from Warrens (2008)

Mountford

Mountford (1962, p. 45):

\(\frac{2 \times a}{a \times \left(b + c\right) + 2 \times b \times c}\)

Equation 37 from Choi et al. (2010), equation 15 in Hubálek (1982) and equation 28 from Warrens (2008)

Pearson

A measure labelled ‘Pearson’ from Choi et al. (2010, p. 45):

\(\sqrt{\frac{\phi^2}{n + \phi^2}}\)

Equation 53 from Choi et al. (2010).

Pearson Phi Coefficient (= Yule Phi Coefficient = Cole C₂)

Pearson Phi Coefficient (Pearson, 1900, p. 12) is not a tetrachoric correlation, but rather the 'standard' correlation if you would give each of the two values for each of the two variables a 0 and 1. Yule also derived the same result (Yule, 1912, p. 596). The formula can be written as:

\(\phi = \frac{a\times d - b\times c}{\sqrt{R_1\times R_2\times C_1\times C_2}}\)

The formula can also be re-written to use the Pearson chi-square statistic , Cole's C₂ (1949) uses this:

\(C_2 = \sqrt{\frac{\chi^2}{n}}\)

Cohen (1988, p. 216) refers to this as 'w'. He uses the proportions in the cross table and gives the equation:

\(w = \sqrt{\sum_i\sum_j\frac{\left(p_{ij} - \pi_{ij}\right)^2}{\pi_{ij}}}\)

Where \(\pi_{ij}\) is the expected proportion for cell row i column j, and \(p_{ij}\) the sample proportion.

Cohen also provided some guidelines for the interpretation:

Table 2
Interpretation for Phi Coefficient
phi	Interpretation
0 < 0.10	negligible
0.10 < 0.30	small
0.30 < 0.50	medium
0.50 or more	large
Note: Adapted from Statistical power analysis for the behavioral sciences by J. Cohen, 1988, pp. 224-225.

Equation 54 from Choi et al. (2010), equation 30 in Hubálek (1982) and equation 7 from Warrens (2008)

Pearson Q₁

Pearson Q₁ (Pearson 1900, p. 15)

\( Q_1 = \sin\left(\frac{\pi}{2} \times \frac{a\times d - b\times c}{\left(a+b\right)\times\left(b+d\right)}\right) \)

Note that Pearson (1900) stated: "Q₁ was found of little service" (p. 16).

Pearson Q₄

Pearson Q₄ (1900, p. 16)

\( Q_4 = \sin\left(\frac{\pi}{2} \times \frac{1}{1 + \frac{2\times b\times c\times n}{\left(a\times d -b\times c\right) \times \left(b+c\right)}}\right) \)

Pearson Q₅

Pearson Q₅ (1900, p. 16)

\( Q_5 = \sin\left(\frac{\pi}{2} \times \frac{1}{\sqrt{1+ k^2}}\right) \)

with \( k^2 = \frac{4\times a\times b\times c\times d\times n^2}{\left(a\times d-b\times c\right)^2\times\left(a+d\right)\times\left(b+c\right)} \)

Peirce i

Peirce i (1884, p. 453):

\(i = \frac{a \times d - b \times c}{R_1 \times R_2}\)

Equation 1a from Warrens (2008)

a second version is also possible:

\(i = \frac{a \times d - b \times c}{C_1 \times C_2}\)

Equation 1b from Warrens (2008)

Equation 73 in Choi et al. (2010) is different and was unable to find the original source for the variation they use. They use:

\(i = \frac{a \times d - b \times c}{a \times b + 2 \times b \times c + c \times d}\)

Equation 73 from Choi et al. (2010), equation 16 in Hubálek (1982).

Rogers-Tanimoto Similarity Ratio

Rogers-Tanimoto Similarity Ratio (1960, p. 1117):

\(s = \frac{a + d}{a + 2\left(b + c\right) + d}\)

Equation 9 from Choi et al. (2010), equation 23 in Hubálek (1982) and equation 25 from Warrens (2008)

Rogot-Goldberg Index of Adjusted Agreement

Rogot-Goldberg Index of Adjusted Agreement (1966, p. 997):

\(s = \frac{a}{R_1 + C_1}+\frac{d}{R_2 + C_2}\)

Equation 32 from Warrens (2008)

Russell-Rao

Russel-Rao supposedly from (1940):

\(\frac{a}{n}\)

Equation 14 from Choi et al. (2010, p. 44), equation 14 in Hubálek (1982, p. 672) and equation 15from Warrens (2008, p. 220)

Scott Index of Reliability

Scott Index of Reliability (1955, p. 323):

\(\pi = \frac{4 \times a \times d - \left(b + c\right)^2}{\left(R_1 + C_1\right) \times \left(R_2 + C_2\right)}\)

Equation 21 from Warrens (2008)

Sokal-Michener / Matching Coefficient

Sokal-Michener / Matching Coefficient (1958, p. 1417):

\(\frac{a + d}{n}\)

Equation 7 from Choi et al. (2010), equation 20 in Hubálek (1982) and equation 22 from Warrens (2008)

Simpson Index of Taxonomic Resemblance

Simpson Index of Taxonomic Resemblance (1943, p.20; 1960, p. 301):

\(\frac{a}{\min{\left(R_1 \times C_1\right)}}\)

Equation 45 from Choi et al. (2010), equation 2 in Hubálek (1982) and equation 16 from Warrens (2008)

Sokal-Sneath S

Sokal-Sneath version 1 (1963, p. 129):

\(\frac{a}{a + 2 \times b + 2 \times c}\)

Ling labels equation 7 as ‘Anderberg’ and uses a formula that can be rewritten to the above version.

Equation 6 from Choi et al. (2010), equation 6 in Hubálek (1982) and equation 30a from Warrens (2008)

Sokal-Sneath version 2 (1963, p. 129):

\(\frac{2 \times a + 2 \times d}{2 \times a + b + c + 2 \times d}\)

Choi and also Ling lists also a version of ‘Gower & Legendre’ (Gower & Legendre, 1986)

Equation 8 and 11 from Choi et al. (2010), equation 22 in Hubálek (1982) and equation 30b from Warrens (2008)

Sokal-Sneath version 3 (1963, p. 130):

\(\frac{1}{4} \times \left(\frac{a}{R_1} + \frac{a}{C_1} + \frac{d}{R_2} + \frac{d}{C_2}\right)\)

Equation 49 from Choi et al. (2010), equation 18 in Hubálek (1982) and equation 30c from Warrens (2008)

Sorgenfrei

Sorgenfrei supposedly from (1958):

\(\frac{a^2}{R_1 \times C_1}\)

Equation 36 from Choi et al. (2010), equation 12 in Hubálek (1982) and equation 23 from Warrens (2008)

Stiles Association Factor

Stiles Association Factor (1961, p. 272) is the log of chi-square with Yates correction:

\(\log_{10}\left(\frac{n\left(\left\lvert a \times d - b \times c \right\rvert -\frac{n}{2}\right)^2}{R_1 \times R_2 \times C_1 \times C_2}\right)\)

Equation 59 from Choi et al. (2010) and equation 26 from Warrens (2008)

tarantula

tarantula supposedly from Jones and Harrold (2005):

\(\frac{a \times R_2}{c \times R_1}\)

Equation 75 from Choi et al. (2010).

Tarwid

Tarwid (1960, p. 117):

\(\frac{n \times a - R_1 \times C_1}{n \times a + R_1 \times C_1}\)

Equation 40 from Choi et al. (2010), equation 43 in Hubálek (1982).

Tulloss Tripartite Similarity Index

Tulloss Tripartite Similarity Index (1997, p. 132):

\(T = \sqrt{U \times S\times R}\)

With:

\(U = \log_2\left(1 + \frac{\min\left(b, c\right) + a}{\max\left(b, c\right) + a}\right)\)

\(S = \frac{1}{\sqrt{\log_2\left(2 + \frac{\min\left(b, c\right)}{a + 1}\right)}}\)

\(R = \log_2\left(1 + \frac{a}{R_1}\right) \times \log_2\left(1 + \frac{a}{C_1}\right)\)

Yule Q (= Cole C₄, and Pearson Q₂)

Yule's Q (Yule, 1900, p. 272) can be calculated using:

\(Q = \frac{a\times d - b\times c}{a\times d + b\times c}\)

As for the interpretation for Q, there aren't many rules of thumb I could find. One I did find is from Glen (2017):

Table 2
Interpretation for Yule Q
\|Q\|	Interpretation
0 < 0.30	negligible
0.30 < 0.50	small
0.50 < 0.70	medium
0.70 or more	large
Note: Adapted from Gamma Coefficient (Goodman and Kruskal's Gamma) & Yule's Q by S. Glen, 2017.

Alternative, the Q can be converted to an Odds Ratio and in turn, some proposed to convert an Odds Ratio to Cohen's d. Also Cohen's d can be calculated to a correlation coefficient (r), for which again there are different tables.

\(OR = \frac{1 + Q}{1 - Q}\)

Equation 61 from Choi et al. (2010), equation 36 in Hubálek (1982) and equation 3 from Warrens (2008)

Yule's r (= Pearson Q₃ = Cole C₆)

Yule proposed to convert his Q to a correlation coefficient using (Yule, 1900, p. 276):

\(r_Q = \cos\left(\frac{\sqrt{k}}{1+\sqrt{k}}\times\pi\right)\)

\(k = \frac{1-Q}{1+Q}\)

Pearson's Q₃ (1900) will give the same result, although he used the sinus function:

\(Q_2 = \sin\left(\frac{\pi}{2}\times\frac{\sqrt{a\times d} - \sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\right)\)

Cole (1949) rewrote this to:

\(C_6 = \cos\left(\frac{\pi\times\sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\right)\)

Equation 55 from Choi et al. (2010) and equation 38 in Hubálek (1982).

Yule Y (coefficient of colligation)

Yule's Y (1912, p. 592) is a further adaptation of Yule's Q into:

\(Y = \frac{\sqrt{a\times d} - \sqrt{b\times c}}{\sqrt{a\times d} + \sqrt{b\times c}}\)

Yule referred to this as referred to as the coefficient of colligation.

The Odds Ratio can also be calculated from this using:

\(OR = \left(\frac{Y + 1}{1 - Y}\right)^2\)

Tables for the interpretation of the Odds Ratio can then be used.

Equation 63 from Choi et al. (2010), equation 37 in Hubálek (1982) and equation 8 from Warrens (2008)

Links to parts

Google adds