Module stikpetP.tests.test_powerdivergence_ind
Expand source code
import pandas as pd
from scipy.stats import chi2
from math import log
from ..other.table_cross import tab_cross
def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc= None, lambd=2/3):
'''
Power Divergence Test of Independence
-------------------------------------
A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test (\\(\\chi^2\\)), but also an exact multinomial, G-test (\\(G^2}), Freeman-Tukey (\\(T^2\\)), Neyman (\\(NM^2\\)), Mod-Log Likelihood (\\(GM^2\\)), and Freeman-Tukey-Read test are possible.
Cressie and Read (1984, p. 463) noticed how the \\(\\chi^2\\), \\(G^2\\), \\(T^2\\), \\(NM^2\\) and \\(GM^2\\) can all be captured with one general formula. The additional variable lambda (\\(\\lambda\\)) was then investigated, and they settled on a \\(\\lambda} of 2/3.
By setting \\(\\lambda\\) to different values, we get the different tests:
* \\(\\lambda = 1\\), Pearson chi-square
* \\(\\lambda = 0\\), G/Wilks/Likelihood-Ratio
* \\(\\lambda = -\\frac{1}{2}\\), Freeman-Tukey
* \\(\\lambda = -1\\), Mod-Log-Likelihood
* \\(\\lambda = -2\\), Neyman
* \\(\\lambda = \\frac{2}{3}\\), Cressie-Read
Parameters
----------
field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the first categorical field
categories1 : list or dictionary, optional
order and/or selection for categories of field1
categories2 : list or dictionary, optional
order and/or selection for categories of field2
cc : {None, "yates", "pearson", "williams"}, optional
method for continuity correction
lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3
Returns
-------
A dataframe with:
* *n*, the sample size
* *n rows*, number of categories used in first field
* *n col.*, number of categories used in second field
* *statistic*, the test statistic (chi-square value)
* *df*, the degrees of freedom
* *p-value*, the significance (p-value)
* *min. exp.*, the minimum expected count
* *prop. exp. below 5*, proportion of cells with expected count less than 5
* *test*, description of the test used
Notes
-----
The formula used is (Cressie & Read, 1984, p. 442):
$$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(F_{i,j}\\times ln\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(E_{i,j}\\times ln\\left(\\frac{E_{i,j}}{F_{i,j}}\\right)\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{r}\\sum_{j=1}^{c} F_{i,j}\\times\\left(\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$
$$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
$$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$
With:
$$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
$$E_{i,j} = \\frac{R_i\\times C_j}{n}$$
$$R_i = \\sum_{j=1}^c F_{i,j}$$
$$C_j = \\sum_{i=1}^r F_{i,j}$$
*Symbols used:*
* \\(r\\), the number of categories in the first variable (the number of rows)
* \\(c\\), the number of categories in the second variable (the number of columns)
* \\(F_{i,j}\\), the observed count in row i and column j
* \\(E_{i,j}\\), the expected count in row i and column j
* \\(R_i\\), the i-th row total
* \\(C_j\\), the j-th column total
* \\(n\\), the sum of all counts
* \\(\\chi^2\\left(\\dots\\right)\\), the chi-square cumulative density function
Cressie and Read (1984, p. 463) suggest to use \\(\\lambda = \\frac{2}{3}\\), which
is therefor the default in this function.
The **Pearson chi-square statistic** can be obtained by setting \\(\\lambda = 1\\). Pearson's original
formula is (Pearson, 1900, p. 165):
$$\\chi_{P}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$
The **Freeman-Tukey test** has as a formula (Bishop et al., 2007, p. 513):
$$T^2 = 4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$
This will be same as setting lambda to \\(-\\frac{1}{2}\\). Note that the source for the formula is often quoted to be from Freeman and Tukey (1950)
but couldn't really find it in that article.
**Neyman test** formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250):
$$\\chi_{N}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(E_{i,j} - F_{i,j}\\right)^2}{F_{i,j}}$$
This will be same as setting lambda to \\(-2\\).
The Yates correction (yates) is calculated using (Yates, 1934, p. 222):
Use instead of \\(F_{i,j}} the adjusted version defined by:
$$F_{i,j}^\\ast = \\begin{cases} F_{i,j} - 0.5 & \\text{ if } F_{i,j}>E_{i,j} \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j}\\\\ F_{i,j} + 0.5 & \\text{ if } F_{i,j}<E_{i,j} \\end{cases}$$
The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157):
$$\\chi_{PP}^2 = \\chi_{P}^{2}\\times\\frac{n - 1}{n}$$
The Williams correction (williams) is calculated using (Williams, 1976, p. 36):
$$\\chi_{PW}^2 = \\frac{\\chi_{P}^2}{q}$$
With:
$$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$
References
----------
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.
Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x
Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756
Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030
Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360
Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
#Test Used
if (lambd == 2 / 3 or lambd == "cressie-read"):
lambd = 2 / 3
testUsed = "Cressie-Read test of independence"
elif (lambd == 0 or lambd == "likelihood-ratio"):
lambd = 0
testUsed = "likelihood ratio test of independence"
elif (lambd == -1 or lambd == "mod-log"):
lambd = -1
testUsed = "mod-log likelihood ratio test of independence"
elif (lambd == 1 or lambd == "pearson"):
lambd = 1
testUsed = "Pearson chi-square test of independence"
elif (lambd == -0.5 or lambd == "freeman-tukey"):
lambd = -0.5
testUsed = "Freeman-Tukey test of independence"
elif (lambd == -2 or lambd == "neyman"):
lambd = -2
testUsed = "Neyman test of independence"
else:
testUsed = paste("power divergence test of independence with lambda = ", lambd)
if cc == "yates":
testUsed = testUsed + ", with Yates continuity correction"
#create the cross table
ct = tab_cross(field1, field2, categories1, categories2, totals="include")
#basic counts
nrows = ct.shape[0] - 1
ncols = ct.shape[1] - 1
n = ct.iloc[nrows, ncols]
#determine the expected counts & chi-square value
chi2Val = 0
expMin = -1
nExpBelow5 = 0
expC = pd.DataFrame()
for i in range(0, nrows):
for j in range(0, ncols):
expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
#add or remove a half in case Yates correction
if cc=="yates":
if ct.iloc[i,j] > expC.iloc[i,j]:
ct.iloc[i,j] = ct.iloc[i,j] - 0.5
elif ct.iloc[i,j] < expC.iloc[i,j]:
ct.iloc[i,j] = ct.iloc[i,j] + 0.5
if (lambd == 0):
chi2Val = chi2Val + ct.iloc[i,j] * log(ct.iloc[i,j] / expC.iloc[i,j])
elif (lambd == -1):
chi2Val = chi2Val + expC.iloc[i,j] * log(expC.iloc[i,j] / ct.iloc[i,j])
else:
chi2Val = chi2Val + ct.iloc[i,j] * ((ct.iloc[i,j] / expC.iloc[i,j])**lambd - 1)
#check if below 5
if expMin < 0 or expC.iloc[i,j] < expMin:
expMin = expC.iloc[i,j]
if expC.iloc[i,j] < 5:
nExpBelow5 = nExpBelow5 + 1
if (lambd == 0 or lambd==-1):
chi2Val = 2*chi2Val
else:
chi2Val = 2/(lambd*(lambd + 1)) * chi2Val
nExpBelow5 = nExpBelow5/(nrows*ncols)
#Degrees of freedom
df = (nrows - 1)*(ncols - 1)
#Williams and Pearson correction
if cc == "williams":
testUsed = testUsed + ", with Williams continuity correction"
rTotInv = 0
for i in range(0, nrows):
rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
cTotInv = 0
for j in range(0, ncols):
cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
chi2Val = chi2Val / q
elif cc == "pearson":
testUsed = testUsed + ", with E.S. Pearson continuity correction"
chi2Val = chi2Val * (n - 1) / n
#The test
pvalue = chi2.sf(chi2Val, df)
#Prepare the results
colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
pd.set_option('display.max_colwidth', None)
return testResults
Functions
def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc=None, lambd=0.6666666666666666)-
Power Divergence Test Of Independence
A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test ((\chi^2)), but also an exact multinomial, G-test ((G^2}), Freeman-Tukey ((T^2)), Neyman ((NM^2)), Mod-Log Likelihood ((GM^2)), and Freeman-Tukey-Read test are possible.
Cressie and Read (1984, p. 463) noticed how the \chi^2, G^2, T^2, NM^2 and GM^2 can all be captured with one general formula. The additional variable lambda ((\lambda)) was then investigated, and they settled on a (\lambda} of 2/3.
By setting \lambda to different values, we get the different tests: * \lambda = 1, Pearson chi-square * \lambda = 0, G/Wilks/Likelihood-Ratio * \lambda = -\frac{1}{2}, Freeman-Tukey * \lambda = -1, Mod-Log-Likelihood * \lambda = -2, Neyman * \lambda = \frac{2}{3}, Cressie-Read
Parameters
field1:listorpandas series- the first categorical field
field2:listorpandas series- the first categorical field
categories1:listordictionary, optional- order and/or selection for categories of field1
categories2:listordictionary, optional- order and/or selection for categories of field2
cc:{None, "yates", "pearson", "williams"}, optional- method for continuity correction
lambd:{float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional- either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3
Returns
A dataframe with:
- n, the sample size
- n rows, number of categories used in first field
- n col., number of categories used in second field
- statistic, the test statistic (chi-square value)
- df, the degrees of freedom
- p-value, the significance (p-value)
- min. exp., the minimum expected count
- prop. exp. below 5, proportion of cells with expected count less than 5
- test, description of the test used
Notes
The formula used is (Cressie & Read, 1984, p. 442): \chi_{C}^{2} = \begin{cases} 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(F_{i,j}\times ln\left(\frac{F_{i,j}}{E_{i,j}}\right)\right) & \text{ if } \lambda=0 \\ 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(E_{i,j}\times ln\left(\frac{E_{i,j}}{F_{i,j}}\right)\right) & \text{ if } \lambda=-1 \\ \frac{2}{\lambda\times\left(\lambda + 1\right)} \times \sum_{i=1}^{r}\sum_{j=1}^{c} F_{i,j}\times\left(\left(\frac{F_{i,j}}{E_{i,j}}\right)^{\lambda} - 1\right) & \text{ else } \end{cases} df = \left(r - 1\right)\times\left(c - 1\right) sig. = 1 - \chi^2\left(\chi_{C}^{2},df\right)
With: n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j} E_{i,j} = \frac{R_i\times C_j}{n} R_i = \sum_{j=1}^c F_{i,j} C_j = \sum_{i=1}^r F_{i,j}
Symbols used: * r, the number of categories in the first variable (the number of rows) * c, the number of categories in the second variable (the number of columns) * F_{i,j}, the observed count in row i and column j * E_{i,j}, the expected count in row i and column j * R_i, the i-th row total * C_j, the j-th column total * n, the sum of all counts * \chi^2\left(\dots\right), the chi-square cumulative density function
Cressie and Read (1984, p. 463) suggest to use \lambda = \frac{2}{3}, which is therefor the default in this function.
The Pearson chi-square statistic can be obtained by setting \lambda = 1. Pearson's original formula is (Pearson, 1900, p. 165): \chi_{P}^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{\left(F_{i,j} - E_{i,j}\right)^2}{E_{i,j}}
The Freeman-Tukey test has as a formula (Bishop et al., 2007, p. 513): T^2 = 4\times\sum_{i=1}^r \sum_{j=1}^c \left(\sqrt{F_{i,j}} - \sqrt{E_{i,j}}\right)^2
This will be same as setting lambda to -\frac{1}{2}. Note that the source for the formula is often quoted to be from Freeman and Tukey (1950) but couldn't really find it in that article.
Neyman test formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250): \chi_{N}^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{\left(E_{i,j} - F_{i,j}\right)^2}{F_{i,j}} This will be same as setting lambda to -2.
The Yates correction (yates) is calculated using (Yates, 1934, p. 222):
Use instead of (F_{i,j}} the adjusted version defined by: F_{i,j}^\ast = \begin{cases} F_{i,j} - 0.5 & \text{ if } F_{i,j}>E_{i,j} \\ F_{i,j} & \text{ if } F_{i,j}= E_{i,j}\\ F_{i,j} + 0.5 & \text{ if } F_{i,j}<E_{i,j} \end{cases}
The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157): \chi_{PP}^2 = \chi_{P}^{2}\times\frac{n - 1}{n}
The Williams correction (williams) is calculated using (Williams, 1976, p. 36): \chi_{PW}^2 = \frac{\chi_{P}^2}{q} With: q = 1 + \frac{\left(n\times\left(\sum_{i=1}^r \frac{1}{R_i}\right)-1\right) \times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_j}\right)-1\right)}{6\times n\times df}
References
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). Discrete multivariate analysis. Springer.
Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x
Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. The Annals of Mathematical Statistics, 21(4), 607–611. doi:10.1214/aoms/1177729756
Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030
Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2), 139–167. doi:10.2307/2332518
Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175. doi:10.1080/14786440009463897
Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62. doi:10.1214/aoms/1177732360
Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. Biometrika, 63(1), 33–37. doi:10.2307/2335081
Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. doi:10.2307/2983604
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc= None, lambd=2/3): ''' Power Divergence Test of Independence ------------------------------------- A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test (\\(\\chi^2\\)), but also an exact multinomial, G-test (\\(G^2}), Freeman-Tukey (\\(T^2\\)), Neyman (\\(NM^2\\)), Mod-Log Likelihood (\\(GM^2\\)), and Freeman-Tukey-Read test are possible. Cressie and Read (1984, p. 463) noticed how the \\(\\chi^2\\), \\(G^2\\), \\(T^2\\), \\(NM^2\\) and \\(GM^2\\) can all be captured with one general formula. The additional variable lambda (\\(\\lambda\\)) was then investigated, and they settled on a \\(\\lambda} of 2/3. By setting \\(\\lambda\\) to different values, we get the different tests: * \\(\\lambda = 1\\), Pearson chi-square * \\(\\lambda = 0\\), G/Wilks/Likelihood-Ratio * \\(\\lambda = -\\frac{1}{2}\\), Freeman-Tukey * \\(\\lambda = -1\\), Mod-Log-Likelihood * \\(\\lambda = -2\\), Neyman * \\(\\lambda = \\frac{2}{3}\\), Cressie-Read Parameters ---------- field1 : list or pandas series the first categorical field field2 : list or pandas series the first categorical field categories1 : list or dictionary, optional order and/or selection for categories of field1 categories2 : list or dictionary, optional order and/or selection for categories of field2 cc : {None, "yates", "pearson", "williams"}, optional method for continuity correction lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3 Returns ------- A dataframe with: * *n*, the sample size * *n rows*, number of categories used in first field * *n col.*, number of categories used in second field * *statistic*, the test statistic (chi-square value) * *df*, the degrees of freedom * *p-value*, the significance (p-value) * *min. exp.*, the minimum expected count * *prop. exp. below 5*, proportion of cells with expected count less than 5 * *test*, description of the test used Notes ----- The formula used is (Cressie & Read, 1984, p. 442): $$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(F_{i,j}\\times ln\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(E_{i,j}\\times ln\\left(\\frac{E_{i,j}}{F_{i,j}}\\right)\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{r}\\sum_{j=1}^{c} F_{i,j}\\times\\left(\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$ $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$ $$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$ With: $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$ $$E_{i,j} = \\frac{R_i\\times C_j}{n}$$ $$R_i = \\sum_{j=1}^c F_{i,j}$$ $$C_j = \\sum_{i=1}^r F_{i,j}$$ *Symbols used:* * \\(r\\), the number of categories in the first variable (the number of rows) * \\(c\\), the number of categories in the second variable (the number of columns) * \\(F_{i,j}\\), the observed count in row i and column j * \\(E_{i,j}\\), the expected count in row i and column j * \\(R_i\\), the i-th row total * \\(C_j\\), the j-th column total * \\(n\\), the sum of all counts * \\(\\chi^2\\left(\\dots\\right)\\), the chi-square cumulative density function Cressie and Read (1984, p. 463) suggest to use \\(\\lambda = \\frac{2}{3}\\), which is therefor the default in this function. The **Pearson chi-square statistic** can be obtained by setting \\(\\lambda = 1\\). Pearson's original formula is (Pearson, 1900, p. 165): $$\\chi_{P}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$ The **Freeman-Tukey test** has as a formula (Bishop et al., 2007, p. 513): $$T^2 = 4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$ This will be same as setting lambda to \\(-\\frac{1}{2}\\). Note that the source for the formula is often quoted to be from Freeman and Tukey (1950) but couldn't really find it in that article. **Neyman test** formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250): $$\\chi_{N}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(E_{i,j} - F_{i,j}\\right)^2}{F_{i,j}}$$ This will be same as setting lambda to \\(-2\\). The Yates correction (yates) is calculated using (Yates, 1934, p. 222): Use instead of \\(F_{i,j}} the adjusted version defined by: $$F_{i,j}^\\ast = \\begin{cases} F_{i,j} - 0.5 & \\text{ if } F_{i,j}>E_{i,j} \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j}\\\\ F_{i,j} + 0.5 & \\text{ if } F_{i,j}<E_{i,j} \\end{cases}$$ The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157): $$\\chi_{PP}^2 = \\chi_{P}^{2}\\times\\frac{n - 1}{n}$$ The Williams correction (williams) is calculated using (Williams, 1976, p. 36): $$\\chi_{PW}^2 = \\frac{\\chi_{P}^2}{q}$$ With: $$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$ References ---------- Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer. Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756 Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030 Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518 Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897 Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360 Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081 Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' #Test Used if (lambd == 2 / 3 or lambd == "cressie-read"): lambd = 2 / 3 testUsed = "Cressie-Read test of independence" elif (lambd == 0 or lambd == "likelihood-ratio"): lambd = 0 testUsed = "likelihood ratio test of independence" elif (lambd == -1 or lambd == "mod-log"): lambd = -1 testUsed = "mod-log likelihood ratio test of independence" elif (lambd == 1 or lambd == "pearson"): lambd = 1 testUsed = "Pearson chi-square test of independence" elif (lambd == -0.5 or lambd == "freeman-tukey"): lambd = -0.5 testUsed = "Freeman-Tukey test of independence" elif (lambd == -2 or lambd == "neyman"): lambd = -2 testUsed = "Neyman test of independence" else: testUsed = paste("power divergence test of independence with lambda = ", lambd) if cc == "yates": testUsed = testUsed + ", with Yates continuity correction" #create the cross table ct = tab_cross(field1, field2, categories1, categories2, totals="include") #basic counts nrows = ct.shape[0] - 1 ncols = ct.shape[1] - 1 n = ct.iloc[nrows, ncols] #determine the expected counts & chi-square value chi2Val = 0 expMin = -1 nExpBelow5 = 0 expC = pd.DataFrame() for i in range(0, nrows): for j in range(0, ncols): expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n #add or remove a half in case Yates correction if cc=="yates": if ct.iloc[i,j] > expC.iloc[i,j]: ct.iloc[i,j] = ct.iloc[i,j] - 0.5 elif ct.iloc[i,j] < expC.iloc[i,j]: ct.iloc[i,j] = ct.iloc[i,j] + 0.5 if (lambd == 0): chi2Val = chi2Val + ct.iloc[i,j] * log(ct.iloc[i,j] / expC.iloc[i,j]) elif (lambd == -1): chi2Val = chi2Val + expC.iloc[i,j] * log(expC.iloc[i,j] / ct.iloc[i,j]) else: chi2Val = chi2Val + ct.iloc[i,j] * ((ct.iloc[i,j] / expC.iloc[i,j])**lambd - 1) #check if below 5 if expMin < 0 or expC.iloc[i,j] < expMin: expMin = expC.iloc[i,j] if expC.iloc[i,j] < 5: nExpBelow5 = nExpBelow5 + 1 if (lambd == 0 or lambd==-1): chi2Val = 2*chi2Val else: chi2Val = 2/(lambd*(lambd + 1)) * chi2Val nExpBelow5 = nExpBelow5/(nrows*ncols) #Degrees of freedom df = (nrows - 1)*(ncols - 1) #Williams and Pearson correction if cc == "williams": testUsed = testUsed + ", with Williams continuity correction" rTotInv = 0 for i in range(0, nrows): rTotInv = rTotInv + 1 / ct.iloc[i, ncols] cTotInv = 0 for j in range(0, ncols): cTotInv = cTotInv + 1 / ct.iloc[nrows, j] q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df) chi2Val = chi2Val / q elif cc == "pearson": testUsed = testUsed + ", with E.S. Pearson continuity correction" chi2Val = chi2Val * (n - 1) / n #The test pvalue = chi2.sf(chi2Val, df) #Prepare the results colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"] testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames) pd.set_option('display.max_colwidth', None) return testResults