Module `stikpetP.tests.test_powerdivergence_ind`

Expand source code

import pandas as pd
from scipy.stats import chi2
from math import log
from ..other.table_cross import tab_cross

def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc= None, lambd=2/3):
    '''
    Power Divergence Test of Independence
    -------------------------------------
    A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test (\\(\\chi^2\\)), but also an exact multinomial, G-test (\\(G^2}), Freeman-Tukey (\\(T^2\\)), Neyman (\\(NM^2\\)), Mod-Log Likelihood (\\(GM^2\\)), and Freeman-Tukey-Read test are possible.
    
    Cressie and Read (1984, p. 463) noticed how the \\(\\chi^2\\), \\(G^2\\), \\(T^2\\), \\(NM^2\\) and \\(GM^2\\) can all be captured with one general formula. The additional variable lambda (\\(\\lambda\\)) was then investigated, and they settled on a \\(\\lambda} of 2/3.
    
    By setting \\(\\lambda\\) to different values, we get the different tests:
    * \\(\\lambda = 1\\), Pearson chi-square
    * \\(\\lambda = 0\\), G/Wilks/Likelihood-Ratio
    * \\(\\lambda = -\\frac{1}{2}\\), Freeman-Tukey
    * \\(\\lambda = -1\\), Mod-Log-Likelihood
    * \\(\\lambda = -2\\), Neyman
    * \\(\\lambda = \\frac{2}{3}\\), Cressie-Read

    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
    lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3    
    
    Returns
    -------
    A dataframe with:
    
    * *n*, the sample size
    * *n rows*, number of categories used in first field
    * *n col.*, number of categories used in second field
    * *statistic*, the test statistic (chi-square value)
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    * *min. exp.*, the minimum expected count
    * *prop. exp. below 5*, proportion of cells with expected count less than 5
    * *test*, description of the test used

    Notes
    -----
    The formula used is (Cressie & Read, 1984, p. 442):
    $$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(F_{i,j}\\times ln\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(E_{i,j}\\times ln\\left(\\frac{E_{i,j}}{F_{i,j}}\\right)\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{r}\\sum_{j=1}^{c} F_{i,j}\\times\\left(\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$

    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i,j} = \\frac{R_i\\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^c F_{i,j}$$
    $$C_j = \\sum_{i=1}^r F_{i,j}$$

    *Symbols used:*
    * \\(r\\), the number of categories in the first variable (the number of rows)
    * \\(c\\), the number of categories in the second variable (the number of columns)
    * \\(F_{i,j}\\), the observed count in row i and column j
    * \\(E_{i,j}\\), the expected count in row i and column j
    * \\(R_i\\), the i-th row total
    * \\(C_j\\), the j-th column total
    * \\(n\\), the sum of all counts
    * \\(\\chi^2\\left(\\dots\\right)\\), the chi-square cumulative density function

    Cressie and Read (1984, p. 463) suggest to use \\(\\lambda = \\frac{2}{3}\\),  which
    is therefor the default in this function.

    The **Pearson chi-square statistic** can be obtained by setting \\(\\lambda = 1\\). Pearson's original 
    formula is (Pearson, 1900, p. 165):
    $$\\chi_{P}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$

    The **Freeman-Tukey test** has as a formula (Bishop et al., 2007, p. 513):
    $$T^2 = 4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$

    This will be same as setting lambda to \\(-\\frac{1}{2}\\). Note that the source for the formula is often quoted to be from Freeman and Tukey (1950) 
    but couldn't really find it in that article.

    **Neyman test** formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250):
    $$\\chi_{N}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(E_{i,j} - F_{i,j}\\right)^2}{F_{i,j}}$$
    This will be same as setting lambda to \\(-2\\).

    The Yates correction (yates) is calculated using (Yates, 1934, p. 222):

    Use instead of \\(F_{i,j}} the adjusted version defined by:
    $$F_{i,j}^\\ast = \\begin{cases} F_{i,j} - 0.5 & \\text{ if } F_{i,j}>E_{i,j}  \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j}\\\\ F_{i,j} + 0.5 & \\text{ if } F_{i,j}<E_{i,j} \\end{cases}$$

    The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{PP}^2 = \\chi_{P}^{2}\\times\\frac{n - 1}{n}$$

    The Williams correction (williams) is calculated using (Williams, 1976, p. 36):
    $$\\chi_{PW}^2 = \\frac{\\chi_{P}^2}{q}$$
    With:
    $$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$

    References 
    ----------
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.

    Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x

    Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756

    Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030

    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518

    Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897

    Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360

    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081

    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    '''
    #Test Used
    if (lambd == 2 / 3 or lambd == "cressie-read"):
        lambd = 2 / 3
        testUsed = "Cressie-Read test of independence"
    elif (lambd == 0 or lambd == "likelihood-ratio"):
        lambd = 0
        testUsed = "likelihood ratio test of independence"
    elif (lambd == -1 or lambd == "mod-log"):
        lambd = -1
        testUsed = "mod-log likelihood ratio test of independence"
    elif (lambd == 1 or lambd == "pearson"):
        lambd = 1
        testUsed = "Pearson chi-square test of independence"
    elif (lambd == -0.5 or lambd == "freeman-tukey"):
        lambd = -0.5
        testUsed = "Freeman-Tukey test of independence"
    elif (lambd == -2 or lambd == "neyman"):
        lambd = -2
        testUsed = "Neyman test of independence"
    else:
        testUsed = paste("power divergence test of independence with lambda = ", lambd)
        
    if cc == "yates":
        testUsed = testUsed + ", with Yates continuity correction"
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0] - 1
    ncols =  ct.shape[1] - 1
    n = ct.iloc[nrows, ncols]
    
    #determine the expected counts & chi-square value
    chi2Val = 0
    expMin = -1
    nExpBelow5 = 0    
    expC = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
            
            #add or remove a half in case Yates correction
            if cc=="yates":
                if ct.iloc[i,j] > expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] - 0.5
                elif ct.iloc[i,j] < expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] + 0.5
            
            if (lambd == 0):
                chi2Val = chi2Val + ct.iloc[i,j] * log(ct.iloc[i,j] / expC.iloc[i,j])
            elif (lambd == -1):
                chi2Val = chi2Val + expC.iloc[i,j] * log(expC.iloc[i,j] / ct.iloc[i,j])
            else:
                chi2Val = chi2Val + ct.iloc[i,j] * ((ct.iloc[i,j] / expC.iloc[i,j])**lambd - 1)
            
            #check if below 5
            if expMin < 0 or expC.iloc[i,j] < expMin:
                expMin = expC.iloc[i,j]            
            if expC.iloc[i,j] < 5:
                nExpBelow5 = nExpBelow5 + 1
    
    if (lambd == 0 or lambd==-1):
        chi2Val = 2*chi2Val
    else:
        chi2Val = 2/(lambd*(lambd + 1)) * chi2Val
    
    nExpBelow5 = nExpBelow5/(nrows*ncols)
    
    #Degrees of freedom
    df = (nrows - 1)*(ncols - 1)
    
    #Williams and Pearson correction
    if cc == "williams":
        testUsed = testUsed + ", with Williams continuity correction"
        rTotInv = 0
        for i in range(0, nrows):
            rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
        
        cTotInv = 0
        for j in range(0, ncols):
            cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
        
        q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
        chi2Val = chi2Val / q
    elif cc == "pearson":
        testUsed = testUsed + ", with E.S. Pearson continuity correction"
        chi2Val = chi2Val * (n - 1) / n
    
    #The test
    pvalue = chi2.sf(chi2Val, df)
    
    #Prepare the results
    colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
    testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
    pd.set_option('display.max_colwidth', None)
    
    return testResults

Functions

def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc=None, lambd=0.6666666666666666)

Power Divergence Test Of Independence

A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test ((\chi^2)), but also an exact multinomial, G-test ((G^2}), Freeman-Tukey ((T^2)), Neyman ((NM^2)), Mod-Log Likelihood ((GM^2)), and Freeman-Tukey-Read test are possible.

Cressie and Read (1984, p. 463) noticed how the $\chi^2$ , $G^2$ , $T^2$ , $NM^2$ and $GM^2$ can all be captured with one general formula. The additional variable lambda ((\lambda)) was then investigated, and they settled on a (\lambda} of 2/3.

By setting $\lambda$ to different values, we get the different tests: * $\lambda = 1$ , Pearson chi-square * $\lambda = 0$ , G/Wilks/Likelihood-Ratio * $\lambda = -\frac{1}{2}$ , Freeman-Tukey * $\lambda = -1$ , Mod-Log-Likelihood * $\lambda = -2$ , Neyman * $\lambda = \frac{2}{3}$ , Cressie-Read

Parameters

field1 : list or pandas series: the first categorical field
field2 : list or pandas series: the first categorical field
categories1 : list or dictionary, optional: order and/or selection for categories of field1
categories2 : list or dictionary, optional: order and/or selection for categories of field2
cc : {None, "yates", "pearson", "williams"}, optional: method for continuity correction
lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional: either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3

Returns

A dataframe with:

n, the sample size
n rows, number of categories used in first field
n col., number of categories used in second field
statistic, the test statistic (chi-square value)
df, the degrees of freedom
p-value, the significance (p-value)
min. exp., the minimum expected count
prop. exp. below 5, proportion of cells with expected count less than 5
test, description of the test used

Notes

The formula used is (Cressie & Read, 1984, p. 442): $\chi_{C}^{2} = \begin{cases} 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(F_{i,j}\times ln\left(\frac{F_{i,j}}{E_{i,j}}\right)\right) & \text{ if } \lambda=0 \\ 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(E_{i,j}\times ln\left(\frac{E_{i,j}}{F_{i,j}}\right)\right) & \text{ if } \lambda=-1 \\ \frac{2}{\lambda\times\left(\lambda + 1\right)} \times \sum_{i=1}^{r}\sum_{j=1}^{c} F_{i,j}\times\left(\left(\frac{F_{i,j}}{E_{i,j}}\right)^{\lambda} - 1\right) & \text{ else } \end{cases}$ $df = \left(r - 1\right)\times\left(c - 1\right)$ $sig. = 1 - \chi^2\left(\chi_{C}^{2},df\right)$

With: $n = \sum_{i=1}^r \sum_{j=1}^c F_{i,j}$ $E_{i,j} = \frac{R_i\times C_j}{n}$ $R_i = \sum_{j=1}^c F_{i,j}$ $C_j = \sum_{i=1}^r F_{i,j}$

Symbols used: * $r$ , the number of categories in the first variable (the number of rows) * $c$ , the number of categories in the second variable (the number of columns) * $F_{i,j}$ , the observed count in row i and column j * $E_{i,j}$ , the expected count in row i and column j * $R_i$ , the i-th row total * $C_j$ , the j-th column total * $n$ , the sum of all counts * $\chi^2\left(\dots\right)$ , the chi-square cumulative density function

Cressie and Read (1984, p. 463) suggest to use $\lambda = \frac{2}{3}$ , which is therefor the default in this function.

The Pearson chi-square statistic can be obtained by setting $\lambda = 1$ . Pearson's original formula is (Pearson, 1900, p. 165): $\chi_{P}^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{\left(F_{i,j} - E_{i,j}\right)^2}{E_{i,j}}$

The Freeman-Tukey test has as a formula (Bishop et al., 2007, p. 513): $T^2 = 4\times\sum_{i=1}^r \sum_{j=1}^c \left(\sqrt{F_{i,j}} - \sqrt{E_{i,j}}\right)^2$

This will be same as setting lambda to $-\frac{1}{2}$ . Note that the source for the formula is often quoted to be from Freeman and Tukey (1950) but couldn't really find it in that article.

Neyman test formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250): $\chi_{N}^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{\left(E_{i,j} - F_{i,j}\right)^2}{F_{i,j}}$ This will be same as setting lambda to $-2$ .

The Yates correction (yates) is calculated using (Yates, 1934, p. 222):

Use instead of (F_{i,j}} the adjusted version defined by: $F_{i,j}^\ast = \begin{cases} F_{i,j} - 0.5 & \text{ if } F_{i,j}>E_{i,j} \\ F_{i,j} & \text{ if } F_{i,j}= E_{i,j}\\ F_{i,j} + 0.5 & \text{ if } F_{i,j}<E_{i,j} \end{cases}$

The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157): $\chi_{PP}^2 = \chi_{P}^{2}\times\frac{n - 1}{n}$

The Williams correction (williams) is calculated using (Williams, 1976, p. 36): $\chi_{PW}^2 = \frac{\chi_{P}^2}{q}$ With: $q = 1 + \frac{\left(n\times\left(\sum_{i=1}^r \frac{1}{R_i}\right)-1\right) \times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_j}\right)-1\right)}{6\times n\times df}$

References

Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). Discrete multivariate analysis. Springer.

Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the Royal Statistical Society: Series B (Methodological), 46(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x

Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. The Annals of Mathematical Statistics, 21(4), 607–611. doi:10.1214/aoms/1177729756

Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030

Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2), 139–167. doi:10.2307/2332518

Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine Series 5, 50(302), 157–175. doi:10.1080/14786440009463897

Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. The Annals of Mathematical Statistics, 9(1), 60–62. doi:10.1214/aoms/1177732360

Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. Biometrika, 63(1), 33–37. doi:10.2307/2335081

Yates, F. (1934). Contingency tables involving small numbers and the chi square test. Supplement to the Journal of the Royal Statistical Society, 1(2), 217–235. doi:10.2307/2983604

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def ts_powerdivergence_ind(field1, field2, categories1=None, categories2=None, cc= None, lambd=2/3):
    '''
    Power Divergence Test of Independence
    -------------------------------------
    A test that can be used with two nominal variables to test if they are independent. There are quite a few tests that can do this. Perhaps the most commonly used is the Pearson chi-square test (\\(\\chi^2\\)), but also an exact multinomial, G-test (\\(G^2}), Freeman-Tukey (\\(T^2\\)), Neyman (\\(NM^2\\)), Mod-Log Likelihood (\\(GM^2\\)), and Freeman-Tukey-Read test are possible.
    
    Cressie and Read (1984, p. 463) noticed how the \\(\\chi^2\\), \\(G^2\\), \\(T^2\\), \\(NM^2\\) and \\(GM^2\\) can all be captured with one general formula. The additional variable lambda (\\(\\lambda\\)) was then investigated, and they settled on a \\(\\lambda} of 2/3.
    
    By setting \\(\\lambda\\) to different values, we get the different tests:
    * \\(\\lambda = 1\\), Pearson chi-square
    * \\(\\lambda = 0\\), G/Wilks/Likelihood-Ratio
    * \\(\\lambda = -\\frac{1}{2}\\), Freeman-Tukey
    * \\(\\lambda = -1\\), Mod-Log-Likelihood
    * \\(\\lambda = -2\\), Neyman
    * \\(\\lambda = \\frac{2}{3}\\), Cressie-Read

    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the first categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
    cc : {None, "yates", "pearson", "williams"}, optional
        method for continuity correction
    lambd : {float, "cressie-read", "likelihood-ratio", "mod-log", "pearson", "freeman-tukey", "neyman"}, optional
        either name of test or specific value. Default is "cressie-read" i.e. lambda of 2/3    
    
    Returns
    -------
    A dataframe with:
    
    * *n*, the sample size
    * *n rows*, number of categories used in first field
    * *n col.*, number of categories used in second field
    * *statistic*, the test statistic (chi-square value)
    * *df*, the degrees of freedom
    * *p-value*, the significance (p-value)
    * *min. exp.*, the minimum expected count
    * *prop. exp. below 5*, proportion of cells with expected count less than 5
    * *test*, description of the test used

    Notes
    -----
    The formula used is (Cressie & Read, 1984, p. 442):
    $$\\chi_{C}^{2} = \\begin{cases} 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(F_{i,j}\\times ln\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)\\right) & \\text{ if } \\lambda=0 \\\\ 2\\times\\sum_{i=1}^{r}\\sum_{j=1}^c\\left(E_{i,j}\\times ln\\left(\\frac{E_{i,j}}{F_{i,j}}\\right)\\right) & \\text{ if } \\lambda=-1 \\\\ \\frac{2}{\\lambda\\times\\left(\\lambda + 1\\right)} \\times \\sum_{i=1}^{r}\\sum_{j=1}^{c} F_{i,j}\\times\\left(\\left(\\frac{F_{i,j}}{E_{i,j}}\\right)^{\\lambda} - 1\\right) & \\text{ else } \\end{cases}$$
    $$df = \\left(r - 1\\right)\\times\\left(c - 1\\right)$$
    $$sig. = 1 - \\chi^2\\left(\\chi_{C}^{2},df\\right)$$

    With:
    $$n = \\sum_{i=1}^r \\sum_{j=1}^c F_{i,j}$$
    $$E_{i,j} = \\frac{R_i\\times C_j}{n}$$
    $$R_i = \\sum_{j=1}^c F_{i,j}$$
    $$C_j = \\sum_{i=1}^r F_{i,j}$$

    *Symbols used:*
    * \\(r\\), the number of categories in the first variable (the number of rows)
    * \\(c\\), the number of categories in the second variable (the number of columns)
    * \\(F_{i,j}\\), the observed count in row i and column j
    * \\(E_{i,j}\\), the expected count in row i and column j
    * \\(R_i\\), the i-th row total
    * \\(C_j\\), the j-th column total
    * \\(n\\), the sum of all counts
    * \\(\\chi^2\\left(\\dots\\right)\\), the chi-square cumulative density function

    Cressie and Read (1984, p. 463) suggest to use \\(\\lambda = \\frac{2}{3}\\),  which
    is therefor the default in this function.

    The **Pearson chi-square statistic** can be obtained by setting \\(\\lambda = 1\\). Pearson's original 
    formula is (Pearson, 1900, p. 165):
    $$\\chi_{P}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(F_{i,j} - E_{i,j}\\right)^2}{E_{i,j}}$$

    The **Freeman-Tukey test** has as a formula (Bishop et al., 2007, p. 513):
    $$T^2 = 4\\times\\sum_{i=1}^r \\sum_{j=1}^c \\left(\\sqrt{F_{i,j}} - \\sqrt{E_{i,j}}\\right)^2$$

    This will be same as setting lambda to \\(-\\frac{1}{2}\\). Note that the source for the formula is often quoted to be from Freeman and Tukey (1950) 
    but couldn't really find it in that article.

    **Neyman test** formula was very similar to Pearson's, but the observed and expected counts swapped (Neyman, 1949, p. 250):
    $$\\chi_{N}^2 = \\sum_{i=1}^r \\sum_{j=1}^c \\frac{\\left(E_{i,j} - F_{i,j}\\right)^2}{F_{i,j}}$$
    This will be same as setting lambda to \\(-2\\).

    The Yates correction (yates) is calculated using (Yates, 1934, p. 222):

    Use instead of \\(F_{i,j}} the adjusted version defined by:
    $$F_{i,j}^\\ast = \\begin{cases} F_{i,j} - 0.5 & \\text{ if } F_{i,j}>E_{i,j}  \\\\ F_{i,j} & \\text{ if } F_{i,j}= E_{i,j}\\\\ F_{i,j} + 0.5 & \\text{ if } F_{i,j}<E_{i,j} \\end{cases}$$

    The Pearson correction (pearson) is calculated using (E.S. Pearson, 1947, p. 157):
    $$\\chi_{PP}^2 = \\chi_{P}^{2}\\times\\frac{n - 1}{n}$$

    The Williams correction (williams) is calculated using (Williams, 1976, p. 36):
    $$\\chi_{PW}^2 = \\frac{\\chi_{P}^2}{q}$$
    With:
    $$q = 1 + \\frac{\\left(n\\times\\left(\\sum_{i=1}^r \\frac{1}{R_i}\\right)-1\\right) \\times \\left(n\\times\\left(\\sum_{j=1}^c \\frac{1}{C_j}\\right)-1\\right)}{6\\times n\\times df}$$

    References 
    ----------
    Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (2007). *Discrete multivariate analysis*. Springer.

    Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. *Journal of the Royal Statistical Society: Series B (Methodological), 46*(3), 440–464. doi:10.1111/j.2517-6161.1984.tb01318.x

    Freeman, M. F., & Tukey, J. W. (1950). Transformations related to the angular and the square root. *The Annals of Mathematical Statistics, 21*(4), 607–611. doi:10.1214/aoms/1177729756

    Neyman, J. (1949). Contribution to the theory of the chi-square test. Berkeley Symposium on Math. Stat, and Prob, 239–273. doi:10.1525/9780520327016-030

    Pearson, E. S. (1947). The choice of statistical tests illustrated on the Interpretation of data classed in a 2 × 2 table. *Biometrika, 34*(1/2), 139–167. doi:10.2307/2332518

    Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. *Philosophical Magazine Series 5, 50*(302), 157–175. doi:10.1080/14786440009463897

    Wilks, S. S. (1938). The large-sample distribution of the likelihood ratio for testing composite hypotheses. *The Annals of Mathematical Statistics, 9*(1), 60–62. doi:10.1214/aoms/1177732360

    Williams, D. A. (1976). Improved likelihood ratio tests for complete contingency tables. *Biometrika, 63*(1), 33–37. doi:10.2307/2335081

    Yates, F. (1934). Contingency tables involving small numbers and the chi square test. *Supplement to the Journal of the Royal Statistical Society, 1*(2), 217–235. doi:10.2307/2983604

    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076

    '''
    #Test Used
    if (lambd == 2 / 3 or lambd == "cressie-read"):
        lambd = 2 / 3
        testUsed = "Cressie-Read test of independence"
    elif (lambd == 0 or lambd == "likelihood-ratio"):
        lambd = 0
        testUsed = "likelihood ratio test of independence"
    elif (lambd == -1 or lambd == "mod-log"):
        lambd = -1
        testUsed = "mod-log likelihood ratio test of independence"
    elif (lambd == 1 or lambd == "pearson"):
        lambd = 1
        testUsed = "Pearson chi-square test of independence"
    elif (lambd == -0.5 or lambd == "freeman-tukey"):
        lambd = -0.5
        testUsed = "Freeman-Tukey test of independence"
    elif (lambd == -2 or lambd == "neyman"):
        lambd = -2
        testUsed = "Neyman test of independence"
    else:
        testUsed = paste("power divergence test of independence with lambda = ", lambd)
        
    if cc == "yates":
        testUsed = testUsed + ", with Yates continuity correction"
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    #basic counts
    nrows = ct.shape[0] - 1
    ncols =  ct.shape[1] - 1
    n = ct.iloc[nrows, ncols]
    
    #determine the expected counts & chi-square value
    chi2Val = 0
    expMin = -1
    nExpBelow5 = 0    
    expC = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            expC.at[i, j] = ct.iloc[nrows, j] * ct.iloc[i, ncols] / n
            
            #add or remove a half in case Yates correction
            if cc=="yates":
                if ct.iloc[i,j] > expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] - 0.5
                elif ct.iloc[i,j] < expC.iloc[i,j]:
                    ct.iloc[i,j] = ct.iloc[i,j] + 0.5
            
            if (lambd == 0):
                chi2Val = chi2Val + ct.iloc[i,j] * log(ct.iloc[i,j] / expC.iloc[i,j])
            elif (lambd == -1):
                chi2Val = chi2Val + expC.iloc[i,j] * log(expC.iloc[i,j] / ct.iloc[i,j])
            else:
                chi2Val = chi2Val + ct.iloc[i,j] * ((ct.iloc[i,j] / expC.iloc[i,j])**lambd - 1)
            
            #check if below 5
            if expMin < 0 or expC.iloc[i,j] < expMin:
                expMin = expC.iloc[i,j]            
            if expC.iloc[i,j] < 5:
                nExpBelow5 = nExpBelow5 + 1
    
    if (lambd == 0 or lambd==-1):
        chi2Val = 2*chi2Val
    else:
        chi2Val = 2/(lambd*(lambd + 1)) * chi2Val
    
    nExpBelow5 = nExpBelow5/(nrows*ncols)
    
    #Degrees of freedom
    df = (nrows - 1)*(ncols - 1)
    
    #Williams and Pearson correction
    if cc == "williams":
        testUsed = testUsed + ", with Williams continuity correction"
        rTotInv = 0
        for i in range(0, nrows):
            rTotInv = rTotInv + 1 / ct.iloc[i, ncols]
        
        cTotInv = 0
        for j in range(0, ncols):
            cTotInv = cTotInv + 1 / ct.iloc[nrows, j]
        
        q = 1 + (n * rTotInv - 1) * (n * cTotInv - 1) / (6 * n * df)
        chi2Val = chi2Val / q
    elif cc == "pearson":
        testUsed = testUsed + ", with E.S. Pearson continuity correction"
        chi2Val = chi2Val * (n - 1) / n
    
    #The test
    pvalue = chi2.sf(chi2Val, df)
    
    #Prepare the results
    colNames = ["n", "n rows", "n col.", "statistic", "df", "p-value", "min. exp.", "prop. exp. below 5", "test"]
    testResults = pd.DataFrame([[n, nrows, ncols, chi2Val, df, pvalue, expMin, nExpBelow5, testUsed]], columns=colNames)
    pd.set_option('display.max_colwidth', None)
    
    return testResults