Module `stikpetP.effect_sizes.eff_size_goodman_kruskal_lambda`

Expand source code

import pandas as pd
from statistics import NormalDist
from random import random
from ..other.table_cross import tab_cross

def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties="first"):
    '''
    Goodman-Kruskal Lambda
    ----------------------
    This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value.
    
    A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence).
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the second categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
    ties : string, optional
        how to deal with tied modal scores. Either "first" (default), "last", "average" or "random"
        
    Returns
    -------
    A dataframe with:
    
    * *dependent*, the field used as dependent variable
    * *value*, the lambda value
    * *n*, the sample size
    * *ASE_0*, the asymptotic standard error assuming the null hypothesis
    * *ASE_1*, the asymptotic standard error not assuming the null hypothesis
    * *statistic*, the z-value
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used is (Goodman & Kruskal, 1954, p. 743):
    $$\\lambda_{Y|X} = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) - C_{max}}{n - C_{max}}$$
    $$\\lambda_{X|Y} = \\frac{\\left(\\sum_{j=1}^c F_{max, j}\\right) - R_{max}}{n - R_{max}}$$
    $$\\lambda = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) + \\left(\\sum_{j=1}^c F_{max, j}\\right) - C_{max} - R_{max}}{2\\times n - R_{max} - C_{max}}$$
    
    The asymptotic errors are given by:
    $$ASE\\left(\\lambda_{Y|X}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^c - \\delta_{j}^c\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)-C_{max}\\right)^2}{n}\\right)}}{n-C_{max}}$$
    $$ASE\\left(\\lambda_{X|Y}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^r - \\delta_{i}^r\\right)\\right)-\\frac{\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)-R_{max}\\right)^2}{n}\\right)}}{n-R_{max}}$$
    $$ASE\\left(\\lambda\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r\\right)^2\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+\\left(\\sum_{j=1}^c F_{max,j}\\right)-C_{max}-R_{max}\\right)^2}{n}}}{2\\times n - R_{max} - C_{max}}$$
    
    $$ASE\\left(\\lambda_{Y|X}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{i=1}^r F_{i,max}\\right)\\times\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+C_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^c \\times\\delta_{j}^c\\right)\\right)}{\\left(n - C_{max}\\right)^3}}$$
    $$ASE\\left(\\lambda_{X|Y}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{j=1}^c F_{max,j}\\right)\\times\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)+R_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^r \\times\\delta_{i}^r\\right)\\right)}{\\left(n - R_{max}\\right)^3}}$$
    $$ASE\\left(\\lambda\\right)_1 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r+\\lambda\\times\\left(\\delta_{j}^c+\\delta_{i}^r\\right)\\right)^2\\right)\\right)-4\\times n\\times\\lambda^2}}{2\\times n-R_{max}-C{max}}$$
    
    With:
    $$\\delta_{i,j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is column index for } F_{i,max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{i,j}^r\\begin{cases} 1 & \\text{ if } i \\text{ is row index for } F_{max,j} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is index for } C_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{i}^r\\begin{cases} 1 & \\text{ if } i \\text{ is index for } R_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    
    The approximate T-values (z-values):
    $$T\\left(\\delta_{Y|X}\\right)=\\frac{\\lambda_{Y|X}}{ASE\\left(\\lambda_{Y|X}\\right)_0}$$
    $$T\\left(\\delta_{X|Y}\\right)=\\frac{\\lambda_{X|Y}}{ASE\\left(\\lambda_{X|Y}\\right)_0}$$
    $$T\\left(\\delta\\right)=\\frac{\\lambda}{ASE\\left(\\lambda\\right)_0}$$
    
    The p-value (significance):
    $$T\\sim N\\left(0,1\\right)$$
    $$sig. = 1 - 2\\times\Phi\\left(T\\right)$$
    
    Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used.
    
    Note the \\(ASE\\left(\\lambda_{X|Y}\\right)_1\\) formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html
    
    *Symbols used:*
    
    * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j.
        * \\(c\\), the number of columns
    * \\(r\\), the number of rows
    * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
    * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(F_{i,max}\\) is the maximum count of row i. i.e. \\(F_{i,max}=\\max\\left\\{F_{i,1},F_{i,2},\\ldots,F_{i,c}\\right\\}\\)
    * \\(F_{max,j}\\) is the maximum count of column j, i.e. \\(F_{max,j}=\\max\\left\\{F_{1,j},F_{2,j},\\ldots,F_{r,j}\\right\\}\\)
    * \\(R_{max}\\) is the maximum of the row totals, i.e. \\(R_{max}=\\max\\left\\{R_1,R_2,\\ldots,R_r\\right\\}\\)
    * \\(C_{max}\\) is the maximum of the column totals, i.e. \\(C_{max}=\\max\\left\\{C_1,C_2,\\ldots,C_c\\right\\}\\)
    * \\(\\Phi\\left(\\ldots\\right)\\), the cumulative density function of the standard normal distribution
    
    References
    ----------
    Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
    
    Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. *Behavioral Science, 20*(4), 258–259. doi:10.1002/bs.3830200407
    
    Hartwig, F. (1973). Statistical significance of the lambda coefficients. *Behavioral Science, 18*(4), 307–310. doi:10.1002/bs.3830180409
    
    SPSS. (2006). SPSS 15.0 algorithms.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #The average method only averages column and row maximums (rm and cm),
    #it uses "first" for ties in fim and fmj.
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    
    #basic counts
    nrows = ct.shape[0]-1
    ncols =  ct.shape[1]-1
    n = ct.iloc[nrows, ncols]
    
    #the margin totals
    rs = ct.iloc[0:nrows, ncols]
    cs = ct.iloc[nrows, 0:ncols]
    
    rm = max(rs)
    cm = max(cs)
    
    rowMaxColIndex = [0]*nrows
    rowTotalsMaxRowIndex = 0
    fim = [0]*nrows
    rowMaxFreq = [0]*nrows
    rowTotalsMaxRowFreq = 0
    for i in range(0, nrows):
        rowMax = 0
        for j in range(0, ncols):
            if ties=="first" and ct.iloc[i, j] > rowMax:
                rowMaxColIndex[i] = j
                rowMax = ct.iloc[i, j]
                fim[i] = rowMax
            elif ties=="last" and ct.iloc[i, j] >= rowMax:
                rowMaxColIndex[i] = j
                rowMax = ct.iloc[i, j]
                fim[i] = rowMax
            elif ties=="average" and ct.iloc[i, j] >= rowMax:
                if ct.iloc[i, j] > rowMax:
                    rowMaxFreq[i] = 1
                    rowMax = ct.iloc[i, j]
                    fim[i] = rowMax
                elif ct.iloc[i, j] == rowMax:
                    rowMaxFreq[i] = rowMaxFreq[i] + 1
                
            elif ties=="random":
                if ct.iloc[i, j] > rowMax:
                    rowMaxColIndex[i] = j
                    rowMax = ct.iloc[i, j]
                    fim[i] = rowMax
                    rowMaxFreq[i] = 1
                elif ct.iloc[i, j] == rowMax:
                    rowMaxFreq[i] = rowMaxFreq[i] + 1
                    if random() < (1 / rowMaxFreq[i]):
                        rowMaxColIndex[i] = j
                    
        if rs[i]==rm:
            if ties=="last":
                rowTotalsMaxRowIndex = i
            elif ties=="first":
                if rowTotalsMaxRowFreq==0:
                    rowTotalsMaxRowIndex = i
            elif ties=="random":
                if rowTotalsMaxRowFreq==0:
                    rowTotalsMaxRowIndex = i
                else:
                    if random() < (1 / rowTotalsMaxRowFreq):
                        rowTotalsMaxRowIndex = i
                        
            #ties = average is not needed since it does not use the rowTotalsMaxRowIndex
            rowTotalsMaxRowFreq = rowTotalsMaxRowFreq + 1
    
    #same for the columns
    colMaxRowIndex = [0]*ncols
    colTotalsMaxColIndex = 0
    fmj = [0]*ncols
    colMaxFreq = [0]*ncols
    colTotalsMaxColFreq = 0
    for j in range(0, ncols):
        colMax = 0
        for i in range(0, nrows):    
            if ties=="first" and ct.iloc[i, j] > colMax:
                colMaxRowIndex[j] = i
                colMax = ct.iloc[i, j]
                fmj[j] = colMax
            elif ties=="last" and ct.iloc[i, j] >= colMax:
                colMaxRowIndex[j] = i
                colMax = ct.iloc[i, j]
                fmj[j] = colMax
            elif ties=="average" and ct.iloc[i, j] >= colMax:
                if ct.iloc[i, j] > colMax:
                    colMaxFreq[j] = 1
                    colMax = ct.iloc[i, j]
                    fmj[j] = colMax
                elif ct.iloc[i, j] == colMax:
                    colMaxFreq[j] = colMaxFreq[j] + 1
                
            elif ties=="random":
                if ct.iloc[i, j] > colMax:
                    colMaxRowIndex[j] = i
                    colMax = ct.iloc[i, j]
                    fmj[j] = colMax
                    colMaxFreq[j] = 1
                elif ct.iloc[i, j] == colMax:
                    colMaxFreq[j] = colMaxFreq[j] + 1
                    if random() < (1 / colMaxFreq[j]):
                        colMaxRowIndex[j] = i
                    
        if cs[j]==cm:
            if ties=="last":
                colTotalsMaxColIndex = j
            elif ties=="first":
                if colTotalsMaxColFreq==0:
                    colTotalsMaxColIndex = j
            elif ties=="random":
                if colTotalsMaxColFreq==0:
                    colTotalsMaxColIndex = j
                else:
                    if random() < (1 / colTotalsMaxColFreq):
                        colTotalsMaxColIndex = j
                        
            #ties = average is not needed since it does not use the rowTotalsMaxRowIndex
            colTotalsMaxColFreq = colTotalsMaxColFreq + 1
    
    dijc = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            if ties=="average":
                if ct.iloc[i, j]==fim[i]:
                    dijc.at[i, j] = 1 / rowMaxFreq[i]
                else:
                    dijc.at[i, j] = 0
            else:
                if j==rowMaxColIndex[i]:
                    dijc.at[i, j] = 1
                else:
                    dijc.at[i, j] = 0
    
    djc = [0]*ncols
    for j in range(0, ncols):
        if cs[j]==cm:
            if ties=="average":
                djc[j] = 1 / colTotalsMaxColFreq
            elif colTotalsMaxColIndex==j:
                djc[j] = 1
        else:
            djc[j] = 0
    
    
    dijr = pd.DataFrame()
    for j in range(0, ncols):
        for i in range(0, nrows):
            if ties=="average":
                if ct.iloc[i, j]==fmj[j]:
                    dijr.at[i, j] = 1 / colMaxFreq[j]
                else:
                    dijr.at[i, j] = 0
            else:
                if i==colMaxRowIndex[j]:
                    dijr.at[i, j] = 1
                else:
                    dijr.at[i, j] = 0
    
    dirr = [0]*nrows
    for i in range(0, nrows):
        if rs[i]==rm:
            if ties=="average":
                dirr[i] = 1 / rowTotalsMaxRowFreq
            elif rowTotalsMaxRowIndex==i:
                dirr[i] = 1
        else:
            dirr[i] = 0
            
    #ase calculations
    sfim = 0
    for i in range(0, nrows):
        sfim = sfim + fim[i]
    
    sfmj = 0
    for j in range(0, ncols):
        sfmj = sfmj + fmj[j]
    
    lambdyx = (sfim - cm) / (n - cm)
    lambdxy = (sfmj - rm) / (n - rm)
    lambd = (sfim + sfmj - cm - rm) / (2 * n - rm - cm)
    
    #aseyx0 and aseyx1
    ase0yx = 0
    ase1yx = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            ase0yx = ase0yx + ct.iloc[i, j] * (dijc.iloc[i, j] - djc[j])**2
            ase1yx = ase1yx + ct.iloc[i, j] * dijc.iloc[i, j] * djc[j]
    
    ase0yx = (ase0yx - (sfim - cm)**2 / n)**0.5 / (n - cm)
    ase1yx = (((n - sfim) * (sfim + cm - 2 * ase1yx)) / ((n - cm)**3))**0.5
    
    #asexy0 and asexy1
    ase0xy = 0
    ase1xy = 0
    for j in range(0, ncols):
        for i in range(0, nrows):
            ase0xy = ase0xy + ct.iloc[i, j] * (dijr.iloc[i, j] - dirr[i])**2
            ase1xy = ase1xy + ct.iloc[i, j] * dijr.iloc[i, j] * dirr[i]
    
    ase0xy = (ase0xy - (sfmj - rm)**2 / n)**0.5 / (n - rm)
    ase1xy = (((n - sfmj) * (sfmj + rm - 2 * ase1xy)) / ((n - rm)**3))**0.5
    
    #ase0 and ase1
    ase0 = 0
    ase1 = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            ase0 = ase0 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i])**2
            ase1 = ase1 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i] + lambd * (djc[j] + dirr[i]))**2
    ase0 = (ase0 - (sfim + sfmj - cm - rm)**2 / n)**0.5 / (2 * n - rm - cm)
    ase1 = (ase1 - 4 * n * lambd**2)**0.5 / (2 * n - rm - cm)
    
    Z = lambd / ase0
    Zyx = lambdyx / ase0yx
    Zxy = lambdxy / ase0xy
    
    p = 2 * (1 - NormalDist().cdf(abs(Z))) 
    pyx = 2 * (1 - NormalDist().cdf(abs(Zyx)))
    pxy = 2 * (1 - NormalDist().cdf(abs(Zxy)))
    
    #the results
    ver = ["symmetric", "field1", "field2"]
    ns = [n, n, n]
    ls = [lambd, lambdxy, lambdyx]
    ase0s = [ase0, ase0xy, ase0yx]
    ase1s = [ase1, ase1xy, ase1yx]
    zs = [Z, Zxy, Zyx]
    pvalues = [p, pxy, pyx]
    
    colNames = ["dependent", "value", "n", "ASE_0", "ASE_1", "statistic", "p-value"]
    results = pd.DataFrame(list(zip(ver, ls, ns, ase0s, ase1s, zs, pvalues)), columns=colNames)
    
    return results

Functions

def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties='first')

Goodman-Kruskal Lambda

This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value.

A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence).

Parameters

field1 : list or pandas series: the first categorical field
field2 : list or pandas series: the second categorical field
categories1 : list or dictionary, optional: order and/or selection for categories of field1
categories2 : list or dictionary, optional: order and/or selection for categories of field2
ties : string, optional: how to deal with tied modal scores. Either "first" (default), "last", "average" or "random"

Returns

A dataframe with:

dependent, the field used as dependent variable
value, the lambda value
n, the sample size
ASE_0, the asymptotic standard error assuming the null hypothesis
ASE_1, the asymptotic standard error not assuming the null hypothesis
statistic, the z-value
p-value, the significance (p-value)

Notes

The formula used is (Goodman & Kruskal, 1954, p. 743): $\lambda_{Y|X} = \frac{\left(\sum_{i=1}^r F_{i, max}\right) - C_{max}}{n - C_{max}}$ $\lambda_{X|Y} = \frac{\left(\sum_{j=1}^c F_{max, j}\right) - R_{max}}{n - R_{max}}$ $\lambda = \frac{\left(\sum_{i=1}^r F_{i, max}\right) + \left(\sum_{j=1}^c F_{max, j}\right) - C_{max} - R_{max}}{2\times n - R_{max} - C_{max}}$

The asymptotic errors are given by: $ASE\left(\lambda_{Y|X}\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta _{i,j}^c - \delta_{j}^c\right)\right)-\frac{\left(\left(\sum_{i=1}^r F_{i,max}\right)-C_{max}\right)^2}{n}\right)}}{n-C_{max}}$ $ASE\left(\lambda_{X|Y}\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta _{i,j}^r - \delta_{i}^r\right)\right)-\frac{\left(\left(\sum_{j=1}^c F_{max,j}\right)-R_{max}\right)^2}{n}\right)}}{n-R_{max}}$ $ASE\left(\lambda\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta_{i,j}^c+\delta_{i,j}^r-\delta_{j}^c-\delta_{i}^r\right)^2\right)\right)-\frac{\left(\left(\sum_{i=1}^r F_{i,max}\right)+\left(\sum_{j=1}^c F_{max,j}\right)-C_{max}-R_{max}\right)^2}{n}}}{2\times n - R_{max} - C_{max}}$

$ASE\left(\lambda_{Y|X}\right)_1 = \sqrt{\frac{\left(n-\sum_{i=1}^r F_{i,max}\right)\times\left(\left(\sum_{i=1}^r F_{i,max}\right)+C_{max}-2\times\sum_{i,j}\left(F_{i,j}\times\delta_{i,j}^c \times\delta_{j}^c\right)\right)}{\left(n - C_{max}\right)^3}}$ $ASE\left(\lambda_{X|Y}\right)_1 = \sqrt{\frac{\left(n-\sum_{j=1}^c F_{max,j}\right)\times\left(\left(\sum_{j=1}^c F_{max,j}\right)+R_{max}-2\times\sum_{i,j}\left(F_{i,j}\times\delta_{i,j}^r \times\delta_{i}^r\right)\right)}{\left(n - R_{max}\right)^3}}$ $ASE\left(\lambda\right)_1 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta_{i,j}^c+\delta_{i,j}^r-\delta_{j}^c-\delta_{i}^r+\lambda\times\left(\delta_{j}^c+\delta_{i}^r\right)\right)^2\right)\right)-4\times n\times\lambda^2}}{2\times n-R_{max}-C{max}}$

With: $\delta_{i,j}^c\begin{cases} 1 & \text{ if } j \text{ is column index for } F_{i,max} \\ 0 & \text{ otherwise } \end{cases}$ $\delta_{i,j}^r\begin{cases} 1 & \text{ if } i \text{ is row index for } F_{max,j} \\ 0 & \text{ otherwise } \end{cases}$ $\delta_{j}^c\begin{cases} 1 & \text{ if } j \text{ is index for } C_{max} \\ 0 & \text{ otherwise } \end{cases}$ $\delta_{i}^r\begin{cases} 1 & \text{ if } i \text{ is index for } R_{max} \\ 0 & \text{ otherwise } \end{cases}$

The approximate T-values (z-values): $T\left(\delta_{Y|X}\right)=\frac{\lambda_{Y|X}}{ASE\left(\lambda_{Y|X}\right)_0}$ $T\left(\delta_{X|Y}\right)=\frac{\lambda_{X|Y}}{ASE\left(\lambda_{X|Y}\right)_0}$ $T\left(\delta\right)=\frac{\lambda}{ASE\left(\lambda\right)_0}$

The p-value (significance): $T\sim N\left(0,1\right)$ $sig. = 1 - 2\times\Phi\left(T\right)$

Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used.

Note the $ASE\left(\lambda_{X|Y}\right)_1$ formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html

Symbols used:

$F_{i,j}$ , the absolute frequency (observed count) from row i and column j.
- $c$ , the number of columns
$r$ , the number of rows
$R_i$ , row total of row i, it can be calculated using $R_i=\sum_{j=1}^{c}F_{i,j}$
$C_j$ , column total of column j, it can be calculated using $C_j=\sum_{i=1}^{r}F_{i,j}$
$n$ = the total number of cases, it can be calculated in various ways, $n=\sum_{j=1}^{c}C_j=\sum_{i=1}^{r}R_i=\sum_{i=1}^{r}\sum_{j=1}^{c}F_{i,j}$
$F_{i,max}$ is the maximum count of row i. i.e. $F_{i,max}=\max\left\{F_{i,1},F_{i,2},\ldots,F_{i,c}\right\}$
$F_{max,j}$ is the maximum count of column j, i.e. $F_{max,j}=\max\left\{F_{1,j},F_{2,j},\ldots,F_{r,j}\right\}$
$R_{max}$ is the maximum of the row totals, i.e. $R_{max}=\max\left\{R_1,R_2,\ldots,R_r\right\}$
$C_{max}$ is the maximum of the column totals, i.e. $C_{max}=\max\left\{C_1,C_2,\ldots,C_c\right\}$
$\Phi\left(\ldots\right)$ , the cumulative density function of the standard normal distribution

References

Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. doi:10.2307/2281536

Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. Behavioral Science, 20(4), 258–259. doi:10.1002/bs.3830200407

Hartwig, F. (1973). Statistical significance of the lambda coefficients. Behavioral Science, 18(4), 307–310. doi:10.1002/bs.3830180409

SPSS. (2006). SPSS 15.0 algorithms.

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code

def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties="first"):
    '''
    Goodman-Kruskal Lambda
    ----------------------
    This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value.
    
    A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence).
    
    Parameters
    ----------
    field1 : list or pandas series
        the first categorical field
    field2 : list or pandas series
        the second categorical field
    categories1 : list or dictionary, optional
        order and/or selection for categories of field1
    categories2 : list or dictionary, optional
        order and/or selection for categories of field2
    ties : string, optional
        how to deal with tied modal scores. Either "first" (default), "last", "average" or "random"
        
    Returns
    -------
    A dataframe with:
    
    * *dependent*, the field used as dependent variable
    * *value*, the lambda value
    * *n*, the sample size
    * *ASE_0*, the asymptotic standard error assuming the null hypothesis
    * *ASE_1*, the asymptotic standard error not assuming the null hypothesis
    * *statistic*, the z-value
    * *p-value*, the significance (p-value)
    
    Notes
    -----
    The formula used is (Goodman & Kruskal, 1954, p. 743):
    $$\\lambda_{Y|X} = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) - C_{max}}{n - C_{max}}$$
    $$\\lambda_{X|Y} = \\frac{\\left(\\sum_{j=1}^c F_{max, j}\\right) - R_{max}}{n - R_{max}}$$
    $$\\lambda = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) + \\left(\\sum_{j=1}^c F_{max, j}\\right) - C_{max} - R_{max}}{2\\times n - R_{max} - C_{max}}$$
    
    The asymptotic errors are given by:
    $$ASE\\left(\\lambda_{Y|X}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^c - \\delta_{j}^c\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)-C_{max}\\right)^2}{n}\\right)}}{n-C_{max}}$$
    $$ASE\\left(\\lambda_{X|Y}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^r - \\delta_{i}^r\\right)\\right)-\\frac{\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)-R_{max}\\right)^2}{n}\\right)}}{n-R_{max}}$$
    $$ASE\\left(\\lambda\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r\\right)^2\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+\\left(\\sum_{j=1}^c F_{max,j}\\right)-C_{max}-R_{max}\\right)^2}{n}}}{2\\times n - R_{max} - C_{max}}$$
    
    $$ASE\\left(\\lambda_{Y|X}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{i=1}^r F_{i,max}\\right)\\times\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+C_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^c \\times\\delta_{j}^c\\right)\\right)}{\\left(n - C_{max}\\right)^3}}$$
    $$ASE\\left(\\lambda_{X|Y}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{j=1}^c F_{max,j}\\right)\\times\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)+R_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^r \\times\\delta_{i}^r\\right)\\right)}{\\left(n - R_{max}\\right)^3}}$$
    $$ASE\\left(\\lambda\\right)_1 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r+\\lambda\\times\\left(\\delta_{j}^c+\\delta_{i}^r\\right)\\right)^2\\right)\\right)-4\\times n\\times\\lambda^2}}{2\\times n-R_{max}-C{max}}$$
    
    With:
    $$\\delta_{i,j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is column index for } F_{i,max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{i,j}^r\\begin{cases} 1 & \\text{ if } i \\text{ is row index for } F_{max,j} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is index for } C_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    $$\\delta_{i}^r\\begin{cases} 1 & \\text{ if } i \\text{ is index for } R_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
    
    The approximate T-values (z-values):
    $$T\\left(\\delta_{Y|X}\\right)=\\frac{\\lambda_{Y|X}}{ASE\\left(\\lambda_{Y|X}\\right)_0}$$
    $$T\\left(\\delta_{X|Y}\\right)=\\frac{\\lambda_{X|Y}}{ASE\\left(\\lambda_{X|Y}\\right)_0}$$
    $$T\\left(\\delta\\right)=\\frac{\\lambda}{ASE\\left(\\lambda\\right)_0}$$
    
    The p-value (significance):
    $$T\\sim N\\left(0,1\\right)$$
    $$sig. = 1 - 2\\times\Phi\\left(T\\right)$$
    
    Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used.
    
    Note the \\(ASE\\left(\\lambda_{X|Y}\\right)_1\\) formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html
    
    *Symbols used:*
    
    * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j.
        * \\(c\\), the number of columns
    * \\(r\\), the number of rows
    * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
    * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
    * \\(F_{i,max}\\) is the maximum count of row i. i.e. \\(F_{i,max}=\\max\\left\\{F_{i,1},F_{i,2},\\ldots,F_{i,c}\\right\\}\\)
    * \\(F_{max,j}\\) is the maximum count of column j, i.e. \\(F_{max,j}=\\max\\left\\{F_{1,j},F_{2,j},\\ldots,F_{r,j}\\right\\}\\)
    * \\(R_{max}\\) is the maximum of the row totals, i.e. \\(R_{max}=\\max\\left\\{R_1,R_2,\\ldots,R_r\\right\\}\\)
    * \\(C_{max}\\) is the maximum of the column totals, i.e. \\(C_{max}=\\max\\left\\{C_1,C_2,\\ldots,C_c\\right\\}\\)
    * \\(\\Phi\\left(\\ldots\\right)\\), the cumulative density function of the standard normal distribution
    
    References
    ----------
    Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
    
    Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. *Behavioral Science, 20*(4), 258–259. doi:10.1002/bs.3830200407
    
    Hartwig, F. (1973). Statistical significance of the lambda coefficients. *Behavioral Science, 18*(4), 307–310. doi:10.1002/bs.3830180409
    
    SPSS. (2006). SPSS 15.0 algorithms.
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    
    #The average method only averages column and row maximums (rm and cm),
    #it uses "first" for ties in fim and fmj.
    
    #create the cross table
    ct = tab_cross(field1, field2, categories1, categories2, totals="include")
    
    
    #basic counts
    nrows = ct.shape[0]-1
    ncols =  ct.shape[1]-1
    n = ct.iloc[nrows, ncols]
    
    #the margin totals
    rs = ct.iloc[0:nrows, ncols]
    cs = ct.iloc[nrows, 0:ncols]
    
    rm = max(rs)
    cm = max(cs)
    
    rowMaxColIndex = [0]*nrows
    rowTotalsMaxRowIndex = 0
    fim = [0]*nrows
    rowMaxFreq = [0]*nrows
    rowTotalsMaxRowFreq = 0
    for i in range(0, nrows):
        rowMax = 0
        for j in range(0, ncols):
            if ties=="first" and ct.iloc[i, j] > rowMax:
                rowMaxColIndex[i] = j
                rowMax = ct.iloc[i, j]
                fim[i] = rowMax
            elif ties=="last" and ct.iloc[i, j] >= rowMax:
                rowMaxColIndex[i] = j
                rowMax = ct.iloc[i, j]
                fim[i] = rowMax
            elif ties=="average" and ct.iloc[i, j] >= rowMax:
                if ct.iloc[i, j] > rowMax:
                    rowMaxFreq[i] = 1
                    rowMax = ct.iloc[i, j]
                    fim[i] = rowMax
                elif ct.iloc[i, j] == rowMax:
                    rowMaxFreq[i] = rowMaxFreq[i] + 1
                
            elif ties=="random":
                if ct.iloc[i, j] > rowMax:
                    rowMaxColIndex[i] = j
                    rowMax = ct.iloc[i, j]
                    fim[i] = rowMax
                    rowMaxFreq[i] = 1
                elif ct.iloc[i, j] == rowMax:
                    rowMaxFreq[i] = rowMaxFreq[i] + 1
                    if random() < (1 / rowMaxFreq[i]):
                        rowMaxColIndex[i] = j
                    
        if rs[i]==rm:
            if ties=="last":
                rowTotalsMaxRowIndex = i
            elif ties=="first":
                if rowTotalsMaxRowFreq==0:
                    rowTotalsMaxRowIndex = i
            elif ties=="random":
                if rowTotalsMaxRowFreq==0:
                    rowTotalsMaxRowIndex = i
                else:
                    if random() < (1 / rowTotalsMaxRowFreq):
                        rowTotalsMaxRowIndex = i
                        
            #ties = average is not needed since it does not use the rowTotalsMaxRowIndex
            rowTotalsMaxRowFreq = rowTotalsMaxRowFreq + 1
    
    #same for the columns
    colMaxRowIndex = [0]*ncols
    colTotalsMaxColIndex = 0
    fmj = [0]*ncols
    colMaxFreq = [0]*ncols
    colTotalsMaxColFreq = 0
    for j in range(0, ncols):
        colMax = 0
        for i in range(0, nrows):    
            if ties=="first" and ct.iloc[i, j] > colMax:
                colMaxRowIndex[j] = i
                colMax = ct.iloc[i, j]
                fmj[j] = colMax
            elif ties=="last" and ct.iloc[i, j] >= colMax:
                colMaxRowIndex[j] = i
                colMax = ct.iloc[i, j]
                fmj[j] = colMax
            elif ties=="average" and ct.iloc[i, j] >= colMax:
                if ct.iloc[i, j] > colMax:
                    colMaxFreq[j] = 1
                    colMax = ct.iloc[i, j]
                    fmj[j] = colMax
                elif ct.iloc[i, j] == colMax:
                    colMaxFreq[j] = colMaxFreq[j] + 1
                
            elif ties=="random":
                if ct.iloc[i, j] > colMax:
                    colMaxRowIndex[j] = i
                    colMax = ct.iloc[i, j]
                    fmj[j] = colMax
                    colMaxFreq[j] = 1
                elif ct.iloc[i, j] == colMax:
                    colMaxFreq[j] = colMaxFreq[j] + 1
                    if random() < (1 / colMaxFreq[j]):
                        colMaxRowIndex[j] = i
                    
        if cs[j]==cm:
            if ties=="last":
                colTotalsMaxColIndex = j
            elif ties=="first":
                if colTotalsMaxColFreq==0:
                    colTotalsMaxColIndex = j
            elif ties=="random":
                if colTotalsMaxColFreq==0:
                    colTotalsMaxColIndex = j
                else:
                    if random() < (1 / colTotalsMaxColFreq):
                        colTotalsMaxColIndex = j
                        
            #ties = average is not needed since it does not use the rowTotalsMaxRowIndex
            colTotalsMaxColFreq = colTotalsMaxColFreq + 1
    
    dijc = pd.DataFrame()
    for i in range(0, nrows):
        for j in range(0, ncols):
            if ties=="average":
                if ct.iloc[i, j]==fim[i]:
                    dijc.at[i, j] = 1 / rowMaxFreq[i]
                else:
                    dijc.at[i, j] = 0
            else:
                if j==rowMaxColIndex[i]:
                    dijc.at[i, j] = 1
                else:
                    dijc.at[i, j] = 0
    
    djc = [0]*ncols
    for j in range(0, ncols):
        if cs[j]==cm:
            if ties=="average":
                djc[j] = 1 / colTotalsMaxColFreq
            elif colTotalsMaxColIndex==j:
                djc[j] = 1
        else:
            djc[j] = 0
    
    
    dijr = pd.DataFrame()
    for j in range(0, ncols):
        for i in range(0, nrows):
            if ties=="average":
                if ct.iloc[i, j]==fmj[j]:
                    dijr.at[i, j] = 1 / colMaxFreq[j]
                else:
                    dijr.at[i, j] = 0
            else:
                if i==colMaxRowIndex[j]:
                    dijr.at[i, j] = 1
                else:
                    dijr.at[i, j] = 0
    
    dirr = [0]*nrows
    for i in range(0, nrows):
        if rs[i]==rm:
            if ties=="average":
                dirr[i] = 1 / rowTotalsMaxRowFreq
            elif rowTotalsMaxRowIndex==i:
                dirr[i] = 1
        else:
            dirr[i] = 0
            
    #ase calculations
    sfim = 0
    for i in range(0, nrows):
        sfim = sfim + fim[i]
    
    sfmj = 0
    for j in range(0, ncols):
        sfmj = sfmj + fmj[j]
    
    lambdyx = (sfim - cm) / (n - cm)
    lambdxy = (sfmj - rm) / (n - rm)
    lambd = (sfim + sfmj - cm - rm) / (2 * n - rm - cm)
    
    #aseyx0 and aseyx1
    ase0yx = 0
    ase1yx = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            ase0yx = ase0yx + ct.iloc[i, j] * (dijc.iloc[i, j] - djc[j])**2
            ase1yx = ase1yx + ct.iloc[i, j] * dijc.iloc[i, j] * djc[j]
    
    ase0yx = (ase0yx - (sfim - cm)**2 / n)**0.5 / (n - cm)
    ase1yx = (((n - sfim) * (sfim + cm - 2 * ase1yx)) / ((n - cm)**3))**0.5
    
    #asexy0 and asexy1
    ase0xy = 0
    ase1xy = 0
    for j in range(0, ncols):
        for i in range(0, nrows):
            ase0xy = ase0xy + ct.iloc[i, j] * (dijr.iloc[i, j] - dirr[i])**2
            ase1xy = ase1xy + ct.iloc[i, j] * dijr.iloc[i, j] * dirr[i]
    
    ase0xy = (ase0xy - (sfmj - rm)**2 / n)**0.5 / (n - rm)
    ase1xy = (((n - sfmj) * (sfmj + rm - 2 * ase1xy)) / ((n - rm)**3))**0.5
    
    #ase0 and ase1
    ase0 = 0
    ase1 = 0
    for i in range(0, nrows):
        for j in range(0, ncols):
            ase0 = ase0 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i])**2
            ase1 = ase1 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i] + lambd * (djc[j] + dirr[i]))**2
    ase0 = (ase0 - (sfim + sfmj - cm - rm)**2 / n)**0.5 / (2 * n - rm - cm)
    ase1 = (ase1 - 4 * n * lambd**2)**0.5 / (2 * n - rm - cm)
    
    Z = lambd / ase0
    Zyx = lambdyx / ase0yx
    Zxy = lambdxy / ase0xy
    
    p = 2 * (1 - NormalDist().cdf(abs(Z))) 
    pyx = 2 * (1 - NormalDist().cdf(abs(Zyx)))
    pxy = 2 * (1 - NormalDist().cdf(abs(Zxy)))
    
    #the results
    ver = ["symmetric", "field1", "field2"]
    ns = [n, n, n]
    ls = [lambd, lambdxy, lambdyx]
    ase0s = [ase0, ase0xy, ase0yx]
    ase1s = [ase1, ase1xy, ase1yx]
    zs = [Z, Zxy, Zyx]
    pvalues = [p, pxy, pyx]
    
    colNames = ["dependent", "value", "n", "ASE_0", "ASE_1", "statistic", "p-value"]
    results = pd.DataFrame(list(zip(ver, ls, ns, ase0s, ase1s, zs, pvalues)), columns=colNames)
    
    return results

def random()

random() -> x in the interval [0, 1).