Module stikpetP.effect_sizes.eff_size_goodman_kruskal_lambda
Expand source code
import pandas as pd
from statistics import NormalDist
from random import random
from ..other.table_cross import tab_cross
def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties="first"):
'''
Goodman-Kruskal Lambda
----------------------
This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value.
A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence).
Parameters
----------
field1 : list or pandas series
the first categorical field
field2 : list or pandas series
the second categorical field
categories1 : list or dictionary, optional
order and/or selection for categories of field1
categories2 : list or dictionary, optional
order and/or selection for categories of field2
ties : string, optional
how to deal with tied modal scores. Either "first" (default), "last", "average" or "random"
Returns
-------
A dataframe with:
* *dependent*, the field used as dependent variable
* *value*, the lambda value
* *n*, the sample size
* *ASE_0*, the asymptotic standard error assuming the null hypothesis
* *ASE_1*, the asymptotic standard error not assuming the null hypothesis
* *statistic*, the z-value
* *p-value*, the significance (p-value)
Notes
-----
The formula used is (Goodman & Kruskal, 1954, p. 743):
$$\\lambda_{Y|X} = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) - C_{max}}{n - C_{max}}$$
$$\\lambda_{X|Y} = \\frac{\\left(\\sum_{j=1}^c F_{max, j}\\right) - R_{max}}{n - R_{max}}$$
$$\\lambda = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) + \\left(\\sum_{j=1}^c F_{max, j}\\right) - C_{max} - R_{max}}{2\\times n - R_{max} - C_{max}}$$
The asymptotic errors are given by:
$$ASE\\left(\\lambda_{Y|X}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^c - \\delta_{j}^c\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)-C_{max}\\right)^2}{n}\\right)}}{n-C_{max}}$$
$$ASE\\left(\\lambda_{X|Y}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^r - \\delta_{i}^r\\right)\\right)-\\frac{\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)-R_{max}\\right)^2}{n}\\right)}}{n-R_{max}}$$
$$ASE\\left(\\lambda\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r\\right)^2\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+\\left(\\sum_{j=1}^c F_{max,j}\\right)-C_{max}-R_{max}\\right)^2}{n}}}{2\\times n - R_{max} - C_{max}}$$
$$ASE\\left(\\lambda_{Y|X}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{i=1}^r F_{i,max}\\right)\\times\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+C_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^c \\times\\delta_{j}^c\\right)\\right)}{\\left(n - C_{max}\\right)^3}}$$
$$ASE\\left(\\lambda_{X|Y}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{j=1}^c F_{max,j}\\right)\\times\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)+R_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^r \\times\\delta_{i}^r\\right)\\right)}{\\left(n - R_{max}\\right)^3}}$$
$$ASE\\left(\\lambda\\right)_1 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r+\\lambda\\times\\left(\\delta_{j}^c+\\delta_{i}^r\\right)\\right)^2\\right)\\right)-4\\times n\\times\\lambda^2}}{2\\times n-R_{max}-C{max}}$$
With:
$$\\delta_{i,j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is column index for } F_{i,max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
$$\\delta_{i,j}^r\\begin{cases} 1 & \\text{ if } i \\text{ is row index for } F_{max,j} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
$$\\delta_{j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is index for } C_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
$$\\delta_{i}^r\\begin{cases} 1 & \\text{ if } i \\text{ is index for } R_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$
The approximate T-values (z-values):
$$T\\left(\\delta_{Y|X}\\right)=\\frac{\\lambda_{Y|X}}{ASE\\left(\\lambda_{Y|X}\\right)_0}$$
$$T\\left(\\delta_{X|Y}\\right)=\\frac{\\lambda_{X|Y}}{ASE\\left(\\lambda_{X|Y}\\right)_0}$$
$$T\\left(\\delta\\right)=\\frac{\\lambda}{ASE\\left(\\lambda\\right)_0}$$
The p-value (significance):
$$T\\sim N\\left(0,1\\right)$$
$$sig. = 1 - 2\\times\Phi\\left(T\\right)$$
Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used.
Note the \\(ASE\\left(\\lambda_{X|Y}\\right)_1\\) formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html
*Symbols used:*
* \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j.
* \\(c\\), the number of columns
* \\(r\\), the number of rows
* \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\)
* \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\)
* \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\)
* \\(F_{i,max}\\) is the maximum count of row i. i.e. \\(F_{i,max}=\\max\\left\\{F_{i,1},F_{i,2},\\ldots,F_{i,c}\\right\\}\\)
* \\(F_{max,j}\\) is the maximum count of column j, i.e. \\(F_{max,j}=\\max\\left\\{F_{1,j},F_{2,j},\\ldots,F_{r,j}\\right\\}\\)
* \\(R_{max}\\) is the maximum of the row totals, i.e. \\(R_{max}=\\max\\left\\{R_1,R_2,\\ldots,R_r\\right\\}\\)
* \\(C_{max}\\) is the maximum of the column totals, i.e. \\(C_{max}=\\max\\left\\{C_1,C_2,\\ldots,C_c\\right\\}\\)
* \\(\\Phi\\left(\\ldots\\right)\\), the cumulative density function of the standard normal distribution
References
----------
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536
Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. *Behavioral Science, 20*(4), 258–259. doi:10.1002/bs.3830200407
Hartwig, F. (1973). Statistical significance of the lambda coefficients. *Behavioral Science, 18*(4), 307–310. doi:10.1002/bs.3830180409
SPSS. (2006). SPSS 15.0 algorithms.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
#The average method only averages column and row maximums (rm and cm),
#it uses "first" for ties in fim and fmj.
#create the cross table
ct = tab_cross(field1, field2, categories1, categories2, totals="include")
#basic counts
nrows = ct.shape[0]-1
ncols = ct.shape[1]-1
n = ct.iloc[nrows, ncols]
#the margin totals
rs = ct.iloc[0:nrows, ncols]
cs = ct.iloc[nrows, 0:ncols]
rm = max(rs)
cm = max(cs)
rowMaxColIndex = [0]*nrows
rowTotalsMaxRowIndex = 0
fim = [0]*nrows
rowMaxFreq = [0]*nrows
rowTotalsMaxRowFreq = 0
for i in range(0, nrows):
rowMax = 0
for j in range(0, ncols):
if ties=="first" and ct.iloc[i, j] > rowMax:
rowMaxColIndex[i] = j
rowMax = ct.iloc[i, j]
fim[i] = rowMax
elif ties=="last" and ct.iloc[i, j] >= rowMax:
rowMaxColIndex[i] = j
rowMax = ct.iloc[i, j]
fim[i] = rowMax
elif ties=="average" and ct.iloc[i, j] >= rowMax:
if ct.iloc[i, j] > rowMax:
rowMaxFreq[i] = 1
rowMax = ct.iloc[i, j]
fim[i] = rowMax
elif ct.iloc[i, j] == rowMax:
rowMaxFreq[i] = rowMaxFreq[i] + 1
elif ties=="random":
if ct.iloc[i, j] > rowMax:
rowMaxColIndex[i] = j
rowMax = ct.iloc[i, j]
fim[i] = rowMax
rowMaxFreq[i] = 1
elif ct.iloc[i, j] == rowMax:
rowMaxFreq[i] = rowMaxFreq[i] + 1
if random() < (1 / rowMaxFreq[i]):
rowMaxColIndex[i] = j
if rs[i]==rm:
if ties=="last":
rowTotalsMaxRowIndex = i
elif ties=="first":
if rowTotalsMaxRowFreq==0:
rowTotalsMaxRowIndex = i
elif ties=="random":
if rowTotalsMaxRowFreq==0:
rowTotalsMaxRowIndex = i
else:
if random() < (1 / rowTotalsMaxRowFreq):
rowTotalsMaxRowIndex = i
#ties = average is not needed since it does not use the rowTotalsMaxRowIndex
rowTotalsMaxRowFreq = rowTotalsMaxRowFreq + 1
#same for the columns
colMaxRowIndex = [0]*ncols
colTotalsMaxColIndex = 0
fmj = [0]*ncols
colMaxFreq = [0]*ncols
colTotalsMaxColFreq = 0
for j in range(0, ncols):
colMax = 0
for i in range(0, nrows):
if ties=="first" and ct.iloc[i, j] > colMax:
colMaxRowIndex[j] = i
colMax = ct.iloc[i, j]
fmj[j] = colMax
elif ties=="last" and ct.iloc[i, j] >= colMax:
colMaxRowIndex[j] = i
colMax = ct.iloc[i, j]
fmj[j] = colMax
elif ties=="average" and ct.iloc[i, j] >= colMax:
if ct.iloc[i, j] > colMax:
colMaxFreq[j] = 1
colMax = ct.iloc[i, j]
fmj[j] = colMax
elif ct.iloc[i, j] == colMax:
colMaxFreq[j] = colMaxFreq[j] + 1
elif ties=="random":
if ct.iloc[i, j] > colMax:
colMaxRowIndex[j] = i
colMax = ct.iloc[i, j]
fmj[j] = colMax
colMaxFreq[j] = 1
elif ct.iloc[i, j] == colMax:
colMaxFreq[j] = colMaxFreq[j] + 1
if random() < (1 / colMaxFreq[j]):
colMaxRowIndex[j] = i
if cs[j]==cm:
if ties=="last":
colTotalsMaxColIndex = j
elif ties=="first":
if colTotalsMaxColFreq==0:
colTotalsMaxColIndex = j
elif ties=="random":
if colTotalsMaxColFreq==0:
colTotalsMaxColIndex = j
else:
if random() < (1 / colTotalsMaxColFreq):
colTotalsMaxColIndex = j
#ties = average is not needed since it does not use the rowTotalsMaxRowIndex
colTotalsMaxColFreq = colTotalsMaxColFreq + 1
dijc = pd.DataFrame()
for i in range(0, nrows):
for j in range(0, ncols):
if ties=="average":
if ct.iloc[i, j]==fim[i]:
dijc.at[i, j] = 1 / rowMaxFreq[i]
else:
dijc.at[i, j] = 0
else:
if j==rowMaxColIndex[i]:
dijc.at[i, j] = 1
else:
dijc.at[i, j] = 0
djc = [0]*ncols
for j in range(0, ncols):
if cs[j]==cm:
if ties=="average":
djc[j] = 1 / colTotalsMaxColFreq
elif colTotalsMaxColIndex==j:
djc[j] = 1
else:
djc[j] = 0
dijr = pd.DataFrame()
for j in range(0, ncols):
for i in range(0, nrows):
if ties=="average":
if ct.iloc[i, j]==fmj[j]:
dijr.at[i, j] = 1 / colMaxFreq[j]
else:
dijr.at[i, j] = 0
else:
if i==colMaxRowIndex[j]:
dijr.at[i, j] = 1
else:
dijr.at[i, j] = 0
dirr = [0]*nrows
for i in range(0, nrows):
if rs[i]==rm:
if ties=="average":
dirr[i] = 1 / rowTotalsMaxRowFreq
elif rowTotalsMaxRowIndex==i:
dirr[i] = 1
else:
dirr[i] = 0
#ase calculations
sfim = 0
for i in range(0, nrows):
sfim = sfim + fim[i]
sfmj = 0
for j in range(0, ncols):
sfmj = sfmj + fmj[j]
lambdyx = (sfim - cm) / (n - cm)
lambdxy = (sfmj - rm) / (n - rm)
lambd = (sfim + sfmj - cm - rm) / (2 * n - rm - cm)
#aseyx0 and aseyx1
ase0yx = 0
ase1yx = 0
for i in range(0, nrows):
for j in range(0, ncols):
ase0yx = ase0yx + ct.iloc[i, j] * (dijc.iloc[i, j] - djc[j])**2
ase1yx = ase1yx + ct.iloc[i, j] * dijc.iloc[i, j] * djc[j]
ase0yx = (ase0yx - (sfim - cm)**2 / n)**0.5 / (n - cm)
ase1yx = (((n - sfim) * (sfim + cm - 2 * ase1yx)) / ((n - cm)**3))**0.5
#asexy0 and asexy1
ase0xy = 0
ase1xy = 0
for j in range(0, ncols):
for i in range(0, nrows):
ase0xy = ase0xy + ct.iloc[i, j] * (dijr.iloc[i, j] - dirr[i])**2
ase1xy = ase1xy + ct.iloc[i, j] * dijr.iloc[i, j] * dirr[i]
ase0xy = (ase0xy - (sfmj - rm)**2 / n)**0.5 / (n - rm)
ase1xy = (((n - sfmj) * (sfmj + rm - 2 * ase1xy)) / ((n - rm)**3))**0.5
#ase0 and ase1
ase0 = 0
ase1 = 0
for i in range(0, nrows):
for j in range(0, ncols):
ase0 = ase0 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i])**2
ase1 = ase1 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i] + lambd * (djc[j] + dirr[i]))**2
ase0 = (ase0 - (sfim + sfmj - cm - rm)**2 / n)**0.5 / (2 * n - rm - cm)
ase1 = (ase1 - 4 * n * lambd**2)**0.5 / (2 * n - rm - cm)
Z = lambd / ase0
Zyx = lambdyx / ase0yx
Zxy = lambdxy / ase0xy
p = 2 * (1 - NormalDist().cdf(abs(Z)))
pyx = 2 * (1 - NormalDist().cdf(abs(Zyx)))
pxy = 2 * (1 - NormalDist().cdf(abs(Zxy)))
#the results
ver = ["symmetric", "field1", "field2"]
ns = [n, n, n]
ls = [lambd, lambdxy, lambdyx]
ase0s = [ase0, ase0xy, ase0yx]
ase1s = [ase1, ase1xy, ase1yx]
zs = [Z, Zxy, Zyx]
pvalues = [p, pxy, pyx]
colNames = ["dependent", "value", "n", "ASE_0", "ASE_1", "statistic", "p-value"]
results = pd.DataFrame(list(zip(ver, ls, ns, ase0s, ase1s, zs, pvalues)), columns=colNames)
return results
Functions
def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties='first')
-
Goodman-Kruskal Lambda
This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value.
A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence).
Parameters
field1
:list
orpandas series
- the first categorical field
field2
:list
orpandas series
- the second categorical field
categories1
:list
ordictionary
, optional- order and/or selection for categories of field1
categories2
:list
ordictionary
, optional- order and/or selection for categories of field2
ties
:string
, optional- how to deal with tied modal scores. Either "first" (default), "last", "average" or "random"
Returns
A dataframe with:
- dependent, the field used as dependent variable
- value, the lambda value
- n, the sample size
- ASE_0, the asymptotic standard error assuming the null hypothesis
- ASE_1, the asymptotic standard error not assuming the null hypothesis
- statistic, the z-value
- p-value, the significance (p-value)
Notes
The formula used is (Goodman & Kruskal, 1954, p. 743): \lambda_{Y|X} = \frac{\left(\sum_{i=1}^r F_{i, max}\right) - C_{max}}{n - C_{max}} \lambda_{X|Y} = \frac{\left(\sum_{j=1}^c F_{max, j}\right) - R_{max}}{n - R_{max}} \lambda = \frac{\left(\sum_{i=1}^r F_{i, max}\right) + \left(\sum_{j=1}^c F_{max, j}\right) - C_{max} - R_{max}}{2\times n - R_{max} - C_{max}}
The asymptotic errors are given by: ASE\left(\lambda_{Y|X}\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta _{i,j}^c - \delta_{j}^c\right)\right)-\frac{\left(\left(\sum_{i=1}^r F_{i,max}\right)-C_{max}\right)^2}{n}\right)}}{n-C_{max}} ASE\left(\lambda_{X|Y}\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta _{i,j}^r - \delta_{i}^r\right)\right)-\frac{\left(\left(\sum_{j=1}^c F_{max,j}\right)-R_{max}\right)^2}{n}\right)}}{n-R_{max}} ASE\left(\lambda\right)_0 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta_{i,j}^c+\delta_{i,j}^r-\delta_{j}^c-\delta_{i}^r\right)^2\right)\right)-\frac{\left(\left(\sum_{i=1}^r F_{i,max}\right)+\left(\sum_{j=1}^c F_{max,j}\right)-C_{max}-R_{max}\right)^2}{n}}}{2\times n - R_{max} - C_{max}}
ASE\left(\lambda_{Y|X}\right)_1 = \sqrt{\frac{\left(n-\sum_{i=1}^r F_{i,max}\right)\times\left(\left(\sum_{i=1}^r F_{i,max}\right)+C_{max}-2\times\sum_{i,j}\left(F_{i,j}\times\delta_{i,j}^c \times\delta_{j}^c\right)\right)}{\left(n - C_{max}\right)^3}} ASE\left(\lambda_{X|Y}\right)_1 = \sqrt{\frac{\left(n-\sum_{j=1}^c F_{max,j}\right)\times\left(\left(\sum_{j=1}^c F_{max,j}\right)+R_{max}-2\times\sum_{i,j}\left(F_{i,j}\times\delta_{i,j}^r \times\delta_{i}^r\right)\right)}{\left(n - R_{max}\right)^3}} ASE\left(\lambda\right)_1 = \frac{\sqrt{\left(\sum_{i,j}\left(F_{i,j}\times\left(\delta_{i,j}^c+\delta_{i,j}^r-\delta_{j}^c-\delta_{i}^r+\lambda\times\left(\delta_{j}^c+\delta_{i}^r\right)\right)^2\right)\right)-4\times n\times\lambda^2}}{2\times n-R_{max}-C{max}}
With: \delta_{i,j}^c\begin{cases} 1 & \text{ if } j \text{ is column index for } F_{i,max} \\ 0 & \text{ otherwise } \end{cases} \delta_{i,j}^r\begin{cases} 1 & \text{ if } i \text{ is row index for } F_{max,j} \\ 0 & \text{ otherwise } \end{cases} \delta_{j}^c\begin{cases} 1 & \text{ if } j \text{ is index for } C_{max} \\ 0 & \text{ otherwise } \end{cases} \delta_{i}^r\begin{cases} 1 & \text{ if } i \text{ is index for } R_{max} \\ 0 & \text{ otherwise } \end{cases}
The approximate T-values (z-values): T\left(\delta_{Y|X}\right)=\frac{\lambda_{Y|X}}{ASE\left(\lambda_{Y|X}\right)_0} T\left(\delta_{X|Y}\right)=\frac{\lambda_{X|Y}}{ASE\left(\lambda_{X|Y}\right)_0} T\left(\delta\right)=\frac{\lambda}{ASE\left(\lambda\right)_0}
The p-value (significance): T\sim N\left(0,1\right) sig. = 1 - 2\times\Phi\left(T\right)
Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used.
Note the ASE\left(\lambda_{X|Y}\right)_1 formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html
Symbols used:
- F_{i,j}, the absolute frequency (observed count) from row i and column j.
- c, the number of columns
- r, the number of rows
- R_i, row total of row i, it can be calculated using R_i=\sum_{j=1}^{c}F_{i,j}
- C_j, column total of column j, it can be calculated using C_j=\sum_{i=1}^{r}F_{i,j}
- n = the total number of cases, it can be calculated in various ways, n=\sum_{j=1}^{c}C_j=\sum_{i=1}^{r}R_i=\sum_{i=1}^{r}\sum_{j=1}^{c}F_{i,j}
- F_{i,max} is the maximum count of row i. i.e. F_{i,max}=\max\left\{F_{i,1},F_{i,2},\ldots,F_{i,c}\right\}
- F_{max,j} is the maximum count of column j, i.e. F_{max,j}=\max\left\{F_{1,j},F_{2,j},\ldots,F_{r,j}\right\}
- R_{max} is the maximum of the row totals, i.e. R_{max}=\max\left\{R_1,R_2,\ldots,R_r\right\}
- C_{max} is the maximum of the column totals, i.e. C_{max}=\max\left\{C_1,C_2,\ldots,C_c\right\}
- \Phi\left(\ldots\right), the cumulative density function of the standard normal distribution
References
Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. doi:10.2307/2281536
Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. Behavioral Science, 20(4), 258–259. doi:10.1002/bs.3830200407
Hartwig, F. (1973). Statistical significance of the lambda coefficients. Behavioral Science, 18(4), 307–310. doi:10.1002/bs.3830180409
SPSS. (2006). SPSS 15.0 algorithms.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def es_goodman_kruskal_lambda(field1, field2, categories1=None, categories2=None, ties="first"): ''' Goodman-Kruskal Lambda ---------------------- This effect size measure compares the frequencies with the modal frequencies. Unlike some other measures like Cramér's V, Tschuprow T, and Cohen w, this measure therefor does not make use of the chi-square value. A value of zero would indicate no association (= independence) and a value of one a perfect association (= dependence). Parameters ---------- field1 : list or pandas series the first categorical field field2 : list or pandas series the second categorical field categories1 : list or dictionary, optional order and/or selection for categories of field1 categories2 : list or dictionary, optional order and/or selection for categories of field2 ties : string, optional how to deal with tied modal scores. Either "first" (default), "last", "average" or "random" Returns ------- A dataframe with: * *dependent*, the field used as dependent variable * *value*, the lambda value * *n*, the sample size * *ASE_0*, the asymptotic standard error assuming the null hypothesis * *ASE_1*, the asymptotic standard error not assuming the null hypothesis * *statistic*, the z-value * *p-value*, the significance (p-value) Notes ----- The formula used is (Goodman & Kruskal, 1954, p. 743): $$\\lambda_{Y|X} = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) - C_{max}}{n - C_{max}}$$ $$\\lambda_{X|Y} = \\frac{\\left(\\sum_{j=1}^c F_{max, j}\\right) - R_{max}}{n - R_{max}}$$ $$\\lambda = \\frac{\\left(\\sum_{i=1}^r F_{i, max}\\right) + \\left(\\sum_{j=1}^c F_{max, j}\\right) - C_{max} - R_{max}}{2\\times n - R_{max} - C_{max}}$$ The asymptotic errors are given by: $$ASE\\left(\\lambda_{Y|X}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^c - \\delta_{j}^c\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)-C_{max}\\right)^2}{n}\\right)}}{n-C_{max}}$$ $$ASE\\left(\\lambda_{X|Y}\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta _{i,j}^r - \\delta_{i}^r\\right)\\right)-\\frac{\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)-R_{max}\\right)^2}{n}\\right)}}{n-R_{max}}$$ $$ASE\\left(\\lambda\\right)_0 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r\\right)^2\\right)\\right)-\\frac{\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+\\left(\\sum_{j=1}^c F_{max,j}\\right)-C_{max}-R_{max}\\right)^2}{n}}}{2\\times n - R_{max} - C_{max}}$$ $$ASE\\left(\\lambda_{Y|X}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{i=1}^r F_{i,max}\\right)\\times\\left(\\left(\\sum_{i=1}^r F_{i,max}\\right)+C_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^c \\times\\delta_{j}^c\\right)\\right)}{\\left(n - C_{max}\\right)^3}}$$ $$ASE\\left(\\lambda_{X|Y}\\right)_1 = \\sqrt{\\frac{\\left(n-\\sum_{j=1}^c F_{max,j}\\right)\\times\\left(\\left(\\sum_{j=1}^c F_{max,j}\\right)+R_{max}-2\\times\\sum_{i,j}\\left(F_{i,j}\\times\\delta_{i,j}^r \\times\\delta_{i}^r\\right)\\right)}{\\left(n - R_{max}\\right)^3}}$$ $$ASE\\left(\\lambda\\right)_1 = \\frac{\\sqrt{\\left(\\sum_{i,j}\\left(F_{i,j}\\times\\left(\\delta_{i,j}^c+\\delta_{i,j}^r-\\delta_{j}^c-\\delta_{i}^r+\\lambda\\times\\left(\\delta_{j}^c+\\delta_{i}^r\\right)\\right)^2\\right)\\right)-4\\times n\\times\\lambda^2}}{2\\times n-R_{max}-C{max}}$$ With: $$\\delta_{i,j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is column index for } F_{i,max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$ $$\\delta_{i,j}^r\\begin{cases} 1 & \\text{ if } i \\text{ is row index for } F_{max,j} \\\\ 0 & \\text{ otherwise } \\end{cases}$$ $$\\delta_{j}^c\\begin{cases} 1 & \\text{ if } j \\text{ is index for } C_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$ $$\\delta_{i}^r\\begin{cases} 1 & \\text{ if } i \\text{ is index for } R_{max} \\\\ 0 & \\text{ otherwise } \\end{cases}$$ The approximate T-values (z-values): $$T\\left(\\delta_{Y|X}\\right)=\\frac{\\lambda_{Y|X}}{ASE\\left(\\lambda_{Y|X}\\right)_0}$$ $$T\\left(\\delta_{X|Y}\\right)=\\frac{\\lambda_{X|Y}}{ASE\\left(\\lambda_{X|Y}\\right)_0}$$ $$T\\left(\\delta\\right)=\\frac{\\lambda}{ASE\\left(\\lambda\\right)_0}$$ The p-value (significance): $$T\\sim N\\left(0,1\\right)$$ $$sig. = 1 - 2\\times\Phi\\left(T\\right)$$ Hartwig (1973) proposed two options in case multi-modal situation occurs: choose random, choose the largest ASE, or average them. Hartwig also proposed to use this for the test, but Gray and Campbell (1975) point out that this is incorrect, and ASE_0 should be used. Note the \\(ASE\\left(\\lambda_{X|Y}\\right)_1\\) formula is different than the one SPSS uses it it’s own documentation, but is actually the formula that is being used. https://lists.gnu.org/archive/html/pspp-dev/2014-05/msg00007.html *Symbols used:* * \\(F_{i,j}\\), the absolute frequency (observed count) from row i and column j. * \\(c\\), the number of columns * \\(r\\), the number of rows * \\(R_i\\), row total of row i, it can be calculated using \\(R_i=\\sum_{j=1}^{c}F_{i,j}\\) * \\(C_j\\), column total of column j, it can be calculated using \\(C_j=\\sum_{i=1}^{r}F_{i,j}\\) * \\(n\\) = the total number of cases, it can be calculated in various ways, \\(n=\\sum_{j=1}^{c}C_j=\\sum_{i=1}^{r}R_i=\\sum_{i=1}^{r}\\sum_{j=1}^{c}F_{i,j}\\) * \\(F_{i,max}\\) is the maximum count of row i. i.e. \\(F_{i,max}=\\max\\left\\{F_{i,1},F_{i,2},\\ldots,F_{i,c}\\right\\}\\) * \\(F_{max,j}\\) is the maximum count of column j, i.e. \\(F_{max,j}=\\max\\left\\{F_{1,j},F_{2,j},\\ldots,F_{r,j}\\right\\}\\) * \\(R_{max}\\) is the maximum of the row totals, i.e. \\(R_{max}=\\max\\left\\{R_1,R_2,\\ldots,R_r\\right\\}\\) * \\(C_{max}\\) is the maximum of the column totals, i.e. \\(C_{max}=\\max\\left\\{C_1,C_2,\\ldots,C_c\\right\\}\\) * \\(\\Phi\\left(\\ldots\\right)\\), the cumulative density function of the standard normal distribution References ---------- Goodman, L. A., & Kruskal, W. H. (1954). Measures of association for cross classifications. *Journal of the American Statistical Association, 49*(268), 732–764. doi:10.2307/2281536 Gray, L. N., & Campbell, R. (1975). Statistical significance of the Lambda coefficients: A comment. *Behavioral Science, 20*(4), 258–259. doi:10.1002/bs.3830200407 Hartwig, F. (1973). Statistical significance of the lambda coefficients. *Behavioral Science, 18*(4), 307–310. doi:10.1002/bs.3830180409 SPSS. (2006). SPSS 15.0 algorithms. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' #The average method only averages column and row maximums (rm and cm), #it uses "first" for ties in fim and fmj. #create the cross table ct = tab_cross(field1, field2, categories1, categories2, totals="include") #basic counts nrows = ct.shape[0]-1 ncols = ct.shape[1]-1 n = ct.iloc[nrows, ncols] #the margin totals rs = ct.iloc[0:nrows, ncols] cs = ct.iloc[nrows, 0:ncols] rm = max(rs) cm = max(cs) rowMaxColIndex = [0]*nrows rowTotalsMaxRowIndex = 0 fim = [0]*nrows rowMaxFreq = [0]*nrows rowTotalsMaxRowFreq = 0 for i in range(0, nrows): rowMax = 0 for j in range(0, ncols): if ties=="first" and ct.iloc[i, j] > rowMax: rowMaxColIndex[i] = j rowMax = ct.iloc[i, j] fim[i] = rowMax elif ties=="last" and ct.iloc[i, j] >= rowMax: rowMaxColIndex[i] = j rowMax = ct.iloc[i, j] fim[i] = rowMax elif ties=="average" and ct.iloc[i, j] >= rowMax: if ct.iloc[i, j] > rowMax: rowMaxFreq[i] = 1 rowMax = ct.iloc[i, j] fim[i] = rowMax elif ct.iloc[i, j] == rowMax: rowMaxFreq[i] = rowMaxFreq[i] + 1 elif ties=="random": if ct.iloc[i, j] > rowMax: rowMaxColIndex[i] = j rowMax = ct.iloc[i, j] fim[i] = rowMax rowMaxFreq[i] = 1 elif ct.iloc[i, j] == rowMax: rowMaxFreq[i] = rowMaxFreq[i] + 1 if random() < (1 / rowMaxFreq[i]): rowMaxColIndex[i] = j if rs[i]==rm: if ties=="last": rowTotalsMaxRowIndex = i elif ties=="first": if rowTotalsMaxRowFreq==0: rowTotalsMaxRowIndex = i elif ties=="random": if rowTotalsMaxRowFreq==0: rowTotalsMaxRowIndex = i else: if random() < (1 / rowTotalsMaxRowFreq): rowTotalsMaxRowIndex = i #ties = average is not needed since it does not use the rowTotalsMaxRowIndex rowTotalsMaxRowFreq = rowTotalsMaxRowFreq + 1 #same for the columns colMaxRowIndex = [0]*ncols colTotalsMaxColIndex = 0 fmj = [0]*ncols colMaxFreq = [0]*ncols colTotalsMaxColFreq = 0 for j in range(0, ncols): colMax = 0 for i in range(0, nrows): if ties=="first" and ct.iloc[i, j] > colMax: colMaxRowIndex[j] = i colMax = ct.iloc[i, j] fmj[j] = colMax elif ties=="last" and ct.iloc[i, j] >= colMax: colMaxRowIndex[j] = i colMax = ct.iloc[i, j] fmj[j] = colMax elif ties=="average" and ct.iloc[i, j] >= colMax: if ct.iloc[i, j] > colMax: colMaxFreq[j] = 1 colMax = ct.iloc[i, j] fmj[j] = colMax elif ct.iloc[i, j] == colMax: colMaxFreq[j] = colMaxFreq[j] + 1 elif ties=="random": if ct.iloc[i, j] > colMax: colMaxRowIndex[j] = i colMax = ct.iloc[i, j] fmj[j] = colMax colMaxFreq[j] = 1 elif ct.iloc[i, j] == colMax: colMaxFreq[j] = colMaxFreq[j] + 1 if random() < (1 / colMaxFreq[j]): colMaxRowIndex[j] = i if cs[j]==cm: if ties=="last": colTotalsMaxColIndex = j elif ties=="first": if colTotalsMaxColFreq==0: colTotalsMaxColIndex = j elif ties=="random": if colTotalsMaxColFreq==0: colTotalsMaxColIndex = j else: if random() < (1 / colTotalsMaxColFreq): colTotalsMaxColIndex = j #ties = average is not needed since it does not use the rowTotalsMaxRowIndex colTotalsMaxColFreq = colTotalsMaxColFreq + 1 dijc = pd.DataFrame() for i in range(0, nrows): for j in range(0, ncols): if ties=="average": if ct.iloc[i, j]==fim[i]: dijc.at[i, j] = 1 / rowMaxFreq[i] else: dijc.at[i, j] = 0 else: if j==rowMaxColIndex[i]: dijc.at[i, j] = 1 else: dijc.at[i, j] = 0 djc = [0]*ncols for j in range(0, ncols): if cs[j]==cm: if ties=="average": djc[j] = 1 / colTotalsMaxColFreq elif colTotalsMaxColIndex==j: djc[j] = 1 else: djc[j] = 0 dijr = pd.DataFrame() for j in range(0, ncols): for i in range(0, nrows): if ties=="average": if ct.iloc[i, j]==fmj[j]: dijr.at[i, j] = 1 / colMaxFreq[j] else: dijr.at[i, j] = 0 else: if i==colMaxRowIndex[j]: dijr.at[i, j] = 1 else: dijr.at[i, j] = 0 dirr = [0]*nrows for i in range(0, nrows): if rs[i]==rm: if ties=="average": dirr[i] = 1 / rowTotalsMaxRowFreq elif rowTotalsMaxRowIndex==i: dirr[i] = 1 else: dirr[i] = 0 #ase calculations sfim = 0 for i in range(0, nrows): sfim = sfim + fim[i] sfmj = 0 for j in range(0, ncols): sfmj = sfmj + fmj[j] lambdyx = (sfim - cm) / (n - cm) lambdxy = (sfmj - rm) / (n - rm) lambd = (sfim + sfmj - cm - rm) / (2 * n - rm - cm) #aseyx0 and aseyx1 ase0yx = 0 ase1yx = 0 for i in range(0, nrows): for j in range(0, ncols): ase0yx = ase0yx + ct.iloc[i, j] * (dijc.iloc[i, j] - djc[j])**2 ase1yx = ase1yx + ct.iloc[i, j] * dijc.iloc[i, j] * djc[j] ase0yx = (ase0yx - (sfim - cm)**2 / n)**0.5 / (n - cm) ase1yx = (((n - sfim) * (sfim + cm - 2 * ase1yx)) / ((n - cm)**3))**0.5 #asexy0 and asexy1 ase0xy = 0 ase1xy = 0 for j in range(0, ncols): for i in range(0, nrows): ase0xy = ase0xy + ct.iloc[i, j] * (dijr.iloc[i, j] - dirr[i])**2 ase1xy = ase1xy + ct.iloc[i, j] * dijr.iloc[i, j] * dirr[i] ase0xy = (ase0xy - (sfmj - rm)**2 / n)**0.5 / (n - rm) ase1xy = (((n - sfmj) * (sfmj + rm - 2 * ase1xy)) / ((n - rm)**3))**0.5 #ase0 and ase1 ase0 = 0 ase1 = 0 for i in range(0, nrows): for j in range(0, ncols): ase0 = ase0 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i])**2 ase1 = ase1 + ct.iloc[i, j] * (dijc.iloc[i, j] + dijr.iloc[i, j] - djc[j] - dirr[i] + lambd * (djc[j] + dirr[i]))**2 ase0 = (ase0 - (sfim + sfmj - cm - rm)**2 / n)**0.5 / (2 * n - rm - cm) ase1 = (ase1 - 4 * n * lambd**2)**0.5 / (2 * n - rm - cm) Z = lambd / ase0 Zyx = lambdyx / ase0yx Zxy = lambdxy / ase0xy p = 2 * (1 - NormalDist().cdf(abs(Z))) pyx = 2 * (1 - NormalDist().cdf(abs(Zyx))) pxy = 2 * (1 - NormalDist().cdf(abs(Zxy))) #the results ver = ["symmetric", "field1", "field2"] ns = [n, n, n] ls = [lambd, lambdxy, lambdyx] ase0s = [ase0, ase0xy, ase0yx] ase1s = [ase1, ase1xy, ase1yx] zs = [Z, Zxy, Zyx] pvalues = [p, pxy, pyx] colNames = ["dependent", "value", "n", "ASE_0", "ASE_1", "statistic", "p-value"] results = pd.DataFrame(list(zip(ver, ls, ns, ase0s, ase1s, zs, pvalues)), columns=colNames) return results
def random()
-
random() -> x in the interval [0, 1).