Module stikpetP.effect_sizes.eff_size_becker_clogg_r

Expand source code
import math
import pandas as pd
from ..other.table_cross import tab_cross
from statistics import NormalDist

def es_becker_clogg_r(field1, field2, categories1=None, categories2=None, version=1):
    '''
    Becker and Clogg rho
    --------------------
    An approximation for the tetrachoric correlation coefficient.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used
    version : {1, 2}, optional
        version of the rho to determine (see notes)

    Returns
    -------
    Becker and Clogg r
    
    Notes
    -----
    Version 1 will calculate:
    $$\\rho^* = \\frac{g-1}{g+1}$$
    
    Version 2 will calculate:
    $$\\rho^{**} = \\frac{OR^{13.3/\\Delta} - 1}{OR^{13.3/\\Delta} + 1}$$
    
    With:
    $$g=e^{12.4\\times\\phi - 24.6\\times\\phi^3}$$
    $$\\phi = \\frac{\\ln\\left(OR\\right)}{\\Delta}$$
    $$OR=\\frac{\\left(\\frac{a}{c}\\right)}{\\left(\\frac{b}{d}\\right)} = \\frac{a\\times d}{b\\times c}$$
    $$\\Delta = \\left(\\mu_{R1} - \\mu_{R2}\\right) \\times \\left(v_{C1} - v_{C2}\\right)$$
    $$\\mu_{R1} = \\frac{-e^{-\\frac{t_r^2}{2}}}{p_{R1}}, \\mu_{R2} = \\frac{e^{-\\frac{t_r^2}{2}}}{p_{R2}}$$
    $$v_{C1} = \\frac{-e^{-\\frac{t_c^2}{2}}}{p_{C1}}, v_{C2} = \\frac{e^{-\\frac{t_c^2}{2}}}{p_{C2}}$$
    $$t_r = \\Phi^{-1}\\left(p_{R1}\\right), t_c = \\Phi^{-1}\\left(p_{C1}\\right)$$
    $$p_{x} = \\frac{x}{n}$$
    
    *Symbols used:*
    
    * \(a\) the count in the top-left cell of the cross table
    * \(b\) the count in the top-right cell of the cross table 
    * \(c\) the count in the bottom-left cell of the cross table 
    * \(d\) the count in the bottom-right cell of the cross table
    * \(R_i\) the sum of counts in the i-th row
    * \(C_i\) the sum of counts in the i-th column
    * \\(\\Phi^{-1}\\left(x\\right)\\) for the the inverse standard normal cumulative distribution function
    
    These formulas can be found in Becker and Clogg (1988, pp. 410-412)
    
    References 
    ----------
    Becker, M. P., & Clogg, C. C. (1988). A note on approximating correlations from Odds Ratios. *Sociological Methods & Research, 16*(3), 407–424. https://doi.org/10.1177/0049124188016003003
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    0.2082967559691196
    
    >>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], version=2)
    np.float64(0.22342632378882407)
    
    '''
    
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    #The totals
    R1 = a + b
    R2 = c + d
    C1 = a + c
    C2 = b + d
    n = R1 + R2
    
    pR1 = R1/n
    pR2 = R2/n
    pC1 = C1/n
    pC2 = C2/n

    tr = NormalDist().inv_cdf(pR1)
    tc = NormalDist().inv_cdf(pC1)

    mR1 = -math.exp(-tr**2/2) / pR1
    mR2 = math.exp(-tr**2/2) / pR2

    vC1 = -math.exp(-tc**2/2) / pC1
    vC2 = math.exp(-tc**2/2) / pC2

    delta = (mR1 - mR2)*(vC1 - vC2)

    OR = a*d/(b*c)
    
    if (version==2):
        rt = (OR**(13.3/delta) - 1) / (OR**(13.3/delta) + 1)
    elif (version==1):
        phiBC = math.log(OR) / delta        
        g = math.exp(12.4*phiBC - 24.6*phiBC**3)    
        rt = (g - 1)/(g + 1)
    
    return(rt)

Functions

def es_becker_clogg_r(field1, field2, categories1=None, categories2=None, version=1)

Becker And Clogg Rho

An approximation for the tetrachoric correlation coefficient.

Parameters

field1 : pandas series
data with categories for the rows
field2 : pandas series
data with categories for the columns
categories1 : list or dictionary, optional
the two categories to use from field1. If not set the first two found will be used
categories2 : list or dictionary, optional
the two categories to use from field2. If not set the first two found will be used
version : {1, 2}, optional
version of the rho to determine (see notes)

Returns

Becker and Clogg r
 

Notes

Version 1 will calculate: \rho^* = \frac{g-1}{g+1}

Version 2 will calculate: \rho^{**} = \frac{OR^{13.3/\Delta} - 1}{OR^{13.3/\Delta} + 1}

With: g=e^{12.4\times\phi - 24.6\times\phi^3} \phi = \frac{\ln\left(OR\right)}{\Delta} OR=\frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)} = \frac{a\times d}{b\times c} \Delta = \left(\mu_{R1} - \mu_{R2}\right) \times \left(v_{C1} - v_{C2}\right) \mu_{R1} = \frac{-e^{-\frac{t_r^2}{2}}}{p_{R1}}, \mu_{R2} = \frac{e^{-\frac{t_r^2}{2}}}{p_{R2}} v_{C1} = \frac{-e^{-\frac{t_c^2}{2}}}{p_{C1}}, v_{C2} = \frac{e^{-\frac{t_c^2}{2}}}{p_{C2}} t_r = \Phi^{-1}\left(p_{R1}\right), t_c = \Phi^{-1}\left(p_{C1}\right) p_{x} = \frac{x}{n}

Symbols used:

  • a the count in the top-left cell of the cross table
  • b the count in the top-right cell of the cross table
  • c the count in the bottom-left cell of the cross table
  • d the count in the bottom-right cell of the cross table
  • R_i the sum of counts in the i-th row
  • C_i the sum of counts in the i-th column
  • \Phi^{-1}\left(x\right) for the the inverse standard normal cumulative distribution function

These formulas can be found in Becker and Clogg (1988, pp. 410-412)

References

Becker, M. P., & Clogg, C. C. (1988). A note on approximating correlations from Odds Ratios. Sociological Methods & Research, 16(3), 407–424. https://doi.org/10.1177/0049124188016003003

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Examples

>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
0.2082967559691196
>>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], version=2)
np.float64(0.22342632378882407)
Expand source code
def es_becker_clogg_r(field1, field2, categories1=None, categories2=None, version=1):
    '''
    Becker and Clogg rho
    --------------------
    An approximation for the tetrachoric correlation coefficient.
    
    Parameters
    ----------
    field1 : pandas series
        data with categories for the rows
    field2 : pandas series
        data with categories for the columns
    categories1 : list or dictionary, optional
        the two categories to use from field1. If not set the first two found will be used
    categories2 : list or dictionary, optional
        the two categories to use from field2. If not set the first two found will be used
    version : {1, 2}, optional
        version of the rho to determine (see notes)

    Returns
    -------
    Becker and Clogg r
    
    Notes
    -----
    Version 1 will calculate:
    $$\\rho^* = \\frac{g-1}{g+1}$$
    
    Version 2 will calculate:
    $$\\rho^{**} = \\frac{OR^{13.3/\\Delta} - 1}{OR^{13.3/\\Delta} + 1}$$
    
    With:
    $$g=e^{12.4\\times\\phi - 24.6\\times\\phi^3}$$
    $$\\phi = \\frac{\\ln\\left(OR\\right)}{\\Delta}$$
    $$OR=\\frac{\\left(\\frac{a}{c}\\right)}{\\left(\\frac{b}{d}\\right)} = \\frac{a\\times d}{b\\times c}$$
    $$\\Delta = \\left(\\mu_{R1} - \\mu_{R2}\\right) \\times \\left(v_{C1} - v_{C2}\\right)$$
    $$\\mu_{R1} = \\frac{-e^{-\\frac{t_r^2}{2}}}{p_{R1}}, \\mu_{R2} = \\frac{e^{-\\frac{t_r^2}{2}}}{p_{R2}}$$
    $$v_{C1} = \\frac{-e^{-\\frac{t_c^2}{2}}}{p_{C1}}, v_{C2} = \\frac{e^{-\\frac{t_c^2}{2}}}{p_{C2}}$$
    $$t_r = \\Phi^{-1}\\left(p_{R1}\\right), t_c = \\Phi^{-1}\\left(p_{C1}\\right)$$
    $$p_{x} = \\frac{x}{n}$$
    
    *Symbols used:*
    
    * \(a\) the count in the top-left cell of the cross table
    * \(b\) the count in the top-right cell of the cross table 
    * \(c\) the count in the bottom-left cell of the cross table 
    * \(d\) the count in the bottom-right cell of the cross table
    * \(R_i\) the sum of counts in the i-th row
    * \(C_i\) the sum of counts in the i-th column
    * \\(\\Phi^{-1}\\left(x\\right)\\) for the the inverse standard normal cumulative distribution function
    
    These formulas can be found in Becker and Clogg (1988, pp. 410-412)
    
    References 
    ----------
    Becker, M. P., & Clogg, C. C. (1988). A note on approximating correlations from Odds Ratios. *Sociological Methods & Research, 16*(3), 407–424. https://doi.org/10.1177/0049124188016003003
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    Examples
    --------
    >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
    >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
    >>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"])
    0.2082967559691196
    
    >>> es_becker_clogg_r(df1['mar1'], df1['sex'], categories1=["WIDOWED", "DIVORCED"], version=2)
    np.float64(0.22342632378882407)
    
    '''
    
    # determine sample cross table
    tab = tab_cross(field1, field2, order1=categories1, order2=categories2, percent=None, totals="exclude")
    
    # cell values of sample cross table
    a = tab.iloc[0,0]
    b = tab.iloc[0,1]
    c = tab.iloc[1,0]
    d = tab.iloc[1,1]
    
    #The totals
    R1 = a + b
    R2 = c + d
    C1 = a + c
    C2 = b + d
    n = R1 + R2
    
    pR1 = R1/n
    pR2 = R2/n
    pC1 = C1/n
    pC2 = C2/n

    tr = NormalDist().inv_cdf(pR1)
    tc = NormalDist().inv_cdf(pC1)

    mR1 = -math.exp(-tr**2/2) / pR1
    mR2 = math.exp(-tr**2/2) / pR2

    vC1 = -math.exp(-tc**2/2) / pC1
    vC2 = math.exp(-tc**2/2) / pC2

    delta = (mR1 - mR2)*(vC1 - vC2)

    OR = a*d/(b*c)
    
    if (version==2):
        rt = (OR**(13.3/delta) - 1) / (OR**(13.3/delta) + 1)
    elif (version==1):
        phiBC = math.log(OR) / delta        
        g = math.exp(12.4*phiBC - 24.6*phiBC**3)    
        rt = (g - 1)/(g + 1)
    
    return(rt)