Module stikpetP.tests.test_wilcox_owa

Expand source code
import pandas as pd
from scipy.stats import chi2

def ts_wilcox_owa(nomField, scaleField, categories=None):
    '''
    Wilcox One-Way ANOVA
    ------------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
        
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *statistic*, the test statistic (chi-square value value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Wilcox, 1988, pp. 110-111):
    $$H = \\frac{\\sum_{j=1}^k \\left(W_j - \\bar{W}\\right)^2}{\\hat{\\theta}}$$
    $$df = k - 1$$
    $$sig. = 1 - \\chi^2\\left(H, df\\right)$$
    
    With:
    $$W_j = b_j\\times x_{n_j,j} + \\frac{1 - b_j}{n_j}\\times\\sum_{i=1}^{n_j-1} x_{i,j}$$
    $$\\bar{W} = \\frac{\\sum_{j=1}^k W_j}{k}$$
    $$b_j = \\frac{1+\\sqrt{\\frac{\\left(n_j-1\\right)\\times\\left(n_j\\times\\hat{\\theta}-s_j^2\\right)}{s_j^2}}}{n_j}$$
    $$\\hat{\\theta} = \\max\\left(\\frac{s_1^2}{n_1}, \\frac{s_2^2}{n_2},\\dots,\\frac{s_k^2}{n_k}\\right)$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    
    The original article has an error in the formula for \\(b_j\\). There are missing brackets. Using the population version in the article of \\(c_j\\) the formula used here was adapted.
    
    *Symbols used*
    
    * \\(x_{i,j}\\), the i-th score in category j
    * \\(k\\), the number of categories
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(df\\), the degrees of freedom
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative density function of the chi-square distribution
    
    References
    ----------
    Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. *British Journal of Mathematical and Statistical Psychology, 41*(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    xj = data.groupby('category').max()
    sj = data.groupby('category').sum()
    
    #number of categories
    k = len(mj)
    
    t = float((sj2/nj).max())
    bj = (1 + ((nj - 1)*(nj*t - sj2)/sj2)**0.5) / nj
    wj = bj*xj + (1 - bj)/nj * (sj - xj)
    wm = wj.sum()/k
    
    h = float(((wj - wm)**2).sum()/t)
    df = k - 1    
    pVal = chi2.sf(h, df)
    
    #results
    res = pd.DataFrame([[n, h, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res

Functions

def ts_wilcox_owa(nomField, scaleField, categories=None)

Wilcox One-Way ANOVA

Tests if the means (averages) of each category could be the same in the population.

If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.

There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.

Parameters

nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField

Returns

Dataframe with:
 
  • n, the sample size
  • statistic, the test statistic (chi-square value value)
  • df, degrees of freedom
  • p-value, the p-value (significance)

Notes

The formula used (Wilcox, 1988, pp. 110-111): H = \frac{\sum_{j=1}^k \left(W_j - \bar{W}\right)^2}{\hat{\theta}} df = k - 1 sig. = 1 - \chi^2\left(H, df\right)

With: W_j = b_j\times x_{n_j,j} + \frac{1 - b_j}{n_j}\times\sum_{i=1}^{n_j-1} x_{i,j} \bar{W} = \frac{\sum_{j=1}^k W_j}{k} b_j = \frac{1+\sqrt{\frac{\left(n_j-1\right)\times\left(n_j\times\hat{\theta}-s_j^2\right)}{s_j^2}}}{n_j} \hat{\theta} = \max\left(\frac{s_1^2}{n_1}, \frac{s_2^2}{n_2},\dots,\frac{s_k^2}{n_k}\right) \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1}

The original article has an error in the formula for b_j. There are missing brackets. Using the population version in the article of c_j the formula used here was adapted.

Symbols used

  • x_{i,j}, the i-th score in category j
  • k, the number of categories
  • n_j, the sample size of category j
  • \bar{x}_j, the sample mean of category j
  • s_j^2, the sample variance of the scores in category j
  • df, the degrees of freedom
  • \chi^2\left(\dots\right), the cumulative density function of the chi-square distribution

References

Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. British Journal of Mathematical and Statistical Psychology, 41(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_wilcox_owa(nomField, scaleField, categories=None):
    '''
    Wilcox One-Way ANOVA
    ------------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
        
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *statistic*, the test statistic (chi-square value value)
    * *df*, degrees of freedom
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Wilcox, 1988, pp. 110-111):
    $$H = \\frac{\\sum_{j=1}^k \\left(W_j - \\bar{W}\\right)^2}{\\hat{\\theta}}$$
    $$df = k - 1$$
    $$sig. = 1 - \\chi^2\\left(H, df\\right)$$
    
    With:
    $$W_j = b_j\\times x_{n_j,j} + \\frac{1 - b_j}{n_j}\\times\\sum_{i=1}^{n_j-1} x_{i,j}$$
    $$\\bar{W} = \\frac{\\sum_{j=1}^k W_j}{k}$$
    $$b_j = \\frac{1+\\sqrt{\\frac{\\left(n_j-1\\right)\\times\\left(n_j\\times\\hat{\\theta}-s_j^2\\right)}{s_j^2}}}{n_j}$$
    $$\\hat{\\theta} = \\max\\left(\\frac{s_1^2}{n_1}, \\frac{s_2^2}{n_2},\\dots,\\frac{s_k^2}{n_k}\\right)$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    
    The original article has an error in the formula for \\(b_j\\). There are missing brackets. Using the population version in the article of \\(c_j\\) the formula used here was adapted.
    
    *Symbols used*
    
    * \\(x_{i,j}\\), the i-th score in category j
    * \\(k\\), the number of categories
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(df\\), the degrees of freedom
    * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative density function of the chi-square distribution
    
    References
    ----------
    Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. *British Journal of Mathematical and Statistical Psychology, 41*(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    xj = data.groupby('category').max()
    sj = data.groupby('category').sum()
    
    #number of categories
    k = len(mj)
    
    t = float((sj2/nj).max())
    bj = (1 + ((nj - 1)*(nj*t - sj2)/sj2)**0.5) / nj
    wj = bj*xj + (1 - bj)/nj * (sj - xj)
    wm = wj.sum()/k
    
    h = float(((wj - wm)**2).sum()/t)
    df = k - 1    
    pVal = chi2.sf(h, df)
    
    #results
    res = pd.DataFrame([[n, h, df, pVal]])
    res.columns = ["n", "statistic", "df", "p-value"]
    
    return res