Module stikpetP.tests.test_brown_forsythe_owa

Expand source code
import pandas as pd
from scipy.stats import f

def ts_brown_forsythe_owa(nomField, scaleField, categories=None):
    '''
    Brown-Forsythe One-Way ANOVA
    -----------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (F value)
    * *df1*, degrees of freedom 1
    * *df2*, degrees of freedom 2
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Brown & Forsythe, 1974, p. 130):
    $$ F_{BF} = \\frac{\\sum_{j=1}^k n_j\\times\\left(\\bar{x}_j - \\bar{x}\\right)^2}{\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2} $$
    $$ df_1 = k - 1 $$
    $$ df_2 = \\frac{\\left(\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k \\frac{\\left(1-\\frac{n_j}{n}\\right)^2\\times s_j^4}{n_j - 1}} $$
    $$ F_{BF}\\sim F\\left(df_1, df_2\\right) $$
    
    With:
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
    $$ n = \\sum_{j=1}^k n_j $$
    
    This appears to give the same results as the Box correction, except for \\(df_1\\) and \\(df_2\\).
    
    *Symbols used* 
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(n\\), the total sample size
    * \\(df_i\\), the i-th degrees of freedom.
    
    References
    ----------
    Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. *Technometrics, 16*(1), 129–132. doi:10.1080/00401706.1974.10489158
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    fVal = float((nj*(mj-m)**2).sum()/((1 - nj/n)*sj2).sum())
    df1 = k - 1
    df2 = float(((1 - nj/n)*sj2).sum()**2 / ((1 - nj/n)**2*sj2**2/(nj - 1)).sum())
    
    pVal = f.sf(fVal, df1, df2)
    
    #results
    res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
    res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
    
    return res

Functions

def ts_brown_forsythe_owa(nomField, scaleField, categories=None)

Brown-Forsythe One-Way ANOVA

Tests if the means (averages) of each category could be the same in the population.

If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.

There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.

Parameters

nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField

Returns

Dataframe with:
 
  • n, the sample size
  • k, the number of categories
  • statistic, the test statistic (F value)
  • df1, degrees of freedom 1
  • df2, degrees of freedom 2
  • p-value, the p-value (significance)

Notes

The formula used (Brown & Forsythe, 1974, p. 130): F_{BF} = \frac{\sum_{j=1}^k n_j\times\left(\bar{x}_j - \bar{x}\right)^2}{\sum_{j=1}^k\left(1-\frac{n_j}{n}\right)\times s_j^2} df_1 = k - 1 df_2 = \frac{\left(\sum_{j=1}^k\left(1-\frac{n_j}{n}\right)\times s_j^2\right)^2}{\sum_{j=1}^k \frac{\left(1-\frac{n_j}{n}\right)^2\times s_j^4}{n_j - 1}} F_{BF}\sim F\left(df_1, df_2\right)

With: s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} \bar{x} = \frac{\sum_{j=1}^{n_j}n_j\times \bar{x}_j}{n} = \frac{\sum_{j=1}^{k}\sum_{i=1}^{n_j} x_{i,j}}{n} n = \sum_{j=1}^k n_j

This appears to give the same results as the Box correction, except for df_1 and df_2.

Symbols used

  • k, for the number of categories
  • x_{i,j}, for the i-th score in category j
  • n_j, the sample size of category j
  • \bar{x}_j, the sample mean of category j
  • s_j^2, the sample variance of the scores in category j
  • n, the total sample size
  • df_i, the i-th degrees of freedom.

References

Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 16(1), 129–132. doi:10.1080/00401706.1974.10489158

Author

Made by P. Stikker

Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076

Expand source code
def ts_brown_forsythe_owa(nomField, scaleField, categories=None):
    '''
    Brown-Forsythe One-Way ANOVA
    -----------------------------
    Tests if the means (averages) of each category could be the same in the population.
        
    If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
    
    There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
    
    Parameters
    ----------
    nomField : pandas series
        data with categories
    scaleField : pandas series
        data with the scores
    categories : list or dictionary, optional
        the categories to use from catField
    
    Returns
    -------
    Dataframe with:
    
    * *n*, the sample size
    * *k*, the number of categories
    * *statistic*, the test statistic (F value)
    * *df1*, degrees of freedom 1
    * *df2*, degrees of freedom 2
    * *p-value*, the p-value (significance)
    
    Notes
    -----
    The formula used (Brown & Forsythe, 1974, p. 130):
    $$ F_{BF} = \\frac{\\sum_{j=1}^k n_j\\times\\left(\\bar{x}_j - \\bar{x}\\right)^2}{\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2} $$
    $$ df_1 = k - 1 $$
    $$ df_2 = \\frac{\\left(\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k \\frac{\\left(1-\\frac{n_j}{n}\\right)^2\\times s_j^4}{n_j - 1}} $$
    $$ F_{BF}\\sim F\\left(df_1, df_2\\right) $$
    
    With:
    $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
    $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
    $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
    $$ n = \\sum_{j=1}^k n_j $$
    
    This appears to give the same results as the Box correction, except for \\(df_1\\) and \\(df_2\\).
    
    *Symbols used* 
    
    * \\(k\\), for the number of categories
    * \\(x_{i,j}\\), for the i-th score in category j
    * \\(n_j\\), the sample size of category j
    * \\(\\bar{x}_j\\), the sample mean of category j
    * \\(s_j^2\\), the sample variance of the scores in category j
    * \\(n\\), the total sample size
    * \\(df_i\\), the i-th degrees of freedom.
    
    References
    ----------
    Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. *Technometrics, 16*(1), 129–132. doi:10.1080/00401706.1974.10489158
    
    Author
    ------
    Made by P. Stikker
    
    Companion website: https://PeterStatistics.com  
    YouTube channel: https://www.youtube.com/stikpet  
    Donations: https://www.patreon.com/bePatron?u=19398076
    
    '''
    if type(nomField) == list:
        nomField = pd.Series(nomField)
        
    if type(scaleField) == list:
        scaleField = pd.Series(scaleField)
        
    data = pd.concat([nomField, scaleField], axis=1)
    data.columns = ["category", "score"]
    
    #remove unused categories
    if categories is not None:
        data = data[data.category.isin(categories)]
    
    #Remove rows with missing values and reset index
    data = data.dropna()    
    data.reset_index()
    
    #overall n, mean and ss
    n = len(data["category"])
    m = data.score.mean()
    sst = data.score.var()*(n-1)
    
    #sample sizes, variances and means per category
    nj = data.groupby('category').count()
    sj2 = data.groupby('category').var()
    mj = data.groupby('category').mean()
    
    #number of categories
    k = len(mj)
    
    fVal = float((nj*(mj-m)**2).sum()/((1 - nj/n)*sj2).sum())
    df1 = k - 1
    df2 = float(((1 - nj/n)*sj2).sum()**2 / ((1 - nj/n)**2*sj2**2/(nj - 1)).sum())
    
    pVal = f.sf(fVal, df1, df2)
    
    #results
    res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
    res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
    
    return res