Module stikpetP.tests.test_brown_forsythe_owa
Expand source code
import pandas as pd
from scipy.stats import f
def ts_brown_forsythe_owa(nomField, scaleField, categories=None):
'''
Brown-Forsythe One-Way ANOVA
-----------------------------
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *k*, the number of categories
* *statistic*, the test statistic (F value)
* *df1*, degrees of freedom 1
* *df2*, degrees of freedom 2
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Brown & Forsythe, 1974, p. 130):
$$ F_{BF} = \\frac{\\sum_{j=1}^k n_j\\times\\left(\\bar{x}_j - \\bar{x}\\right)^2}{\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2} $$
$$ df_1 = k - 1 $$
$$ df_2 = \\frac{\\left(\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k \\frac{\\left(1-\\frac{n_j}{n}\\right)^2\\times s_j^4}{n_j - 1}} $$
$$ F_{BF}\\sim F\\left(df_1, df_2\\right) $$
With:
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
$$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$
$$ n = \\sum_{j=1}^k n_j $$
This appears to give the same results as the Box correction, except for \\(df_1\\) and \\(df_2\\).
*Symbols used*
* \\(k\\), for the number of categories
* \\(x_{i,j}\\), for the i-th score in category j
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(n\\), the total sample size
* \\(df_i\\), the i-th degrees of freedom.
References
----------
Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. *Technometrics, 16*(1), 129–132. doi:10.1080/00401706.1974.10489158
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
#number of categories
k = len(mj)
fVal = float((nj*(mj-m)**2).sum()/((1 - nj/n)*sj2).sum())
df1 = k - 1
df2 = float(((1 - nj/n)*sj2).sum()**2 / ((1 - nj/n)**2*sj2**2/(nj - 1)).sum())
pVal = f.sf(fVal, df1, df2)
#results
res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]])
res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"]
return res
Functions
def ts_brown_forsythe_owa(nomField, scaleField, categories=None)
-
Brown-Forsythe One-Way ANOVA
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField
:pandas series
- data with categories
scaleField
:pandas series
- data with the scores
categories
:list
ordictionary
, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- k, the number of categories
- statistic, the test statistic (F value)
- df1, degrees of freedom 1
- df2, degrees of freedom 2
- p-value, the p-value (significance)
Notes
The formula used (Brown & Forsythe, 1974, p. 130): F_{BF} = \frac{\sum_{j=1}^k n_j\times\left(\bar{x}_j - \bar{x}\right)^2}{\sum_{j=1}^k\left(1-\frac{n_j}{n}\right)\times s_j^2} df_1 = k - 1 df_2 = \frac{\left(\sum_{j=1}^k\left(1-\frac{n_j}{n}\right)\times s_j^2\right)^2}{\sum_{j=1}^k \frac{\left(1-\frac{n_j}{n}\right)^2\times s_j^4}{n_j - 1}} F_{BF}\sim F\left(df_1, df_2\right)
With: s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1} \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} \bar{x} = \frac{\sum_{j=1}^{n_j}n_j\times \bar{x}_j}{n} = \frac{\sum_{j=1}^{k}\sum_{i=1}^{n_j} x_{i,j}}{n} n = \sum_{j=1}^k n_j
This appears to give the same results as the Box correction, except for df_1 and df_2.
Symbols used
- k, for the number of categories
- x_{i,j}, for the i-th score in category j
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- n, the total sample size
- df_i, the i-th degrees of freedom.
References
Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. Technometrics, 16(1), 129–132. doi:10.1080/00401706.1974.10489158
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_brown_forsythe_owa(nomField, scaleField, categories=None): ''' Brown-Forsythe One-Way ANOVA ----------------------------- Tests if the means (averages) of each category could be the same in the population. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *k*, the number of categories * *statistic*, the test statistic (F value) * *df1*, degrees of freedom 1 * *df2*, degrees of freedom 2 * *p-value*, the p-value (significance) Notes ----- The formula used (Brown & Forsythe, 1974, p. 130): $$ F_{BF} = \\frac{\\sum_{j=1}^k n_j\\times\\left(\\bar{x}_j - \\bar{x}\\right)^2}{\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2} $$ $$ df_1 = k - 1 $$ $$ df_2 = \\frac{\\left(\\sum_{j=1}^k\\left(1-\\frac{n_j}{n}\\right)\\times s_j^2\\right)^2}{\\sum_{j=1}^k \\frac{\\left(1-\\frac{n_j}{n}\\right)^2\\times s_j^4}{n_j - 1}} $$ $$ F_{BF}\\sim F\\left(df_1, df_2\\right) $$ With: $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ $$ \\bar{x} = \\frac{\\sum_{j=1}^{n_j}n_j\\times \\bar{x}_j}{n} = \\frac{\\sum_{j=1}^{k}\\sum_{i=1}^{n_j} x_{i,j}}{n}$$ $$ n = \\sum_{j=1}^k n_j $$ This appears to give the same results as the Box correction, except for \\(df_1\\) and \\(df_2\\). *Symbols used* * \\(k\\), for the number of categories * \\(x_{i,j}\\), for the i-th score in category j * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(n\\), the total sample size * \\(df_i\\), the i-th degrees of freedom. References ---------- Brown, M. B., & Forsythe, A. B. (1974). The small sample behavior of some statistics which test the equality of several means. *Technometrics, 16*(1), 129–132. doi:10.1080/00401706.1974.10489158 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() #number of categories k = len(mj) fVal = float((nj*(mj-m)**2).sum()/((1 - nj/n)*sj2).sum()) df1 = k - 1 df2 = float(((1 - nj/n)*sj2).sum()**2 / ((1 - nj/n)**2*sj2**2/(nj - 1)).sum()) pVal = f.sf(fVal, df1, df2) #results res = pd.DataFrame([[n, k, fVal, df1, df2, pVal]]) res.columns = ["n", "k", "statistic", "df1", "df2", "p-value"] return res