Module stikpetP.tests.test_wilcox_owa
Expand source code
import pandas as pd
from scipy.stats import chi2
def ts_wilcox_owa(nomField, scaleField, categories=None):
'''
Wilcox One-Way ANOVA
------------------------------
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
----------
nomField : pandas series
data with categories
scaleField : pandas series
data with the scores
categories : list or dictionary, optional
the categories to use from catField
Returns
-------
Dataframe with:
* *n*, the sample size
* *statistic*, the test statistic (chi-square value value)
* *df*, degrees of freedom
* *p-value*, the p-value (significance)
Notes
-----
The formula used (Wilcox, 1988, pp. 110-111):
$$H = \\frac{\\sum_{j=1}^k \\left(W_j - \\bar{W}\\right)^2}{\\hat{\\theta}}$$
$$df = k - 1$$
$$sig. = 1 - \\chi^2\\left(H, df\\right)$$
With:
$$W_j = b_j\\times x_{n_j,j} + \\frac{1 - b_j}{n_j}\\times\\sum_{i=1}^{n_j-1} x_{i,j}$$
$$\\bar{W} = \\frac{\\sum_{j=1}^k W_j}{k}$$
$$b_j = \\frac{1+\\sqrt{\\frac{\\left(n_j-1\\right)\\times\\left(n_j\\times\\hat{\\theta}-s_j^2\\right)}{s_j^2}}}{n_j}$$
$$\\hat{\\theta} = \\max\\left(\\frac{s_1^2}{n_1}, \\frac{s_2^2}{n_2},\\dots,\\frac{s_k^2}{n_k}\\right)$$
$$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$
$$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$
The original article has an error in the formula for \\(b_j\\). There are missing brackets. Using the population version in the article of \\(c_j\\) the formula used here was adapted.
*Symbols used*
* \\(x_{i,j}\\), the i-th score in category j
* \\(k\\), the number of categories
* \\(n_j\\), the sample size of category j
* \\(\\bar{x}_j\\), the sample mean of category j
* \\(s_j^2\\), the sample variance of the scores in category j
* \\(df\\), the degrees of freedom
* \\(\\chi^2\\left(\\dots\\right)\\), the cumulative density function of the chi-square distribution
References
----------
Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. *British Journal of Mathematical and Statistical Psychology, 41*(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
'''
if type(nomField) == list:
nomField = pd.Series(nomField)
if type(scaleField) == list:
scaleField = pd.Series(scaleField)
data = pd.concat([nomField, scaleField], axis=1)
data.columns = ["category", "score"]
#remove unused categories
if categories is not None:
data = data[data.category.isin(categories)]
#Remove rows with missing values and reset index
data = data.dropna()
data.reset_index()
#overall n, mean and ss
n = len(data["category"])
m = data.score.mean()
sst = data.score.var()*(n-1)
#sample sizes, variances and means per category
nj = data.groupby('category').count()
sj2 = data.groupby('category').var()
mj = data.groupby('category').mean()
xj = data.groupby('category').max()
sj = data.groupby('category').sum()
#number of categories
k = len(mj)
t = float((sj2/nj).max())
bj = (1 + ((nj - 1)*(nj*t - sj2)/sj2)**0.5) / nj
wj = bj*xj + (1 - bj)/nj * (sj - xj)
wm = wj.sum()/k
h = float(((wj - wm)**2).sum()/t)
df = k - 1
pVal = chi2.sf(h, df)
#results
res = pd.DataFrame([[n, h, df, pVal]])
res.columns = ["n", "statistic", "df", "p-value"]
return res
Functions
def ts_wilcox_owa(nomField, scaleField, categories=None)-
Wilcox One-Way ANOVA
Tests if the means (averages) of each category could be the same in the population.
If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population.
There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences.
Parameters
nomField:pandas series- data with categories
scaleField:pandas series- data with the scores
categories:listordictionary, optional- the categories to use from catField
Returns
Dataframe with:
- n, the sample size
- statistic, the test statistic (chi-square value value)
- df, degrees of freedom
- p-value, the p-value (significance)
Notes
The formula used (Wilcox, 1988, pp. 110-111): H = \frac{\sum_{j=1}^k \left(W_j - \bar{W}\right)^2}{\hat{\theta}} df = k - 1 sig. = 1 - \chi^2\left(H, df\right)
With: W_j = b_j\times x_{n_j,j} + \frac{1 - b_j}{n_j}\times\sum_{i=1}^{n_j-1} x_{i,j} \bar{W} = \frac{\sum_{j=1}^k W_j}{k} b_j = \frac{1+\sqrt{\frac{\left(n_j-1\right)\times\left(n_j\times\hat{\theta}-s_j^2\right)}{s_j^2}}}{n_j} \hat{\theta} = \max\left(\frac{s_1^2}{n_1}, \frac{s_2^2}{n_2},\dots,\frac{s_k^2}{n_k}\right) \bar{x}_j = \frac{\sum_{j=1}^{n_j} x_{i,j}}{n_j} s_j^2 = \frac{\sum_{i=1}^{n_j} \left(x_{i,j} - \bar{x}_j\right)^2}{n_j - 1}
The original article has an error in the formula for b_j. There are missing brackets. Using the population version in the article of c_j the formula used here was adapted.
Symbols used
- x_{i,j}, the i-th score in category j
- k, the number of categories
- n_j, the sample size of category j
- \bar{x}_j, the sample mean of category j
- s_j^2, the sample variance of the scores in category j
- df, the degrees of freedom
- \chi^2\left(\dots\right), the cumulative density function of the chi-square distribution
References
Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. British Journal of Mathematical and Statistical Psychology, 41(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Expand source code
def ts_wilcox_owa(nomField, scaleField, categories=None): ''' Wilcox One-Way ANOVA ------------------------------ Tests if the means (averages) of each category could be the same in the population. If the p-value is below a pre-defined threshold (usually 0.05), the null hypothesis is rejected, and there are then at least two categories who will have a different mean on the scaleField score in the population. There are quite some alternatives for this, the stikpet library has Fisher, Welch, James, Box, Scott-Smith, Brown-Forsythe, Alexander-Govern, Mehrotra modified Brown-Forsythe, Hartung-Agac-Makabi, Özdemir-Kurt and Wilcox as options. See the notes from ts_fisher_owa() for some discussion on the differences. Parameters ---------- nomField : pandas series data with categories scaleField : pandas series data with the scores categories : list or dictionary, optional the categories to use from catField Returns ------- Dataframe with: * *n*, the sample size * *statistic*, the test statistic (chi-square value value) * *df*, degrees of freedom * *p-value*, the p-value (significance) Notes ----- The formula used (Wilcox, 1988, pp. 110-111): $$H = \\frac{\\sum_{j=1}^k \\left(W_j - \\bar{W}\\right)^2}{\\hat{\\theta}}$$ $$df = k - 1$$ $$sig. = 1 - \\chi^2\\left(H, df\\right)$$ With: $$W_j = b_j\\times x_{n_j,j} + \\frac{1 - b_j}{n_j}\\times\\sum_{i=1}^{n_j-1} x_{i,j}$$ $$\\bar{W} = \\frac{\\sum_{j=1}^k W_j}{k}$$ $$b_j = \\frac{1+\\sqrt{\\frac{\\left(n_j-1\\right)\\times\\left(n_j\\times\\hat{\\theta}-s_j^2\\right)}{s_j^2}}}{n_j}$$ $$\\hat{\\theta} = \\max\\left(\\frac{s_1^2}{n_1}, \\frac{s_2^2}{n_2},\\dots,\\frac{s_k^2}{n_k}\\right)$$ $$ \\bar{x}_j = \\frac{\\sum_{j=1}^{n_j} x_{i,j}}{n_j}$$ $$ s_j^2 = \\frac{\\sum_{i=1}^{n_j} \\left(x_{i,j} - \\bar{x}_j\\right)^2}{n_j - 1}$$ The original article has an error in the formula for \\(b_j\\). There are missing brackets. Using the population version in the article of \\(c_j\\) the formula used here was adapted. *Symbols used* * \\(x_{i,j}\\), the i-th score in category j * \\(k\\), the number of categories * \\(n_j\\), the sample size of category j * \\(\\bar{x}_j\\), the sample mean of category j * \\(s_j^2\\), the sample variance of the scores in category j * \\(df\\), the degrees of freedom * \\(\\chi^2\\left(\\dots\\right)\\), the cumulative density function of the chi-square distribution References ---------- Wilcox, R. R. (1988). A new alternative to the ANOVA F and new results on James’s second-order method. *British Journal of Mathematical and Statistical Psychology, 41*(1), 109–117. doi:10.1111/j.2044-8317.1988.tb00890.x Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 ''' if type(nomField) == list: nomField = pd.Series(nomField) if type(scaleField) == list: scaleField = pd.Series(scaleField) data = pd.concat([nomField, scaleField], axis=1) data.columns = ["category", "score"] #remove unused categories if categories is not None: data = data[data.category.isin(categories)] #Remove rows with missing values and reset index data = data.dropna() data.reset_index() #overall n, mean and ss n = len(data["category"]) m = data.score.mean() sst = data.score.var()*(n-1) #sample sizes, variances and means per category nj = data.groupby('category').count() sj2 = data.groupby('category').var() mj = data.groupby('category').mean() xj = data.groupby('category').max() sj = data.groupby('category').sum() #number of categories k = len(mj) t = float((sj2/nj).max()) bj = (1 + ((nj - 1)*(nj*t - sj2)/sj2)**0.5) / nj wj = bj*xj + (1 - bj)/nj * (sj - xj) wm = wj.sum()/k h = float(((wj - wm)**2).sum()/t) df = k - 1 pVal = chi2.sf(h, df) #results res = pd.DataFrame([[n, h, df, pVal]]) res.columns = ["n", "statistic", "df", "p-value"] return res