Module stikpetP.tests.test_trimmed_mean_is
Expand source code
from math import floor
from statistics import mean, variance
from scipy.stats import t
import pandas as pd
def ts_trimmed_mean_is(catField, scaleField, categories=None, dmu=0, trimProp = 0.1, se = "yuen"):
'''
Independent Samples Trimmed/Yuen Mean Test
------------------------------------------
A test to compare two means. The null hypothesis would be that the means of each category are equal in the population.
As the name implies a trimmed mean test, trims the data and then performs the test. This does mean it actually tests if trimmed means are equal, rather than the regular means. As a benefit, unlike the Student and Welch test, it doesn't require the assumption of normality. The regular trimmed mean test (Yuen-Dixon) is the variation of the Student test and does require the assumption of equal variances, while the Yuen-Welch variation doesn't.
The test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/TrimmedMeansIS.html)
Parameters
----------
catField : dataframe or list
the categorical data
scaleField : dataframe or list
the scores
categories : list, optional
to indicate which two categories of catField to use, otherwise first two found will be used.
dmu : float, optional
difference according to null hypothesis (default is 0)
trimProp : float optional
proportion to trim in total for each category. If for example set to 0.1 then 0.05 from each side for each category will be trimmed. Default is 0.1.
se : {"yuen", "yuen-dixon"}, optional
to indicate which standard error to use. Default is "Yuen"
Returns
-------
A dataframe with:
* *n cat. 1*, the sample size of the first category
* *n cat. 2*, the sample size of the second category
* *trim mean cat. 1*, the sample mean of the first category
* *trim mean cat. 2*, the sample mean of the second category
* *diff.*, difference between the two sample means
* *hyp. diff.*, hypothesized difference between the two population means
* *statistic*, the test statistic (t-value)
* *df*, the degrees of freedom
* *pValue*, the significance (p-value)
* *test*, name of test used
Notes
-----
**YUEN**
The default `se="yuen"` will perform a Yuen-Welch test.
The formula used is (Yuen, 1974, p. 167):
$$t = \\frac{\\bar{x}_{t,1} - \\bar{x}_{t,2}}{SE}$$
$$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$
With:
$$SE = \\sqrt{\\frac{s_{w,1}^2}{m_1} + \\frac{s_{w,2}^2}{m_2}}$$
$$s_{w,i}^2 = \\frac{SSD_{w,i}}{m_i - 1}$$
$$df = \\frac{1}{\\frac{c^2}{m_1 - 1} + \\frac{\\left(1 - c\\right)^2}{m_2 -1}}$$
$$c = \\frac{\\frac{s_{w,1}^2}{m_1}}{\\frac{s_{w,1}^2}{m_1} + \\frac{s_{w,2}^2}{m_2}}$$
$$\\bar{x}_{t,i} = \\frac{\\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{}$$
$$g_i = \\lfloor n_i\\times p_t\\rfloor$$
$$m_i = n_ - 2\\times g_i$$
$$SSD_{w,i} = g_i\\times\\left(y_{i,g_i+1} - \\bar{x}_{wi}\\right)^2 + g_i\\times\\left(y_{i,n_i-g_i} - \\bar{x}_{w,i}\\right)^2 + \\sum_{j=g+1}^{n_i - g_i} \\left(y_{i,j} - \\bar{x}_{w,i}\\right)^2$$
$$\\bar{x}_{w,i} = \\frac{\\bar{x}_{t,i}\\times m_i + g_i\\times\\left(y_{i, g_i+1} + y_{i, n_i-g_i}\\right)}{n_i}$$
*Symbols used:*
* \\(x_{t,i}\\) the trimmed mean of the scores in category i
* \\(x_{w,i}\\) The Winsorized mean of the scores in category i
* \\(SSD_{w,i}\\) the sum of squared deviations from the Winsorized mean of category i
* \\(m_i\\) the number of scores in the trimmed data set from category i
* \\(y_{i,j}\\) the j-th score after the scores in category i, after they are sorted from low to high
* \\(p_t\\) the proportion of trimming on each side, we can define
**YUEN-DIXON**
If `se="yuen-dixon` a trimmed means test will be performed.
The formula used is (Yuen & Dixon, 1973, p. 394):
$$t = \\frac{\\bar{x}_{t,1} - \\bar{x}_{t,2}}{SE}$$
$$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$
With:
$$SE = \\sqrt{\\frac{SSD_{w,1} + SSD_{w,2}}{m_1 + m_2 - 2}\\times\\left(\\frac{1}{m_1} + \\frac{1}{m_2}\\right)}$$
$$df = m_1 + m_2 - 2$$
$$\\bar{x}_{t,i} = \\frac{\\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{}$$
$$g_i = \\lfloor n_i\\times p_t\\rfloor$$
$$m_i = n_ - 2\\times g_i$$
$$SSD_{w,i} = g_i\\times\\left(y_{i,g_i+1} - \\bar{x}_{w,i}\\right)^2 + g_i\\times\\left(y_{i,n_i-g_i} - \\bar{x}_{w,i}\\right)^2 + \\sum_{j=g+1}^{n_i - g_i} \\left(y_{i,j} - \\bar{x}_{w,i}\\right)^2$$
$$\\bar{x}_{w,i} = \\frac{\\bar{x}_{t,i}\\times m_i + g_i\\times\\left(y_{i, g_i+1} + y_{i, n_i-g_i}\\right)}{n_i}$$
*Symbols used:*
* \\(x_{ti}\\) the trimmed mean of the scores in category i
* \\(x_{wi}\\) The Winsorized mean of the scores in category i
* \\(SSD_{wi}\\) the sum of squared deviations from the Winsorized mean of category i
* \\(m_i\\) the number of scores in the trimmed data set from category i
* \\(y_{i,j}\\) the j-th score after the scores in category i, after they are sorted from low to high
* \\(p_t\\) the proportion of trimming on each side, we can define
Before, After and Alternatives
------------------------------
Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation
Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram
After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html)
There are four similar tests, with different assumptions.
|test|equal variance|normality|
|-------|-----------|---------|
|[Student t](../tests/test_student_t_is.html)| yes | yes|
|[Welch t](../tests/test_welch_t_is.html) | no | yes|
|[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no |
|[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no |
Another test that in some cases could be used is the [Z test](../tests/test_z_is.html)
References
----------
Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. *Biometrika, 61*(1), 165–170. https://doi.org/10.1093/biomet/61.1.165
Yuen, K. K., & Dixon, W. J. (1973). The approximate behaviour and performance of the two-sample trimmed t. *Biometrika, 60*(2), 369–374. https://doi.org/10.2307/2334550
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
--------
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv"
>>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['age']
>>> ex1 = ex1.replace("89 OR OLDER", "90")
>>> ts_trimmed_mean_is(df1['sex'], ex1)
n FEMALE n MALE trim mean FEMALE trim mean MALE diff. hyp. diff. statistic df p-value test
0 1083 886 48.106667 47.274436 0.832231 0 0.971312 1707.922425 0.33153 Yuen-Welch independent samples t-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40]
>>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."]
>>> ts_trimmed_mean_is(groups, scores)
n int. n nat. trim mean int. trim mean nat. diff. hyp. diff. statistic df p-value test
0 12 6 61.916667 41.666667 20.25 0 1.69314 9.750994 0.122075 Yuen-Welch independent samples t-test
'''
#convert to pandas series if needed
if type(catField) is list:
catField = pd.Series(catField)
if type(scaleField) is list:
scaleField = pd.Series(scaleField)
#combine as one dataframe
df = pd.concat([catField, scaleField], axis=1)
df = df.dropna()
#the two categories
if categories is not None:
cat1 = categories[0]
cat2 = categories[1]
else:
cat1 = df.iloc[:,0].value_counts().index[0]
cat2 = df.iloc[:,0].value_counts().index[1]
#seperate the scores for each category
x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1])
x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2])
#make sure they are floats
x1 = [float(x) for x in x1]
x2 = [float(x) for x in x2]
n1 = len(x1)
n2 = len(x2)
n = n1 + n2
#number of scores to trim from each side
g1 = floor(n1*trimProp/2)
g2 = floor(n2*trimProp/2)
n1t = n1 - 2*g1
n2t = n2 - 2*g2
#sort the results
x1.sort()
x2.sort()
#trimmed means
m1t = mean(x1[g1:(n1 - g1)])
m2t = mean(x2[g2:(n2 - g2)])
#Winsorize the data
x1[0:g1] = [x1[g1]]*g1
x2[0:g2] = [x2[g2]]*g2
x1[(n1 -g1):n1] = [x1[n1 - g1 -1]]*g1
x2[(n2 -g2):n2] = [x2[n2 - g2 -1]]*g2
#Variance of winsorized data
var1 = variance(x1)
var2 = variance(x2)
if (se=="yuen"):
ssw1 = var1*(n1-1)
ssw2 = var2*(n2-1)
var1w = ssw1 / (n1t - 1)
var2w = ssw2 / (n2t - 1)
se = (var1w / n1t + var2w / n2t)**0.5
c = (var1w / n1t) / (var1w / n1t + var2w / n2t)
df = 1 / (c**2 / (n1t - 1) + (1 - c)**2 / (n2t - 1))
testUsed = "Yuen-Welch independent samples t-test"
else:
s2 = ((n1 - 1)*var1 + (n2 - 1)*var2)/((n1t - 1) + (n2t - 1))
se = (s2 * (1/n1t + 1/n2t))**0.5
df = n1t + n2t - 2
testUsed = "Trimmed Mean independent samples t-test"
tValue = (m1t - m2t - dmu) / se
pValue = 2*(1-t.cdf(abs(tValue), df))
statistic=tValue
colnames = ["n "+cat1, "n "+cat2, "trim mean "+cat1, "trim mean "+cat2, "diff.", "hyp. diff.", "statistic", "df", "p-value", "test"]
results = pd.DataFrame([[n1, n2, m1t, m2t, m1t - m2t, dmu, statistic, df, pValue, testUsed]], columns=colnames)
return(results)
Functions
def ts_trimmed_mean_is(catField, scaleField, categories=None, dmu=0, trimProp=0.1, se='yuen')
-
Independent Samples Trimmed/Yuen Mean Test
A test to compare two means. The null hypothesis would be that the means of each category are equal in the population.
As the name implies a trimmed mean test, trims the data and then performs the test. This does mean it actually tests if trimmed means are equal, rather than the regular means. As a benefit, unlike the Student and Welch test, it doesn't require the assumption of normality. The regular trimmed mean test (Yuen-Dixon) is the variation of the Student test and does require the assumption of equal variances, while the Yuen-Welch variation doesn't.
The test is also described at PeterStatistics.com
Parameters
catField
:dataframe
orlist
- the categorical data
scaleField
:dataframe
orlist
- the scores
categories
:list
, optional- to indicate which two categories of catField to use, otherwise first two found will be used.
dmu
:float
, optional- difference according to null hypothesis (default is 0)
trimProp
:float optional
- proportion to trim in total for each category. If for example set to 0.1 then 0.05 from each side for each category will be trimmed. Default is 0.1.
se
:{"yuen", "yuen-dixon"}
, optional- to indicate which standard error to use. Default is "Yuen"
Returns
A dataframe with:
- n cat. 1, the sample size of the first category
- n cat. 2, the sample size of the second category
- trim mean cat. 1, the sample mean of the first category
- trim mean cat. 2, the sample mean of the second category
- diff., difference between the two sample means
- hyp. diff., hypothesized difference between the two population means
- statistic, the test statistic (t-value)
- df, the degrees of freedom
- pValue, the significance (p-value)
- test, name of test used
Notes
YUEN
The default
se="yuen"
will perform a Yuen-Welch test.The formula used is (Yuen, 1974, p. 167): t = \frac{\bar{x}_{t,1} - \bar{x}_{t,2}}{SE} sig = 2\times\left(1 - T\left(\left|t\right|, df\right)\right)
With: SE = \sqrt{\frac{s_{w,1}^2}{m_1} + \frac{s_{w,2}^2}{m_2}} s_{w,i}^2 = \frac{SSD_{w,i}}{m_i - 1} df = \frac{1}{\frac{c^2}{m_1 - 1} + \frac{\left(1 - c\right)^2}{m_2 -1}} c = \frac{\frac{s_{w,1}^2}{m_1}}{\frac{s_{w,1}^2}{m_1} + \frac{s_{w,2}^2}{m_2}} \bar{x}_{t,i} = \frac{\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{} g_i = \lfloor n_i\times p_t\rfloor m_i = n_ - 2\times g_i SSD_{w,i} = g_i\times\left(y_{i,g_i+1} - \bar{x}_{wi}\right)^2 + g_i\times\left(y_{i,n_i-g_i} - \bar{x}_{w,i}\right)^2 + \sum_{j=g+1}^{n_i - g_i} \left(y_{i,j} - \bar{x}_{w,i}\right)^2 \bar{x}_{w,i} = \frac{\bar{x}_{t,i}\times m_i + g_i\times\left(y_{i, g_i+1} + y_{i, n_i-g_i}\right)}{n_i}
Symbols used:
- x_{t,i} the trimmed mean of the scores in category i
- x_{w,i} The Winsorized mean of the scores in category i
- SSD_{w,i} the sum of squared deviations from the Winsorized mean of category i
- m_i the number of scores in the trimmed data set from category i
- y_{i,j} the j-th score after the scores in category i, after they are sorted from low to high
- p_t the proportion of trimming on each side, we can define
YUEN-DIXON
If
se="yuen-dixon
a trimmed means test will be performed.The formula used is (Yuen & Dixon, 1973, p. 394): t = \frac{\bar{x}_{t,1} - \bar{x}_{t,2}}{SE} sig = 2\times\left(1 - T\left(\left|t\right|, df\right)\right)
With: SE = \sqrt{\frac{SSD_{w,1} + SSD_{w,2}}{m_1 + m_2 - 2}\times\left(\frac{1}{m_1} + \frac{1}{m_2}\right)} df = m_1 + m_2 - 2 \bar{x}_{t,i} = \frac{\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{} g_i = \lfloor n_i\times p_t\rfloor m_i = n_ - 2\times g_i SSD_{w,i} = g_i\times\left(y_{i,g_i+1} - \bar{x}_{w,i}\right)^2 + g_i\times\left(y_{i,n_i-g_i} - \bar{x}_{w,i}\right)^2 + \sum_{j=g+1}^{n_i - g_i} \left(y_{i,j} - \bar{x}_{w,i}\right)^2 \bar{x}_{w,i} = \frac{\bar{x}_{t,i}\times m_i + g_i\times\left(y_{i, g_i+1} + y_{i, n_i-g_i}\right)}{n_i}
Symbols used:
- x_{ti} the trimmed mean of the scores in category i
- x_{wi} The Winsorized mean of the scores in category i
- SSD_{wi} the sum of squared deviations from the Winsorized mean of category i
- m_i the number of scores in the trimmed data set from category i
- y_{i,j} the j-th score after the scores in category i, after they are sorted from low to high
- p_t the proportion of trimming on each side, we can define
Before, After and Alternatives
Before this you might want some descriptive measures. Use me_mode_bin for Mode for Binned Data, me_mean for different types of mean, and/or me_variation for different Measures of Quantitative Variation
Or a visualisation are vi_boxplot_single for a Box (and Whisker) Plot and vi_histogram for a Histogram
After the test you might want an effect size measure, options include: Common Language, Cohen d_s, Cohen U, Hedges g, Glass delta, biserial correlation, point-biserial correlation
There are four similar tests, with different assumptions.
test equal variance normality Student t yes yes Welch t no yes Trimmed means yes no Yuen-Welch no no Another test that in some cases could be used is the Z test
References
Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. Biometrika, 61(1), 165–170. https://doi.org/10.1093/biomet/61.1.165
Yuen, K. K., & Dixon, W. J. (1973). The approximate behaviour and performance of the two-sample trimmed t. Biometrika, 60(2), 369–374. https://doi.org/10.2307/2334550
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
Example 1: Dataframe
>>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_trimmed_mean_is(df1['sex'], ex1) n FEMALE n MALE trim mean FEMALE trim mean MALE diff. hyp. diff. statistic df p-value test 0 1083 886 48.106667 47.274436 0.832231 0 0.971312 1707.922425 0.33153 Yuen-Welch independent samples t-test
Example 2: List
>>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_trimmed_mean_is(groups, scores) n int. n nat. trim mean int. trim mean nat. diff. hyp. diff. statistic df p-value test 0 12 6 61.916667 41.666667 20.25 0 1.69314 9.750994 0.122075 Yuen-Welch independent samples t-test
Expand source code
def ts_trimmed_mean_is(catField, scaleField, categories=None, dmu=0, trimProp = 0.1, se = "yuen"): ''' Independent Samples Trimmed/Yuen Mean Test ------------------------------------------ A test to compare two means. The null hypothesis would be that the means of each category are equal in the population. As the name implies a trimmed mean test, trims the data and then performs the test. This does mean it actually tests if trimmed means are equal, rather than the regular means. As a benefit, unlike the Student and Welch test, it doesn't require the assumption of normality. The regular trimmed mean test (Yuen-Dixon) is the variation of the Student test and does require the assumption of equal variances, while the Yuen-Welch variation doesn't. The test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/TrimmedMeansIS.html) Parameters ---------- catField : dataframe or list the categorical data scaleField : dataframe or list the scores categories : list, optional to indicate which two categories of catField to use, otherwise first two found will be used. dmu : float, optional difference according to null hypothesis (default is 0) trimProp : float optional proportion to trim in total for each category. If for example set to 0.1 then 0.05 from each side for each category will be trimmed. Default is 0.1. se : {"yuen", "yuen-dixon"}, optional to indicate which standard error to use. Default is "Yuen" Returns ------- A dataframe with: * *n cat. 1*, the sample size of the first category * *n cat. 2*, the sample size of the second category * *trim mean cat. 1*, the sample mean of the first category * *trim mean cat. 2*, the sample mean of the second category * *diff.*, difference between the two sample means * *hyp. diff.*, hypothesized difference between the two population means * *statistic*, the test statistic (t-value) * *df*, the degrees of freedom * *pValue*, the significance (p-value) * *test*, name of test used Notes ----- **YUEN** The default `se="yuen"` will perform a Yuen-Welch test. The formula used is (Yuen, 1974, p. 167): $$t = \\frac{\\bar{x}_{t,1} - \\bar{x}_{t,2}}{SE}$$ $$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$ With: $$SE = \\sqrt{\\frac{s_{w,1}^2}{m_1} + \\frac{s_{w,2}^2}{m_2}}$$ $$s_{w,i}^2 = \\frac{SSD_{w,i}}{m_i - 1}$$ $$df = \\frac{1}{\\frac{c^2}{m_1 - 1} + \\frac{\\left(1 - c\\right)^2}{m_2 -1}}$$ $$c = \\frac{\\frac{s_{w,1}^2}{m_1}}{\\frac{s_{w,1}^2}{m_1} + \\frac{s_{w,2}^2}{m_2}}$$ $$\\bar{x}_{t,i} = \\frac{\\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{}$$ $$g_i = \\lfloor n_i\\times p_t\\rfloor$$ $$m_i = n_ - 2\\times g_i$$ $$SSD_{w,i} = g_i\\times\\left(y_{i,g_i+1} - \\bar{x}_{wi}\\right)^2 + g_i\\times\\left(y_{i,n_i-g_i} - \\bar{x}_{w,i}\\right)^2 + \\sum_{j=g+1}^{n_i - g_i} \\left(y_{i,j} - \\bar{x}_{w,i}\\right)^2$$ $$\\bar{x}_{w,i} = \\frac{\\bar{x}_{t,i}\\times m_i + g_i\\times\\left(y_{i, g_i+1} + y_{i, n_i-g_i}\\right)}{n_i}$$ *Symbols used:* * \\(x_{t,i}\\) the trimmed mean of the scores in category i * \\(x_{w,i}\\) The Winsorized mean of the scores in category i * \\(SSD_{w,i}\\) the sum of squared deviations from the Winsorized mean of category i * \\(m_i\\) the number of scores in the trimmed data set from category i * \\(y_{i,j}\\) the j-th score after the scores in category i, after they are sorted from low to high * \\(p_t\\) the proportion of trimming on each side, we can define **YUEN-DIXON** If `se="yuen-dixon` a trimmed means test will be performed. The formula used is (Yuen & Dixon, 1973, p. 394): $$t = \\frac{\\bar{x}_{t,1} - \\bar{x}_{t,2}}{SE}$$ $$sig = 2\\times\\left(1 - T\\left(\\left|t\\right|, df\\right)\\right)$$ With: $$SE = \\sqrt{\\frac{SSD_{w,1} + SSD_{w,2}}{m_1 + m_2 - 2}\\times\\left(\\frac{1}{m_1} + \\frac{1}{m_2}\\right)}$$ $$df = m_1 + m_2 - 2$$ $$\\bar{x}_{t,i} = \\frac{\\sum_{j=g_i+1}^{n_i - g_i}y_{i,j}}{}$$ $$g_i = \\lfloor n_i\\times p_t\\rfloor$$ $$m_i = n_ - 2\\times g_i$$ $$SSD_{w,i} = g_i\\times\\left(y_{i,g_i+1} - \\bar{x}_{w,i}\\right)^2 + g_i\\times\\left(y_{i,n_i-g_i} - \\bar{x}_{w,i}\\right)^2 + \\sum_{j=g+1}^{n_i - g_i} \\left(y_{i,j} - \\bar{x}_{w,i}\\right)^2$$ $$\\bar{x}_{w,i} = \\frac{\\bar{x}_{t,i}\\times m_i + g_i\\times\\left(y_{i, g_i+1} + y_{i, n_i-g_i}\\right)}{n_i}$$ *Symbols used:* * \\(x_{ti}\\) the trimmed mean of the scores in category i * \\(x_{wi}\\) The Winsorized mean of the scores in category i * \\(SSD_{wi}\\) the sum of squared deviations from the Winsorized mean of category i * \\(m_i\\) the number of scores in the trimmed data set from category i * \\(y_{i,j}\\) the j-th score after the scores in category i, after they are sorted from low to high * \\(p_t\\) the proportion of trimming on each side, we can define Before, After and Alternatives ------------------------------ Before this you might want some descriptive measures. Use [me_mode_bin](../measures/meas_mode_bin.html#me_mode_bin) for Mode for Binned Data, [me_mean](../measures/meas_mean.html#me_mean) for different types of mean, and/or [me_variation](../measures/meas_variation.html#me_variation) for different Measures of Quantitative Variation Or a visualisation are [vi_boxplot_single](../visualisations/vis_boxplot_single.html#vi_boxplot_single) for a Box (and Whisker) Plot and [vi_histogram](../visualisations/vis_histogram.html#vi_histogram) for a Histogram After the test you might want an effect size measure, options include: [Common Language](../effect_sizes/eff_size_common_language_is.html), [Cohen d_s](../effect_sizes/eff_size_hedges_g_is.html), [Cohen U](../effect_sizes/eff_size_cohen_u.html), [Hedges g](../effect_sizes/eff_size_hedges_g_is.html), [Glass delta](../effect_sizes/eff_size_glass_delta.html), [biserial correlation](../correlations/cor_biserial.html), [point-biserial correlation](../effect_sizes/cor_point_biserial.html) There are four similar tests, with different assumptions. |test|equal variance|normality| |-------|-----------|---------| |[Student t](../tests/test_student_t_is.html)| yes | yes| |[Welch t](../tests/test_welch_t_is.html) | no | yes| |[Trimmed means](../tests/test_trimmed_mean_is.html) | yes | no | |[Yuen-Welch](../tests/test_trimmed_mean_is.html)|no | no | Another test that in some cases could be used is the [Z test](../tests/test_z_is.html) References ---------- Yuen, K. K. (1974). The two-sample trimmed t for unequal population variances. *Biometrika, 61*(1), 165–170. https://doi.org/10.1093/biomet/61.1.165 Yuen, K. K., & Dixon, W. J. (1973). The approximate behaviour and performance of the two-sample trimmed t. *Biometrika, 60*(2), 369–374. https://doi.org/10.2307/2334550 Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples -------- Example 1: Dataframe >>> file1 = "https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv" >>> df1 = pd.read_csv(file1, sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['age'] >>> ex1 = ex1.replace("89 OR OLDER", "90") >>> ts_trimmed_mean_is(df1['sex'], ex1) n FEMALE n MALE trim mean FEMALE trim mean MALE diff. hyp. diff. statistic df p-value test 0 1083 886 48.106667 47.274436 0.832231 0 0.971312 1707.922425 0.33153 Yuen-Welch independent samples t-test Example 2: List >>> scores = [20,50,80,15,40,85,30,45,70,60, None, 90,25,40,70,65, None, 70,98,40] >>> groups = ["nat.","int.","int.","nat.","int.", "int.","nat.","nat.","int.","int.","int.","int.","int.","int.","nat.", "int." ,None,"nat.","int.","int."] >>> ts_trimmed_mean_is(groups, scores) n int. n nat. trim mean int. trim mean nat. diff. hyp. diff. statistic df p-value test 0 12 6 61.916667 41.666667 20.25 0 1.69314 9.750994 0.122075 Yuen-Welch independent samples t-test ''' #convert to pandas series if needed if type(catField) is list: catField = pd.Series(catField) if type(scaleField) is list: scaleField = pd.Series(scaleField) #combine as one dataframe df = pd.concat([catField, scaleField], axis=1) df = df.dropna() #the two categories if categories is not None: cat1 = categories[0] cat2 = categories[1] else: cat1 = df.iloc[:,0].value_counts().index[0] cat2 = df.iloc[:,0].value_counts().index[1] #seperate the scores for each category x1 = list(df.iloc[:,1][df.iloc[:,0] == cat1]) x2 = list(df.iloc[:,1][df.iloc[:,0] == cat2]) #make sure they are floats x1 = [float(x) for x in x1] x2 = [float(x) for x in x2] n1 = len(x1) n2 = len(x2) n = n1 + n2 #number of scores to trim from each side g1 = floor(n1*trimProp/2) g2 = floor(n2*trimProp/2) n1t = n1 - 2*g1 n2t = n2 - 2*g2 #sort the results x1.sort() x2.sort() #trimmed means m1t = mean(x1[g1:(n1 - g1)]) m2t = mean(x2[g2:(n2 - g2)]) #Winsorize the data x1[0:g1] = [x1[g1]]*g1 x2[0:g2] = [x2[g2]]*g2 x1[(n1 -g1):n1] = [x1[n1 - g1 -1]]*g1 x2[(n2 -g2):n2] = [x2[n2 - g2 -1]]*g2 #Variance of winsorized data var1 = variance(x1) var2 = variance(x2) if (se=="yuen"): ssw1 = var1*(n1-1) ssw2 = var2*(n2-1) var1w = ssw1 / (n1t - 1) var2w = ssw2 / (n2t - 1) se = (var1w / n1t + var2w / n2t)**0.5 c = (var1w / n1t) / (var1w / n1t + var2w / n2t) df = 1 / (c**2 / (n1t - 1) + (1 - c)**2 / (n2t - 1)) testUsed = "Yuen-Welch independent samples t-test" else: s2 = ((n1 - 1)*var1 + (n2 - 1)*var2)/((n1t - 1) + (n2t - 1)) se = (s2 * (1/n1t + 1/n2t))**0.5 df = n1t + n2t - 2 testUsed = "Trimmed Mean independent samples t-test" tValue = (m1t - m2t - dmu) / se pValue = 2*(1-t.cdf(abs(tValue), df)) statistic=tValue colnames = ["n "+cat1, "n "+cat2, "trim mean "+cat1, "trim mean "+cat2, "diff.", "hyp. diff.", "statistic", "df", "p-value", "test"] results = pd.DataFrame([[n1, n2, m1t, m2t, m1t - m2t, dmu, statistic, df, pValue, testUsed]], columns=colnames) return(results)