Module stikpetP.tests.test_multinomial_gof
Expand source code
import itertools as it
import pandas as pd
import numpy as np
from scipy.stats import multinomial
def ts_multinomial_gof(data, expCounts=None):
'''
Multinomial Goodness-of-Fit Test
--------------------------------
A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis). If the test has a p-value below a pre-defined threshold (usually 0.05) the assumption they are all equal in the population will be rejected.
There are quite a few tests that can do this. Perhaps the most commonly used is a Pearson chi-square test, but also a G-test, Freeman-Tukey, Neyman, Mod-Log Likelihood and Cressie-Read test are possible.
McDonald (2014, p. 82) suggests to always use this exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers).
This function is shown in this [YouTube video](https://youtu.be/eGcd45t8LlA) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/Multinomial-GoF.html)
Parameters
----------
data : list or pandas data series
expCounts : pandas data frame, optional
the categories and expected counts
Returns
-------
testResults : Pandas dataframe with the probability of the observed frequencies, number of combinations used, significance (p-value) and test used
Notes
-----
It uses the itertools, pandas, numpy and scipy's stats multinomial function
The exact multinomial test of goodness of fit is done in four steps
Step 1: Determine the probability of the observed counts using the probability mass function of the multinomial distribution
Step 2: Determine all possible permutations with repetition that create a sum equal to the sample size over the k-categories.
Step 3: Determine the probability of each of these permutations using the probability mass function of the multinomial distribution.
Step 4: Sum all probabilities found in step 3 that are equal or less than the one found in step 1.
Before, After and Alternatives
------------------------------
Before this an impression using a frequency table or a visualisation might be helpful:
* [tab_frequency](../other/table_frequency.html#tab_frequency)
* [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart
* [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot
* [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot
* [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart
* [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart
After this you might want to perform a post-hoc test:
* [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test
* [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests
* [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests
* [ph_residual_gof_gof](../other/poho_residual_gof_gof.html#ph_residual_gof_gof) for Residuals Using Goodness-of-Fit Tests
Alternative tests:
* [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test
* [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit
* [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit
* [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test
* [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit
* [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit
* [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test
References
----------
McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing.
Author
------
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076
Examples
---------
>>> pd.set_option('display.width',1000)
>>> pd.set_option('display.max_columns', 1000)
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'})
>>> ex1 = df1['mar1'][0:20]
>>> ts_multinomial_gof(ex1)
p obs. n combs. p-value test
0 0.00022 8855 0.203762 one-sample multinomial exact goodness-of-fit test
Example 2: pandas series with various settings
>>> ex2 = df1['mar1'][0:20]
>>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]})
>>> ts_multinomial_gof(ex2, expCounts=eCounts)
p obs. n combs. p-value test
0 0.003209 1330 0.435166 one-sample multinomial exact goodness-of-fit test
Example 3: a list
>>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"]
>>> ts_multinomial_gof(ex3)
p obs. n combs. p-value test
0 0.002541 1540 0.388712 one-sample multinomial exact goodness-of-fit test
'''
if type(data) == list:
data = pd.Series(data)
#determine the observed counts
if expCounts is None:
#generate frequency table
freq = data.value_counts()
n = sum(freq)
freq = freq.rename_axis('category').reset_index(name='count')
#number of categories to use (k)
k = len(freq)
#number of expected counts is simply sample size
nE = n
else:
#if expected counts are given
#number of categories to use (k)
k = len(expCounts)
freq = pd.DataFrame(columns = ["category", "count"])
for i in range(0, k):
nk = data[data==expCounts.iloc[i, 0]].count()
lk = expCounts.iloc[i, 0]
freq = pd.concat([freq, pd.DataFrame([{"category": lk, "count": nk}])])
nE = sum(expCounts.iloc[:,1])
freq = freq.reset_index(drop=True)
n = sum(freq["count"])
#the true expected counts
if expCounts is None:
#assume all to be equal
exp_prop = [1/k] * k
else:
#check if categories match
exp_prop = []
for i in range(0,k):
exp_prop.append(expCounts.iloc[i, 1]/nE)
observed = freq.iloc[:,1]
p_obs = multinomial.pmf(x=np.sort(observed), n=n, p=exp_prop)
counts = np.arange(0, n + 1)
all_perm = np.asarray(list(it.product(counts, repeat=k)))
sum_perm = all_perm[np.sum(all_perm, axis=1) == n]
ncomb = len(sum_perm)
all_exp_prop_same = all(x == exp_prop[0] for x in exp_prop)
p_val = 0
for i in sum_perm:
p_perm = multinomial.pmf(x=i, n=n, p=exp_prop)
if p_perm <= p_obs or (all_exp_prop_same and np.array_equal(np.sort(observed), np.sort(i))):
p_val = p_val + p_perm
testUsed = "one-sample multinomial exact goodness-of-fit test"
testResults = pd.DataFrame([[p_obs, ncomb, p_val, testUsed]], columns=["p obs.", "n combs.", "p-value", "test"])
pd.set_option('display.max_colwidth', None)
return testResults
Functions
def ts_multinomial_gof(data, expCounts=None)
-
Multinomial Goodness-of-Fit Test
A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis). If the test has a p-value below a pre-defined threshold (usually 0.05) the assumption they are all equal in the population will be rejected.
There are quite a few tests that can do this. Perhaps the most commonly used is a Pearson chi-square test, but also a G-test, Freeman-Tukey, Neyman, Mod-Log Likelihood and Cressie-Read test are possible.
McDonald (2014, p. 82) suggests to always use this exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers).
This function is shown in this YouTube video and the test is also described at PeterStatistics.com
Parameters
data
:list
orpandas data series
expCounts
:pandas data frame
, optional- the categories and expected counts
Returns
testResults
:Pandas dataframe with the probability
ofthe observed frequencies, number
ofcombinations used, significance (p-value) and test used
Notes
It uses the itertools, pandas, numpy and scipy's stats multinomial function
The exact multinomial test of goodness of fit is done in four steps
Step 1: Determine the probability of the observed counts using the probability mass function of the multinomial distribution
Step 2: Determine all possible permutations with repetition that create a sum equal to the sample size over the k-categories.
Step 3: Determine the probability of each of these permutations using the probability mass function of the multinomial distribution.
Step 4: Sum all probabilities found in step 3 that are equal or less than the one found in step 1.
Before, After and Alternatives
Before this an impression using a frequency table or a visualisation might be helpful: * tab_frequency * vi_bar_simple for Simple Bar Chart * vi_cleveland_dot_plot for Cleveland Dot Plot * vi_dot_plot for Dot Plot * vi_pareto_chart for Pareto Chart * vi_pie for Pie Chart
After this you might want to perform a post-hoc test: * ph_pairwise_bin for Pairwise Binary Test * ph_pairwise_gof for Pairwise Goodness-of-Fit Tests * ph_residual_gof_bin for Residuals Tests * ph_residual_gof_gof for Residuals Using Goodness-of-Fit Tests
Alternative tests: * ts_pearson_gof for Pearson Chi-Square Goodness-of-Fit Test * ts_freeman_tukey_gof for Freeman-Tukey Test of Goodness-of-Fit * ts_freeman_tukey_read for Freeman-Tukey-Read Test of Goodness-of-Fit * ts_g_gof for G (Likelihood Ratio) Goodness-of-Fit Test * ts_mod_log_likelihood_gof for Mod-Log Likelihood Test of Goodness-of-Fit * ts_neyman_gof for Neyman Test of Goodness-of-Fit * ts_powerdivergence_gof for Power Divergence GoF Test
References
McDonald, J. H. (2014). Handbook of biological statistics (3rd ed.). Sparky House Publishing.
Author
Made by P. Stikker
Companion website: https://PeterStatistics.com
YouTube channel: https://www.youtube.com/stikpet
Donations: https://www.patreon.com/bePatron?u=19398076Examples
>>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000)
Example 1: pandas series
>>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'][0:20] >>> ts_multinomial_gof(ex1) p obs. n combs. p-value test 0 0.00022 8855 0.203762 one-sample multinomial exact goodness-of-fit test
Example 2: pandas series with various settings
>>> ex2 = df1['mar1'][0:20] >>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]}) >>> ts_multinomial_gof(ex2, expCounts=eCounts) p obs. n combs. p-value test 0 0.003209 1330 0.435166 one-sample multinomial exact goodness-of-fit test
Example 3: a list
>>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> ts_multinomial_gof(ex3) p obs. n combs. p-value test 0 0.002541 1540 0.388712 one-sample multinomial exact goodness-of-fit test
Expand source code
def ts_multinomial_gof(data, expCounts=None): ''' Multinomial Goodness-of-Fit Test -------------------------------- A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis). If the test has a p-value below a pre-defined threshold (usually 0.05) the assumption they are all equal in the population will be rejected. There are quite a few tests that can do this. Perhaps the most commonly used is a Pearson chi-square test, but also a G-test, Freeman-Tukey, Neyman, Mod-Log Likelihood and Cressie-Read test are possible. McDonald (2014, p. 82) suggests to always use this exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers). This function is shown in this [YouTube video](https://youtu.be/eGcd45t8LlA) and the test is also described at [PeterStatistics.com](https://peterstatistics.com/Terms/Tests/Multinomial-GoF.html) Parameters ---------- data : list or pandas data series expCounts : pandas data frame, optional the categories and expected counts Returns ------- testResults : Pandas dataframe with the probability of the observed frequencies, number of combinations used, significance (p-value) and test used Notes ----- It uses the itertools, pandas, numpy and scipy's stats multinomial function The exact multinomial test of goodness of fit is done in four steps Step 1: Determine the probability of the observed counts using the probability mass function of the multinomial distribution Step 2: Determine all possible permutations with repetition that create a sum equal to the sample size over the k-categories. Step 3: Determine the probability of each of these permutations using the probability mass function of the multinomial distribution. Step 4: Sum all probabilities found in step 3 that are equal or less than the one found in step 1. Before, After and Alternatives ------------------------------ Before this an impression using a frequency table or a visualisation might be helpful: * [tab_frequency](../other/table_frequency.html#tab_frequency) * [vi_bar_simple](../visualisations/vis_bar_simple.html#vi_bar_simple) for Simple Bar Chart * [vi_cleveland_dot_plot](../visualisations/vis_cleveland_dot_plot.html#vi_cleveland_dot_plot) for Cleveland Dot Plot * [vi_dot_plot](../visualisations/vis_dot_plot.html#vi_dot_plot) for Dot Plot * [vi_pareto_chart](../visualisations/vis_pareto_chart.html#vi_pareto_chart) for Pareto Chart * [vi_pie](../visualisations/vis_pie.html#vi_pie) for Pie Chart After this you might want to perform a post-hoc test: * [ph_pairwise_bin](../other/poho_pairwise_bin.html#ph_pairwise_bin) for Pairwise Binary Test * [ph_pairwise_gof](../other/poho_pairwise_gof.html#ph_pairwise_gof) for Pairwise Goodness-of-Fit Tests * [ph_residual_gof_bin](../other/poho_residual_gof_bin.html#ph_residual_gof_bin) for Residuals Tests * [ph_residual_gof_gof](../other/poho_residual_gof_gof.html#ph_residual_gof_gof) for Residuals Using Goodness-of-Fit Tests Alternative tests: * [ts_pearson_gof](../tests/test_pearson_gof.html#ts_pearson_gof) for Pearson Chi-Square Goodness-of-Fit Test * [ts_freeman_tukey_gof](../tests/test_freeman_tukey_gof.html#ts_freeman_tukey_gof) for Freeman-Tukey Test of Goodness-of-Fit * [ts_freeman_tukey_read](../tests/test_freeman_tukey_read.html#ts_freeman_tukey_read) for Freeman-Tukey-Read Test of Goodness-of-Fit * [ts_g_gof](../tests/test_g_gof.html#ts_g_gof) for G (Likelihood Ratio) Goodness-of-Fit Test * [ts_mod_log_likelihood_gof](../tests/test_mod_log_likelihood_gof.html#ts_mod_log_likelihood_gof) for Mod-Log Likelihood Test of Goodness-of-Fit * [ts_neyman_gof](../tests/test_neyman_gof.html#ts_neyman_gof) for Neyman Test of Goodness-of-Fit * [ts_powerdivergence_gof](../tests/test_powerdivergence_gof.html#ts_powerdivergence_gof) for Power Divergence GoF Test References ---------- McDonald, J. H. (2014). *Handbook of biological statistics* (3rd ed.). Sparky House Publishing. Author ------ Made by P. Stikker Companion website: https://PeterStatistics.com YouTube channel: https://www.youtube.com/stikpet Donations: https://www.patreon.com/bePatron?u=19398076 Examples --------- >>> pd.set_option('display.width',1000) >>> pd.set_option('display.max_columns', 1000) Example 1: pandas series >>> df1 = pd.read_csv('https://peterstatistics.com/Packages/ExampleData/GSS2012a.csv', sep=',', low_memory=False, storage_options={'User-Agent': 'Mozilla/5.0'}) >>> ex1 = df1['mar1'][0:20] >>> ts_multinomial_gof(ex1) p obs. n combs. p-value test 0 0.00022 8855 0.203762 one-sample multinomial exact goodness-of-fit test Example 2: pandas series with various settings >>> ex2 = df1['mar1'][0:20] >>> eCounts = pd.DataFrame({'category' : ["MARRIED", "DIVORCED", "NEVER MARRIED", "SEPARATED"], 'count' : [5,5,5,5]}) >>> ts_multinomial_gof(ex2, expCounts=eCounts) p obs. n combs. p-value test 0 0.003209 1330 0.435166 one-sample multinomial exact goodness-of-fit test Example 3: a list >>> ex3 = ["MARRIED", "DIVORCED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "NEVER MARRIED", "MARRIED", "MARRIED", "MARRIED", "SEPARATED", "DIVORCED", "NEVER MARRIED", "NEVER MARRIED", "DIVORCED", "DIVORCED", "MARRIED"] >>> ts_multinomial_gof(ex3) p obs. n combs. p-value test 0 0.002541 1540 0.388712 one-sample multinomial exact goodness-of-fit test ''' if type(data) == list: data = pd.Series(data) #determine the observed counts if expCounts is None: #generate frequency table freq = data.value_counts() n = sum(freq) freq = freq.rename_axis('category').reset_index(name='count') #number of categories to use (k) k = len(freq) #number of expected counts is simply sample size nE = n else: #if expected counts are given #number of categories to use (k) k = len(expCounts) freq = pd.DataFrame(columns = ["category", "count"]) for i in range(0, k): nk = data[data==expCounts.iloc[i, 0]].count() lk = expCounts.iloc[i, 0] freq = pd.concat([freq, pd.DataFrame([{"category": lk, "count": nk}])]) nE = sum(expCounts.iloc[:,1]) freq = freq.reset_index(drop=True) n = sum(freq["count"]) #the true expected counts if expCounts is None: #assume all to be equal exp_prop = [1/k] * k else: #check if categories match exp_prop = [] for i in range(0,k): exp_prop.append(expCounts.iloc[i, 1]/nE) observed = freq.iloc[:,1] p_obs = multinomial.pmf(x=np.sort(observed), n=n, p=exp_prop) counts = np.arange(0, n + 1) all_perm = np.asarray(list(it.product(counts, repeat=k))) sum_perm = all_perm[np.sum(all_perm, axis=1) == n] ncomb = len(sum_perm) all_exp_prop_same = all(x == exp_prop[0] for x in exp_prop) p_val = 0 for i in sum_perm: p_perm = multinomial.pmf(x=i, n=n, p=exp_prop) if p_perm <= p_obs or (all_exp_prop_same and np.array_equal(np.sort(observed), np.sort(i))): p_val = p_val + p_perm testUsed = "one-sample multinomial exact goodness-of-fit test" testResults = pd.DataFrame([[p_obs, ncomb, p_val, testUsed]], columns=["p obs.", "n combs.", "p-value", "test"]) pd.set_option('display.max_colwidth', None) return testResults