Multinomial Goodness-of-Fit Test
Introduction
This is an extension of the one-sample binomial test to allow for more than two categories.
The test first determines the probability of the observed counts given the expected counts. It then determines all possible count variations with the same sum, the probability for each of these, and then sums all of those with a probability less or equal to the sample one.
The computation for this test is quite intensive, and for many years it was therefor discouraged to be used. With computers these days it is starting to gain popularity again.
Performing the Test
with Excel
Excel file from video: TS - Exact Multinomial GoF (E).xlsm.
with stikpetE
without stikpetE
with Python
Jupyter Notebook from videos: TS - Exact Multinomial GoF (P).ipynb.
with stikpetP
without stikpetP
with R
Jupyter Notebook from videos: TS - Exact Multinomial GoF (R).ipynb.
with stikpetR
without stikpetR
with SPSS
Datafile used StudentStatistics.sav.
Manually
To calculate the exact multinomial test by hand the following steps can be used.
Step 1: Determine the probability of the observed counts using the probability mass function of the multinomial distribution.
The formula for this is given by:
\(\frac{n!}{\prod_{i=1}^{k}F_i!}\times\prod_{i=1}^{k}\pi_i^{F_i}\)
Where \(n\) is the total sample size, \(k\) the number of categories, \(F_i\) the frequency of the i-th category, and \(\pi_i\) the expected proportion of the i-th category.
Step 2: Determine all possible permutations with repetition that create a sum equal to the sample size over the k-categories.
Step 3: Determine the probability of each of these permutations using the probability mass function of the multinomial distribution.
Step 4: Sum all probabilities found in step 3 that are equal or less than the one found in step 1.
Step 2 is quite tricky. We could create all possible permutations with replacement. If our sample size is n and the number of categories is k, this gives \((n+1)^k\) permutations. The ‘+ 1’ comes from the option of 0 to be included. Most of these permutations will not sum to the sample size, so they can be removed.
If the expected probability for each category is the same, we could use another approach. We could then create all possible combinations with replacement. This would give fewer results:
\(\binom{n+k}{k}=\frac{(n+k)!}{n!k!}\)
Again we can then remove the ones that don’t sum to the sample size. Then perform step 3, but now multiply each by how many variations this can be arranged in. If for example we have 5 categories, and a total sample size of 20, one possible combination is [2, 2, 3, 3, 10]. This would be the same as [2, 3, 3, 2, 10], [2, 3, 10, 2, 3], etc. We could determine the count (frequency) of each unique score, so in the example 2 has a frequency of 2, 3 also and 10 only one. Now the first 2 we can arrange in:
\(\binom{5}{2}=\frac{5!}{(5-2)!2!}=10\)
The 5 is our number of categories, the 2 the frequency. For the two 3’s we now have 5 – 2 = 3 spots left, so those can only be arranged in:
\(\binom{3}{2}=\frac{3!}{(3-2)!2!}=3\)
Combining these 3 with the 10 we had earlier gives 3×10=30 possibilities. The single 10 only can now go to one spot so that’s it.
In general, if we have k categories, m different values and F_i is the i-th frequency of those values, sorted from high to low, we get:
\(\binom{k}{F_1}\prod_{i=2}^m\binom{k-\sum_{j=1}^{m-i+1}F_j}{F_j}=\binom{k}{F_1}\binom{k-F_1}{F_2}\binom{k-\sum_{j=1}^{2}F_j}{F_3}…\binom{k-\sum_{j=1}^{m-1}F_j}{F_k}\)
Where:
\(\binom{a}{b}=\frac{a!}{(a-b)!b!}\)
Interpreting the Result
The assumption about the population (null hypothesis) are those expected counts, i.e. if you would research the entire population the counts would match the expected counts.
With the goodness-of-fit test, the expected counts are often simply set so that each category is chosen evenly. For example, if the sample size was 200 and there were four categories, the expected count for each category is usually simply 200/4 = 50.
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, with a goodness-of-fit test, we then conclude that the categories will not be equally distributed in the population.
Writing the results
This test is a so-called exact test, and therefor only requires the p-value:
p = [p-value]
Prior to this, mention the test that was done, and the interpretation of the results. For example:
A multinomial test of goodness-of-fit showed that the marital status was not equally distributed in the population, p < .001.
Round the p-value to three decimal places, or if it is below .001 (as in the example) use p < .001
Prior to the test result a visualisation might be appreciated, and after an effect-size measure is also important to add. We could then, if the result is significant, follow up with a post-hoc analysis to determine which categories are significantly different from each other.
Next step and Alternatives
APA (2019, p. 88) states to also report an effect size measure with statistical tests. With a Goodness-of-Fit test this could be:
Note though, that these will require a test statistic, something the multinomial test doesn't have since it is an exact test.
If the test is significant a post-hoc analysis to pin-point which category is significantly different could be used. The post-hoc anlysis for a goodness-of-fit test is discussed here.
There are quite a few tests that can be used with a single or two nominal variables:
- Exact Test (Multinomial for GoF, Fisher for 2x2 and Fisher-Hamiton for larger tables)
- Pearson Chi-Square
- Neyman
- (Modified) Freeman-Tukey
- Cressie-Read / Power Divergence
- Freeman-Tukey-Read
- G / Likelihood Ratio / Wilks
- Mod-Log Likelihood
The Pearson chi-square is probably the most famous and used one. Neyman swops the observed and expected counts in Pearson's formula. Freeman-Tukey attempts to smooth the chi-square distribution by using a square root transformation, while the G test uses a logarithm transformation. Cressie-Read noticed that all the others ones can be placed in a generic format and created a whole family of tests. For the goodness-of-fit tests Read also proposed an alternative generalisation.
As for goodness-of-fit tests, McDonald (2014, p. 82) suggests to always use the exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers). Lawal (1984) continued some work from Larntz (1978) and compared the modified Freeman-Tukey, G-test and the Pearson chi-square test, and concluded that for small samples the Pearson test is preferred, while for large samples either the Pearson or G-test. Making the Freeman-Tukey test perhaps somewhat redundant.
Tests
Google adds