Post-Hoc Tests after a Goodness-of-Fit Test

Introduction

A Goodness-of-Fit (GoF) test informs us if the counts in the population might not all be equal across the different categories or if expected counts are provided if overall in the population the proportions could indeed be the expected proportions. Most likely it is then also interesting to know which categories have a different count from the expected count or from each other. This gives two possible types of post-hoc tests:

paired tests, these will compare each possible pair of categories
residual tests, these will compare each category proportion with its expected proportion

For each we can either use any of the one-sample binary tests (binomial, Wald or score test) or any of the goodness-of-fit tests (Pearson, Freeman-Tukey, Freeman-Tukey-Read, G, mod-log-G, Neyman, power divergence, multinomial). For the residual tests there is also a specific test that could be used that uses the (adjusted) standardized residuals.

Since each time we perform a test we have a risk of rejecting the null hypothesis, even though it is true. This is the alpha level. Although the alpha chance is usually quite low, since we perform multiple test the risk of at least one mistake quickly increases. If we use alpha=.05, with for example six tests the chance we made a mistake in at least one of them is already 25%. To counter this there are various adjustments/corrections, the most basic and used one is probably Bonferroni.

Because of the corrections it could be that none of the post-hoc tests reveal a significant result, even though the GoF test did. There is then insufficient data to pinpoint the categories with a significant difference.

Since this is done after we already established that in general for all categories they won't be equal, this is then called a post-hoc test. 'Post hoc' simply means something that happens after an event, the GoF test being the event in this.

The residual test using a one-sample score test is also known as an adjusted (standardized) residual test (Haberman, 1973, p. 205; Sharpe 2015, p. 3), while others might call this a standardized residual test (Agresti, 2007, p. 38; R, n.d.). This will be the same as using a score test.

Performing the Test

pairwise tests

using one-sample binary tests

with Excel

Excel file: PH - Pairwise Binary Tests (E).xlsm

with stikpetE

without stikpetE

with Python

Jupyter Notebook: PH - Pairwise Binary Tests (P).ipynb

with stikpetP

without stikpetP

with R

Jupyter Notebook: PH - Pairwise Binary Tests (R).ipynb

with stikpetR

without stikpetR

with SPSS

Manually

For each pair of categories use as total sample size the sum of the counts of each category, i.e. \(n_{pair} = n_i + n_j\). As expected counts adjust the original expected counts if they were not set to all equal using: \(E_i^* = n_{pair}\times\frac{E_i}{E_i + E_j}\)

Then simply perform the binary test on each possible pair. See the separate pages of each binary test for the corresponding formulas (binomial, Wald or score test).

using goodness-of-fit tests

with Excel

Excel file: PH - Pairwise GoF Tests (E).xlsm

with stikpetE

without stikpetE

with Python

Jupyter Notebook: PH - Pairwise GoF tests (P).ipynb

with stikpetP

without stikpetP

with R

Jupyter Notebook: PH - Pairwise GoF tests (R).ipynb

with stikpetR

without stikpetR

with SPSS

To Be Uploaded

Manually

Then simply perform the goodness-of-fit test on each possible pair. See the separate pages of each goodness-of-fit test for the corresponding formulas (Pearson, Freeman-Tukey, Freeman-Tukey-Read, G, mod-log-G, Neyman, power divergence, multinomial).

residual tests

using one-sample binary tests

with Excel

Excel file: PH - Residual Binary Tests for GoF (E).xlsm

with stikpetE

without stikpetE

with Python

Jupyter Notebook: PH - Residual Binary Tests for GoF (P).ipynb

with stikpetP

without stikpetP

with R

Jupyter Notebook: PH - Residual Binary Tests for GoF (R).ipynb

with stikpetR

without stikpetR

with SPSS

To Be Uploaded

Manually

Simply perform the one-sample binary test on each category vs. all other categories as one large category. See the separate pages of each one-sample binary test for the corresponding formulas (binomial, Wald or score test).

Besides the one-sample binary tests, there is also an option to use the residuals. These are defined as:

\(R_i = F_i - E_i\)

\(R_i^{\text{st.}} = \frac{R_i}{\sqrt{E_i}}\)

\(R_i^{\text{adj. st.}} = \frac{R_i}{\sqrt{E_i}\times\left(1 - \frac{E_i}{n}\right)}\)

The two-tailed significance is then found using:

\(sig. = 2\times\left(1-\Phi\left(R_i^{*}\right)\right)\)

Symbols used:

\(n\) is the sample size
\(F_i\), the observed count of category \(i\).
\(E_i\), the expected count of category \(i\).
\(R_i\), the residuals of category \(i\).
\(R_i^{\text{st.}\), the standardized residuals of category \(i\).
\(R_i^{\text{adj. st.}\), the adjusted standardized residuals of category \(i\).
\(\Phi\left(\dots\right)\), the cumulative density function of the standard normal distribution.

using goodness-of-fit tests

with Excel

Excel file: PH - Residual Binary Tests for GoF (E).xlsm

with stikpetE

without stikpetE

with Python

Jupyter Notebook: PH - Residual Binary Tests for GoF (P).ipynb

with stikpetP

without stikpetP

with R

Jupyter Notebook: PH - Residual Binary Tests for GoF (R).ipynb

with stikpetR

without stikpetR

with SPSS

To Be Uploaded

Manually

Simply perform the goodness-of-fit test on each category vs. all other categories as one large category. See the separate pages of each goodness-of-fit test for the corresponding formulas (Pearson, Freeman-Tukey, Freeman-Tukey-Read, G, mod-log-G, Neyman, power divergence, multinomial).

Interpreting the Result

The assumption about the population (null hypothesis) would be that the counts will equal the expected counts in the population.

With the goodness-of-fit variation, the expected counts are often simply set so that each category is chosen evenly. For example, if the sample size was 200 and there were four categories, the expected count for each category is usually simply 200/4 = 50.

The test of each pair provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. It is then adjusted because of the multiple testing. If this adjusted p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.

If the assumption is rejected, we then conclude that the categories will not be equally distributed in the population.

Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.

Writing the results

The results of the post-hoc analyses are sometimes shown in table format or sometimes as a cross table.

Table 1
Example table output post-hoc binary tests
Category 1	Category 2	p-value	adj. p-value
A	B	.003	.018
A	C	.122	.732
A	D	.531	1.00
B	C	.002	.012
B	D	.231	1.00
C	D	.081	.486

or sometimes as a cross table.

Table 2
Example cross-table output post-hoc binary tests
adj. p-values	Category 2
Category 1	A	B	C	D
A		.018	.732	1.00
B	.018		.012	1.00
C	.732	.012		.486
D	1.00	1.00	.486

The cross-table is a bit shorter, but for other than exact tests we should also include the test-statistic. With a regular table this can be simply done by adding a column, but the cross-table becomes less clear if we add additional values in the cells.

Depending on the situation, you could sometimes summarize the table. In the example we could note that only Categories A and B, and Categories B and C were significantly different.

Next step

APA (2019, p. 88) states to also report an effect size measure. If a one-sample binary test was used this could be Cohen's g, Cohen's h', or the Alternative Ratio. If a goodness-of-fit test was used it could be Cramér V, Cohen w, Johnston-Berry-Mielke E or Fei. Additionally if a residual, score or Wald test was used, the Rosenthal Correlation coefficient is also an option.

To obtain these measures, you can select only the data in the pair (in case of a pairwise post-hoc test), or the category only and all other categories as one large one (in case of a residual test). The stikpetP, stikpetR and stikpetE library each also have a function that could do this for you.

with Excel and stikpetE

Excel file: ES - Post-Hoc GoF (E).xlsm

with Python and stikpetP

Jupyter Notebook: ES - Post-Hoc GoF (P).ipynb

with R and stikpetR

Jupyter Notebook: ES - Post-Hoc GoF (R).ipynb

Links to parts

Google adds