Power Divergence Tests
Introduction
A test that can be used with a single nominal variable, to test if the probabilities in all the categories are equal (the null hypothesis), or with two nominal variables to test if they are independent.
There are quite a few tests that can do this. Cressie and Read (1984, p. 463) noticed how the \(\chi^2\), \(G^2\), \(T^2\), \(NM^2\) and \(GM^2\) can all be captured with one general formula.
The additional variable lambda (λ) was then investigated, and they settled on a λ of 2/3.
By setting lambda (\(\lambda\)) to different values, we get the different tests:
- Pearson chi-square set \(\lambda = 1\)
- G/Wilks/Likelihood-Ratio set \(\lambda=0\)
- Freeman-Tukey set \(\lambda = -\frac{1}{2}\)
- Mod-Log-Likelihood set \(\lambda = -1\)
- Neyman set \(\lambda = -2\)
- Cressie-Read set \(\lambda = \frac{2}{3}\)
Condition
Since the test uses a chi-square distribution, there is a condition to be checked before using it.
For a goodness-of-fit test it is often recommended to use it if the minimum expected count is at least 5 (Peck & Devore, 2012, p. 593).
For a test of independence the criteria is often that the minimum expected count should be at least 1 and no more than 20% of cells can have an expected count less than 5. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5.
Performing the Test
with Excel
Excel file: TS - Power Divergence (E).xlsm
with stikpetE
Goodness-of-Fit without stikpetE
Independence without stikpetE
To Be Made
with Python
Jupyter Notebook: TS - Power Divergence (P).ipynb
with stikpetP
with other libraries
without libraries
Manually
The formula used for a test of independence is (Cressie & Read, 1984, p. 442)::
\(\chi_{C}^{2} = \begin{cases} 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(F_{i,j}\times ln\left(\frac{F_{i,j}}{E_{i,j}}\right)\right) & \text{ if } \lambda=0 \\ 2\times\sum_{i=1}^{r}\sum_{j=1}^c\left(E_{i,j}\times ln\left(\frac{E_{i,j}}{F_{i,j}}\right)\right) & \text{ if } \lambda=-1 \\ \frac{2}{\lambda\times\left(\lambda + 1\right)} \times \sum_{i=1}^{r}\sum_{j=1}^{c} F_{i,j}\times\left(\left(\frac{F_{i,j}}{E_{i,j}}\right)^{\lambda} - 1\right) & \text{ else } \end{cases}\)
\(df = \left(r - 1\right)\times\left(c - 1\right)\)
For a test of goodness-of-fit:
\(\chi_{C}^{2} = \begin{cases} 2\times\sum_{i=1}^{k}F_{i}\times ln\left(\frac{F_{i}}{E_{i}}\right) & \text{ if } \lambda=0 \\ 2\times\sum_{i=1}^{k} E_{i}\times ln\left(\frac{E_{i}}{F_{i}}\right) & \text{ if } \lambda=-1 \\ \frac{2}{\lambda\times\left(\lambda + 1\right)} \times \sum_{i=1}^{k} F_{i}\times\left(\left(\frac{F_{i}}{E_{i}}\right)^{\lambda} - 1\right) & \text{ else } \end{cases}\)
\(df = k - 1\)
The p-value is then found using:
\(sig. = 1 - \chi^2\left(\chi_{C}^{2},df\right)\)
With:
\(n = \sum_{i=1}^k F_i\)
If no expected counts provided:
For a test of independence:
\(E_{i,j} = \frac{R_i\times C_j}{n}\)
with
\(R_{i} = \sum_{j=1}^c F_{i,j}\)
\(C_{j} = \sum_{i=1}^r F_{i,j}\)
For a test of goodness-of-fit:
\(E_i = \frac{n}{k}\)
Symbols used:
- \(k\) the number of categories
- \(r\) the number of rows
- \(c\) the number of columns
- \(F_i\) the (absolute) frequency of category i
- \(F_{i,j}\) the (absolute) frequency of row i and column j
- \(E_i\) the expected frequency of category i
- \(E_{i,j}\) the expected frequency of row i and column j
- \(n\) the sample size, i.e. the sum of all frequencies
- \(\chi^2\left(\dots\right)\) the chi-square cumulative density function
Click here to see the formulas of the continuity corrections
The Yates continuity correction (cc="yates") is calculated using (Yates, 1934, p. 222):
\(\chi_{PY}^2 = \sum_{i=1}^k \frac{\left(\left|F_i - E_i\right| - 0.5\right)^2}{E_i}\)
Or adjust the observed counts with:
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i ≥ E_i \\ F_i + 0.5 & \text{ if } F_i < E_i \end{cases}\)
Then use these adjusted counts in the calculation of the chi-square value.
Sometimes the Yates continuity correction is written as (Allen, 1990, p. 523):
\(\chi_{PY}^2 = \sum_{i=1}^k \frac{\max\left(0, \left(\left|F_i - E_i\right| - 0.5\right)\right)^2}{E_i}\)
Which is then the same as adjusting the observed counts using:
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i - 0.5 > E_i \\ F_i + 0.5 & \text{ if } F_i + 0.5 < E_i \\ F_i & \text{ else } \end{cases}\)
The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 36):
\(\chi_{PP}^2 = \chi_{P}^{2}\times\frac{n - 1}{n}\)
The Williams correction (cc="williams") is calculated for a Goodness-of-Fit test using (Williams, 1976, p. 36):
\(\chi_{PW}^2 = \frac{\chi_{P}^2}{q}\)
With:
\(q = 1 + \frac{k^2 - 1}{6\times n\times df}\)
While for a test of independence the formula of q changes to (Williams, 1976, p. 36; McDonald, 2014):
\(q = 1+\frac{\left(n\times \left(\sum_{i=1}^r \frac{1}{R_i}\right) - 1\right)\times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_i}\right) - 1\right)}{6\times n \times\left(r - 1\right)\times\left(c - 1\right)}\)
Interpreting the Result
Check the Conditions
If you do not meet the criteria, there are three options. First off, are you sure you have a nominal variable, and not an ordinal one? If you have an ordinal variable, you probably want a different test. If you are sure you have a nominal variable you might be able to combine two or more categories into one larger category. If for example you asked people about their country of birth, but a few countries were only selected by one or two people, you might want to combine these simply into a category ‘other’. Be very clear though in your report that you’ve done so. Alternative you can use an exact test. For the goodness-of-fit this is a exact multinomial test of goodness-of-fit. For the test of independence a Fisher exact test (for larger than 2x2 tables this is also then known as the Fisher-Freeman-Halton Exact Test).
Reading the p-value
The assumption about the population (null hypothesis) are those expected counts, i.e. if you would research the entire population the counts would match the expected counts.
With the goodness-of-fit variation, the expected counts are often simply set so that each category is chosen evenly. For example, if the sample size was 200 and there were four categories, the expected count for each category is usually simply 200/4 = 50.
With the test of independence, the expected counts are calculated from the row and column totals. These are then the counts expected if the two variables would be independent. The null-hypothesis is then often stated as that the two variables are independent (i.e. have no influence on each other.
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, with a goodness-of-fit test, we then conclude that the categories will not be equally distributed in the population, while for a test of independence that the two variables are not independent (i.e. they have an influence on each other).
Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.
Writing the results
Both variations use the chi-square distribution. A template for tests that use this distribution would be:
χ2([degrees of freedom], N = [sample size]) = [test value], p = [p-value]
Prior to this, mention the test that was done, and the interpretation of the results. For example:
A Power Divergence test of goodness-of-fit showed that the marital status was not equally distributed in the population, χ2(4, N = 1941) = 1249.13, p < .001.
Round the p-value to three decimal places, or if it is below .001 (as in the example) use p < .001
You can get that chi symbol (χ) in Word by typing in the letter 'c', then select it and change the font to 'Symbol'. The square can be done by typing a '2' and make it a superscript.
APA (2019, p. 88) states to also report an effect size measure.
Prior to the test result a visualisation might be appreciated, and after the effect-size measure. We could then, if the result is significant, follow up with a post-hoc analysis. This could be for a goodness-of-fit test to determine which categories are significantly different from each other, or for a test of independence to better describe what the influence then is.
Corrections
The chi-square distribution is a so-called continuous distribution, but the observed counts are discrete. It is therefor possible to add a so-called continuity correction. I've seen three versions for this:
- Yates (1934, p. 222) (only if there are two categories (in each variable))
- Williams (1976, p. 36)
- E.S. Pearson (1947, p. 36)
Note that the Yates correction has been shown to be too conservative and is often discouraged to be used (see Haviland (1990) for a good starting point on this).
Next step and Alternatives
APA (2019, p. 88) states to also report an effect size measure with statistical tests. With a Goodness-of-Fit test this could be:
If the test is significant a post-hoc analysis to pin-point which category is significantly different could be used. The post-hoc anlysis for a goodness-of-fit test is discussed here.
There are quite a few tests that can be used with a single or two nominal variables:
- Exact Test (Multinomial for GoF, Fisher for 2x2 and Fisher-Hamiton for larger tables)
- Pearson Chi-Square
- Neyman
- (Modified) Freeman-Tukey
- Cressie-Read / Power Divergence
- Freeman-Tukey-Read
- G / Likelihood Ratio / Wilks
- Mod-Log Likelihood
The Pearson chi-square is probably the most famous and used one. Neyman swops the observed and expected counts in Pearson's formula. Freeman-Tukey attempts to smooth the chi-square distribution by using a square root transformation, while the G test uses a logarithm transformation. Cressie-Read noticed that all the others ones can be placed in a generic format and created a whole family of tests. For the goodness-of-fit tests Read also proposed an alternative generalisation.
As for goodness-of-fit tests, McDonald (2014, p. 82) suggests to always use the exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers). Lawal (1984) continued some work from Larntz (1978) and compared the modified Freeman-Tukey, G-test and the Pearson chi-square test, and concluded that for small samples the Pearson test is preferred, while for large samples either the Pearson or G-test. Making the Freeman-Tukey test perhaps somewhat redundant.
Google adds
