(modified) Freeman-Tukey Test
Introduction
The Freeman-Tukey test is an alternative test for the more popular Pearson chi-square test. There are two variations of this test:
- a goodness-of-fit (gof) test
- test of independence
The goodness-of-fit variation is used when analysing a single nominal variable, while the test of independence is used with two nominal variables.
Both test compare observed counts with the expected counts. The test attempts to smooth the chi-square distribution by using a square-root transformation. Note however, that this test might actually be redundant compared to some of the alternatives (see bottom of page).
The modified Freeman-Tukey test uses a slightly different formula than the original one.
Performing the Test
Click here to see how to perform the Goodness-of-Fit test.
with Python
Jupyter Notebook from videos: TS - Freeman-Tukey GoF (P).ipynb.
with stikpetP
without stikpetP
with R
Jupyter notebook: TS - Freeman-Tukey GoF (R).ipynb.
Manually
The formula used is (Ayinde & Abidoye, 2010, p. 21):
\(T^{2}=4\times\sum_{i=1}^{k}\left(\sqrt{F_{i}} - \sqrt{E_{i}}\right)^2\)
\(df = k - 1\)
\(sig. = 1 - \chi^2\left(T^{2},df\right)\)
With:
\(n = \sum_{i=1}^k F_i\)
If no expected counts provided:
\(E_i = \frac{n}{k}\)
else:
\(E_i = n\times\frac{E_{p_i}}{n_p}\)
\(n_p = \sum_{i=1}^k E_{p_i}\)
A modified version uses another possible smoothing (Larntz, 1978, p.253):
\(T_{mod}^2 = \sum_{i=1}^{k}\left(\sqrt{F_{i}} + \sqrt{F_{i} + 1} - \sqrt{4\times E_{i} + 1}\right)^2\)
Symbols used:
- \(k\) the number of categories
- \(F_i\) the (absolute) frequency of category i
- \(E_i\) the expected frequency of category i
- \(E_{p_i}\) the provided expected frequency of category i
- \(n\) the sample size, i.e. the sum of all frequencies
- \(n_p\) the sum of all provided expected counts
- \(\chi^2\left(\dots\right)\) the chi-square cumulative density function
The test is attributed to Freeman and Tukey (1950), but couldn't really find it in there. Another source often mentioned is Bishop et al. (2007)
Interpreting the Result
Is it appropriate to use?
First you should check if it was appropriate to use the test, since this test is unreliable if the sample sizes are small.
The criteria is often that the minimum expected count should be at least 1 and no more than 20% of cells can have an expected count less than 5. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5. If you don't meet the conditions you can do a
If you do not meet the criteria, there are three options. First off, are you sure you have a nominal variable, and not an ordinal one? If you have an ordinal variable, you probably want a different test. If you are sure you have a nominal variable you might be able to combine two or more categories into one larger category. If for example you asked people about their country of birth, but a few countries were only selected by one or two people, you might want to combine these simply into a category ‘other’. Be very clear though in your report that you’ve done so. Alternative you can use an exact test. For the goodness-of-fit this is a exact multinomial test of goodness-of-fit. For the test of independence a Fisher exact test (for larger than 2x2 tables this is also then known as the Fisher-Freeman-Halton Exact Test).
Reading the p-value
The assumption about the population (null hypothesis) are those expected counts, i.e. if you would research the entire population the counts would match the expected counts.
With the goodness-of-fit variation, the expected counts are often simply set so that each category is chosen evenly. For example, if the sample size was 200 and there were four categories, the expected count for each category is usually simply 200/4 = 50.
With the test of independence, the expected counts are calculated from the row and column totals. These are then the counts expected if the two variables would be independent. The null-hypothesis is then often stated as that the two variables are independent (i.e. have no influence on each other.
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, with a goodness-of-fit test, we then conclude that the categories will not be equally distributed in the population, while for a test of independence that the two variables are not independent (i.e. they have an influence on each other).
Writing the results
Both variations use the chi-square distribution. A template for tests that use this distribution would be:
χ2([degrees of freedom], N = [sample size]) = [test value], p = [p-value]
Prior to this, mention the test that was done, and the interpretation of the results. For example:
A Freeman-Tukey test of goodness-of-fit showed that the marital status was not equally distributed in the population, χ2(4, N = 1941) = 1249.13, p < .001.
Round the p-value to three decimal places, or if it is below .001 (as in the example) use p < .001
You can get that chi symbol (χ) in Word by typing in the letter 'c', then select it and change the font to 'Symbol'. The square can be done by typing a '2' and make it a superscript.
APA (2019, p. 88) states to also report an effect size measure.
Prior to the test result a visualisation might be appreciated, and after the effect-size measure. We could then, if the result is significant, follow up with a post-hoc analysis. This could be for a goodness-of-fit test to determine which categories are significantly different from each other, or for a test of independence to better describe what the influence then is.
Corrections
The chi-square distribution is a so-called continuous distribution, but the observed counts are discrete. It is therefor possible to add a so-called continuity correction. I've seen three versions for this:
- Yates (1934, p. 222) (only if there are two categories (in each variable))
- Williams (1976, p. 36)
- E.S. Pearson (1947, p. 36)
Click here to see the formulas of these continuity corrections
The Yates continuity correction (cc="yates") is calculated by adjusting the observed counts with: (Yates, 1934, p. 222):
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i ≥ E_i \\ F_i + 0.5 & \text{ if } F_i < E_i \end{cases}\)
Then use these adjusted counts in the calculations.
Sometimes the Yates continuity correction is written as (Allen, 1990, p. 523):
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i - 0.5 > E_i \\ F_i + 0.5 & \text{ if } F_i + 0.5 < E_i \\ F_i & \text{ else } \end{cases}\)
The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 36):
\(\chi_{PP}^2 = \chi_{P}^{2}\times\frac{n - 1}{n}\)
The Williams correction (cc="williams") is calculated for a Goodness-of-Fit test using (Williams, 1976, p. 36):
\(\chi_{PW}^2 = \frac{\chi_{P}^2}{q}\)
With:
\(q = 1 + \frac{k^2 - 1}{6\times n\times df}\)
While for a test of independence the formula of q changes to (Williams, 1976, p. 36; McDonald, 2014):
\(q = 1+\frac{\left(n\times \left(\sum_{i=1}^r \frac{1}{R_i}\right) - 1\right)\times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_i}\right) - 1\right)}{6\times n \times\left(r - 1\right)\times\left(c - 1\right)}\)
Note that the Yates correction has been shown to be too conservative and is often discouraged to be used (see Haviland (1990) for a good starting point on this).
Next step and Alternatives
APA (2019, p. 88) states to also report an effect size measure with statistical tests. With a Goodness-of-Fit test this could be:
If the test is significant a post-hoc analysis to pin-point which category is significantly different could be used. The post-hoc anlysis for a goodness-of-fit test is discussed here.
There are quite a few tests that can be used with a single or two nominal variables:
- Exact Test (Multinomial for GoF, Fisher for 2x2 and Fisher-Hamiton for larger tables)
- Pearson Chi-Square
- Neyman
- (Modified) Freeman-Tukey
- Cressie-Read / Power Divergence
- Freeman-Tukey-Read
- G / Likelihood Ratio / Wilks
- Mod-Log Likelihood
The Pearson chi-square is probably the most famous and used one. Neyman swops the observed and expected counts in Pearson's formula. Freeman-Tukey attempts to smooth the chi-square distribution by using a square root transformation, while the G test uses a logarithm transformation. Cressie-Read noticed that all the others ones can be placed in a generic format and created a whole family of tests. For the goodness-of-fit tests Read also proposed an alternative generalisation.
As for goodness-of-fit tests, McDonald (2014, p. 82) suggests to always use the exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers). Lawal (1984) continued some work from Larntz (1978) and compared the modified Freeman-Tukey, G-test and the Pearson chi-square test, and concluded that for small samples the Pearson test is preferred, while for large samples either the Pearson or G-test. Making the Freeman-Tukey test perhaps somewhat redundant.
Tests
Google adds