Neyman Test
Introduction
The Freeman-Tukey test is an alternative test for the more popular Pearson chi-square test. Neyman swops the position of the observed and the expected counts in the formula of the Pearson chi-square test. There are two variations of this test:
- a goodness-of-fit (gof) test
- test of independence
The goodness-of-fit variation is used when analysing a single nominal variable, while the test of independence is used with two nominal variables.
Both test compare observed counts with the expected counts. The test attempts to smooth the chi-square distribution by using a square-root transformation. Note however, that this test might actually be redundant compared to some of the alternatives (see bottom of page).
Performing the Test
Formulas
The Neyman chi-square goodness-of-fit test statistic (Neyman, 1949, p. 250):
\( \chi_{Neyman}^{2}=\sum_{i=1}^{k}\frac{\left(E_{i}-F_{i}\right)^{2}}{F_{i}}\)
The degrees of freedom (df):
\(df=k-1\)
\(sig. = 1 - \chi^2\left(MG, df\right)\)
Symbols used:
- \(F_i\), is the observed count in category \(i\)
- \(E_i\), is the expected count in category \(i\)
- \(k\), is the number of categories
- \(\chi^2\left(MG, df\right)\), the cumulative density function of the chi-square distribution
Interpreting the Result
Is it appropriate to use?
First you should check if it was appropriate to use the test, since this test is unreliable if the sample sizes are small.
The criteria is often that the minimum expected count should be at least 1 and no more than 20% of cells can have an expected count less than 5. This is often referred to as 'Cochran conditions', after Cochran (1954, p. 420). Note that for example Fisher (1925, p. 83) is more strict, and finds that all cells should have an expected count of at least 5. If you don't meet the conditions you can do a
If you do not meet the criteria, there are three options. First off, are you sure you have a nominal variable, and not an ordinal one? If you have an ordinal variable, you probably want a different test. If you are sure you have a nominal variable you might be able to combine two or more categories into one larger category. If for example you asked people about their country of birth, but a few countries were only selected by one or two people, you might want to combine these simply into a category ‘other’. Be very clear though in your report that you’ve done so. Alternative you can use an exact test. For the goodness-of-fit this is a exact multinomial test of goodness-of-fit. For the test of independence a Fisher exact test (for larger than 2x2 tables this is also then known as the Fisher-Freeman-Halton Exact Test).
Reading the p-value
The assumption about the population (null hypothesis) are those expected counts, i.e. if you would research the entire population the counts would match the expected counts.
With the goodness-of-fit variation, the expected counts are often simply set so that each category is chosen evenly. For example, if the sample size was 200 and there were four categories, the expected count for each category is usually simply 200/4 = 50.
With the test of independence, the expected counts are calculated from the row and column totals. These are then the counts expected if the two variables would be independent. The null-hypothesis is then often stated as that the two variables are independent (i.e. have no influence on each other.
The test provides a p-value, which is the probability of a test statistic as from the sample, or even more extreme, if the assumption about the population would be true. If this p-value (significance) is below a pre-defined threshold (the significance level \(\alpha\) ), the assumption about the population is rejected. We then speak of a (statistically) significant result. The threshold is usually set at 0.05. Anything below is then considered low.
If the assumption is rejected, with a goodness-of-fit test, we then conclude that the categories will not be equally distributed in the population, while for a test of independence that the two variables are not independent (i.e. they have an influence on each other).
Note that if we do not reject the assumption, it does not mean we accept it, we simply state that there is insufficient evidence to reject it.
Writing the results
Both variations use the chi-square distribution. A template for tests that use this distribution would be:
χ2([degrees of freedom], N = [sample size]) = [test value], p = [p-value]
Prior to this, mention the test that was done, and the interpretation of the results. For example:
A Neyman test of goodness-of-fit showed that the marital status was not equally distributed in the population, χ2(4, N = 1941) = 1249.13, p < .001.
Round the p-value to three decimal places, or if it is below .001 (as in the example) use p < .001
You can get that chi symbol (χ) in Word by typing in the letter 'c', then select it and change the font to 'Symbol'. The square can be done by typing a '2' and make it a superscript.
APA (2019, p. 88) states to also report an effect size measure.
Prior to the test result a visualisation might be appreciated, and after the effect-size measure. We could then, if the result is significant, follow up with a post-hoc analysis. This could be for a goodness-of-fit test to determine which categories are significantly different from each other, or for a test of independence to better describe what the influence then is.
Corrections
The chi-square distribution is a so-called continuous distribution, but the observed counts are discrete. It is therefor possible to add a so-called continuity correction. I've seen three versions for this:
- Yates (1934, p. 222) (only if there are two categories (in each variable))
- Williams (1976, p. 36)
- E.S. Pearson (1947, p. 36)
Click here to see the formulas of these continuity corrections
The Yates continuity correction (cc="yates") is calculated using (Yates, 1934, p. 222):
\(\chi_{PY}^2 = \sum_{i=1}^k \frac{\left(\left|F_i - E_i\right| - 0.5\right)^2}{E_i}\)
Or adjust the observed counts with:
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i ≥ E_i \\ F_i + 0.5 & \text{ if } F_i < E_i \end{cases}\)
Then use these adjusted counts in the calculation of the chi-square value.
Sometimes the Yates continuity correction is written as (Allen, 1990, p. 523):
\(\chi_{PY}^2 = \sum_{i=1}^k \frac{\max\left(0, \left(\left|F_i - E_i\right| - 0.5\right)\right)^2}{E_i}\)
Which is then the same as adjusting the observed counts using:
\(F_i^\ast = \begin{cases} F_i - 0.5 & \text{ if } F_i - 0.5 > E_i \\ F_i + 0.5 & \text{ if } F_i + 0.5 < E_i \\ F_i & \text{ else } \end{cases}\)
The Pearson correction (cc="pearson") is calculated using (E.S. Pearson, 1947, p. 36):
\(\chi_{PP}^2 = \chi_{P}^{2}\times\frac{n - 1}{n}\)
The Williams correction (cc="williams") is calculated for a Goodness-of-Fit test using (Williams, 1976, p. 36):
\(\chi_{PW}^2 = \frac{\chi_{P}^2}{q}\)
With:
\(q = 1 + \frac{k^2 - 1}{6\times n\times df}\)
While for a test of independence the formula of q changes to (Williams, 1976, p. 36; McDonald, 2014):
\(q = 1+\frac{\left(n\times \left(\sum_{i=1}^r \frac{1}{R_i}\right) - 1\right)\times \left(n\times\left(\sum_{j=1}^c \frac{1}{C_i}\right) - 1\right)}{6\times n \times\left(r - 1\right)\times\left(c - 1\right)}\)
Note that the Yates correction has been shown to be too conservative and is often discouraged to be used (see Haviland (1990) for a good starting point on this).
Next step and Alternatives
APA (2019, p. 88) states to also report an effect size measure with statistical tests. With a Goodness-of-Fit test this could be:
If the test is significant a post-hoc analysis to pin-point which category is significantly different could be used. The post-hoc anlysis for a goodness-of-fit test is discussed here.
There are quite a few tests that can be used with a single or two nominal variables:
- Exact Test (Multinomial for GoF, Fisher for 2x2 and Fisher-Hamiton for larger tables)
- Pearson Chi-Square
- Neyman
- (Modified) Freeman-Tukey
- Cressie-Read / Power Divergence
- Freeman-Tukey-Read
- G / Likelihood Ratio / Wilks
- Mod-Log Likelihood
The Pearson chi-square is probably the most famous and used one. Neyman swops the observed and expected counts in Pearson's formula. Freeman-Tukey attempts to smooth the chi-square distribution by using a square root transformation, while the G test uses a logarithm transformation. Cressie-Read noticed that all the others ones can be placed in a generic format and created a whole family of tests. For the goodness-of-fit tests Read also proposed an alternative generalisation.
As for goodness-of-fit tests, McDonald (2014, p. 82) suggests to always use the exact test as long as the sample size is less than 1000 (which was just picked as a nice round number, when n is very large the exact test becomes computational heavy even for computers). Lawal (1984) continued some work from Larntz (1978) and compared the modified Freeman-Tukey, G-test and the Pearson chi-square test, and concluded that for small samples the Pearson test is preferred, while for large samples either the Pearson or G-test. Making the Freeman-Tukey test perhaps somewhat redundant.
Tests
Google adds