Wednesday, October 30, 2013

The Normal Distribution Assumption and Outliers

So your normal distribution assumption tests are not meet for your ANOVA, ANCOVA, MANCOVA, correlation, t test, etc?

Where to start?

By leaving panic behind and taking it easy. The first thing to remember is that you do not use the same criterion for assumptions tests (e.g., normal distribution) as you do for hypothesis tests. 

Criterion for assumption tests

Use p = .001 as the criterion. That is to say only if you get a p value lower than .001 should you worry about violations of assumption tests. If you are looking a skewness (skew) and kurtosis (kurt) then look at zskew =and zkurt and if either is higher than 3.29 (p = .001) then you may have a problem.

zskew = Skewness/SEskew

zkurt = Kurtosis/SEkurt


Why this conservative criterion you may ask? Because only when you have a serious violation of an assumption are you likely to run into problems with interpreting the results of your analysis.

If you have a violation of the normal distribution assumption then follow the flowchart in Figure 1 and refer to explanations in text.

Criterion for outliers

Outliers are not what SPSS calls "extreme values", just so that is clear. Neither are they what SPSS marks with a circle or an asterix in its box plots. An outlier is a value looking for its distribution - no sorry that is an attempt at being funny. An outlier is a value that is more than 3.29 standard deviation units away from the mean. Just look at your highest (Xmax) and lowest (Xmin) value for the variable you are having a problem with and the mean (M) and the standard deviation (SD) then calculate the following.

zdistance = (Xmax - M)/SD

and

zdistance = (M - Xmin)/SD

If either one of the above give you zdistance > 3.29 then you have an outlier so check other scores close to the outlier to see if you have multiple outliers.

Comparing the results (transformed vs. raw)

You should run your analysis twice. Once using the raw data and once with the transformed data. Compare the effect sizes such as draw = 0.50 vs. dtransformed = 0.56. The difference being 20% or less means that it is probably not going to affect the interpretation of the results. In cases like this you would simply report the raw data findings and note that the violation of the normal distribution only had a small effect on the analysis. 

If the change is moderate then your reader needs to know this and both sets of findings (i.e. raw and transformed) should be reported. 

Comparing the results (with vs. without an outlier)

As above you should run your analysis twice. Once using the raw data and once without the outliers. Compare the effect sizes such as draw = 0.40 vs. dwithout outliers = 0.47. The difference being 20% or less means that it is probably not going to affect the interpretation of the results. In cases like this you would simply report the raw data findings and note that the violation of the normal distribution only had a small effect on the interpretation of the findings. 

If the effect on your interpretation of the findings is moderate or large then your reader needs to know this and both sets of findings (i.e. raw and without outlier) should be reported.



Figure 1. Flowchart for decision on violations of the normality assumption. 

No comments:

Post a Comment