Statistical tests such as the t-test or Anova, assume a normal distribution for events. Therefor the statistical analysis-section of many papers report that tests for normality confirmed the validity of this assumption and inspection of data plots supported the assumption of normality. Often normality tests are also applied to independent variables (predictors) although most statistical models, like regression analyses, make no strong assumptions regarding predictors. These variables are supposed to be measured without error. The Explore-procedure in SPSS results in a Normality tests table and Normal Q-Q Plots which are the  main methods of assessing normality of data: numerical and graphical methods.

Numerical methods
The Tests of Normality table in SPSS produces the Kolmogorov–Smirnov test and the Shapiro–Wilk test. But there are many alternative tests of univariate normality: the Lilliefors test, the Pearson's chi-squared test, and the Shapiro–Francia test, D'Agostino's K-squared test, the Anderson–Darling test, the Cramér–von Mises criterion, and the Jarque–Bera test. The Shapiro-Wilk test and Anderson-Darling test have better power for a given significance compared to Kolmogorov-Smirnov or Lilliefors test - an adaptation of the Kolmogorov–Smirnov test (Razali, Nornadiah, Wah, Yap Bee 2011).

All normality tests report a P-value. The null hypothesis is that the sample values come from a population characterised by a normal distribution – the nil hypothesis being that there is no difference between observed values and expected normally distributed values. Like in other tests, the P value is an estimate of the probability that a random sample would generate data that deviate from the normal distribution as much as the observed data do. That probability is calculated assuming that the null hypothesis is true. However, P-values in the Tests of Normality table in SPSS can be very misleading.
The Kolmogorov–Smirnov test was developed to compare observed data with parameters based on known population values or other tests. But in most cases, the mean and variation in the population are estimated from the observed data. Either way, if the null hypothesis were true, p-values should follow a uniform distribution and we would falsely reject the null only in a few cases (depending on the threshold which is often set at 5%). Suppose we take 1000 samples size 250 of deviates from a normal distribution with mean 100 and sd 3. 

Figure 1 shows the distribution of 1000 P-values of KS-test with the mean and sd specified. The frequencies are reasonably uniform and the rejection rate is close to the threshold of 50 cases (comes closer as the number of simulations increases). However, Figure 2 shows that P-values will be inaccurate when the parameters are estimated from the observed data and vary with each KS-test.
Like all significance tests, normality tests like the Shapiro-Wilk test will lead to a significant result when n gets large. If we generate 1000 normality tests for different sizes of samples taken from a normal distribution, the percentage of P-values signalling deviation from perfect normality increases with sample size (Figure 3). On the other hand, a small sample might by chance look non normal. In some textbooks a sample size of 30 is considered large enough to follow a normal distribution, but Figure 4 shows that normality tests have practically no power when n=30. If we take 1000 small samples from a lognormal distribution (with mean=0 and sd=0.4) in many cases the Shapiro-Wilk normality test will not reject the false null hypothesis (power is about .55).

What we really want to know is whether the data are close enough to the normal distribution to allow the use of conventional statistical tests. Unfortunately, a normality test does not answer this question. If sample size is not too large and the P-value extremely small, we can reject the null hypothesis that the data come from a normally distributed population. But rejecting the null does not tell us anything about the alternative distribution. However, if we cannot reject the null we cannot conclude that the test confirmed the validity of the normality assumption. As always, absence of proof is not proof of absence. So it looks like formal testing of the normality assumption is rather useless.

Graphical methods
Visual inspection of the observed data may show that the assumption of normality is reasonable, but do not use histograms. Rectangles of histograms are proportional to the number of observations per breakpoint – the number of values within the width of the rectangle. Figure 5 shows histograms and normal curves for random samples in different sizes taken from a population with mean=10 and sd=1. Goodness of fit improves as sample size increases but different appearances depending on the number of breakpoints make these plots difficult to interpret.

Normal Q-Q plots as in Figure 6 are more useful than histograms. The x-axis in a Q-Q plot is for the observed data point. The y-axis is the expected data point if the population distribution of the variable is normal with the population mean and the population sd. The line is approximately the sample mean + and – 2 times the sample standard deviation, in this case: 29,58-(1,96*6,89) to 29,58+(1,96*6,89) -> 16 to 43. If the data points fall on the straight line, the sample data likely came from a population that is approximately normal. Figure 7 zooms in on the deviations. In this detrended plot the line y=0 is the expected difference; the x-axis is for the observed data point and the y-axis is the difference between the expected data point and the observed data point. If the data are spread evenly around y=0 without an apparent pattern, the sample data likely came from a population that is approximately normal.


When to “call of the search” for normality? Don’t rely on a single Q-Q plot but place it in context. The bottom left corner in Figure 8 shows a Q-Q plot of 66 residuals (mean=0, sd=8.9). The reference-plots are from equally sized samples taken from a normal distribution with mean=0 and sd=8.9. This illustrates the variability of values when sampling from a normal distribution. In this context the normality of the data can be assessed.



What next?
The distribution of the sample data can be close to normal or very far from it, or somewhere in between. In any case, both formal normality tests and graphical methods will be of limited use in answering the question what statistical tools are required. There may be a subjective element in this statistical thinking, but then there are so many decisions to be made in the course of a study. If we decide that the deviation from the normal distribution is relevant, which is different from "statistically significant", then there are four choices:


  • Don’t tell anyone. Statistical tests tend to be quite robust if the departure from normality is not to gross. When looking at regression slopes or mean differences the “Central Limit Theorem” soon kicks in, which tells us that these statistics will be approximately normal even when the underlying distribution is not normal.


  • Do an outlier test. One or a few outliers or influential data points might be causing problems. Consider excluding outlier(s) but don’t forget to tell everyone or you risk being accused of trying to cheat. You may be able to correct skewed data and create a normal distribution by transforming values. Common transformations are: logarithmic, square root, and power transformations. However, transformed values may be more difficult to interpret. 


  • Abandon the “normal distribution” ship. Not seldom data transformation does not solve the problem because the data come from another identifiable distribution. Analysing normally distributed values is a special case of the generalized linear model framework, which covers a variety of distributions in the dependent variable, including Poisson, negative binomial, binomial, multinomial, and gamma. 


  • Switch to nonparametric methods. These methods don’t assume a normal distribution or relax other model assumptions. In case none of the above worked, nonparametric methods are sometimes useful. However,  often researchers switch to a nonparametric test automatically based on a single normality test. This kind of “Mindless statistics” (Gigerenzer, 2004) should be stopped, just like cruelty to animals, plagiarism, and counting footnotes in the word limit in essays, too name only but a few other interesting topics.