One–Way Analysis of Variance Using R (Using RStudio)

When to use a One-Way ANOVA

A one-way analysis of variance (ANOVA) is a statistical test that can be used to assess whether the mean value of some outcome variable is different between three or more groups.

The observations/measurements on the outcome variable must be Scale data, such as weight, where you might want to compare the mean weight of the population of several different countries. The groups must be independent of each other, for example each person’s weight measurement is included in only one country.

If your measurements are Ordinal, then consider a Kruskal Wallis test instead.

If you wish to compare just two groups, then you can use ANOVA, but a better simpler alternative is to use an independent samples t-test.

For more help on “What test do I need” go to the sigma website statistical worksheets resources page.

Example

A paper bag manufacturer investigated the impact of varying hardwood fibre concentrations (5%, 10%, 15%, and 20%) on the tensile strength of their products, measured in pounds per square Inch (psi). Six test specimens were produced at each concentration level, resulting in 24 bags being tested. The data shown below can be downloaded in a CSV file called hardwood.csv:

The research question aimed to determine if there existed a significant difference in mean tensile strengths among the four hardwood concentration levels.

To address this, a one-way ANOVA was conducted, testing the following hypotheses:

H0: There is no difference in mean tensile strength between the four hardwood concentrations.

H1: There is a difference in mean tensile strength between at least two of the hardwood concentrations.

The data has two columns. The first column, Conc, represents the independent or grouping variable, where 1 means 5% hardwood concentration, 2 means 10% hardwood concentration, and so on. The second column, Strength, contains all the tensile strength measurements.

Importing the data into R

To get started with the analysis, first, bring the dataset into RStudio. To do this you can either run a read.csv() function if you know how to do this or alternatively you can follow the following steps using the menus:

From the File menu select Import Dataset then From Text(base):

From the pop-up window navigate to the folder where you have saved the dataset, then once the file was selected click “Open”:

At the next dialogue box (see below), in the upper left corner in the “Name” field, amend the name of your dataset if you wish, in this example we named it as “hardwood”.

We should also make sure the Heading option below is set to Yes (otherwise the data will all be read in as text):

Finally, click on “Import” to complete the process. This imports the data set and is listed in the “Environment” in the top right of your RStudio screen as follows:

Using R

Begin by obtaining descriptive statistics of strengths for each concentration level.

tapply(hardwood$Strength, hardwood$Conc, summary)

Next, we will create a descriptive plot showing mean strengths over concentration. To do this, we first calculate and store the means, and then plot these on a line plot.

mean_strengths <- tapply(hardwood$Strength, hardwood$Conc, mean)
plot(mean_strengths, type="l", xlab="Conc", ylab="Strength")

Finally, we will carry out One-Way ANOVA using the function aov(). First, we need to convert Conc to a factor. To obtain the results we want, we will need a summary of the results.

hardwood$Conc <- as.factor(hardwood$Conc)
anova <- aov(hardwood$Strength ~ hardwood$Conc, type=III)
summary(anova)

Examining the Output

The descriptive statistics and plot can be seen below:

## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00    8.25    9.50   10.00   10.75   15.00 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   13.50   16.00   15.67   17.75   19.00 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   16.25   17.50   17.00   18.00   19.00 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   19.25   21.00   21.17   22.75   25.00

Note: To generate plots in R, we need to indicate the data or variables we want to plot. We can do that by indicating which columns in our dataset contain that data, for the case of Rstudio, we can type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data as in the snippet of code above. Additionally, to add labels, we can add them by using the xlab and ylab options inside the function as seen in the code provided.

The descriptives statistics provide statistics for tensile strength for the different wood concentrations. The dscriptives plot show that in our samples, the higher the concentration, the higher the mean tensile strength, ranging from 10.00 psi at 5% to 21.17 psi at 20%. although we note there appears to be less difference in strength between the 10% concentration (15.67 psi) and the 15% concentration (17.00 psi).

##               Df Sum Sq Mean Sq F value   Pr(>F)    
## hardwood$Conc  3  382.8  127.60   19.61 3.59e-06 ***
## Residuals     20  130.2    6.51                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output provides the main results of our test. The column labelled p provides you with your p-value which is reported as <.001. This is less than 0.05 so there is evidence in favour of H1 that there truly is a difference in mean tensile strength somewhere among the four hardwood fibre concentration levels (5%, 10%, 15%, and 20%). The table also shows that the F statistic was 19.605 on 20 degrees of freedom (df) within groups (Conc) and 3 df between groups (Residuals), which we often include when reporting our results.

However, the ANOVA test only tells us that there is a difference somewhere, not where specifically, which is why we need to undergo post hoc tests to identify where the differences lie.

Post-hoc tests

The nature of these differences can be explored further by performing post hoc tests to drill down into the results, but only if the ANOVA test has been found to be statistically significant, as in our case. Post-hoc tests identify any significant differences in mean scores between all possible pairs of the factor levels, in our case pairs of concentration levels. In effect they are variations on t-tests comparing each pair of means.

To obtain the post hoc tests, we apply the function TukeyHSD() to the anova model previously fit:

TukeyHSD(anova)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = hardwood$Strength ~ hardwood$Conc, type = III)
## 
## $`hardwood$Conc`
##          diff         lwr       upr     p adj
## 2-1  5.666667  1.54410408  9.789229 0.0051108
## 3-1  7.000000  2.87743741 11.122563 0.0006501
## 4-1 11.166667  7.04410408 15.289229 0.0000015
## 3-2  1.333333 -2.78922925  5.455896 0.8022275
## 4-2  5.500000  1.37743741  9.622563 0.0065966
## 4-3  4.166667  0.04410408  8.289229 0.0470251

Because our factor has 4 levels, there are a total of 6 possible pairings.

Careful examination of the R output indicates a statistically significant difference in mean strength between: 5% and 10%, 5% and 15%, 5% and 20% (p=0.005, p<0.001 and p<0.001, respectively), 10% and 20% (p=0.007), and 15% and 20% (p=0.047). However, there was no statistically significant difference between 10% and 15% (p=0.802).

Reporting Results

We could report the results as:

“ANOVA was used to compare the mean tensile strengths of paper bag products containing hardwood at four different concentration levels. There was evidence of a significant difference (F(20,3)=19.605, p<0.001). Tukey’s post hoc test revealed statistically significant differences between 5% and all other concentrations at 10% (p=0.005), 15% (p=0.001) and 20% (p<0.001), as well as statistically significant differences between 10% and 20% (p=0.007), and between 15% and 20% (p=0.047). However, there was no statistically significant difference between 10% and 15% (p=0.802). On average, the 20% hardwood concentration group (M = 21.17, SD = 2.64) scored higher than the 15% (M = 17.00, SD = 1.79), 10% (M = 15.67, SD = 2.81), and 5% (M = 10.00, SD = 2.83) groups. Overall, these findings suggest that the concentration of hardwood significantly impacts the tensile strength of the paper, except between the 15% and 10% groups.”

Further Work

To further enhance our results, we need to compute eta-squared for practical significance and assess normality plus homogeneity of variances. This will ensure that our one-way ANOVA results are robust enough.

Normality

For the test to be valid it should be reasonable to assume that the measurements in each group are approximately normally distributed. If we have 30 or more measurements in each group, then we can safely make that assumption and need not check this any further.

Since we only had 6 measurements in each group, we need to do some further assessments. However, when we only have very small samples, it can sometimes be quite a challenge to determine whether the data in each group come from a normal distribution or not.

Hence, the best way to assess normality for one-way ANOVA is to repeat the main analysis but save what are called the residuals. The residuals are the differences between each observed value and the mean value for that group. We then need to check if this single column of residuals can be assumed to be normally distributed.

We can obtain the residuals using the function residuals() applied to our anova model. These can then be plotted in a histogram:

residuals <- residuals(anova)
hist(residuals)

Normality could be judged by examining the shape of the histogram to see if it makes a roughly symmetric bell-shaped curve. However, small sample sizes tend to make the histogram look ‘jagged’, making it harder to discern normality by eye. It is probably better to assess normality using the Shapiro-Wilk test:

shapiro.test(residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.96624, p-value = 0.5757

You need a non-significant result, i.e. the p-value needs to be greater than 0.05 to be able to assume normality. In our example, p = 0.578 and we can assume our sampled differences data is normally distributed. If this were not the case, we would need to use the non-parametric equivalent of one-way ANOVA, called the Kruskal Wallis test.

Equal variances

The ANOVA test we undertook also assumes homogeneity (equality) of variance. This essentially means, can we assume the amount by which the tensile strength varies amongst bags with 10% concentration hardwood, is the same amount the strength of bags with 15% varies and so on?

Levene’s test is used in R to evaluate the homogeneity of variance assumption.

First, install the package car. Then, apply the library() function to this. This will give us access to the function LeveneTest() which we will use next.

install.packages("car", repos="https://cloud.r-project.org")  
library(car)

## Loading required package: carData

Note: to add additional functions to Rstudio we can install external packages. These can add more capabilities not present in the default Rstudio or when there is no other way to carry on an analysis. After installing the package we need to call it by using library(). This will indicate Rstudio to use the functions inside this package to run calculations or procedures requested by the user.

To be able to use LeveneTest(), we first need to convert the variable Conc to a factor. This is as Levene’s test can only be used with quantitative variables. Once this is done, we can then use LeveneTest().

hardwood$Conc <- as.factor(hardwood$Conc)
leveneTest(Strength ~ Conc, data = hardwood)

We examine the p-value. If this is above 0.05 then we can assume equality of variances. In our case p=0.583 so this assumption is fine in our case. If the p-value had been below 0.05 then we should re-do the ANOVA but using a Welch test, shown below:

oneway.test(Strength ~ Conc, data = hardwood, var.equal = FALSE)

We would then report the p value for the overall ANOVA from the output shown below, rather than from the main ANOVA table we saw earlier:

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Strength and Conc
## F = 15.253, num df = 3.000, denom df = 10.873, p-value = 0.0003262

For more resources, see sigma.coventry.ac.uk Adapted from material developed by Coventry University Creative Commons License