Independent Samples t-test Using R (Using RStudio)

When to use an Independent Samples t-test

The independent samples t-test can be used to assess whether the mean value of some outcome variable is different between two groups.

The observations/measurements on the outcome variable must be Scale data, such as weight, where you might want to compare the mean weight of the population of two different countries.

If your measurements are:

Ordinal then consider a Mann-Whitney test instead
Nominal then consider a Chi-squared test.

If you wish to compare more than two groups then consider using ANOVA (or a Kruskal Wallis test for Ordinal data).

For more help on “What test do I need” go to the sigma website statistical worksheets resources page.

Example

Consumer Reports, a US magazine, conducted a study to compare the calorie content of beef and poultry hotdogs. They used an independent samples t-test to determine if there was evidence of a difference in the mean calorie content between the two types of hotdogs.

The data shown below can be downloaded in a CSV file called hotdog.csv. The data should be set up as two columns. One column contains all the measurements/observations for calories. The other column then indicates which group the measurements came from.

*Data is taken from Moore DS & McCabe GP (2002) Introduction to the Practice of Statistics WH Freeman & Co. USA

To get started with the analysis, bring the dataset into RStudio. To do this you can either run a read.csv() function if you know how to do this or alternatively you can follow these steps using the menus:

From the File menu select Import Dataset then From Text(base):

From the pop-up window navigate to the foolder where you have saved the dataset, then once the file is selected click Open:

At the next dialogue box (see below), in the upper left corner in the “Name” field, amend the name of your dataset if you wish, in this example we named it as “hotdog”.

We should also make sure the Heading option below is set to Yes (otherwise the data will all be read in as text):

Finally, click on Import to complete the process. This imports the data set and is listed in the Environment in the top right of your RStudio screen as follows:

Using R

Before performing the Paired Samples T Test, we will calculate descriptive statistics for the calorie content of beef and poultry hotdogs. We indicate which columns we want to investigate by typing type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data.

summary(hotdog$calories[hotdog==1])
summary(hotdog$calories[hotdog==2])

Using [hotdog==1] and [hotdog==2] allows for us to get seperate summary statistics for our two levels of the hotdog variable.

The t.test() function is used to carry out the Independent Samples T Test. Use the *calories** variable first, and the *hotdog variable second. This is so that calories can be compared between the two hotdog** groups.

t.test(hotdog$calories ~ hotdog$hotdog, var.equal=TRUE)

Examining the Output

The summary statistics are useful because they provide descriptive statistics for the calorie content of beef and poultry hotdogs. In our case, the mean calorie content for beef hotdogs is 156.85, which is higher than the mean calorie content for poultry hotdogs of 122.47.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   111.0   140.5   152.5   156.8   177.2   190.0

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    86.0   102.0   129.0   122.5   143.0   170.0

But do the results provide evidence that this reflects a true difference in the mean calorie content? Or could we have observed this difference just by chance?

The Research Question is: Is there a difference in the mean calorie content in beef and poultry hotdogs?

The independent samples t-test answers this by testing the hypotheses:

H0: There is no difference in the mean calorie content

H1: There is a difference in the mean calorie content

The Independent Samples T-Test output provides the main results of our test. The p-value is reported as 0.00011. This is less than 0.05 so there is evidence in favour of H1 that there is a difference in the mean calorie content of beef hotdogs compared to poultry hotdogs.

The table also shows that the t statistic was 4.346 on 35 degrees of freedom (df), which we often include when reporting our results. Note that we are extracting these results assuming that variances are equal. We will come back to the choice of this later when we consider assumptions in the test.

## 
##  Two Sample t-test
## 
## data:  hotdog$calories by hotdog$hotdog
## t = 4.3455, df = 35, p-value = 0.0001137
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
##  18.31825 50.44057
## sample estimates:
## mean in group 1 mean in group 2 
##        156.8500        122.4706

Reporting Results

We could report the results as:

“An independent samples t-test was used to compare the mean calorie content of beef and poultry hotdogs. Beef hotdogs had a higher mean calorie content (M=156.85) compared to poultry hotdogs (M=122.47), and this difference was statistically significant, t(35) = 4.346, p<0.001.”

Further Work

Effect Sizes Cohen’s d

We can cobtain a value for Cohen’s d using the cohens_d() function.

First, install the package effectsize.

Note: to add additional functions to Rstudio we can install external packages. These can add more capabilities not present in the default Rstudio or when there is no other way to carry on an analysis.

install.packages("effectsize", repos="https://cloud.r-project.org")  
library(effectsize)

After installing the package we need to call it by using library(). This will indicate Rstudio to use the functions inside this package to run calculations or procedures requested by the user. Then apply the cohens_d() function to our previous code for the independent samples t-test:

cohens_d(t.test(hotdog$calories ~ hotdog$hotdog, var.equal=TRUE))

A commonly used interpretation of this value is based on benchmarks suggested by Cohen (1988). Here effect sizes are classified as follows: A value of Cohen’s d around 0.2 indicates a small effect, a value around 0.5 is a medium effect and a value around 0.8 is a large effect. In our case Cohen’s d was 1.434, so we have a very large effect.

Equality of Variance

We also need to assess the assumption of homogeneity (equality) of variance. This essentially means can we assume the amount by which the calorie content varies in beef hotdogs is about the same as the amount calories vary in poultry hotdogs.

Levene’s test is used in RStudio to evaluate the homogeneity of variance assumption.

First, install the package car. Then, apply the library() function to this. This will give us access to the function LeveneTest() which we will use next.

install.packages("car", repos="https://cloud.r-project.org")  
library(car)

## Loading required package: carData

To be able to use LeveneTest(), we first need to convert the variable hotdog to a factor. This is as Levene’s test can only be used with quantitative variables. Once this is done, we can then use LeveneTest().

hotdog$hotdog <- as.factor(hotdog$hotdog)
leveneTest(calories ~ hotdog, data = hotdog)

The p-value from Levene’s test is shown in the fourth column of the table. You need the p-value to be greater than 0.05 to be able to assume homogeneity of variances. Here p=0.386 and so we can assume equal variances. Hence, we use var.equal=TRUE in our t.test() code.

If Levene’s test had given us a p-value below 0.05 then we the results for the t-test obtained from using var.equal=FALSE.

Normality

For the test to be valid it should be reasonable to assume that the calorie measurements are approximately normally distributed. If we have 30 or more measurements in each group, then we can safely make that assumption and need not check this any further. Since we only had 20 measurements for beef and 17 for poultry, we need to do some further assessments.

Normality could be judged by examining histograms of your calorie data -see code and graphs below- to see if both beef and poultry display a roughly symmetric bell-shaped curve.

Note: To generate plots in R, we need to indicate the data or variables we want to plot. We can do that by indicating which columns in our dataset contain that data, for the case of Rstudio, we can type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data as in the snippet of code above. Additionally, to add labels, we can add them by using the xlab options inside the function as seen in the code provided.

par(mfrow=c(1,2))
hist(hotdog$calories[hotdog==1], 
     main="Beef",
     xlab="Calories")
hist(hotdog$calories[hotdog==2], 
     main="Poultry",
     xlab="Calories")

Using par(mfrow=c(1,2)) allows for us to have the 2 histograms plotted side by side.

With small sample sizes, the differences in scores can make the histogram appear jagged, making it difficult to determine normality. It is probably better to assess normality using the Shapiro-Wilk test.

shapiro.test(hotdog$calories)

## 
##  Shapiro-Wilk normality test
## 
## data:  hotdog$calories
## W = 0.95882, p-value = 0.1853

You need a non-significant result in both groups, i.e. the sig values (p-values) both need to be greater than 0.05 to be able to assume normality. In our example, p = 0.185, so we can assume the data in both groups is normally distributed. If this were not the case, we would need to use the non-parametric equivalent of the independent samples t-test, called the Mann-Whitney test.

For more resources, see sigma.coventry.ac.uk Adapted from material developed by Coventry University Creative Commons License