Coventry University Logo Sigma Logo

Linear Regression in R (Using R Studio)

What is Simple Linear Regression?

Simple Linear Regression allows us to predict or explain one variable in terms of another.

It is similar to correlation, but enables us to describe more precisely how changes in one variable (the Independent variable which is sometimes referred to as the Predictor variable) might explain or predict changes in another variable (the Dependent variable which is sometimes referred to as the Response variable).

  • Note that more than one independent or predictor variable can be used in a model in which case we refer to that as Multiple Linear Regression. Here we consider just Simple Linear Regression.

The dependent variable must be scale. However, the independent (or predictor) variables can be scale or they can be categorical (i.e. ordinal or nominal) such as Gender or Ethnicity etc. This worksheet focuses on when you have one independent variable which is scale.

Example

A student wants to explore the relationship between sports science students’ calcium intake and their knowledge about calcium. The data shown below can be downloaded in a CSV file called calcium.csv. Knowledge scores about calcium are in the Knowledge column and the recorded calcium intake (in mg) is in the Calcium column.

In particular, the student wants to know if the participants’ knowledge about calcium can be used to predict their calcium intake. The research question is: Does knowledge about calcium predict the calcium intake in Sports Science students?

To get started with the analysis, first, bring the dataset into RStudio. To do this you can either run a read.csv() function if you know how to do this or alternatively you can follow these steps using the menus:

From the File menu select Import Dataset then From Text(base):

From the pop-up window navigate to the folder where you have saved the dataset, then once the file was selected click “Open”:

At the next dialogue box (see below), in the upper left corner in the “Name” field, amend the name of your dataset if you wish, in this example we named it as “calcium”.

We should also make sure the Heading option below is set to Yes (otherwise the data will all be read in as text):

Finally, click on “Import” to complete the process. This imports the data set and is listed in the “Environment” in the top right of your RStudio screen as follows:

Using R

Step 1: Check that a linear relationship exists between the two variables by drawing a scatter plot of the data. If there is NO linear relationship, i.e. the plot points appear scattered evenly across the graph, or if the underlying line is curved, running a linear regression on would not be appropriate.

The following code can be run to produce the scatter plot:

plot(calcium$Knowledge, calcium$Calcium, 
     xlab = "Knowledge Score", ylab = "Calcium Intake",
     main = "Calcium Intake vs Knowledge Score")
abline(lm(Calcium ~ Knowledge, data = calcium), col = "red")

Note: To generate plots in R, we need to indicate the data or variables we want to plot. We can do that by indicating which columns in our dataset contain that data, for the case of Rstudio, we can type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data as in the snippet of code above. Additionally, to add labels, we can add them by using the xlab and ylab options inside the function as seen in the code provided.

In the plot, the points follow a clear increasing linear pattern. They are also reasonably close to line of best fit (called the regression line) through the data points. This suggests there is a strong relationship between the two variables. Here the line slopes upwards from left to right, which tells us that the value of one variable increases as the value of the other increases.

Step 2: Linear Regression in RStudio

Having established that a linear relationship exists between the two variables, we can run a Simple Linear Regression. This finds the equation (slope and intercept) of the regression line, which is essentially a mathematical model of the relationship between the independent/predictor and the dependent/response variable.

This code will fit a linear regression model with Calcium as the response variable and Knowledge as the predictor variable, using the lm() function. The summary() function prints out details of the model fit:

# Fit linear regression model
lm_model <- lm(Calcium ~ Knowledge, data = calcium)

# Print model summary
summary(lm_model)

Running the code will produce this output table of model coefficients, residual standard error, R-squared, etc.:

## 
## Call:
## lm(formula = Calcium ~ Knowledge, data = calcium)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -149.117  -58.579    5.981   40.304  174.108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  373.743     55.067   6.787 2.34e-06 ***
## Knowledge     13.897      1.748   7.951 2.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 84.35 on 18 degrees of freedom
## Multiple R-squared:  0.7784, Adjusted R-squared:  0.7661 
## F-statistic: 63.22 on 1 and 18 DF,  p-value: 2.675e-07

Examining the Output

We will focus on the coefficients table as we are interested in the “Estimate” column which provides the estimated intercept and slope of our regression line.

Slope: The estimated slope is given in the row for the “Knowledge” variable and has a value around 13.9. This is a positive number and so indicates that as a Knowledge score about calcium increases then Calcium intake also increases. The estimated slope tells us that as Knowledge score increases by 1, there is an associated increase in Calcium intake of around 13.9mg.

Intercept: The value for the “(Intercept)” row is the estimated intercept, reflecting the value of our Dependent variable when our Independent variable is zero. In our example the estimated intercept is around 373.743, suggesting that on average those with a Knowledge score of zero would have an estimated calcium intake of 373.7 mg.

Significance (p-values): We are also interested in the “Pr(>|t|)” column which gives the p-value associated with each coefficient. This tells us if there is evidence that the relationship between that variable and the response is statistically significant (i.e. does it reflect evidence that a true relationship exists between Knowledge score and Calcium intake, and not just a pattern we are seeing by random chance?). For the Knowledge variable, the p-value is very small (<0.0001). Since this is below 0.05 (when working to the usual 5% significance level), we can conclude there is strong evidence that Knowledge score is a statistically significant predictor of Calcium intake.

We can write the equation of our regression line or line of best fit as:

\[ Calcium\ Intake = 373.7 + 13.9 \times Knowledge\ Score \]

Next, we should assess whether our model is any good. Does it provide a good way of predicting Calcium intake? The Multiple R-squared value at the bottom of the summary() output can help us do this.

This value is 0.778 which can be reported as 77.8%, which indicates that 77.8% of the variability in peoples’ calcium intake, can actually be explained by their Knowledge of calcium.

The remaining 22.2% of variation in Calcium intake arises from other factors not included in this simple model.

Finally note that we CANNOT say that knowledge about calcium CAUSES the increase in calcium intake. All we can do is infer that they are associated or connected.

Reporting Results

We could report the results as:

“Simple linear regression analysis was used to examine the relationship between calcium intake and knowledge about calcium. The results suggest that knowledge about calcium was a significant predictor of calcium intake (p<0.001). The estimated coefficient (slope) for knowledge score suggests that each additional unit increase in knowledge about calcium is associated with an increase in calcium intake of around 13.9mg.”

Finally, note that we CANNOT say that knowledge about calcium CAUSES the increase in calcium intake. All we can do is infer that they are connected or associated.

For more resources, see sigma.coventry.ac.uk Adapted from material developed by Coventry University Creative Commons License