|
|
Correlation Using R (Using RStudio)
Correlation is a measure of the strength of the relationship between two variables.
It is measured using a number called the correlation coefficient which lies between -1 and +1.
Larger values closer to +1 or -1 indicate a stronger relationship. Values nearer to zero indicate weaker or even no relationship.
The most used measures are Pearson’s correlation coefficient and Spearman’s correlation coefficient.
For help on “What test do I need” go to the sigma website statistics resources page.
Pearson’s correlation measures the strength of the linear (i.e. straight-line) relationship, whereas Spearman’s correlation simply measures the strength of a general monotonic relationship which can be non-linear (monotonic means always increasing or always decreasing).
The variables used to calculate a correlation coefficient need to be scale or ordinal. Correlation is not appropriate if any of the variables are nominal.
Pearson’s correlation is appropriate if both variables are scale.
Spearman’s correlation can be used for any combination of scale and ordinal variables (i.e. both can be scale or both ordinal or one of each).
If one or both variables are scale, you should also obtain a scatter plot to visualize the relationship between them. We can also conduct a test on the correlation coefficient – see later.
A student wanted to explore the relationship between knowledge about calcium and calcium intake, among sports science students. The data shown below can be downloaded from the CSV file calcium.csv.
To get started with the analysis, first, bring the dataset into RStudio. To do this you can either run a read.csv() function if you know how to do this or alternatively you can follow the following steps using the menus:
From the File menu select Import Dataset then From Text(base):
From the pop-up window navigate to the folder where you have saved the dataset, then once the file was selected click Open:
At the next dialogue box (see below), in the upper left corner in the “Name” field, amend the name of your dataset if you wish, in this example we named it as “calcium”.
We should also make sure the Heading option below is set to Yes (otherwise the data will all be read in as text):
Finally, click on “Import” to complete the process. This imports the data set and is listed in the “Environment” in the top right of your RStudio screen as follows:
Since both variables are scale, we will use Pearson’s correlation, but first we should examine the relationship using a scatter plot.
This snippet of code will allow us to create the scatter plot of this data:
plot(calcium$Knowledge, calcium$Calcium,
xlab = "Knowledge Score", ylab = "Calcium Intake",
main = "Calcium Intake vs Knowledge Score")
abline(lm(Calcium ~ Knowledge, data = calcium), col = "red")
Note: To generate plots in R, we need to indicate the data or variables we want to plot. We can do that by indicating which columns in our dataset contain that data, for the case of Rstudio, we can type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data as in the snippet of code above. Additionally, To add labels, we can add them by using the xlab and ylab options inside the function as seen in the code provided.
Once you run the previous code, you should have the following graph:
In the plot, the points follow an increasing pattern and seem to be reasonably close to an underlying straight line. This suggests there is a strong relationship between the two variables and also that it looks reasonably linear.
To correlation coefficient can be obtained in R, using the following code. Similarly as in the generation of the plot, in R we need to indicate which columns among our data we want to use by typing the dataset name followed by a $ and the variable names.
correlation <- cor.test(calcium$Knowledge, calcium$Calcium, method='pearson')
correlation
Note that if required, we could have opted for the Spearman correlation if we had ordinal data.
Running above code should produce this output:
##
## Pearson's product-moment correlation
##
## data: calcium$Knowledge and calcium$Calcium
## t = 7.951, df = 18, p-value = 2.675e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7213682 0.9527909
## sample estimates:
## cor
## 0.8822551
The Pearson correlation coefficient is 0.8822551. We could report this as r=0.88 as two decimal places is sufficient. This is indicative of a strong relationship, as we saw earlier through the scatter plot. It is also positive which indicates that calcium intake increases as knowledge of calcium increases and vice-versa.
A commonly used interpretation is based on benchmarks suggested by Cohen (1992). Here correlation strengths are classified as in the table below. Note our value of 0.88 falls in the strong category of 0.5 to 0.9.
Correlation Coefficient Value | Interpretation |
---|---|
-0.3 to +0.3 | Weak |
-0.5 to -0.3 or 0.3 to 0.5 | Moderate |
-0.9 to -0.5 or 0.5 to 0.9 | Strong |
-0.9 to -1 or 0.9 to 1 | Very Strong |
Extracted from Cohen, L.(1992). Power Primer. Psychological Bulletin, 112(1) 155-159
The table in the earlier correlation test output includes a number labelled as the p-value reported as 2.675e-07 by R, which is a value smaller than 0.001. This p-value is used to explore the research question:
Is there a true relationship between the intake of calcium and knowledge about calcium?
This can be tested formally using the hypotheses.
H0: There is no correlation between calcium intake and knowledge about calcium (equivalent to saying r = 0)
H1: There is some correlation between calcium intake and knowledge about calcium (equivalent to saying r not = 0).
Since our p-value is reported as p is less than 0.001 this means it is below the usual level of 0.05 used to test such hypotheses and so we can reject H0 and conclude there is evidence of a true correlation in the wider population of Sports Science students. The point here is that whilst our correlation coefficient of 0.88 indicates a strong relationship, this is true for our sample of 20 participants; but can we use this as evidence to infer that a relationship truly exists between knowledge and intake of calcium among ALL Sports Science students (not just in our sample)? In our case the test we did above says yes, we can.
We could report the results as:
“Among Sports Science students, there is evidence that knowledge about calcium is related to calcium intake (p less than 0.001). Greater knowledge about calcium is associated with increase calcium intake and the correlation coefficient indicated a strong linear relationship (r=0.88).”
Note how we avoid suggesting that greater knowledge CAUSES increased calcium intake as correlation cannot be used to infer a cause-and-effect relationship.
For more
resources, see
sigma.coventry.ac.uk
Adapted from material developed by
Coventry University