|
|
Panel Data Regression Using R (Using RStudio)
This guide shows how to do a simplified Panel Data Regression analysis using RStudio. Panel data regression is a little different to the usual simple linear regression or multiple regression, since it often includes data, such as financial data, that changes over time for specific entities (like companies or countries).
A finance researcher seeks to determine if aspects such as the frequency of board meetings, board size, firm size, total assets, and equity have an impact on Tobin’s Q (a financial measure of the value of a company) for a number of different banking institutions from 2010 to 2020. The data are shown below in JASP and can be downloaded in a CSV file called panel_data_dbase_csv_v2.csv.
Notice how the data are formatted. Each row represents one set of observations (in this case one year’s data) for each bank. Therefore we need two additional variables that identify the bank (bank_id) and the year (years) the data comes from. This way of arranging the data is called “Long Format”. The data needs to be in this long format and not “Wide Format” where you have each year’s data in separate columns which is often how the data are presented in tables or when extracted from financial databases.
There are also variables for BoardSize, FirmSize, FrequencyofBoardMeeting, tot_asset_trill, and Equity_trill, which can vary from year to year within each bank. These are our predictor variables also referred to as independent variables. The dependent variable is Tobin’s Q since we are interested in how this depends on (are affected by) the other (predictor or independent) variables.
To get started with the analysis, first, bring the dataset into RStudio. To do this you can either run a read.csv() function if you know how to do this or alternatively you can follow the following steps using the menus:
From the File menu select Import Dataset then From Text(base):
From the pop-up window navigate to the folder where you have saved the dataset, then once the file was selected click “Open”:
At the next dialogue box (see below), in the upper left corner in the “Name” field, amend the name of your dataset if you wish, in this example we named it as “panel_data”.
We should also make sure the Heading option below is set to Yes (otherwise the data will all be read in as text):
Finally, click on “Import” to complete the process. This imports the data set and is listed in the “Environment” in the top right of your RStudio screen as follows:
We should always first explore our data visually, to assess the potential value of a panel regression model. Particularly useful is to analyse the relationship between our dependent variable, e.g. Tobin’s q and our main independent variables, particularly Equity_trill and Total_asset_trill. We can do that by generating scatter plots:
plot(panel_data$Equity_trill, panel_data$TobinQ,
xlab="Equity Trill",
ylab="TobinQ")
Note: To generate plots in R, we need to indicate the data or variables we want to plot. We can do that by indicating which columns in our dataset contain that data, for the case of Rstudio, we can type the name of our dataset followed by a dollar sign “$” and the name of our column that contains the data as in the snippet of code above. Additionally, to add labels, we can add them by using the xlab and ylab options inside the function as seen in the code provided.
Repeat, this time putting tot_asset_trill in the X-Axis box to get:
plot(panel_data$tot_asset_trill, panel_data$TobinQ,
xlab="Tot Asset Trill",
ylab="TobinQ")
We can then repeat for all the other predictor variables.
In these graphs, we’re trying to spot trends or patterns in the data. A common thing we look for is whether there’s a linear relationship between the variables. Imagine you’re trying to draw a straight line through the data points. It can sometimes be tricky to determine this just by looking, but it’s a helpful first step to understanding your data.
In our example, there appears to be a reasonably linear relationship between Tobin’s Q and our two main factors, albeit with lots of variation. If needed, you can add a line to the graph to help see this relationship better. But for our example, just looking at the overall trend or pattern of the data is enough.
In panel data regression, it’s essential to control for specific unobserved effects that are specific to each bank and remain fairly constant over time. Examples might be the types of customers each bank attracts, as some may attract younger customers and some older customers. Another example might be the types of financial products the bank trades in. These might differ between banks but stay relatively constant over time but may explain some of the differences in Tobin’s Q.
Whilst RStudio can’t handle advanced panel regression models very easily, we can use it to implement a fixed effects regression technique for panel regression. Whilst this is somewhat limited for panel data regression, this approach offers a practical and scientifically sound method for panel data analysis.
We will approach the panel regression using a Generalised Linear Model (GLM). First, we need to convert the variables Bank_id and years to factors so that they are recognised as categorical. These two types of variables are essential in panel data regression; one tells us about the individual entity (in this case, banks) and the other about the time period linking the same data to each bank together.
panel_data$Bank_id <- as.factor(panel_data$Bank_id)
panel_data$years <- as.factor(panel_data$years)
To gain a better understanding of the data, we obtain descriptive statistics for each variable by using the summary() function:
summary(panel_data)
Next, we can fit a GLM to our data using the glm() function. We put our dependent variable, TobinQ first, followed by our independent variables. Also make sure to specify the dataset name. To visually see our results, we obtain a summary of the model.
glm_model <- glm(TobinQ ~ BoardSize + FirmSize + FrequencyofBoardMeeting + tot_asset_trill + Equity_trill + Bank_id + years, data = panel_data)
summary(glm_model)
The descriptive statistics tell us information about the data. For the categorical variables, they provides information about how many individuals, , banks in our example, are in the sample. For the continuous variables, we can see key summary statistics, including the mean values for each variable.
## Bank_id years TobinQ BoardSize FirmSize
## 1 :11 2010 :14 Min. :1.005 Min. : 7.00 Min. : 8.653
## 2 :11 2011 :14 1st Qu.:1.106 1st Qu.:12.00 1st Qu.: 9.296
## 3 :11 2012 :14 Median :1.137 Median :14.00 Median : 9.749
## 4 :11 2013 :14 Mean :1.135 Mean :13.99 Mean :10.637
## 5 :11 2014 :14 3rd Qu.:1.159 3rd Qu.:16.00 3rd Qu.:12.176
## 6 :11 2015 :14 Max. :1.235 Max. :20.00 Max. :12.676
## (Other):88 (Other):70
## FrequencyofBoardMeeting tot_asset_trill Equity_trill
## Min. : 2.000 Min. :0.000450 Min. :0.0000775
## 1st Qu.: 4.000 1st Qu.:0.002356 1st Qu.:0.0002741
## Median : 6.000 Median :0.005275 Median :0.0059091
## Mean : 6.273 Mean :0.717485 Mean :0.0980692
## 3rd Qu.: 8.000 3rd Qu.:1.487500 3rd Qu.:0.1615000
## Max. :12.000 Max. :3.730000 Max. :0.5680000
##
One of the key parts to look at is the output from the glm model. Our primary interest is in understanding how our predictor variables shown at the bottom of the table influence the outcome Tobin’s Q (labelled as “BoardSize” down to “Equity_trill”. This table includes a lot of results but we are often not interested in the row labelled “(Intercept”) or rows any labelled “(Bank_ID=)” or (“years=)”. The parameter estimates for Bank_id are there to capture differences between banks and the parameter estimates for years are there to connect the data for each bank over time.
The column labelled Estimate gives us what are sometimes called the “beta coefficients” which describe the relationship each predictor has with Tobin’s Q. The column labelled Pr(>|t|) gives the associated p-value which tells us if there is evidence each relationship is statistically significant (i.e. represents a true relationship and not just a random chance relationship).
##
## Call:
## glm(formula = TobinQ ~ BoardSize + FirmSize + FrequencyofBoardMeeting +
## tot_asset_trill + Equity_trill + Bank_id + years, data = panel_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1643710 0.0499568 23.308 < 2e-16 ***
## BoardSize -0.0002501 0.0014293 -0.175 0.86137
## FirmSize -0.0036726 0.0047572 -0.772 0.44156
## FrequencyofBoardMeeting 0.0010371 0.0019865 0.522 0.60254
## tot_asset_trill -0.0233459 0.0118193 -1.975 0.05045 .
## Equity_trill 0.2164053 0.0715306 3.025 0.00302 **
## Bank_id2 0.0275815 0.0177545 1.553 0.12283
## Bank_id3 0.0130195 0.0177381 0.734 0.46433
## Bank_id4 0.0158831 0.0178719 0.889 0.37586
## Bank_id5 0.0017986 0.0176925 0.102 0.91919
## Bank_id6 0.0103118 0.0180773 0.570 0.56941
## Bank_id7 0.0074946 0.0176803 0.424 0.67237
## Bank_id8 -0.0014874 0.0178281 -0.083 0.93364
## Bank_id9 0.0126252 0.0179529 0.703 0.48322
## Bank_id10 0.0162111 0.0182739 0.887 0.37672
## Bank_id11 0.0087426 0.0177809 0.492 0.62381
## Bank_id12 -0.0043552 0.0178490 -0.244 0.80763
## Bank_id13 0.0186465 0.0177386 1.051 0.29520
## Bank_id14 0.0196368 0.0178290 1.101 0.27284
## years2011 -0.0099860 0.0160900 -0.621 0.53597
## years2012 -0.0135208 0.0156869 -0.862 0.39038
## years2013 -0.0144780 0.0161643 -0.896 0.37214
## years2014 0.0053494 0.0157949 0.339 0.73542
## years2015 -0.0019190 0.0160772 -0.119 0.90518
## years2016 0.0114609 0.0158435 0.723 0.47080
## years2017 -0.0229530 0.0161511 -1.421 0.15777
## years2018 -0.0095451 0.0158521 -0.602 0.54817
## years2019 -0.0376837 0.0157669 -2.390 0.01834 *
## years2020 0.0036278 0.0163951 0.221 0.82524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.001706704)
##
## Null deviance: 0.28436 on 153 degrees of freedom
## Residual deviance: 0.21334 on 125 degrees of freedom
## AIC: -516.57
##
## Number of Fisher Scoring iterations: 2
tot_asset_trill has a B (beta) value of -0.023 which indicates a negative relationship. i.e. Banks with larger assets are associated with having decreased values of Tobin’s Q. The p-value in this case is 0.050 which is just 0.05 (when working to the usual 5% level of significance) indicating that there is evidence (just) that tot_asset_trill is a statistically significant predictor of Tobin’s Q.
statistically significant predictor of Tobin’s Q. Equity_till has a B (beta) value of 0.216 which indicates a positive relationship. i.e. Banks with larger equity are associated with having increased values of Tobin’s Q. The p-value here is listed as being less than 0.003, which means it is very small and hence very much below 0.05, and so there is actually a lot more evidence here that Equity_bill is a statistically significant predictor of Tobin’s Q.
For the remaining predictor variables (BoardSize, FirmSize and FrequencyofBoardMeeting) the p-values are all greater than 0.05 indicating that there is no evidence that these have any true association with Tobin’s Q. hence we do not try to interpret the B (beta) values as these represent random chance relationships.
We might write this up in a report by saying:
“The results suggest that the total assets of a company, as represented by tot_asset_trill, and the equity figures, denoted by Equity_bill, both have a statistically significant relationship with Tobin’s Q. However, it is important to note that the tot_asset_trill variable is negative, indicating that banks with higher total assets and equity do not necessarily have higher levels of Tobin’s Q. Conversely, there is no evidence that factors such as the size of the board (BoardSize), the overall size of the firm (FirmSize), and the frequency of board meetings (FrequencyofBoardMeeting) have any relationship with Tobin’s Q.”
For more
resources, see
sigma.coventry.ac.uk
Adapted from material developed by
Coventry University