Skip to content

Commit 6021da6

Browse files
committed
Updated BasicEmpirMethods.md
1 parent 171a53f commit 6021da6

3 files changed

Lines changed: 211 additions & 3 deletions

File tree

docs/book/basic_empirics/BasicEmpirMethods.md

Lines changed: 211 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,8 @@ kernelspec:
1313
(Chap_BasicEmpirMethods)=
1414
# Basic Empirical Methods
1515

16+
This chapter has an executable [Google Colab notebook](https://colab.research.google.com/drive/1sIHaDBE5fafPXYBl9cRDFQMsFNjq67t5?usp=sharing) with all the same code, data references, and images. The Google Colab notebook allows you to execute the code in this chapter in the cloud so you don't have to download Python, any of its packages, or any data to your local computer. You could manipulate and execute this notebook on any device with a browser, whether than be your computer, phone, or tablet.
17+
1618
The focus of this chapter is to give the reader a basic introduction to the standard empirical methods in data science, policy analysis, and economics. I want each reader to come away from this chapter with the following basic skills:
1719

1820
* Difference between **correlation** and **causation**
@@ -294,7 +296,7 @@ where:
294296
Visually, this linear model involves choosing a straight line that best fits the data according to some criterion, as in the following plot (Figure 2 in {cite}`AcemogluEtAl:2001`).
295297

296298
```{code-cell} ipython3
297-
:tags: []
299+
:tags: ["remove-output"]
298300
299301
import numpy as np
300302
@@ -325,12 +327,218 @@ plt.xlabel('Average Expropriation Protection 1985-95')
325327
plt.ylabel('Log GDP per capita, PPP, 1995')
326328
plt.xlim((3.2, 10.5))
327329
plt.ylim((5.9, 10.5))
328-
plt.title('Figure 2: OLS relationship between expropriation risk and income')
330+
plt.title('OLS relationship between expropriation risk and income (Fig. 2 from Acemoglu, et al 2001)')
331+
plt.show()
332+
```
333+
334+
```{figure} ../../../images/basic_empirics/AcemogluEtAl_fig2.png
335+
:height: 500px
336+
:name: FigBasicEmpir_AcemFig2
337+
338+
OLS relationship between expropropriation risk and income (Fig. 2 from Acemoglu, et al, 2001)
339+
```
340+
341+
The most common technique to estimate the parameters ($\beta$‘s) of the linear model is Ordinary Least Squares (OLS). As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals.
342+
343+
```{math}
344+
:label: EqBasicEmp_OLScrit
345+
\hat{\beta}_{OLS} = \beta : \quad \min_{\beta}\: u(X|\beta_0,\beta_1)^T \: u(X|\beta_0,\beta_1)
346+
```
347+
348+
where $\hat{u}_i$ is the difference between the dependent variable observation $logpgp95_i$ and the predicted value of the dependent variable $\beta_0 + \beta_1 avexpr_i$. To estimate the constant term $\beta_0$, we need to add a column of 1’s to our dataset (consider the equation if $\beta_0$ was replaced with $\beta_0 x_i$ where $x_i=1$).
349+
350+
```{code-cell} ipython3
351+
:tags: []
352+
353+
df1['const'] = 1
354+
```
355+
356+
Now we can construct our model using the [`statsmodels`](https://www.statsmodels.org/stable/index.html) module and the [`OLS`](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) method. We will use `pandas` DataFrames with `statsmodels`. However, standard arrays can also be used as arguments.
357+
358+
```{code-cell} ipython
359+
:tags: ["remove-output"]
360+
361+
!pip install --upgrade statsmodels
362+
```
363+
364+
```{code-cell} ipython
365+
:tags: []
366+
367+
import statsmodels.api as sm
368+
369+
reg1 = sm.OLS(endog=df1['logpgp95'], exog=df1[['const', 'avexpr']], missing='drop')
370+
type(reg1)
371+
```
372+
373+
So far we have simply constructed our model. The `statsmodels.regression.linear_model.OLS` is simply an object specifying dependent and independent variables, as well as instructions about what to do with missing data. We need to use the `.fit()` method to obtain OLS parameter estimates $\hat{\beta}_0$ and $\hat{\beta}_1$. This method calculates the OLS coefficients according to the minimization problem in {eq}`EqBasicEmp_OLScrit`.
374+
375+
```{code-cell} ipython
376+
:tags: []
377+
378+
results = reg1.fit()
379+
type(results)
380+
```
381+
382+
We now have the fitted regression model stored in `results` (see [statsmodels.regression.linear_model.RegressionResultsWrapper](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html)). The `results` from the `reg1.fit()` command is a regression results object with a lot of information, similar to the results object of the `scipy.optimize.minimize()` function we worked with in the {ref}`Chap_MaxLikeli` and {ref}`Chap_GMM` chapters.
383+
384+
To view the OLS regression results, we can call the `.summary()` method.
385+
386+
[Note that an observation was mistakenly dropped from the results in the original paper (see the note located in maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.]
387+
388+
```{code-cell} ipython
389+
:tags: []
390+
391+
print(results.summary())
392+
```
393+
394+
We can get individual items from the results, which are saved as attributes.
395+
396+
```{code-cell} ipython
397+
:tags: []
398+
399+
print(dir(results))
400+
print("")
401+
print("Degrees of freedom residuals:", results.df_resid)
402+
print("")
403+
print("Estimated coefficients:")
404+
print(results.params)
405+
print("")
406+
print("Standard errors of estimated coefficients:")
407+
print(results.bse)
408+
```
409+
410+
The powerful machine learning python package scikit-learn also has a linear regression function [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). It is very good at prediction, but it is harder to get things like standard errors that are valuable for inference.
411+
412+
413+
(SecBasicEmpLinRegCoefSE)=
414+
### What do coefficients and standard errors mean?
415+
416+
Go through cross terms and quadratic terms and difference-in-difference.
417+
418+
419+
(SecBasicEmpLinRegInterpRes)=
420+
### Interpreting results and output
421+
422+
From our results, we see that:
423+
* the intercept $\hat{\beta}_0=4.63$ (interpretation?)
424+
* the slope $\hat{\beta}_1=0.53$ (interpretation?)
425+
* the positive $\hat{\beta}_1>0$ parameter estimate implies that protection from expropriation has a positive effect on economic outcomes, as we saw in the figure.
426+
* How would you quantitatively interpret the $\hat{\beta}_1$ coefficient?
427+
* What do the standard errors on the coefficients tell you?
428+
* The p-value of 0.000 for $\hat{\beta}_1$ implies that the effect of institutions on GDP is statistically significant (using $p < 0.05$ as a rejection rule)
429+
* The R-squared value of 0.611 indicates that around 61% of variation in log GDP per capita is explained by protection against expropriation
430+
431+
Using our parameter estimates, we can now write our estimated relationship as:
432+
```{math}
433+
:label: EqBasicEmp_AcemogluRegEst
434+
\hat{logpgp95}_i = 4.63 + 0.53 avexpr_i
435+
```
436+
437+
This equation describes the line that best fits our data, as shown in {numref}`Figure %s <FigBasicEmpir_AcemFig2>`. We can use this equation to predict the level of log GDP per capita for a value of the index of expropriation protection (see Section {ref}`SecBasicEmpLinRegPredVals` below).
438+
439+
440+
(SecBasicEmpLinRegANOVA)=
441+
### Analysis of variance (ANOVA) output
442+
443+
The results `.summary()` method provides a lot of regression output. And the `.RegressionResults` object has much more as evidenced in the help page [statsmodels.regression.linear_model.RegressionResults](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html).
444+
445+
* The `Df Residuals: 109` displays the degrees of freedom from the residual variance calculation. This equals the number of observations minus the number of regression coefficients, `N-p=111-2`. This is accessed with `results.df_resid`.
446+
* The `Df Model: 1` displays the degrees of freedom from the model variance calculation or from the regressors. This equals the number of regression coefficients minus one, `p-1=2-1`. This is accessed with `results.df_model`.
447+
* One can specify robust standard errors in their regression. The robust option is specified in the `.fit()` command. You can specify three different types of robust standard errors using the `.fit(cov_type='HC1')`, `.fit(cov_type='HC2')`, or `.fit(cov_type='HC3')` options.
448+
* You can do clustered standard errors if you have groups labeled in a variable called `mygroups` by using the `.fit(cov_type='cluster', cov_kwds={'groups': mygroups})`.
449+
* R-squared is a measure of fit of the overall model. It is $R^2=1 - SSR/SST$ where $SST$ is the total variance of the dependent variable (total sum of squares), and $SSR$ is the sum of squared residuals (variance of the residuals). Another expresion is the sum of squared predicted values over the total sum of squares $R^2= SSM/SST$, where $SSM$ is the sum of squared predicted values. This is accessed with `results.rsquared`.
450+
* Adjusted R-squared is a measure of fit of the overall model that penalizes extra regressors. A property of the R-squared in the previous bullet is that it always increases as you add more explanatory variables. This is accessed with `results.rsquared_adj`.
451+
452+
453+
(SecBasicEmpLinRegFtest)=
454+
### F-test and log likelihood test
455+
456+
* The F-statistic is the statistic from an F-test of the joint hypothesis that all the coefficients are equal to zero. The value of the F-statistic is distributed according to the F-distribution $F(d1,d2)$, where $d1=p-1$ and $d2=N-p$.
457+
* The Prob (F-statistic) is the probability that the null hypothesis of all the coefficients being zero is true. In this case, it is really small.
458+
* Log-likelihood is the sum of the log pdf values of the errors given their being normally distributed with mean 0 and standard deviation implied by the OLS estimates.
459+
460+
461+
(SecBasicEmpLinRegInfer)=
462+
### Inference on individual parameters
463+
464+
* The estimated coefficients of the linear regression are reported in the `results.params` vector object (pandas Series).
465+
* The standard error on each estimated coefficient is reported in the summary results column entitled `std err`. These standard errors are reported in the `results.bse` vector object (pandas Series).
466+
* The "t" column is the $t$ test statistic. It is the value in the support of the students-T distribution that is equivalent to the estimated coefficient if the null-hypothesis were true that the estimated coefficient were 0.
467+
* The reported p-value is the probability of a two-sided t-test that gives the probability that the estimated coefficient is greater than its estimated value if the true value were 0. A more intuitive interpretation is the probability of seeing that estimated value if the null hypothesis were true. We usually reject the null hypothesis if the p-value is lower than 0.05.
468+
* The summary results report the 95% two-sided confidence interval for the estimated value.
469+
470+
471+
(SecBasicEmpLinRegPredVals)=
472+
### Predicted values
473+
474+
We can obtain an array of predicted $logpgp95_i$ for every value of $avexpr_i$ in our dataset by calling `.predict()` on our results. Let's first get the predicted value for the average country in the dataset.
475+
476+
```{code-cell} ipython
477+
:tags: []
478+
479+
mean_expr = np.mean(df1['avexpr'])
480+
mean_expr
481+
```
482+
483+
```{code-cell} ipython
484+
:tags: []
485+
486+
print(results.params)
487+
```
488+
489+
```{code-cell} ipython
490+
:tags: []
491+
492+
predicted_logpdp95 = 4.63 + 0.53 * mean_expr
493+
print(predicted_logpdp95)
494+
```
495+
496+
An easier (and more accurate) way to obtain this result is to use `.predict()` and set $constant=1$ and $avexpr_i=$ `mean_expr`.
497+
498+
```{code-cell} ipython
499+
:tags: []
500+
501+
results.predict(exog=[1, mean_expr])
502+
```
503+
504+
Plotting the predicted values against $avexpr_i$ shows that the predicted values lie along the linear line that we fitted below in {numref}`Figure %s <FigBasicEmpir_AcemPredVals>`. The observed values of $logpgp95_i$ are also plotted for comparison purposes.
505+
506+
```{code-cell} ipython
507+
:tags: ["remove-output"]
508+
509+
# Drop missing observations from whole sample
510+
df1_plot = df1.dropna(subset=['logpgp95', 'avexpr'])
511+
512+
# Plot predicted values. alpha is a blending value between 0 (transparent) and 1 (opaque)
513+
plt.scatter(df1_plot['avexpr'], results.predict(), alpha=0.5, label='predicted')
514+
515+
# Plot observed values
516+
plt.scatter(df1_plot['avexpr'], df1_plot['logpgp95'], alpha=0.5, label='observed')
517+
518+
plt.legend()
519+
plt.title('OLS predicted values')
520+
plt.xlabel('Average Expropriation Protection 1985-95')
521+
plt.ylabel('Log GDP per capita, PPP, 1995')
329522
plt.show()
330523
```
331524

525+
```{figure} ../../../images/basic_empirics/AcemogluEtAl_predvals.png
526+
:height: 500px
527+
:name: FigBasicEmpir_AcemPredVals
528+
529+
OLS predicted values for Acemoglu, et al, 2001 data
530+
```
531+
532+
533+
(SecBasicEmpLinRegExt)=
534+
## Basic extensions of linear regression
332535

333-
<!-- {numref}`ExerBasicEmpir_MultLinRegress` -->
536+
* Instrumental variables (omitted variable bias)
537+
* Logistic regression
538+
* Multiple equation models
539+
* Panel data
540+
* Time series data
541+
* Vector autoregression
334542

335543

336544
(SecBasicEmpirExercises)=
191 KB
Loading
191 KB
Loading

0 commit comments

Comments
 (0)