You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/book/basic_empirics/BasicEmpirMethods.md
+211-3Lines changed: 211 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,6 +13,8 @@ kernelspec:
13
13
(Chap_BasicEmpirMethods)=
14
14
# Basic Empirical Methods
15
15
16
+
This chapter has an executable [Google Colab notebook](https://colab.research.google.com/drive/1sIHaDBE5fafPXYBl9cRDFQMsFNjq67t5?usp=sharing) with all the same code, data references, and images. The Google Colab notebook allows you to execute the code in this chapter in the cloud so you don't have to download Python, any of its packages, or any data to your local computer. You could manipulate and execute this notebook on any device with a browser, whether than be your computer, phone, or tablet.
17
+
16
18
The focus of this chapter is to give the reader a basic introduction to the standard empirical methods in data science, policy analysis, and economics. I want each reader to come away from this chapter with the following basic skills:
17
19
18
20
* Difference between **correlation** and **causation**
@@ -294,7 +296,7 @@ where:
294
296
Visually, this linear model involves choosing a straight line that best fits the data according to some criterion, as in the following plot (Figure 2 in {cite}`AcemogluEtAl:2001`).
OLS relationship between expropropriation risk and income (Fig. 2 from Acemoglu, et al, 2001)
339
+
```
340
+
341
+
The most common technique to estimate the parameters ($\beta$‘s) of the linear model is Ordinary Least Squares (OLS). As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals.
where $\hat{u}_i$ is the difference between the dependent variable observation $logpgp95_i$ and the predicted value of the dependent variable $\beta_0 + \beta_1 avexpr_i$. To estimate the constant term $\beta_0$, we need to add a column of 1’s to our dataset (consider the equation if $\beta_0$ was replaced with $\beta_0 x_i$ where $x_i=1$).
349
+
350
+
```{code-cell} ipython3
351
+
:tags: []
352
+
353
+
df1['const'] = 1
354
+
```
355
+
356
+
Now we can construct our model using the [`statsmodels`](https://www.statsmodels.org/stable/index.html) module and the [`OLS`](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) method. We will use `pandas` DataFrames with `statsmodels`. However, standard arrays can also be used as arguments.
So far we have simply constructed our model. The `statsmodels.regression.linear_model.OLS` is simply an object specifying dependent and independent variables, as well as instructions about what to do with missing data. We need to use the `.fit()` method to obtain OLS parameter estimates $\hat{\beta}_0$ and $\hat{\beta}_1$. This method calculates the OLS coefficients according to the minimization problem in {eq}`EqBasicEmp_OLScrit`.
374
+
375
+
```{code-cell} ipython
376
+
:tags: []
377
+
378
+
results = reg1.fit()
379
+
type(results)
380
+
```
381
+
382
+
We now have the fitted regression model stored in `results` (see [statsmodels.regression.linear_model.RegressionResultsWrapper](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html)). The `results` from the `reg1.fit()` command is a regression results object with a lot of information, similar to the results object of the `scipy.optimize.minimize()` function we worked with in the {ref}`Chap_MaxLikeli` and {ref}`Chap_GMM` chapters.
383
+
384
+
To view the OLS regression results, we can call the `.summary()` method.
385
+
386
+
[Note that an observation was mistakenly dropped from the results in the original paper (see the note located in maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.]
387
+
388
+
```{code-cell} ipython
389
+
:tags: []
390
+
391
+
print(results.summary())
392
+
```
393
+
394
+
We can get individual items from the results, which are saved as attributes.
395
+
396
+
```{code-cell} ipython
397
+
:tags: []
398
+
399
+
print(dir(results))
400
+
print("")
401
+
print("Degrees of freedom residuals:", results.df_resid)
402
+
print("")
403
+
print("Estimated coefficients:")
404
+
print(results.params)
405
+
print("")
406
+
print("Standard errors of estimated coefficients:")
407
+
print(results.bse)
408
+
```
409
+
410
+
The powerful machine learning python package scikit-learn also has a linear regression function [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). It is very good at prediction, but it is harder to get things like standard errors that are valuable for inference.
411
+
412
+
413
+
(SecBasicEmpLinRegCoefSE)=
414
+
### What do coefficients and standard errors mean?
415
+
416
+
Go through cross terms and quadratic terms and difference-in-difference.
417
+
418
+
419
+
(SecBasicEmpLinRegInterpRes)=
420
+
### Interpreting results and output
421
+
422
+
From our results, we see that:
423
+
* the intercept $\hat{\beta}_0=4.63$ (interpretation?)
424
+
* the slope $\hat{\beta}_1=0.53$ (interpretation?)
425
+
* the positive $\hat{\beta}_1>0$ parameter estimate implies that protection from expropriation has a positive effect on economic outcomes, as we saw in the figure.
426
+
* How would you quantitatively interpret the $\hat{\beta}_1$ coefficient?
427
+
* What do the standard errors on the coefficients tell you?
428
+
* The p-value of 0.000 for $\hat{\beta}_1$ implies that the effect of institutions on GDP is statistically significant (using $p < 0.05$ as a rejection rule)
429
+
* The R-squared value of 0.611 indicates that around 61% of variation in log GDP per capita is explained by protection against expropriation
430
+
431
+
Using our parameter estimates, we can now write our estimated relationship as:
432
+
```{math}
433
+
:label: EqBasicEmp_AcemogluRegEst
434
+
\hat{logpgp95}_i = 4.63 + 0.53 avexpr_i
435
+
```
436
+
437
+
This equation describes the line that best fits our data, as shown in {numref}`Figure %s <FigBasicEmpir_AcemFig2>`. We can use this equation to predict the level of log GDP per capita for a value of the index of expropriation protection (see Section {ref}`SecBasicEmpLinRegPredVals` below).
438
+
439
+
440
+
(SecBasicEmpLinRegANOVA)=
441
+
### Analysis of variance (ANOVA) output
442
+
443
+
The results `.summary()` method provides a lot of regression output. And the `.RegressionResults` object has much more as evidenced in the help page [statsmodels.regression.linear_model.RegressionResults](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html).
444
+
445
+
* The `Df Residuals: 109` displays the degrees of freedom from the residual variance calculation. This equals the number of observations minus the number of regression coefficients, `N-p=111-2`. This is accessed with `results.df_resid`.
446
+
* The `Df Model: 1` displays the degrees of freedom from the model variance calculation or from the regressors. This equals the number of regression coefficients minus one, `p-1=2-1`. This is accessed with `results.df_model`.
447
+
* One can specify robust standard errors in their regression. The robust option is specified in the `.fit()` command. You can specify three different types of robust standard errors using the `.fit(cov_type='HC1')`, `.fit(cov_type='HC2')`, or `.fit(cov_type='HC3')` options.
448
+
* You can do clustered standard errors if you have groups labeled in a variable called `mygroups` by using the `.fit(cov_type='cluster', cov_kwds={'groups': mygroups})`.
449
+
* R-squared is a measure of fit of the overall model. It is $R^2=1 - SSR/SST$ where $SST$ is the total variance of the dependent variable (total sum of squares), and $SSR$ is the sum of squared residuals (variance of the residuals). Another expresion is the sum of squared predicted values over the total sum of squares $R^2= SSM/SST$, where $SSM$ is the sum of squared predicted values. This is accessed with `results.rsquared`.
450
+
* Adjusted R-squared is a measure of fit of the overall model that penalizes extra regressors. A property of the R-squared in the previous bullet is that it always increases as you add more explanatory variables. This is accessed with `results.rsquared_adj`.
451
+
452
+
453
+
(SecBasicEmpLinRegFtest)=
454
+
### F-test and log likelihood test
455
+
456
+
* The F-statistic is the statistic from an F-test of the joint hypothesis that all the coefficients are equal to zero. The value of the F-statistic is distributed according to the F-distribution $F(d1,d2)$, where $d1=p-1$ and $d2=N-p$.
457
+
* The Prob (F-statistic) is the probability that the null hypothesis of all the coefficients being zero is true. In this case, it is really small.
458
+
* Log-likelihood is the sum of the log pdf values of the errors given their being normally distributed with mean 0 and standard deviation implied by the OLS estimates.
459
+
460
+
461
+
(SecBasicEmpLinRegInfer)=
462
+
### Inference on individual parameters
463
+
464
+
* The estimated coefficients of the linear regression are reported in the `results.params` vector object (pandas Series).
465
+
* The standard error on each estimated coefficient is reported in the summary results column entitled `std err`. These standard errors are reported in the `results.bse` vector object (pandas Series).
466
+
* The "t" column is the $t$ test statistic. It is the value in the support of the students-T distribution that is equivalent to the estimated coefficient if the null-hypothesis were true that the estimated coefficient were 0.
467
+
* The reported p-value is the probability of a two-sided t-test that gives the probability that the estimated coefficient is greater than its estimated value if the true value were 0. A more intuitive interpretation is the probability of seeing that estimated value if the null hypothesis were true. We usually reject the null hypothesis if the p-value is lower than 0.05.
468
+
* The summary results report the 95% two-sided confidence interval for the estimated value.
469
+
470
+
471
+
(SecBasicEmpLinRegPredVals)=
472
+
### Predicted values
473
+
474
+
We can obtain an array of predicted $logpgp95_i$ for every value of $avexpr_i$ in our dataset by calling `.predict()` on our results. Let's first get the predicted value for the average country in the dataset.
475
+
476
+
```{code-cell} ipython
477
+
:tags: []
478
+
479
+
mean_expr = np.mean(df1['avexpr'])
480
+
mean_expr
481
+
```
482
+
483
+
```{code-cell} ipython
484
+
:tags: []
485
+
486
+
print(results.params)
487
+
```
488
+
489
+
```{code-cell} ipython
490
+
:tags: []
491
+
492
+
predicted_logpdp95 = 4.63 + 0.53 * mean_expr
493
+
print(predicted_logpdp95)
494
+
```
495
+
496
+
An easier (and more accurate) way to obtain this result is to use `.predict()` and set $constant=1$ and $avexpr_i=$ `mean_expr`.
497
+
498
+
```{code-cell} ipython
499
+
:tags: []
500
+
501
+
results.predict(exog=[1, mean_expr])
502
+
```
503
+
504
+
Plotting the predicted values against $avexpr_i$ shows that the predicted values lie along the linear line that we fitted below in {numref}`Figure %s <FigBasicEmpir_AcemPredVals>`. The observed values of $logpgp95_i$ are also plotted for comparison purposes.
0 commit comments