OpenSourceEcon
diff --git a/‎docs/book/basic_empirics/BasicEmpirMethods.md‎
Lines changed: 211 additions & 3 deletions b/‎docs/book/basic_empirics/BasicEmpirMethods.md‎
Lines changed: 211 additions & 3 deletions
diff --git a/‎images/basic_empirics/AcemogluEtAl_fig2.png‎
191 KB b/‎images/basic_empirics/AcemogluEtAl_fig2.png‎
191 KB
diff --git a/‎images/basic_empirics/AcemogluEtAl_predvals.png‎
191 KB b/‎images/basic_empirics/AcemogluEtAl_predvals.png‎
191 KB
@@ -13,6 +13,8 @@ kernelspec:
 (Chap_BasicEmpirMethods)=
 # Basic Empirical Methods
 
+This chapter has an executable [Google Colab notebook](https://colab.research.google.com/drive/1sIHaDBE5fafPXYBl9cRDFQMsFNjq67t5?usp=sharing) with all the same code, data references, and images. The Google Colab notebook allows you to execute the code in this chapter in the cloud so you don't have to download Python, any of its packages, or any data to your local computer. You could manipulate and execute this notebook on any device with a browser, whether than be your computer, phone, or tablet.
+
 The focus of this chapter is to give the reader a basic introduction to the standard empirical methods in data science, policy analysis, and economics. I want each reader to come away from this chapter with the following basic skills:
 
 * Difference between **correlation** and **causation**
@@ -294,7 +296,7 @@ where:
 Visually, this linear model involves choosing a straight line that best fits the data according to some criterion, as in the following plot (Figure 2 in {cite}`AcemogluEtAl:2001`).
 
 ```{code-cell} ipython3
-:tags: []
+:tags: ["remove-output"]
 
 import numpy as np
 
@@ -325,12 +327,218 @@ plt.xlabel('Average Expropriation Protection 1985-95')
 plt.ylabel('Log GDP per capita, PPP, 1995')
 plt.xlim((3.2, 10.5))
 plt.ylim((5.9, 10.5))
-plt.title('Figure 2: OLS relationship between expropriation risk and income')
+plt.title('OLS relationship between expropriation risk and income (Fig. 2 from Acemoglu, et al 2001)')
+plt.show()
+```
+
+```{figure} ../../../images/basic_empirics/AcemogluEtAl_fig2.png
+:height: 500px
+:name: FigBasicEmpir_AcemFig2
+
+OLS relationship between expropropriation risk and income (Fig. 2 from Acemoglu, et al, 2001)
+```
+
+The most common technique to estimate the parameters ($\beta$‘s) of the linear model is Ordinary Least Squares (OLS). As the name implies, an OLS model is solved by finding the parameters that minimize the sum of squared residuals.
+
+```{math}
+    :label: EqBasicEmp_OLScrit
+    \hat{\beta}_{OLS} = \beta : \quad \min_{\beta}\: u(X|\beta_0,\beta_1)^T \: u(X|\beta_0,\beta_1)
+```
+
+where $\hat{u}_i$ is the difference between the dependent variable observation $logpgp95_i$ and the predicted value of the dependent variable $\beta_0 + \beta_1 avexpr_i$. To estimate the constant term $\beta_0$, we need to add a column of 1’s to our dataset (consider the equation if $\beta_0$ was replaced with $\beta_0 x_i$ where $x_i=1$).
+
+```{code-cell} ipython3
+:tags: []
+
+df1['const'] = 1
+```
+
+Now we can construct our model using the [`statsmodels`](https://www.statsmodels.org/stable/index.html) module and the [`OLS`](https://www.statsmodels.org/dev/examples/notebooks/generated/ols.html) method. We will use `pandas` DataFrames with `statsmodels`. However, standard arrays can also be used as arguments.
+
+```{code-cell} ipython
+:tags: ["remove-output"]
+
+!pip install --upgrade statsmodels
+```
+
+```{code-cell} ipython
+:tags: []
+
+import statsmodels.api as sm
+
+reg1 = sm.OLS(endog=df1['logpgp95'], exog=df1[['const', 'avexpr']], missing='drop')
+type(reg1)
+```
+
+So far we have simply constructed our model. The `statsmodels.regression.linear_model.OLS` is simply an object specifying dependent and independent variables, as well as instructions about what to do with missing data. We need to use the `.fit()` method to obtain OLS parameter estimates $\hat{\beta}_0$ and $\hat{\beta}_1$. This method calculates the OLS coefficients according to the minimization problem in {eq}`EqBasicEmp_OLScrit`.
+
+```{code-cell} ipython
+:tags: []
+
+results = reg1.fit()
+type(results)
+```
+
+We now have the fitted regression model stored in `results` (see [statsmodels.regression.linear_model.RegressionResultsWrapper](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html)). The `results` from the `reg1.fit()` command is a regression results object with a lot of information, similar to the results object of the `scipy.optimize.minimize()` function we worked with in the {ref}`Chap_MaxLikeli` and {ref}`Chap_GMM` chapters.
+
+To view the OLS regression results, we can call the `.summary()` method.
+
+[Note that an observation was mistakenly dropped from the results in the original paper (see the note located in maketable2.do from Acemoglu’s webpage), and thus the coefficients differ slightly.]
+
+```{code-cell} ipython
+:tags: []
+
+print(results.summary())
+```
+
+We can get individual items from the results, which are saved as attributes.
+
+```{code-cell} ipython
+:tags: []
+
+print(dir(results))
+print("")
+print("Degrees of freedom residuals:", results.df_resid)
+print("")
+print("Estimated coefficients:")
+print(results.params)
+print("")
+print("Standard errors of estimated coefficients:")
+print(results.bse)
+```
+
+The powerful machine learning python package scikit-learn also has a linear regression function [sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). It is very good at prediction, but it is harder to get things like standard errors that are valuable for inference.
+
+
+(SecBasicEmpLinRegCoefSE)=
+### What do coefficients and standard errors mean?
+
+Go through cross terms and quadratic terms and difference-in-difference.
+
+
+(SecBasicEmpLinRegInterpRes)=
+### Interpreting results and output
+
+From our results, we see that:
+* the intercept $\hat{\beta}_0=4.63$ (interpretation?)
+* the slope $\hat{\beta}_1=0.53$ (interpretation?)
+* the positive $\hat{\beta}_1>0$ parameter estimate implies that protection from expropriation has a positive effect on economic outcomes, as we saw in the figure.
+* How would you quantitatively interpret the $\hat{\beta}_1$ coefficient?
+* What do the standard errors on the coefficients tell you?
+* The p-value of 0.000 for $\hat{\beta}_1$ implies that the effect of institutions on GDP is statistically significant (using $p < 0.05$ as a rejection rule)
+* The R-squared value of 0.611 indicates that around 61% of variation in log GDP per capita is explained by protection against expropriation
+
+Using our parameter estimates, we can now write our estimated relationship as:
+```{math}
+    :label: EqBasicEmp_AcemogluRegEst
+    \hat{logpgp95}_i = 4.63 + 0.53 avexpr_i
+```
+
+This equation describes the line that best fits our data, as shown in {numref}`Figure %s <FigBasicEmpir_AcemFig2>`. We can use this equation to predict the level of log GDP per capita for a value of the index of expropriation protection (see Section {ref}`SecBasicEmpLinRegPredVals` below).
+
+
+(SecBasicEmpLinRegANOVA)=
+### Analysis of variance (ANOVA) output
+
+The results `.summary()` method provides a lot of regression output. And the `.RegressionResults` object has much more as evidenced in the help page [statsmodels.regression.linear_model.RegressionResults](http://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html).
+
+* The `Df Residuals: 109` displays the degrees of freedom from the residual variance calculation. This equals the number of observations minus the number of regression coefficients, `N-p=111-2`. This is accessed with `results.df_resid`.
+* The `Df Model: 1` displays the degrees of freedom from the model variance calculation or from the regressors. This equals the number of regression coefficients minus one, `p-1=2-1`. This is accessed with `results.df_model`.
+* One can specify robust standard errors in their regression. The robust option is specified in the `.fit()` command. You can specify three different types of robust standard errors using the `.fit(cov_type='HC1')`, `.fit(cov_type='HC2')`, or `.fit(cov_type='HC3')` options.
+* You can do clustered standard errors if you have groups labeled in a variable called `mygroups` by using the `.fit(cov_type='cluster', cov_kwds={'groups': mygroups})`.
+* R-squared is a measure of fit of the overall model. It is $R^2=1 - SSR/SST$ where $SST$ is the total variance of the dependent variable (total sum of squares), and $SSR$ is the sum of squared residuals (variance of the residuals). Another expresion is the sum of squared predicted values over the total sum of squares $R^2= SSM/SST$, where $SSM$ is the sum of squared predicted values. This is accessed with `results.rsquared`.
+* Adjusted R-squared is a measure of fit of the overall model that penalizes extra regressors. A property of the R-squared in the previous bullet is that it always increases as you add more explanatory variables. This is accessed with `results.rsquared_adj`.
+
+
+(SecBasicEmpLinRegFtest)=
+### F-test and log likelihood test
+
+* The F-statistic is the statistic from an F-test of the joint hypothesis that all the coefficients are equal to zero. The value of the F-statistic is distributed according to the F-distribution $F(d1,d2)$, where $d1=p-1$ and $d2=N-p$.
+* The Prob (F-statistic) is the probability that the null hypothesis of all the coefficients being zero is true. In this case, it is really small.
+* Log-likelihood is the sum of the log pdf values of the errors given their being normally distributed with mean 0 and standard deviation implied by the OLS estimates.
+
+
+(SecBasicEmpLinRegInfer)=
+### Inference on individual parameters
+
+* The estimated coefficients of the linear regression are reported in the `results.params` vector object (pandas Series).
+* The standard error on each estimated coefficient is reported in the summary results column entitled `std err`. These standard errors are reported in the `results.bse` vector object (pandas Series).
+* The "t" column is the $t$ test statistic. It is the value in the support of the students-T distribution that is equivalent to the estimated coefficient if the null-hypothesis were true that the estimated coefficient were 0.
+* The reported p-value is the probability of a two-sided t-test that gives the probability that the estimated coefficient is greater than its estimated value if the true value were 0. A more intuitive interpretation is the probability of seeing that estimated value if the null hypothesis were true. We usually reject the null hypothesis if the p-value is lower than 0.05.
+* The summary results report the 95% two-sided confidence interval for the estimated value.
+
+
+(SecBasicEmpLinRegPredVals)=
+### Predicted values
+
+We can obtain an array of predicted $logpgp95_i$ for every value of $avexpr_i$ in our dataset by calling `.predict()` on our results. Let's first get the predicted value for the average country in the dataset.
+
+```{code-cell} ipython
+:tags: []
+
+mean_expr = np.mean(df1['avexpr'])
+mean_expr
+```
+
+```{code-cell} ipython
+:tags: []
+
+print(results.params)
+```
+
+```{code-cell} ipython
+:tags: []
+
+predicted_logpdp95 = 4.63 + 0.53 * mean_expr
+print(predicted_logpdp95)
+```
+
+An easier (and more accurate) way to obtain this result is to use `.predict()` and set $constant=1$ and $avexpr_i=$ `mean_expr`.
+
+```{code-cell} ipython
+:tags: []
+
+results.predict(exog=[1, mean_expr])
+```
+
+Plotting the predicted values against $avexpr_i$ shows that the predicted values lie along the linear line that we fitted below in {numref}`Figure %s <FigBasicEmpir_AcemPredVals>`. The observed values of $logpgp95_i$ are also plotted for comparison purposes.
+
+```{code-cell} ipython
+:tags: ["remove-output"]
+
+# Drop missing observations from whole sample
+df1_plot = df1.dropna(subset=['logpgp95', 'avexpr'])
+
+# Plot predicted values. alpha is a blending value between 0 (transparent) and 1 (opaque)
+plt.scatter(df1_plot['avexpr'], results.predict(), alpha=0.5, label='predicted')
+
+# Plot observed values
+plt.scatter(df1_plot['avexpr'], df1_plot['logpgp95'], alpha=0.5, label='observed')
+
+plt.legend()
+plt.title('OLS predicted values')
+plt.xlabel('Average Expropriation Protection 1985-95')
+plt.ylabel('Log GDP per capita, PPP, 1995')
 plt.show()
 ```
 
+```{figure} ../../../images/basic_empirics/AcemogluEtAl_predvals.png
+:height: 500px
+:name: FigBasicEmpir_AcemPredVals
+
+OLS predicted values for Acemoglu, et al, 2001 data
+```
+
+
+(SecBasicEmpLinRegExt)=
+## Basic extensions of linear regression
 
-<!-- {numref}`ExerBasicEmpir_MultLinRegress` -->
+* Instrumental variables (omitted variable bias)
+* Logistic regression
+* Multiple equation models
+* Panel data
+* Time series data
+* Vector autoregression
 
 
 (SecBasicEmpirExercises)=