You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/book/basic_empirics/BasicEmpirMethods.md
+194-6Lines changed: 194 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -100,7 +100,7 @@ These variables and other data used in the paper are available for download on [
100
100
101
101
102
102
(SecBasicEmpDescrBasic)=
103
-
### Basic Data Description
103
+
### Basic data description
104
104
105
105
The following cells downloads the data from {cite}`AcemogluEtAl:2001` from the file `maketable1.dta` and displays the first five observations from the data.
106
106
@@ -109,8 +109,9 @@ The following cells downloads the data from {cite}`AcemogluEtAl:2001` from the f
The [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method returns the first $n$ forws of a DataFrame with column headings and index numbers. The default is `n=5`.
@@ -121,26 +122,213 @@ The [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/refer
121
122
df1.head()
122
123
```
123
124
124
-
How many observations are in this dataset? What are the different countries in this dataset?
125
+
How many observations are in this dataset? What are the different countries in this dataset? The [`pandas.DataFrame.shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) method returns a tuple in which the first element is the number of observations (rows) in the DataFrame and the second element is the number of variables (columns).
125
126
126
127
```{code-cell} ipython3
127
128
:tags: []
128
129
129
-
print("The number of observations (rows) in the dataset is:", df1.size)
130
+
131
+
df1.shape
132
+
```
133
+
134
+
```{code-cell} ipython3
135
+
:tags: []
136
+
137
+
print("The number of observations (rows) and variables (columns)")
138
+
print("in the dataset is " + str(df1.shape[0]) + "observations (rows) and")
print("A list of all the", len(df1["shortnam"].unique()),
132
142
'unique countries in the "shortnam" variable is:')
143
+
print("")
133
144
print(df1["shortnam"].unique())
134
145
```
135
146
136
-
Pandas DataFrames have a built-in method `.describe()` that will give the basic descriptive statistics for the numerical variables of a dataset.
147
+
Pandas DataFrames have a built-in method [`.describe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) that will give the basic descriptive statistics for the numerical variables of a dataset.
137
148
138
149
```{code-cell} ipython3
139
150
:tags: []
140
151
141
152
df1.describe()
142
153
```
143
154
155
+
The variable `logpgp95` represents GDP per capita for each country. The variable `avexpr` represents the protection against expropriation index. So more protection is a good thing. What do we expect to see if we do a scatterplot of these two variables with `avexpr` on the `x`-axis and `logpgp95` on the `y`-axis? Draw it on a piece of paper or on a white board.
156
+
157
+
Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against expropriation index.
Scatterplot of average expropriation protection $avexpr$ and log GDP per capita $logpgp95$ for each country
180
+
```
181
+
182
+
The plot shows a fairly strong positive relationship between protection against expropriation and log GDP per capita. Specifically, if higher protection against expropriation is a measure of institutional quality, then better institutions appear to be positively correlated with better economic outcomes (higher GDP per capita).
183
+
184
+
185
+
(SecBasicEmpDescrCross)=
186
+
### Cross tabulated data Description
187
+
188
+
Cross tabulation is a set of descriptive statics by groupings of the data. In R and Python, this is done with a powerful [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) command. What if we thought that the relationship between protection against expropriation `avexpr` and `logpgp95` were different for countries whose abbreviation started with A-M versus countries whose abbreviation started with N-Z?
189
+
190
+
```{code-cell} ipython3
191
+
:tags: []
192
+
193
+
# Create AtoM variable that = 1 if the first letter of the abbreviation is in
Another way we could do this that is more readable that the output above is to just describe the data in two separate commands in which we restrict the data to the two separate groups.
207
+
208
+
```{code-cell} ipython3
209
+
:tags: []
210
+
211
+
df1[df1["AtoM"]==1].describe()
212
+
```
213
+
214
+
```{code-cell} ipython3
215
+
:tags: []
216
+
217
+
df1[df1["AtoM"]==0].describe()
218
+
```
219
+
220
+
Let's make two scatterplots to see with our eyes if there seems to be a difference in the relationship.
221
+
222
+
```{code-cell} ipython3
223
+
:tags: ["remove-output"]
224
+
225
+
# Plot the scatterplot of the relationship for the countries for which the first
Scatterplot of average expropriation protection $avexpr$ and log GDP per capita $logpgp95$ for each country, first letter in N-Z
270
+
```
271
+
272
+
273
+
(SecBasicEmpLinReg)=
274
+
## Basic Understanding of Linear Regression
275
+
276
+
277
+
(SecBasicEmpLinRegExamp)=
278
+
### Example: Acemoglu, et al (2001)
279
+
280
+
Given the plots in {numref}`Figure %s <FigBasicEmpir_scatter1>`, {numref}`Figure %s <FigBasicEmpir_scatter2>`, and {numref}`Figure %s <FigBasicEmpir_scatter3>` above, choosing a linear model to describe this relationship seems like a reasonable assumption.
281
+
282
+
We can write a model as:
283
+
284
+
```{math}
285
+
:label: EqBasicEmp_AcemogluReg
286
+
logpgp95_i = \beta_0 + \beta_1 avexpr_i + u_i
287
+
```
288
+
289
+
where:
290
+
* $\beta_0$ is the intercept of the linear trend line on the $y$-axis
291
+
* $\beta_1$ is the slope of the linear trend line, representing the marginal effect of protection against risk on log GDP per capita
292
+
* $u_i$ is a random error term (deviations of observations from the linear trend due to factors not included in the model)
293
+
294
+
Visually, this linear model involves choosing a straight line that best fits the data according to some criterion, as in the following plot (Figure 2 in {cite}`AcemogluEtAl:2001`).
295
+
296
+
```{code-cell} ipython3
297
+
:tags: []
298
+
299
+
import numpy as np
300
+
301
+
# Dropping NA's is required to use numpy's polyfit
0 commit comments