Skip to content

Commit 171a53f

Browse files
committed
Update BasicEmpirMethods.md
1 parent 82e8f74 commit 171a53f

4 files changed

Lines changed: 194 additions & 6 deletions

File tree

docs/book/basic_empirics/BasicEmpirMethods.md

Lines changed: 194 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ These variables and other data used in the paper are available for download on [
100100

101101

102102
(SecBasicEmpDescrBasic)=
103-
### Basic Data Description
103+
### Basic data description
104104

105105
The following cells downloads the data from {cite}`AcemogluEtAl:2001` from the file `maketable1.dta` and displays the first five observations from the data.
106106

@@ -109,8 +109,9 @@ The following cells downloads the data from {cite}`AcemogluEtAl:2001` from the f
109109
110110
import pandas as pd
111111
112-
df1 = pd.read_stata('https://github.com/QuantEcon/QuantEcon.lectures.code/' +
113-
'raw/master/ols/maketable1.dta')
112+
path_df1 = ('https://github.com/OpenSourceEcon/CompMethods/' +
113+
'raw/main/data/basic_empirics/maketable1.dta')
114+
df1 = pd.read_stata(path_df1)
114115
```
115116

116117
The [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method returns the first $n$ forws of a DataFrame with column headings and index numbers. The default is `n=5`.
@@ -121,26 +122,213 @@ The [`pandas.DataFrame.head`](https://pandas.pydata.org/pandas-docs/stable/refer
121122
df1.head()
122123
```
123124

124-
How many observations are in this dataset? What are the different countries in this dataset?
125+
How many observations are in this dataset? What are the different countries in this dataset? The [`pandas.DataFrame.shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) method returns a tuple in which the first element is the number of observations (rows) in the DataFrame and the second element is the number of variables (columns).
125126

126127
```{code-cell} ipython3
127128
:tags: []
128129
129-
print("The number of observations (rows) in the dataset is:", df1.size)
130+
131+
df1.shape
132+
```
133+
134+
```{code-cell} ipython3
135+
:tags: []
136+
137+
print("The number of observations (rows) and variables (columns)")
138+
print("in the dataset is " + str(df1.shape[0]) + "observations (rows) and")
139+
print(str(df1.shape[1]) + " variables (columns).")
130140
print("")
131141
print("A list of all the", len(df1["shortnam"].unique()),
132142
'unique countries in the "shortnam" variable is:')
143+
print("")
133144
print(df1["shortnam"].unique())
134145
```
135146

136-
Pandas DataFrames have a built-in method `.describe()` that will give the basic descriptive statistics for the numerical variables of a dataset.
147+
Pandas DataFrames have a built-in method [`.describe()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) that will give the basic descriptive statistics for the numerical variables of a dataset.
137148

138149
```{code-cell} ipython3
139150
:tags: []
140151
141152
df1.describe()
142153
```
143154

155+
The variable `logpgp95` represents GDP per capita for each country. The variable `avexpr` represents the protection against expropriation index. So more protection is a good thing. What do we expect to see if we do a scatterplot of these two variables with `avexpr` on the `x`-axis and `logpgp95` on the `y`-axis? Draw it on a piece of paper or on a white board.
156+
157+
Let’s use a scatterplot to see whether any obvious relationship exists between GDP per capita and the protection against expropriation index.
158+
159+
```{code-cell} ipython3
160+
:tags: ["remove-output"]
161+
162+
import matplotlib.pyplot as plt
163+
164+
plt.scatter(x=df1["avexpr"], y=df1["logpgp95"], s=10)
165+
plt.xlim((3.2, 10.5))
166+
plt.ylim((5.9, 10.5))
167+
plt.title("Scatterplot of average expropriation protection and log GDP per " +
168+
"capita for each country")
169+
plt.xlabel(r'Average Expropriation Protection 1985-95')
170+
plt.ylabel(r'Log GDP per capita, PPP, 1995')
171+
plt.grid(color='gray', linestyle=':', linewidth=1, alpha=0.5)
172+
plt.show()
173+
```
174+
175+
```{figure} ../../../images/basic_empirics/scatter1.png
176+
:height: 500px
177+
:name: FigBasicEmpir_scatter1
178+
179+
Scatterplot of average expropriation protection $avexpr$ and log GDP per capita $logpgp95$ for each country
180+
```
181+
182+
The plot shows a fairly strong positive relationship between protection against expropriation and log GDP per capita. Specifically, if higher protection against expropriation is a measure of institutional quality, then better institutions appear to be positively correlated with better economic outcomes (higher GDP per capita).
183+
184+
185+
(SecBasicEmpDescrCross)=
186+
### Cross tabulated data Description
187+
188+
Cross tabulation is a set of descriptive statics by groupings of the data. In R and Python, this is done with a powerful [`.groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) command. What if we thought that the relationship between protection against expropriation `avexpr` and `logpgp95` were different for countries whose abbreviation started with A-M versus countries whose abbreviation started with N-Z?
189+
190+
```{code-cell} ipython3
191+
:tags: []
192+
193+
# Create AtoM variable that = 1 if the first letter of the abbreviation is in
194+
# A to M and = 0 if it is in N to Z
195+
df1["AtoM"] = 0
196+
df1["AtoM"][
197+
df1["shortnam"].str[0].isin([
198+
'A','B','C','D','E','F','G','H','I','J','K','L','M'
199+
])
200+
] = 1
201+
202+
# Describe the data
203+
df1.groupby("AtoM").describe()
204+
```
205+
206+
Another way we could do this that is more readable that the output above is to just describe the data in two separate commands in which we restrict the data to the two separate groups.
207+
208+
```{code-cell} ipython3
209+
:tags: []
210+
211+
df1[df1["AtoM"]==1].describe()
212+
```
213+
214+
```{code-cell} ipython3
215+
:tags: []
216+
217+
df1[df1["AtoM"]==0].describe()
218+
```
219+
220+
Let's make two scatterplots to see with our eyes if there seems to be a difference in the relationship.
221+
222+
```{code-cell} ipython3
223+
:tags: ["remove-output"]
224+
225+
# Plot the scatterplot of the relationship for the countries for which the first
226+
# letter of the abbreviation is between A to M
227+
plt.scatter(
228+
x=df1[df1["AtoM"]==1]["avexpr"], y=df1[df1["AtoM"]==1]["logpgp95"], s=10
229+
)
230+
plt.xlim((3.2, 10.5))
231+
plt.ylim((5.9, 10.5))
232+
plt.title("Scatterplot of average expropriation protection and log GDP per " +
233+
"capita \n for each country, first letter in A-M")
234+
plt.xlabel(r'Average Expropriation Protection 1985-95')
235+
plt.ylabel(r'Log GDP per capita, PPP, 1995')
236+
plt.grid(color='gray', linestyle=':', linewidth=1, alpha=0.5)
237+
plt.show()
238+
```
239+
240+
```{figure} ../../../images/basic_empirics/scatter2.png
241+
:height: 500px
242+
:name: FigBasicEmpir_scatter2
243+
244+
Scatterplot of average expropriation protection $avexpr$ and log GDP per capita $logpgp95$ for each country, first letter in A-M
245+
```
246+
247+
```{code-cell} ipython3
248+
:tags: ["remove-output"]
249+
250+
# Plot the scatterplot of the relationship for the countries for which the first
251+
# letter of the abbreviation is between N to Z
252+
plt.scatter(
253+
x=df1[df1["AtoM"]==0]["avexpr"], y=df1[df1["AtoM"]==0]["logpgp95"], s=10
254+
)
255+
plt.xlim((3.2, 10.5))
256+
plt.ylim((5.9, 10.5))
257+
plt.title("Scatterplot of average expropriation protection and log GDP per " +
258+
"capita \n for each country, first letter in N-Z")
259+
plt.xlabel(r'Average Expropriation Protection 1985-95')
260+
plt.ylabel(r'Log GDP per capita, PPP, 1995')
261+
plt.grid(color='gray', linestyle=':', linewidth=1, alpha=0.5)
262+
plt.show()
263+
```
264+
265+
```{figure} ../../../images/basic_empirics/scatter3.png
266+
:height: 500px
267+
:name: FigBasicEmpir_scatter3
268+
269+
Scatterplot of average expropriation protection $avexpr$ and log GDP per capita $logpgp95$ for each country, first letter in N-Z
270+
```
271+
272+
273+
(SecBasicEmpLinReg)=
274+
## Basic Understanding of Linear Regression
275+
276+
277+
(SecBasicEmpLinRegExamp)=
278+
### Example: Acemoglu, et al (2001)
279+
280+
Given the plots in {numref}`Figure %s <FigBasicEmpir_scatter1>`, {numref}`Figure %s <FigBasicEmpir_scatter2>`, and {numref}`Figure %s <FigBasicEmpir_scatter3>` above, choosing a linear model to describe this relationship seems like a reasonable assumption.
281+
282+
We can write a model as:
283+
284+
```{math}
285+
:label: EqBasicEmp_AcemogluReg
286+
logpgp95_i = \beta_0 + \beta_1 avexpr_i + u_i
287+
```
288+
289+
where:
290+
* $\beta_0$ is the intercept of the linear trend line on the $y$-axis
291+
* $\beta_1$ is the slope of the linear trend line, representing the marginal effect of protection against risk on log GDP per capita
292+
* $u_i$ is a random error term (deviations of observations from the linear trend due to factors not included in the model)
293+
294+
Visually, this linear model involves choosing a straight line that best fits the data according to some criterion, as in the following plot (Figure 2 in {cite}`AcemogluEtAl:2001`).
295+
296+
```{code-cell} ipython3
297+
:tags: []
298+
299+
import numpy as np
300+
301+
# Dropping NA's is required to use numpy's polyfit
302+
df1_subset = df1.dropna(subset=['logpgp95', 'avexpr'])
303+
# df1_subset.describe()
304+
305+
# Use only 'base sample' for plotting purposes (smaller sample)
306+
df1_subset = df1_subset[df1_subset['baseco'] == 1]
307+
# df1_subset.describe()
308+
309+
X = df1_subset['avexpr']
310+
y = df1_subset['logpgp95']
311+
labels = df1_subset['shortnam']
312+
313+
# Replace markers with country labels
314+
plt.scatter(X, y, marker='')
315+
316+
for i, label in enumerate(labels):
317+
plt.annotate(label, (X.iloc[i], y.iloc[i]))
318+
319+
# Fit a linear trend line
320+
plt.plot(np.unique(X),
321+
np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),
322+
color='black')
323+
324+
plt.xlabel('Average Expropriation Protection 1985-95')
325+
plt.ylabel('Log GDP per capita, PPP, 1995')
326+
plt.xlim((3.2, 10.5))
327+
plt.ylim((5.9, 10.5))
328+
plt.title('Figure 2: OLS relationship between expropriation risk and income')
329+
plt.show()
330+
```
331+
144332

145333
<!-- {numref}`ExerBasicEmpir_MultLinRegress` -->
146334

images/basic_empirics/scatter1.png

152 KB
Loading

images/basic_empirics/scatter2.png

146 KB
Loading

images/basic_empirics/scatter3.png

139 KB
Loading

0 commit comments

Comments
 (0)