Skip to content

Commit 934c563

Browse files
authored
Merge pull request #15 from rickecon/ml
Merging
2 parents 70cf559 + ff7a33b commit 934c563

8 files changed

Lines changed: 1143 additions & 4 deletions

File tree

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
| | |
44
| --- | --- |
5-
| Org | [![OSE Lab cataloged](https://img.shields.io/badge/OSE%20Lab-catalogued-critical)](https://github.com/OpenSourceEcon) [![OS License: AGPL-3.0](https://img.shields.io/badge/OS%20License-AGPL%203.0-yellow)](https://github.com/OpenSourceEcon/CompMethods/blob/main/LICENSE) |
5+
| Org | [![OSE Lab cataloged](https://img.shields.io/badge/OSE%20Lab-catalogued-critical)](https://github.com/OpenSourceEcon) [![OS License: AGPL-3.0](https://img.shields.io/badge/OS%20License-AGPL%203.0-yellow)](https://github.com/OpenSourceEcon/CompMethods/blob/main/LICENSE) [![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://opensourceecon.github.io/CompMethods/) |
66
| Package | [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-31013/) [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3115/) |
77
| Testing | ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/build_and_test.yml/badge.svg?branch=main) ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/deploy_docs.yml/badge.svg?branch=main) ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/check_format.yml/badge.svg?branch=main) [![Codecov](https://codecov.io/gh/OpenSourceEcon/CompMethods/branch/main/graph/badge.svg)](https://codecov.io/gh/OpenSourceEcon/compmethods) |
88

data/basic_empirics/logit/titanic-train.csv

Lines changed: 892 additions & 0 deletions
Large diffs are not rendered by default.

docs/book/_toc.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ parts:
2525
numbered: True
2626
chapters:
2727
- file: basic_empirics/BasicEmpirMethods
28+
- file: basic_empirics/LogisticReg
2829
- caption: Basic Machine Learning
2930
numbered: True
3031
chapters:

docs/book/basic_empirics/BasicEmpirMethods.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -567,7 +567,7 @@ from pandas.plotting import scatter_matrix
567567
scatter_matrix(df_numer, alpha=0.3, figsize=(6, 6), diagonal='kde')
568568
```
569569
4. Compute the correlation matrix for the numerical variables ($8\times 8$) using the [`pandas.DataFrame.corr()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html) method.
570-
5. What is wrong with estimating the following linear regression model? How would you fix this problem? (Hint: There is an issue with one of the variables)
570+
5. What is wrong with estimating the following linear regression model? How would you fix this problem? (Hint: There is an issue with one of the variables.)
571571
\begin{equation*}
572572
\begin{split}
573573
mpg_i &= \beta_0 + \beta_1 cylinders_i + \beta_2 displacement_i + \beta_3 horsepower_i + ... \\
Lines changed: 224 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,224 @@
1+
---
2+
jupytext:
3+
formats: md:myst
4+
text_representation:
5+
extension: .md
6+
format_name: myst
7+
kernelspec:
8+
display_name: Python 3
9+
language: python
10+
name: python3
11+
---
12+
13+
(Chap_LogIntro)=
14+
# Logistic Regression Model
15+
16+
This chapter has an executable [Google Colab notebook](https://colab.research.google.com/drive/1kNMOMvoKzuzNq_rw1yz86B3N98hgTaZ8?usp=sharing) with all the same code, data references, and images. The Google Colab notebook allows you to execute the code in this chapter in the cloud so you don't have to download Python, any of its packages, or any data to your local computer. You could manipulate and execute this notebook on any device with a browser, whether than be your computer, phone, or tablet.
17+
18+
The focus of this chapter is to give the reader a basic introduction to the logistic regression model, where it comes from, and how it can be interpreted.
19+
20+
21+
(Sec_LogQuantQual)=
22+
## Quantitative versus Qualitative Data
23+
The linear regression models of chapter {ref}`Chap_BasicEmpirMethods` have continuous quantitative variables as dependent variables. That is, the $y_i$ variable takes on a continuum of values. We use a different class of models to estimate the relationship of exogenous variables to *qualitative* or *categorical* or *discrete* endogenous or dependent variables.
24+
25+
Examples of qualitative or categorical variables include:
26+
27+
* Binary variables take on two values ($J=2$), most often 0 or 1. Examples: Male or female, dead or alive, accept or reject.
28+
* General categorical variables can take on more than two values ($J\geq 2$). Examples: red, blue, or green; teenager, young adult, middle aged, senior.
29+
30+
Note with general categorical variables that order and numerical distance do not matter. As an example let $FlowerColor_i=\{red=1, blue=2,green=3\}$ be a function of $neighborhood_i$, $season_i$, and $income_i$.
31+
32+
$$ FlowerColor_i = \beta_0 + \beta_1 neighborhood_i + \beta_2 season_i + \beta_3 income_i + u_i $$
33+
34+
We could mathematically estimate this regression model, but would that make sense? What would be wrong with a regression model?
35+
36+
37+
(Sec_LogQuantQualClassSet)=
38+
### The classification setting
39+
Let $y_i$ be a qualitative dependent variable on $N$ observations with $i$ being the index of the observation. Each observation $y_i$ can take on one of $J$ discrete values $j\in\{1,2,...J\}$. Let $x_{p,i}$ be the $i$th observation of the $p$th explanatory variable (independent variable) such that $X_i=\{x_{1,i}, x_{2,i}, ... x_{P,i}\}$. Then the general formulation of a classifier comes in the following two forms,
40+
41+
```{math}
42+
:label: EqLog_GenClassModel
43+
Pr(y_i=j|X_i,\theta) = f(X_i|\theta) \quad\forall i, j \quad\text{or}\quad \sum_{j=1}^J I_j(y_i=j) = f(X_i|\theta) \quad\forall i, j
44+
```
45+
46+
where $I_j$ in the second formulation is an indicator function that equals 1 when $y_i=j$ and equals 0 otherwise.
47+
48+
49+
(Sec_LogRegClass)=
50+
## Logistic Regression Classifier
51+
In this section, we will look at two models for binary (0 or 1) categorical dependent variables. We describe the first model--the linear probability (LP) model--for purely illustrative purposes. This is because the LP model has some serious shortcomings that make it almost strictly dominated by our second model in this section.
52+
53+
The second model--the logistic regression (logit, binary classifier) model--will be the focus of this section. There is another variant of this model, the probit model. But the logistic model is the more flexible, more easily interpretable, and more commonly used of the two.
54+
55+
56+
(Sec_LogLPM)=
57+
### The linear probability (LP) model
58+
59+
One option in which a regression is barely acceptable for modeling a binary (categorical) dependent variable is the linear probability (LP) model. When the dependent variable has only two categories, it can be modeled as $y_i\in\{0,1\}$ without loss of generality. Let the variable $z_i$ be interpreted as the probability that $y_i=1$ given the data $X_i$ and parameter values $\theta=\{\beta_0,\beta_1,...\beta_P\}$.
60+
61+
```{math}
62+
:label: EqLog_LPM
63+
z_i = Pr(y_i=1|X_i,\theta) = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + ... \beta_P x_{P,i} + u_i
64+
```
65+
66+
The LP model can be a nice, easy, computationally convenient way to estimate the probability of outcome $y_i=1$. This could also be reinterpreted, without loss of generality, as the probability that $y_i=0$. This is equivalent to a redefinition of which outcome is defined as $y_i=1$.
67+
68+
The main drawback of the LP model is that the predicted values of the probability that $y_i=1$ or $Pr(y_i=1|X_i,\theta)$ can be greater than 1 and can be less than 0. It is for this reason that it is very difficult to publish any research based on an LP model.
69+
70+
71+
(Sec_LogLogit)=
72+
### The logistic (logit) regression classifier
73+
74+
In contrast to the linear probability model, a good classifier tranforms numerical values from explanatory variables or feature variables into a probability that is strictly between 0 and 1. More specifically this function must take any numer on the real line between $-\infty$ and $\infty$ and map it to the $[0,1]$ interval. In addition, we want a monotonically increasing relationship between $x$ and the function $f(x)$. What are some functions with this property? Candidates include the following functions.
75+
76+
* $f(x)=\text{max}\Bigl(0, \,\text{min}\bigl(1, x\bigr)\Bigr)$
77+
* $f(x)=\frac{e^x}{1 + e^x}$
78+
* $f(x) = \arctan(x)$
79+
* $f(x) = \text{cdf}(x)$
80+
81+
Why don't functions like $\sin(x)$, $\cos(x)$, and $\frac{|x|}{1+|x|}$ fit these criteria?
82+
83+
The second function in the bulletted list above is the logistic function. The logistic regression model is a binary dependent variable classifier that constrains its predicted values to be stricly between 0 and 1. The logistic function is the following,
84+
85+
```{math}
86+
:label: EqLog_Logistic
87+
f(x) = \frac{e^x}{1 + e^x} \quad\forall x
88+
```
89+
90+
and has the following general shape.
91+
92+
```{code-cell} ipython3
93+
:tags: ["hide-input", "remove_output"]
94+
95+
import numpy as np
96+
import matplotlib.pyplot as plt
97+
98+
x_vals = np.linspace(-6, 6, 500)
99+
y_vals = np.exp(x_vals) / (1 + np.exp(x_vals))
100+
plt.plot(x_vals, y_vals, color="blue")
101+
plt.scatter(0, 0.5, color="black", s=15)
102+
plt.title(r"Logistic function for $x\in[-6,6]$")
103+
plt.xlabel(r'$x$ values')
104+
plt.ylabel(r'$f(x)$ values')
105+
plt.grid(color='gray', linestyle=':', linewidth=1, alpha=0.5)
106+
plt.show()
107+
```
108+
109+
```{figure} ../../../images/basic_empirics/logit/logit_gen.png
110+
:height: 500px
111+
:name: FigLogit_logit_gen
112+
113+
Logistic function for $x\in[-6,6]$
114+
```
115+
116+
The logistic regression function is the specific case of the logistic function where the value of $x$ in the general logistic function {eq}`EqLog_Logistic` is replaced by a linear combination of variables $\beta_0 + \beta_1 x_{1,i} + ...\beta_P x_{P,i}$ similar to a linear regression model.
117+
118+
```{math}
119+
:label: EqLog_Logit_std
120+
Pr(y_i=1|X_i,\theta) = \frac{e^{X_i\beta}}{1 + e^{X_i\beta}} = \frac{e^{\beta_0 + \beta_1 x_{1,i} + ...\beta_P x_{P,i}}}{1 + e^{\beta_0 + \beta_1 x_{1,i} + ...\beta_P x_{P,i}}}
121+
```
122+
123+
or equivalently
124+
125+
```{math}
126+
:label: EqLog_Logit_neg
127+
Pr(y_i=1|X_i,\theta) = \frac{1}{1 + e^{-X_i\beta}} = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_{1,i} + ...\beta_P x_{P,i})}}
128+
```
129+
130+
We could estimate the paramters $\theta=\{\beta_0,\beta_1,...\beta_P\}$ by generalized method of moments (GMM) using nonlinear least squares or a more general set of moments to match.[^GMM] But maximum likelihood estimation is the most common method for estimating the parameters $\theta$ because of its more robust statistical properties.[^MaxLikeli] Also, the distributional assumptions are built into the model, so they are not overly strong.
131+
132+
133+
(Sec_LogLogitNLLS)=
134+
#### Nonlinear least squares estimation
135+
If we define $z_i = Pr(y_i=1|X_i,\theta)$, then the error in the logistic regression is the following.
136+
137+
```{math}
138+
:label: EqLog_LogitNLLS_err
139+
\varepsilon_i = y_i - z_i
140+
```
141+
142+
The GMM specification of the nonlinear least squares method of estimating the parameter vector $\theta$ would then be the following.[^GMM]
143+
144+
```{math}
145+
:label: EqLog_LogitNLLS_gmm
146+
\begin{split}
147+
\hat{\theta}_{nlls} = \theta:\quad &\min_{\theta} \sum_{i=1}^N\varepsilon_i^2 \quad = \quad \min_{\theta}\sum_{i=1}^N\bigl(y_i - z_i \bigr)^2 \quad \\
148+
&= \quad \min_{\theta} \sum_{i=1}^N\Bigl[y_i - Pr(y_i=1|X_i,\theta)\Bigr]^2
149+
\end{split}
150+
```
151+
152+
153+
(Sec_LogLogitMLE)=
154+
#### Maximum likelihood estimation
155+
We characterized the general likelihood function for a sample of data as the probability that the given sample $(y_i,X_i)$ came from the assumed distribution given parameter values $Pr(y_i=1|X_i,\theta)$.
156+
157+
```{math}
158+
:label: EqLog_LogitMLE_like
159+
\mathcal{L}(y_i,X_i|\theta) = \prod_{i=1}^N Pr(y_i=1|X_i,\theta)^{y_i}\bigl[1 - Pr(y_i=1|X_i,\theta)\bigr]^{1 - y_i}
160+
```
161+
162+
The intuition of this likelihood function is that you want the probability of the observations for which $y_i=1$ to be close to one $Pr(X)$, and you want the probability of the observations for which $y_i=0$ to also be close to one $1 - Pr(X)$.
163+
164+
The log-likelihood function, which the MLE problem maximizes is the following.
165+
166+
```{math}
167+
:label: EqLog_LogitMLE_loglike
168+
\ln\bigl[\mathcal{L}(y_i,X_i|\theta)\bigr] = \sum_{i=1}^N\Bigl(y_i\ln\bigl[Pr(y_i=1|X_i,\theta)\bigr] + (1 - y_i)\ln\bigl[1 - Pr(y_i=1|X_i,\theta)\bigr]\Bigr)
169+
```
170+
171+
The MLE problem for estimating $\theta$ of the logistic regression model is, therefore, the following.[^MaxLikeli]
172+
173+
```{math}
174+
:label: EqLog_LogitMLE_maxprob
175+
\hat{\theta}_{mle} = \theta:\quad \max_{\theta} \ln\bigl[\mathcal{L}(y_i,X_i|\theta)\bigr]
176+
```
177+
178+
(Sec_LogLogitTitanic)=
179+
#### Titanic example
180+
A good example of logistic regression comes from a number of sources. But I am adapting some code and commentary from [http://www.data-mania.com/blog/logistic-regression-example-in-python/](http://www.data-mania.com/blog/logistic-regression-example-in-python/). The research question is to use a famous Titanic passenger dataset to try to identify the characteristics that most predict whether you survived $y_i=1$ or died $y_i=0$.
181+
182+
```{code-cell} ipython3
183+
:tags: []
184+
185+
import pandas as pd
186+
187+
url = ('https://raw.githubusercontent.com/OpenSourceEcon/CompMethods/' +
188+
'main/data/basic_empirics/logit/titanic-train.csv')
189+
titanic = pd.read_csv(url)
190+
titanic.columns = ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
191+
'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
192+
titanic.describe()
193+
```
194+
195+
The variable descriptions are the following:
196+
* `Survived`: Survival (0 = No; 1 = Yes)
197+
* `Pclass`: Passenger class (1 = 1st; 2 = 2nd; 3 = 3rd)
198+
* `Name`: Name
199+
* `Sex`: Gender
200+
* `Age`: Age
201+
* `SibSp`: Number of siblings/spouses aboard
202+
* `Parch`: Number of parents/children aboard
203+
* `Ticket`: Ticket number
204+
* `Fare`: Passenger fare (British pound)
205+
* `Cabin`: Cabin
206+
* `Embarked`: Port of embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
207+
208+
Let's first check that our target variable, `Survived`, is binary. Since we are building a model to predict survival of passangers from the Titanic, our target is going to be the `Survived` variable from the titanic dataframe. To make sure that it is a binary variable, let's use Seaborn's `countplot()` function.
209+
210+
```{code-cell} ipython3
211+
:tags: []
212+
213+
titanic['Survived'].value_counts()
214+
```
215+
216+
217+
(SecLogFootnotes)=
218+
## Footnotes
219+
220+
The footnotes from this chapter.
221+
222+
[^GMM]: See the {ref}`Chap_GMM` chapter of this book.
223+
224+
[^MaxLikeli]: See the {ref}`Chap_MaxLikeli` chapter of this book.

docs/book/basic_ml/ml_intro.md

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,24 @@
33

44
Put basic machine learning intro here.
55

6+
Define regression model versus classification model. Define parametric model versus nonparametric model. Define supervised learning versus unsupervised learning.
7+
8+
Introduce the paradigm of cross-validation.
9+
10+
The definitions of machine learning, statistical learning, and artificial intelligence overlap in most cases. And in some contexts they are indistinguishable.
11+
12+
Machine learning, statistical learning, and artificial intelligence is mostly focused on predictive models $\hat{y}=f(x|\theta)$ and tuning or estimating the parameters $\theta$ to minimize some definition of total error in the predictions for $\hat{y}$.
13+
* Highly nonlinear models
14+
* Cross-validation
15+
* Exotic loss functions
16+
* Super robust minimizers (variants of stochastic gradient descent)
17+
18+
Machine learning could have an equally appropriate and nondescriptive name of nonlinear regression modeling. On predictive accuracy, machine learning models outperform structural models and top regression models. However, this accuracy often comes at the cost of interpretability. The estimated parameters in structural models and regression models often have clear interpretations. On the other hand, it is nearly impossible to make a robust claim about the effect of an explanatory variable on a dependent variable in a neural net model.
19+
20+
Recent advances by Athey and others have re-established the elements of interpretation, marginal effects, and causal inference to machine learning models. [include citations here.]
21+
622

723
(SecBasicMLintroFootnotes)=
824
## Footnotes
925

10-
<!-- [^citation_note]: See {cite}`AuerbachEtAl:1981,AuerbachEtAl:1983`, {cite}`AuerbachKotlikoff:1983a,AuerbachKotlikoff:1983b,AuerbachKotlikoff:1983c`, and {cite}`AuerbachKotlikoff:1985`. -->
26+
The footnotes from this chapter.

docs/book/index.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,12 @@
11
# Computational Methods for Economists using Python
22

3-
This site contains open access tutorial materials and exercises for learning and using modern computational methods used by economists and data scientists. These materials have been developed by [Richard W. Evans](https://sites.google.com/site/rickecon) since 2008 primarily through the following endeavors:
3+
| | |
4+
| --- | --- |
5+
| Org | [![OSE Lab cataloged](https://img.shields.io/badge/OSE%20Lab-catalogued-critical)](https://github.com/OpenSourceEcon) [![OS License: AGPL-3.0](https://img.shields.io/badge/OS%20License-AGPL%203.0-yellow)](https://github.com/OpenSourceEcon/CompMethods/blob/main/LICENSE) [![Jupyter Book Badge](https://jupyterbook.org/badge.svg)](https://opensourceecon.github.io/CompMethods/) |
6+
| Package | [![Python 3.10](https://img.shields.io/badge/python-3.10-blue.svg)](https://www.python.org/downloads/release/python-31013/) [![Python 3.11](https://img.shields.io/badge/python-3.11-blue.svg)](https://www.python.org/downloads/release/python-3115/) |
7+
| Testing | ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/build_and_test.yml/badge.svg?branch=main) ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/deploy_docs.yml/badge.svg?branch=main) ![example event parameter](https://github.com/OpenSourceEcon/CompMethods/actions/workflows/check_format.yml/badge.svg?branch=main) [![Codecov](https://codecov.io/gh/OpenSourceEcon/CompMethods/branch/main/graph/badge.svg)](https://codecov.io/gh/OpenSourceEcon/compmethods) |
8+
9+
This online book site contains open access tutorial materials and exercises for learning and using modern computational methods used by economists and data scientists. These materials have been developed by [Richard W. Evans](https://sites.google.com/site/rickecon) since 2008 primarily through the following endeavors:
410
* (2008-2016) Assistant Professor, Department of Economics, Brigham Young University. Taught undergraduate courses in macroeconomics, international finance, advanced macroeconomics, computational methods.
511
* (2012-2016) Co-founder and co-director of the BYU Macroeconomics and Computational Laboratory.
612
* (2013-2016) Co-PI, National Science Foundation Grant for original development of Applied and Computational Math Emphasis (ACME) curriculum at Brigham Young University.
147 KB
Loading

0 commit comments

Comments
 (0)