statsmodels regression examples — pydata: Huiming's learning notes

In statsmodels it supports the basic regression models like linear regression and logistic regression.

It also supports to write the regression function similar to R formula.

1. regression with R-style formula

if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x) type formula.

1.1 linear regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()

1.2 logistic regression

each x is numeric, write the formula directly

f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()

1.3 categorical variable, include it in the `C()`

logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()

if you want to check the output, you can use dir(logitfit) or dir(linreg) to check the attributes of the fitted model.

generally, the following most used will be useful:

for linear regression
- linreg.summary() # summary of the model
- linreg.fittedvalues # fitted value from the model
- linreg.predict() # predict
- linreg.rsquared_adj # adjusted r-square
for logistic regression
- logitreg.summary() # summary of the model
- logitreg.fittedvalues # fitted value from the model
- logitreg.predict() # predict
- logfitreg.pred_table() # confusion matrix

2. Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

The “-” sign can be used to remove columns/variables.

df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params

”:” adds a new column to the design matrix with the product of the other two columns.

“*” will also include the individual columns that were multiplied together

res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params

3. regression without formula

in this type, you need to indicate your y and X separately in the model.

important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use statsmodels.tools.add_constant to add constant in the X matrix

3.1. linear regression

import statsmodels.api as sm
sm.OLS(y, X).fit()

3.2. logistic regression

sm.Logit(y, X).fit()

3.3. GLM, states the family clearly in the regression

sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

Reference

http://nbviewer.ipython.org/urls/umich.box.com/shared/static/aouhn2mci77opm3v89vc.ipynb

http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/nhanes_logistic_regression.html

http://statsmodels.sourceforge.net/devel/example_formulas.html

http://statsmodels.sourceforge.net/devel/contrasts.html