pydata

Keep Looking, Don't Settle

statsmodels regression examples

In statsmodels it supports the basic regression models like linear regression and logistic regression.

It also supports to write the regression function similar to R formula.

1. regression with R-style formula

if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x) type formula.

1.1 linear regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()

1.2 logistic regression

each x is numeric, write the formula directly

f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()

1.3 categorical variable, include it in the C()

logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()

if you want to check the output, you can use dir(logitfit) or dir(linreg) to check the attributes of the fitted model.

generally, the following most used will be useful:

  1. for linear regression

    • linreg.summary()          # summary of the model
    • linreg.fittedvalues          # fitted value from the model
    • linreg.predict()          # predict
    • linreg.rsquared_adj          # adjusted r-square
  2. for logistic regression

    • logitreg.summary()          # summary of the model
    • logitreg.fittedvalues          # fitted value from the model
    • logitreg.predict()          # predict
    • logfitreg.pred_table()          # confusion matrix

2. Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

The “-” sign can be used to remove columns/variables.

df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()

res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params

”:” adds a new column to the design matrix with the product of the other two columns.

“*” will also include the individual columns that were multiplied together

res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params

3. regression without formula

in this type, you need to indicate your y and X separately in the model.

important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use statsmodels.tools.add_constant to add constant in the X matrix

3.1. linear regression

import statsmodels.api as sm
sm.OLS(y, X).fit()

3.2. logistic regression

sm.Logit(y, X).fit()

3.3. GLM, states the family clearly in the regression

sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

Reference

http://nbviewer.ipython.org/urls/umich.box.com/shared/static/aouhn2mci77opm3v89vc.ipynb

http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/nhanes_logistic_regression.html

http://statsmodels.sourceforge.net/devel/example_formulas.html

http://statsmodels.sourceforge.net/devel/contrasts.html