# statsmodels regression examples

In statsmodels it supports the basic regression models like linear regression and logistic regression.

It also supports to write the regression function similar to R formula.

## 1. regression with R-style formula

if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x) type formula.

### 1.1 linear regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()


### 1.2 logistic regression

each x is numeric, write the formula directly

f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()


### 1.3 categorical variable, include it in the C()

logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()


if you want to check the output, you can use dir(logitfit) or dir(linreg) to check the attributes of the fitted model.

generally, the following most used will be useful:

1. for linear regression

• linreg.summary()          # summary of the model
• linreg.fittedvalues          # fitted value from the model
• linreg.predict()          # predict
• linreg.rsquared_adj          # adjusted r-square
2. for logistic regression

• logitreg.summary()          # summary of the model
• logitreg.fittedvalues          # fitted value from the model
• logitreg.predict()          # predict
• logfitreg.pred_table()          # confusion matrix

## 2. Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

The “-” sign can be used to remove columns/variables.

df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params


”:” adds a new column to the design matrix with the product of the other two columns.

“*” will also include the individual columns that were multiplied together

res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params


## 3. regression without formula

in this type, you need to indicate your y and X separately in the model.

important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use statsmodels.tools.add_constant to add constant in the X matrix

### 3.1. linear regression

import statsmodels.api as sm
sm.OLS(y, X).fit()


### 3.2. logistic regression

sm.Logit(y, X).fit()


### 3.3. GLM, states the family clearly in the regression

sm.GLM(data.endog, data.exog, family=sm.families.Binomial())


### Reference

http://nbviewer.ipython.org/urls/umich.box.com/shared/static/aouhn2mci77opm3v89vc.ipynb

http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/nhanes_logistic_regression.html

http://statsmodels.sourceforge.net/devel/example_formulas.html

http://statsmodels.sourceforge.net/devel/contrasts.html