In statsmodels it supports the basic regression models like linear regression and logistic regression.
It also supports to write the regression function similar to R formula.
1. regression with R-style formula
if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x) type formula.
1.1 linear regression
import statsmodels.api as sm
import statsmodels.formula.api as smf
linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()
1.2 logistic regression
each x is numeric, write the formula directly
f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()
1.3 categorical variable, include it in the C()
logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()
if you want to check the output, you can use dir(logitfit) or dir(linreg) to check the attributes of the fitted model.
generally, the following most used will be useful:
-
for linear regression
linreg.summary()# summary of the modellinreg.fittedvalues# fitted value from the modellinreg.predict()# predictlinreg.rsquared_adj# adjusted r-square
-
for logistic regression
logitreg.summary()# summary of the modellogitreg.fittedvalues# fitted value from the modellogitreg.predict()# predictlogfitreg.pred_table()# confusion matrix
2. Operators
We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.
The “-” sign can be used to remove columns/variables.
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params
”:” adds a new column to the design matrix with the product of the other two columns.
“*” will also include the individual columns that were multiplied together
res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params
3. regression without formula
in this type, you need to indicate your y and X separately in the model.
important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use statsmodels.tools.add_constant to add constant in the X matrix
3.1. linear regression
import statsmodels.api as sm
sm.OLS(y, X).fit()
3.2. logistic regression
sm.Logit(y, X).fit()
3.3. GLM, states the family clearly in the regression
sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
Reference
http://nbviewer.ipython.org/urls/umich.box.com/shared/static/aouhn2mci77opm3v89vc.ipynb
http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/nhanes_logistic_regression.html
http://statsmodels.sourceforge.net/devel/example_formulas.html
http://statsmodels.sourceforge.net/devel/contrasts.html