In statsmodels
it supports the basic regression models like linear regression and logistic regression.
It also supports to write the regression function similar to R
formula.
1. regression with R-style formula
if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x)
type formula.
1.1 linear regression
import statsmodels.api as sm
import statsmodels.formula.api as smf
linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()
1.2 logistic regression
each x is numeric, write the formula directly
f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()
1.3 categorical variable, include it in the C()
logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()
if you want to check the output, you can use dir(logitfit)
or dir(linreg)
to check the attributes of the fitted model.
generally, the following most used will be useful:
-
for linear regression
linreg.summary()
# summary of the modellinreg.fittedvalues
# fitted value from the modellinreg.predict()
# predictlinreg.rsquared_adj
# adjusted r-square
-
for logistic regression
logitreg.summary()
# summary of the modellogitreg.fittedvalues
# fitted value from the modellogitreg.predict()
# predictlogfitreg.pred_table()
# confusion matrix
2. Operators
We have already seen that “~”
separates the left-hand side of the model from the right-hand side, and that “+”
adds new columns to the design matrix.
The “-”
sign can be used to remove columns/variables.
df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
df.head()
res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params
”:”
adds a new column to the design matrix with the product of the other two columns.
“*”
will also include the individual columns that were multiplied together
res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params
3. regression without formula
in this type, you need to indicate your y
and X
separately in the model.
important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use statsmodels.tools.add_constant
to add constant in the X
matrix
3.1. linear regression
import statsmodels.api as sm
sm.OLS(y, X).fit()
3.2. logistic regression
sm.Logit(y, X).fit()
3.3. GLM, states the family clearly in the regression
sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
Reference
http://nbviewer.ipython.org/urls/umich.box.com/shared/static/aouhn2mci77opm3v89vc.ipynb
http://dept.stat.lsa.umich.edu/~kshedden/Python-Workshop/nhanes_logistic_regression.html
http://statsmodels.sourceforge.net/devel/example_formulas.html
http://statsmodels.sourceforge.net/devel/contrasts.html