statsmodels regression examples

In statsmodels it supports the basic regression models like linear regression and logistic regression.

It also supports to write the regression function similar to R formula.

1. regression with R-style formula

if the independent variables x are numeric data, then you can write in the formula directly. However, if the independent variable x is categorical variable, then you need to include it in the C(x) type formula.

1.1 linear regression

import statsmodels.api as sm
import statsmodels.formula.api as smf

linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()

1.2 logistic regression

each x is numeric, write the formula directly

f = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW'
logitfit = smf.logit(formula = str(f), data = hgc).fit()

1.3 categorical variable, include it in the C()

logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()

if you want to check the output, you can use dir(logitfit) or dir(linreg) to check the attributes of the fitted model.

generally, the following most used will be useful:

  1. for linear regression

    • linreg.summary()          # summary of the model
    • linreg.fittedvalues          # fitted value from the model
    • linreg.predict()          # predict
    • linreg.rsquared_adj          # adjusted r-square
  2. for logistic regression

    • logitreg.summary()          # summary of the model
    • logitreg.fittedvalues          # fitted value from the model
    • logitreg.predict()          # predict
    • logfitreg.pred_table()          # confusion matrix

2. Operators

We have already seen that “~” separates the left-hand side of the model from the right-hand side, and that “+” adds new columns to the design matrix.

The “-” sign can be used to remove columns/variables.

df = sm.datasets.get_rdataset("Guerry", "HistData").data
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()

res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
print res.params

”:” adds a new column to the design matrix with the product of the other two columns.

“*” will also include the individual columns that were multiplied together

res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
print res1.params, '\n'
print res2.params

3. regression without formula

in this type, you need to indicate your y and X separately in the model.

important: by default, this regression will not include intercept. if you want to add intercept in the regression, you need to use to add constant in the X matrix

3.1. linear regression

import statsmodels.api as sm
sm.OLS(y, X).fit()

3.2. logistic regression

sm.Logit(y, X).fit()

3.3. GLM, states the family clearly in the regression

sm.GLM(data.endog, data.exog, family=sm.families.Binomial())