pydata

Keep Looking, Don't Settle

sklearn evaluation model with scoring parameter and sklearn.metrics

Sklearn Model Evaluation and Scoring Function

Table of Contents:

sklearn通常有三种办法来评价模型的效果:

  1. estimator score method: 很多作者开发的统计包子带有估计量的评价方法。
  2. scoring parameter: 比如在crossing validation包里面(cross_validation.cross_val_score, grid_search.GridSearchCV)有scoring参数,你可以选择不同的score函数
  3. metric function: sklearn的metric包带有许多评价模型的函数。通常分为下面几种不同的大类:Classification, MultiLabel ranking, Regression and Clustering

这儿主要讨论第2种和第3种情况。

1. scoring parameter

  1. 使用crossing validation包里面(cross_validation.cross_val_score, grid_search.GridSearchCV)的scoring参数。
from sklearn import svm, cross_validation, datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
clf = svm.SVC(C = 10, gamma = 10, random_state = 0)
cross_validation.cross_val_score(clf, X, y, scoring = 'accuracy', cv = 5)

这样可以得到5个cross validation的accuracy值。

  1. 使用GridSearchVC里面的scoring参数
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
clf = SGDClassifier()
penalty_params = ['l1', 'l2']
loss_params = ['hinge', 'log']
alpha_params = [0.1, 1]
param_grid = dict(penalty = penalty_params, loss = loss_params, alpha = alpha_params)
cv = cross_validation.KFold(y.size, n_folds = 3, shuffle = True, random_state = 9999)
grid = GridSearchCV(clf, param_grid = param_grid, cv = cv, n_jobs = 5, scoring = 'accuracy')
grid.fit(X, y)
[x for x in grid.grid_scores_]

说明:上面使用GridSearch来搜索最优的参数组合。选择了两种不同的惩罚函数,L1是绝对值的和(Lasso),L2是平方和。loss function也是两个,alpha的值也选择了两个。cross validation选择了3个folder,也就是数据分成三组,每次两组用来做training data,一组用来做validation data. GridSearch因为使用了很多种组合,所以计算量会非常大。通常要运行很久才能有结果。

2. Metric Fcunction

主要是使用sklearn.metircs里面的定义好的函数或者用make_score来自定义函数。

  • 函数以 _score 结尾的值越大说明模型越好
  • 函数以 _error 或者 _loss 结尾的值则越小越好. 当使用make_scorer来转换为score的时候, 需要把参数 greater_is_better 设置为 False (默认为True).

有些metrics不能在上面的scoring参数里面用,有时候是应为需要更多的参数,比如fbeta_score. 这时候可以用make_score来自定义score函数。

一个典型的用法就是在metrics里面已有的函数上选择不同于默认值的值。比如下面在fbeta_score里面选择beta = 2.

from sklearn.metrics import fbeta_score, make_scorer
ftwo_scorer = make_scorer(fbeta_score, beta=2)
from sklearn.grid_search import GridSearchCV
from sklearn.svm import LinearSVC
grid = GridSearchCV(LinearSVC(), param_grid={'C': [1, 10]}, scoring=ftwo_scorer)

另一种就是完全定义一个新的score函数.

3. 常见选择

  1. f1_score: 又叫balanced F-Score,用于binary的响应值
  2. auc_roc_score:同样用于binary的响应值。当y=1的个数很少的时候,auc不是一个很好的度量值。
  3. accuracy_score:预测值等于真实值的百分比。很多的时候没有什么用,因为小概率事情都预测为0对accuracy没什么影响。
  4. consusion_matrix:预测值跟真实值的cross table. 可以看出预测效果,但是没有度量值。
  5. precision, recall 和 F-measures:
MultiClass / MultiLabel

对 multiclass 和 multilabel 分类问题, 可以对每个label分别应用precision, recall, 和 F-measures.

对于不同的label,shangmi上面讨论过 average_precision_score (multilabel only), f1_score, fbeta_score, precision_recall_fscore_support, precision_score 和 recall_score functions可以用来处理多label的情况. 注意在multiclass的时候,“micro”-averaging会产生相等的 precision, recall and F, 但 “weighted” averaging 产生的 F-score 会在 precision 和 recall的值之间.

MultiLabel Ranking

In multilabel learning, each sample can have any number of ground truth labels associated with it. The goal is to give high scores and better rank to the ground truth labels.

The coverage_error function computes the average number of labels that have to be included in the final prediction such that all true labels are predicted.

The label_ranking_average_precision_score function implements label ranking average precision (LRAP). This metric is linked to the average_precision_score function, but is based on the notion of label ranking instead of precision and recall.

Model evaluation: quantifying the quality of predictions