pydata

Keep Looking, Don't Settle

working with text data in sklearn

处理文本文档

本文以kaggle的一个竞赛data为例 CrowdFlower
  1. 计算词频 (CountVectorizer, TfidfTransformer)
  2. 训练分类器
  3. Pipeline,自动化 (以后扩展为不同的列用不同的变换)
  4. 网格搜索来选择最优参数 (GridSearch)
import numpy as np
from scipy.sparse import hstack
import pandas as pd
import re
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV

# Load the training and test file
train = pd.read_csv("H:\\python\\kaggle\\CrowdFlower\\train.csv").fillna("")
test  = pd.read_csv("H:\\python\\kaggle\\CrowdFlower\\test.csv").fillna("")

# Drop the ID columns
idx = test.id.values.astype(int)
train, test = train.drop('id', axis=1), test.drop('id', axis=1)

# create labels and drop variance
y = train.median_relevance.values
train = train.drop(['median_relevance', 'relevance_variance'], axis=1)
train.head(5)
query product_title product_description
0 bridal shower decorations Accent Pillow with Heart Design - Red/Black Red satin accent pillow embroidered with a hea...
1 led christmas lights Set of 10 Battery Operated Multi LED Train Chr... Set of 10 Battery Operated Train Christmas Lig...
2 projector ViewSonic Pro8200 DLP Multimedia Projector
3 wine rack Concept Housewares WR-44526 Solid-Wood Ceiling... Like a silent and sturdy tree, the Southern En...
4 light bulb Wintergreen Lighting Christmas LED Light Bulb ... WTGR1011\nFeatures\nNickel base, 60,000 averag...
# 计算query和title或者description的相交的长度
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')

for i, row in train.iterrows():
    query = set(x.lower() for x in token_pattern.findall(row["query"]))
    title = set(x.lower() for x in token_pattern.findall(row["product_title"]))
    description = set(x.lower() for x in token_pattern.findall(row["product_description"]))
    if len(title) > 0:
        train.set_value(i, "query_in_title", len(query.intersection(title)) / float(len(title)))
    if len(description) > 0:
        train.set_value(i, "query_in_description", len(query.intersection(description)) / float(len(description)))
1. 计算query里面各个单词出现的次数(occurance count),

比如'ecco'在query里面总共出现了29次,python里面 [x for x in train["query"] if 'ecco' in x.lower()] 可以查看所有包含'ecco'的行

1.1. 对query统计词频,然后查看ecco出现的频率

cnt = CountVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b')      #调用函数
cntn = cnt.fit_transform(train["query"]).toarray()               #fit data
print cntn.shape                                                 # (10158L, 487L)
print cnt.vocabulary_.get('ecco')                                #’ecco'在query所有单词的次序中排第142位(从0开始是141)
sum(cntn[:, cnt.vocabulary_.get('ecco')])                        #找出ecco出现的频率,为29

1.2. 同样可以对product_title, product_description做相似的统计

cnt.fit_transform(train["product_title"])

cnt.fit_transform(train['product_description'])
    <10158x26155 sparse matrix of type '<type 'numpy.int64'>'
    with 432648 stored elements in Compressed Sparse Row format>

很明显的是product_description里面单词太多了,导致产生的矩阵(sparse matrix)的维数为 10158x26155。我们可以只取出现频率最高的200个单词 (max_features=200) 作为我们的feature。也可以设置最小出现的频率为20 (min_df = 10) 。

cnt = CountVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b', max_features=200, min_df = 20)      #调用函数
cnt.fit_transform(train['product_description'])
<10158x200 sparse matrix of type '<type 'numpy.int64'>'
    with 158472 stored elements in Compressed Sparse Row format>
2. TfidfVeftorizer()计算 TF-IDF feature

仅仅计算occurance count有一个问题:越长的文档通常会有越高的词频。为避免这个潜在的问题,可以用 tf (term frequency)来解决:tf会把上面的词频除以文档中所有的单词数(to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies)。

tf的另一个改进是减小了那些在很多文档中都出现的单词的权重。sklearn中把这个过程叫 tf–idf: “Term Frequency times Inverse Document Frequency”.

下面给出的例子是计算 train['product_title']train['query'] 里面的TF-IDF

tfv = TfidfVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b', max_features=200, min_df = 20)      #调用TfvifVectorizer函数
tfv.fit(train['query'])
tfvm = tfv.transform(train['product_title']).toarray()
tfv.vocabulary_.get('zippo')
sum(tfvm[:, tfv.vocabulary_.get('zippo')])                      # 47.881802215754419
3. Build estimator to fit data

3.1. 下面以 randomForest 为例来建立模型. n_jobs = -1 表示并行运算

# RandomForest()

tfv.fit(train['query'])
tfvm_train = tfv.transform(train['product_title']).toarray()
tfvm_test = tfv.transform(test['product_title']).toarray()

clf = RandomForestClassifier(n_estimators=200, n_jobs = -1, min_samples_split = 2, random_state = 1)
clf.fit(tfvm_train, y)
pd.Series(clf.predict(tfvm_test)).value_counts(dropna = False).sort_index()
1      372
2     1265
3     1344
4    19532
dtype: int64

3.2. SVM 通常有三个参数供我们选择:

参见 sklearn 自带的文档

C 是error term的惩罚参数,用来折衷training data的模型复杂度和估计精度的。C越小模型越光滑,C越大越会overfitting。C越大越多的样本点会被选为supporting vector.

gamma 用来描述training data单个点影响程度。参见 视频从第17分钟开始。gamma越小,模型越光滑,类似于线性模型。

kernel 可以选择不同的核

参见此图:

alt text

以及此图:

alt text

clf = SVC(C = 10, gamma = 0.01)
clf.fit(tfvm_train, y)
pd.Series(clf.predict(tfvm_test)).value_counts(dropna = False).sort_index()
2      106
3      127
4    22280
dtype: int64

3.3. 怎么选择C 和 gamma (GridSearchCV)

grid_params = {'C': [10e-2, 10e-1, 1, 10, 100], 'gamma': [10e-2, 10e-1, 1, 10, 100]}
clf = SVC()
gs_clf = GridSearchCV(clf, grid_params)
gs_clf.fit(tfvm_train, y)


# get the best estimator information,
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(grid_params.keys()):
    print "%s: %r" %(param_name, best_parameters[param_name])
参考文献 Working with Text Data