处理文本文档
本文以kaggle的一个竞赛data为例 CrowdFlower
- 计算词频 (CountVectorizer, TfidfTransformer)
- 训练分类器
- Pipeline,自动化 (以后扩展为不同的列用不同的变换)
- 网格搜索来选择最优参数 (GridSearch)
import numpy as np
from scipy.sparse import hstack
import pandas as pd
import re
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.grid_search import GridSearchCV
# Load the training and test file
train = pd.read_csv("H:\\python\\kaggle\\CrowdFlower\\train.csv").fillna("")
test = pd.read_csv("H:\\python\\kaggle\\CrowdFlower\\test.csv").fillna("")
# Drop the ID columns
idx = test.id.values.astype(int)
train, test = train.drop('id', axis=1), test.drop('id', axis=1)
# create labels and drop variance
y = train.median_relevance.values
train = train.drop(['median_relevance', 'relevance_variance'], axis=1)
train.head(5)
query | product_title | product_description | |
---|---|---|---|
0 | bridal shower decorations | Accent Pillow with Heart Design - Red/Black | Red satin accent pillow embroidered with a hea... |
1 | led christmas lights | Set of 10 Battery Operated Multi LED Train Chr... | Set of 10 Battery Operated Train Christmas Lig... |
2 | projector | ViewSonic Pro8200 DLP Multimedia Projector | |
3 | wine rack | Concept Housewares WR-44526 Solid-Wood Ceiling... | Like a silent and sturdy tree, the Southern En... |
4 | light bulb | Wintergreen Lighting Christmas LED Light Bulb ... | WTGR1011\nFeatures\nNickel base, 60,000 averag... |
# 计算query和title或者description的相交的长度
token_pattern = re.compile(u'(?u)\\b\\w\\w+\\b')
for i, row in train.iterrows():
query = set(x.lower() for x in token_pattern.findall(row["query"]))
title = set(x.lower() for x in token_pattern.findall(row["product_title"]))
description = set(x.lower() for x in token_pattern.findall(row["product_description"]))
if len(title) > 0:
train.set_value(i, "query_in_title", len(query.intersection(title)) / float(len(title)))
if len(description) > 0:
train.set_value(i, "query_in_description", len(query.intersection(description)) / float(len(description)))
1. 计算query里面各个单词出现的次数(occurance count),
比如'ecco'在query里面总共出现了29次,python里面 [x for x in train["query"] if 'ecco' in x.lower()]
可以查看所有包含'ecco'的行
1.1. 对query统计词频,然后查看ecco出现的频率
cnt = CountVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b') #调用函数
cntn = cnt.fit_transform(train["query"]).toarray() #fit data
print cntn.shape # (10158L, 487L)
print cnt.vocabulary_.get('ecco') #’ecco'在query所有单词的次序中排第142位(从0开始是141)
sum(cntn[:, cnt.vocabulary_.get('ecco')]) #找出ecco出现的频率,为29
1.2. 同样可以对product_title, product_description做相似的统计
cnt.fit_transform(train["product_title"])
cnt.fit_transform(train['product_description'])
<10158x26155 sparse matrix of type '<type 'numpy.int64'>'
with 432648 stored elements in Compressed Sparse Row format>
很明显的是product_description里面单词太多了,导致产生的矩阵(sparse matrix)的维数为 10158x26155
。我们可以只取出现频率最高的200个单词 (max_features=200
) 作为我们的feature。也可以设置最小出现的频率为20 (min_df = 10
) 。
cnt = CountVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b', max_features=200, min_df = 20) #调用函数
cnt.fit_transform(train['product_description'])
<10158x200 sparse matrix of type '<type 'numpy.int64'>'
with 158472 stored elements in Compressed Sparse Row format>
2. TfidfVeftorizer()计算 TF-IDF feature
仅仅计算occurance count有一个问题:越长的文档通常会有越高的词频。为避免这个潜在的问题,可以用 tf
(term frequency)来解决:tf
会把上面的词频除以文档中所有的单词数(to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies)。
tf
的另一个改进是减小了那些在很多文档中都出现的单词的权重。sklearn中把这个过程叫 tf–idf
: “Term Frequency times Inverse Document Frequency”.
下面给出的例子是计算 train['product_title']
在 train['query']
里面的TF-IDF
tfv = TfidfVectorizer(token_pattern = u'(?u)\\b\\w\\w+\\b', max_features=200, min_df = 20) #调用TfvifVectorizer函数
tfv.fit(train['query'])
tfvm = tfv.transform(train['product_title']).toarray()
tfv.vocabulary_.get('zippo')
sum(tfvm[:, tfv.vocabulary_.get('zippo')]) # 47.881802215754419
3. Build estimator to fit data
3.1. 下面以 randomForest 为例来建立模型. n_jobs = -1 表示并行运算
# RandomForest()
tfv.fit(train['query'])
tfvm_train = tfv.transform(train['product_title']).toarray()
tfvm_test = tfv.transform(test['product_title']).toarray()
clf = RandomForestClassifier(n_estimators=200, n_jobs = -1, min_samples_split = 2, random_state = 1)
clf.fit(tfvm_train, y)
pd.Series(clf.predict(tfvm_test)).value_counts(dropna = False).sort_index()
1 372
2 1265
3 1344
4 19532
dtype: int64
3.2. SVM 通常有三个参数供我们选择:
参见 sklearn 自带的文档
C
是error term的惩罚参数,用来折衷training data的模型复杂度和估计精度的。C
越小模型越光滑,C
越大越会overfitting。C
越大越多的样本点会被选为supporting vector.
gamma
用来描述training data单个点影响程度。参见 视频从第17分钟开始。gamma
越小,模型越光滑,类似于线性模型。
kernel
可以选择不同的核
参见此图:
以及此图:
clf = SVC(C = 10, gamma = 0.01)
clf.fit(tfvm_train, y)
pd.Series(clf.predict(tfvm_test)).value_counts(dropna = False).sort_index()
2 106
3 127
4 22280
dtype: int64
3.3. 怎么选择C 和 gamma (GridSearchCV)
grid_params = {'C': [10e-2, 10e-1, 1, 10, 100], 'gamma': [10e-2, 10e-1, 1, 10, 100]}
clf = SVC()
gs_clf = GridSearchCV(clf, grid_params)
gs_clf.fit(tfvm_train, y)
# get the best estimator information,
best_parameters, score, _ = max(gs_clf.grid_scores_, key=lambda x: x[1])
for param_name in sorted(grid_params.keys()):
print "%s: %r" %(param_name, best_parameters[param_name])