python regular expression — pydata: Huiming's learning notes

正则表达式因为平时用的不多，每次要用的时候都要google去找怎么用。所以干脆把常用的记录下来，这样以后要用的时候可以快速找到。更多的可以参考这个视频：python正则表达式

通常python的正则表达式分为这两步：

pattern = re.compile()：将字符串形式的表达式编译成pattern对象

下面记录下几个在text analysis中常用的python的正则表达式。

1： CountVectorizer中的 re.compile(r"(?u)\b\w\w+\b")，用来搜索text文档中所有的单词，并且单词长度大于1（\w+）。

token_pattern = re.compile(r"(?u)\b\w\w+\b")
token_pattern.findall("this is a good example")
Out[69]: ['this', 'is', 'good', 'example']

2： CountVectorizer中的 re.compile(r"(?u)\b\w+\b")，用来搜索text文档中所有的单词，包括长度为1的单词。

token_pattern = re.compile(r'(?u)\b\w+\b')
token_pattern.findall("this is a good example")
Out[74]: ['this', 'is', 'a', 'good', 'example']

3：以", '来分割单词，比如说 ['this is a bus, that is a car']会被分割成['this is a bus', 'that is a car']

reexp = re.compile(',\s+')
In [77]: reexp.split('this is a bus, that is a car') 
Out[77]: ['this is a bus', 'that is a car']

注意 \s+ 表示逗号后面有一个空格，所以如果输入 ['this is a bus,that is a car'] 会得到不同的结果，那是因为输入的文字逗号后面没有空格

reexp.split('this is a bus,that is a car')
Out[78]: ['this is a bus,that is a car']