nlpIR 詞性標註符號詳情 - nlpIR 詞性標註符號,NLTK,學習,詞性標註,詞性,NLP,人工智能代碼魔術師之手博客

轉載自：NLTK學習之二：建構詞性標註器

學習所用，如有侵權，立即刪除。

詞性標註，或POS(Part Of Speech)，是一種分析句子成分的方法，通過它來識別每個詞的詞性。下面簡要列舉POS的tagset含意，詳細可看nltk.help.brown_tagset()

標記	詞性	示例
ADJ	形容詞	new, good, high, special, big, local
ADV	動詞	really, already, still, early, now
CONJ	連詞	and, or, but, if, while, although
DET	限定詞	the, a, some, most, every, no
EX	存在量詞	there, there’s
MOD	情態動詞	will, can, would, may, must, should
NN	名詞	year,home,costs,time
NNP	專有名詞	April，China，Washington
NUM	數詞	fourth，2016, 09:30
PRON	代詞	he,they,us
P	介詞	on,over,with,of
TO	詞to	to
UH	嘆詞	ah,ha,oops
VB		動詞
VBD	動詞過去式	made,said,went
VBG	現在分詞	going,lying,playing
VBN	過去分詞	taken,given,gone
WH	wh限定詞	who,where,when,what

使用NLTK進行詞性標註：

示例：

import nltk
sent = 'I am going to Wuhan University now'
tokens = nltk.word_tokenize(sent)

taged_sent = nltk.pos_tag(tokens)
print(taged_sent)

運行結果：

[('I', 'PRP'), ('am', 'VBP'), ('going', 'VBG'), ('to', 'TO'), ('Wuhan', 'NNP'), ('University', 'NNP'), ('now', 'RB')]

語料庫的已標註數據

語料類提供了下列方法可以返回預標註數據。

方法	説明
tagged_words(fileids,categories)	返回標註數據，以詞列表的形式
tagged_sents(fileids,categories)	返回標註數據，以句子列表形式
tagged_paras(fileids,categories)	返回標註數據，以文章列表形式

標註器：

1、默認標註器

最簡單的詞性標註器將所有的詞都標註為名詞NN，標註的正確率不是很高，沒有很高的價值。

示例：

import nltk
from nltk.corpus import brown

default_tagger = nltk.DefaultTagger('NN')
sents = 'I am going to Wuhan University now.'
print(default_tagger.tag(sents))

tagged_sent = brown.tagged_sents(categories='news')
print(default_tagger.evaluate(tagged_sent))

運行結果：

[('I', 'NN'), (' ', 'NN'), ('a', 'NN'), ('m', 'NN'), (' ', 'NN'), ('g', 'NN'), ('o', 'NN'), ('i', 'NN'), ('n', 'NN'), ('g', 'NN'), (' ', 'NN'), ('t', 'NN'), ('o', 'NN'), (' ', 'NN'), ('W', 'NN'), ('u', 'NN'), ('h', 'NN'), ('a', 'NN'), ('n', 'NN'), (' ', 'NN'), ('U', 'NN'), ('n', 'NN'), ('i', 'NN'), ('v', 'NN'), ('e', 'NN'), ('r', 'NN'), ('s', 'NN'), ('i', 'NN'), ('t', 'NN'), ('y', 'NN'), (' ', 'NN'), ('n', 'NN'), ('o', 'NN'), ('w', 'NN'), ('.', 'NN')]
0.13089484257215028

從上述結果可以看出，默認標註器將所有詞標註為名詞NN的準確率較低，只有13.08%。

基於規則的標註器：

使用規則可以提高標註器的準確率，比如對於ing結尾則柡注為VG，ed結尾則標註為VD。可以通過正則表達式標註器實現。

示例：

import nltk
from nltk.corpus import brown

pattern = [
    (r'.*ing$', 'VBG'),
    (r'.*ed$', 'VBD'),
    (r'.*es$', 'VBZ'),
    (r'.*\'s$', 'NN$'),
    (r'.*s$', 'NNS'),
    (r'.*', 'NN'),  # 未標註的仍為NN
]
sents = 'I am going to Wuhan University.'

tagger = nltk.RegexpTagger(pattern)

print(tagger.tag(nltk.word_tokenize(sents)))

tagged_sents = brown.tagged_sents(categories='news')
print(tagger.evaluate(tagged_sents))

運行結果：

[('I', 'NN'), ('am', 'NN'), ('going', 'VBG'), ('to', 'NN'), ('Wuhan', 'NN'), ('University', 'NN'), ('.', 'NN')]
0.1875310778288283

可以看出相比於默認的詞性標註器，基於規則的詞性標註具有較高的準確率。

3、基於查表的標註器

統計一下部分高頻詞的詞性，比如經常出現的100個詞的詞性。利用單個詞的詞性的統計知識來進行標註，這就是Unigram模型的思想。

示例：

import nltk
from nltk.corpus import  brown
sents = open('nothing_gonna_change_my_love_for_you.txt').read()
fdist = nltk.FreqDist(brown.words(categories='news'))
common_word = fdist.most_common(100)

cfdist = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
table = dict((word, cfdist[word].max()) for (word, _) in common_word)

uni_tagger = nltk.UnigramTagger(model=table, backoff=nltk.DefaultTagger('NN'))
print(uni_tagger.tag(nltk.word_tokenize(sents)))
tagged_sents = brown.tagged_sents(categories='news')
print(uni_tagger.evaluate(tagged_sents))

運行結果：

[('鍩縄f', 'NN'), ('I', 'PPSS'), ('had', 'HVD'), ('to', 'TO'), ('live', 'NN'), ('my', 'NN'), ('life', 'NN'), ('without', 'NN'), ('you', 'NN'), ('near', 'NN'), ('me', 'NN'), ('The', 'AT'), ('days', 'NN'), ('would', 'MD'), ('all', 'ABN'), ('be', 'BE'), ('empty', 'NN'), ('The', 'AT'), ('nights', 'NN'), ('would', 'MD'), ('seem', 'NN'), ('so', 'NN'), ('long', 'NN'), ('With', 'NN'), ('you', 'NN'), ('I', 'PPSS'), ('see', 'NN'), ('forever', 'NN'), ('oh', 'NN'), ('so', 'NN'), ('clearly', 'NN'), ('I', 'PPSS'), ('might', 'NN'), ('have', 'HV'), ('been', 'BEN'), ('in', 'IN'), ('love', 'NN'), ('before', 'IN'), ('But', 'CC'), ('it', 'PPS'), ('never', 'NN'), ('felt', 'NN'), ('this', 'DT'), ('strong', 'NN'), ('Our', 'NN'), ('dreams', 'NN'), ('are', 'BER'), ('young', 'NN'), ('And', 'NN'), ('we', 'PPSS'), ('both', 'NN'), ('know', 'NN'), ('they', 'PPSS'), ("'ll", 'NN'), ('take', 'NN'), ('us', 'NN'), ('where', 'NN'), ('we', 'PPSS'), ('want', 'NN'), ('to', 'TO'), ('go', 'NN'), ('Hold', 'NN'), ('me', 'NN'), ('now', 'NN'), ('Touch', 'NN'), ('me', 'NN'), ('now', 'NN'), ('I', 'PPSS'), ('do', 'NN'), ("n't", 'NN'), ('want', 'NN'), ('to', 'TO'), ('live', 'NN'), ('without', 'NN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('[', 'NN'), ('2', 'NN'), (']', 'NN'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('If', 'NN'), ('the', 'AT'), ('road', 'NN'), ('ahead', 'NN'), ('[', 'NN'), ('3', 'NN'), (']', 'NN'), ('is', 'BEZ'), ('not', '*'), ('so', 'NN'), ('easy', 'NN'), (',', ','), ('Our', 'NN'), ('love', 'NN'), ('will', 'MD'), ('lead', 'NN'), ('the', 'AT'), ('way', 'NN'), ('for', 'IN'), ('us', 'NN'), ('Just', 'NN'), ('like', 'NN'), ('a', 'AT'), ('guiding', 'NN'), ('star', 'NN'), ('I', 'PPSS'), ("'ll", 'NN'), ('be', 'BE'), ('there', 'EX'), ('for', 'IN'), ('you', 'NN'), ('if', 'NN'), ('you', 'NN'), ('should', 'NN'), ('need', 'NN'), ('me', 'NN'), ('You', 'NN'), ('do', 'NN'), ("n't", 'NN'), ('have', 'HV'), ('to', 'TO'), ('change', 'NN'), ('a', 'AT'), ('thing', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('just', 'NN'), ('the', 'AT'), ('way', 'NN'), ('you', 'NN'), ('are', 'BER'), ('So', 'NN'), ('come', 'NN'), ('with', 'IN'), ('me', 'NN'), ('and', 'CC'), ('share', 'NN'), ('the', 'AT'), ('view', 'NN'), ('I', 'PPSS'), ("'ll", 'NN'), ('help', 'NN'), ('you', 'NN'), ('see', 'NN'), ('forever', 'NN'), ('too', 'NN'), ('Hold', 'NN'), ('me', 'NN'), ('now', 'NN'), ('Touch', 'NN'), ('me', 'NN'), ('now', 'NN'), ('I', 'PPSS'), ('do', 'NN'), ("n't", 'NN'), ('want', 'NN'), ('to', 'TO'), ('live', 'NN'), ('without', 'NN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('One', 'NN'), ('thing', 'NN'), ('you', 'NN'), ('can', 'MD'), ('be', 'BE'), ('sure', 'NN'), ('of', 'IN'), ('I', 'PPSS'), ("'ll", 'NN'), ('never', 'NN'), ('ask', 'NN'), ('for', 'IN'), ('more', 'AP'), ('than', 'IN'), ('your', 'NN'), ('love', 'NN'), ('Nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN'), ('You', 'NN'), ('ought', 'NN'), ('to', 'TO'), ('know', 'NN'), ('by', 'IN'), ('now', 'NN'), ('how', 'NN'), ('much', 'NN'), ('I', 'PPSS'), ('love', 'NN'), ('you', 'NN'), ('The', 'AT'), ('world', 'NN'), ('may', 'NN'), ('change', 'NN'), ('my', 'NN'), ('whole', 'NN'), ('life', 'NN'), ('through', 'NN'), ('But', 'CC'), ('nothing', 'NN'), ("'s", 'NN'), ('gon', 'NN'), ('na', 'NN'), ('change', 'NN'), ('my', 'NN'), ('love', 'NN'), ('for', 'IN'), ('you', 'NN')]
0.5817769556656125

只利用前100個詞的歷史統計數據便能獲得58%的正確率，加大這個詞的數量更可以繼續提升標註的正確率,當為8000時可以達到90%的正確率。這裏我們對不在這100個詞的其他詞統一回退到默認標註器。

訓練N-gram標註器：

一般N-gram標註器：

Unigram標註器是1-Gram。考慮更多的上下文，便有了2-gram、3-gram，這裏統稱為N-gram。注意，考慮更長的上下文並不能帶來準確度的提升。
除了向N-gram標註器提供詞表模型，另外一種構建標註器的方法是訓練。N-gram標註器的構建函數如下：__init__(train=None, model=None, backoff=None),可以將標註好的語料作為訓練數據，用於構建一個標註器。

示例：

import nltk
from nltk.corpus import brown

brown_tagged_sents = brown.tagged_sents(categories='news')
train_num = int(len(brown_tagged_sents) * 0.9)
x_train = brown_tagged_sents[0: train_num]
x_test = brown_tagged_sents[train_num:]
tagger = nltk.UnigramTagger(train=x_train)
print(tagger.evaluate(x_test))

運行結果：

0.8121200039868434

對於UnigramTagger，採用90%的數據進行訓練，再餘下的10%數據上測試的準確率是81%。

組合標註器

利用backoff參數，將多個組合標註器組合起來，以提高識別精確率。

示例：

import nltk
from nltk.corpus import brown

pattern = [
    (r'.*ing$','VBG'),
    (r'.*ed$','VBD'),
    (r'.*es$','VBZ'),
    (r'.*\'s$','NN$'),
    (r'.*s$','NNS'),
    (r'.*', 'NN')  #未匹配的仍標註為NN
]

brown_tagged_sents = brown.tagged_sents(categories='news')
train_num = int(len(brown_tagged_sents) * 0.9)

x_train = brown_tagged_sents[0: train_num]
x_test = brown_tagged_sents[train_num:]
t0 = nltk.RegexpTagger(pattern)
t1 = nltk.UnigramTagger(x_train, backoff=t0)
t2 = nltk.BigramTagger(x_train, backoff=t1)
print(t2.evaluate(x_test))

運行結果：

0.8627529153792485

可以看出，不需要任何的語言學知識，只需要藉助統計數據便可以使得詞性標註做的足夠好。
對於中文，只要有標註語料，也可以按照上面的過程訓練N-gram標註器。

nltk.tag.BrillTagger實現了基於轉換的標註，在基礎標註器的結果上，對輸出進行基於規則的修正，實現更高的準確度。

中文標註器的訓練

示例：

import nltk
import json

lines = open('199801.txt', 'rb').readlines()
all_tagged_sents = []

for line in lines:
    line = line.decode('utf-8')
    sent = line.split()
    tagged_sent = []
    for item in sent:
        pair = nltk.str2tuple(item)
        tagged_sent.append(pair)

    if len(tagged_sent) > 0:
        all_tagged_sents.append(tagged_sent)

train_size = int(len(all_tagged_sents) * 0.8)
x_train = all_tagged_sents[: train_size]
x_test = all_tagged_sents[train_size:]
tagger = nltk.UnigramTagger(train=x_train, backoff=nltk.DefaultTagger('n'))

tokens = nltk.word_tokenize(u'我 認為 不丹 的 被動 捲入 不 構成 此次 對峙 的 主要 因素。')
tagged = tagger.tag(tokens)
print(json.dumps(tagged, ensure_ascii=False))
print(tagger.evaluate(x_test))

運行結果：

[["我", "R"], ["認為", "V"], ["不丹", "n"], ["的", "U"], ["被動", "A"], ["捲入", "V"], ["不", "D"], ["構成", "V"], ["此次", "R"], ["對峙", "V"], ["的", "U"], ["主要", "B"], ["因素。", "n"]]
0.8714095491725319

本文章為轉載內容，我們尊重原作者對文章享有的著作權。如有內容錯誤或侵權問題，歡迎原作者聯繫我們進行內容更正或刪除文章。

代碼魔術師之手博客

代碼魔術師之手博客

博客 / 詳情

nlpIR 詞性標註符號

發佈評論

Product

Company

Support

Company

博客 / 詳情

nlpIR 詞性標註符號

發佈 評論

發佈評論