CN108319584A

CN108319584A - A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms

Info

Publication number: CN108319584A
Application number: CN201810058993.6A
Authority: CN
Inventors: 刘磊; 贾亚璐; 孙孟涛; 陈浩; 李静
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-01-22
Filing date: 2018-01-22
Publication date: 2018-07-24

Abstract

The present invention discloses a kind of new word discovery method of the microblogging class short text based on improved FP Growth algorithms, including：Corpus of text is obtained, is segmented using jieba, the pretreatments such as part-of-speech tagging；Frequent item set word set is obtained by the FP Growth algorithms of optimization, and to each frequent episode ordering；Repeated strings are obtained using N grams models, and intersection is taken with frequent item set；It is filtered by part of speech, filters out some parts of speech being of little use on morphological structure；Using improved mutual information, sliding iterates to calculate mutual information to filter neologisms；It is once filtered again using part of speech rule of combination library；Verify the validity that this method obtains neologisms.

Description

A kind of new word discovery based on the microblogging class short text for improving FP-Growth algorithms Method

Technical field

The invention belongs to text information processing fields, are specifically related to a kind of based on the microblogging for improving FP-Growth algorithms The new word discovery method of class short text.

Background technology

Microblogging is one of most popular social platform in current global range, and daily user can issue greatly on microblogging The text message of amount, this becomes one of the main source of network neologisms.

Difference lies in microblogging is short text, and the information of each user's publication does not exceed 140 for microblogging and general text Character, content is more random, and form has diversity.So research this kind of short text of microblogging is relatively difficult.But magnanimity The knowledge contained in microblogging text monitors public sentiment, and the research in the fields such as new word discovery has great importance.

The research of new word discovery at present is mainly based upon the knowledge of the name entity such as name, place name, mechanism name of traditional text , not relatively fewer based on the research of the new word discovery of microblogging short text, and compare with traditional text, since microblogging has The features such as text is short, irregular, effect of traditional new word discovery method in microblogging class short text are unsatisfactory.

FP-Growth algorithms obtain the frequent item set in data by twice sweep database, are a kind of efficient acquisitions The algorithm of frequent item set can be used for the acquisition of neologisms, but apply existing defects in microblogging class short text.Traditional FP-Growth algorithms have ignored part of speech to the influence at word in the discovery of neologisms, propose a kind of improved FP-Growth thus Algorithm, and neologisms are found in conjunction with N-grams models, improved mutual information and rule.

Invention content

For FP-Growth algorithms in the defect of the new word discovery of microblogging class short text, a kind of improved FP- is proposed Growth algorithms, take into account part of speech, the relevance that not only can be effectively expressed as by frequent episode between word word, also It is difficult that the identification that part of speech imbalance is brought can be cut down, improve to obtain by the integrated learning approach in conjunction with N-grams models Neologisms accuracy rate, while being filtered by part of speech, improved mutual information and part of speech rule of combination library.

To achieve the above object, the present invention adopts the following technical scheme that

A kind of new word discovery method of the microblogging class short text based on improved FP-Growth algorithms, includes the following steps：

Step (1) microblogging language material obtains and pretreatment

Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, is stored as the file of html format.To file into The matching of row canonical obtains text therein, deletes URL therein, then makes pauses in reading unpunctuated ancient writings by punctuation mark.To obtained plain text It is segmented, part-of-speech tagging, uses the third party module jieba of python, obtain pretreated language material, be denoted as G；

Step (2) obtains frequent item set C using the FP-Growth algorithm process G of optimization_fp

Step (2.1) handles microblogging language material G, builds and improves FP-Growth models, two factors of comprehensive word frequency and part of speech, The calculation formula of part of speech relative probability value is as follows：

Wherein, f (w | pos (w)=a) indicates part of speech relative probability values of the word w when part of speech is a, n_aIndicate word in language material G Property be a word frequency number, N indicates word frequency number total in language material G, n_{(w | pos (w)=a)}Indicate word frequency numbers of the word w when part of speech is a.

When building frequent item set, selection meets condition f (w | pos (w)=a) ＞ α₁Repeated strings as candidate frequent episode Collect R_fp, α₁For the minimum support of setting.

Step (2.2) is to obtained frequent item set R_fpCarry out sequence correction.In the frequent episode that FP-Growth algorithms obtain Word is unordered, thus by with original language material carry out sequence comparison, obtain sequential frequent item set C_fp。

Step (3) obtains neologisms Candidate Set C using N-grams models_grams

The number for counting N number of word from language material while occurring, the frequency P for being obtained word by N-grams models while being occurred (w₁,w₂,w₃,......w_n).Selection meets condition α₂＜ P (w₁,w₂,w₃,......w_n) ＜ β₂N member repeated strings as neologisms Candidate Set C_grams, α₂,β₂It is co-occurrence frequency threshold value.

Step (4) takes frequent item set C_fpWith neologisms Candidate Set C_gramsIntersection, obtain neologisms candidate C1={ c₁, c₂,…,c_m},c_i=(w₁,w₂,..w_n),c_iIndicate candidate neologisms, w_nIndicate the former word of composition neologisms.

The FP-Growth algorithms of optimization take into account part of speech, not only can effectively be expressed as word original by frequent episode It is difficult can also to cut down the identification that part of speech imbalance is brought for relevance between language.It is obtained simultaneously by N-grams algorithms new Word Candidate Set C_gramsThe frequent item set C obtained with the FP-Growth algorithms of optimization_fpThe integrated learning approach of intersection is taken to improve The accuracy rate of obtained neologisms.

Step (5) screens the word for wherein containing filtering part of speech in neologisms candidate C1, using part of speech label, Obtain neologisms Candidate Set C2

Filtering part of speech set includes：

It is filtered according to the above part of speech, is filtered through the neologisms Candidate Set C1 that step (4) obtains, obtains neologisms Candidate Set C2；

Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, obtains neologisms candidate collection C3.If c_i =(w₁,w₂,..w_n), c_i∈ C2, to each c_iUsing the sliding of improved mutual information calculate, improved mutual information calculates public Formula is as follows：

Wherein, p (w_i,w_i+1) indicate word w_iWith word w_i+1The frequency occurred jointly, p (w_i) indicate word w_iFrequency, w_i,i+1Table Show word w_iWith neighbouring word w_i+1It is combined into the weight of word,Indicate word w_i, w_i+1The frequency of the part of speech combination of co-occurrence,Indicate word w_iPart of speech occur frequency.In all frequent item sets, selection meets condition I (w_i,w_i+1) ＞ β₃Word As new set of words C={ c₁,c₂,c₃,......c_m, each neologisms are c1=(w₁,w₂,w₃,......w_n) constitute, wherein β₃For the threshold value of setting.

Step (7) combines the candidate new set of words C3 of filtering rule library R filterings by part of speech, then obtains final new word set Close C4

If c_i=(w₁,w₂,..w_n), ci ∈ C3, for each c_i, for arbitrary (w_i, w_i+1), part of speech combination (pos(w_i),pos(w_i+1)), if meeting any regular in part of speech combination filtering rule library R, to neologisms c_iIt is filtered To new set of words C4.

Part of speech rule of combination library R is made of following rule：

Filtering rule one：/ ns/v (can be nr, nz at ns)；

Filtering rule two：/ ns/ns (can be nr, nz at ns)；

Filtering rule three：/ n/v or/vn/v；

Filtering rule four：/t/t；

Filtering rule five：/t/nr；

Filtering rule six：/ t/f (can be vn, n, l, f at t)；

Filtering rule seven：/v/t；

Filtering rule eight：/t/v；

Filtering rule nine：/ns/j；

Description of the drawings

A kind of flow charts of the new word discovery method based on the microblogging class short text for improving FP-Growth algorithms of Fig. 1；

Specific implementation mode

With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Following reality Example is applied for illustrating the present invention, but is not limited to the scope of the present invention.

According to Fig. 1, method proposed by the present invention is (by taking the Sina weibo as an example) realized according to the following steps successively：

Step (1) microblogging language material obtains and pretreatment

Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, is stored as the file of html format, and by just The text message for extracting wherein user's publication is then matched, complicated and simple conversion then is carried out to the text message of acquisition, by punctuation mark Punctuate, punctuation mark include：‘.', ', ', '；’、‘’、‘！' and '：', the text of a line a word is obtained, current development is used More mature jieba Words partition systems carry out participle and part-of-speech tagging to text, and are filtered to the language material after participle, only retain the Chinese Word, part of speech label, obtain corpus G.The language material storage mode of G is with behavior unit, one microblogging text of storage per a line, and about 50 Wan Hang.

Step (2) utilizes improved FP-Growth algorithm process G, obtains frequent item set C_fp

Step (2.1) improves FP-Growth algorithms according to formula (1), wherein passing through parameter threshold α₁To obtain frequent episode Collection, parameter alpha₂=0.000000000001 so that algorithm not only only accounts for the frequency of frequent item set, but also considers part of speech It influences, because the disequilibrium of part of speech affects the appearance of some neologisms, such as：" anger/vg rancours/v ", "/vg " this part of speech is only Occur 60 times in language material, if only obtaining frequent item set by frequency, " anger/vg rancours/v " just can be filtered.

Frequent item set carry out sequence correction of the step (2.2) to being obtained by improved FP-Growth algorithms, because of FP- The frequent item set that Growth algorithms obtain is that do not have sequential, so by being corrected with original language material position versus, is obtained To frequent item set C_fp, C_fp={ c₁,c₂,c₃,......c_m, for each c of the inside_i, c_i=(w₁,w₂,w₃,......w_n), Here part frequent item set is listed, as shown in table 1 below：

1 part frequent item set C of table_fp

Step (3) obtains neologisms Candidate Set C using N-grams_grams

Neologisms Candidate Set C is obtained using N-grams algorithms_grams, C_grams={ c₁,c₂,c₃,......c_mInside it is each A candidate's neologisms c_i.Wherein N values are 2, c_i=(w₁,w₂).For arbitrary c_iAll pass through threshold alpha₂、β₂It is obtained by filtration, wherein α₂ =10, β₂=2000 filter out some noise datas of appearance, such as：" /uj people/n ".Here it is candidate that part neologisms are listed Collect C_grams, as shown in table 2 below：

2 part neologisms Candidate Set C of table_grams

Step (4) takes frequent item set C_fpWith neologisms Candidate Set C_gramsIntersection, obtain neologisms Candidate Set C1

By taking the available part neologisms candidate of intersection as shown in table 3 below：

3 part neologisms candidate C1 of table

Filtering part of speech set includes：Distinction word (/b), secondary morpheme (/dg), interjection (/e), number (/m), is intended conjunction (/c) Sound word (/o), preposition (/p), quantifier (/q), pronoun (/r), place word (/s), tense morpheme (/tg), auxiliary word (/u), punctuation mark (/w), non-morpheme word (/x), modal particle (/y), descriptive word (/z).Part part-of-speech information table is as shown in table 4：

4 part part-of-speech information table of table

Shown in the neologisms Candidate Set C2 figure the following table 5 of part：

5 part neologisms candidate C2 of table

Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, obtains neologisms candidate collection C3

Ci=(w1, w2), ci ∈ C2 are calculated each ci using improved mutual information, are calculated according to formula (2) To the mutual information for often singing neologisms in neologisms Candidate Set C2, pass through threshold alpha₃Candidate Set is filtered, new set of words C3 is obtained, gives below The new set of words C3 in part is gone out, as shown in table 6：

The new word set C3 in 6 part of table

Then step (7) obtains final new set of words C4 by the candidate new set of words C3 of part of speech rule of combination library R filterings

It is filtered by following rule：

Filtering rule one：It is filtered out shaped like noise word as " Gansu/ns publications/v ", structure is as follows：/ ns/v is (at ns Can be nr, nz)；

Filtering rule two：It is filtered out shaped like noise word as " Jiangxi/Ganzhou ns/ns ", structure is as follows：/ns/ns(ns Place can be nr, nz)；

Filtering rule three：It is filtered out shaped like noise word as " reason/n is /v ", structure is as follows：/ n/v or/vn/v；

Filtering rule four：It is filtered out shaped like noise word as " tomorrow/t New Year's Eve/t ", structure is as follows：/t/t；

Filtering rule five：It is filtered out shaped like noise word as " tomorrow/t the beginning of spring/nr ", structure is as follows：/t/nr；

Filtering rule six：Shaped like " Ching Ming Festival/t during/f " as noise word filter out, structure is as follows：(t goes out also/t/f Can be vn, n, l, f)；

Filtering rule seven：It is filtered out shaped like noise word as " by/v today/t ", structure is as follows：/v/t；

Filtering rule eight：Shaped like " now/t apparently/v " as noise word filter out, structure is as follows：/t/v；

Filtering rule nine：It is filtered out shaped like noise word as " Tianjin/ns traffic police/j ", structure is as follows：/ns/j；

The part neologisms obtained by rule-based filtering are as shown in table 7 below：

7 part neologisms candidate C4 of table

Step (8) new word identification effect analysis

The accuracy rate of the neologisms of computational algorithm identification calculates accuracy rate and sees formula (4)：

Wherein, p indicates that the accuracy rate of model, tp indicate the correct neologisms quantity identified, the neologisms of fp wrong identifications Quantity.

Be obtained by calculation that improved FP-growth algorithms identify on microblogging class short text neologisms accurate is： 69%.This algorithm is got well than the effect of general FP-growth algorithms and N-grams models, the standard that general new word discovery is calculated True rate is all 57% or so.And compared with character labeling model, the present invention does not need a large amount of artificial marks of early period.

Claims

1. it is a kind of based on improve FP-Growth algorithms microblogging class short text new word discovery method, which is characterized in that including with Lower step：

Step (1), microblogging language material obtain and pretreatment

Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, carrying out canonical matching to file obtains microblogging therein Body matter deletes URL therein, then makes pauses in reading unpunctuated ancient writings by punctuation mark, is segmented to obtained plain text, part of speech mark Note, obtains pretreated language material, is denoted as G；

Step (2) utilizes improved FP-Growth algorithm process language material G, acquisition frequent item set C_fp

Step (3) obtains neologisms Candidate Set C using N-grams models_grams

The number for counting N number of word from language material while occurring, the frequency P (w for being obtained word by N-grams models while being occurred₁,w₂, w₃,......w_n).Selection meets condition α₂＜ P (w₁,w₂,w₃,......w_n) ＜ β₂N members repeated strings as neologisms Candidate Set C_grams, α₂,β₂It is co-occurrence frequency threshold value.

Step (4) takes frequent item set C_fpWith neologisms Candidate Set C_gramsIntersection, obtain neologisms candidate C1={ c₁,c₂,…, c_m},c_i=(w₁,w₂,..w_n),c_iIndicate candidate neologisms, w_jIndicate the former word of composition neologisms.

Step (5), in neologisms candidate C1, using part of speech label to wherein contain filtering part of speech word screen, obtain To neologisms Candidate Set C2

Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, neologisms candidate collection C3 is obtained, if c_i= (w₁,w₂,..w_n), c_i∈ C2, to each c_iUsing improved mutual information formula, to adjacent w_jIt is calculated, improved mutual trust It is as follows to cease calculation formula：

Wherein, p (w_i,w_i+1) indicate word w_iWith word w_i+1The frequency occurred jointly, p (w_i) indicate word w_iFrequency, w_i,i+1Indicate word w_iWith neighbouring word w_i+1It is combined into the weight of word, n_pos(wi,wi+1)Indicate word w_i, w_i+1The frequency of the part of speech combination of co-occurrence,Indicate word w_iPart of speech occur frequency；In all frequent item sets, selection meets condition I (w_i,w_i+1) ＞ β₃Word As new set of words C={ c₁,c₂,c₃,......c_m, each neologisms are c1=(w₁,w₂,w₃,......w_n) constitute, wherein β₃For the threshold value of setting；

Step (7) combines the candidate new set of words C3 of filtering rule library R filterings by part of speech, then obtains final new set of words C4,

If c_i=(w₁,w₂,..w_n), ci ∈ C3, for each c_i, for arbitrary (w_i, w_i+1), part of speech combines (pos (w_i),pos(w_i+1)), if meeting any regular in part of speech combination filtering rule library R, remove neologisms c_i, finally obtain neologisms Set C4；

Part of speech combination filtering rule library R is made of following rule：

Filtering rule one：/ ns/v (can be nr, nz at ns)；

Filtering rule two：/ ns/ns (can be nr, nz at ns)；

Filtering rule three：/ n/v or/vn/v；

Filtering rule four：/t /t；

Filtering rule five：/t /nr；

Filtering rule six：/ t/f (can be vn, n, l, f at t)；

Filtering rule seven：/v /t；

Filtering rule eight：/t /v；

Filtering rule nine：/ns /j.

2. the new word discovery method as described in claim 1 based on the microblogging class short text for improving FP-Growth algorithms, special Sign is that step (2) specifically includes：

Step (2.1) handles microblogging language material G, and two factors of comprehensive word frequency and part of speech build improved FP-Growth models, word The calculation formula of property relative probability value is as follows：

Wherein, f (w | pos (w)=a) indicates part of speech relative probability values of the word w when part of speech is a, n_aIndicate that part of speech is the language material G of a The word frequency number of middle word, N indicate word frequency number total in language material G, n_{(w | pos (w)=a)}Indicate word frequency numbers of the word w when part of speech is a；

When building frequent item set, selection meets condition f (w | pos (w)=a) ＞ α₁Repeated strings as candidate frequent item set R_fp, α₁For the minimum support of setting；

Step (2.2) is to obtained frequent item set R_fpCarry out sequence correction, by with original language material carry out sequence comparison, obtain Sequential frequent item set C_fp。

3. the new word discovery method as claimed in claim 2 based on the microblogging class short text for improving FP-Growth algorithms, special Sign is that filtering part of speech set includes in step (2)：

b Distinction word Take the initial consonant of Chinese character " other ". c Conjunction Take the 1st letter of English conjunction conjunction. dg Secondary morpheme Adverbial morpheme.Adverbial word code is d, is set with D before morpheme code g. e Interjection Take the 1st letter of English interjection exclamation. m Number The 3rd letter of English numeral is taken, n, u have his use. o Onomatopoeia Take the 1st letter of English onomatopoeia onomatopoeia. p Preposition Take the 1st letter of English preposition prepositional. q Quantifier Take the 1st letter of English quantity. r Pronoun The 2nd letter for taking English pronoun pronoun, because p has been used for preposition. s Place word Take the 1st letter of English space. tg Tense morpheme Time part of speech morpheme.Time word code is t, is set with T before the code g of morpheme. u Auxiliary word Take English auxiliary word auxiliary w Punctuation mark x Non- morpheme word Non- morpheme word is a symbol, and alphabetical x is commonly used in representing unknown number, symbol. y Modal particle Take the initial consonant of Chinese character " language ". z Descriptive word Take the previous letter of the initial consonant of Chinese character " shape ".