CN108319584A - A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms - Google Patents
A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms Download PDFInfo
- Publication number
- CN108319584A CN108319584A CN201810058993.6A CN201810058993A CN108319584A CN 108319584 A CN108319584 A CN 108319584A CN 201810058993 A CN201810058993 A CN 201810058993A CN 108319584 A CN108319584 A CN 108319584A
- Authority
- CN
- China
- Prior art keywords
- word
- speech
- neologisms
- filtering rule
- microblogging
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The present invention discloses a kind of new word discovery method of the microblogging class short text based on improved FP Growth algorithms, including:Corpus of text is obtained, is segmented using jieba, the pretreatments such as part-of-speech tagging;Frequent item set word set is obtained by the FP Growth algorithms of optimization, and to each frequent episode ordering;Repeated strings are obtained using N grams models, and intersection is taken with frequent item set;It is filtered by part of speech, filters out some parts of speech being of little use on morphological structure;Using improved mutual information, sliding iterates to calculate mutual information to filter neologisms;It is once filtered again using part of speech rule of combination library;Verify the validity that this method obtains neologisms.
Description
Technical field
The invention belongs to text information processing fields, are specifically related to a kind of based on the microblogging for improving FP-Growth algorithms
The new word discovery method of class short text.
Background technology
Microblogging is one of most popular social platform in current global range, and daily user can issue greatly on microblogging
The text message of amount, this becomes one of the main source of network neologisms.
Difference lies in microblogging is short text, and the information of each user's publication does not exceed 140 for microblogging and general text
Character, content is more random, and form has diversity.So research this kind of short text of microblogging is relatively difficult.But magnanimity
The knowledge contained in microblogging text monitors public sentiment, and the research in the fields such as new word discovery has great importance.
The research of new word discovery at present is mainly based upon the knowledge of the name entity such as name, place name, mechanism name of traditional text
, not relatively fewer based on the research of the new word discovery of microblogging short text, and compare with traditional text, since microblogging has
The features such as text is short, irregular, effect of traditional new word discovery method in microblogging class short text are unsatisfactory.
FP-Growth algorithms obtain the frequent item set in data by twice sweep database, are a kind of efficient acquisitions
The algorithm of frequent item set can be used for the acquisition of neologisms, but apply existing defects in microblogging class short text.Traditional
FP-Growth algorithms have ignored part of speech to the influence at word in the discovery of neologisms, propose a kind of improved FP-Growth thus
Algorithm, and neologisms are found in conjunction with N-grams models, improved mutual information and rule.
Invention content
For FP-Growth algorithms in the defect of the new word discovery of microblogging class short text, a kind of improved FP- is proposed
Growth algorithms, take into account part of speech, the relevance that not only can be effectively expressed as by frequent episode between word word, also
It is difficult that the identification that part of speech imbalance is brought can be cut down, improve to obtain by the integrated learning approach in conjunction with N-grams models
Neologisms accuracy rate, while being filtered by part of speech, improved mutual information and part of speech rule of combination library.
To achieve the above object, the present invention adopts the following technical scheme that
A kind of new word discovery method of the microblogging class short text based on improved FP-Growth algorithms, includes the following steps:
Step (1) microblogging language material obtains and pretreatment
Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, is stored as the file of html format.To file into
The matching of row canonical obtains text therein, deletes URL therein, then makes pauses in reading unpunctuated ancient writings by punctuation mark.To obtained plain text
It is segmented, part-of-speech tagging, uses the third party module jieba of python, obtain pretreated language material, be denoted as G;
Step (2) obtains frequent item set C using the FP-Growth algorithm process G of optimizationfp
Step (2.1) handles microblogging language material G, builds and improves FP-Growth models, two factors of comprehensive word frequency and part of speech,
The calculation formula of part of speech relative probability value is as follows:
Wherein, f (w | pos (w)=a) indicates part of speech relative probability values of the word w when part of speech is a, naIndicate word in language material G
Property be a word frequency number, N indicates word frequency number total in language material G, n(w | pos (w)=a)Indicate word frequency numbers of the word w when part of speech is a.
When building frequent item set, selection meets condition f (w | pos (w)=a) > α1Repeated strings as candidate frequent episode
Collect Rfp, α1For the minimum support of setting.
Step (2.2) is to obtained frequent item set RfpCarry out sequence correction.In the frequent episode that FP-Growth algorithms obtain
Word is unordered, thus by with original language material carry out sequence comparison, obtain sequential frequent item set Cfp。
Step (3) obtains neologisms Candidate Set C using N-grams modelsgrams
The number for counting N number of word from language material while occurring, the frequency P for being obtained word by N-grams models while being occurred
(w1,w2,w3,......wn).Selection meets condition α2< P (w1,w2,w3,......wn) < β2N member repeated strings as neologisms
Candidate Set Cgrams, α2,β2It is co-occurrence frequency threshold value.
Step (4) takes frequent item set CfpWith neologisms Candidate Set CgramsIntersection, obtain neologisms candidate C1={ c1,
c2,…,cm},ci=(w1,w2,..wn),ciIndicate candidate neologisms, wnIndicate the former word of composition neologisms.
The FP-Growth algorithms of optimization take into account part of speech, not only can effectively be expressed as word original by frequent episode
It is difficult can also to cut down the identification that part of speech imbalance is brought for relevance between language.It is obtained simultaneously by N-grams algorithms new
Word Candidate Set CgramsThe frequent item set C obtained with the FP-Growth algorithms of optimizationfpThe integrated learning approach of intersection is taken to improve
The accuracy rate of obtained neologisms.
Step (5) screens the word for wherein containing filtering part of speech in neologisms candidate C1, using part of speech label,
Obtain neologisms Candidate Set C2
Filtering part of speech set includes:
It is filtered according to the above part of speech, is filtered through the neologisms Candidate Set C1 that step (4) obtains, obtains neologisms Candidate Set C2;
Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, obtains neologisms candidate collection C3.If ci
=(w1,w2,..wn), ci∈ C2, to each ciUsing the sliding of improved mutual information calculate, improved mutual information calculates public
Formula is as follows:
Wherein, p (wi,wi+1) indicate word wiWith word wi+1The frequency occurred jointly, p (wi) indicate word wiFrequency, wi,i+1Table
Show word wiWith neighbouring word wi+1It is combined into the weight of word,Indicate word wi, wi+1The frequency of the part of speech combination of co-occurrence,Indicate word wiPart of speech occur frequency.In all frequent item sets, selection meets condition I (wi,wi+1) > β3Word
As new set of words C={ c1,c2,c3,......cm, each neologisms are c1=(w1,w2,w3,......wn) constitute, wherein
β3For the threshold value of setting.
Step (7) combines the candidate new set of words C3 of filtering rule library R filterings by part of speech, then obtains final new word set
Close C4
If ci=(w1,w2,..wn), ci ∈ C3, for each ci, for arbitrary (wi, wi+1), part of speech combination
(pos(wi),pos(wi+1)), if meeting any regular in part of speech combination filtering rule library R, to neologisms ciIt is filtered
To new set of words C4.
Part of speech rule of combination library R is made of following rule:
Filtering rule one:/ ns/v (can be nr, nz at ns);
Filtering rule two:/ ns/ns (can be nr, nz at ns);
Filtering rule three:/ n/v or/vn/v;
Filtering rule four:/t/t;
Filtering rule five:/t/nr;
Filtering rule six:/ t/f (can be vn, n, l, f at t);
Filtering rule seven:/v/t;
Filtering rule eight:/t/v;
Filtering rule nine:/ns/j;
Description of the drawings
A kind of flow charts of the new word discovery method based on the microblogging class short text for improving FP-Growth algorithms of Fig. 1;
Specific implementation mode
With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Following reality
Example is applied for illustrating the present invention, but is not limited to the scope of the present invention.
According to Fig. 1, method proposed by the present invention is (by taking the Sina weibo as an example) realized according to the following steps successively:
Step (1) microblogging language material obtains and pretreatment
Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, is stored as the file of html format, and by just
The text message for extracting wherein user's publication is then matched, complicated and simple conversion then is carried out to the text message of acquisition, by punctuation mark
Punctuate, punctuation mark include:‘.', ', ', ';’、‘’、‘!' and ':', the text of a line a word is obtained, current development is used
More mature jieba Words partition systems carry out participle and part-of-speech tagging to text, and are filtered to the language material after participle, only retain the Chinese
Word, part of speech label, obtain corpus G.The language material storage mode of G is with behavior unit, one microblogging text of storage per a line, and about 50
Wan Hang.
Step (2) utilizes improved FP-Growth algorithm process G, obtains frequent item set Cfp
Step (2.1) improves FP-Growth algorithms according to formula (1), wherein passing through parameter threshold α1To obtain frequent episode
Collection, parameter alpha2=0.000000000001 so that algorithm not only only accounts for the frequency of frequent item set, but also considers part of speech
It influences, because the disequilibrium of part of speech affects the appearance of some neologisms, such as:" anger/vg rancours/v ", "/vg " this part of speech is only
Occur 60 times in language material, if only obtaining frequent item set by frequency, " anger/vg rancours/v " just can be filtered.
Frequent item set carry out sequence correction of the step (2.2) to being obtained by improved FP-Growth algorithms, because of FP-
The frequent item set that Growth algorithms obtain is that do not have sequential, so by being corrected with original language material position versus, is obtained
To frequent item set Cfp, Cfp={ c1,c2,c3,......cm, for each c of the insidei, ci=(w1,w2,w3,......wn),
Here part frequent item set is listed, as shown in table 1 below:
1 part frequent item set C of tablefp
Step (3) obtains neologisms Candidate Set C using N-gramsgrams
Neologisms Candidate Set C is obtained using N-grams algorithmsgrams, Cgrams={ c1,c2,c3,......cmInside it is each
A candidate's neologisms ci.Wherein N values are 2, ci=(w1,w2).For arbitrary ciAll pass through threshold alpha2、β2It is obtained by filtration, wherein α2
=10, β2=2000 filter out some noise datas of appearance, such as:" /uj people/n ".Here it is candidate that part neologisms are listed
Collect Cgrams, as shown in table 2 below:
2 part neologisms Candidate Set C of tablegrams
Step (4) takes frequent item set CfpWith neologisms Candidate Set CgramsIntersection, obtain neologisms Candidate Set C1
By taking the available part neologisms candidate of intersection as shown in table 3 below:
3 part neologisms candidate C1 of table
Step (5) screens the word for wherein containing filtering part of speech in neologisms candidate C1, using part of speech label,
Obtain neologisms Candidate Set C2
Filtering part of speech set includes:Distinction word (/b), secondary morpheme (/dg), interjection (/e), number (/m), is intended conjunction (/c)
Sound word (/o), preposition (/p), quantifier (/q), pronoun (/r), place word (/s), tense morpheme (/tg), auxiliary word (/u), punctuation mark
(/w), non-morpheme word (/x), modal particle (/y), descriptive word (/z).Part part-of-speech information table is as shown in table 4:
4 part part-of-speech information table of table
Shown in the neologisms Candidate Set C2 figure the following table 5 of part:
5 part neologisms candidate C2 of table
Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, obtains neologisms candidate collection C3
Ci=(w1, w2), ci ∈ C2 are calculated each ci using improved mutual information, are calculated according to formula (2)
To the mutual information for often singing neologisms in neologisms Candidate Set C2, pass through threshold alpha3Candidate Set is filtered, new set of words C3 is obtained, gives below
The new set of words C3 in part is gone out, as shown in table 6:
The new word set C3 in 6 part of table
Then step (7) obtains final new set of words C4 by the candidate new set of words C3 of part of speech rule of combination library R filterings
It is filtered by following rule:
Filtering rule one:It is filtered out shaped like noise word as " Gansu/ns publications/v ", structure is as follows:/ ns/v is (at ns
Can be nr, nz);
Filtering rule two:It is filtered out shaped like noise word as " Jiangxi/Ganzhou ns/ns ", structure is as follows:/ns/ns(ns
Place can be nr, nz);
Filtering rule three:It is filtered out shaped like noise word as " reason/n is /v ", structure is as follows:/ n/v or/vn/v;
Filtering rule four:It is filtered out shaped like noise word as " tomorrow/t New Year's Eve/t ", structure is as follows:/t/t;
Filtering rule five:It is filtered out shaped like noise word as " tomorrow/t the beginning of spring/nr ", structure is as follows:/t/nr;
Filtering rule six:Shaped like " Ching Ming Festival/t during/f " as noise word filter out, structure is as follows:(t goes out also/t/f
Can be vn, n, l, f);
Filtering rule seven:It is filtered out shaped like noise word as " by/v today/t ", structure is as follows:/v/t;
Filtering rule eight:Shaped like " now/t apparently/v " as noise word filter out, structure is as follows:/t/v;
Filtering rule nine:It is filtered out shaped like noise word as " Tianjin/ns traffic police/j ", structure is as follows:/ns/j;
The part neologisms obtained by rule-based filtering are as shown in table 7 below:
7 part neologisms candidate C4 of table
Step (8) new word identification effect analysis
The accuracy rate of the neologisms of computational algorithm identification calculates accuracy rate and sees formula (4):
Wherein, p indicates that the accuracy rate of model, tp indicate the correct neologisms quantity identified, the neologisms of fp wrong identifications
Quantity.
Be obtained by calculation that improved FP-growth algorithms identify on microblogging class short text neologisms accurate is:
69%.This algorithm is got well than the effect of general FP-growth algorithms and N-grams models, the standard that general new word discovery is calculated
True rate is all 57% or so.And compared with character labeling model, the present invention does not need a large amount of artificial marks of early period.
Claims (3)
1. it is a kind of based on improve FP-Growth algorithms microblogging class short text new word discovery method, which is characterized in that including with
Lower step:
Step (1), microblogging language material obtain and pretreatment
Microblogging language material is obtained using the api interface or acquisition reptile of microblogging, carrying out canonical matching to file obtains microblogging therein
Body matter deletes URL therein, then makes pauses in reading unpunctuated ancient writings by punctuation mark, is segmented to obtained plain text, part of speech mark
Note, obtains pretreated language material, is denoted as G;
Step (2) utilizes improved FP-Growth algorithm process language material G, acquisition frequent item set Cfp
Step (3) obtains neologisms Candidate Set C using N-grams modelsgrams
The number for counting N number of word from language material while occurring, the frequency P (w for being obtained word by N-grams models while being occurred1,w2,
w3,......wn).Selection meets condition α2< P (w1,w2,w3,......wn) < β2N members repeated strings as neologisms Candidate Set
Cgrams, α2,β2It is co-occurrence frequency threshold value.
Step (4) takes frequent item set CfpWith neologisms Candidate Set CgramsIntersection, obtain neologisms candidate C1={ c1,c2,…,
cm},ci=(w1,w2,..wn),ciIndicate candidate neologisms, wjIndicate the former word of composition neologisms.
Step (5), in neologisms candidate C1, using part of speech label to wherein contain filtering part of speech word screen, obtain
To neologisms Candidate Set C2
Step (6) is filtered neologisms Candidate Set C2 using improved mutual information, neologisms candidate collection C3 is obtained, if ci=
(w1,w2,..wn), ci∈ C2, to each ciUsing improved mutual information formula, to adjacent wjIt is calculated, improved mutual trust
It is as follows to cease calculation formula:
Wherein, p (wi,wi+1) indicate word wiWith word wi+1The frequency occurred jointly, p (wi) indicate word wiFrequency, wi,i+1Indicate word
wiWith neighbouring word wi+1It is combined into the weight of word, npos(wi,wi+1)Indicate word wi, wi+1The frequency of the part of speech combination of co-occurrence,Indicate word wiPart of speech occur frequency;In all frequent item sets, selection meets condition I (wi,wi+1) > β3Word
As new set of words C={ c1,c2,c3,......cm, each neologisms are c1=(w1,w2,w3,......wn) constitute, wherein
β3For the threshold value of setting;
Step (7) combines the candidate new set of words C3 of filtering rule library R filterings by part of speech, then obtains final new set of words
C4,
If ci=(w1,w2,..wn), ci ∈ C3, for each ci, for arbitrary (wi, wi+1), part of speech combines (pos
(wi),pos(wi+1)), if meeting any regular in part of speech combination filtering rule library R, remove neologisms ci, finally obtain neologisms
Set C4;
Part of speech combination filtering rule library R is made of following rule:
Filtering rule one:/ ns/v (can be nr, nz at ns);
Filtering rule two:/ ns/ns (can be nr, nz at ns);
Filtering rule three:/ n/v or/vn/v;
Filtering rule four:/t /t;
Filtering rule five:/t /nr;
Filtering rule six:/ t/f (can be vn, n, l, f at t);
Filtering rule seven:/v /t;
Filtering rule eight:/t /v;
Filtering rule nine:/ns /j.
2. the new word discovery method as described in claim 1 based on the microblogging class short text for improving FP-Growth algorithms, special
Sign is that step (2) specifically includes:
Step (2.1) handles microblogging language material G, and two factors of comprehensive word frequency and part of speech build improved FP-Growth models, word
The calculation formula of property relative probability value is as follows:
Wherein, f (w | pos (w)=a) indicates part of speech relative probability values of the word w when part of speech is a, naIndicate that part of speech is the language material G of a
The word frequency number of middle word, N indicate word frequency number total in language material G, n(w | pos (w)=a)Indicate word frequency numbers of the word w when part of speech is a;
When building frequent item set, selection meets condition f (w | pos (w)=a) > α1Repeated strings as candidate frequent item set
Rfp, α1For the minimum support of setting;
Step (2.2) is to obtained frequent item set RfpCarry out sequence correction, by with original language material carry out sequence comparison, obtain
Sequential frequent item set Cfp。
3. the new word discovery method as claimed in claim 2 based on the microblogging class short text for improving FP-Growth algorithms, special
Sign is that filtering part of speech set includes in step (2):
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810058993.6A CN108319584A (en) | 2018-01-22 | 2018-01-22 | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810058993.6A CN108319584A (en) | 2018-01-22 | 2018-01-22 | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108319584A true CN108319584A (en) | 2018-07-24 |
Family
ID=62887532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810058993.6A Withdrawn CN108319584A (en) | 2018-01-22 | 2018-01-22 | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108319584A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543021A (en) * | 2018-11-29 | 2019-03-29 | 北京光年无限科技有限公司 | A kind of narration data processing method and system towards intelligent robot |
CN110532548A (en) * | 2019-08-12 | 2019-12-03 | 上海大学 | A kind of hyponymy abstracting method based on FP-Growth algorithm |
CN110874408A (en) * | 2018-08-29 | 2020-03-10 | 阿里巴巴集团控股有限公司 | Model training method, text recognition device and computing equipment |
CN111339403A (en) * | 2020-02-11 | 2020-06-26 | 安徽理工大学 | Commodity comment-based new word extraction method |
-
2018
- 2018-01-22 CN CN201810058993.6A patent/CN108319584A/en not_active Withdrawn
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110874408A (en) * | 2018-08-29 | 2020-03-10 | 阿里巴巴集团控股有限公司 | Model training method, text recognition device and computing equipment |
CN110874408B (en) * | 2018-08-29 | 2023-05-26 | 阿里巴巴集团控股有限公司 | Model training method, text recognition device and computing equipment |
CN109543021A (en) * | 2018-11-29 | 2019-03-29 | 北京光年无限科技有限公司 | A kind of narration data processing method and system towards intelligent robot |
CN109543021B (en) * | 2018-11-29 | 2022-03-18 | 北京光年无限科技有限公司 | Intelligent robot-oriented story data processing method and system |
CN110532548A (en) * | 2019-08-12 | 2019-12-03 | 上海大学 | A kind of hyponymy abstracting method based on FP-Growth algorithm |
CN111339403A (en) * | 2020-02-11 | 2020-06-26 | 安徽理工大学 | Commodity comment-based new word extraction method |
CN111339403B (en) * | 2020-02-11 | 2022-08-02 | 安徽理工大学 | Commodity comment-based new word extraction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509425B (en) | Chinese new word discovery method based on novelty | |
CN104268160B (en) | A kind of OpinionTargetsExtraction Identification method based on domain lexicon and semantic role | |
CN107180025B (en) | Method and device for identifying new words | |
CN106897559B (en) | A kind of symptom and sign class entity recognition method and device towards multi-data source | |
CN104636466B (en) | Entity attribute extraction method and system for open webpage | |
CN102033879B (en) | Method and device for identifying Chinese name | |
CN108319584A (en) | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms | |
CN102682120B (en) | Method and device for acquiring essential article commented on network | |
CN105630884B (en) | A kind of geographical location discovery method of microblog hot event | |
CN109543178A (en) | A kind of judicial style label system construction method and system | |
CN105988990A (en) | Device and method for resolving zero anaphora in Chinese language, as well as training method | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN104298714B (en) | A kind of mass text automatic marking method based on abnormality processing | |
CN106294744A (en) | Interest recognition methods and system | |
CN104778256A (en) | Rapid incremental clustering method for domain question-answering system consultations | |
CN110674296B (en) | Information abstract extraction method and system based on key words | |
CN110457711B (en) | Subject word-based social media event subject identification method | |
CN107577668A (en) | Social media non-standard word correcting method based on semanteme | |
CN107357785A (en) | Theme feature word abstracting method and system, feeling polarities determination methods and system | |
CN113033185B (en) | Standard text error correction method and device, electronic equipment and storage medium | |
CN110941720A (en) | Knowledge base-based specific personnel information error correction method | |
CN107943786A (en) | A kind of Chinese name entity recognition method and system | |
CN106980620A (en) | A kind of method and device matched to Chinese character string | |
CN110119443A (en) | A kind of sentiment analysis method towards recommendation service |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20180724 |