CN108664642A

CN108664642A - Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm

Info

Publication number: CN108664642A
Application number: CN201810466451.2A
Authority: CN
Inventors: 丁福冬
Original assignee: Jurong Ma Run Seedlings Co Ltd
Current assignee: Jurong Ma Run Seedlings Co Ltd
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2018-10-16

Abstract

The invention discloses the Rules for Part of Speech Tagging automatic obtaining methods based on Apriori algorithm, including：Step 1, transaction database is inputted；Step 2, the set L1 of frequent 1 item collection is calculated；Step 3, the set L6 of frequent 6 item collection is generated；Step 4：Correlation rule is obtained based on frequent 6 item collection.

Description

Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm

Technical field

The present invention relates to the Rules for Part of Speech Tagging automatic obtaining methods based on Apriori algorithm.

Background technology

Data mining (bibliography：The Fujian research [J] the computer of Jiang sea elder brother's data mining processes, 2007.3：67-74) It is from a large amount of extracting data or " excavation " knowledge.Specifically, data mining be exactly from it is a large amount of, random, fuzzy, In incomplete, noisy data, extraction lies in therein, potentially useful, road unknown by the people in advance knowledge and letter Process (the bibliography of breath：ZhaoHui Tang. data minings principles and publishing house of application [M] Tsinghua University, 2007.).Word Property mark be natural language processing an important link, task be in sentence each word mark a correct word Property, the mistake that this link occurs will be amplified that (Maihemuti buys in the processing such as subsequent syntactic analysis, machine translation It proposes Uighur part-of-speech tagging researchs of the based on statistics and realizes [D] Xinjiang Universitys, 2009.).Part-of-speech tagging is so far There are many methods, has there is the method (bibliography that rule-based, statistics and rule are combined with statistics：Liu S, Chen L et al.Automatic part-of-speech tagging for Chinese corpus.Computer Progressing of Chinese and Oriental Languages, 1955.9 (1)：31-47).

The acquisition of rule is generally integrated by manual sorting, but this has following both sides (bibliography：Li Xiao Multitude, Shi Zhong plant data mining methods and obtain regular [J] the Journal of Computer Research and Development of Part of Speech Tagging, 2000.37 (2)： 1409-1414)：1. from the application range of rule, method manually is only possible to generate some general character rule, it is impossible to produce The raw persona rules for individual cases, and persona rules are although application range is small, are also the important means for improving accuracy； 2. since the regular accuracy rate that manual method obtains is still to be tested, before being not easy to improve again based on statistical method accuracy It puts, can automatically and efficiently obtain rule be the critical issue realized in part-of-speech tagging.

Invention content

The present invention in view of the deficiencies of the prior art, discloses the Rules for Part of Speech Tagging based on Apriori algorithm and obtains automatically Method includes the following steps：

Step 1, transaction database is inputted, and it is I={ A that the project set that transaction database includes, which is arranged,₁, A₂, A₃, A₄, A₅, A₆, A₁~A₆Indicate 6 subsets, i.e., one shares 6 projects, wherein A in expression transaction database₁Indicate the collection of previous word It closes, A₂Indicate previous word part of speech set, A₃Indicate the set of current word, A₄Indicate current word part of speech set, A₅Indicate latter word Set, A₆Indicate latter word part of speech set；

Step 2, the set L1 of frequent 1- item collections is calculated in scan item set I；Data in L1 indicate each word, The number that part of speech occurs.

Step 3, candidate frequent item set set C2 is generated by L1 connections, beta pruning, each Candidate Set in C2 is counted, will be less than The Candidate Set of minimum support abandons, to generate the set L2 of frequent 2- item collections；Data in L2 indicate that word, part of speech connect two-by-two The number occurred after connecing；

Candidate frequent item set set C3 is generated by L2 connections, beta pruning, each Candidate Set in C3 is counted, most ramuscule will be less than The Candidate Set for degree of holding abandons, to generate the set L3 of frequent 3- item collections；

Candidate frequent item set set C4 is generated by L3 connections, beta pruning, each Candidate Set in C4 is counted, most ramuscule will be less than The Candidate Set for degree of holding abandons, to generate the set L4 of frequent 4- item collections；

Candidate frequent item set set C5 is generated by L4 connections, beta pruning, each Candidate Set in C5 is counted, most ramuscule will be less than The Candidate Set for degree of holding abandons, to generate the set L5 of frequent 5- item collections；

Candidate frequent item set set C6 is generated by L5 connections, beta pruning, each Candidate Set in C6 is counted, most ramuscule will be less than The Candidate Set for degree of holding abandons, to generate the set L6 of frequent 6- item collections；(in step 3, it is described connection, beta pruning method be existing There are technology, bibliography：Liu S, Chen L et al.Automatic part-of-speech tagging for Chinese corpus.Computer progressing of Chinese and Oriental Languages, 1955.9 (1)： 31-47).

Step 4：Correlation rule is obtained based on frequent 6- item collections.

Step 2 includes：

Use N_iIndicate i-th of project A in project set I_iThe number of appearance, i values are 1~6, are calculated according to following formula Obtain i-th of project support degree sup (A_i)：

sup(A_i)=N_i/ | D |,

Wherein | D | the number of transactions that transaction database includes is indicated, by projects support and set minimum support Min_support (being traditionally arranged to be 10) is compared, and deletes the program member that support is less than minimum support, obtains frequency The set L of numerous 1- item collections₁。

Step 4 includes：For each frequent item set L_x, x values are 1~6, find out wherein all nonvoid subsets, are counted The confidence level of each nonvoid subset a is calculated, if frequent item set L_xSupport sup (L_x) with the support sup of nonvoid subset a (a) ratio be more than Minimum support4 (Minimum support4 oneself is configured according to demand by user, for example is set as 0.8), Then there is correlation rule a==>(L_x- a), correlation rule is otherwise not present, correlation rule is Rules for Part of Speech Tagging.

Using createTransRule () function creation correlation rule, using createL1 (), createL2 (), Six createL3 (), createL4 (), createL5 (), createL6 () function creation Frequent Sets, six function difference Corresponding set L1, L2, L3, L4, L5 and L6, use getMinusCollect (String [] a, String [] L_x) function seeks a With L_xDifference set.

X=>Y, meaning is the appearance of X also leads to the appearance of Y simultaneously.For correlation rule X=>Y, the table of support Existing form is sup (X=>Y)=sup (X ∪ Y) includes the transaction amount of X, Y simultaneously that is, in transaction set in All Activity sum Shared ratio；Confidence level conf (X=>Y the form of expression) is conf (X=>Y)=sup (X ∪ Y)/sup (X), i.e., simultaneously Include the ratio of the transaction amount of X, Y and the transaction amount only comprising X.Wherein support is one kind to correlation rule importance Indicate, and confidence level can be described as confidence level, be a kind of expression to correlation rule accuracy, value range 0 to 1 it Between.

The present invention need not carry out dimension and step analysis for the acquisition of Rules for Part of Speech Tagging, also need not use point and Control method, but use most basic Apriori algorithm (Agrawal et al. first proposed in 1993 excavation care for Correlation rule problem in objective transaction data base between item collection devises the Apriori algorithm based on Frequent Set theory (with reference to text It offers：Yang Guang Studies on Algorithms of Association [D] Dalian University Of Communications, 2005.).Apriori algorithm is that one kind most has an impact The algorithm of the Mining Boolean Association Rules frequent item set of power.Its core is that the recursion based on two stage frequent item set thought is calculated The design of method, the algorithm is decomposed into two sub-problems：1. finding the item collection that all supports are more than minimum support (itemset), these item collections are known as Frequent Set (frequent itemset)；2. according to minimum confidence level and finding frequent Item collection generates correlation rule.), shadow of the mode sequences to part of speech of part of speech and word is studied from the language material manually marked It rings.This method and people are consistent using information such as word, parts of speech in language material context come the method judged part of speech. In the case where statistics language material is larger, after giving minimum support and Minimum support4, excavates be more than minimum support first The common pattern collection of degree, then produces correlation rule, if the confidence level of this rule is more than Minimum support4, obtains part of speech rule Then.If Minimum support4 defines sufficiently high, the rule obtained can be as the supplement of probabilistic method, to preferably solve Certainly part-of-speech tagging problem.

Advantageous effect：The present invention need not carry out dimension and step analysis for the acquisition of Rules for Part of Speech Tagging, also be not required to The method divided and rule, experiment is used to show that the mark rule obtained automatically has good utility value, word can be improved Property mark accuracy.

Description of the drawings

The present invention is done with reference to the accompanying drawings and detailed description and is further illustrated, it is of the invention above-mentioned or Otherwise advantage will become apparent.

Fig. 1 is flow chart of the present invention.

Specific implementation mode

The present invention will be further described with reference to the accompanying drawings and embodiments.

As shown in Figure 1, the invention discloses the Rules for Part of Speech Tagging automatic obtaining methods based on Apriori algorithm, including Following steps：

Candidate frequent item set set C6 is generated by L5 connections, beta pruning, each Candidate Set in C6 is counted, most ramuscule will be less than The Candidate Set for degree of holding abandons, to generate the set L6 of frequent 6- item collections；

Step 4：Correlation rule is obtained based on frequent 6- item collections.

Step 2 includes：

sup(A_i)=N_i/ | D |,

Step 4 includes：For each frequent item set L_x, x values are 1~6, find out wherein all nonvoid subsets, are counted The confidence level of each nonvoid subset a is calculated, if frequent item set L_xSupport sup (L_x) with the support sup of nonvoid subset a (a) ratio be more than Minimum support4 (Minimum support4 oneself is configured according to demand by user, for example is set as 0.8), Then there is correlation rule a==>(L_x- a), association is otherwise not present, correlation rule is Rules for Part of Speech Tagging.

Embodiment

Design following model program framework：

(1) Main functions are responsible for the overall operation of program, as caller initialization, Item Sets calculate, correlation rule is calculated The output operation etc. of method, relevant information.

(2) Apriori () constructed fuction is for creating graphic user interface.

(3) print () function is used to return to the relevant information for needing to export.

(4) createTransRule () function is for creating correlation rule.

⑸createL1()、createL2()、createL3()、createL4()、createL5()、createL6 () six functions are for creating Frequent Set.

(6) removeNotSupportKey () function is used to delete the key that key assignments is less than minimum support.

FindKey (Set keyset, String a, String b, String c, String d, String e, String f) function be used for it is strong integrate key value is searched in keyset as a, b, c, d, e, f's is good for.

Contain (Set keyset, String a, String b, String c, String d, String e, String f) function is for judging integrate whether contained key value in keyset as a strong, b, c, d, e, f's is strong.

(9) getMinusCollect (String [] a, String [] L) function is used to ask the difference set of a and L.

(10) getSubSet (String setN []) function is used to obtain the subset of setN.

Language material uses《Xinjiang daily paper》Language version is tieed up, subject matter is related to politics, economy, sport, health, culture, art, amusement Deng.Stem cutting, affixe extraction and part part-of-speech tagging is completed in the language material at present.

According to the Apriori methods in data mining, each long patterns are excavated respectively, and final pattern is set Minimum support and confidence level are set, the rule of part-of-speech tagging is therefrom excavated.Word, part of speech are can be seen that from the rule excavated And influence of the combination of word and part of speech to current word part of speech.

In the present embodiment, part of speech label sets Tags=Tagi | i=1,2 ..., m }, word set Dwords=Wordi | i= 1,2 ..., n }, item collection I=DwordsUTags, wherein Wordi, Tagi are respectively i-th of word part of speech label corresponding with its.

Marked text T=(Wordi, Tagi) | and Wordi ∈ Dwords, Tagi ∈ Tags }, Tagi is word Wordi The corresponding part of speech label in the retrtieval.Partial-length pattern is illustrated below：

Pattern one：Indicate that the occurrence number of single word or part of speech, wherein occurrence number front three are：N, v, adj.Due to Contextual information is not utilized in one pattern, because without composition rule.

Pattern two：Indicate the influence of previous word or previous part of speech to current part of speech.

The mark rule of acquisition is：If (wordi, adv) is then (word2, n), adv indicates that adverbial word, n indicate noun, this If illustrating, previous word part of speech is adverbial word, and the part of speech of latter word is noun.

Pattern three：Influence of the combination of preceding two word of expression or part of speech to the part of speech of current word.

The rule of acquisition is：If (part of speech 1, v) then (word 2, " ") then (word 3, n), v indicate verb.

Pattern six：Indicate { " previous word ", " part of speech of previous word ", " current word ", " current word part of speech ", " the latter Word ", " part of speech of the latter word " } go out new number.

By should be apparent that restriction effect of the word in pattern to the comparison of different length pattern.

From experimental data it can be seen that：The combination of each pattern with modal length the absolute number for being continuously increased a combination thereof Amount is also continuously increased.Due to being reduced by the support of more context sensitivity, pattern, confidence level increases, and part of speech energy It is enough to be also increased by the possibility of only determination.

More, institute is wanted since the number that word and its corresponding part of speech occur is far from the number that a part of speech individually occurs With the situation that the part of speech in word contextual information does restriction correspondence is more, more complicated, is unfavorable for disappearing to conversion of parts of speech part of speech Discrimination, and influence bigger of the word as a pair of part of speech of the factor of context, i.e., it is more accurate to the limitation of part of speech.In general, Influence of the word to part of speech is some larger in pattern, therefore the support of the pattern containing word wants smaller.

In order to carry out experiment comparison, the present embodiment is first labeled above-mentioned language material with the method for maximum entropy, is accurately 92.01%.According to the mark of acquisition rule, on the basis of maximum entropy model marks, annotation results are optimized, accurately It is 93.13%, better than the result marked with the maximum entropy method based on statistics merely.

The present invention provides the Rules for Part of Speech Tagging automatic obtaining methods based on Apriori algorithm, implement the technology There are many method and approach of scheme, the above is only a preferred embodiment of the present invention, it is noted that for the art Those of ordinary skill for, various improvements and modifications may be made without departing from the principle of the present invention, these change Protection scope of the present invention is also should be regarded as into retouching.The available prior art of each component part being not known in the present embodiment adds To realize.

Claims

1. the Rules for Part of Speech Tagging automatic obtaining method based on Apriori algorithm, which is characterized in that include the following steps：

Step 1, transaction database is inputted, and it is I={ A that the project set that transaction database includes, which is arranged,₁, A₂, A₃, A₄, A₅, A₆, A₁~A₆Indicate 6 subsets, i.e., one shares 6 projects, wherein A in expression transaction database₁Indicate the set of previous word, A₂ Indicate previous word part of speech set, A₃Indicate the set of current word, A₄Indicate current word part of speech set, A₅Indicate the set of latter word, A₆Indicate latter word part of speech set；

Step 2, the set L1 of frequent 1- item collections is calculated in scan item set I；

Step 3, the set L6 of frequent 6- item collections is generated；

Step 4：Correlation rule is obtained based on frequent 6- item collections.

2. according to the method described in claim 1, it is characterized in that, step 2 includes：

Use N_iIndicate i-th of project A in project set I_iThe number of appearance, i values are 1~6, are calculated according to following formula I-th of project support degree sup (A_i)：

sup(A_i)=N_i/ | D |,

Wherein | D | the number of transactions that transaction database includes is indicated, by projects support and set minimum support min_ Support is compared, and deletes the program member that support is less than minimum support, obtains the set L of frequent 1- item collections₁。

3. according to the method described in claim 2, it is characterized in that, step 3 includes：It is generated by L1 connections, beta pruning candidate frequent Item collection set C2 counts each Candidate Set in C2, the Candidate Set less than minimum support is abandoned, 2- frequent to generate The set L2 of collection；

Candidate frequent item set set C3 is generated by L2 connections, beta pruning, each Candidate Set in C3 is counted, minimum support will be less than Candidate Set abandon, to generate the set L3 of frequent 3- item collections；

Candidate frequent item set set C4 is generated by L3 connections, beta pruning, each Candidate Set in C4 is counted, minimum support will be less than Candidate Set abandon, to generate the set L4 of frequent 4- item collections；

Candidate frequent item set set C5 is generated by L4 connections, beta pruning, each Candidate Set in C5 is counted, minimum support will be less than Candidate Set abandon, to generate the set L5 of frequent 5- item collections；

Candidate frequent item set set C6 is generated by L5 connections, beta pruning, each Candidate Set in C6 is counted, minimum support will be less than Candidate Set abandon, to generate the set L6 of frequent 6- item collections.

4. according to the method described in claim 3, it is characterized in that, step 4 includes：For each frequent item set L_x, x values It is 1~6, finds out wherein all nonvoid subsets, if frequent item set L_xSupport sup (L_x) with the support of nonvoid subset a The ratio of sup (a) is more than Minimum support4, then there is correlation rule a==>(L_x- a), correlation rule, association is otherwise not present Rule is Rules for Part of Speech Tagging.

5. according to the method described in claim 4, being advised it is characterized in that, being associated with using createTransRule () function creation Then, using createL1 (), createL2 (), createL3 (), createL4 (), createL5 (), createL6 () six A function creation Frequent Set, six functions correspond to set L1, L2, L3, L4, L5 and L6, use getMinusCollect respectively (String [] a, String [] L_x) function asks a and L_xDifference set.