CN104750779A

CN104750779A - Chinese multi-class word identification method based on conditional random field

Info

Publication number: CN104750779A
Application number: CN201510096284.3A
Authority: CN
Inventors: 费凡; 徐文超; 杨雁峰; 刘云鹏; 汤俊; 杨艳琴
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2015-03-04
Filing date: 2015-03-04
Publication date: 2015-07-01

Abstract

The invention discloses a Chinese multi-class word identification method based on a conditional random field. The method includes the steps that entries related to multi-class words are acquired, and linguistic data are obtained from the entries; the linguistic data are segmented to generate chunks, and meanwhile the chunk characteristics of characters are generated in the chunks; part-of-speech tagging is performed on the characters to obtain the part-of-speech characteristics of the characters, and the characters are tagged through the chunk characteristics and the part-of-speech characteristics; part of the linguistic data are randomly selected to be trained, the rest of the linguistic data are tested, and then a first test result is obtained; a characteristic template is modified according to the characteristics of the linguistic data, the linguistic data continue being trained and tested after modification, and then a second test result is obtained; metric performance comparison is performed on the first test result and the second test result to improve identification of the multi-class words. According to the method, the Chinese multi-class words of the E-commerce field are identified through the conditional random field, and after the characteristics of the original characteristic template of the conditional random field are modified, the accuracy rate, recall rate and f value of multi-class word identification are increased.

Description

A kind of Chinese conversion of parts of speech recognition methods based on condition random field

Technical field

The invention belongs to electric business's product field of character recognition, particularly relate to the Chinese conversion of parts of speech recognition methods based on condition random field in a kind of electric business field.

Background technology

Along with the development in epoch and the lifting of technology, ambiguity word (ambiguity word and same word or word have two or more implication, ambiguity Producing reason: the meaning of a word is indefinite, syntax is fixing, level fuzzy, refer to not clear etc.) a large amount of emerging in large numbers result in the puzzlement that same word in many circumstances or word occur due to machine or the different understanding of people.So for performance whether accurate of ambiguity word identification, whether efficiently affect the result of the process for Word message.And ambiguity word is roughly divided into polyphonic word, homonym, polysemant, conversion of parts of speech and anti-instructions.Study of recognition is before only limitted to traditional Chinese word segmentation, specific field is not studied, the present invention is only for the feature of conversion of parts of speech (a conversion of parts of speech i.e. word or word have two or more part of speech) in electric business field in ambiguity word, and by condition random field feature templates and amended new feature template, training and testing is carried out to language material, object is that optimization improves the performance of template in the identification of the conversion of parts of speech in electric business field of condition random field.

Method for distinguishing is known for Chinese text and is mainly divided into following four large classes:

1. rule-based method.

1a: string matching method.The word needing to identify or word are mated with dictionary (training set namely of certain scale).Forward can be divided into mate according to matching direction, negative relational matching and bi-directional matching three kinds, be divided into again maximum coupling and smallest match two kinds according to the priority principle of coupling.

1b: shortest path first.Adopt dijkstra's algorithm, Floyd algorithm, k shortest path first, the algorithm of some graph-theoretical algorithms such as n shortest path first and derivative mutation.

Above two kinds of methods are only the sub-fractions in rule-based method, rule-based method is all identify according to the rule oneself set separately, whether the rule that this method depends on setting is rationally complete, subjectivity cannot be all applicable for any corpus relatively, for process ambiguity word poor-performing, accuracy rate is lower.

2. based on the method understood.This method is that syntax is analyzed together with semanteme, simulation people for the understanding of word or word, by identifying corresponding word or word like this.Due to Chinese word or word and syntax system comparatively complicated, this method needs a large amount of data and information and knowledge.

3. based on the method for conversion.This method looks for a corpus having marked part of speech, the part of speech that each word or word are consistent most is identified from this corpus, afterwards again with this as training set, then convert out a kind of rule (the mutation conversion namely in certain rule original) newly again by existing well-regulated study.

4. Statistics-Based Method.This method, according to the composition association before and after word and characteristic information, carries out probability statistics to each word and part of speech, therefrom selects optimum state transition probability to judge word and part of speech.Most representative three large-sized models are Hidden Markov Model (HMM) respectively, maximum entropy Markov model, condition random field.Hidden Markov Model (HMM) shortcoming is under the condition of given observation sequence, observed value only depends on state, this makes each observation element be self-existent, and under real linguistic context, word is not often only relevant to front and back word, the characteristic information having certain to associate with farther word, so only accomplished local optimum.Although maximum entropy Markov model take into account and linked character information between the more remote word of current word, but when state transfer, due to numbers of branches different probability skewness weighing apparatus, reside in certain state with regard to result in namely marked biasing problem when state transfer.And condition random field is digraph unlike the state transfer of Hidden Markov Model (HMM) and maximum entropy Markov model, the feature of its non-directed graph had both avoided the marking bias problem of maximum entropy Markov model, contemplate the characteristic information that is mutually related between the more remote word of current word simultaneously, solve Hidden Markov only local normalization and the word too independently situation that causes, accomplish global optimization.

Summary of the invention

The present invention proposes a kind of Chinese conversion of parts of speech recognition methods based on condition random field, comprise the following steps:

Step 1: search for a Chinese conversion of parts of speech in electric business field, obtain the entry relevant to described conversion of parts of speech, obtain the language material with electric business's domain features from described entry;

Step 2: cutting is carried out to described language material and generates language block, generate the language block feature of each word simultaneously in institute's predicate block;

Step 3: carry out part-of-speech tagging to described word, obtains the part of speech feature of described word, utilizes institute's predicate block feature and described part of speech feature to mark described word;

Step 4: Stochastic choice part language material is trained in condition random field, and remaining language material is tested in described condition random field, obtains the first experimental result;

Step 5: the feature templates in condition random field according to the feature modification of described language material, continues after amendment to carry out training and testing to the described language material in described condition random field, obtains the second experimental result;

Step 6: the performance test described first experimental result and described second experimental result being carried out to module, improves the identification for conversion of parts of speech.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, described step 1 comprises the steps:

Step 1a: in electric business field, search for by the occlusion of described conversion of parts of speech, obtains the entry relevant to described occlusion, wherein consistent with trade name entry is classified as language material, is classified as language material after incongruent entry is modified as corresponding trade name;

Step 1b: search for by the adjective form of described conversion of parts of speech, obtains the entry relevant to described adjective form, wherein consistent with trade name entry is classified as language material, is classified as language material after incongruent entry is modified as corresponding trade name.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, in described step 2, according to content contained by product in electric business field, described entry is cut into manufacturer's block, place of production block, brand block, commodity name block (NAM), and net content block.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, in described step 2, if comprise two or more word in institute's predicate block, then the language block feature of first word is initial word, and the language block feature of all the other words is for following word closely; If institute's predicate block comprises a word, then the language block feature of described word is independently block.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, described step 3, described part of speech feature comprises noun, verb, adjective.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, described step 4 comprises the steps:

Step 4a: the training set that the language material of adjective form or occlusion that Stochastic choice contains a conversion of parts of speech is included into described condition random field from described language material is trained, and the test set that the language material of occlusion described in another part adjective form containing described conversion of parts of speech is included into described condition random field is tested;

Another language material of step 4b: after completing training and testing, repeated execution of steps 4a random selecting carries out training and testing, until complete training and testing to all language materials.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, described step 5 comprises the steps:

Step 5a: the assemblage characteristic changing part of speech association in the feature templates of described condition random field;

Step 5b: return the training set of each conversion of parts of speech of step 4 re-training and test the test set of each conversion of parts of speech, obtaining the second experimental result.

The present invention is based in the Chinese conversion of parts of speech recognition methods of condition random field, described step 6 comprises the steps:

Step 6a: the performance test respectively described first experimental result and described second experimental result being carried out to three modules with Conll 2000 algorithm based on perl script language compilation; Described module is accurate rate, recall rate and f value;

Step 6b: if described second experimental result is lower than described first experimental result, then return step 5 and modify to described feature templates and again obtain the second experimental result, till described second experimental result is better than described first experimental result.

In above summary of the invention, the feature of language material comprises part of speech, semanteme and the mutual relationship etc. between word and word.Part of speech feature comprises noun, verb, adjective etc.

Beneficial effect of the present invention is: amended feature templates is compared the pervasive feature templates of crf and seemed when identifying the conversion of parts of speech in electric business field and more mate.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the Chinese conversion of parts of speech recognition methods that the present invention is based on condition random field.

Fig. 2 is the particular flow sheet of step 1.

Fig. 3 is the particular flow sheet of step 2.

Fig. 4 is the particular flow sheet of step 3.

Fig. 5 is the particular flow sheet of step 4.

Fig. 6 is the particular flow sheet of step 5.

Fig. 7 is the particular flow sheet of step 6.

Embodiment

In conjunction with following specific embodiments and the drawings, the present invention is described in further detail.Implement process of the present invention, condition, experimental technique etc., except the following content mentioned specially, be universal knowledege and the common practise of this area, the present invention is not particularly limited content.

The present invention specifically comprises the steps: as shown in Figure 1

Step 5: according to the part of speech of described language material, feature templates in condition random field described in the feature modifications such as semanteme and the mutual relationship between word and word, continue after amendment to carry out training and testing to the described language material in described condition random field, obtain the second experimental result;

Step 6: the performance test described experimental result being carried out to module, improves the identification for conversion of parts of speech.

Below in conjunction with specific embodiment, each step above-mentioned is explained in detail, so that technical scheme of the present invention to be described.

As shown in Figure 2, step 1 completes above-mentioned especially by following steps:

Step 1a: log in a shop or Taobao's homepage, search for by the occlusion of described conversion of parts of speech in commercial articles searching frame, obtain the entry relevant to described occlusion, the entry consistent with the trade name on commodity packaging picture is classified as language material, is classified as language material after incongruent entry is modified as corresponding trade name.The entry such as had with the addition of the unnecessary attribute do not appeared in trade name a bit and modifies: fresh organic pollution-free vegetable is outdoor does not naturally ripely urge red authentic tomato, and clicks and enter commodity details page and find that commodity packaging only has fresh Organic tomato.

Step 1b: the adjective form then inputting this conversion of parts of speech, all for adjective form commodity entries are intercepted, the entry consistent with the trade name on commodity packaging picture, directly as test language material, is modified as trade name that its packing of product shows simultaneously also as testing language material using incongruent entry.

After completing the language material of the electric business's domain features needed for acquisition, cutting is carried out to language material.Fig. 3 display be the concrete implementing procedure of step 2, mainly comprise following each step:

Step 2a: the commodity entry cutting that each intercepting is got off is manufacturer's block, place of production block, brand block, commodity name block (NAM), and net content block.Such as: northern field Taiwan import brown rice fruit volume milk flavor children biscuit 150G, need according to stipulated form cutting be: northern field/manufacturer, import/place of production, Taiwan, brown rice fruit volume/brand, milk flavor children biscuit/trade name, 150G/ net content.

Step 2b: each independent language block is divided into word again, the language block feature of first word of independent language block is initial word, mark with B, the language block feature of words all is afterwards for following word closely, mark with I, if institute's predicate block only comprises a word, then the language block feature of described word is independently block, represents with O.Such as: the beginning word representing each piece with B, represent the word followed closely after each piece with I, such as: our sun we be a block I be labeled as B be labeled as I and be that to be labeled as the O sun be that a block is labeled as very much B sun and is labeled as I to independent block.

After completing the cutting for language material.What Fig. 4 showed is that step 3 carries out the step of part-of-speech tagging for language material, specific as follows:

Step 3a: all single words segmented are carried out part-of-speech tagging; Noun be labeled as noun, be adjectivally labeled as adjective, by that analogy, concrete part-of-speech tagging correspondence as shown below:

Verb V V; Bag M; Bag/case M M; Bag/group M M; Bag/group bag M; Package M; Taste Z; Brand N N; Brand+category NL; Brand+category+commodity NLC; Brand+commodity LC; Brand+businessman LJ; Brand+businessman+category NJL; Brand+businessman+category+commodity NJLC; Brand+businessman+commodity NJS; Brand+color NNY; Category NP; Category+commodity PC; Category+seasonal PT; Commodity NPC; Commodity+seasonal CT; Businessman NJ; Businessman+category JL; Businessman+category+commodity JLC; Businessman+commodity JC; Place name NS NS; Age T T; Adjective A A; Shape AD AD; Carry M; Prop up M; Quantity Q Q; Seasonal TG TG; Bar M; Cup M; Cup/case M; Bucket M; Sheet M; Bottle M; Box M; Bowl M; Symbol W W; Cylinder M; Case M; Group M; Tank M; Tank/group M; Tank group M; Bag M; Bag/group M; Bag bag M; Specification NG; Quality NQ; Conjunction H; Amount of money NM; Color NY.

For instantiation such as:

Modern N; Wheat N; Youth N; Bone NP; Soup NP; Play NPC; Face NPC; Dense Z; Soup Z; Sea Z; Fresh Z; Cup NPC; Face NPC; 7M; 8M; Gram M.

Step 3b: for each word is initial word two attributes i.e. identity of this word or follow word closely or the part of speech that independently block and this word mark combines.

For instantiation such as:

Modern N B-N; Wheat N I-N; Youth N I-N; Bone NP B-NP; Soup NP I-NP; Play NPC B-NPC; Face NPC I-NPC; Dense Z B-Z; Soup Z I-Z; Sea Z I-Z; Fresh Z I-Z; Cup NPC B-NPC; Face NPC I-NPC; 7M B-M; 8M I-M; Gram M I-M;

After completing the part-of-speech tagging for language material.What Fig. 5 showed is that step 4 carries out the step of training and testing for language material, specific as follows:

Step 4a: the language material of the correct labeling form of eligible random field therefrom the language material of the Stochastic choice adjective form and occlusion that contain certain conversion of parts of speech be included into training set and train, and be included into test set containing another part adjective form of certain conversion of parts of speech and the language material of occlusion and test;

Step 4b: continue according to the mode described in step 4a the conversion of parts of speech carried out in the electric business's language material obtained remaining and do same random training and testing, obtain the first experimental result.This first experimental result is that accurate rate reaches 66.47%, and recall rate reaches 63.13%, f value and reaches 65.36%.

Training patterns is as follows, enter command prompt, enter the catalogue that training set is deposited, input training directive: ../crf_learn – [optional parameter] template train.data model, test mode is as follows, enter command prompt, enter the catalogue that test set is deposited, input test instruction: ../crf_test-[optional parameter]-m model test.data.

After completing the training and testing for language material.What Fig. 6 showed is that step 5 continues the step of training test according to language material feature modification feature templates, specific as follows:

Step 5a: continue the assemblage characteristic increasing or reduce or change the association of certain part of speech on the basis of original condition random field feature templates, the original feature templates of condition random field as shown below:

#Unigram

U00:％x[-2,0]

U01:％x[-1,0]

U02:％x[0,0]

U03:％x[1,0]

U04:％x[2,0]

U05:％x[-1,0]/％x[0,0]

U06:％x[0,0]/％x[1,0]

U07:％x[-2,1]

U08:％x[-1,1]

U09:％x[0,1]

U10:％x[1,1]

U11:％x[2,1]

U12:％x[-2,1]/％x[-1,1]

U13:％x[-1,1]/％x[0,1]

U14:％x[0,1]/％x[1,1]

U15:％x[1,1]/％x[2,1]

U16:％x[-2,1]/％x[-1,1]/％x[0,1]

U17:％x[-1,1]/％x[0,1]/％x[1,1]

U18:％x[0,1]/％x[1,1]/％x[2,1]

#Bigram

B

Continuous experiment constantly revise after feature final for tested language material performance relative to the better feature templates of primary template as shown below;

#Unigram

U00:％x[-3,0]

U01:％x[-2,0]

U02:％x[-1,0]

U03:％x[0,0]

U04:％x[1,0]

U05:％x[2,0]

U06:％x[3,0]

U07:％x[-2,0]/％x[0,0]

U08:％x[-1,0]/％x[0,0]

U09:％x[0,0]/％x[1,0]

U10:％x[-2,0]/％x[-1,0]

U11:％x[1,0]/％x[2,0]

U12:％x[0,0]/％x[2,0]

Step 5b: according to each conversion of parts of speech of mode re-training described in step 4 training set and test the test set of each conversion of parts of speech, obtain the second experimental result.This second experimental result is that accurate rate reaches 77.86%, and recall rate reaches 73.17%, f value and reaches 74.59%.

Complete for after the amendment of condition random field feature templates.What Fig. 7 showed is the performance test that step 6 pair experimental result carries out module, specific as follows:

Step 6a: install perl interpreter, respectively the result through two kinds of template tests is carried out to the performance test of three modules with Conll 2000 algorithm based on perl script language compilation, test set is imported after after perl interpreter and automatically generates three performance index;

Step 6b: if the template revised all is better than original template in three performance index, the language material that first time has tested isolates corpus, in residue corpus, Stochastic choice part language material is a part of as test set as training set again, again the performance index under two kinds of different templates tests are compared, if continue to be better than original template, the language material that then second time has been tested also isolates corpus, same training is again continued in residue corpus, test and performance index comparison, until continuous 5 times, namely training and testing entry sum about ten thousand (namely each training set and test set entries sum about 2000) result all show new template be better than original template then illustrate new template identification electric business field conversion of parts of speech time more mate, as long as certain performance index is not better than original template and then continues amendment template and again test in each experimental result.

Protection content of the present invention is not limited to above embodiment.Under the spirit and scope not deviating from inventive concept, the change that those skilled in the art can expect and advantage are all included in the present invention, and are protection domain with appending claims.

Claims

1., based on a Chinese conversion of parts of speech recognition methods for condition random field, it is characterized in that, comprise the following steps:

2., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, described step 1 comprises the steps:

3., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, in described step 2, according to content contained by product in electric business field, described entry is cut into manufacturer's block, place of production block, brand block, commodity name block (NAM), and net content block.

4. as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, in described step 2, if comprise two or more word in institute's predicate block, then the language block feature of first word is initial word, and the language block feature of all the other words is for following word closely; If institute's predicate block comprises a word, then the language block feature of described word is independently block.

5., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, described step 3, described part of speech feature comprises noun, verb, adjective.

6., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, described step 4 comprises the steps:

7., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, described step 5 comprises the steps:

8., as claimed in claim 1 based on the Chinese conversion of parts of speech recognition methods of condition random field, it is characterized in that, described step 6 comprises the steps: