CN110362821A

CN110362821A - A kind of Laotian base noun phrase recognition methods based on stack combinations classifier

Info

Publication number: CN110362821A
Application number: CN201910520748.7A
Authority: CN
Inventors: 周兰江; 汤礼欣; 张建安
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-06-17
Filing date: 2019-06-17
Publication date: 2019-10-22

Abstract

The Laotian base noun phrase recognition methods based on stack combinations classifier that the invention discloses a kind of, belong to natural language processing field, it is a basic business in natural language processing, the present invention has mainly merged Laotian linguistic feature into algorithm model, the method for having selected assembled classifier, accuracy of identification is improved, the algorithm model of selection is CRF+SVM.It is labeled respectively using CRF, SVM forward direction and the reverse three kinds of identification models of SVM first, obtain 3 parts of different prediction results, new data acquisition system is together constituted with original data set, then it selects to the highest model of sequence labelling performance as upper layer sorting algorithm, finally use new data acquisition system, make feature with word, part of speech and 3 kinds of category of model results, is sent to layer model, and using its recognition result as final result.The present invention identifies that upper accuracy rate is higher in Laotian base noun phrase, has certain research significance.

Description

A kind of Laotian base noun phrase recognition methods based on stack combinations classifier

Technical field

The Laotian base noun phrase recognition methods based on stack combinations classifier that the present invention relates to a kind of, belongs to nature Rare foreign languages identify field in Language Processing.

Background technique

Base noun phrase identifies an important foundation task as natural language processing, and noun phrase is big in sentence Amount exists, and the important syntactic role such as often take on subject, object, therefore it is accurately identified to simplifying a sentence structure to carry out Further syntactic analysis plays a crucial role.At present about the research of noun phrase, it is broadly divided into the recognition methods based on statistics With rule-based recognition methods, and the identification to base noun phrase and Maximal noun phrase is concentrated on.The present invention uses CRF, SVM are positive and the reverse three kinds of identification models of SVM are labeled respectively, 3 parts of different prediction results are obtained, with original data set Together constitute with new data acquisition system.Then it selects finally to make the highest model of sequence labelling performance as upper layer disaggregated model With new data acquisition system, make feature with the classification results of word, part of speech and 3 kinds of models, be sent to layer model, and by its recognition result As final result.Currently with the method for the assembled classifier of building lamination, there are no applying to, Laotian basic noun is short In language Study of recognition.

Summary of the invention

The technical problem to be solved in the present invention is to provide a kind of Laotian basic noun based on stack combinations classifier is short Language recognition methods, for solving the problems, such as that Laotian base noun phrase identifies.

The technical solution adopted by the present invention is that: a kind of Laotian base noun phrase identification based on stack combinations classifier Method, the specific steps are as follows:

Step1 manually marks out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system Base noun phrase obtains experimental data set；

The experimental data set that step1 is obtained is carried out data set and is divided into 5 parts by Step2, wherein 4 parts are used as training corpus, 1 part is used as testing material；

Step3, using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained Collection together constitutes with new data acquisition system；

The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection For upper layer disaggregated model；

Step5, the new data acquisition system obtained using step3, to obtain 3 parts of different classifications knots in word, part of speech and step3 Fruit makees feature, is sent to upper layer disaggregated model, and using its recognition result as final result.

Specifically, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=W for input₁W₂… W_I…W_n, W_IFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T₁T₂…T_I…T_n, so that the sequence It is listed in maximum probability in all possible annotated sequence, wherein T_IA left side for ∈ { B, I, O }, B mark Laotian base noun phrase Boundary, I are identified inside Laotian base noun phrase, and O identifies other.

Specifically, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in Training corpus, which obtains, weighs training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability Weight.

Specifically, inversely the training corpus in Step2 is carried out respectively using SVM forward direction and SVM in the step step3 Mark obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.

Specifically, selection is to use to the method for the highest model of sequence labelling performance in the step step4 CoNLL2000 evaluation, is evaluated using accurate rate (P), recall rate (R) and F value.

The beneficial effects of the present invention are:

(1) algorithm model that the present invention selects is CRF+SVM, (maximum relative to HMM (Hidden Markov Model) and MEMM Entropy Markov model), CRF not only solves the problem of HMM exports independence assumption, and the mark biasing for also solving MEMM is asked Topic, it is because only normalizing locally, and CRF has counted global probability, is normalizing that MEMM, which is easily trapped into local optimum, When consider data in global distribution, rather than only in local normalization, so that the decoding of sequence labelling becomes optimal solution. SVM can be mapped using kernel function to higher dimensional space, can solve nonlinear classification, and classificating thought is simple, classification effect Fruit is preferable.

(2) assembled classifier that the present invention is laminated is assigned with properly independent classifier annotation results by structural model Weight, can guarantee the advantages of merging different independent classifier, identify the more single model of Laotian base noun phrase result Middle optimal result indices will be high, even the bad model of result also makes tribute in fusion process when being individually identified It offers, improves accuracy for the identification of Laotian base noun phrase.

Detailed description of the invention

Fig. 1 is the flow chart in the present invention.

Specific embodiment

In the following with reference to the drawings and specific embodiments, the present invention is further illustrated.

Embodiment 1: as shown in Figure 1, a kind of Laotian base noun phrase recognition methods based on stack combinations classifier, Specific step is as follows:

Step1 obtains Laotian base noun phrase corpus.Laotian base noun phrase corpus is passed through into participle and word Property labeling system, manually marks out Laotian base noun phrase by Laos classmate, basic noun is defined as inside cannot be again Include smaller noun phrase, according to Laotian grammer, base noun phrase (BaseNP) form of Laotian is as follows:

1.BaseNP→BaseNP+BaseNP

2.BaseNP → BaseNP+ adjective

3.BaseNP → BaseNP+ verb

4.BaseNP → BaseNP+ numeral-classifier compound

5.BaseNP → BaseNP+ refers to word

6.BaseNP → BaseNP+ subject-predicate phrase

7.BaseNP → BaseNP+ preposition structure

8.BaseNP → BaseNP+ refers to word+adjective

9.BaseNP → BaseNP+ adjective+numeral-classifier compound

10.BaseNP → BaseNP+ adjective+verb

Step2, by the good laggard line data set distribution training corpus of Laotian base noun phrase corpus labeling and test language Material.Specifically, the Laotian base noun phrase corpus that step1 has manually been marked is divided into 5 parts, wherein 4 parts are used as training Corpus.

Step3 is labeled respectively using three kinds of models, is obtained 3 parts of different prediction results, is together constituted with original data set New data acquisition system.Using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained Collection together constitutes with new data acquisition system.

The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection For upper layer disaggregated model.Specifically, assessment three models of comprehensive descision are carried out with accurate rate (P), recall rate (R) and F value to know Other performance, P, R, F value formula are as follows:

Wherein, N_CIt represents and identifies correct Laotian base noun phrase, N_IRepresent the Laotian basic noun identified Phrase, N_YRepresent the Laotian base noun phrase sum in corpus.Accurate rate (P) reflects the recognition capability of model, recalls Rate (R) reflects looking into for model all can power.F value comprehensive characterization accurate rate and recall rate, embody algorithm synthesis performance, use Accurate rate (P), recall rate (R) and F value are assessed, can three model recognition performances of comprehensive descision.

Step5 uses CRF, SVM forward direction and the reverse three kinds of models of SVM using new data acquisition system with word, part of speech and 3 parts The different classifications result for being labeled and obtaining to experimental data set respectively makees feature, is sent to layer model, and is identified knot Fruit promotes recognition accuracy as final result.

Further, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=for input W₁W₂…W_I…W_n, W_IFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T₁T₂…T_I…T_n, so that Sequence maximum probability in all possible annotated sequence, wherein T_I∈ { B, I, O }, B identify Laotian base noun phrase Left margin, I identify Laotian base noun phrase inside, O identifies other.

Further, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in Training corpus acquisition make training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability Weight.

Further, in the step step3 using SVM forward direction and SVM inversely respectively to the training corpus in Step2 into Rower note obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of Laotian base noun phrase recognition methods based on stack combinations classifier, it is characterised in that: specific steps It is as follows:

It is basic manually to mark out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system by Step1 Noun phrase obtains experimental data set；

Step2, by step1 obtain experimental data set carry out data set be divided into 5 parts, wherein 4 parts be used as training corpus, 1 part As testing material；

Step3 divides equally the training corpus obtained to step2 respectively using CRF, SVM forward direction and the reverse three kinds of models of SVM and marks Note obtains 3 parts of different prediction classification results, the experimental data set one that 3 parts of different prediction classification results and step1 are obtained With the new data acquisition system of composition；

Step4, the testing material test model obtained using step2, is selected to the highest model of sequence labelling performance as upper Layer disaggregated model；

Step5, the new data acquisition system obtained using step3 are made with obtaining 3 parts of different classifications results in word, part of speech and step3 Feature is sent to upper layer disaggregated model, and using its recognition result as final result.

2. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: the specific steps manually marked in the step1 are as follows: passes through the sequence of terms S=W for input₁W₂…W_I…W_n, W_I For i-th word, the target of task is to obtain a corresponding annotated sequence T^*=T₁T₂…T_I…T_n, so that the sequence it is all can Can annotated sequence in maximum probability, wherein T_I∈ { B, I, O }, B identify the left margin of Laotian base noun phrase, I mark Inside Laotian base noun phrase, O identifies other.

3. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: CRF utilizes the contextual information feature of Laotian word in the step3, passes through the training corpus in study Step2 Acquisition makes training sample annotated sequence in the characteristic set and feature weight of annotated sequence set conditional maximum probability.

4. the Laotian base noun phrase recognition methods according to claim 1 or 3 based on stack combinations classifier, It is characterized in that: inversely the training corpus in Step2 being labeled respectively using SVM forward direction and SVM in the step step3, Using from left to right identifying direction and identifying the otherness in direction from right to left, 2 parts of different prediction results are obtained.

5. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: selection to the method for the highest model of sequence labelling performance is evaluated using CoNLL2000 in the step step4, It is evaluated using accurate rate P, recall rate R and F value.