CN110362821A - A kind of Laotian base noun phrase recognition methods based on stack combinations classifier - Google Patents

A kind of Laotian base noun phrase recognition methods based on stack combinations classifier Download PDF

Info

Publication number
CN110362821A
CN110362821A CN201910520748.7A CN201910520748A CN110362821A CN 110362821 A CN110362821 A CN 110362821A CN 201910520748 A CN201910520748 A CN 201910520748A CN 110362821 A CN110362821 A CN 110362821A
Authority
CN
China
Prior art keywords
laotian
noun phrase
classifier
svm
base noun
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910520748.7A
Other languages
Chinese (zh)
Inventor
周兰江
汤礼欣
张建安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910520748.7A priority Critical patent/CN110362821A/en
Publication of CN110362821A publication Critical patent/CN110362821A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The Laotian base noun phrase recognition methods based on stack combinations classifier that the invention discloses a kind of, belong to natural language processing field, it is a basic business in natural language processing, the present invention has mainly merged Laotian linguistic feature into algorithm model, the method for having selected assembled classifier, accuracy of identification is improved, the algorithm model of selection is CRF+SVM.It is labeled respectively using CRF, SVM forward direction and the reverse three kinds of identification models of SVM first, obtain 3 parts of different prediction results, new data acquisition system is together constituted with original data set, then it selects to the highest model of sequence labelling performance as upper layer sorting algorithm, finally use new data acquisition system, make feature with word, part of speech and 3 kinds of category of model results, is sent to layer model, and using its recognition result as final result.The present invention identifies that upper accuracy rate is higher in Laotian base noun phrase, has certain research significance.

Description

A kind of Laotian base noun phrase recognition methods based on stack combinations classifier
Technical field
The Laotian base noun phrase recognition methods based on stack combinations classifier that the present invention relates to a kind of, belongs to nature Rare foreign languages identify field in Language Processing.
Background technique
Base noun phrase identifies an important foundation task as natural language processing, and noun phrase is big in sentence Amount exists, and the important syntactic role such as often take on subject, object, therefore it is accurately identified to simplifying a sentence structure to carry out Further syntactic analysis plays a crucial role.At present about the research of noun phrase, it is broadly divided into the recognition methods based on statistics With rule-based recognition methods, and the identification to base noun phrase and Maximal noun phrase is concentrated on.The present invention uses CRF, SVM are positive and the reverse three kinds of identification models of SVM are labeled respectively, 3 parts of different prediction results are obtained, with original data set Together constitute with new data acquisition system.Then it selects finally to make the highest model of sequence labelling performance as upper layer disaggregated model With new data acquisition system, make feature with the classification results of word, part of speech and 3 kinds of models, be sent to layer model, and by its recognition result As final result.Currently with the method for the assembled classifier of building lamination, there are no applying to, Laotian basic noun is short In language Study of recognition.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of Laotian basic noun based on stack combinations classifier is short Language recognition methods, for solving the problems, such as that Laotian base noun phrase identifies.
The technical solution adopted by the present invention is that: a kind of Laotian base noun phrase identification based on stack combinations classifier Method, the specific steps are as follows:
Step1 manually marks out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system Base noun phrase obtains experimental data set;
The experimental data set that step1 is obtained is carried out data set and is divided into 5 parts by Step2, wherein 4 parts are used as training corpus, 1 part is used as testing material;
Step3, using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained Collection together constitutes with new data acquisition system;
The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection For upper layer disaggregated model;
Step5, the new data acquisition system obtained using step3, to obtain 3 parts of different classifications knots in word, part of speech and step3 Fruit makees feature, is sent to upper layer disaggregated model, and using its recognition result as final result.
Specifically, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=W for input1W2… WI…Wn, WIFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that the sequence It is listed in maximum probability in all possible annotated sequence, wherein TIA left side for ∈ { B, I, O }, B mark Laotian base noun phrase Boundary, I are identified inside Laotian base noun phrase, and O identifies other.
Specifically, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in Training corpus, which obtains, weighs training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability Weight.
Specifically, inversely the training corpus in Step2 is carried out respectively using SVM forward direction and SVM in the step step3 Mark obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.
Specifically, selection is to use to the method for the highest model of sequence labelling performance in the step step4 CoNLL2000 evaluation, is evaluated using accurate rate (P), recall rate (R) and F value.
The beneficial effects of the present invention are:
(1) algorithm model that the present invention selects is CRF+SVM, (maximum relative to HMM (Hidden Markov Model) and MEMM Entropy Markov model), CRF not only solves the problem of HMM exports independence assumption, and the mark biasing for also solving MEMM is asked Topic, it is because only normalizing locally, and CRF has counted global probability, is normalizing that MEMM, which is easily trapped into local optimum, When consider data in global distribution, rather than only in local normalization, so that the decoding of sequence labelling becomes optimal solution. SVM can be mapped using kernel function to higher dimensional space, can solve nonlinear classification, and classificating thought is simple, classification effect Fruit is preferable.
(2) assembled classifier that the present invention is laminated is assigned with properly independent classifier annotation results by structural model Weight, can guarantee the advantages of merging different independent classifier, identify the more single model of Laotian base noun phrase result Middle optimal result indices will be high, even the bad model of result also makes tribute in fusion process when being individually identified It offers, improves accuracy for the identification of Laotian base noun phrase.
Detailed description of the invention
Fig. 1 is the flow chart in the present invention.
Specific embodiment
In the following with reference to the drawings and specific embodiments, the present invention is further illustrated.
Embodiment 1: as shown in Figure 1, a kind of Laotian base noun phrase recognition methods based on stack combinations classifier, Specific step is as follows:
Step1 obtains Laotian base noun phrase corpus.Laotian base noun phrase corpus is passed through into participle and word Property labeling system, manually marks out Laotian base noun phrase by Laos classmate, basic noun is defined as inside cannot be again Include smaller noun phrase, according to Laotian grammer, base noun phrase (BaseNP) form of Laotian is as follows:
1.BaseNP→BaseNP+BaseNP
2.BaseNP → BaseNP+ adjective
3.BaseNP → BaseNP+ verb
4.BaseNP → BaseNP+ numeral-classifier compound
5.BaseNP → BaseNP+ refers to word
6.BaseNP → BaseNP+ subject-predicate phrase
7.BaseNP → BaseNP+ preposition structure
8.BaseNP → BaseNP+ refers to word+adjective
9.BaseNP → BaseNP+ adjective+numeral-classifier compound
10.BaseNP → BaseNP+ adjective+verb
Step2, by the good laggard line data set distribution training corpus of Laotian base noun phrase corpus labeling and test language Material.Specifically, the Laotian base noun phrase corpus that step1 has manually been marked is divided into 5 parts, wherein 4 parts are used as training Corpus.
Step3 is labeled respectively using three kinds of models, is obtained 3 parts of different prediction results, is together constituted with original data set New data acquisition system.Using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained Collection together constitutes with new data acquisition system.
The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection For upper layer disaggregated model.Specifically, assessment three models of comprehensive descision are carried out with accurate rate (P), recall rate (R) and F value to know Other performance, P, R, F value formula are as follows:
Wherein, NCIt represents and identifies correct Laotian base noun phrase, NIRepresent the Laotian basic noun identified Phrase, NYRepresent the Laotian base noun phrase sum in corpus.Accurate rate (P) reflects the recognition capability of model, recalls Rate (R) reflects looking into for model all can power.F value comprehensive characterization accurate rate and recall rate, embody algorithm synthesis performance, use Accurate rate (P), recall rate (R) and F value are assessed, can three model recognition performances of comprehensive descision.
Step5 uses CRF, SVM forward direction and the reverse three kinds of models of SVM using new data acquisition system with word, part of speech and 3 parts The different classifications result for being labeled and obtaining to experimental data set respectively makees feature, is sent to layer model, and is identified knot Fruit promotes recognition accuracy as final result.
Further, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=for input W1W2…WI…Wn, WIFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that Sequence maximum probability in all possible annotated sequence, wherein TI∈ { B, I, O }, B identify Laotian base noun phrase Left margin, I identify Laotian base noun phrase inside, O identifies other.
Further, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in Training corpus acquisition make training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability Weight.
Further, in the step step3 using SVM forward direction and SVM inversely respectively to the training corpus in Step2 into Rower note obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (5)

1. a kind of Laotian base noun phrase recognition methods based on stack combinations classifier, it is characterised in that: specific steps It is as follows:
It is basic manually to mark out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system by Step1 Noun phrase obtains experimental data set;
Step2, by step1 obtain experimental data set carry out data set be divided into 5 parts, wherein 4 parts be used as training corpus, 1 part As testing material;
Step3 divides equally the training corpus obtained to step2 respectively using CRF, SVM forward direction and the reverse three kinds of models of SVM and marks Note obtains 3 parts of different prediction classification results, the experimental data set one that 3 parts of different prediction classification results and step1 are obtained With the new data acquisition system of composition;
Step4, the testing material test model obtained using step2, is selected to the highest model of sequence labelling performance as upper Layer disaggregated model;
Step5, the new data acquisition system obtained using step3 are made with obtaining 3 parts of different classifications results in word, part of speech and step3 Feature is sent to upper layer disaggregated model, and using its recognition result as final result.
2. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: the specific steps manually marked in the step1 are as follows: passes through the sequence of terms S=W for input1W2…WI…Wn, WI For i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that the sequence it is all can Can annotated sequence in maximum probability, wherein TI∈ { B, I, O }, B identify the left margin of Laotian base noun phrase, I mark Inside Laotian base noun phrase, O identifies other.
3. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: CRF utilizes the contextual information feature of Laotian word in the step3, passes through the training corpus in study Step2 Acquisition makes training sample annotated sequence in the characteristic set and feature weight of annotated sequence set conditional maximum probability.
4. the Laotian base noun phrase recognition methods according to claim 1 or 3 based on stack combinations classifier, It is characterized in that: inversely the training corpus in Step2 being labeled respectively using SVM forward direction and SVM in the step step3, Using from left to right identifying direction and identifying the otherness in direction from right to left, 2 parts of different prediction results are obtained.
5. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special Sign is: selection to the method for the highest model of sequence labelling performance is evaluated using CoNLL2000 in the step step4, It is evaluated using accurate rate P, recall rate R and F value.
CN201910520748.7A 2019-06-17 2019-06-17 A kind of Laotian base noun phrase recognition methods based on stack combinations classifier Pending CN110362821A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910520748.7A CN110362821A (en) 2019-06-17 2019-06-17 A kind of Laotian base noun phrase recognition methods based on stack combinations classifier

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910520748.7A CN110362821A (en) 2019-06-17 2019-06-17 A kind of Laotian base noun phrase recognition methods based on stack combinations classifier

Publications (1)

Publication Number Publication Date
CN110362821A true CN110362821A (en) 2019-10-22

Family

ID=68216249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910520748.7A Pending CN110362821A (en) 2019-06-17 2019-06-17 A kind of Laotian base noun phrase recognition methods based on stack combinations classifier

Country Status (1)

Country Link
CN (1) CN110362821A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021225A (en) * 2016-05-12 2016-10-12 大连理工大学 Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
CN106202035A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese conversion of parts of speech disambiguation method based on combined method
CN107797994A (en) * 2017-09-26 2018-03-13 昆明理工大学 Vietnamese noun phrase block identifying method based on constraints random field
CN109753650A (en) * 2018-12-14 2019-05-14 昆明理工大学 A kind of Laotian name place name entity recognition method merging multiple features

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021225A (en) * 2016-05-12 2016-10-12 大连理工大学 Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs)
CN106202035A (en) * 2016-06-30 2016-12-07 昆明理工大学 Vietnamese conversion of parts of speech disambiguation method based on combined method
CN107797994A (en) * 2017-09-26 2018-03-13 昆明理工大学 Vietnamese noun phrase block identifying method based on constraints random field
CN109753650A (en) * 2018-12-14 2019-05-14 昆明理工大学 A kind of Laotian name place name entity recognition method merging multiple features

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
田雪 等: "一种混合的汉语简单名词短语识别方法", 《小型微型计算机系统》 *

Similar Documents

Publication Publication Date Title
Makav et al. A new image captioning approach for visually impaired people
CN107229610B (en) A kind of analysis method and device of affection data
CN109933664B (en) Fine-grained emotion analysis improvement method based on emotion word embedding
CN103345922B (en) A kind of large-length voice full-automatic segmentation method
CN108536870B (en) Text emotion classification method fusing emotional features and semantic features
WO2018028077A1 (en) Deep learning based method and device for chinese semantics analysis
CN108763510A (en) Intension recognizing method, device, equipment and storage medium
CN106919673A (en) Text mood analysis system based on deep learning
CN107330011A (en) The recognition methods of the name entity of many strategy fusions and device
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN108509629A (en) Text emotion analysis method based on emotion dictionary and support vector machine
Paraschiv et al. UPB at GermEval-2019 Task 2: BERT-Based Offensive Language Classification of German Tweets.
CN110232123A (en) The sentiment analysis method and device thereof of text calculate equipment and readable medium
CN115357719B (en) Power audit text classification method and device based on improved BERT model
CN112417132B (en) New meaning identification method for screening negative samples by using guest information
CN109492105A (en) A kind of text sentiment classification method based on multiple features integrated study
CN110532390A (en) A kind of news keyword extracting method based on NER and Complex Networks Feature
CN104462409A (en) Cross-language emotional resource data identification method based on AdaBoost
CN103020167A (en) Chinese text classification method for computer
CN105389303B (en) A kind of automatic fusion method of heterologous corpus
CN108681532A (en) A kind of sentiment analysis method towards Chinese microblogging
CN101556580A (en) Stock comment classification system based on analysis of discourse structure and method
CN109977391A (en) A kind of information extraction method and device of text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191022