CN110362821A - A kind of Laotian base noun phrase recognition methods based on stack combinations classifier - Google Patents
A kind of Laotian base noun phrase recognition methods based on stack combinations classifier Download PDFInfo
- Publication number
- CN110362821A CN110362821A CN201910520748.7A CN201910520748A CN110362821A CN 110362821 A CN110362821 A CN 110362821A CN 201910520748 A CN201910520748 A CN 201910520748A CN 110362821 A CN110362821 A CN 110362821A
- Authority
- CN
- China
- Prior art keywords
- laotian
- noun phrase
- classifier
- svm
- base noun
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 19
- 238000002372 labelling Methods 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 9
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000011160 research Methods 0.000 abstract description 2
- 235000013399 edible fruits Nutrition 0.000 description 3
- 150000001875 compounds Chemical class 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 238000003475 lamination Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Machine Translation (AREA)
Abstract
The Laotian base noun phrase recognition methods based on stack combinations classifier that the invention discloses a kind of, belong to natural language processing field, it is a basic business in natural language processing, the present invention has mainly merged Laotian linguistic feature into algorithm model, the method for having selected assembled classifier, accuracy of identification is improved, the algorithm model of selection is CRF+SVM.It is labeled respectively using CRF, SVM forward direction and the reverse three kinds of identification models of SVM first, obtain 3 parts of different prediction results, new data acquisition system is together constituted with original data set, then it selects to the highest model of sequence labelling performance as upper layer sorting algorithm, finally use new data acquisition system, make feature with word, part of speech and 3 kinds of category of model results, is sent to layer model, and using its recognition result as final result.The present invention identifies that upper accuracy rate is higher in Laotian base noun phrase, has certain research significance.
Description
Technical field
The Laotian base noun phrase recognition methods based on stack combinations classifier that the present invention relates to a kind of, belongs to nature
Rare foreign languages identify field in Language Processing.
Background technique
Base noun phrase identifies an important foundation task as natural language processing, and noun phrase is big in sentence
Amount exists, and the important syntactic role such as often take on subject, object, therefore it is accurately identified to simplifying a sentence structure to carry out
Further syntactic analysis plays a crucial role.At present about the research of noun phrase, it is broadly divided into the recognition methods based on statistics
With rule-based recognition methods, and the identification to base noun phrase and Maximal noun phrase is concentrated on.The present invention uses
CRF, SVM are positive and the reverse three kinds of identification models of SVM are labeled respectively, 3 parts of different prediction results are obtained, with original data set
Together constitute with new data acquisition system.Then it selects finally to make the highest model of sequence labelling performance as upper layer disaggregated model
With new data acquisition system, make feature with the classification results of word, part of speech and 3 kinds of models, be sent to layer model, and by its recognition result
As final result.Currently with the method for the assembled classifier of building lamination, there are no applying to, Laotian basic noun is short
In language Study of recognition.
Summary of the invention
The technical problem to be solved in the present invention is to provide a kind of Laotian basic noun based on stack combinations classifier is short
Language recognition methods, for solving the problems, such as that Laotian base noun phrase identifies.
The technical solution adopted by the present invention is that: a kind of Laotian base noun phrase identification based on stack combinations classifier
Method, the specific steps are as follows:
Step1 manually marks out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system
Base noun phrase obtains experimental data set;
The experimental data set that step1 is obtained is carried out data set and is divided into 5 parts by Step2, wherein 4 parts are used as training corpus,
1 part is used as testing material;
Step3, using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into
Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained
Collection together constitutes with new data acquisition system;
The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection
For upper layer disaggregated model;
Step5, the new data acquisition system obtained using step3, to obtain 3 parts of different classifications knots in word, part of speech and step3
Fruit makees feature, is sent to upper layer disaggregated model, and using its recognition result as final result.
Specifically, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=W for input1W2…
WI…Wn, WIFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that the sequence
It is listed in maximum probability in all possible annotated sequence, wherein TIA left side for ∈ { B, I, O }, B mark Laotian base noun phrase
Boundary, I are identified inside Laotian base noun phrase, and O identifies other.
Specifically, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in
Training corpus, which obtains, weighs training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability
Weight.
Specifically, inversely the training corpus in Step2 is carried out respectively using SVM forward direction and SVM in the step step3
Mark obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.
Specifically, selection is to use to the method for the highest model of sequence labelling performance in the step step4
CoNLL2000 evaluation, is evaluated using accurate rate (P), recall rate (R) and F value.
The beneficial effects of the present invention are:
(1) algorithm model that the present invention selects is CRF+SVM, (maximum relative to HMM (Hidden Markov Model) and MEMM
Entropy Markov model), CRF not only solves the problem of HMM exports independence assumption, and the mark biasing for also solving MEMM is asked
Topic, it is because only normalizing locally, and CRF has counted global probability, is normalizing that MEMM, which is easily trapped into local optimum,
When consider data in global distribution, rather than only in local normalization, so that the decoding of sequence labelling becomes optimal solution.
SVM can be mapped using kernel function to higher dimensional space, can solve nonlinear classification, and classificating thought is simple, classification effect
Fruit is preferable.
(2) assembled classifier that the present invention is laminated is assigned with properly independent classifier annotation results by structural model
Weight, can guarantee the advantages of merging different independent classifier, identify the more single model of Laotian base noun phrase result
Middle optimal result indices will be high, even the bad model of result also makes tribute in fusion process when being individually identified
It offers, improves accuracy for the identification of Laotian base noun phrase.
Detailed description of the invention
Fig. 1 is the flow chart in the present invention.
Specific embodiment
In the following with reference to the drawings and specific embodiments, the present invention is further illustrated.
Embodiment 1: as shown in Figure 1, a kind of Laotian base noun phrase recognition methods based on stack combinations classifier,
Specific step is as follows:
Step1 obtains Laotian base noun phrase corpus.Laotian base noun phrase corpus is passed through into participle and word
Property labeling system, manually marks out Laotian base noun phrase by Laos classmate, basic noun is defined as inside cannot be again
Include smaller noun phrase, according to Laotian grammer, base noun phrase (BaseNP) form of Laotian is as follows:
1.BaseNP→BaseNP+BaseNP
2.BaseNP → BaseNP+ adjective
3.BaseNP → BaseNP+ verb
4.BaseNP → BaseNP+ numeral-classifier compound
5.BaseNP → BaseNP+ refers to word
6.BaseNP → BaseNP+ subject-predicate phrase
7.BaseNP → BaseNP+ preposition structure
8.BaseNP → BaseNP+ refers to word+adjective
9.BaseNP → BaseNP+ adjective+numeral-classifier compound
10.BaseNP → BaseNP+ adjective+verb
Step2, by the good laggard line data set distribution training corpus of Laotian base noun phrase corpus labeling and test language
Material.Specifically, the Laotian base noun phrase corpus that step1 has manually been marked is divided into 5 parts, wherein 4 parts are used as training
Corpus.
Step3 is labeled respectively using three kinds of models, is obtained 3 parts of different prediction results, is together constituted with original data set
New data acquisition system.Using CRF, SVM be positive and the reverse three kinds of models of SVM step2 is divided equally respectively the training corpus that obtains into
Rower note, obtains 3 parts of different prediction classification results, the experimental data that 3 parts of different prediction classification results and step1 are obtained
Collection together constitutes with new data acquisition system.
The highest model of sequence labelling performance is made in Step4, the testing material test model obtained using step2, selection
For upper layer disaggregated model.Specifically, assessment three models of comprehensive descision are carried out with accurate rate (P), recall rate (R) and F value to know
Other performance, P, R, F value formula are as follows:
Wherein, NCIt represents and identifies correct Laotian base noun phrase, NIRepresent the Laotian basic noun identified
Phrase, NYRepresent the Laotian base noun phrase sum in corpus.Accurate rate (P) reflects the recognition capability of model, recalls
Rate (R) reflects looking into for model all can power.F value comprehensive characterization accurate rate and recall rate, embody algorithm synthesis performance, use
Accurate rate (P), recall rate (R) and F value are assessed, can three model recognition performances of comprehensive descision.
Step5 uses CRF, SVM forward direction and the reverse three kinds of models of SVM using new data acquisition system with word, part of speech and 3 parts
The different classifications result for being labeled and obtaining to experimental data set respectively makees feature, is sent to layer model, and is identified knot
Fruit promotes recognition accuracy as final result.
Further, the specific steps manually marked in the step1 are as follows: pass through the sequence of terms S=for input
W1W2…WI…Wn, WIFor i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that
Sequence maximum probability in all possible annotated sequence, wherein TI∈ { B, I, O }, B identify Laotian base noun phrase
Left margin, I identify Laotian base noun phrase inside, O identifies other.
Further, in the step3 CRF utilize Laotian word contextual information feature, by study Step2 in
Training corpus acquisition make training sample annotated sequence in the characteristic set and feature of annotated sequence set conditional maximum probability
Weight.
Further, in the step step3 using SVM forward direction and SVM inversely respectively to the training corpus in Step2 into
Rower note obtains 2 parts of different prediction results using from left to right identifying direction and identifying the otherness in direction from right to left.
In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (5)
1. a kind of Laotian base noun phrase recognition methods based on stack combinations classifier, it is characterised in that: specific steps
It is as follows:
It is basic manually to mark out Laotian by Laotian base noun phrase corpus by participle and part-of-speech tagging system by Step1
Noun phrase obtains experimental data set;
Step2, by step1 obtain experimental data set carry out data set be divided into 5 parts, wherein 4 parts be used as training corpus, 1 part
As testing material;
Step3 divides equally the training corpus obtained to step2 respectively using CRF, SVM forward direction and the reverse three kinds of models of SVM and marks
Note obtains 3 parts of different prediction classification results, the experimental data set one that 3 parts of different prediction classification results and step1 are obtained
With the new data acquisition system of composition;
Step4, the testing material test model obtained using step2, is selected to the highest model of sequence labelling performance as upper
Layer disaggregated model;
Step5, the new data acquisition system obtained using step3 are made with obtaining 3 parts of different classifications results in word, part of speech and step3
Feature is sent to upper layer disaggregated model, and using its recognition result as final result.
2. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special
Sign is: the specific steps manually marked in the step1 are as follows: passes through the sequence of terms S=W for input1W2…WI…Wn, WI
For i-th word, the target of task is to obtain a corresponding annotated sequence T*=T1T2…TI…Tn, so that the sequence it is all can
Can annotated sequence in maximum probability, wherein TI∈ { B, I, O }, B identify the left margin of Laotian base noun phrase, I mark
Inside Laotian base noun phrase, O identifies other.
3. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special
Sign is: CRF utilizes the contextual information feature of Laotian word in the step3, passes through the training corpus in study Step2
Acquisition makes training sample annotated sequence in the characteristic set and feature weight of annotated sequence set conditional maximum probability.
4. the Laotian base noun phrase recognition methods according to claim 1 or 3 based on stack combinations classifier,
It is characterized in that: inversely the training corpus in Step2 being labeled respectively using SVM forward direction and SVM in the step step3,
Using from left to right identifying direction and identifying the otherness in direction from right to left, 2 parts of different prediction results are obtained.
5. the Laotian base noun phrase recognition methods according to claim 1 based on stack combinations classifier, special
Sign is: selection to the method for the highest model of sequence labelling performance is evaluated using CoNLL2000 in the step step4,
It is evaluated using accurate rate P, recall rate R and F value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910520748.7A CN110362821A (en) | 2019-06-17 | 2019-06-17 | A kind of Laotian base noun phrase recognition methods based on stack combinations classifier |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910520748.7A CN110362821A (en) | 2019-06-17 | 2019-06-17 | A kind of Laotian base noun phrase recognition methods based on stack combinations classifier |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110362821A true CN110362821A (en) | 2019-10-22 |
Family
ID=68216249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910520748.7A Pending CN110362821A (en) | 2019-06-17 | 2019-06-17 | A kind of Laotian base noun phrase recognition methods based on stack combinations classifier |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110362821A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021225A (en) * | 2016-05-12 | 2016-10-12 | 大连理工大学 | Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs) |
CN106202035A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese conversion of parts of speech disambiguation method based on combined method |
CN107797994A (en) * | 2017-09-26 | 2018-03-13 | 昆明理工大学 | Vietnamese noun phrase block identifying method based on constraints random field |
CN109753650A (en) * | 2018-12-14 | 2019-05-14 | 昆明理工大学 | A kind of Laotian name place name entity recognition method merging multiple features |
-
2019
- 2019-06-17 CN CN201910520748.7A patent/CN110362821A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106021225A (en) * | 2016-05-12 | 2016-10-12 | 大连理工大学 | Chinese maximal noun phrase (MNP) identification method based on Chinese simple noun phrases (SNPs) |
CN106202035A (en) * | 2016-06-30 | 2016-12-07 | 昆明理工大学 | Vietnamese conversion of parts of speech disambiguation method based on combined method |
CN107797994A (en) * | 2017-09-26 | 2018-03-13 | 昆明理工大学 | Vietnamese noun phrase block identifying method based on constraints random field |
CN109753650A (en) * | 2018-12-14 | 2019-05-14 | 昆明理工大学 | A kind of Laotian name place name entity recognition method merging multiple features |
Non-Patent Citations (1)
Title |
---|
田雪 等: "一种混合的汉语简单名词短语识别方法", 《小型微型计算机系统》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Makav et al. | A new image captioning approach for visually impaired people | |
CN107229610B (en) | A kind of analysis method and device of affection data | |
CN109933664B (en) | Fine-grained emotion analysis improvement method based on emotion word embedding | |
CN103345922B (en) | A kind of large-length voice full-automatic segmentation method | |
CN108536870B (en) | Text emotion classification method fusing emotional features and semantic features | |
WO2018028077A1 (en) | Deep learning based method and device for chinese semantics analysis | |
CN108763510A (en) | Intension recognizing method, device, equipment and storage medium | |
CN106919673A (en) | Text mood analysis system based on deep learning | |
CN107330011A (en) | The recognition methods of the name entity of many strategy fusions and device | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN112231472B (en) | Judicial public opinion sensitive information identification method integrated with domain term dictionary | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN108509629A (en) | Text emotion analysis method based on emotion dictionary and support vector machine | |
Paraschiv et al. | UPB at GermEval-2019 Task 2: BERT-Based Offensive Language Classification of German Tweets. | |
CN110232123A (en) | The sentiment analysis method and device thereof of text calculate equipment and readable medium | |
CN115357719B (en) | Power audit text classification method and device based on improved BERT model | |
CN112417132B (en) | New meaning identification method for screening negative samples by using guest information | |
CN109492105A (en) | A kind of text sentiment classification method based on multiple features integrated study | |
CN110532390A (en) | A kind of news keyword extracting method based on NER and Complex Networks Feature | |
CN104462409A (en) | Cross-language emotional resource data identification method based on AdaBoost | |
CN103020167A (en) | Chinese text classification method for computer | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN108681532A (en) | A kind of sentiment analysis method towards Chinese microblogging | |
CN101556580A (en) | Stock comment classification system based on analysis of discourse structure and method | |
CN109977391A (en) | A kind of information extraction method and device of text data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191022 |