CN109783803A - A kind of Laotian organization names recognition methods based on SVM and HMM - Google Patents

A kind of Laotian organization names recognition methods based on SVM and HMM Download PDF

Info

Publication number
CN109783803A
CN109783803A CN201811532381.2A CN201811532381A CN109783803A CN 109783803 A CN109783803 A CN 109783803A CN 201811532381 A CN201811532381 A CN 201811532381A CN 109783803 A CN109783803 A CN 109783803A
Authority
CN
China
Prior art keywords
word
organization names
laotian
current word
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811532381.2A
Other languages
Chinese (zh)
Inventor
周兰江
晏雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201811532381.2A priority Critical patent/CN109783803A/en
Publication of CN109783803A publication Critical patent/CN109783803A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The Laotian organization names recognition methods based on SVM and HMM that the present invention relates to a kind of, belongs to natural language processing and machine learning techniques field.Laos's organization names are divided into organization names prefix word and organization names suffix word first by the present invention, and the extraction of organization names prefix word is configured to by an organization names feature lexicon according to Laotian feature, it the use of the word that the judgement of SVM model is present in feature lexicon whether is Laos's organization names prefix word, to determine prezone, HMM model is reused to determine boundary after Laos's organization names, to identify a complete organization names.The present invention has merged rule-based with the point based on machine learning method, supports without huge knowledge base, achieves than using the better result of conventional machines learning method.

Description

A kind of Laotian organization names recognition methods based on SVM and HMM
Technical field
The Laotian organization names recognition methods based on SVM and HMM that the present invention relates to a kind of, belong to natural language processing and Machine learning techniques field.
Background technique
Name Entity recognition is always the vital task of natural language processing field, in skills such as information retrieval, machine translation Occupy very important status in art.Organization names are lives due to having the characteristics that structure is complicated, different in size, composition multiplicity Most indiscernible one kind in name seven major class of entity.It is mainly three kinds following that current organization names know method for distinguishing: rule-based With the method for dictionary, based on the method for machine learning.Rule-based and dictionary method, needs to rely on expertise, needs big Amount label content, takes time and effort.Individually although the method based on machine learning is easier to build, but accuracy rate is less It is high.
Summary of the invention
The Laotian organization names recognition methods based on SVM and HMM that the object of the present invention is to provide a kind of, according to Laotian The status of organization names Study of recognition, the present invention mainly use the method based on machine learning and merge some Laotian linguistics Feature, then the advantages of assisted in identifying with rule-based approach, combine two kinds of algorithms in the prior art, it is accurate to facilitate The promotion of rate.
The technical solution adopted by the present invention is that: a kind of Laotian organization names recognition methods based on SVM and HMM, specifically Steps are as follows:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, it is single A word is exactly the referred to as simple mechanism title an of entity, and form of Definition S, an entity name of multiple words compositions is known as multiple Miscellaneous organization names, form of Definition S+P, wherein S is characterized word, also referred to as prefix word, and P is qualifier, also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to extract Laotian organization names For a feature lexicon;
Step2, current word is set to first word;
Step3, it is scanned backward since current word, judges whether current word appears in the feature lexicon in Step1.1, There are two types of situations at this time:
The first situation is in the feature lexicon that current word appears in Step1.1:
When the first situation, illustrates that current word may be Laotian organization names prefix word, converted according to feature vector Current word is converted to feature vector by process, then executes step 4;
Second situation is that current word does not appear in the feature lexicon in Step1.1:
When second situation, judge whether current word is ending, if then terminating, if otherwise current word position moves back One, this step is repeated, continues scanning judgement backward, until ending;
Step4, according to appear in the feature vector of the word in feature lexicon in Step1.1 using SVM model to its into Row judgement, if be Laotian organization names prefix word, if it is, continuing below step, if it is not, then by current lexeme It postpones shifting one and returns to Step3;
Step5, current word is set to prefix word wiThe latter word wi+1, using having merged multiple Laotian mechanism name structures The Hidden Markov Model of word feature is to current word wi+1Judged, there are two types of situations at this time:
The first situation is that current word is Laotian organization names suffix word:
When the first situation, illustrate that current word is Laotian organization names suffix word, then current word position moves back one Position repeats the judgement of this step;
Second situation is that current word is not Laotian organization names suffix word:
When second situation, then prefix word w is extractediTo current word wj+1Previous word wjIn all words, word at this time wi…wjFor a complete Laotian organization names entity, Step6 is then executed;
Step6, judge current word wj+1It whether is the last one word, if it is not, then setting w for current word positionj+1, And Step3 is returned, continuation scans backward, if it is, circulation terminates.
The beneficial effects of the present invention are:
1, the Laotian organization names recognition method of the invention based on SVM and HMM, with exclusive use SVM model realization machine Structure name recognition method compares, and accurate rate, recall rate, F value all significantly improve.
2, the Laotian organization names recognition method of the invention based on SVM and HMM, turns to letter for Laotian mechanism name form Single organization names S or complex mechanism title S+P.A complete Laotian organization names are identified in two steps.
3, the Laotian organization names recognition method of the invention based on SVM and HMM, by Laotian organization names prefix word (S) identification is abstracted as two classification problems, and SVM model is good at handling two classification problems, so being known using SVM model Other Laotian organization names prefix word.
4, the Laotian organization names recognition method of the invention based on SVM and HMM, by the suffix word of Laotian organization names (P) identification is abstracted into the decoding problem of a HMM model, and has merged multiple Laotian mechanism name word-building characteristics in a model, Many is improved using the accurate rate of HMM model identification mechanism title than tradition.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
In order to describe in more detail the present invention and convenient for the understanding of those skilled in the art, with reference to the accompanying drawing and embodiment pair The present invention is further described, and the embodiment of this part for illustrating the present invention, do not come with this by the purpose being easy to understand The limitation present invention.
Embodiment 1: as shown in Figure 1, a kind of Laotian organization names recognition methods based on SVM and HMM, specific steps are such as Under:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, it is single A word is exactly the referred to as simple mechanism title an of entity, and form of Definition S, an entity name of multiple words compositions is known as multiple Miscellaneous organization names, form of Definition S+P, wherein S is characterized word (such as university, party committee), and also referred to as prefix word, P are qualifier, Also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to extract Laotian organization names For a feature lexicon;
Step2, current word is set to first word;
Step3, from the sentence after word segmentation processing (w1-w12) in first word w1Start to scan backward, judges that current word is It is no to appear in feature lexicon, through judging, w1To w6Not in feature lexicon, continue to scan word w backward at this time7, through judging w7 It is present in feature lexicon, then according to feature vector conversion process by current word w7Be converted to feature vector;
Step4, according to word w7Feature vector it is judged using SVM model, if be Laotian organization names Prefix word.If it is, continuing below step.If it is not, then returning to Step3, continue to judge since next word.Through sentencing Disconnected w7For Laotian organization names prefix word.
Step5, from prefix word w7The latter word w8Start to scan each word, using having merged multiple Laotian mechanisms name The Hidden Markov Model of word-building characteristic is to current word w8Judged, model finally identifies w8And w9It is all Laotian mechanism Title suffix word, w10It is not Laotian organization names suffix word.At this time by w7、w8、w9It extracts,The word is a complete Laotian organization names entity.
Step6, judge current word w10It whether is the last one word, if it is not, then from w10Start, returns to Step3 and start Judgement continues scanning backward until terminating.If it is, circulation terminates.
Pass through the judgement in Step3, w at this time10、w11And w12It is not present in the feature lexicon in Step1.1, due to It has arrived at ending, so terminating judgement.
Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (1)

1. a kind of Laotian organization names recognition methods based on SVM and HMM, it is characterised in that: specific step is as follows:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, single word It is exactly the referred to as simple mechanism title an of entity, form of Definition S, an entity name of multiple words compositions is known as complicated Organization names, form of Definition S+P, wherein S is characterized word, also referred to as prefix word, and P is qualifier, also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to be extracted as one Laotian organization names A feature lexicon;
Step2, current word is set to first word;
Step3, it is scanned backward since current word, judges whether current word appears in the feature lexicon in Step1.1, at this time There are two types of situations:
The first situation is in the feature lexicon that current word appears in Step1.1:
When the first situation, illustrate that current word may be Laotian organization names prefix word, according to feature vector conversion process Current word is converted into feature vector, then executes step 4;
Second situation is that current word does not appear in the feature lexicon in Step1.1:
When second situation, judge whether current word is ending, if then terminating, if otherwise current word position moves back one Position repeats this step, continues scanning judgement backward, until ending;
Step4, it is sentenced using SVM model according to the feature vector of the word in the feature lexicon appeared in Step1.1 It is disconnected, if to be Laotian organization names prefix word, if it is, continuing below step, if it is not, then by behind current word position It moves one and returns to Step3;
Step5, current word is set to prefix word wiThe latter word wi+1, using having merged multiple Laotian mechanism name word-building characteristics Hidden Markov Model to current word wi+1Judged, there are two types of situations at this time:
The first situation is that current word is Laotian organization names suffix word:
When the first situation, illustrate that current word is Laotian organization names suffix word, then current word position moves back one, weight The judgement of this multiple step;
Second situation is that current word is not Laotian organization names suffix word:
When second situation, then prefix word w is extractediTo current word wj+1Previous word wjIn all words, word w at this timei…wj For a complete Laotian organization names entity, Step6 is then executed;
Step6, judge current word wj+1It whether is the last one word, if it is not, then setting w for current word positionj+1, and return Step3 is returned, continuation scans backward, if it is, circulation terminates.
CN201811532381.2A 2018-12-14 2018-12-14 A kind of Laotian organization names recognition methods based on SVM and HMM Pending CN109783803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811532381.2A CN109783803A (en) 2018-12-14 2018-12-14 A kind of Laotian organization names recognition methods based on SVM and HMM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811532381.2A CN109783803A (en) 2018-12-14 2018-12-14 A kind of Laotian organization names recognition methods based on SVM and HMM

Publications (1)

Publication Number Publication Date
CN109783803A true CN109783803A (en) 2019-05-21

Family

ID=66496899

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811532381.2A Pending CN109783803A (en) 2018-12-14 2018-12-14 A kind of Laotian organization names recognition methods based on SVM and HMM

Country Status (1)

Country Link
CN (1) CN109783803A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776560A (en) * 2016-12-15 2017-05-31 昆明理工大学 A kind of Kampuchean organization name recognition method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776560A (en) * 2016-12-15 2017-05-31 昆明理工大学 A kind of Kampuchean organization name recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
祝继锋: "基于SVM和HMM算法的中文机构名称识别", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Abdul-Hamid et al. Simplified feature set for Arabic named entity recognition
CN109284400B (en) Named entity identification method based on Lattice LSTM and language model
Wshah et al. Script independent word spotting in offline handwritten documents based on hidden markov models
Fischer et al. Improving hmm-based keyword spotting with character language models
CN104881458B (en) A kind of mask method and device of Web page subject
CN111178074A (en) Deep learning-based Chinese named entity recognition method
WO2008107305A2 (en) Search-based word segmentation method and device for language without word boundary tag
Zhang et al. Word segmentation and named entity recognition for sighan bakeoff3
CN108363691B (en) Domain term recognition system and method for power 95598 work order
CN111046660B (en) Method and device for identifying text professional terms
Peng et al. Multi-font printed Mongolian document recognition system
Bedrick et al. Robust kaomoji detection in Twitter
CN113505200A (en) Sentence-level Chinese event detection method combining document key information
Chen et al. A boundary assembling method for Chinese entity-mention recognition
CN109344233B (en) Chinese name recognition method
Singh et al. Can RNNs reliably separate script and language at word and line level?
Saluja et al. Sub-word embeddings for OCR corrections in highly fusional indic languages
CN112307756A (en) Bi-LSTM and word fusion-based Chinese word segmentation method
CN109783803A (en) A kind of Laotian organization names recognition methods based on SVM and HMM
CN111178009A (en) Text multilingual recognition method based on feature word weighting
Altenbek et al. Kazakh segmentation system of inflectional affixes
CN113240485A (en) Training method of text generation model, and text generation method and device
Saetiew et al. Thai person name recognition (PNR) using likelihood probability of tokenized words
KR101126186B1 (en) Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof
Tepdang et al. Improving thai word segmentation with named entity recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190521

RJ01 Rejection of invention patent application after publication