CN109783803A - A kind of Laotian organization names recognition methods based on SVM and HMM - Google Patents
A kind of Laotian organization names recognition methods based on SVM and HMM Download PDFInfo
- Publication number
- CN109783803A CN109783803A CN201811532381.2A CN201811532381A CN109783803A CN 109783803 A CN109783803 A CN 109783803A CN 201811532381 A CN201811532381 A CN 201811532381A CN 109783803 A CN109783803 A CN 109783803A
- Authority
- CN
- China
- Prior art keywords
- word
- organization names
- laotian
- current word
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008520 organization Effects 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 26
- 230000007246 mechanism Effects 0.000 claims description 11
- 239000000203 mixture Substances 0.000 claims description 4
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000010801 machine learning Methods 0.000 abstract description 6
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 238000000605 extraction Methods 0.000 abstract 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The Laotian organization names recognition methods based on SVM and HMM that the present invention relates to a kind of, belongs to natural language processing and machine learning techniques field.Laos's organization names are divided into organization names prefix word and organization names suffix word first by the present invention, and the extraction of organization names prefix word is configured to by an organization names feature lexicon according to Laotian feature, it the use of the word that the judgement of SVM model is present in feature lexicon whether is Laos's organization names prefix word, to determine prezone, HMM model is reused to determine boundary after Laos's organization names, to identify a complete organization names.The present invention has merged rule-based with the point based on machine learning method, supports without huge knowledge base, achieves than using the better result of conventional machines learning method.
Description
Technical field
The Laotian organization names recognition methods based on SVM and HMM that the present invention relates to a kind of, belong to natural language processing and
Machine learning techniques field.
Background technique
Name Entity recognition is always the vital task of natural language processing field, in skills such as information retrieval, machine translation
Occupy very important status in art.Organization names are lives due to having the characteristics that structure is complicated, different in size, composition multiplicity
Most indiscernible one kind in name seven major class of entity.It is mainly three kinds following that current organization names know method for distinguishing: rule-based
With the method for dictionary, based on the method for machine learning.Rule-based and dictionary method, needs to rely on expertise, needs big
Amount label content, takes time and effort.Individually although the method based on machine learning is easier to build, but accuracy rate is less
It is high.
Summary of the invention
The Laotian organization names recognition methods based on SVM and HMM that the object of the present invention is to provide a kind of, according to Laotian
The status of organization names Study of recognition, the present invention mainly use the method based on machine learning and merge some Laotian linguistics
Feature, then the advantages of assisted in identifying with rule-based approach, combine two kinds of algorithms in the prior art, it is accurate to facilitate
The promotion of rate.
The technical solution adopted by the present invention is that: a kind of Laotian organization names recognition methods based on SVM and HMM, specifically
Steps are as follows:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, it is single
A word is exactly the referred to as simple mechanism title an of entity, and form of Definition S, an entity name of multiple words compositions is known as multiple
Miscellaneous organization names, form of Definition S+P, wherein S is characterized word, also referred to as prefix word, and P is qualifier, also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to extract Laotian organization names
For a feature lexicon;
Step2, current word is set to first word;
Step3, it is scanned backward since current word, judges whether current word appears in the feature lexicon in Step1.1,
There are two types of situations at this time:
The first situation is in the feature lexicon that current word appears in Step1.1:
When the first situation, illustrates that current word may be Laotian organization names prefix word, converted according to feature vector
Current word is converted to feature vector by process, then executes step 4;
Second situation is that current word does not appear in the feature lexicon in Step1.1:
When second situation, judge whether current word is ending, if then terminating, if otherwise current word position moves back
One, this step is repeated, continues scanning judgement backward, until ending;
Step4, according to appear in the feature vector of the word in feature lexicon in Step1.1 using SVM model to its into
Row judgement, if be Laotian organization names prefix word, if it is, continuing below step, if it is not, then by current lexeme
It postpones shifting one and returns to Step3;
Step5, current word is set to prefix word wiThe latter word wi+1, using having merged multiple Laotian mechanism name structures
The Hidden Markov Model of word feature is to current word wi+1Judged, there are two types of situations at this time:
The first situation is that current word is Laotian organization names suffix word:
When the first situation, illustrate that current word is Laotian organization names suffix word, then current word position moves back one
Position repeats the judgement of this step;
Second situation is that current word is not Laotian organization names suffix word:
When second situation, then prefix word w is extractediTo current word wj+1Previous word wjIn all words, word at this time
wi…wjFor a complete Laotian organization names entity, Step6 is then executed;
Step6, judge current word wj+1It whether is the last one word, if it is not, then setting w for current word positionj+1,
And Step3 is returned, continuation scans backward, if it is, circulation terminates.
The beneficial effects of the present invention are:
1, the Laotian organization names recognition method of the invention based on SVM and HMM, with exclusive use SVM model realization machine
Structure name recognition method compares, and accurate rate, recall rate, F value all significantly improve.
2, the Laotian organization names recognition method of the invention based on SVM and HMM, turns to letter for Laotian mechanism name form
Single organization names S or complex mechanism title S+P.A complete Laotian organization names are identified in two steps.
3, the Laotian organization names recognition method of the invention based on SVM and HMM, by Laotian organization names prefix word
(S) identification is abstracted as two classification problems, and SVM model is good at handling two classification problems, so being known using SVM model
Other Laotian organization names prefix word.
4, the Laotian organization names recognition method of the invention based on SVM and HMM, by the suffix word of Laotian organization names
(P) identification is abstracted into the decoding problem of a HMM model, and has merged multiple Laotian mechanism name word-building characteristics in a model,
Many is improved using the accurate rate of HMM model identification mechanism title than tradition.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Specific embodiment
In order to describe in more detail the present invention and convenient for the understanding of those skilled in the art, with reference to the accompanying drawing and embodiment pair
The present invention is further described, and the embodiment of this part for illustrating the present invention, do not come with this by the purpose being easy to understand
The limitation present invention.
Embodiment 1: as shown in Figure 1, a kind of Laotian organization names recognition methods based on SVM and HMM, specific steps are such as
Under:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, it is single
A word is exactly the referred to as simple mechanism title an of entity, and form of Definition S, an entity name of multiple words compositions is known as multiple
Miscellaneous organization names, form of Definition S+P, wherein S is characterized word (such as university, party committee), and also referred to as prefix word, P are qualifier,
Also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to extract Laotian organization names
For a feature lexicon;
Step2, current word is set to first word;
Step3, from the sentence after word segmentation processing (w1-w12) in first word w1Start to scan backward, judges that current word is
It is no to appear in feature lexicon, through judging, w1To w6Not in feature lexicon, continue to scan word w backward at this time7, through judging w7
It is present in feature lexicon, then according to feature vector conversion process by current word w7Be converted to feature vector;
Step4, according to word w7Feature vector it is judged using SVM model, if be Laotian organization names
Prefix word.If it is, continuing below step.If it is not, then returning to Step3, continue to judge since next word.Through sentencing
Disconnected w7For Laotian organization names prefix word.
Step5, from prefix word w7The latter word w8Start to scan each word, using having merged multiple Laotian mechanisms name
The Hidden Markov Model of word-building characteristic is to current word w8Judged, model finally identifies w8And w9It is all Laotian mechanism
Title suffix word, w10It is not Laotian organization names suffix word.At this time by w7、w8、w9It extracts,The word is a complete Laotian organization names entity.
Step6, judge current word w10It whether is the last one word, if it is not, then from w10Start, returns to Step3 and start
Judgement continues scanning backward until terminating.If it is, circulation terminates.
Pass through the judgement in Step3, w at this time10、w11And w12It is not present in the feature lexicon in Step1.1, due to
It has arrived at ending, so terminating judgement.
Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned
Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept
Put that various changes can be made.
Claims (1)
1. a kind of Laotian organization names recognition methods based on SVM and HMM, it is characterised in that: specific step is as follows:
Step1, according to the preceding feature of Laotian organization names Feature Words, Laotian organization names are divided into two classes, single word
It is exactly the referred to as simple mechanism title an of entity, form of Definition S, an entity name of multiple words compositions is known as complicated
Organization names, form of Definition S+P, wherein S is characterized word, also referred to as prefix word, and P is qualifier, also referred to as suffix word;
Step1.1, according to formal definition, name all Feature Words S in entity corpus to be extracted as one Laotian organization names
A feature lexicon;
Step2, current word is set to first word;
Step3, it is scanned backward since current word, judges whether current word appears in the feature lexicon in Step1.1, at this time
There are two types of situations:
The first situation is in the feature lexicon that current word appears in Step1.1:
When the first situation, illustrate that current word may be Laotian organization names prefix word, according to feature vector conversion process
Current word is converted into feature vector, then executes step 4;
Second situation is that current word does not appear in the feature lexicon in Step1.1:
When second situation, judge whether current word is ending, if then terminating, if otherwise current word position moves back one
Position repeats this step, continues scanning judgement backward, until ending;
Step4, it is sentenced using SVM model according to the feature vector of the word in the feature lexicon appeared in Step1.1
It is disconnected, if to be Laotian organization names prefix word, if it is, continuing below step, if it is not, then by behind current word position
It moves one and returns to Step3;
Step5, current word is set to prefix word wiThe latter word wi+1, using having merged multiple Laotian mechanism name word-building characteristics
Hidden Markov Model to current word wi+1Judged, there are two types of situations at this time:
The first situation is that current word is Laotian organization names suffix word:
When the first situation, illustrate that current word is Laotian organization names suffix word, then current word position moves back one, weight
The judgement of this multiple step;
Second situation is that current word is not Laotian organization names suffix word:
When second situation, then prefix word w is extractediTo current word wj+1Previous word wjIn all words, word w at this timei…wj
For a complete Laotian organization names entity, Step6 is then executed;
Step6, judge current word wj+1It whether is the last one word, if it is not, then setting w for current word positionj+1, and return
Step3 is returned, continuation scans backward, if it is, circulation terminates.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811532381.2A CN109783803A (en) | 2018-12-14 | 2018-12-14 | A kind of Laotian organization names recognition methods based on SVM and HMM |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811532381.2A CN109783803A (en) | 2018-12-14 | 2018-12-14 | A kind of Laotian organization names recognition methods based on SVM and HMM |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109783803A true CN109783803A (en) | 2019-05-21 |
Family
ID=66496899
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811532381.2A Pending CN109783803A (en) | 2018-12-14 | 2018-12-14 | A kind of Laotian organization names recognition methods based on SVM and HMM |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783803A (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776560A (en) * | 2016-12-15 | 2017-05-31 | 昆明理工大学 | A kind of Kampuchean organization name recognition method |
-
2018
- 2018-12-14 CN CN201811532381.2A patent/CN109783803A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776560A (en) * | 2016-12-15 | 2017-05-31 | 昆明理工大学 | A kind of Kampuchean organization name recognition method |
Non-Patent Citations (1)
Title |
---|
祝继锋: "基于SVM和HMM算法的中文机构名称识别", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Abdul-Hamid et al. | Simplified feature set for Arabic named entity recognition | |
CN109284400B (en) | Named entity identification method based on Lattice LSTM and language model | |
Wshah et al. | Script independent word spotting in offline handwritten documents based on hidden markov models | |
Fischer et al. | Improving hmm-based keyword spotting with character language models | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
CN111178074A (en) | Deep learning-based Chinese named entity recognition method | |
WO2008107305A2 (en) | Search-based word segmentation method and device for language without word boundary tag | |
Zhang et al. | Word segmentation and named entity recognition for sighan bakeoff3 | |
CN108363691B (en) | Domain term recognition system and method for power 95598 work order | |
CN111046660B (en) | Method and device for identifying text professional terms | |
Peng et al. | Multi-font printed Mongolian document recognition system | |
Bedrick et al. | Robust kaomoji detection in Twitter | |
CN113505200A (en) | Sentence-level Chinese event detection method combining document key information | |
Chen et al. | A boundary assembling method for Chinese entity-mention recognition | |
CN109344233B (en) | Chinese name recognition method | |
Singh et al. | Can RNNs reliably separate script and language at word and line level? | |
Saluja et al. | Sub-word embeddings for OCR corrections in highly fusional indic languages | |
CN112307756A (en) | Bi-LSTM and word fusion-based Chinese word segmentation method | |
CN109783803A (en) | A kind of Laotian organization names recognition methods based on SVM and HMM | |
CN111178009A (en) | Text multilingual recognition method based on feature word weighting | |
Altenbek et al. | Kazakh segmentation system of inflectional affixes | |
CN113240485A (en) | Training method of text generation model, and text generation method and device | |
Saetiew et al. | Thai person name recognition (PNR) using likelihood probability of tokenized words | |
KR101126186B1 (en) | Apparatus and Method for disambiguation of morphologically ambiguous Korean verbs, and Recording medium thereof | |
Tepdang et al. | Improving thai word segmentation with named entity recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190521 |
|
RJ01 | Rejection of invention patent application after publication |