CN108090039A - A kind of name recognition methods and device - Google Patents
A kind of name recognition methods and device Download PDFInfo
- Publication number
- CN108090039A CN108090039A CN201611038892.XA CN201611038892A CN108090039A CN 108090039 A CN108090039 A CN 108090039A CN 201611038892 A CN201611038892 A CN 201611038892A CN 108090039 A CN108090039 A CN 108090039A
- Authority
- CN
- China
- Prior art keywords
- name
- decision
- linguistic context
- word
- making
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
Abstract
The invention discloses a kind of name recognition methods, including:The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence includes at least one name;According to the constitutive characteristic of name, name identification is carried out to the text sequence after participle based at least two statistical models, obtains all Potential names;Ngram models are built according to the linguistic context of name;Decision-making is carried out to all Potential names according to the ngram models, determine and exports the final name recognition result for meeting linguistic context.The present invention further simultaneously discloses a kind of name identification device.
Description
Technical field
The present invention relates to the identification technology in natural language processing technique field more particularly to a kind of name recognition methods and
Device.
Background technology
Natural language processing is the core analysis technology in internet information search field, search engine, public sentiment monitoring, with
And numerous internet IT industries such as e-commerce are all widely used.With the exponential growth of internet information and user experience
Demand is higher and higher, and on the premise of ensureing that processing speed meets user's use demand, internet information search is to natural language
The requirement of handling result is also more and more accurate.Wherein, name identification is core most difficult in natural language processing morphological analysis
One of problem, either in searching engine field still in public sentiment monitoring field, user is to the attention rate of name far above common
Word, and all names of dictionary None- identified are used, cause to identify that difficulty is larger, therefore, name identification is all user all the time
The research topic being concerned.
In general, name identification includes two types:Chinese personal name recognition and the identification of transliteration name.Since name is formed
The complexity of feature and contextual feature, at present the single statistical model employed in mainstream technology be can not cover comprehensively it is all
Name constitutive characteristic and contextual feature, therefore, in order to promote comprehensive recognition effect, there is an urgent need to a kind of fusion Chinese personal names
Identification and the identification of transliteration name are in more name recognition methods of one.At present, relatively common more name recognition methods mainly have
It is following two:
1) the name recognition methods based on mixed model, this method are mutually tied with a variety of statistical models based on Decision Tree Rule
The name recognizer of conjunction first, classifies to name constitutive characteristic and contextual feature using Decision Tree Rule;Then,
To the name of each classification using targetedly statistical model, so as to make up current mainstream technology used by single statistical model
The shortcomings that all name constitutive characteristics and contextual feature can not be covered comprehensively, promotes comprehensive recognition effect;
2) the name recognition methods of based role mark, this method are by carrying out role's mark to the list entries after participle
Note, and character labeling sequence is obtained, and then Chinese personal name and transliteration name are uniformly processed, and the name to there is mistake
Identification role is modified, and finally, obtained character labeling sequence is matched according to name recognition mode, and output group
Into name.
However, the more name recognition methods of above two are primarily present problems with:
It, need to be to all since this method is before name identification is carried out for the name recognition methods based on mixed model
Name is classified, if classifying quality is bad, it is easy to name be caused to omit identification or wrong identification;And not to different names
The recognition result of identification model carries out the decision-making of unified dimensional, therefore, when the recognition result between different models has intersection, allows
User is difficult to accept or reject;And for the name recognition methods of based role mark, it simply considers Chinese personal name and transliteration name
Unified identification is carried out using character labeling method, does not consider but different names in otherness present on own characteristic, identification
Effect Shortcomings.
The content of the invention
In view of this, an embodiment of the present invention is intended to provide a kind of name recognition methods and device, at least solve existing more
The above problem present in name identification technology can quickly and accurately identify Chinese personal name and transliteration name.
In order to achieve the above objectives, the technical solution of the embodiment of the present invention is realized in:
The embodiment of the present invention provides a kind of name recognition methods, the described method includes:
Obtain the text sequence of input, and the text sequence segmented, wherein, the text sequence include to
A few name;
According to the constitutive characteristic of name, name knowledge is carried out to the text sequence after participle based at least two statistical models
Not, all Potential names are obtained;
Ngram models are built according to the linguistic context of name;
Decision-making is carried out to all Potential names according to the ngram models, determines and export finally to meet linguistic context
Name recognition result.
In said program, the method further includes:Will segment during without typing dictionary unregistered word with individual character shape
State is presented.
In said program, after all Potential names of acquisition, the method further includes:According to described all potential
Name builds the name digraph for treating decision-making;
The linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction.
It is described that decision-making is carried out to all Potential names according to the ngram models in said program, it determines and exports
The final name recognition result for meeting linguistic context, including:
The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction according to where name, by the ngram models to institute
It states and treats that the name digraph of decision-making carries out probability mapping, build the probability digraph of all Potential names, and based on prediction
Algorithm is compared all paths in the probability digraph, is identified shortest path as the name for finally meeting linguistic context
As a result and export.
In said program, in language material in the ngram models and the segmenter segmented to the text sequence
Language material is identical.
The embodiment of the present invention also provides a kind of name identification device, and described device includes:Word-dividing mode, name identification mould
Block, model construction module, name decision-making module;Wherein,
The word-dividing mode for obtaining the text sequence of input, and segments the text sequence, wherein, institute
Stating text sequence includes at least one name;
The name identification module, for the constitutive characteristic according to name, based at least two statistical models to participle after
Text sequence carry out name identification, obtain all Potential names;
The model construction module, for building ngram models according to the linguistic context of name;
The name decision-making module for carrying out decision-making to all Potential names according to the ngram models, determines
And export the name recognition result for finally meeting linguistic context.
In said program, the word-dividing mode, be additionally operable to will segment during without typing dictionary unregistered word with list
Font state is presented.
In said program, the model construction module is additionally operable to obtain all Potential names in the name identification module
Afterwards, the name digraph of decision-making is treated according to all Potential names structures;
The linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction.
In said program, the name decision-making module is specifically used for:
The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction according to where name, by the ngram models to institute
It states and treats that the name digraph of decision-making carries out probability mapping, build the probability digraph of all Potential names, and based on prediction
Algorithm is compared all paths in the probability digraph, is identified shortest path as the name for finally meeting linguistic context
As a result and export.
In said program, in language material in the ngram models and the segmenter segmented to the text sequence
Language material is identical.
The name recognition methods and device that the embodiment of the present invention is provided obtain the text sequence of input, and to the text
This sequence is segmented, wherein, the text sequence includes at least one name;According to the constitutive characteristic of name, based on extremely
Few two kinds of statistical models carry out name identification to the text sequence after participle, obtain all Potential names;According to the linguistic context of name
Build ngram models;Decision-making is carried out to all Potential names according to the ngram models, determines and export finally to meet
The name recognition result of linguistic context.In this way, according to the otherness of different name own characteristics, known using different statistical models
Not, the advantage of a variety of statistical model identifications is made full use of, all Potential names are identified, promote name discrimination;Pass through
The probability digraph of all Potential names of ngram model constructions is united to be built to the recognition result of Chinese personal name, transliteration name
The decision-making of dimension, so as to carry out optimal selection to recognition result, not only name identification certainty is good, also improves comprehensive identification
Effect can rapidly and accurately identify different names.
Description of the drawings
Fig. 1 is a kind of flow diagram for name recognition methods that the embodiment of the present invention one provides;
Fig. 2 is a kind of overall procedure schematic diagram of name recognition methods provided by Embodiment 2 of the present invention;
Fig. 3 is the schematic diagram for the name digraph for treating decision-making that the embodiment of the present invention two is built;
Fig. 4 is the schematic diagram of the probability digraph for all Potential names that the embodiment of the present invention two is built;
Fig. 5 is a kind of composition structure diagram for name identification device that the embodiment of the present invention three provides.
Specific embodiment
The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair
The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used for limiting the present invention.
Embodiment one:
As shown in Figure 1, in the embodiment of the present invention name recognition methods realization flow, comprise the following steps:
Step 101:The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence
Include at least one name;
Here, the text sequence of the input is continuous language text character string, wherein, the text sequence includes
One or more names, and the text sequence can be obtained by way of following at least one, such as:It is directly defeated by user
Enter either to collect or receive from current environment such as chat window text sequence of terminal transmission etc..It in general, can
The text sequence is segmented using segmenter, specifically, according to the Chinese dictionary of collection, using segmenting method to text
Sequence carries out word segmentation, is word sequence by continuous language text character string cutting.
At present, there are many kinds of segmenting methods of the prior art, such as:Forward Maximum Method, reverse maximum matching, N- members
Grammer, improved maximum matching algorithm etc..Wherein, improved maximum matching algorithm is the core for continuing to use Forward Maximum Method algorithm
Thought is thought, and makes up the function that Forward Maximum Method algorithm does not possess ambiguity detection and resolution, and then is ensureing to segment speed base
On the premise of this is constant, the accuracy of participle is improved.Which kind of, for the embodiment of the present invention using segmenting method, do not do herein specific
It limits.
Wherein, Chinese dictionary is made of standard Chinese word set, conflict word set and association word set three parts.Standard Chinese
The word that word set includes can neither be name, cannot also occur as the part of name, which will be as the word set of segmenter
It uses;The word that conflict word set includes can occur as the part of name, and itself is not name;The word that conjunctive word is concentrated
Either name, and can be the word of place name or other physical names and its relevant characterization word.
Here, by being compared with the standard Chinese word set in Chinese dictionary, will not be recorded in the text sequence of input
The unknown word identification for entering dictionary comes out, wherein, the unregistered word of no typing dictionary is presented in the form of individual character, and these connect
It greatly may be name that continuous individual character, which has, and therefore, the embodiment of the present invention is equivalent on word segmentation result to continuous
Individual character segment is judged, judges whether it is name.For example, user inputs text sequence, " Minister for Health's Chen Minzhang should ask to join one
Brooker Nellie is taken on the telephone ", become after participle:" health/minister n/n is old/n is quick/ag chapters/q should about/v with/p cloth/n Shandongs/j
Gram/q in/f that/n lead to/v phones/n ", wherein, n, ag, q, v, p, j, f represent different parts of speech respectively.In this way, by word segmentation result
In continuous individual character " old/n quick/ag chapters/q ", " with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v " be respectively transmitted to name and identify
In model, in case subsequent step further judges whether these unregistered words are name.
It should be noted that the embodiment of the present invention segments the text sequence using existing segmenting method,
This is no longer described in detail.
Step 102:According to the constitutive characteristic of name, the text sequence after participle is carried out based at least two statistical models
Name identifies, obtains all Potential names;
Here, the name includes:Chinese personal name and transliteration name;The constitutive characteristic of the name includes:Above and below name
Literary feature, name word feature, border word feature, character feature;Wherein, name contextual feature is to occur in language text
Word before and after name;Name word feature includes:Name surname, name lead-in, name middle word, name tail word;
Border word is characterized as the word of name left margin word or the word of name right margin word;Character feature is what may be included in transliteration name
Number, character or punctuate etc.;The statistical model can be the name identification model based on dictionary, based on hidden Markov model
The names such as the Chinese personal name recognition model of (HMM, Hidden Markov Mode), the transliteration name identification model based on probability graph
At least two in identification model.In this way, the feature according to possessed by Chinese personal name and transliteration name itself, uses a variety of statistics
Chinese personal name and transliteration name is identified in model, and all name constitutive characteristics can not be covered not comprehensively by eliminating single model
Foot strives all identifying all Potential names inputted in text sequence.
Here, after obtaining all Potential names in a step 102, the method further includes:According to described all potential
Name builds the name digraph for treating decision-making.
Specifically, all Potential names that will identify that are showed by way of digraph, so as to according to name
Digraph carries out decision-making to all Potential names identified, judges which name is more in line with linguistic context.
Step 103:Ngram models are built according to the linguistic context of name;
Here, the linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction.
Here, ngram models are built using the cliction up and down of name and the part of speech feature of upper and lower cliction, it specifically, can be with
According to the possibility that the part of speech of the cliction up and down of name and upper and lower cliction occurs in the text, i.e. the height of probability of occurrence is built
Ngram models, specific building process have a detailed description later;Wherein, can by regarding all Potential names as a primitive,
To eliminate the problem of name is not logged in, that is to say, that by name be either with or without the unregistered word unified definition of typing dictionary
NR, the feature of shielding name in itself, the cliction up and down of consideration name and part of speech feature;Furthermore it is also possible to pass through smoothing algorithm
Problem is not logged in eliminate other words.
Usually, the language material in ngram models is arranged in the segmenter with being segmented to the text sequence
Language material is identical, to identify different names exactly.
Step 104:Decision-making is carried out to all Potential names according to the ngram models, determines and exports final symbol
Close the name recognition result of linguistic context.
This step specifically includes:The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction, pass through institute according to where name
It states ngram models and probability mapping is carried out to the name digraph for treating decision-making, the probability for building all Potential names has
Xiang Tu, and all paths in the probability digraph are compared based on prediction algorithm, using shortest path as final symbol
It closes the name recognition result of linguistic context and exports.
Here, the prediction algorithm uses Viterbi algorithm.Wherein, Viterbi algorithm is a kind of common applies hidden
Dynamic programming algorithm in Markov model, for finding the most probable hidden status switch generated by observation information.It is described
Potential names are all names that the possibility tentatively identified based on name identification model meets linguistic context.
Embodiment two:
The specific implementation process of name recognition methods of the embodiment of the present invention is done below and is further described in detail.
Fig. 2 gives the overall procedure schematic diagram of name recognition methods of the embodiment of the present invention, as shown in Fig. 2, including following
Step:
Step 201:Input text is obtained, and input text is segmented;
Here, based on the word collected in Chinese dictionary, word segmentation is carried out to input text using segmenter, it will
It is word sequence to input continuous language text character string cutting in text, will be gone out without the unknown word identification of typing dictionary
Come.Wherein, during participle, the unregistered word of no typing dictionary will occur in the form of individual character, and these are continuous single
Word is likely to be name.In other words, the embodiment of the present invention judges continuous individual character segment on word segmentation result, sentences
Determine whether it is name.For example, user inputs " Minister for Health's Chen Minzhang should ask to join one Brooker Nellie and take on the telephone ", after participle
Become:" health/minister n/n is old/n is quick/ag chapters/q should about/v with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v phones/n ".This
Sample, by word segmentation result continuous individual character " old/n quick/ag chapters/q ", " with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v " pass respectively
It is sent in name identification model, further to judge whether these unregistered words are name.
Step 202:In word segmentation result, according to the name identification model based on dictionary, the Chinese personal name recognition based on HMM
The identification models such as model, transliteration name identification model based on probability graph carry out the continuous individual character after participle preliminary name
Identification;
Above-mentioned several name identification models are done below and are further described in detail.
1) the name identification model based on dictionary
The so-called name identification based on dictionary, that is, the identification of rule-based name, this using name for identifying mould
Type is not easy for the name identification problem identified, is a supplement well, if transliteration name " Putin " is in name identification
It is not easy to identify in model, know but if we by the way that " Putin " typing dictionary is supplemented, can play supplement well
Other effect.
In general, the name included in dictionary is mainly the name of star personality and common name.However, based on dictionary
A problem can be introduced by carrying out name identification, such as:" Li Jing " is a name, but sentence " Jingjing LI classmate makes great efforts very much "
In, it is clear that " Li Jing " is not a name, but a part for name.In order to eliminate this influence, the measure generally taken
It is:As long as the name in the name covering dictionary identified using other name identification models is just directly deleted dictionary and identified
Name.
2) the Chinese personal name recognition model based on HMM
Here, the method for Primary Reference Chinese Academy of Sciences teacher Zhang Huaping marks name in word segmentation result using HMM and forms angle
Color;Then, according to each different role of mark, most long pattern String matching is carried out on the basis of the role's sequence marked out,
Finally identify name.Wherein, HMM is one kind of Markov chain, its state cannot observe directly, but can pass through sight
Direction finding amount sequence inspection;Each observation vector is to show as various states by some probability density distributions, each observation
Vector is that the status switch for having corresponding probability density distribution by one generates.
In general, according to name recognition point, by word all in a sentence be divided into name inside composition, on
Hereafter, the composition role of the names such as unrelated word.Wherein, the formulation foundation that name forms role is words institute in name composition
Role is different, such as:Surname, name, above, hereafter etc..Such as:The name listed according to table 1 forms Role Classification, and participle is tied
Fruit " health/minister/old/quick/chapter/should about/with/cloth/Shandong/gram/it is interior/you/logical/phone " carry out character labeling, character labeling knot
Fruit for " health/minister A/K is old/B is quick/C chapters/D should about/L with in/A cloth/A Shandongs/A grams/A/A that/A lead to/A phones/A ".Wherein,
Table 1 is Chinese personal name role construction classification chart.
Table 1
The modeling process of HMM is outlined below below:It is assumed that W is the Token sequences after participle, it is that unregistered word is known
Not preceding word segmentation is as a result, T is some possible character labeling sequence of W, T#For final annotation results, then have:
W=(w1,w2,...,wm),
T=(t1,t2,...,tm), m > 0,
If by word wiIt is considered as observed value, by role tiIt is considered as state value, wherein, t0For original state, then W is observation sequence
Row, T is the status switch being hidden in after W, this is a hidden Markov chain.It is possible to introduce Hidden Markov Model, lead to
It crosses and T is calculated#:
For the simplicity of calculating, to above-mentioned calculating T#The probability of formula take negative logarithm, then have:
Here, p (w are calculatedi|ti)、p(ti|ti-1) method be on the basis of modified corpus, part-of-speech tagging is converted
For role, and carry out Role Information statistics.Wherein, p (wi|ti) represent in given role tiUnder conditions of, Token wiIt is general
Rate;p(ti|ti-1) represent from role ti-1To role tiTransition probability.According to law of great number, can obtain:
p(wi|ti)≈C(wi,ti)/C(ti),
Wherein, C (wi,ti) it is wiFor role tiThe number of appearance, C (ti) it is role tiThe number of appearance.As i > 1, then
Have:
p(ti|ti-1)≈C(ti-1,ti)/C(ti-1),
Wherein, C (ti-1,ti) it is role ti-1Next role be tiNumber, and C (wi,ti)、C(ti)、C(ti-1,ti)
It can be obtained by carrying out learning training to cutting, the ripe corpus marked, extracting automatically.
By above-mentioned as it can be seen that character labeling problem is converted to T by the embodiment of the present invention#Solve problems, herein can be used ask
The classic algorithm Viterbi algorithm of such issues that solution is solved.And the final identification of name is to T#The role's mark solved
Note carries out the maximum matching of pattern string, and exports homologous segment composition name, while records their corresponding positions in sentence.Example
Such as:By read statement " health/minister/old/quick/chapter/should about/with/cloth/Shandong/gram/it is interior/you/logical/phone " pass through character labeling
Afterwards, corresponding T can be obtained#For " AKBCDLAAAAAAAA ", after the maximum matching of pattern string is carried out to it, tentatively identify
Chinese personal name is " Chen Min " or " Chen Minzhang ".
3) the transliteration name identification model based on probability graph
The so-called transliteration name identification model based on probability graph is the expression of one kind graphic model based on probability correlation relation
Model general name, mainly in conjunction with probability theory and the knowledge of graph theory, the joint of the variable related with model is represented using figure
Probability distribution.The name identification model mainly with real world news language material,《New english-chinese dictionary》、《English personage, place name,
Event name dictionary》In Chinese and English translated name for resource, first, establish translated name character library, translated name dictionary and translated name syllable information
Storehouse;Secondly, translated name word confidence level is counted, and the energy that translated name word forms translated name is evaluated according to translated name probabilistic estimation formula
Power;Again, the rule base of translated name identification is constructed;Finally, the syllable information in contextual information and english name is made full use of, from
Dynamic identification english name.
Wherein, translated name probabilistic estimation formula embodies the ability that translated name word forms translated name, it is assumed that translated name string ymstr=
c1c2...cn‐1cn, wherein, c1For translated name lead-in, c2~cn‐1For translated name middle word, cnFor translated name tail word, Ps, Pm, Pe difference table
Show translated name lead-in, middle word, the probabilistic estimation of tail word, Pymstring represents the probabilistic estimation of translated name, then translated name probabilistic estimation is public
Formula is described as follows:
Pymstring=Ps × Pm × Pe,
Wherein, YmNumsInter (c), YmNummInter (c), YmNumeInter (c) represent word c in true text respectively
In this as translated name lead-in, middle word, tail word number;TotalInText (c) represents time that word c occurs in real text
Number;TotalInBase (c) represents the number that word c occurs in translated name;NumInBase (c) represents translated name with translated name word in word
Total number.If translated name string c1c2...cn‐1cnMeet condition:Ln (Pymstring) > θ, then it is assumed that the translated name string is candidate
Translated name.Wherein, θ is threshold value, and the selection of the threshold value is so as to the translated name in covering translated name storehouse more than 99% is principle.
Therefore, the transliteration name identification model based on above-mentioned probability graph, the transliteration name that can tentatively identify are " Bu Lu
Gram ", " Brooker Nellie " or " Brooker Nellie leads to ".
Step 203:According to the name tentatively identified, name digraph is built;
Here, sentence to be identified " Minister for Health's Chen Minzhang should ask to join one Brooker Nellie and take on the telephone " is obtained after participle
As a result " health/minister n/n is old/n is quick/ag chapters/q should about/v with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v phones/n ", by this
As a result add in above-mentioned three-type-person's name identification model, the name tentatively identified is:Chen Min, Chen Minzhang, Brooker, Brooker
Nellie, Brooker Nellie lead to.In this way, it is showed according to all names tentatively identified by way of digraph, with regard to structure
Into the name digraph for treating decision-making as shown in Figure 3, so as to judge which name more according to the name digraph for treating decision-making
Meet linguistic context, and then optimal name is identified.
Step 204:Ngram models are built according to the linguistic context of name, according to ngram models and name digraph, build horse
Er Ke madam's name probability graph;
This step mainly by building ngram models, the cliction up and down of linguistic context where considering name and part of speech
Feature, the name digraph for treating decision-making carry out probability mapping, build Markov name probability graph.Modeling is described in detail below
Process:
1) Language Modeling
Assuming that S represent a significant sentence, the sentence by a succession of particular sequence word w1,w2,w3,...,wnComposition,
And the corresponding part of speech sequence of these words is p1,p2,p3,...,pn, wherein, n is the length of sentence;Sw=(w1,w2,w3,...,wn)
The sentence of representation language modeling, w represent word;The part of speech of S is configured to Sp=(p1,p2,p3,...,pn), piRepresent wiIt is corresponding
Part of speech, then, following probability calculation formula can be converted to the solution of S probability:
P (S)=λ P (SW)+(1-λ)·P(Sp),
Wherein, λ is a floating number between 0 to 1, and value is 0.8 here, i.e., the main influence for considering cliction up and down,
The influence of the part of speech of cliction above and below secondary consideration;Respectively by P (Sw) and P (Sp) the hypothesis valuation of second order Markov is carried out, it is converted into
Formula is calculated as below:
P(Sw)=P (w1,w2,w3,...,wn)
=P (w1)·P(w2|w1)·P(w3|w2)·...·P(wn|wn-1),
P(Sp)=P (p1,p2,p3,...,pn)
=P (p1)·P(p2|p1)·P(p3|p2)·...·P(pn|pn-1),
It should be noted that in order to avoid data spilling, improving performance, substituted usually after log is taken using add operation
Multiplying, calculation formula are as follows:
log(P(Sw))=log (P (w1)·P(w2|w1)·P(w3|w2)·...·P(wn|wn-1))
=log (P (w1))+log(P(w2|w1))+...+log(P(wn|wn-1)),
log(P(Sp))=log (P (p1)·P(p2|p1)·P(p3|p2)·...·P(pn|pn-1))
=log (P (p1))+log(P(p2|p1))+...+log(P(pn|pn-1)),
According to the law of large numbers, as long as language material reaches a certain level, following estimation can be done to above-mentioned probability:
Wherein, c (wi) represent wiThe word number occurred in training corpus, c (T) represent the number of all words of training corpus.
In order to adapt to this Language Modeling purpose, secondary mark can be carried out, by all people's name unified definition in language material
For NR, the characteristic of name in itself is so shielded, only considers the cliction up and down of name and the characteristic of part of speech, final result sample is:
Health/minister n/n Chen Minzhangs/nr should about/v lead to/v phones/n with/p Brookers Nellie/nr
Carrying out the result after secondary mark is:
Health/minister n/n NR/nr should about/v lead to/v phones/n with/p NR/nr
Here, it is contemplated that name goes out occurrence context and generally can all have and represent characteristic word such as " minister ", " chairman ",
With 2-gram with regard to that can reflect context main feature, therefore, completed using 2-gram to upper and lower cliction feature and part of speech feature
Consider.
2) data smoothing is handled
Due to necessarily leading to Sparse between extensive data statistics method and limited training corpus, so as to cause zero
Probability problem, therefore, when analyzing name digraph, carrying out data smoothing processing is necessary.Further, since Chinese
Part of speech is very limited, and by analysis, Sparse Problem is not serious, and according to investigation it can be seen that cliction is all non-above and below name
Often there is feature, relatives, appellation, position, verb, preposition, conjunction, adverbial word etc. can be divided into substantially, if there is Sparse Problems, base
Originally it is less qualified name recognition result.
In the embodiment of the present invention, the method being combined by using smoothing algorithm Katz and Good-Turing completes data
Smoothing processing, so as to solve inevitable Sparse Problem.Here is corresponding calculation formula:
Wherein, r represents the number that word occurs;K represents a threshold value, and the value of k is 5 here;drFor GT rebate values;α
(ww-1) it is normalizing factor.
From above-mentioned formula it can be seen that:When r is more than k, P (w are selectedi|wi-1);When r be less than k, but be 0 when, pass through
Good-Turing smoothing algorithms carry out discount processing, and remaining probability value is given to the entry for not having occur in language material;Work as r
For 0 when, replaced with the probability value for returning back to low order.
3) prediction algorithm
It can be obtained by above-mentioned probability calculation formula:
log(P(w1w2w3...wn))=log (P (w1))+log(P(w2|w1))+...+log(P(wn|wn-1))
=log (P (w1w2w3...wn-1))+log(P(wn|wn-1)),
This is typical dynamic programming algorithm, by P (wi|wi-1) probability value be mapped to wi-1And wiSide on.It in this way, will
The problem of solving probability minimum value is converted to the shortest route problem that probability digraph is solved using Viterbi algorithm, i.e., by P
(wi|wi-1) it is mapped to w in name digraphi-1And wiBetween side on, name digraph is converted into Ma Er as shown in Figure 4
Section madam name probability graph.
4) generally outline
This modeling mainly carries out decision-making to the multiple names identified, judges which name is more in line with linguistic context.
In general, name can be judged in linguistic context by the word and part of speech of context, according to this characteristic, ngram models are utilized
Carry out the quality that decision-making finally identifies name.
By comparing discovery:The embodiment of the present invention will differentiate the method for name, conversion by means such as threshold values in single model
To be compared judgement, and final decision optimal identification knot to the name identified by context of co-text in multi-model
The method of fruit, it is clear that multi-model compares the with the obvious advantage of differentiation.
It is important to note that the language material of this modeling selects the language material identical with segmenter, can so ensure to make
The word branched away with segmenter occurs in language model, eliminates the problem of issuable non-name is sparse, specific language
Expect for People's Daily's in January, 1998 corpus.Corpus labeling includes word segmentation result and part of speech.
Step 205:Final name recognition result is determined using Viterbi algorithm.
Here, the solution of optimal path in Markov name probability graph is completed using Viterbi algorithm, that is, is appeared in most
Name on shortest path is exactly the name finally identified, and the name recognition effect of this method in existing universal model than passing through
The effect that threshold value carrys out decision-making name is good.By experimental verification, the name recognition effect of this method than merely using HMM model or
Probability graph model wants two percentage points high.
The embodiment of the present invention segments input text first with segmenter, without typing dictionary during participle
Unregistered word will occur in the form of individual character, and these continuous individual characters are likely to be name;Then, using a variety of statistics moulds
Chinese personal name and transliteration name is identified in type, and all name constitutive characteristics can not be covered not comprehensively by eliminating single model
Foot, strives for all identifying Potential names;Furthermore build ngram moulds using the linguistic context (upper and lower cliction and part of speech) of name
Type;Finally, by the probability digraph of all Potential names of ngram model constructions, and it is final using Viterbi algorithm decision-making
Name recognition result, i.e., the problem of name decision-making, is converted to the name identified is done by contextual feature it is optimal
The problem of comparing.
Embodiment three:
To realize the above method, the embodiment of the present invention additionally provides a kind of name identification device, as shown in figure 5, the device
Including word-dividing mode 501, name identification module 502, model construction module 503, name decision-making module 504;Wherein,
The word-dividing mode 501 for obtaining the text sequence of input, and segments the text sequence, wherein,
The text sequence includes at least one name;
The name identification module 502, for the constitutive characteristic according to name, based at least two statistical models to participle
Text sequence afterwards carries out name identification, obtains all Potential names;
The model construction module 503, for building ngram models according to the linguistic context of name;
The name decision-making module 504, for carrying out decision-making to all Potential names according to the ngram models,
It determines and exports the final name recognition result for meeting linguistic context.
Here, the word-dividing mode 501, be additionally operable to will segment during without typing dictionary unregistered word with individual character shape
State is presented.
Here, the name includes:Chinese personal name and transliteration name;The statistical model can be the name based on dictionary
The names identification models such as identification model, the Chinese personal name recognition model based on HMM, the transliteration name identification model based on probability graph
In at least two.
The model construction module 503 is additionally operable to after the name identification module 502 obtains all Potential names,
The name digraph of decision-making is treated according to all Potential names structures.
Wherein, the linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction;The ngram models
In language material it is identical with the language material in the segmenter segmented to the text sequence.
The name decision-making module 504, is specifically used for:
The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction according to where name, by the ngram models to institute
It states and treats that the name digraph of decision-making carries out probability mapping, build the probability digraph of all Potential names, and based on prediction
Algorithm is compared all paths in the probability digraph, is identified shortest path as the name for finally meeting linguistic context
As a result and export.
In practical applications, the word-dividing mode 501, name identification module 502, model construction module 503, name decision-making
Module 504 can by be located in natural language processing terminal central processing unit (CPU, Central Processing Unit),
Microprocessor (MPU, Micro Processor Unit), digital signal processor (DSP, Digital Signal
) or the realizations such as field programmable gate array (FPGA, Field Programmable Gate Array) Processor.
The embodiment of the present invention obtains the text sequence of input, and the text sequence is segmented, wherein, the text
Sequence includes at least one name;According to the constitutive characteristic of name, based at least two statistical models to the text after participle
Sequence carries out name identification, obtains all Potential names;Ngram models are built according to the linguistic context of name;According to the ngram moulds
Type carries out decision-making to all Potential names, determines and exports the final name recognition result for meeting linguistic context.In this way, according to not
It with the otherness of name own characteristic, is identified using different statistical models, makes full use of a variety of statistical model identifications
Advantage identifies all Potential names, promotes name discrimination;Pass through the general of all Potential names of ngram model constructions
Rate digraph, so as to the recognition result of Chinese personal name, transliteration name structure unified dimensional decision-making, so as to recognition result into
Row optimal selection, not only name identification certainty is good, also improves comprehensive recognition effect, can rapidly and accurately identify different
Name.
It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program
Product.Therefore, the shape of the embodiment in terms of hardware embodiment, software implementation or combination software and hardware can be used in the present invention
Formula.Moreover, the present invention can be used can use storage in one or more computers for wherein including computer usable program code
The form for the computer program product that medium is implemented on (including but not limited to magnetic disk storage and optical memory etc.).
The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real
The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to
Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps is performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or
The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention, it is all
All any modification, equivalent and improvement made within the spirit and principles in the present invention etc. should be included in the protection of the present invention
Within the scope of.
Claims (10)
1. a kind of name recognition methods, which is characterized in that the described method includes:
The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence includes at least one
A name;
According to the constitutive characteristic of name, name identification is carried out to the text sequence after participle based at least two statistical models, is obtained
Obtain all Potential names;
Ngram models are built according to the linguistic context of name;
Decision-making is carried out to all Potential names according to the ngram models, determine and exports the final name for meeting linguistic context
Recognition result.
2. according to the method described in claim 1, it is characterized in that, the method further includes:To not there is no typing during participle
The unregistered word of dictionary is presented in the form of individual character.
3. according to the method described in claim 1, it is characterized in that, it is described obtain all Potential names after, the method
It further includes:The name digraph of decision-making is treated according to all Potential names structures;
The linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction.
4. according to the method described in claim 3, it is characterized in that, it is described according to the ngram models to described all potential
Name carries out decision-making, determines and exports the final name recognition result for meeting linguistic context, including:
The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction, are treated by the ngram models to described according to where name
The name digraph of decision-making carries out probability mapping, builds the probability digraph of all Potential names, and based on prediction algorithm
All paths in the probability digraph are compared, using shortest path as the name recognition result for finally meeting linguistic context
And it exports.
5. according to the method described in claim 1, it is characterized in that, language material in the ngram models with to the text sequence
The language material arranged in the segmenter segmented is identical.
6. a kind of name identification device, which is characterized in that described device includes:Word-dividing mode, name identification module, model construction
Module, name decision-making module;Wherein,
The word-dividing mode for obtaining the text sequence of input, and segments the text sequence, wherein, the text
This sequence includes at least one name;
The name identification module, for the constitutive characteristic according to name, based at least two statistical models to the text after participle
This sequence carries out name identification, obtains all Potential names;
The model construction module, for building ngram models according to the linguistic context of name;
The name decision-making module for carrying out decision-making to all Potential names according to the ngram models, determines and defeated
Go out the final name recognition result for meeting linguistic context.
7. device according to claim 6, which is characterized in that the word-dividing mode is additionally operable to during participle not have
The unregistered word of typing dictionary is presented in the form of individual character.
8. device according to claim 6, which is characterized in that the model construction module is additionally operable to know in the name
After other module obtains all Potential names, the name digraph of decision-making is treated according to all Potential names structures;
The linguistic context of the name includes:The cliction up and down of name and the part of speech of upper and lower cliction.
9. device according to claim 8, which is characterized in that the name decision-making module is specifically used for:
The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction, are treated by the ngram models to described according to where name
The name digraph of decision-making carries out probability mapping, builds the probability digraph of all Potential names, and based on prediction algorithm
All paths in the probability digraph are compared, using shortest path as the name recognition result for finally meeting linguistic context
And it exports.
10. device according to claim 6, which is characterized in that language material in the ngram models with to the text sequence
The language material arranged in the segmenter segmented is identical.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611038892.XA CN108090039A (en) | 2016-11-21 | 2016-11-21 | A kind of name recognition methods and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611038892.XA CN108090039A (en) | 2016-11-21 | 2016-11-21 | A kind of name recognition methods and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108090039A true CN108090039A (en) | 2018-05-29 |
Family
ID=62170975
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611038892.XA Pending CN108090039A (en) | 2016-11-21 | 2016-11-21 | A kind of name recognition methods and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090039A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750993A (en) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | Word segmentation method, word segmentation device, named entity identification method and system |
CN112270173A (en) * | 2020-10-27 | 2021-01-26 | 北京百度网讯科技有限公司 | Character mining method and device in text, electronic equipment and storage medium |
CN112883727A (en) * | 2021-02-25 | 2021-06-01 | 重庆邮电大学 | Method and device for determining association relationship between people |
CN113408286A (en) * | 2021-05-28 | 2021-09-17 | 浙江工业大学 | Chinese entity identification method and system for mechanical and chemical engineering field |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000062193A1 (en) * | 1999-04-08 | 2000-10-19 | Kent Ridge Digital Labs | System for chinese tokenization and named entity recognition |
CN101075228A (en) * | 2006-05-15 | 2007-11-21 | 松下电器产业株式会社 | Method and apparatus for named entity recognition in natural language |
CN103823859A (en) * | 2014-02-21 | 2014-05-28 | 安徽博约信息科技有限责任公司 | Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models |
CN105138515A (en) * | 2015-09-02 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Named entity recognition method and device |
CN105868184A (en) * | 2016-05-10 | 2016-08-17 | 大连理工大学 | Chinese name recognition method based on recurrent neural network |
-
2016
- 2016-11-21 CN CN201611038892.XA patent/CN108090039A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000062193A1 (en) * | 1999-04-08 | 2000-10-19 | Kent Ridge Digital Labs | System for chinese tokenization and named entity recognition |
CN101075228A (en) * | 2006-05-15 | 2007-11-21 | 松下电器产业株式会社 | Method and apparatus for named entity recognition in natural language |
CN103823859A (en) * | 2014-02-21 | 2014-05-28 | 安徽博约信息科技有限责任公司 | Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models |
CN105138515A (en) * | 2015-09-02 | 2015-12-09 | 百度在线网络技术(北京)有限公司 | Named entity recognition method and device |
CN105868184A (en) * | 2016-05-10 | 2016-08-17 | 大连理工大学 | Chinese name recognition method based on recurrent neural network |
Non-Patent Citations (3)
Title |
---|
冯鲸华: "基于N-gram模型的哈萨克语实体名识别方法研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
孙静: "基于组合分类器的生物命名实体识别", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
范文婷: "生物医学领域的命名实体识别和标准化", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110750993A (en) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | Word segmentation method, word segmentation device, named entity identification method and system |
CN112270173A (en) * | 2020-10-27 | 2021-01-26 | 北京百度网讯科技有限公司 | Character mining method and device in text, electronic equipment and storage medium |
CN112270173B (en) * | 2020-10-27 | 2021-10-26 | 北京百度网讯科技有限公司 | Character mining method and device in text, electronic equipment and storage medium |
CN112883727A (en) * | 2021-02-25 | 2021-06-01 | 重庆邮电大学 | Method and device for determining association relationship between people |
CN113408286A (en) * | 2021-05-28 | 2021-09-17 | 浙江工业大学 | Chinese entity identification method and system for mechanical and chemical engineering field |
CN113408286B (en) * | 2021-05-28 | 2024-03-26 | 浙江工业大学 | Chinese entity identification method and system oriented to field of mechanical and chemical industry |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jiao et al. | Real-time emotion recognition via attention gated hierarchical memory network | |
US10949709B2 (en) | Method for determining sentence similarity | |
CN107729313B (en) | Deep neural network-based polyphone pronunciation distinguishing method and device | |
CN108197109A (en) | A kind of multilingual analysis method and device based on natural language processing | |
CN108984530A (en) | A kind of detection method and detection system of network sensitive content | |
CN110427463A (en) | Search statement response method, device and server and storage medium | |
CN110321563B (en) | Text emotion analysis method based on hybrid supervision model | |
CN110362819B (en) | Text emotion analysis method based on convolutional neural network | |
JP6541673B2 (en) | Real time voice evaluation system and method in mobile device | |
CN107346340A (en) | A kind of user view recognition methods and system | |
CN108563638B (en) | Microblog emotion analysis method based on topic identification and integrated learning | |
CN108090039A (en) | A kind of name recognition methods and device | |
CN109976702A (en) | A kind of audio recognition method, device and terminal | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN110210028A (en) | For domain feature words extracting method, device, equipment and the medium of speech translation text | |
CN110929520B (en) | Unnamed entity object extraction method and device, electronic equipment and storage medium | |
CN114818891A (en) | Small sample multi-label text classification model training method and text classification method | |
KR20200088088A (en) | Apparatus and method for classifying word attribute | |
US20230073602A1 (en) | System of and method for automatically detecting sarcasm of a batch of text | |
CN113254643A (en) | Text classification method and device, electronic equipment and | |
CN113761377B (en) | False information detection method and device based on attention mechanism multi-feature fusion, electronic equipment and storage medium | |
CN111475651A (en) | Text classification method, computing device and computer storage medium | |
Kumar et al. | A reliable technique for sentiment analysis on tweets via machine learning and bert | |
CN112528658B (en) | Hierarchical classification method, hierarchical classification device, electronic equipment and storage medium | |
CN111435375A (en) | Threat information automatic labeling method based on FastText |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180529 |
|
RJ01 | Rejection of invention patent application after publication |