CN108090039A

CN108090039A - A kind of name recognition methods and device

Info

Publication number: CN108090039A
Application number: CN201611038892.XA
Authority: CN
Inventors: 蒋忠强; 梁俊; 全兵; 陶鸿飞; 温士帅; 骆舰; 刘甦晓
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Shanghai Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2016-11-21
Filing date: 2016-11-21
Publication date: 2018-05-29

Abstract

The invention discloses a kind of name recognition methods, including：The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence includes at least one name；According to the constitutive characteristic of name, name identification is carried out to the text sequence after participle based at least two statistical models, obtains all Potential names；Ngram models are built according to the linguistic context of name；Decision-making is carried out to all Potential names according to the ngram models, determine and exports the final name recognition result for meeting linguistic context.The present invention further simultaneously discloses a kind of name identification device.

Description

A kind of name recognition methods and device

Technical field

The present invention relates to the identification technology in natural language processing technique field more particularly to a kind of name recognition methods and Device.

Background technology

Natural language processing is the core analysis technology in internet information search field, search engine, public sentiment monitoring, with And numerous internet IT industries such as e-commerce are all widely used.With the exponential growth of internet information and user experience Demand is higher and higher, and on the premise of ensureing that processing speed meets user's use demand, internet information search is to natural language The requirement of handling result is also more and more accurate.Wherein, name identification is core most difficult in natural language processing morphological analysis One of problem, either in searching engine field still in public sentiment monitoring field, user is to the attention rate of name far above common Word, and all names of dictionary None- identified are used, cause to identify that difficulty is larger, therefore, name identification is all user all the time The research topic being concerned.

In general, name identification includes two types：Chinese personal name recognition and the identification of transliteration name.Since name is formed The complexity of feature and contextual feature, at present the single statistical model employed in mainstream technology be can not cover comprehensively it is all Name constitutive characteristic and contextual feature, therefore, in order to promote comprehensive recognition effect, there is an urgent need to a kind of fusion Chinese personal names Identification and the identification of transliteration name are in more name recognition methods of one.At present, relatively common more name recognition methods mainly have It is following two：

1) the name recognition methods based on mixed model, this method are mutually tied with a variety of statistical models based on Decision Tree Rule The name recognizer of conjunction first, classifies to name constitutive characteristic and contextual feature using Decision Tree Rule；Then, To the name of each classification using targetedly statistical model, so as to make up current mainstream technology used by single statistical model The shortcomings that all name constitutive characteristics and contextual feature can not be covered comprehensively, promotes comprehensive recognition effect；

2) the name recognition methods of based role mark, this method are by carrying out role's mark to the list entries after participle Note, and character labeling sequence is obtained, and then Chinese personal name and transliteration name are uniformly processed, and the name to there is mistake Identification role is modified, and finally, obtained character labeling sequence is matched according to name recognition mode, and output group Into name.

However, the more name recognition methods of above two are primarily present problems with：

It, need to be to all since this method is before name identification is carried out for the name recognition methods based on mixed model Name is classified, if classifying quality is bad, it is easy to name be caused to omit identification or wrong identification；And not to different names The recognition result of identification model carries out the decision-making of unified dimensional, therefore, when the recognition result between different models has intersection, allows User is difficult to accept or reject；And for the name recognition methods of based role mark, it simply considers Chinese personal name and transliteration name Unified identification is carried out using character labeling method, does not consider but different names in otherness present on own characteristic, identification Effect Shortcomings.

The content of the invention

In view of this, an embodiment of the present invention is intended to provide a kind of name recognition methods and device, at least solve existing more The above problem present in name identification technology can quickly and accurately identify Chinese personal name and transliteration name.

In order to achieve the above objectives, the technical solution of the embodiment of the present invention is realized in：

The embodiment of the present invention provides a kind of name recognition methods, the described method includes：

Obtain the text sequence of input, and the text sequence segmented, wherein, the text sequence include to A few name；

According to the constitutive characteristic of name, name knowledge is carried out to the text sequence after participle based at least two statistical models Not, all Potential names are obtained；

Ngram models are built according to the linguistic context of name；

Decision-making is carried out to all Potential names according to the ngram models, determines and export finally to meet linguistic context Name recognition result.

In said program, the method further includes：Will segment during without typing dictionary unregistered word with individual character shape State is presented.

In said program, after all Potential names of acquisition, the method further includes：According to described all potential Name builds the name digraph for treating decision-making；

The linguistic context of the name includes：The cliction up and down of name and the part of speech of upper and lower cliction.

It is described that decision-making is carried out to all Potential names according to the ngram models in said program, it determines and exports The final name recognition result for meeting linguistic context, including：

The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction according to where name, by the ngram models to institute It states and treats that the name digraph of decision-making carries out probability mapping, build the probability digraph of all Potential names, and based on prediction Algorithm is compared all paths in the probability digraph, is identified shortest path as the name for finally meeting linguistic context As a result and export.

In said program, in language material in the ngram models and the segmenter segmented to the text sequence Language material is identical.

The embodiment of the present invention also provides a kind of name identification device, and described device includes：Word-dividing mode, name identification mould Block, model construction module, name decision-making module；Wherein,

The word-dividing mode for obtaining the text sequence of input, and segments the text sequence, wherein, institute Stating text sequence includes at least one name；

The name identification module, for the constitutive characteristic according to name, based at least two statistical models to participle after Text sequence carry out name identification, obtain all Potential names；

The model construction module, for building ngram models according to the linguistic context of name；

The name decision-making module for carrying out decision-making to all Potential names according to the ngram models, determines And export the name recognition result for finally meeting linguistic context.

In said program, the word-dividing mode, be additionally operable to will segment during without typing dictionary unregistered word with list Font state is presented.

In said program, the model construction module is additionally operable to obtain all Potential names in the name identification module Afterwards, the name digraph of decision-making is treated according to all Potential names structures；

In said program, the name decision-making module is specifically used for：

The name recognition methods and device that the embodiment of the present invention is provided obtain the text sequence of input, and to the text This sequence is segmented, wherein, the text sequence includes at least one name；According to the constitutive characteristic of name, based on extremely Few two kinds of statistical models carry out name identification to the text sequence after participle, obtain all Potential names；According to the linguistic context of name Build ngram models；Decision-making is carried out to all Potential names according to the ngram models, determines and export finally to meet The name recognition result of linguistic context.In this way, according to the otherness of different name own characteristics, known using different statistical models Not, the advantage of a variety of statistical model identifications is made full use of, all Potential names are identified, promote name discrimination；Pass through The probability digraph of all Potential names of ngram model constructions is united to be built to the recognition result of Chinese personal name, transliteration name The decision-making of dimension, so as to carry out optimal selection to recognition result, not only name identification certainty is good, also improves comprehensive identification Effect can rapidly and accurately identify different names.

Description of the drawings

Fig. 1 is a kind of flow diagram for name recognition methods that the embodiment of the present invention one provides；

Fig. 2 is a kind of overall procedure schematic diagram of name recognition methods provided by Embodiment 2 of the present invention；

Fig. 3 is the schematic diagram for the name digraph for treating decision-making that the embodiment of the present invention two is built；

Fig. 4 is the schematic diagram of the probability digraph for all Potential names that the embodiment of the present invention two is built；

Fig. 5 is a kind of composition structure diagram for name identification device that the embodiment of the present invention three provides.

Specific embodiment

The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings to this hair The realization of bright embodiment is described in detail, appended attached drawing purposes of discussion only for reference, is not used for limiting the present invention.

Embodiment one：

As shown in Figure 1, in the embodiment of the present invention name recognition methods realization flow, comprise the following steps：

Step 101：The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence Include at least one name；

Here, the text sequence of the input is continuous language text character string, wherein, the text sequence includes One or more names, and the text sequence can be obtained by way of following at least one, such as：It is directly defeated by user Enter either to collect or receive from current environment such as chat window text sequence of terminal transmission etc..It in general, can The text sequence is segmented using segmenter, specifically, according to the Chinese dictionary of collection, using segmenting method to text Sequence carries out word segmentation, is word sequence by continuous language text character string cutting.

At present, there are many kinds of segmenting methods of the prior art, such as：Forward Maximum Method, reverse maximum matching, N- members Grammer, improved maximum matching algorithm etc..Wherein, improved maximum matching algorithm is the core for continuing to use Forward Maximum Method algorithm Thought is thought, and makes up the function that Forward Maximum Method algorithm does not possess ambiguity detection and resolution, and then is ensureing to segment speed base On the premise of this is constant, the accuracy of participle is improved.Which kind of, for the embodiment of the present invention using segmenting method, do not do herein specific It limits.

Wherein, Chinese dictionary is made of standard Chinese word set, conflict word set and association word set three parts.Standard Chinese The word that word set includes can neither be name, cannot also occur as the part of name, which will be as the word set of segmenter It uses；The word that conflict word set includes can occur as the part of name, and itself is not name；The word that conjunctive word is concentrated Either name, and can be the word of place name or other physical names and its relevant characterization word.

Here, by being compared with the standard Chinese word set in Chinese dictionary, will not be recorded in the text sequence of input The unknown word identification for entering dictionary comes out, wherein, the unregistered word of no typing dictionary is presented in the form of individual character, and these connect It greatly may be name that continuous individual character, which has, and therefore, the embodiment of the present invention is equivalent on word segmentation result to continuous Individual character segment is judged, judges whether it is name.For example, user inputs text sequence, " Minister for Health's Chen Minzhang should ask to join one Brooker Nellie is taken on the telephone ", become after participle：" health/minister n/n is old/n is quick/ag chapters/q should about/v with/p cloth/n Shandongs/j Gram/q in/f that/n lead to/v phones/n ", wherein, n, ag, q, v, p, j, f represent different parts of speech respectively.In this way, by word segmentation result In continuous individual character " old/n quick/ag chapters/q ", " with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v " be respectively transmitted to name and identify In model, in case subsequent step further judges whether these unregistered words are name.

It should be noted that the embodiment of the present invention segments the text sequence using existing segmenting method, This is no longer described in detail.

Step 102：According to the constitutive characteristic of name, the text sequence after participle is carried out based at least two statistical models Name identifies, obtains all Potential names；

Here, the name includes：Chinese personal name and transliteration name；The constitutive characteristic of the name includes：Above and below name Literary feature, name word feature, border word feature, character feature；Wherein, name contextual feature is to occur in language text Word before and after name；Name word feature includes：Name surname, name lead-in, name middle word, name tail word； Border word is characterized as the word of name left margin word or the word of name right margin word；Character feature is what may be included in transliteration name Number, character or punctuate etc.；The statistical model can be the name identification model based on dictionary, based on hidden Markov model The names such as the Chinese personal name recognition model of (HMM, Hidden Markov Mode), the transliteration name identification model based on probability graph At least two in identification model.In this way, the feature according to possessed by Chinese personal name and transliteration name itself, uses a variety of statistics Chinese personal name and transliteration name is identified in model, and all name constitutive characteristics can not be covered not comprehensively by eliminating single model Foot strives all identifying all Potential names inputted in text sequence.

Here, after obtaining all Potential names in a step 102, the method further includes：According to described all potential Name builds the name digraph for treating decision-making.

Specifically, all Potential names that will identify that are showed by way of digraph, so as to according to name Digraph carries out decision-making to all Potential names identified, judges which name is more in line with linguistic context.

Step 103：Ngram models are built according to the linguistic context of name；

Here, the linguistic context of the name includes：The cliction up and down of name and the part of speech of upper and lower cliction.

Here, ngram models are built using the cliction up and down of name and the part of speech feature of upper and lower cliction, it specifically, can be with According to the possibility that the part of speech of the cliction up and down of name and upper and lower cliction occurs in the text, i.e. the height of probability of occurrence is built Ngram models, specific building process have a detailed description later；Wherein, can by regarding all Potential names as a primitive, To eliminate the problem of name is not logged in, that is to say, that by name be either with or without the unregistered word unified definition of typing dictionary NR, the feature of shielding name in itself, the cliction up and down of consideration name and part of speech feature；Furthermore it is also possible to pass through smoothing algorithm Problem is not logged in eliminate other words.

Usually, the language material in ngram models is arranged in the segmenter with being segmented to the text sequence Language material is identical, to identify different names exactly.

Step 104：Decision-making is carried out to all Potential names according to the ngram models, determines and exports final symbol Close the name recognition result of linguistic context.

This step specifically includes：The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction, pass through institute according to where name It states ngram models and probability mapping is carried out to the name digraph for treating decision-making, the probability for building all Potential names has Xiang Tu, and all paths in the probability digraph are compared based on prediction algorithm, using shortest path as final symbol It closes the name recognition result of linguistic context and exports.

Here, the prediction algorithm uses Viterbi algorithm.Wherein, Viterbi algorithm is a kind of common applies hidden Dynamic programming algorithm in Markov model, for finding the most probable hidden status switch generated by observation information.It is described Potential names are all names that the possibility tentatively identified based on name identification model meets linguistic context.

Embodiment two：

The specific implementation process of name recognition methods of the embodiment of the present invention is done below and is further described in detail.

Fig. 2 gives the overall procedure schematic diagram of name recognition methods of the embodiment of the present invention, as shown in Fig. 2, including following Step：

Step 201：Input text is obtained, and input text is segmented；

Here, based on the word collected in Chinese dictionary, word segmentation is carried out to input text using segmenter, it will It is word sequence to input continuous language text character string cutting in text, will be gone out without the unknown word identification of typing dictionary Come.Wherein, during participle, the unregistered word of no typing dictionary will occur in the form of individual character, and these are continuous single Word is likely to be name.In other words, the embodiment of the present invention judges continuous individual character segment on word segmentation result, sentences Determine whether it is name.For example, user inputs " Minister for Health's Chen Minzhang should ask to join one Brooker Nellie and take on the telephone ", after participle Become：" health/minister n/n is old/n is quick/ag chapters/q should about/v with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v phones/n ".This Sample, by word segmentation result continuous individual character " old/n quick/ag chapters/q ", " with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v " pass respectively It is sent in name identification model, further to judge whether these unregistered words are name.

Step 202：In word segmentation result, according to the name identification model based on dictionary, the Chinese personal name recognition based on HMM The identification models such as model, transliteration name identification model based on probability graph carry out the continuous individual character after participle preliminary name Identification；

Above-mentioned several name identification models are done below and are further described in detail.

1) the name identification model based on dictionary

The so-called name identification based on dictionary, that is, the identification of rule-based name, this using name for identifying mould Type is not easy for the name identification problem identified, is a supplement well, if transliteration name " Putin " is in name identification It is not easy to identify in model, know but if we by the way that " Putin " typing dictionary is supplemented, can play supplement well Other effect.

In general, the name included in dictionary is mainly the name of star personality and common name.However, based on dictionary A problem can be introduced by carrying out name identification, such as：" Li Jing " is a name, but sentence " Jingjing LI classmate makes great efforts very much " In, it is clear that " Li Jing " is not a name, but a part for name.In order to eliminate this influence, the measure generally taken It is：As long as the name in the name covering dictionary identified using other name identification models is just directly deleted dictionary and identified Name.

2) the Chinese personal name recognition model based on HMM

Here, the method for Primary Reference Chinese Academy of Sciences teacher Zhang Huaping marks name in word segmentation result using HMM and forms angle Color；Then, according to each different role of mark, most long pattern String matching is carried out on the basis of the role's sequence marked out, Finally identify name.Wherein, HMM is one kind of Markov chain, its state cannot observe directly, but can pass through sight Direction finding amount sequence inspection；Each observation vector is to show as various states by some probability density distributions, each observation Vector is that the status switch for having corresponding probability density distribution by one generates.

In general, according to name recognition point, by word all in a sentence be divided into name inside composition, on Hereafter, the composition role of the names such as unrelated word.Wherein, the formulation foundation that name forms role is words institute in name composition Role is different, such as：Surname, name, above, hereafter etc..Such as：The name listed according to table 1 forms Role Classification, and participle is tied Fruit " health/minister/old/quick/chapter/should about/with/cloth/Shandong/gram/it is interior/you/logical/phone " carry out character labeling, character labeling knot Fruit for " health/minister A/K is old/B is quick/C chapters/D should about/L with in/A cloth/A Shandongs/A grams/A/A that/A lead to/A phones/A ".Wherein, Table 1 is Chinese personal name role construction classification chart.

Table 1

The modeling process of HMM is outlined below below：It is assumed that W is the Token sequences after participle, it is that unregistered word is known Not preceding word segmentation is as a result, T is some possible character labeling sequence of W, T^#For final annotation results, then have：

W=(w₁,w₂,...,w_m),

T=(t₁,t₂,...,t_m), m ＞ 0,

If by word w_iIt is considered as observed value, by role t_iIt is considered as state value, wherein, t₀For original state, then W is observation sequence Row, T is the status switch being hidden in after W, this is a hidden Markov chain.It is possible to introduce Hidden Markov Model, lead to It crosses and T is calculated^#：

For the simplicity of calculating, to above-mentioned calculating T^#The probability of formula take negative logarithm, then have：

Here, p (w are calculated_i|t_i)、p(t_i|t_i-1) method be on the basis of modified corpus, part-of-speech tagging is converted For role, and carry out Role Information statistics.Wherein, p (w_i|t_i) represent in given role t_iUnder conditions of, Token w_iIt is general Rate；p(t_i|t_i-1) represent from role t_i-1To role t_iTransition probability.According to law of great number, can obtain：

p(w_i|t_i)≈C(w_i,t_i)/C(t_i),

Wherein, C (w_i,t_i) it is w_iFor role t_iThe number of appearance, C (t_i) it is role t_iThe number of appearance.As i ＞ 1, then Have：

p(t_i|t_i-1)≈C(t_i-1,t_i)/C(t_i-1),

Wherein, C (t_i-1,t_i) it is role t_i-1Next role be t_iNumber, and C (w_i,t_i)、C(t_i)、C(t_i-1,t_i) It can be obtained by carrying out learning training to cutting, the ripe corpus marked, extracting automatically.

By above-mentioned as it can be seen that character labeling problem is converted to T by the embodiment of the present invention^#Solve problems, herein can be used ask The classic algorithm Viterbi algorithm of such issues that solution is solved.And the final identification of name is to T^#The role's mark solved Note carries out the maximum matching of pattern string, and exports homologous segment composition name, while records their corresponding positions in sentence.Example Such as：By read statement " health/minister/old/quick/chapter/should about/with/cloth/Shandong/gram/it is interior/you/logical/phone " pass through character labeling Afterwards, corresponding T can be obtained^#For " AKBCDLAAAAAAAA ", after the maximum matching of pattern string is carried out to it, tentatively identify Chinese personal name is " Chen Min " or " Chen Minzhang ".

3) the transliteration name identification model based on probability graph

The so-called transliteration name identification model based on probability graph is the expression of one kind graphic model based on probability correlation relation Model general name, mainly in conjunction with probability theory and the knowledge of graph theory, the joint of the variable related with model is represented using figure Probability distribution.The name identification model mainly with real world news language material,《New english-chinese dictionary》、《English personage, place name, Event name dictionary》In Chinese and English translated name for resource, first, establish translated name character library, translated name dictionary and translated name syllable information Storehouse；Secondly, translated name word confidence level is counted, and the energy that translated name word forms translated name is evaluated according to translated name probabilistic estimation formula Power；Again, the rule base of translated name identification is constructed；Finally, the syllable information in contextual information and english name is made full use of, from Dynamic identification english name.

Wherein, translated name probabilistic estimation formula embodies the ability that translated name word forms translated name, it is assumed that translated name string ymstr= c₁c₂...c_n‐1c_n, wherein, c₁For translated name lead-in, c₂~c_n‐1For translated name middle word, c_nFor translated name tail word, Ps, Pm, Pe difference table Show translated name lead-in, middle word, the probabilistic estimation of tail word, Pymstring represents the probabilistic estimation of translated name, then translated name probabilistic estimation is public Formula is described as follows：

Pymstring=Ps × Pm × Pe,

Wherein, YmNumsInter (c), YmNummInter (c), YmNumeInter (c) represent word c in true text respectively In this as translated name lead-in, middle word, tail word number；TotalInText (c) represents time that word c occurs in real text Number；TotalInBase (c) represents the number that word c occurs in translated name；NumInBase (c) represents translated name with translated name word in word Total number.If translated name string c₁c₂...c_n‐1c_nMeet condition：Ln (Pymstring) ＞ θ, then it is assumed that the translated name string is candidate Translated name.Wherein, θ is threshold value, and the selection of the threshold value is so as to the translated name in covering translated name storehouse more than 99% is principle.

Therefore, the transliteration name identification model based on above-mentioned probability graph, the transliteration name that can tentatively identify are " Bu Lu Gram ", " Brooker Nellie " or " Brooker Nellie leads to ".

Step 203：According to the name tentatively identified, name digraph is built；

Here, sentence to be identified " Minister for Health's Chen Minzhang should ask to join one Brooker Nellie and take on the telephone " is obtained after participle As a result " health/minister n/n is old/n is quick/ag chapters/q should about/v with in/p cloth/n Shandongs/j grams/q/f that/n lead to/v phones/n ", by this As a result add in above-mentioned three-type-person's name identification model, the name tentatively identified is：Chen Min, Chen Minzhang, Brooker, Brooker Nellie, Brooker Nellie lead to.In this way, it is showed according to all names tentatively identified by way of digraph, with regard to structure Into the name digraph for treating decision-making as shown in Figure 3, so as to judge which name more according to the name digraph for treating decision-making Meet linguistic context, and then optimal name is identified.

Step 204：Ngram models are built according to the linguistic context of name, according to ngram models and name digraph, build horse Er Ke madam's name probability graph；

This step mainly by building ngram models, the cliction up and down of linguistic context where considering name and part of speech Feature, the name digraph for treating decision-making carry out probability mapping, build Markov name probability graph.Modeling is described in detail below Process：

1) Language Modeling

Assuming that S represent a significant sentence, the sentence by a succession of particular sequence word w₁,w₂,w₃,...,w_nComposition, And the corresponding part of speech sequence of these words is p₁,p₂,p₃,...,p_n, wherein, n is the length of sentence；S_w=(w₁,w₂,w₃,...,w_n) The sentence of representation language modeling, w represent word；The part of speech of S is configured to S_p=(p₁,p₂,p₃,...,p_n), p_iRepresent w_iIt is corresponding Part of speech, then, following probability calculation formula can be converted to the solution of S probability：

P (S)=λ P (S_W)+(1-λ)·P(S_p),

Wherein, λ is a floating number between 0 to 1, and value is 0.8 here, i.e., the main influence for considering cliction up and down, The influence of the part of speech of cliction above and below secondary consideration；Respectively by P (S_w) and P (S_p) the hypothesis valuation of second order Markov is carried out, it is converted into Formula is calculated as below：

P(S_w)=P (w₁,w₂,w₃,...,w_n)

=P (w₁)·P(w₂|w₁)·P(w₃|w₂)·...·P(w_n|w_n-1),

P(S_p)=P (p₁,p₂,p₃,...,p_n)

=P (p₁)·P(p₂|p₁)·P(p₃|p₂)·...·P(p_n|p_n-1),

It should be noted that in order to avoid data spilling, improving performance, substituted usually after log is taken using add operation Multiplying, calculation formula are as follows：

log(P(S_w))=log (P (w₁)·P(w₂|w₁)·P(w₃|w₂)·...·P(w_n|w_n-1))

=log (P (w₁))+log(P(w₂|w₁))+...+log(P(w_n|w_n-1)),

log(P(S_p))=log (P (p₁)·P(p₂|p₁)·P(p₃|p₂)·...·P(p_n|p_n-1))

=log (P (p₁))+log(P(p₂|p₁))+...+log(P(p_n|p_n-1)),

According to the law of large numbers, as long as language material reaches a certain level, following estimation can be done to above-mentioned probability：

Wherein, c (w_i) represent w_iThe word number occurred in training corpus, c (T) represent the number of all words of training corpus.

In order to adapt to this Language Modeling purpose, secondary mark can be carried out, by all people's name unified definition in language material For NR, the characteristic of name in itself is so shielded, only considers the cliction up and down of name and the characteristic of part of speech, final result sample is：

Health/minister n/n Chen Minzhangs/nr should about/v lead to/v phones/n with/p Brookers Nellie/nr

Carrying out the result after secondary mark is：

Health/minister n/n NR/nr should about/v lead to/v phones/n with/p NR/nr

Here, it is contemplated that name goes out occurrence context and generally can all have and represent characteristic word such as " minister ", " chairman ", With 2-gram with regard to that can reflect context main feature, therefore, completed using 2-gram to upper and lower cliction feature and part of speech feature Consider.

2) data smoothing is handled

Due to necessarily leading to Sparse between extensive data statistics method and limited training corpus, so as to cause zero Probability problem, therefore, when analyzing name digraph, carrying out data smoothing processing is necessary.Further, since Chinese Part of speech is very limited, and by analysis, Sparse Problem is not serious, and according to investigation it can be seen that cliction is all non-above and below name Often there is feature, relatives, appellation, position, verb, preposition, conjunction, adverbial word etc. can be divided into substantially, if there is Sparse Problems, base Originally it is less qualified name recognition result.

In the embodiment of the present invention, the method being combined by using smoothing algorithm Katz and Good-Turing completes data Smoothing processing, so as to solve inevitable Sparse Problem.Here is corresponding calculation formula：

Wherein, r represents the number that word occurs；K represents a threshold value, and the value of k is 5 here；d_rFor GT rebate values；α (w_w-1) it is normalizing factor.

From above-mentioned formula it can be seen that：When r is more than k, P (w are selected_i|w_i-1)；When r be less than k, but be 0 when, pass through Good-Turing smoothing algorithms carry out discount processing, and remaining probability value is given to the entry for not having occur in language material；Work as r For 0 when, replaced with the probability value for returning back to low order.

3) prediction algorithm

It can be obtained by above-mentioned probability calculation formula：

log(P(w₁w₂w₃...w_n))=log (P (w₁))+log(P(w₂|w₁))+...+log(P(w_n|w_n-1))

=log (P (w₁w₂w₃...w_n-1))+log(P(w_n|w_n-1)),

This is typical dynamic programming algorithm, by P (w_i|w_i-1) probability value be mapped to w_i-1And w_iSide on.It in this way, will The problem of solving probability minimum value is converted to the shortest route problem that probability digraph is solved using Viterbi algorithm, i.e., by P (w_i|w_i-1) it is mapped to w in name digraph_i-1And w_iBetween side on, name digraph is converted into Ma Er as shown in Figure 4 Section madam name probability graph.

4) generally outline

This modeling mainly carries out decision-making to the multiple names identified, judges which name is more in line with linguistic context. In general, name can be judged in linguistic context by the word and part of speech of context, according to this characteristic, ngram models are utilized Carry out the quality that decision-making finally identifies name.

By comparing discovery：The embodiment of the present invention will differentiate the method for name, conversion by means such as threshold values in single model To be compared judgement, and final decision optimal identification knot to the name identified by context of co-text in multi-model The method of fruit, it is clear that multi-model compares the with the obvious advantage of differentiation.

It is important to note that the language material of this modeling selects the language material identical with segmenter, can so ensure to make The word branched away with segmenter occurs in language model, eliminates the problem of issuable non-name is sparse, specific language Expect for People's Daily's in January, 1998 corpus.Corpus labeling includes word segmentation result and part of speech.

Step 205：Final name recognition result is determined using Viterbi algorithm.

Here, the solution of optimal path in Markov name probability graph is completed using Viterbi algorithm, that is, is appeared in most Name on shortest path is exactly the name finally identified, and the name recognition effect of this method in existing universal model than passing through The effect that threshold value carrys out decision-making name is good.By experimental verification, the name recognition effect of this method than merely using HMM model or Probability graph model wants two percentage points high.

The embodiment of the present invention segments input text first with segmenter, without typing dictionary during participle Unregistered word will occur in the form of individual character, and these continuous individual characters are likely to be name；Then, using a variety of statistics moulds Chinese personal name and transliteration name is identified in type, and all name constitutive characteristics can not be covered not comprehensively by eliminating single model Foot, strives for all identifying Potential names；Furthermore build ngram moulds using the linguistic context (upper and lower cliction and part of speech) of name Type；Finally, by the probability digraph of all Potential names of ngram model constructions, and it is final using Viterbi algorithm decision-making Name recognition result, i.e., the problem of name decision-making, is converted to the name identified is done by contextual feature it is optimal The problem of comparing.

Embodiment three：

To realize the above method, the embodiment of the present invention additionally provides a kind of name identification device, as shown in figure 5, the device Including word-dividing mode 501, name identification module 502, model construction module 503, name decision-making module 504；Wherein,

The word-dividing mode 501 for obtaining the text sequence of input, and segments the text sequence, wherein, The text sequence includes at least one name；

The name identification module 502, for the constitutive characteristic according to name, based at least two statistical models to participle Text sequence afterwards carries out name identification, obtains all Potential names；

The model construction module 503, for building ngram models according to the linguistic context of name；

The name decision-making module 504, for carrying out decision-making to all Potential names according to the ngram models, It determines and exports the final name recognition result for meeting linguistic context.

Here, the word-dividing mode 501, be additionally operable to will segment during without typing dictionary unregistered word with individual character shape State is presented.

Here, the name includes：Chinese personal name and transliteration name；The statistical model can be the name based on dictionary The names identification models such as identification model, the Chinese personal name recognition model based on HMM, the transliteration name identification model based on probability graph In at least two.

The model construction module 503 is additionally operable to after the name identification module 502 obtains all Potential names, The name digraph of decision-making is treated according to all Potential names structures.

Wherein, the linguistic context of the name includes：The cliction up and down of name and the part of speech of upper and lower cliction；The ngram models In language material it is identical with the language material in the segmenter segmented to the text sequence.

The name decision-making module 504, is specifically used for：

In practical applications, the word-dividing mode 501, name identification module 502, model construction module 503, name decision-making Module 504 can by be located in natural language processing terminal central processing unit (CPU, Central Processing Unit), Microprocessor (MPU, Micro Processor Unit), digital signal processor (DSP, Digital Signal ) or the realizations such as field programmable gate array (FPGA, Field Programmable Gate Array) Processor.

The embodiment of the present invention obtains the text sequence of input, and the text sequence is segmented, wherein, the text Sequence includes at least one name；According to the constitutive characteristic of name, based at least two statistical models to the text after participle Sequence carries out name identification, obtains all Potential names；Ngram models are built according to the linguistic context of name；According to the ngram moulds Type carries out decision-making to all Potential names, determines and exports the final name recognition result for meeting linguistic context.In this way, according to not It with the otherness of name own characteristic, is identified using different statistical models, makes full use of a variety of statistical model identifications Advantage identifies all Potential names, promotes name discrimination；Pass through the general of all Potential names of ngram model constructions Rate digraph, so as to the recognition result of Chinese personal name, transliteration name structure unified dimensional decision-making, so as to recognition result into Row optimal selection, not only name identification certainty is good, also improves comprehensive recognition effect, can rapidly and accurately identify different Name.

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, the shape of the embodiment in terms of hardware embodiment, software implementation or combination software and hardware can be used in the present invention Formula.Moreover, the present invention can be used can use storage in one or more computers for wherein including computer usable program code The form for the computer program product that medium is implemented on (including but not limited to magnetic disk storage and optical memory etc.).

The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention, it is all All any modification, equivalent and improvement made within the spirit and principles in the present invention etc. should be included in the protection of the present invention Within the scope of.

Claims

1. a kind of name recognition methods, which is characterized in that the described method includes：

The text sequence of input is obtained, and the text sequence is segmented, wherein, the text sequence includes at least one A name；

According to the constitutive characteristic of name, name identification is carried out to the text sequence after participle based at least two statistical models, is obtained Obtain all Potential names；

Ngram models are built according to the linguistic context of name；

Decision-making is carried out to all Potential names according to the ngram models, determine and exports the final name for meeting linguistic context Recognition result.

2. according to the method described in claim 1, it is characterized in that, the method further includes：To not there is no typing during participle The unregistered word of dictionary is presented in the form of individual character.

3. according to the method described in claim 1, it is characterized in that, it is described obtain all Potential names after, the method It further includes：The name digraph of decision-making is treated according to all Potential names structures；

4. according to the method described in claim 3, it is characterized in that, it is described according to the ngram models to described all potential Name carries out decision-making, determines and exports the final name recognition result for meeting linguistic context, including：

The cliction up and down of linguistic context and the part of speech feature of upper and lower cliction, are treated by the ngram models to described according to where name The name digraph of decision-making carries out probability mapping, builds the probability digraph of all Potential names, and based on prediction algorithm All paths in the probability digraph are compared, using shortest path as the name recognition result for finally meeting linguistic context And it exports.

5. according to the method described in claim 1, it is characterized in that, language material in the ngram models with to the text sequence The language material arranged in the segmenter segmented is identical.

6. a kind of name identification device, which is characterized in that described device includes：Word-dividing mode, name identification module, model construction Module, name decision-making module；Wherein,

The word-dividing mode for obtaining the text sequence of input, and segments the text sequence, wherein, the text This sequence includes at least one name；

The name identification module, for the constitutive characteristic according to name, based at least two statistical models to the text after participle This sequence carries out name identification, obtains all Potential names；

The name decision-making module for carrying out decision-making to all Potential names according to the ngram models, determines and defeated Go out the final name recognition result for meeting linguistic context.

7. device according to claim 6, which is characterized in that the word-dividing mode is additionally operable to during participle not have The unregistered word of typing dictionary is presented in the form of individual character.

8. device according to claim 6, which is characterized in that the model construction module is additionally operable to know in the name After other module obtains all Potential names, the name digraph of decision-making is treated according to all Potential names structures；

9. device according to claim 8, which is characterized in that the name decision-making module is specifically used for：

10. device according to claim 6, which is characterized in that language material in the ngram models with to the text sequence The language material arranged in the segmenter segmented is identical.