CN102033879A

CN102033879A - Method and device for identifying Chinese name

Info

Publication number: CN102033879A
Application number: CN200910177127XA
Authority: CN
Inventors: 罗长升; 方高林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date: 2009-09-27
Filing date: 2009-09-27
Publication date: 2011-04-27
Anticipated expiration: 2029-09-27
Also published as: CN102033879B

Abstract

The invention discloses a method and a device for identifying a Chinese name. The method comprises the following steps of: acquiring an input sequence and performing word segmentation on the input sequence; performing role labeling on the input sequence subjected to word segmentation and acquiring a role labeling sequence; detecting name identifying roles in the role labeling sequence and correcting wrong name identifying roles; and matching the role labeling sequence according to a name identifying mode and outputting a formed name. By the method and the device, the Chinese name and a transliterated name can be accurately identified.

Description

A kind of method and apparatus of Chinese name identification

Technical field

The present invention relates to Internet technical field, relate in particular to a kind of method and apparatus of Chinese name identification.

Background technology

Chinese information processing is meant with computing machine information such as the sound of Chinese, shape, justice is handled and processed, and is the branch that natural language information is handled.Wherein, Chinese information processing is mainly studied and how to be utilized computing machine that Chinese information is handled automatically, compares with western languages such as English, and Chinese lacks tangible separation mark, also more flexible at aspects such as grammer, semanteme and pragmatics, this has just increased the difficulty of Computer Processing and understanding.And the word analysis is the prerequisite and the basis of Chinese natural language processing, and the research of Chinese word analysis has also obtained bigger progress, and still, when processing contained the text of unregistered word, corresponding result generally was difficult to satisfy actual demand.

Concrete, the wrong identification of unregistered word, not only can cause self can't correctly discerning, and unregistered word often with other words combined crosswise of front and back, can have a strong impact on the correct identification of other speech, thereby directly reduced the accuracy that word is analyzed, even had influence on the accuracy of the whole analysis of sentence.As can be seen, the automatic identification of unregistered word has become the bottleneck problem that quality analyzed in Chinese word.

Further, named entity occupies bigger proportion in unregistered word, also is the main difficult point of unregistered word identification.Wherein, named entity is meant the entity that has certain sense in the text, can be expressed as abstract things or concrete things in the real world, and this named entity mainly comprises name, place name, mechanism's name, date, time, monetary value and percentage etc.And from recognition effect, the identification of date, time, monetary value and percentage etc. is relative simple, and the statistics of rule, the training of data add up also relatively easy.

But because named entities such as name, place name, mechanism's name have opening and expansionary, composing law has bigger randomness, and feasible identification to name, place name, mechanism's name exists bigger mistake identification and leaks identification; And the identification of named entity is significant for the correct understanding text, is the basis of technology such as information extraction, automatic question answering, mechanical translation; Therefore, the identification to name, place name, mechanism's name also is the research emphasis of present named entity recognition.Wherein, in the identifying of name, place name, mechanism's name, name entities such as Chinese personal name and transliteration name occupy very big proportion in named entity, make being identified as automatically for the emphasis of login identification not of name, the solution of name identification problem will improve the final mass of Chinese lexical analysis, syntactic analysis and even Chinese information processing.

In the prior art, usually use and carry out the automatic identification of Chinese personal name based on the method for character labeling, promptly utilize the Role Information of Automatic Extraction from corpus, take Viterbi algorithm (the Viterbi algorithm is a kind of decoding algorithm of convolutional code) to carry out character labeling to cutting the speech result, on the basis of role's sequence, carry out the pattern maximum match, thereby realize the identification of Chinese personal name.

Concrete, should think based on the method for character labeling: each entry in the sentence has all impliedly carried a Role Information, and wherein, the role representation entry is role in sentence or named entity.This character labeling just is meant that each entry mark in the entry sequence that the cutting result is obtained goes up corresponding role, and wherein, the role mainly is divided three classes, and is respectively: the inside of name is formed the role, is become the speech role with context, name has nothing to do the role.A kind of role's table as shown in table 1:

Table 1

As can be seen, show according to the role shown in the table 1, when the cutting result be shop/interior/display/week/grace/come/and/Deng/grain husk/excusing from death/preceding/uses/mistake// article/time, result's (being the result of character labeling) that each entry mark in the entry sequence that the cutting result is obtained is gone up corresponding role for " shop/A is interior/A display/A week/B grace/C is next/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A "

Further, in this method, be by using the Viterbi algorithm to carry out role's automatic mark based on character labeling; Promptly from all possible mark sequence, optimize the mark of probability maximum as final annotation results; Concrete theory and derivation are as follows:

Suppose that W is the Token sequence (being the word segmentation result before the unregistered word identification) behind the participle, W=(w1, w2 ..., wm); T is certain possible character labeling sequence of W, T=(t1, t2 ..., tm), m＞0; Wherein, T ^#Be final annotation results, i.e. role's sequence of probability maximum.According to the Bayes formula, and introduce Hidden Markov Model (HMM), then

T^{#} = \arg_{T} \max P (T | W) = T^{#} = - a {rg}_{T} \min Σ_{i = 0}^{m} {\ln p (w_{i} | t_{i}) + \ln p (t_{i} | t_{i - 1})}

(formula 1)

Wherein, wi is an observed value, and role ti is a state value, and W is the observed value sequence, and T is the state value sequence that is hidden in behind the W; P (wi|ti) refers to the probability of role for wi in the Token set of ti; P (ti|ti-1) refers to the transition probability of role ti-1 to role ti.

Suppose C (wi, ti): the number of times that wi occurs as role ti;

C (ti-1, ti): the next role of role ti-1 is the number of times of ti;

C (ti): the number of times that role ti occurs.

Under the prerequisite of extensive corpus training:

P (w _i| t _i) ≈ C (w _i, t _i)/C (t _i) (formula 2)

P (t _i| t _I-1) ≈ C (t _I-1, t _i)/C (t _I-1) (formula 3)

As can be seen, in this method based on character labeling, above-mentioned role marks problem automatically and has just converted the minimized problem of expression formula of solution formula 1 to; Wherein, in this Vitebi algorithm special solution to the problems described above is arranged, very ripe, do not repeat them here; Promptly can realize that by above-mentioned formula 1, formula 2 and formula 3 role marks automatically.

In realizing process of the present invention, the inventor finds prior art, and there are the following problems at least:

(1) existing method based on character labeling depends on context role set.For example, when input of character string was the vehement flat Baidu of Liu encyclopaedia, the result of rough segmentation was Liu/vehement/flat/Baidu/encyclopaedia, if this entry of Baidu does not have the hereinafter role of name, then vehement the putting down of Liu can't be correctly validated; And the context role set of name to close not be the set of sealing, but have open and expansionary set; Therefore, obtain very difficulty of sufficient context role set; And then cause name to be correctly validated.

(2) the existing various probability that method trained based on character labeling depend on corpus; And corpus is a closed set, and when using corpus to train, the probability that may cause training goes wrong, and then makes name correctly to discern.

(3) existing method based on character labeling is to the transliteration name, and it is not enough that especially the support of name identification translated in English.

(4) existing method shortage mistake based on character labeling is discerned name eliminating mechanism, when mistake appears in the name identifying, can not well get rid of, and the accuracy rate of name identification has much room for improvement.

Summary of the invention

The embodiment of the invention provides a kind of method and apparatus of Chinese name identification, with accurate recognition Chinese name.

In order to achieve the above object, the embodiment of the invention provides a kind of method of Chinese name identification, comprising:

Obtain list entries, and described list entries is carried out participle;

List entries behind the participle is carried out character labeling, and obtain the character labeling sequence;

According to the name recognition mode described character labeling sequence is mated, and the name of output composition.

Described obtaining after the character labeling sequence also comprises:

Detect the name identification role in the described character labeling sequence, and the name identification role who mistake occurs is revised.

Described obtaining after the character labeling sequence also comprises:

Role U in the character labeling sequence and role V are divided processing.

Role U in the character labeling sequence is divided processing to be comprised:

A role is C behind described role U, E, and G, or during Z is split into role A and role B with the content of described role U correspondence; When the back role of described role U is D, the content of described role U correspondence is split into role A and role C; A role is I behind described role U, and X2, or during E2 is split into role A and role H with the content of described role U correspondence; When the back role of described role U is other roles, the content of described role U correspondence is split into role A and role A;

Role V in the character labeling sequence is divided processing specifically to be comprised:

When the previous role of described role V is C or X, the content of described role V correspondence is split into role D and role A; When the previous role of described role V is B, the content of described role V correspondence is split into role E and role A; When the previous role of described role V is I or X2, the content of described role V correspondence is split into role T and role A; When the previous role of described role V is H, the content of described role V correspondence is split into role E2 and role A; When the previous role of described role V is other roles, the content of described role V correspondence is split into role A and role A.

Describedly described character labeling sequence is mated, and the name that output is formed comprises according to the name recognition mode:

According to the name recognition mode character labeling sequence after division is handled through role U and role V is carried out the pattern maximum match, and the name formed of output.

According to the name recognition mode character labeling sequence after division is handled through role U and role V is carried out the pattern maximum match, and the name that output is formed comprises:

In through the character labeling sequence after role U and the role V division processing, there is name recognition mode BCD, BE, BG, BZ, FB, Y, when XD or FE, the result of pattern maximum match is a Chinese personal name for corresponding content;

In through the character labeling sequence after role U and the role V division processing, there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2], X2T, when X2 or Y, the result of pattern maximum match is the transliteration name for corresponding content.

Described obtaining before the list entries also comprises: carry out model training;

Describedly list entries behind the participle is carried out character labeling comprise: carry out character labeling according to the result of the model training list entries after to participle.

The described model training that carries out comprises:

Obtain the input language material, and remove the nested mark structure that exists in the described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in the described input language material, obtain the text language material, and use Words partition system to carry out cutting, obtain language material behind the participle described text language material;

According to role's table language material behind the described participle is marked, obtain the character labeling language material, wherein, do not comprise the context Role Information in described role's table;

Obtain according to described character labeling language material that the role shifts language material and the role launches language material;

Shifting language material and role according to described role launches language material and carries out model training.

A kind of device of Chinese name identification comprises:

Acquisition module is used to obtain list entries, and described list entries is carried out participle;

The character labeling module is used for the list entries behind the described acquisition module participle is carried out character labeling, and obtains the character labeling sequence;

The pattern match module is used for mating according to the character labeling sequence that the name recognition mode obtains described character labeling module, and the name of output composition.

Also comprise:

Role's correcting module is used for detecting the name identification role of the character labeling sequence that described character labeling module obtains, and the name identification role who mistake occurs is revised.

Also comprise:

The role disassembles module, and the role U and the role V that are used for character labeling sequence that described character labeling module is obtained divide processing.

Described role disassembles module and specifically is used for, and a role is C behind described role U, E, and G, or during Z is split into role A and role B with the content of described role U correspondence; When the back role of described role U is D, the content of described role U correspondence is split into role A and role C; A role is I behind described role U, and X2, or during E2 is split into role A and role H with the content of described role U correspondence; When the back role of described role U is other roles, the content of described role U correspondence is split into role A and role A;

Described pattern match module specifically is used for, and according to the name recognition mode character labeling sequence after division is handled through role U and role V is carried out the pattern maximum match, and the name formed of output.

Also comprise:

Training module was used for before obtaining list entries, carried out model training;

Described character labeling module also is used for, and carries out character labeling according to the result of the model training of the described training module list entries after to participle.

Described training module specifically is used for, and obtains the input language material, and removes the nested mark structure that exists in the described input language material, obtains the mark language material of non-nesting; Remove all part of speech marks in the described input language material, obtain the text language material, and use Words partition system to carry out cutting, obtain language material behind the participle described text language material; According to role's table language material behind the described participle is marked, obtain the character labeling language material, wherein, do not comprise the context Role Information in described role's table; And obtain according to described character labeling language material that the role shifts language material and the role launches language material; Shifting language material and role according to described role launches language material and carries out model training.

Compared with prior art, the present invention has the following advantages: by the list entries behind the participle is carried out character labeling, can accurate recognition go out Chinese personal name and transliteration name; And role set proposed by the invention merges and not rely on context role set, further improved the accuracy of Chinese personal name identification.

Description of drawings

In order to be illustrated more clearly in the present invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in the present invention or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.

A kind of Chinese people name recognition method process flow diagram of Fig. 1 for proposing in the embodiment of the invention;

The another kind of Chinese people name recognition method process flow diagram of Fig. 2 for proposing in the embodiment of the invention;

Fig. 3 is based on the training process process flow diagram of character labeling model in the embodiment of the invention;

The process flow diagram flow chart of Fig. 4 for discerning according to the scholar of the centering as a result name of model training in the embodiment of the invention;

A kind of Chinese people's name recognition device structural representation in Fig. 5 embodiment of the invention;

Another kind of Chinese people's name recognition device structural representation in Fig. 6 embodiment of the invention.

Embodiment

Below in conjunction with the accompanying drawing among the present invention, technical scheme of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, rather than whole embodiment.Based on the embodiment among the present invention, those of ordinary skills belong to the scope of protection of the invention not making the every other embodiment that is obtained under the creative work prerequisite.

The embodiment of the invention provides a kind of method of Chinese name identification, as shown in Figure 1, specifically may further comprise the steps:

Step 101 is obtained list entries, and described list entries is carried out participle.

Step 102 is carried out character labeling to the list entries behind the participle, and obtains the character labeling sequence.

Wherein, described obtaining after the character labeling sequence also comprises: detect the name identification role in the described character labeling sequence, and the name identification role who mistake occurs is revised.

Described obtaining after the character labeling sequence also comprises: role U in the character labeling sequence and role V are divided processing.Need to prove, this divides the operation of processing to role U in the character labeling sequence and role V, with the name identification role who detects in the described character labeling sequence, and the operation that the name identification role who mistake occurs is revised does not have ordinal relation successively.

Concrete, the role U in the character labeling sequence divided to handle comprise: a role is C behind described role U, E, and G, or during Z is split into role A and role B with the content of described role U correspondence; When the back role of described role U is D, the content of described role U correspondence is split into role A and role C; A role is I behind described role U, and X2, or during E2 is split into role A and role H with the content of described role U correspondence; When the back role of described role U is other roles, the content of described role U correspondence is split into role A and role A;

Role V in the character labeling sequence is divided processing specifically to be comprised: as the previous role of described role V during for C or X, the content of described role V correspondence is split into role D and role A; When the previous role of described role V is B, the content of described role V correspondence is split into role E and role A; When the previous role of described role V is I or X2, the content of described role V correspondence is split into role T and role A; When the previous role of described role V is H, the content of described role V correspondence is split into role E2 and role A; When the previous role of described role V is other roles, the content of described role V correspondence is split into role A and role A.

Step 103 is mated described character labeling sequence according to the name recognition mode, and the name of output composition.Wherein, describedly described character labeling sequence is mated according to the name recognition mode, and the name that output is formed comprises: according to the name recognition mode character labeling sequence after handling through role U and role V division is carried out the pattern maximum match, and the name formed of output.

Concrete, according to the name recognition mode character labeling sequence of passing through after role U and role V division is handled is carried out the pattern maximum match, and the name that output is formed comprises: have name recognition mode BCD in through the character labeling sequence after role U and the role V division processing, BE, BG, BZ, FB, Y, when XD or FE, the result of pattern maximum match is a Chinese personal name for corresponding content; In through the character labeling sequence after role U and the role V division processing, there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2], X2T, when X2 or Y, the result of pattern maximum match is the transliteration name for corresponding content.

In the embodiment of the invention, described obtaining before the list entries also comprises: carry out model training; At this moment, describedly list entries behind the participle is carried out character labeling comprise: carry out character labeling according to the result of the model training list entries after to participle.

Concrete, the described model training that carries out comprises: obtain the input language material, and remove the nested mark structure that exists in the described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in the described input language material, obtain the text language material, and use Words partition system to carry out cutting, obtain language material behind the participle described text language material; According to role's table language material behind the described participle is marked, obtain the character labeling language material, wherein, do not comprise the context Role Information in described role's table; Obtain according to described character labeling language material that the role shifts language material and the role launches language material; Shifting language material and role according to described role launches language material and carries out model training.

As seen, in the method that the embodiment of the invention provided,, can accurate recognition go out Chinese personal name and transliteration name by the list entries behind the participle is carried out character labeling; And role set proposed by the invention merges and not rely on context role set, further improved the accuracy of Chinese personal name identification.

The embodiment of the invention two provides a kind of method of Chinese name identification, as shown in Figure 2, specifically may further comprise the steps:

Step 201 is carried out model training.Wherein, in the process of underway scholar's name identification, in order to guarantee the accuracy of character labeling, need carry out model training, thereby each speech is all had character labeling result accurately.

Concrete, can be in the embodiment of the invention by using class-based language model to carry out model training and the character labeling model carries out model training.Wherein, in class-based language model, three class named entities such as name, place name, mechanism's name need be defined as three classifications respectively, i.e. PN (Person Name), LN (Location Name) and ON (Organization Name); Before the training, name, place name, mechanism's name need be replaced with PN, LN and ON, after this, the training method of use ordinary language model is trained and is got final product.Wherein, comprise two submodels in this class-based language model, be respectively context model P (C) (Context Model) and named entity class model P (S|C) (Class Model), this class-based language model is given unnecessary details at this no longer in detail for the embodiment of existing comparative maturity.For promptly carrying out model training by use SegTag by the process of using this character labeling model to carry out model training.Certainly, in the application of reality, can also carry out model training, repeat no more in the embodiment of the invention by other language model.

Concrete, as shown in Figure 3, in the embodiment of the invention, above-mentioned training process based on the character labeling model specifically may further comprise the steps:

Step 301 is carried out pre-service to the input language material.Wherein, the corpus (for example, People's Daily's corpus in 2000) that marks from cutting of this input language material.The input language material is carried out pre-service to be specially: remove the nested mark structure that exists in the input language material, and obtain the mark language material of non-nesting.For example, original input language material is: with/p hair/east, nrf pool/comrade nrg/n for/vl representative/n /the ud[China/ns Communist Party/n] nt; By to removing the nested mark structure that input exists in the language material, the language material after obtaining handling (the mark language material of non-nesting) is: with/p hair/east, nrf pool/comrade nrg/n for/v representative/n /the u China/ns Communist Party/n; As can be seen, in the input language material, there is nested mark structure, for example, vl (a kind of part of speech more specifically in the verb), ud (a kind of part of speech more specifically in the auxiliary word), nt (mechanism's name has been nested in the Chinese Communist Party together), remove the nested mark structure that input exists in the language material after, obtain v (verb), u (auxiliary word), and removed the nested of the Chinese Communist Party.

In addition, also need all part of speech marks in the input language material are all removed, obtain the text language material, the text language material after handling under the promptly above-mentioned situation is: be the Chinese Communist Party of representative with comrade Mao Zedong.Further use Words partition system to carry out cutting to the text language material, the language material that obtains is a language material behind the participle.Wherein, this cutting process only is a participle, does not carry out any named entity recognition and part-of-speech tagging.

As can be seen, by the preprocessing process of this step, can obtain two middle language material, i.e. language materials behind the mark language material of non-nesting and the participle.

Step 302 is obtained the character labeling language material, promptly according to each role's implication each entry is carried out character labeling.Wherein, after obtaining participle, during language material, can mark language material behind the participle according to role's table in this step, obtain the character labeling language material.For example, for above-mentioned example, language material is behind the participle: with/hair/damp east/comrade/for/representative// China for produce party/, directly obtain the character labeling language material according to language material behind this participle, the result is: with/A hair/east, B pool/comrade Z/A for/A representative/A /the A China/A Communist Party/A.

Need to prove, in order to solve the problem that too depends on context role set in the prior art, employed role's table does not have the context Role Information in the embodiment of the invention, for example, at the table of the role shown in the table 1 in the prior art, remove the context Role Information shown in the table 2, can obtain the role table shown in the employed table 3 of the embodiment of the invention, the promptly above-mentioned process of obtaining the character labeling language material is based on and draws under the prerequisite of the table of role shown in the table 3; Certainly, according to the actual needs, the role's table shown in can also his-and-hers watches 3 is made amendment and is adjusted, and repeats no more in this process embodiment of the invention.

Table 2

K	Name above	Again ComeThe family of Yu Hongyang.
			L	Name hereinafter	Reporter of the Xinhua News Agency Huang Wen Take the photograph
M	Composition between two names	Playwright, screenwriter Shao Junlin WithCheck blue or green saying

Table 3

As can be seen, compare with table 1, in the employed role of the embodiment of the invention table not name above, name hereinafter and the roles such as composition between two names, make and obtain the process of character labeling language material in this step and depend on context role set that Chinese people's name recognition method that promptly embodiment of the invention proposed does not rely on context role set.

Need to prove, in above-mentioned table 3, be not limited to use the implication of above-mentioned letter representation correspondence, for example, can also use role C to represent the surname of Chinese personal name, use role B to represent the lead-in of the two-character given name of Chinese personal name, promptly in the table 3 between role and the meaning corresponding relation can adjust arbitrarily according to the actual needs; The combination in any of corresponding relation is all within protection domain of the present invention between role and the meaning, is that example describes with the corresponding relation shown in the table 3 in the embodiment of the invention.

Need to prove, in embodiments of the present invention, obtain the character labeling language material by the SegTag method.Wherein, should obtain the character labeling language material by the SegTag method comprises:

(1) record name information, wherein, the process of this record name information is specially: each row by in the mark language material (obtaining in step 301) of scanning non-nesting, write down the position (space that is occurred has been ignored in the calculating of this position) of all names appearance in this row and the type of this name etc.Wherein, the type of this name mainly comprises: the foreign name of the no surname Chinese personal name of Chinese surname part, Chinese name part, two Chinese character length, the foreign name that surpasses two Chinese character length, two Chinese character length and be noted as the non-name etc. of name part of speech.

(2) be each entry mark role, take out corresponding row in the language material behind participle (in step 301, obtaining), and the position with next nearest name place, position that each speech occurred compared, type according to relative position and this name of each speech and name marks corresponding role.

The difference of obtaining character labeling language material and existing mark language material method by the SegTag method is: the SegTag method is that language material behind the participle (language material behind the current Words partition system participle) is marked, and the method for existing mark language material is that the standard cutting result who imports language material is marked.At this moment, be different owing to cut the speech the possibility of result, the resulting actor model corpus of SegTag method also may be different.For example, if language material is behind the participle: with/hair/damp east/comrade/for/representative// China/Communist Party/time, then language material be to produce party/A with the/damp east of A hair/B/comrade's Z value/A value/A tabular value/A value/A state value/A behind the character labeling; And if language material is behind the participle: with/hair/damp east/comrade/for/representative// China/Communist Party/time, then language material is behind the character labeling: with/A hair/B pool/C east/comrade D/A for/A representative/A /the A China/A Communist Party/A.

As can be seen, by language material behind the participle is marked corresponding role, make the character labeling result to change in the embodiment of the invention, thereby improved the accuracy rate of name identification along with the variation of current Words partition system.For example, the cutting result of current Words partition system is: with/hair/pool/east/comrade/for/representative// China/Communist Party, if then in the corpus result of character labeling be with/A hair/east, B pool/comrade Z/A for/A representative/A /the A China/A Communist Party/A, at this moment, this name of Mao Zedong just might be discerned to be come out.Particularly, comprise the pool east probability as role Z in actor model, and do not have the pool as the probability of role C, east during as the probability of role D, the Mao Zedong can not correctly identify.

Step 303 is extracted training file and dictionary.Promptly the character labeling language material that gets access in the above-mentioned steps 302 is extracted, obtain that corresponding role shifts language material and the role launches language material.Wherein, it is that all entries are removed that this role shifts language material, only keeps the language material that corresponding role obtains; And the role launches language material is that each role and entry are placed on the language material that independent delegation obtains.For example, the character labeling language material is: with/A hair/B pool/C east/comrade D/A for/A representative/A /during A China/A Communist Party/A, then the role after the extraction shifts language material and is: A B C D A A A A A A ".Role after the extraction launches language material:

A with

The B hair

The C pool

D east

Comrade A

A is

The A representative

A's

A China

The A Communist Party

In addition, when obtaining above-mentioned role and shifting language material and role and launch language material, can also obtain initial role's dictionary; Wherein, the process of this acquisition role dictionary is: extract basic role's dictionary from the corpus of character labeling model, and this basic role's dictionary is progressively purified and expanded.For example, according to the character labeling model, can easily obtain a name predicative material that comprises a large amount of Chinese personal names and transliteration name,, just can obtain role's set commonly used of a name everyday character more accurately and each everyday character correspondence by this name predicative material is carried out statistical treatment.Certainly, in actual applications, the wrong identification according to finding also needs to purify step by step and expand role's dictionary, thereby obtains high-quality role's dictionary, does not repeat them here.

Step 304 is carried out model training.Wherein, according to the employed formula 1 of the method for character labeling in the prior art, as can be seen, needing the purpose of training is to obtain two types probable value: p (wi|ti) and p (ti|ti-1); This p (wi|ti) refers to the probability of role for wi in the Token set of ti, i.e. role's emission probability; What p (ti|ti-1) represented is the transition probability of role ti-1 to role ti, i.e. role's transition probability.As can be seen, launch language material and promptly can carry out model training by using the role who obtains in the above-mentioned steps 303 to shift language material and role, and finally obtain training result.

In the embodiment of the invention, by using the Katz smoothing algorithm that role's transition probability and role's emission probability are carried out smoothly, thereby solved the probability model that uses maximal possibility estimation to obtain and run into the sparse problem of data inevitably, this Katz smoothing algorithm is an existing algorithm, repeat no more in the embodiment of the invention

As can be seen,, promptly can obtain the result of model training, in follow-up process of carrying out the identification of Chinese name, can directly use the result of this model training to carry out corresponding Chinese name identification by above-mentioned step 301-step 304.

Step 202 is discerned according to the scholar of the centering as a result name of model training.Wherein, in the Chinese information processing process, need discern the Chinese name in this Chinese information, thus the process of realization Chinese information processing.

Concrete, as shown in Figure 4, in the embodiment of the invention, the process of discerning according to the scholar of the centering as a result name of model training specifically may further comprise the steps:

Step 401 is carried out participle to the input sentence, thereby obtains sentence behind the participle.In the embodiment of the invention be with the input sentence: display Zhou Enlai and the used before death article of Deng Yingchao in the shop, for example describes; As can be seen, sentence (a kind of word segmentation result wherein) is behind the Dui Ying participle: shop/interior/display/week/grace/come/and/Deng/grain husk/excusing from death/preceding/use/mistake// article.

Step 402 to sentence behind the participle, by use character labeling model, and by the Viterbi algorithm, is obtained the character labeling sequence of probability maximum.For example, above-mentioned character labeling result is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Wherein, this character labeling result can carry out character labeling (for example, obtaining by the role's dictionary that uses above-mentioned model training) according to the result of above-mentioned model training; This step is existing processing mode, gives unnecessary details no longer in detail in the embodiment of the invention.

Need to prove that the executive agent of this step can be the character labeling module, certainly, according to the actual needs, also can use other entities to handle, do not repeat them here.

For the entry (being the entry that do not land of role's dictionary) that does not have in role's dictionary to occur, in the embodiment of the invention, a kind of conjecture method of effectively not landing entry has been proposed; The principle of this conjecture method is the characteristics according to the length of entry and composition character, guesses the role of entry; Specifically comprise:

(1) if this lands entry the odd number byte is not arranged, perhaps be no less than 6 bytes, non-Chinese character is perhaps arranged, determine directly that then this role who does not land entry is the irrelevant role A of name.

(2) if this does not land entry is single Chinese character, then needing the role who guesses is A|C|D|E.

(3) if this does not land entry is two Chinese characters, then needing the role who guesses is A|X|Z.

Need to prove that because the transliteration name is concentrated relatively with word, therefore, comprised the commonly used role of common transliteration name with word and correspondence in role's dictionary basically, promptly above-mentioned conjecture method mainly is that possible Chinese personal name role is guessed.

Step 403 detects the name identification role that may produce identification error in the character labeling sequence, and possible identification error is in time revised corresponding role.Wherein, the executive agent of this step can be role's correcting module, certainly, according to the actual needs, also can use other entities to handle, and does not repeat them here.

In this step, suppose that name character string to be identified is w _mw _M+1... w _n, corresponding name pattern is t _mt _M+1... t _nWherein, m＞=0, n＞=m+2; Speech of speech in the front adjacent with this name and back is respectively w _M-1And w _N+1, pass through the relatively probability of two paths in the embodiment of the invention, thereby whether decision discerns this name.Wherein, this two paths is respectively:

Path 1 (the name path, PN_PATH)

Be P (w _M-1To PN) * P (PN to w _N+1) * P (w _mw _M+1... w _n| PN)

Path 2 (non-name path, NOT_PN_PATH)

Be P (w _M-1To w _m) * P (w _mTo w _M+1) * ... P (w _nTo w _N+1)

Concrete, path 1 is can be with w _mw _M+1... w _nBe identified as the path of name, wherein, first probable value is P (w _M-1To PN), the transition probability value that above arrives name of expression name, second probable value is P (PN to w _N+1), expression PN is to the transition probability value hereinafter of name, and the 3rd probable value is (w _mw _M+1... w _n| PN), represent that this name that identifies is w _mw _M+1... w _nProbable value; As can be seen, the probable value in path 1 is exactly the product of above-mentioned three probability.

Path 2 is can not be with w _mw _M+1... w _nBe identified as the path of name, wherein, the probable value in this path 2 is product values of the transition probability of adjacent entry on the path 2.

Further, by the probability of more above-mentioned two paths, thereby determine that whether discerning this name specifically comprises: if the probable value in path 1 is not less than the probable value in path 2, then with w _mw _M+1... w _nBe identified as name; Otherwise, with w _mw _M+1... w _nBe identified as name mistake identification may take place, can not be with w _mw _M+1... w _nBe identified as name, and with w _mw _M+1... w _nCorresponding role is labeled as the irrelevant role of name, promptly is labeled as A role.

Need to prove that the account form in above-mentioned two paths is specially: the 3rd probable value P (w in (1) path 1 _mw _M+1... w _n| PN), calculate by using role's emission probability, i.e. P (w _mw _M+1... w _n| PN)=p (t _mTo w _m) * p (t _M+1To w _M+1) * ... * p (t _nTo w _n); (2) other probable values in the path 1 and all probable values in the path 2 can obtain from class-based language model.In actual applications, because the number of the probability that multiplies each other in path 1 is Duoed 1 than the number in path 2; Promptly according to the actual needs, can also be at (w _mw _M+1... w _n| PN) front adds a weight factor w, so that above-mentioned two paths is more accurate.Wherein, the account form in two above-mentioned paths all can adopt existing embodiment to obtain, and gives unnecessary details no longer in detail in the embodiment of the invention.

In addition, because it is bigger that the possibility of mistake identification takes place under BCD pattern and the XD pattern, therefore, in order to correct the identification error that may occur more exactly; For BCD pattern and XD pattern, also need to add the 3rd relatively path, be called PN_PATH2; This path is:

Path 3 (name path 2, PN_PATH2)

Be P (w _M-1To PN) * P (PN to w _n) * P (w _mw _M+1... w _N-1| PN) * P (w _nTo w _N+1)

Concrete, path 3 is can be with w _mw _M+1... w _N-1Be identified as the path of name, wherein, three probable values in the implication of first three probable value in path 3 and the path 1 are similar, do not repeat them here; And the 4th probable value P (w _nTo w _N+1) be the hereinafter w of name _nArrive hereinafter w of the next one _N+1The transition probability value; Same, this transition probability value also can obtain from class-based language model, does not repeat them here.

In the embodiment of the invention, if the probable value in path 3 during greater than the probable value in the probable value in path 1 and path 2, then needs to revise w _N-1And w _nCorresponding role promptly is revised as t respectively _N-1And t _nWherein, when treating that the correction pattern is the BCD pattern, then with t _N-1Be revised as role E, with t _nBe revised as role A; When treating that the correction pattern is the XD pattern, then with t _N-1Be revised as angle Y, with t _nBe revised as role A.

For this step more clearly is described, continue to describe with above-mentioned example.Wherein, the character labeling sequence that obtains through character labeling in the above-mentioned steps is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Because what used this moment is the BCD pattern, is in the model domain to be revised, promptly role's correcting module need compare the size of three paths probable values, and draws final correction result.

Path 2:P (display to week) * P (thoughtful grace) * P (grace is to coming) * P (come to and).

Path 3:P (displaying to PN) * P (PN to come) * P (all grace | PN) * P (come to and).Wherein, P (all grace | PN)=P (week | B) * P (grace | E).

In summary it can be seen that (1) does not need to make any role and revises if the probable value maximum in path 1 is then discerned this name; (2) if the probable value maximum in path 2, the role who then needs to revise each character string of name correspondence to be identified is A, promptly week, grace, come, the role of three entry correspondences is revised as the irrelevant role A of name; (3) if the probable value maximum in path 3 then needs grace and next role are revised as role E and role A respectively.This calculates in each path probability value and final this step of comparison procedure and gives unnecessary details no longer in detail.

Step 404 is got rid of or revises the wrong identification of name by the condition of setting.Wherein, the executive agent of this step can be checked module for rule, certainly, according to the actual needs, also can use other entities to handle, and does not repeat them here.

Concrete, the comprising of the condition of this setting:

(1) in the BCD pattern, if the Chinese character of D role's correspondence is and, and the character string that is close to below is identified as name, then herein and be a conjunction; Promptly needing this BCD schema modification is the BE pattern, and original BC partly is identified as name.For example, he has seen that Guo Quan and Zhao Tao are fighting, and the name that may identify is: Guo Quanhe (BCD pattern), Zhao Tao, by using above-mentioned condition, can correct mistake and discern Guo Quanhe.For Guo Quan be counted as a conjunction.

(2) the transliteration name that satisfies following any one condition will not be discerned.The transliteration name (for example, Andrew Jefferson Karstlo Bill Gates Bauer Mo Qiaobusibulin Page may be identified as a transliteration name, needs to get rid of this name this moment) that surpasses 16 Chinese characters; Comprise 3 or above consecutive identical character in the name that identifies and (for example, A Aaluo), can not be identified as a name.

(3) in the name that identifies, if front or back have ", " when (pause mark), then ", " if before and after speech determine it is not name, this name just can not be discerned so.Its reason is that pause mark usually is the border, the left and right sides of name, if the word name that pause mark is cut apart, front and back should also have name to occur, otherwise this name must be got rid of.

Step 405 divides processing to the role U in the character labeling sequence (name above individual character becomes speech with the name lead-in) and V (name end word with hereinafter individual character become speech).Wherein, in above-mentioned steps, obtain the character labeling sequence, also needed in this step role U and V are divided processing, thereby obtained character labeling sequence more accurately.Wherein, the executive agent of this step can be disassembled module for the role, certainly, according to the actual needs, also can use other entities to handle, and does not repeat them here.

Need to prove not have ordinal relation successively between above-mentioned steps 403, step 404 and the step 405, is that example describes with above-mentioned step just in the embodiment of the invention.In actual applications, above-mentioned step can also be adjusted according to the actual needs, for example, the step that in elder generation's execution in step 405 the role U in the character labeling sequence and V is divided processing, afterwards, in carry out step 403 the name identification role that may produce identification error in the character labeling sequence is detected, and to the timely step of revising corresponding role of possible identification error, afterwards, get rid of or revise the step of the wrong identification of name in carry out step 404 by the condition of setting, repeat no more in the embodiment of the invention.

Concrete, be combined into word problem in order to solve between name and corresponding context, need disassemble the disassembling method of the concrete role U that disassembles process such as table 4 and the disassembling method of the role V shown in the table 5 to role U and V; Certainly, according to the actual needs, can also his-and-hers watches 4 and table 5 in content adjust and revise, do not repeat them here.

Table 4: the disassembling method of role U

The back role of role U	Disassemble the result
		?C，E，G，Z	AB
?D	AC
		?I，X2，E2	AH
Other roles	AA

Table 5: the disassembling method of role V

The previous role of role V	Disassemble the result
		?C，X	DA
?B	EA
		?I，X2	TA
?H	E2A
		Other roles	AA

For this step more clearly is described, continue to describe with above-mentioned example; In above-mentioned character labeling sequence: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A in, need divide role V and disassemble, as can be seen, the previous role of role V is role C, promptly for the result that disassembles is DA, as can be seen, the character labeling result who obtains after the division is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A.

Step 406 is mated the character labeling sequence that divides after handling according to the name recognition mode, and the name of output composition, writes down the position of this name in sentence.Wherein, the executive agent of this step can be the pattern match module, certainly, according to the actual needs, also can use other entities to handle, and does not repeat them here.

Concrete, this name recognition mode is as shown in table 6, and certainly, according to the actual needs, content that can also his-and-hers watches 6 is adjusted and revised, and does not repeat them here.

Table 6: name recognition mode collection

Further, in this step, be the character labeling sequence that divides after handling to be carried out the pattern maximum match according to the name recognition mode, when promptly having the name recognition mode to concentrate corresponding content in the character labeling sequence after division is handled, according to the content that this name recognition mode set pair is answered, carry out the pattern maximum match.For example, when having Chinese personal name recognition mode BCD in the character labeling sequence, then the result of pattern maximum match is BCD (the pattern match the possibility of result is BC, CD etc.), as seen, need to use the method for pattern maximum match that the character labeling sequence is mated in the embodiment of the invention, do not repeat them here.

When continuation describes with above-mentioned example, because the character labeling sequence after division is handled is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A, after the pattern maximum match, the name that identifies is: Zhou Enlai's (BCD pattern), Deng Yingchao's (BCD pattern).In addition, above-mentioned transliteration name recognition mode [H|X2] [I|X2]+[T|X2] is the form of a canonical formula, if promptly the head-word role is H or role X2, several role I or role X2 arranged among the middle role, end speech role is role X2 or role T, then can be identified as the transliteration name.

Need to prove, in embodiments of the present invention, above-mentioned character labeling module, role's correcting module, rule check that module, role disassemble module, the pattern match module can further be combined as one or more modules according to the actual needs, perhaps further is split as a plurality of submodules.

Wherein, above-mentioned steps 401-step 406 can also be adjusted sequencing according to the actual needs, does not repeat them here.

As seen, the method that the application of the invention embodiment is provided has made full use of the advantage of character labeling model and class-based language model, can identify Chinese personal name and transliteration name comparatively exactly under the prerequisite that guarantees higher recall rate; And the role who is proposed among the present invention and do not rely on the context role and gather, thereby make the name recognition system be issued to higher name identification recall rate in certain training data condition with word.In addition, the present invention also provides a very practical entry conjecture strategy that do not land, and can guess effectively that major part do not land the role of entry, has solved to a certain extent and has not landed the negative effect of speech to name identification.

The embodiment of the invention also provides a kind of device of Chinese name identification, as shown in Figure 5, comprising:

Acquisition module 51 is used to obtain list entries, and described list entries is carried out participle.

Character labeling module 52 is used for the list entries behind described acquisition module 51 participles is carried out character labeling, and obtains the character labeling sequence.

Pattern match module 53 is used for carrying out the pattern maximum match according to the character labeling sequence that the name recognition mode obtains described character labeling module 52, and the name of output composition.

The embodiment of the invention also provides a kind of device of Chinese name identification, as shown in Figure 6, comprising:

Acquisition module 61 is used to obtain list entries, and described list entries is carried out participle.

Character labeling module 62 is used for the list entries behind described acquisition module 61 participles is carried out character labeling, and obtains the character labeling sequence.

Pattern match module 63 is used for carrying out the pattern maximum match according to the character labeling sequence that the name recognition mode obtains described character labeling module 62, and the name of output composition.

Role's correcting module 64 is used for detecting the name identification role of the character labeling sequence that described character labeling module obtains, and the name identification role who mistake occurs is revised.

The role disassembles module 65, and the role U and the role V that are used for character labeling sequence that described character labeling module is obtained divide processing.

Wherein, described role disassembles module 65 and specifically is used for, and a role is C behind described role U, E, and G, or during Z is split into role A and role B with the content of described role U correspondence; When the back role of described role U is D, the content of described role U correspondence is split into role A and role C; A role is I behind described role U, and X2, or during E2 is split into role A and role H with the content of described role U correspondence; When the back role of described role U is other roles, the content of described role U correspondence is split into role A and role A;

Further, described pattern match module 63 specifically is used for, and according to the name recognition mode character labeling sequence after division is handled through role U and role V is carried out the pattern maximum match, and the name formed of output.In through the character labeling sequence after role U and the role V division processing, there is name recognition mode BCD, BE, BG, BZ, FB, Y, when XD or FE, the result of pattern maximum match is a Chinese personal name for corresponding content; In through the character labeling sequence after role U and the role V division processing, there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2], X2T, when X2 or Y, the result of pattern maximum match is the transliteration name for corresponding content.

Training module 66 was used for before obtaining list entries, carried out model training; Described character labeling module 62 also is used for, and carries out character labeling according to the result of the model training of described training module 66 list entries after to participle.

Concrete, described training module 66 specifically is used for, and obtains the input language material, and removes the nested mark structure that exists in the described input language material, obtains the mark language material of non-nesting; Remove all part of speech marks in the described input language material, obtain the text language material, and use Words partition system to carry out cutting, obtain language material behind the participle described text language material; According to role's table language material behind the described participle is marked, obtain the character labeling language material, wherein, do not comprise the context Role Information in described role's table; And obtain according to described character labeling language material that the role shifts language material and the role launches language material; Shifting language material and role according to described role launches language material and carries out model training.

As seen, the device that the application of the invention embodiment is provided has made full use of the advantage of character labeling model and class-based language model, can identify Chinese personal name and transliteration name comparatively exactly under the prerequisite that guarantees higher recall rate; And the role who is proposed among the present invention and do not rely on the context role and gather, thereby make the name recognition system be issued to higher name identification recall rate in certain training data condition with word.In addition, the present invention also provides a very practical entry conjecture strategy that do not land, and can guess effectively that major part do not land the role of entry, has solved to a certain extent and has not landed the negative effect of speech to name identification.

Through the above description of the embodiments, those skilled in the art can be well understood to the present invention and can realize by the mode that software adds essential general hardware platform, can certainly pass through hardware, but the former is better embodiment under a lot of situation.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words can embody with the form of software product, this computer software product is stored in the storage medium, comprise that some instructions are with so that a station terminal equipment (can be mobile phone, personal computer, server, the perhaps network equipment etc.) carry out the described method of each embodiment of the present invention.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be looked protection scope of the present invention.

It will be appreciated by those skilled in the art that the module in the device among the embodiment can be distributed in the device of embodiment according to the embodiment description, also can carry out respective change and be arranged in the one or more devices that are different from present embodiment.The module of the foregoing description can be integrated in one, and also can separate deployment; A module can be merged into, also a plurality of submodules can be further split into.

The invention described above embodiment sequence number is not represented the quality of embodiment just to description.

More than disclosed only be several specific embodiment of the present invention, still, the present invention is not limited thereto, any those skilled in the art can think variation all should fall into protection scope of the present invention.

Claims

1. the method for a Chinese name identification is characterized in that, comprising:

Obtain list entries, and described list entries is carried out participle;

2. the method for claim 1 is characterized in that, described obtaining after the character labeling sequence also comprises:

3. method as claimed in claim 1 or 2 is characterized in that, described obtaining after the character labeling sequence also comprises:

Role U in the character labeling sequence and role V are divided processing.

4. method as claimed in claim 3 is characterized in that, the role U in the character labeling sequence is divided to handle comprise:

5. method as claimed in claim 3 is characterized in that, describedly according to the name recognition mode described character labeling sequence is mated, and the name that output is formed comprises:

6. method as claimed in claim 5 is characterized in that, according to the name recognition mode character labeling sequence after division is handled through role U and role V is carried out the pattern maximum match, and the name that output is formed comprises:

7. the method for claim 1 is characterized in that, described obtaining before the list entries also comprises: carry out model training;

8. method as claimed in claim 7 is characterized in that, the described model training that carries out comprises:

9. the device of a Chinese name identification is characterized in that, comprising:

10. device as claimed in claim 9 is characterized in that, also comprises:

11. as claim 9 or 10 described devices, it is characterized in that, also comprise:

12. device as claimed in claim 11 is characterized in that,

13. device as claimed in claim 11 is characterized in that,

14. device as claimed in claim 13 is characterized in that,

15. device as claimed in claim 9 is characterized in that, also comprises:

16. device as claimed in claim 15 is characterized in that,