Background technology
Chinese information processing refers to carries out treatment and processing by the information such as sound, shape, justice of computing machine to Chinese, is a branch of natural language information process.Wherein, Chinese information processing is mainly studied and how to be utilized computing machine automatically to process Chinese information, and compared with the western languages such as English, Chinese lacks obvious separation mark, also more flexible in grammer, semanteme and pragmatic etc., this adds increased the difficulty of computer disposal and understanding.And word analysis is prerequisite and the basis of Chinese natural language process, the research of Chinese lexical analysis has also obtained larger progress, but when processing the text containing unregistered word, corresponding result is generally difficult to meet actual demand.
Concrete, the wrong identification of unregistered word, not only can cause self cannot correctly identifying, and unregistered word often with other words combined crosswise of front and back, the correct identification of other words can be had a strong impact on, thus directly reduce the accuracy of word analysis, even have influence on the accuracy of the whole analysis of sentence.Can find out, the automatic identification of unregistered word has become the bottleneck problem of Chinese lexical analysis quality.
Further, named entity occupies larger proportion in unregistered word, is also the Major Difficulties of unknown word identification.Wherein, named entity refers in text the entity with certain sense, and can be expressed as the abstract things in real world or concrete things, this named entity mainly comprises name, place name, mechanism's name, date, time, monetary value and percentage etc.And from recognition effect, date, time, monetary value are relative with the identification of percentage etc. simple, the statistics of rule, the training statistics of data are also relatively easy.
But because the named entities such as name, place name, mechanism's name have open and expansionary, composing law has larger randomness, make the identification of name, place name, mechanism's name is also existed to larger mistake identification and leaks to identify; And the identification of named entity is significant for correct understanding text, it is the basis of the technology such as information extraction, automatic question answering, mechanical translation; Therefore, to the identification of name, place name, mechanism's name be also the research emphasis of present named entity recognition.Wherein, in the identifying of name, place name, mechanism's name, the name entities such as Chinese personal name and transliteration name occupy very large proportion in named entity, make the emphasis be automatically identified as to not log in identification of name, the solution of name identification problem will improve the final mass of Chinese lexical analysis, syntactic analysis and even Chinese information processing.
In prior art, the method of usual use based role mark carries out the automatic identification of Chinese personal name, namely the Role Information of Automatic Extraction from corpus is utilized, Viterbi algorithm (Viterbi algorithm is a kind of decoding algorithm of convolutional code) is taked to carry out character labeling to cutting word result, on the basis of role's sequence, carry out the maximum coupling of pattern, thus realize the identification of Chinese personal name.
Concrete, the method for this based role mark is thought: each entry in sentence impliedly carries a Role Information, and wherein, role representation entry is role in sentence or named entity.This character labeling just refers to the upper corresponding role of each entry mark in the entry sequence obtain cutting result, and wherein, role is mainly divided three classes, and is respectively: the inside composition role of name, become word role with context, name has nothing to do role.A kind of role's table as shown in table 1:
Table 1
Role |
Meaning |
Example |
B |
The surname of Chinese personal name |
?Mr. Warburg Pincus
|
C |
The lead-in of the two-character given name of Chinese personal name |
?
ChinaFlat Mr.
|
D |
The last word of the two-character given name of Chinese personal name |
Zhang Hua
FlatSir
|
E |
The single-character given name of Chinese personal name |
?
GreatSay: " I is Mr. Nice Guy "
|
F |
Prefix |
AlwaysLiu,
LittleLee
|
G |
Suffix |
King
Always, Liu
Always, Xiao
Family name, Wu
Mother, leaf
Handsome |
U |
Individual character becomes word [*] with name lead-in above |
Here
RelevantIt is herioc that it is trained
|
V |
Name end word becomes word with hereafter individual character |
Gong Xue
EqualityLeader, Deng Ying
Excusing from deathBefore
|
X |
The surname of Chinese personal name becomes word with the lead-in of two-character given name |
Wang Guowei、
|
Z |
The two-character given name of Chinese personal name itself becomes word |
Zhang Chaoyang
|
Y |
Complete name itself becomes word |
Peak、
Vast sea, Bush, Brian Special |
h |
the lead-in of transliteration name |
gramislington
|
i |
the middle word of transliteration name |
history/
the base of a fruit/
fragrant/
./
this/
skin/
you/
primary/ lattice
|
t |
three words and above transliteration name end word thereof |
crin
? |
e2 |
the last word of the transliteration name of two words |
general
capital |
x2 |
transliteration name inside becomes word |
oman reaches
billmoral
|
a |
name has nothing to do role |
hu Jintao's cordiality visits child |
k |
name above |
come again to the home of Yu Hongyang. |
l |
name hereafter |
reporter of the Xinhua News Agency Huang Wen takes the photograph |
m |
composition between two names |
green grass or young crops is said in playwright, screenwriter Shao Jun Lin Heji road |
Can find out, according to the role's table shown in table 1, when cutting result be shop/interior/display/week/grace/come/and/Deng/grain husk/excusing from death/front/uses/mistake// article/time, the result (i.e. the result of character labeling) of the upper corresponding role of each entry mark in the entry sequence that cutting result is obtained be " shop/A is interior/A display/A week/B grace/C is next/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A "
Further, in the method that this based role marks, be by using Viterbi algorithm to carry out the automatic marking of role; Namely from all possible annotated sequence, the mark of maximum probability is optimized as final annotation results; Concrete theory and derivation as follows:
Suppose that W is the Token sequence (word segmentation result namely before unknown word identification) after participle, W=(w1, w2 ..., w
m); T is certain possible character labeling sequence of W, T=(t1, t2 ..., t
m), m > 0; Wherein, T
#for final annotation results, i.e. role's sequence of maximum probability.According to Bayes formula, and introduce Hidden Markov Model (HMM), then
(formula 1)
Wherein, wi is observed value, and role ti is state value, and W is observed value sequence, and T is the state value sequence after being hidden in W; P (wi|ti) refers to the probability of wi in the Token set that role is ti; P (ti|ti-1) refers to the transition probability of role ti-1 to role ti.
Suppose C (wi, ti): the wi number of times occurred as role ti;
C (ti-1, ti): role ti-1 next role is the number of times of ti;
C (ti): the role ti number of times occurred.
Under the prerequisite of Large Scale Corpus training:
P (w
i| t
i) ≈ C (w
i, t
i)/C (t
i) (formula 2)
P (t
i| t
i-1) ≈ C (t
i-1, t
i)/C (t
i-1) (formula 3)
Can find out, in the method that this based role marks, above-mentioned role's automatic marking problem is just converted to the minimized problem of expression formula of solution formula 1; Wherein, in this Vitebi algorithm, there is special solution to the problems described above, very ripe, do not repeat them here; Namely role's automatic marking can be realized by above-mentioned formula 1, formula 2 and formula 3.
Realizing in process of the present invention, inventor finds prior art, and at least there are the following problems:
(1) method of existing based role mark depends on contextual role set.Such as, when input of character string is Liu's vehement flat Baidupedia, the result of rough segmentation is Liu/vehement/flat/Baidu/encyclopaedia, if this entry of Baidu does not have the hereafter role of name, then Liu is vehement flatly cannot be correctly validated; And the contextual role set of name is not closed set, but there is open and expansionary set; Therefore, sufficient contextual role set be obtained very difficult; And then cause name to be correctly validated.
(2) the various probability dependence that the method that existing based role marks trains are in corpus; And corpus is closed set, when using corpus to train, the probability trained may be caused to go wrong, and then name cannot correctly be identified.
(3) method of existing based role mark is to transliteration name, and especially to translate the support of name identification inadequate for English.
(4) method of existing based role mark lacks identification name by mistake and gets rid of mechanism, and when mistake appears in name identifying, can not well get rid of, the accuracy rate of name identification has much room for improvement.
Summary of the invention
Embodiments provide a kind of method and apparatus of Chinese personal name recognition, to identify Chinese personal name accurately.
In order to achieve the above object, embodiments provide a kind of method of Chinese personal name recognition, comprising:
Obtain list entries, and participle is carried out to described list entries;
Character labeling is carried out to the list entries after participle, and obtains character labeling sequence;
According to name recognition mode, described character labeling sequence is mated, and export the name of composition.
After described acquisition character labeling sequence, also comprise:
Detect the name identification role in described character labeling sequence, and to occurring that the name identification role of mistake revises.
After described acquisition character labeling sequence, also comprise:
Division process is carried out to the role U in character labeling sequence and role V.
Carry out division process to the role U in character labeling sequence to comprise:
When after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
Carry out division process to the role V in character labeling sequence specifically to comprise:
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Describedly according to name recognition mode, described character labeling sequence to be mated, and the name exporting composition comprises:
According to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and export the name of composition.
According to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and the name exporting composition comprises:
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Before described acquisition list entries, also comprise: carry out model training;
Describedly character labeling is carried out to the list entries after participle comprise: the result according to model training carries out character labeling to the list entries after participle.
The described model training that carries out comprises:
Obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle;
According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information;
Obtain role according to described character labeling language material and shift language material and role launches language material;
Shift language material and role according to described role to launch language material and carry out model training.
A device for Chinese personal name recognition, comprising:
Acquisition module, for obtaining list entries, and carries out participle to described list entries;
Character labeling module, for carrying out character labeling to the list entries after described acquisition module participle, and obtains character labeling sequence;
Pattern Matching Module, for mating the character labeling sequence that described character labeling module obtains according to name recognition mode, and exports the name of composition.
Also comprise:
Role's correcting module, for detecting the name identification role in the character labeling sequence of described character labeling module acquisition, and to occurring that the name identification role of mistake revises.
Also comprise:
Role disassembles module, carries out division process for the role U in the character labeling sequence to described character labeling module acquisition and role V.
Described role disassemble module specifically for, when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Described Pattern Matching Module specifically for, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and exports the name of composition.
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Also comprise:
Training module, for before acquisition list entries, carries out model training;
Described character labeling module also for, the result according to the model training of described training module carries out character labeling to the list entries after participle.
Described training module specifically for, obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; And obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Compared with prior art, the present invention has the following advantages: by carrying out character labeling to the list entries after participle, can identify Chinese personal name and transliteration name accurately; And role set merging proposed by the invention does not rely on contextual role set, further improves the accuracy of Chinese personal name identification.
Embodiment
Below in conjunction with the accompanying drawing in the present invention, be clearly and completely described technical scheme of the present invention, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The embodiment of the present invention provides a kind of method of Chinese personal name recognition, as shown in Figure 1, specifically comprises the following steps:
Step 101, obtains list entries, and carries out participle to described list entries.
Step 102, carries out character labeling to the list entries after participle, and obtains character labeling sequence.
Wherein, after described acquisition character labeling sequence, also comprise: detect the name identification role in described character labeling sequence, and to occurring that the name identification role of mistake revises.
After described acquisition character labeling sequence, also comprise: division process is carried out to the role U in character labeling sequence and role V.It should be noted that, this carries out the operation of dividing process to the role U in character labeling sequence and role V, with the name identification role detected in described character labeling sequence, and to occurring that the operation that the name identification role of mistake revises does not have ordinal relation successively.
Concrete, division process is carried out to the role U in character labeling sequence and comprises: when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
Carry out dividing process to the role V in character labeling sequence specifically to comprise: when the previous role of described role V is for C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Step 103, mates described character labeling sequence according to name recognition mode, and exports the name of composition.Wherein, describedly according to name recognition mode, described character labeling sequence to be mated, and the name exporting composition comprises: carry out the maximum coupling of pattern according to name recognition mode to the character labeling sequence divided after process through role U and role V, and export the name of composition.
Concrete, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and the name exporting composition comprises: when there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, BE, BG, BZ, FB, when Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
In the embodiment of the present invention, before described acquisition list entries, also comprise: carry out model training; Now, describedly character labeling is carried out to the list entries after participle comprise: the result according to model training carries out character labeling to the list entries after participle.
Concrete, described in carry out model training and comprise: obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; Obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Visible, in the method that the embodiment of the present invention provides, by carrying out character labeling to the list entries after participle, Chinese personal name and transliteration name can be identified accurately; And role set merging proposed by the invention does not rely on contextual role set, further improves the accuracy of Chinese personal name identification.
The embodiment of the present invention two provides a kind of method of Chinese personal name recognition, as shown in Figure 2, specifically comprises the following steps:
Step 201, carries out model training.Wherein, in the process of underway scholar's name identification, in order to ensure the accuracy of character labeling, need to carry out model training, thus have character labeling result accurately to each word.
Concrete, model training can be carried out by using class-based language model in the embodiment of the present invention, and character labeling model carries out model training.Wherein, in class-based language model, the three class named entities such as name, place name, mechanism's name are needed to be defined as three classifications respectively, i.e. PN (Person Name), LN (Location Name) and ON (Organization Name); Before training, need name, place name, mechanism's name to replace with PN, LN and ON, after this, use the training method of ordinary language model to carry out training.Wherein, this class-based language model comprises two submodels, be respectively context model P (C) (Context Model) and named entity class model P (S|C) (Class Model), this class-based language model is the embodiment of existing comparative maturity, repeats no longer in detail at this.Namely by use SegTag, model training is carried out for the process by using this character labeling model to carry out model training.Certainly, in the application of reality, model training can also be carried out by other language model, repeat no more in the embodiment of the present invention.
Concrete, as shown in Figure 3, in the embodiment of the present invention, the training process of above-mentioned based role marking model specifically comprises the following steps:
Step 301, carries out pre-service to input language material.Wherein, the corpus (such as, People's Daily's corpus in 2000) that marks from cutting of this input language material.Carry out pre-service to input language material to be specially: remove the nested marking structure existed in input language material, and obtain the mark language material of non-nesting.Such as, original input language material is: with/p hair/nrf pool east/comrade nrg/n be /v1 representative/n /ud [China/ns Communist Party/n] nt; By to removing the nested marking structure that exists in input language material, obtaining the language material (the mark language material of non-nesting) after processing is: with/p hair/nf pool east/comrade nrg/n be /v representative/n /the u China/ns Communist Party/n; Can find out, in input language material, there is nested marking structure, such as, v1 (in verb one part of speech more specifically), ud (in auxiliary word one part of speech more specifically), nt (mechanism's name, the Chinese Communist Party be nested in together), after removing the nested marking structure existed in input language material, obtain v (verb), u (auxiliary word), and relieve the nested of the Chinese Communist Party.
In addition, also need all part of speech marks in input language material all to remove, obtain text language material, the text language material namely in above-mentioned situation after process is: the Chinese Communist Party taking Comrade Mao Zedong as representative.Use Words partition system to carry out cutting to text language material further, the language material obtained is language material after participle.Wherein, this dicing process is only participle, does not carry out any named entity recognition and part-of-speech tagging.
Can find out, by the preprocessing process of this step, language materials in the middle of two can be obtained, be i.e. language material after the mark language material of non-nesting and participle.
Step 302, obtains character labeling language material, namely carries out character labeling according to the implication of each role to each entry.Wherein, after obtaining participle during language material, can mark language material after participle according to role's table in this step, obtain character labeling language material.Such as, for above-mentioned example, after participle, language material is: with/hair/damp east/comrade/for/representative// China for produce party/, directly obtain character labeling language material according to language material after this participle, result is: with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A.
It should be noted that, in order to solve the problem too depending on contextual role set in prior art, the role's table used in the embodiment of the present invention does not have contextual role information, such as, for the role's table in prior art shown in table 1, remove the contextual role information shown in table 2, can obtain the table of the role shown in table 3 that the embodiment of the present invention uses, namely the process of above-mentioned acquisition character labeling language material draws under prerequisite based on the table of role shown in table 3; Certainly, according to the actual needs, can also modify and adjust by the role's table shown in his-and-hers watches 3, repeating no more in this process embodiment of the present invention.
Table 2
k |
name above |
again
comethe family of Yu Hongyang.
|
l |
name hereafter |
reporter of the Xinhua News Agency Huang Wen
take the photograph |
m |
composition between two names |
playwright, screenwriter Shao Junlin
withcheck blue or green saying
|
Table 3
Role |
Meaning |
Example |
B |
The surname of Chinese personal name |
?Mr. Warburg Pincus
|
C |
The lead-in of the two-character given name of Chinese personal name |
?
ChinaFlat Mr.
|
D |
The last word of the two-character given name of Chinese personal name |
Zhang Hua
FlatSir
|
E |
The single-character given name of Chinese personal name |
?
GreatSay: " I is Mr. Nice Guy "
|
F |
Prefix |
AlwaysLiu,
LittleLee
|
G |
Suffix |
King
Always, Liu
Always, Xiao
Family name, Wu
Mother, leaf
Handsome |
U |
Name above individual character becomes word [*] with name lead-in |
Here
RelevantIt is herioc that it is trained
|
V |
Name end word becomes word with hereafter individual character |
Gong Xue
EqualityLeader, Deng Ying
Excusing from deathBefore
|
X |
The surname of Chinese personal name becomes word with the lead-in of two-character given name |
Wang Guowei、
|
Z |
The two-character given name of Chinese personal name itself becomes word |
Zhang Chaoyang
|
Y |
Complete name itself becomes word |
Peak、
Vast sea, Bush, Brian Special |
H |
The lead-in of transliteration name |
GramIslington
|
I |
The middle word of transliteration name |
History/
The base of a fruit/
Fragrant/
·/
This/
Skin/
You/
Primary/ lattice
|
T |
Three words and above transliteration name end word thereof |
Crin
? |
E2 |
The last word of the transliteration name of two words |
General
Capital |
X2 |
Transliteration name inside becomes word |
Oman reaches
BillMoral
|
A |
Name has nothing to do role |
Hu Jintao's cordiality visits child |
Can find out, compared with table 1, the role such as the composition above, between hereafter and two names of name of not name in role that the embodiment of the present invention uses table, make to obtain in this step the process of character labeling language material and depend on contextual role set, be i.e. the Chinese personal name recognition method that proposes of the embodiment of the present invention do not rely on contextual role set.
It should be noted that, in above-mentioned table 3, be not limited to the implication using above-mentioned letter representation corresponding, such as, role C can also be used to represent the surname of Chinese personal name, use role B to represent the lead-in of the two-character given name of Chinese personal name, namely in table 3, between role and meaning, corresponding relation can adjust arbitrarily according to the actual needs; Between role and meaning, the combination in any of corresponding relation is all within scope, is described in the embodiment of the present invention for the corresponding relation shown in table 3.
It should be noted that, in embodiments of the present invention, obtain character labeling language material by SegTag method.Wherein, character labeling language material should be obtained by SegTag method to comprise:
(1) name information is recorded, wherein, the process of this record name information is specially: by scanning the every a line in the mark language material (obtaining in step 301) of non-nesting, records all positions (calculating of this position have ignored occurred space) of name appearance and the type etc. of this name in this row.Wherein, the type of this name mainly comprises: Chinese surname part, Chinese name part, two Chinese character length without surname Chinese personal name, more than the foreign name of two Chinese character length, the foreign name of two Chinese character length be noted as the nominal non-name of people etc.
(2) be each entry mark role, corresponding row is taken out in language material (obtaining in step 301) after participle, and the position at the position occurred each word and the nearest name place of the next one compares, according to each word and the relative position of name and the type of this name, mark corresponding role.
The difference being obtained character labeling language material and existing mark language material method by SegTag method is: SegTag method marks language material after participle (language material after current Words partition system participle), and the method for existing mark language material marks the standard cutting result of input language material.Now, be different owing to cutting word the possibility of result, the actor model corpus that SegTag method obtains also may be different.Such as, if language material is after participle: with/hair/damp east/comrade/for/representative// China/Communist Party/time, then after character labeling language material be with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A; And if language material is after participle: with/hair/pool/east/comrade/for/representative// China/Communist Party/time, then after character labeling, language material is: with/A hair/B pool/C east/comrade D/A be /A representative/A /the A China/A Communist Party/A.
Can find out, by marking corresponding role to language material after participle in the embodiment of the present invention, character labeling result can be changed along with the change of current Words partition system, thus improve the accuracy rate of name identification.Such as, the cutting result of current Words partition system is: with/hair/pool/east/comrade/for/representative// China/Communist Party, if then in corpus the result of character labeling be with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A, now, this name of Mao Zedong just likely can identify out.Particularly, when comprising east, pool in actor model as the probability of role Z, and when not having pool as probability as role D of the probability of role C, east, Mao Zedong can not correctly identify.
Step 303, extracts training file and dictionary.Namely the character labeling language material got in above-mentioned steps 302 is extracted, obtain corresponding role and shift language material and role launches language material.Wherein, it is removed by all entries that this role shifts language material, only retains the language material that corresponding role obtains; And role launches language material is each role and entry are placed on the language material that independent a line obtains.Such as, character labeling language material is: with/A hair/B pool/C east/comrade D/A be /A representative/A /the A China/A Communist Party/A time, then the role after extraction shifts language material and is: A B C D A A A A A A ".Role after extraction launches language material:
A with
B hair
C pool
D east
Comrade A
A is
A represents
A's
A China
The A Communist Party
In addition, shift while language material and role launch language material obtaining above-mentioned role, initial role's dictionary can also be obtained; Wherein, the process of this acquisition role dictionary is: from the corpus of character labeling model, extract basic role's dictionary, and the role dictionary basic to this is progressively purified and expand.Such as, according to character labeling model, easily can obtaining the name predicative material that comprises a large amount of Chinese personal name and transliteration name, by carrying out statistical treatment to this name predicative material, just can obtain a name everyday character more accurately and conventional role set corresponding to each everyday character.Certainly, in actual applications, according to the wrong identification found, also need to purify and expand role's dictionary step by step, thus obtain high-quality role's dictionary, do not repeat them here.
Step 304, carries out model training.Wherein, according to the formula 1 that the method for character labeling in prior art uses, can find out, the object of needs training is the probable value of acquisition two type: p (wi|ti) and p (ti|ti-1); This p (wi|ti) refers to the probability of wi in the Token set that role is ti, i.e. role's emission probability; What p (ti|ti-1) represented is the transition probability of role ti-1 to role ti, i.e. role's transition probability.Can find out, shift language material and role by using the role obtained in above-mentioned steps 303 and launch language material and namely can carry out model training, and finally obtain training result.
In the embodiment of the present invention, by use Katz smoothing algorithm to role's transition probability and role's emission probability smoothing, thus solve the problem that the probability model using maximal possibility estimation to obtain inevitably runs into Sparse, this Katz smoothing algorithm is existing algorithm, repeat no more in the embodiment of the present invention
Can find out, by above-mentioned step 301-step 304, namely can obtain the result of model training, carry out in the process of Chinese personal name recognition follow-up, can directly use the result of this model training to carry out corresponding Chinese personal name recognition.
Step 202, the result according to model training identifies Chinese personal name.Wherein, in Chinese information processing process, need to identify the Chinese personal name in this Chinese information, thus realize the process of Chinese information processing.
Concrete, as shown in Figure 4, in the embodiment of the present invention, the result according to model training specifically comprises the following steps the process that Chinese personal name identifies:
Step 401, carries out participle to input sentence, thus sentence after obtaining participle.In the embodiment of the present invention to input sentence be: display Zhou Enlai and Deng Yingchao's used article before death in shop, for example is described; Can find out, after corresponding participle, sentence (a kind of word segmentation result wherein) is: shop/interior/display/week/grace/come/with/Deng/grain husk/excusing from death/front/use/mistake// article.
Step 402, to sentence after participle, by using character labeling model, and by Viterbi algorithm, obtains the character labeling sequence of maximum probability.Such as, above-mentioned character labeling result is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Wherein, this character labeling result can carry out character labeling (such as, by using role's dictionary of above-mentioned model training to obtain) according to the result of above-mentioned model training; This step is existing processing mode, repeats no longer in detail in the embodiment of the present invention.
It should be noted that, the executive agent of this step can be character labeling module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
For the entry (i.e. role's dictionary do not log in entry) not having in role's dictionary to occur, in the embodiment of the present invention, propose a kind of effective conjecture method not logging in entry; The principle of this conjecture method is the feature of length according to entry and composition character, guesses the role of entry; Specifically comprise:
(1) if this does not log in entry odd number byte, or be no less than 6 bytes, or have non-Chinese character, then directly determine that this role not logging in entry to have nothing to do role A for name.
(2) if this does not log in entry is individual Chinese character, then the role guessed is needed to be A|C|D|E.
(3) if this does not log in entry is two Chinese characters, then the role guessed is needed to be A|X|Z.
It should be noted that, use word Relatively centralized due to transliteration name, therefore, essentially comprising common transliteration name word and the conventional role of correspondence in role's dictionary, namely above-mentioned conjecture method is mainly guessed possible Chinese personal name role.
Step 403, detects the name identification role that may produce identification error in character labeling sequence, and revises corresponding role in time to possible identification error.Wherein, the executive agent of this step can be role's correcting module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
In this step, suppose that name character string to be identified is w
mw
m+1w
n, corresponding name pattern is t
mt
m+1t
n; Wherein, m >=0, n >=m+2; Before adjacent with this name a word and below a word be respectively w
m-1and w
n+1, by comparing the probability of two paths in the embodiment of the present invention, thus determine whether identify this name.Wherein, this two paths is respectively:
Path 1 (name path, PN_PATH)
I.e. P (w
m-1to PN) * P (PN to w
n+1) * P (w
mw
m+1w
n| PN)
Path 2 (non-name path, NOT_PN_PATH)
I.e. P (w
m-1to w
m) * P (w
mto w
m+1) * ... P (w
nto w
n+1)
Concrete, path 1 is can by w
mw
m+1w
nbe identified as the path of name, wherein, first probable value is P (w
m-1to PN), represent the transition probability value arriving name above of name, second probable value is P (PN to w
n+1), represent the transition probability value hereafter of PN to name, the 3rd probable value is (w
mw
m+1w
n| PN), represent that the name that this identifies is w
mw
m+1w
nprobable value; Can find out, the probable value in path 1 is exactly the product of above-mentioned three probability.
Path 2 is can not by w
mw
m+1w
nbe identified as the path of name, wherein, the probable value in this path 2 is product values of the transition probability of adjacent entry on path 2.
Further, by the probability of more above-mentioned two paths, thus determine whether identify that this name specifically comprises: if the probable value in path 1 is not less than the probable value in path 2, then by w
mw
m+1w
nbe identified as name; Otherwise, by w
mw
m+1w
nbe identified as name may occur to identify by mistake, can not by w
mw
m+1w
nbe identified as name, and by w
mw
m+1w
ncorresponding role is labeled as name and has nothing to do role, is namely labeled as A role.
It should be noted that, the account form in above-mentioned two paths is specially: the 3rd probable value P (w in (1) path 1
mw
m+1w
n| PN), by using role's emission probability to calculate, i.e. P (w
mw
m+1w
n| PN)=p (t
mto w
m) * p (t
m+1to w
m+1) * ... * p (t
nto w
n); (2) other probable values in path 1 and all probable values in path 2 can obtain from class-based language model.In actual applications, more than the number in path 21 of the number of the probability be multiplied due to path 1; Namely according to the actual needs, can also at (w
mw
m+1w
n| PN) add a weight factor w, to make the more accurate of above-mentioned two paths above.Wherein, the account form in two above-mentioned paths all can adopt existing embodiment to obtain, and repeats no longer in detail in the embodiment of the present invention.
In addition, because the possibility occurring under BCD pattern and XD pattern to identify is comparatively large by mistake, therefore, in order to correct the identification error that may occur more exactly; For BCD pattern and XD pattern, also need to add Article 3 and compare path, be called PN_PATH2; This path is:
Path 3 (name path 2, PN_PATH2)
I.e. P (w
m-1to PN) * P (PN to w
n) * P (w
mw
m+1w
n-1| PN) * P (w
nto w
n+1)
Concrete, path 3 is can by w
mw
m+1w
n-1be identified as the path of name, wherein, three probable values in the implication of first three probable value in path 3 and path 1 are similar, do not repeat them here; And the 4th probable value P (w
nto w
n+1) be the hereafter w of name
nto the next one hereafter w
n+1transition probability value; Same, this transition probability value also can obtain from class-based language model, does not repeat them here.
In the embodiment of the present invention, if when the probable value in path 3 is greater than the probable value in the probable value in path 1 and path 2, then need to revise w
n-1and w
ncorresponding role, is namely revised as t respectively
n-1and t
n.Wherein, when when correction pattern is BCD pattern, then by t
n-1be revised as role E, by t
nbe revised as role A; When correction pattern is XD pattern, then by t
n-1be revised as angle Y, by t
nbe revised as role A.
In order to this step of explanation clearly, continue to be described with above-mentioned example.Wherein, the character labeling sequence obtained through character labeling in above-mentioned steps is: in shop/A/A display/A week/B grace/C come/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Owing to now using BCD pattern, be in model domain to be revised, namely role's correcting module needs the size comparing three paths probable values, and draws final correction result.
Concrete: path 1:P (displaying to PN) * P (PN to and) * P (Zhou Enlai | PN); Wherein, P (Zhou Enlai | PN)=P (week | B) * P (grace | C) * P (come | D).
Path 2:P (display to week) * P (thoughtful grace) * P (grace is to coming) * P (come to and).
Path 3:P (displaying to PN) * P (PN to come) * P (all grace | PN) * P (come to and).Wherein, P (all grace | PN)=P (week | B) * P (grace | E).
In summary it can be seen, (1) if the probable value in path 1 is maximum, then identifies this name, does not need to make any role amendment; (2) if the probable value in path 2 is maximum, then need the role revising each character string corresponding to name to be identified to be A, namely week, grace, come, the role that three entries are corresponding is revised as name and is had nothing to do role A; (3) if the probable value in path 3 is maximum, then need grace and next role to be revised as role E and role A respectively.Repeat no longer in detail in each path probability value of this calculating and final this step of comparison procedure.
Step 404, is got rid of by the condition of setting or is revised the wrong identification of name.Wherein, the executive agent of this step can be rule checking module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
Concrete, comprising of the condition of this setting:
(1) in BCD pattern, if Chinese character corresponding to D role is and, and the character string be close to below is identified as name, then herein and be a conjunction; Namely needing this BCD schema modification is BE pattern, and original BC part is identified as name.Such as, he sees Guo Quan and Zhao Tao and is fighting, and the name that may identify is: Guo Quanhe (BCD pattern), Zhao Tao, by using above-mentioned condition, can correct and identify Guo Quanhe by mistake.For Guo Quan, and be counted as a conjunction.
(2) the transliteration name meeting any one condition following will not identify.More than the transliteration name (such as, Andrew Jefferson Karstlo Bill Gates Bauer Mo Qiaobusibulin Page, may be identified as a transliteration name, now need to get rid of this name) of 16 Chinese characters; Comprise the consecutive identical character (such as, A Aaluo) of 3 or more in the name identified, a name can not be identified as.
(3) in the name identified, if having above or below ", " (pause mark) time, then ", " if before and after word determine it is not name, so this name just can not identify.The right boundary of its reason to be pause mark be usually name, if the word name of pause mark segmentation, front and back should also have name to occur, otherwise this name must be got rid of.
Step 405, carries out division to the role U (name above individual character becomes word with name lead-in) in character labeling sequence and V (name end word and hereafter individual character become word) and processes.Wherein, in above-mentioned steps, obtain character labeling sequence, also needed in this step to carry out division process to role U and V, thus obtain character labeling sequence more accurately.Wherein, the executive agent of this step can disassemble module for role, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
It should be noted that, above-mentioned steps 403, ordinal relation not successively between step 404 and step 405, be just described for above-mentioned step in the embodiment of the present invention.In actual applications, above-mentioned step can also adjust according to the actual needs, such as, first perform the step carrying out dividing process in step 405 to role U and V in character labeling sequence, afterwards, in carry out step 403, the name identification role that may produce identification error in character labeling sequence is detected, and possible identification error is revised in time to the step of corresponding role, afterwards, got rid of by the condition of setting in carry out step 404 or revise the step of wrong identification of name, repeating no more in the embodiment of the present invention.
Concrete, being combined into word problem to solve between name with corresponding context, needing to disassemble role U and V, concrete process of disassembling is as the disassembling method of the role U of table 4, and the disassembling method of the role V shown in table 5; Certainly, according to the actual needs, the content in his-and-hers watches 4 and table 5 can also carry out adjusting and revising, do not repeat them here.
Table 4: the disassembling method of role U
A rear role of role U |
Disassemble result |
C,E,G,Z |
AB |
D |
AC |
I,X2,E2 |
AH |
Other roles |
AA |
Table 5: the disassembling method of role V
The previous role of role V |
Disassemble result |
C,X |
DA |
B |
EA |
I,X2 |
TA |
H |
E2A |
Other roles |
AA |
In order to this step of explanation clearly, continue to be described with above-mentioned example; In above-mentioned character labeling sequence: in shop/A/A display/A week/B grace/C come/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A in, need to carry out division to role V to disassemble, can find out, the previous role of role V is role C, namely for result of disassembling be DA, can find out, the character labeling result obtained after division is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A.
Step 406, mates the character labeling sequence after division process according to name recognition mode, and exports the name of composition, record the position of this name in sentence.Wherein, the executive agent of this step can be Pattern Matching Module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
Concrete, this name recognition mode is as shown in table 6, certainly, according to the actual needs, the content of his-and-hers watches 6 can also carry out adjusting and revising, does not repeat them here.
Table 6: name recognition mode collection
Type |
Pattern |
Chinese personal name recognition mode |
BCD,BE,BG,BZ,FB,Y,XD, |
|
FE |
Transliteration name recognition mode |
HE2,[H|X2][I|X2]+[T|X2],X2T,X2, Y |
Further, in this step,, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence after division process, when having name recognition mode to concentrate corresponding content in the character labeling sequence namely after dividing process, according to the content that this name recognition mode set pair is answered, carry out the maximum coupling of pattern.Such as, when having Chinese personal name recognition mode BCD in character labeling sequence, then the result of the maximum coupling of pattern is BCD (pattern match result may be BC, CD etc.), visible, need the method for the maximum coupling of using forestland to mate character labeling sequence in the embodiment of the present invention, do not repeat them here.
When continuing to be described with above-mentioned example, because the character labeling sequence after division process is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A, after the maximum coupling of pattern, the name identified is: Zhou Enlai's (BCD pattern), Deng Yingchao's (BCD pattern).In addition, above-mentioned transliteration name recognition mode [H|X2] [I|X2]+[T|X2] is the form of a canonical formula, if namely head-word role is H or role X2, several role I or role X2 are had in middle role, end word role is role X2 or role T, then can be identified as transliteration name.
It should be noted that, in embodiments of the present invention, above-mentioned character labeling module, role's correcting module, rule checking module, role disassembles module, Pattern Matching Module can be combined as one or more module according to the actual needs further, or is split as multiple submodule further.
Wherein, above-mentioned steps 401-step 406 can also adjust sequencing according to the actual needs, does not repeat them here.
Visible, the method that the application of the invention embodiment provides, takes full advantage of the advantage of character labeling model and class-based language model, under the prerequisite ensureing higher recall rate, can more adequately identify Chinese personal name and transliteration name; And the role proposed in the present invention do not rely on the set of contextual role word, thus name recognition system is made to be issued to higher name identification recall rate in certain training data condition.In addition, present invention also offers one and very practical do not log in entry guess strategy, effectively can guess that major part does not log in the role of entry, solve to a certain extent and do not log in the negative effect of word to name identification.
The embodiment of the present invention also provides a kind of device of Chinese personal name recognition, as shown in Figure 5, comprising:
Acquisition module 51, for obtaining list entries, and carries out participle to described list entries.
Character labeling module 52, for carrying out character labeling to the list entries after described acquisition module 51 participle, and obtains character labeling sequence.
Pattern Matching Module 53, for carrying out the maximum coupling of pattern according to name recognition mode to the character labeling sequence that described character labeling module 52 obtains, and exports the name of composition.
The embodiment of the present invention also provides a kind of device of Chinese personal name recognition, as shown in Figure 6, comprising:
Acquisition module 61, for obtaining list entries, and carries out participle to described list entries.
Character labeling module 62, for carrying out character labeling to the list entries after described acquisition module 61 participle, and obtains character labeling sequence.
Pattern Matching Module 63, for carrying out the maximum coupling of pattern according to name recognition mode to the character labeling sequence that described character labeling module 62 obtains, and exports the name of composition.
Role's correcting module 64, for detecting the name identification role in the character labeling sequence of described character labeling module acquisition, and to occurring that the name identification role of mistake revises.
Role disassembles module 65, carries out division process for the role U in the character labeling sequence to described character labeling module acquisition and role V.
Wherein, described role disassemble module 65 specifically for, when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Further, described Pattern Matching Module 63 specifically for, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and exports the name of composition.When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Training module 66, for before acquisition list entries, carries out model training; Described character labeling module 62 also for, the result according to the model training of described training module 66 carries out character labeling to the list entries after participle.
Concrete, described training module 66 specifically for, obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; And obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Visible, the device that the application of the invention embodiment provides, takes full advantage of the advantage of character labeling model and class-based language model, under the prerequisite ensureing higher recall rate, can more adequately identify Chinese personal name and transliteration name; And the role proposed in the present invention do not rely on the set of contextual role word, thus name recognition system is made to be issued to higher name identification recall rate in certain training data condition.In addition, present invention also offers one and very practical do not log in entry guess strategy, effectively can guess that major part does not log in the role of entry, solve to a certain extent and do not log in the negative effect of word to name identification.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required general hardware platform by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium, comprising some instructions in order to make a station terminal equipment (can be mobile phone, personal computer, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should look protection scope of the present invention.
It will be appreciated by those skilled in the art that the module in the device in embodiment can carry out being distributed in the device of embodiment according to embodiment description, also can carry out respective change and be arranged in the one or more devices being different from the present embodiment.The module of above-described embodiment can be integrated in one, and also can be separated deployment; A module can be merged into, also can split into multiple submodule further.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Be only several specific embodiment of the present invention above, but the present invention is not limited thereto, the changes that any person skilled in the art can think of all should fall into protection scope of the present invention.