CN102033879B - Method and device for identifying Chinese name - Google Patents

Method and device for identifying Chinese name Download PDF

Info

Publication number
CN102033879B
CN102033879B CN200910177127.XA CN200910177127A CN102033879B CN 102033879 B CN102033879 B CN 102033879B CN 200910177127 A CN200910177127 A CN 200910177127A CN 102033879 B CN102033879 B CN 102033879B
Authority
CN
China
Prior art keywords
role
name
language material
character
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN200910177127.XA
Other languages
Chinese (zh)
Other versions
CN102033879A (en
Inventor
罗长升
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN200910177127.XA priority Critical patent/CN102033879B/en
Publication of CN102033879A publication Critical patent/CN102033879A/en
Application granted granted Critical
Publication of CN102033879B publication Critical patent/CN102033879B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method and a device for identifying a Chinese name. The method comprises the following steps of: acquiring an input sequence and performing word segmentation on the input sequence; performing role labeling on the input sequence subjected to word segmentation and acquiring a role labeling sequence; detecting name identifying roles in the role labeling sequence and correcting wrong name identifying roles; and matching the role labeling sequence according to a name identifying mode and outputting a formed name. By the method and the device, the Chinese name and a transliterated name can be accurately identified.

Description

A kind of method and apparatus of Chinese personal name recognition
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of method and apparatus of Chinese personal name recognition.
Background technology
Chinese information processing refers to carries out treatment and processing by the information such as sound, shape, justice of computing machine to Chinese, is a branch of natural language information process.Wherein, Chinese information processing is mainly studied and how to be utilized computing machine automatically to process Chinese information, and compared with the western languages such as English, Chinese lacks obvious separation mark, also more flexible in grammer, semanteme and pragmatic etc., this adds increased the difficulty of computer disposal and understanding.And word analysis is prerequisite and the basis of Chinese natural language process, the research of Chinese lexical analysis has also obtained larger progress, but when processing the text containing unregistered word, corresponding result is generally difficult to meet actual demand.
Concrete, the wrong identification of unregistered word, not only can cause self cannot correctly identifying, and unregistered word often with other words combined crosswise of front and back, the correct identification of other words can be had a strong impact on, thus directly reduce the accuracy of word analysis, even have influence on the accuracy of the whole analysis of sentence.Can find out, the automatic identification of unregistered word has become the bottleneck problem of Chinese lexical analysis quality.
Further, named entity occupies larger proportion in unregistered word, is also the Major Difficulties of unknown word identification.Wherein, named entity refers in text the entity with certain sense, and can be expressed as the abstract things in real world or concrete things, this named entity mainly comprises name, place name, mechanism's name, date, time, monetary value and percentage etc.And from recognition effect, date, time, monetary value are relative with the identification of percentage etc. simple, the statistics of rule, the training statistics of data are also relatively easy.
But because the named entities such as name, place name, mechanism's name have open and expansionary, composing law has larger randomness, make the identification of name, place name, mechanism's name is also existed to larger mistake identification and leaks to identify; And the identification of named entity is significant for correct understanding text, it is the basis of the technology such as information extraction, automatic question answering, mechanical translation; Therefore, to the identification of name, place name, mechanism's name be also the research emphasis of present named entity recognition.Wherein, in the identifying of name, place name, mechanism's name, the name entities such as Chinese personal name and transliteration name occupy very large proportion in named entity, make the emphasis be automatically identified as to not log in identification of name, the solution of name identification problem will improve the final mass of Chinese lexical analysis, syntactic analysis and even Chinese information processing.
In prior art, the method of usual use based role mark carries out the automatic identification of Chinese personal name, namely the Role Information of Automatic Extraction from corpus is utilized, Viterbi algorithm (Viterbi algorithm is a kind of decoding algorithm of convolutional code) is taked to carry out character labeling to cutting word result, on the basis of role's sequence, carry out the maximum coupling of pattern, thus realize the identification of Chinese personal name.
Concrete, the method for this based role mark is thought: each entry in sentence impliedly carries a Role Information, and wherein, role representation entry is role in sentence or named entity.This character labeling just refers to the upper corresponding role of each entry mark in the entry sequence obtain cutting result, and wherein, role is mainly divided three classes, and is respectively: the inside composition role of name, become word role with context, name has nothing to do role.A kind of role's table as shown in table 1:
Table 1
Role Meaning Example
B The surname of Chinese personal name ?Mr. Warburg Pincus
C The lead-in of the two-character given name of Chinese personal name ? ChinaFlat Mr.
D The last word of the two-character given name of Chinese personal name Zhang Hua FlatSir
E The single-character given name of Chinese personal name ? GreatSay: " I is Mr. Nice Guy "
F Prefix AlwaysLiu, LittleLee
G Suffix King Always, Liu Always, Xiao Family name, Wu Mother, leaf Handsome
U Individual character becomes word [*] with name lead-in above Here RelevantIt is herioc that it is trained
V Name end word becomes word with hereafter individual character Gong Xue EqualityLeader, Deng Ying Excusing from deathBefore
X The surname of Chinese personal name becomes word with the lead-in of two-character given name Wang Guowei
Z The two-character given name of Chinese personal name itself becomes word Zhang Chaoyang
Y Complete name itself becomes word PeakVast sea, Bush, Brian Special
h the lead-in of transliteration name gramislington
i the middle word of transliteration name history/ the base of a fruit/ fragrant/ ./ this/ skin/ you/ primary/ lattice
t three words and above transliteration name end word thereof crin ?
e2 the last word of the transliteration name of two words general capital
x2 transliteration name inside becomes word oman reaches billmoral
a name has nothing to do role hu Jintao's cordiality visits child
k name above come again to the home of Yu Hongyang.
l name hereafter reporter of the Xinhua News Agency Huang Wen takes the photograph
m composition between two names green grass or young crops is said in playwright, screenwriter Shao Jun Lin Heji road
Can find out, according to the role's table shown in table 1, when cutting result be shop/interior/display/week/grace/come/and/Deng/grain husk/excusing from death/front/uses/mistake// article/time, the result (i.e. the result of character labeling) of the upper corresponding role of each entry mark in the entry sequence that cutting result is obtained be " shop/A is interior/A display/A week/B grace/C is next/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A "
Further, in the method that this based role marks, be by using Viterbi algorithm to carry out the automatic marking of role; Namely from all possible annotated sequence, the mark of maximum probability is optimized as final annotation results; Concrete theory and derivation as follows:
Suppose that W is the Token sequence (word segmentation result namely before unknown word identification) after participle, W=(w1, w2 ..., w m); T is certain possible character labeling sequence of W, T=(t1, t2 ..., t m), m > 0; Wherein, T #for final annotation results, i.e. role's sequence of maximum probability.According to Bayes formula, and introduce Hidden Markov Model (HMM), then
T # = arg T max P ( T | W ) = T # = - arg T min Σ i = 0 m { ln p ( w i | t i ) + ln p ( t i | t i - 1 ) } (formula 1)
Wherein, wi is observed value, and role ti is state value, and W is observed value sequence, and T is the state value sequence after being hidden in W; P (wi|ti) refers to the probability of wi in the Token set that role is ti; P (ti|ti-1) refers to the transition probability of role ti-1 to role ti.
Suppose C (wi, ti): the wi number of times occurred as role ti;
C (ti-1, ti): role ti-1 next role is the number of times of ti;
C (ti): the role ti number of times occurred.
Under the prerequisite of Large Scale Corpus training:
P (w i| t i) ≈ C (w i, t i)/C (t i) (formula 2)
P (t i| t i-1) ≈ C (t i-1, t i)/C (t i-1) (formula 3)
Can find out, in the method that this based role marks, above-mentioned role's automatic marking problem is just converted to the minimized problem of expression formula of solution formula 1; Wherein, in this Vitebi algorithm, there is special solution to the problems described above, very ripe, do not repeat them here; Namely role's automatic marking can be realized by above-mentioned formula 1, formula 2 and formula 3.
Realizing in process of the present invention, inventor finds prior art, and at least there are the following problems:
(1) method of existing based role mark depends on contextual role set.Such as, when input of character string is Liu's vehement flat Baidupedia, the result of rough segmentation is Liu/vehement/flat/Baidu/encyclopaedia, if this entry of Baidu does not have the hereafter role of name, then Liu is vehement flatly cannot be correctly validated; And the contextual role set of name is not closed set, but there is open and expansionary set; Therefore, sufficient contextual role set be obtained very difficult; And then cause name to be correctly validated.
(2) the various probability dependence that the method that existing based role marks trains are in corpus; And corpus is closed set, when using corpus to train, the probability trained may be caused to go wrong, and then name cannot correctly be identified.
(3) method of existing based role mark is to transliteration name, and especially to translate the support of name identification inadequate for English.
(4) method of existing based role mark lacks identification name by mistake and gets rid of mechanism, and when mistake appears in name identifying, can not well get rid of, the accuracy rate of name identification has much room for improvement.
Summary of the invention
Embodiments provide a kind of method and apparatus of Chinese personal name recognition, to identify Chinese personal name accurately.
In order to achieve the above object, embodiments provide a kind of method of Chinese personal name recognition, comprising:
Obtain list entries, and participle is carried out to described list entries;
Character labeling is carried out to the list entries after participle, and obtains character labeling sequence;
According to name recognition mode, described character labeling sequence is mated, and export the name of composition.
After described acquisition character labeling sequence, also comprise:
Detect the name identification role in described character labeling sequence, and to occurring that the name identification role of mistake revises.
After described acquisition character labeling sequence, also comprise:
Division process is carried out to the role U in character labeling sequence and role V.
Carry out division process to the role U in character labeling sequence to comprise:
When after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
Carry out division process to the role V in character labeling sequence specifically to comprise:
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Describedly according to name recognition mode, described character labeling sequence to be mated, and the name exporting composition comprises:
According to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and export the name of composition.
According to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and the name exporting composition comprises:
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Before described acquisition list entries, also comprise: carry out model training;
Describedly character labeling is carried out to the list entries after participle comprise: the result according to model training carries out character labeling to the list entries after participle.
The described model training that carries out comprises:
Obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle;
According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information;
Obtain role according to described character labeling language material and shift language material and role launches language material;
Shift language material and role according to described role to launch language material and carry out model training.
A device for Chinese personal name recognition, comprising:
Acquisition module, for obtaining list entries, and carries out participle to described list entries;
Character labeling module, for carrying out character labeling to the list entries after described acquisition module participle, and obtains character labeling sequence;
Pattern Matching Module, for mating the character labeling sequence that described character labeling module obtains according to name recognition mode, and exports the name of composition.
Also comprise:
Role's correcting module, for detecting the name identification role in the character labeling sequence of described character labeling module acquisition, and to occurring that the name identification role of mistake revises.
Also comprise:
Role disassembles module, carries out division process for the role U in the character labeling sequence to described character labeling module acquisition and role V.
Described role disassemble module specifically for, when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Described Pattern Matching Module specifically for, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and exports the name of composition.
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Also comprise:
Training module, for before acquisition list entries, carries out model training;
Described character labeling module also for, the result according to the model training of described training module carries out character labeling to the list entries after participle.
Described training module specifically for, obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; And obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Compared with prior art, the present invention has the following advantages: by carrying out character labeling to the list entries after participle, can identify Chinese personal name and transliteration name accurately; And role set merging proposed by the invention does not rely on contextual role set, further improves the accuracy of Chinese personal name identification.
Accompanying drawing explanation
In order to be illustrated more clearly in the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in the present invention or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is a kind of Chinese personal name recognition method flow diagram proposed in the embodiment of the present invention;
Fig. 2 is the another kind of Chinese personal name recognition method flow diagram proposed in the embodiment of the present invention;
Fig. 3 is the training process process flow diagram of based role marking model in the embodiment of the present invention;
Fig. 4 is according to the process flow diagram flow chart that the result of model training identifies Chinese personal name in the embodiment of the present invention;
A kind of Chinese personal name recognition apparatus structure schematic diagram in Fig. 5 embodiment of the present invention;
Another kind of Chinese personal name recognition apparatus structure schematic diagram in Fig. 6 embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the present invention, be clearly and completely described technical scheme of the present invention, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The embodiment of the present invention provides a kind of method of Chinese personal name recognition, as shown in Figure 1, specifically comprises the following steps:
Step 101, obtains list entries, and carries out participle to described list entries.
Step 102, carries out character labeling to the list entries after participle, and obtains character labeling sequence.
Wherein, after described acquisition character labeling sequence, also comprise: detect the name identification role in described character labeling sequence, and to occurring that the name identification role of mistake revises.
After described acquisition character labeling sequence, also comprise: division process is carried out to the role U in character labeling sequence and role V.It should be noted that, this carries out the operation of dividing process to the role U in character labeling sequence and role V, with the name identification role detected in described character labeling sequence, and to occurring that the operation that the name identification role of mistake revises does not have ordinal relation successively.
Concrete, division process is carried out to the role U in character labeling sequence and comprises: when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
Carry out dividing process to the role V in character labeling sequence specifically to comprise: when the previous role of described role V is for C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Step 103, mates described character labeling sequence according to name recognition mode, and exports the name of composition.Wherein, describedly according to name recognition mode, described character labeling sequence to be mated, and the name exporting composition comprises: carry out the maximum coupling of pattern according to name recognition mode to the character labeling sequence divided after process through role U and role V, and export the name of composition.
Concrete, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and the name exporting composition comprises: when there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, BE, BG, BZ, FB, when Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
In the embodiment of the present invention, before described acquisition list entries, also comprise: carry out model training; Now, describedly character labeling is carried out to the list entries after participle comprise: the result according to model training carries out character labeling to the list entries after participle.
Concrete, described in carry out model training and comprise: obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; Obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Visible, in the method that the embodiment of the present invention provides, by carrying out character labeling to the list entries after participle, Chinese personal name and transliteration name can be identified accurately; And role set merging proposed by the invention does not rely on contextual role set, further improves the accuracy of Chinese personal name identification.
The embodiment of the present invention two provides a kind of method of Chinese personal name recognition, as shown in Figure 2, specifically comprises the following steps:
Step 201, carries out model training.Wherein, in the process of underway scholar's name identification, in order to ensure the accuracy of character labeling, need to carry out model training, thus have character labeling result accurately to each word.
Concrete, model training can be carried out by using class-based language model in the embodiment of the present invention, and character labeling model carries out model training.Wherein, in class-based language model, the three class named entities such as name, place name, mechanism's name are needed to be defined as three classifications respectively, i.e. PN (Person Name), LN (Location Name) and ON (Organization Name); Before training, need name, place name, mechanism's name to replace with PN, LN and ON, after this, use the training method of ordinary language model to carry out training.Wherein, this class-based language model comprises two submodels, be respectively context model P (C) (Context Model) and named entity class model P (S|C) (Class Model), this class-based language model is the embodiment of existing comparative maturity, repeats no longer in detail at this.Namely by use SegTag, model training is carried out for the process by using this character labeling model to carry out model training.Certainly, in the application of reality, model training can also be carried out by other language model, repeat no more in the embodiment of the present invention.
Concrete, as shown in Figure 3, in the embodiment of the present invention, the training process of above-mentioned based role marking model specifically comprises the following steps:
Step 301, carries out pre-service to input language material.Wherein, the corpus (such as, People's Daily's corpus in 2000) that marks from cutting of this input language material.Carry out pre-service to input language material to be specially: remove the nested marking structure existed in input language material, and obtain the mark language material of non-nesting.Such as, original input language material is: with/p hair/nrf pool east/comrade nrg/n be /v1 representative/n /ud [China/ns Communist Party/n] nt; By to removing the nested marking structure that exists in input language material, obtaining the language material (the mark language material of non-nesting) after processing is: with/p hair/nf pool east/comrade nrg/n be /v representative/n /the u China/ns Communist Party/n; Can find out, in input language material, there is nested marking structure, such as, v1 (in verb one part of speech more specifically), ud (in auxiliary word one part of speech more specifically), nt (mechanism's name, the Chinese Communist Party be nested in together), after removing the nested marking structure existed in input language material, obtain v (verb), u (auxiliary word), and relieve the nested of the Chinese Communist Party.
In addition, also need all part of speech marks in input language material all to remove, obtain text language material, the text language material namely in above-mentioned situation after process is: the Chinese Communist Party taking Comrade Mao Zedong as representative.Use Words partition system to carry out cutting to text language material further, the language material obtained is language material after participle.Wherein, this dicing process is only participle, does not carry out any named entity recognition and part-of-speech tagging.
Can find out, by the preprocessing process of this step, language materials in the middle of two can be obtained, be i.e. language material after the mark language material of non-nesting and participle.
Step 302, obtains character labeling language material, namely carries out character labeling according to the implication of each role to each entry.Wherein, after obtaining participle during language material, can mark language material after participle according to role's table in this step, obtain character labeling language material.Such as, for above-mentioned example, after participle, language material is: with/hair/damp east/comrade/for/representative// China for produce party/, directly obtain character labeling language material according to language material after this participle, result is: with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A.
It should be noted that, in order to solve the problem too depending on contextual role set in prior art, the role's table used in the embodiment of the present invention does not have contextual role information, such as, for the role's table in prior art shown in table 1, remove the contextual role information shown in table 2, can obtain the table of the role shown in table 3 that the embodiment of the present invention uses, namely the process of above-mentioned acquisition character labeling language material draws under prerequisite based on the table of role shown in table 3; Certainly, according to the actual needs, can also modify and adjust by the role's table shown in his-and-hers watches 3, repeating no more in this process embodiment of the present invention.
Table 2
k name above again comethe family of Yu Hongyang.
l name hereafter reporter of the Xinhua News Agency Huang Wen take the photograph
m composition between two names playwright, screenwriter Shao Junlin withcheck blue or green saying
Table 3
Role Meaning Example
B The surname of Chinese personal name ?Mr. Warburg Pincus
C The lead-in of the two-character given name of Chinese personal name ? ChinaFlat Mr.
D The last word of the two-character given name of Chinese personal name Zhang Hua FlatSir
E The single-character given name of Chinese personal name ? GreatSay: " I is Mr. Nice Guy "
F Prefix AlwaysLiu, LittleLee
G Suffix King Always, Liu Always, Xiao Family name, Wu Mother, leaf Handsome
U Name above individual character becomes word [*] with name lead-in Here RelevantIt is herioc that it is trained
V Name end word becomes word with hereafter individual character Gong Xue EqualityLeader, Deng Ying Excusing from deathBefore
X The surname of Chinese personal name becomes word with the lead-in of two-character given name Wang Guowei
Z The two-character given name of Chinese personal name itself becomes word Zhang Chaoyang
Y Complete name itself becomes word PeakVast sea, Bush, Brian Special
H The lead-in of transliteration name GramIslington
I The middle word of transliteration name History/ The base of a fruit/ Fragrant/ ·/ This/ Skin/ You/ Primary/ lattice
T Three words and above transliteration name end word thereof Crin ?
E2 The last word of the transliteration name of two words General Capital
X2 Transliteration name inside becomes word Oman reaches BillMoral
A Name has nothing to do role Hu Jintao's cordiality visits child
Can find out, compared with table 1, the role such as the composition above, between hereafter and two names of name of not name in role that the embodiment of the present invention uses table, make to obtain in this step the process of character labeling language material and depend on contextual role set, be i.e. the Chinese personal name recognition method that proposes of the embodiment of the present invention do not rely on contextual role set.
It should be noted that, in above-mentioned table 3, be not limited to the implication using above-mentioned letter representation corresponding, such as, role C can also be used to represent the surname of Chinese personal name, use role B to represent the lead-in of the two-character given name of Chinese personal name, namely in table 3, between role and meaning, corresponding relation can adjust arbitrarily according to the actual needs; Between role and meaning, the combination in any of corresponding relation is all within scope, is described in the embodiment of the present invention for the corresponding relation shown in table 3.
It should be noted that, in embodiments of the present invention, obtain character labeling language material by SegTag method.Wherein, character labeling language material should be obtained by SegTag method to comprise:
(1) name information is recorded, wherein, the process of this record name information is specially: by scanning the every a line in the mark language material (obtaining in step 301) of non-nesting, records all positions (calculating of this position have ignored occurred space) of name appearance and the type etc. of this name in this row.Wherein, the type of this name mainly comprises: Chinese surname part, Chinese name part, two Chinese character length without surname Chinese personal name, more than the foreign name of two Chinese character length, the foreign name of two Chinese character length be noted as the nominal non-name of people etc.
(2) be each entry mark role, corresponding row is taken out in language material (obtaining in step 301) after participle, and the position at the position occurred each word and the nearest name place of the next one compares, according to each word and the relative position of name and the type of this name, mark corresponding role.
The difference being obtained character labeling language material and existing mark language material method by SegTag method is: SegTag method marks language material after participle (language material after current Words partition system participle), and the method for existing mark language material marks the standard cutting result of input language material.Now, be different owing to cutting word the possibility of result, the actor model corpus that SegTag method obtains also may be different.Such as, if language material is after participle: with/hair/damp east/comrade/for/representative// China/Communist Party/time, then after character labeling language material be with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A; And if language material is after participle: with/hair/pool/east/comrade/for/representative// China/Communist Party/time, then after character labeling, language material is: with/A hair/B pool/C east/comrade D/A be /A representative/A /the A China/A Communist Party/A.
Can find out, by marking corresponding role to language material after participle in the embodiment of the present invention, character labeling result can be changed along with the change of current Words partition system, thus improve the accuracy rate of name identification.Such as, the cutting result of current Words partition system is: with/hair/pool/east/comrade/for/representative// China/Communist Party, if then in corpus the result of character labeling be with/A hair/B pool east/comrade Z/A be /A representative/A /the A China/A Communist Party/A, now, this name of Mao Zedong just likely can identify out.Particularly, when comprising east, pool in actor model as the probability of role Z, and when not having pool as probability as role D of the probability of role C, east, Mao Zedong can not correctly identify.
Step 303, extracts training file and dictionary.Namely the character labeling language material got in above-mentioned steps 302 is extracted, obtain corresponding role and shift language material and role launches language material.Wherein, it is removed by all entries that this role shifts language material, only retains the language material that corresponding role obtains; And role launches language material is each role and entry are placed on the language material that independent a line obtains.Such as, character labeling language material is: with/A hair/B pool/C east/comrade D/A be /A representative/A /the A China/A Communist Party/A time, then the role after extraction shifts language material and is: A B C D A A A A A A ".Role after extraction launches language material:
A with
B hair
C pool
D east
Comrade A
A is
A represents
A's
A China
The A Communist Party
In addition, shift while language material and role launch language material obtaining above-mentioned role, initial role's dictionary can also be obtained; Wherein, the process of this acquisition role dictionary is: from the corpus of character labeling model, extract basic role's dictionary, and the role dictionary basic to this is progressively purified and expand.Such as, according to character labeling model, easily can obtaining the name predicative material that comprises a large amount of Chinese personal name and transliteration name, by carrying out statistical treatment to this name predicative material, just can obtain a name everyday character more accurately and conventional role set corresponding to each everyday character.Certainly, in actual applications, according to the wrong identification found, also need to purify and expand role's dictionary step by step, thus obtain high-quality role's dictionary, do not repeat them here.
Step 304, carries out model training.Wherein, according to the formula 1 that the method for character labeling in prior art uses, can find out, the object of needs training is the probable value of acquisition two type: p (wi|ti) and p (ti|ti-1); This p (wi|ti) refers to the probability of wi in the Token set that role is ti, i.e. role's emission probability; What p (ti|ti-1) represented is the transition probability of role ti-1 to role ti, i.e. role's transition probability.Can find out, shift language material and role by using the role obtained in above-mentioned steps 303 and launch language material and namely can carry out model training, and finally obtain training result.
In the embodiment of the present invention, by use Katz smoothing algorithm to role's transition probability and role's emission probability smoothing, thus solve the problem that the probability model using maximal possibility estimation to obtain inevitably runs into Sparse, this Katz smoothing algorithm is existing algorithm, repeat no more in the embodiment of the present invention
Can find out, by above-mentioned step 301-step 304, namely can obtain the result of model training, carry out in the process of Chinese personal name recognition follow-up, can directly use the result of this model training to carry out corresponding Chinese personal name recognition.
Step 202, the result according to model training identifies Chinese personal name.Wherein, in Chinese information processing process, need to identify the Chinese personal name in this Chinese information, thus realize the process of Chinese information processing.
Concrete, as shown in Figure 4, in the embodiment of the present invention, the result according to model training specifically comprises the following steps the process that Chinese personal name identifies:
Step 401, carries out participle to input sentence, thus sentence after obtaining participle.In the embodiment of the present invention to input sentence be: display Zhou Enlai and Deng Yingchao's used article before death in shop, for example is described; Can find out, after corresponding participle, sentence (a kind of word segmentation result wherein) is: shop/interior/display/week/grace/come/with/Deng/grain husk/excusing from death/front/use/mistake// article.
Step 402, to sentence after participle, by using character labeling model, and by Viterbi algorithm, obtains the character labeling sequence of maximum probability.Such as, above-mentioned character labeling result is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Wherein, this character labeling result can carry out character labeling (such as, by using role's dictionary of above-mentioned model training to obtain) according to the result of above-mentioned model training; This step is existing processing mode, repeats no longer in detail in the embodiment of the present invention.
It should be noted that, the executive agent of this step can be character labeling module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
For the entry (i.e. role's dictionary do not log in entry) not having in role's dictionary to occur, in the embodiment of the present invention, propose a kind of effective conjecture method not logging in entry; The principle of this conjecture method is the feature of length according to entry and composition character, guesses the role of entry; Specifically comprise:
(1) if this does not log in entry odd number byte, or be no less than 6 bytes, or have non-Chinese character, then directly determine that this role not logging in entry to have nothing to do role A for name.
(2) if this does not log in entry is individual Chinese character, then the role guessed is needed to be A|C|D|E.
(3) if this does not log in entry is two Chinese characters, then the role guessed is needed to be A|X|Z.
It should be noted that, use word Relatively centralized due to transliteration name, therefore, essentially comprising common transliteration name word and the conventional role of correspondence in role's dictionary, namely above-mentioned conjecture method is mainly guessed possible Chinese personal name role.
Step 403, detects the name identification role that may produce identification error in character labeling sequence, and revises corresponding role in time to possible identification error.Wherein, the executive agent of this step can be role's correcting module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
In this step, suppose that name character string to be identified is w mw m+1w n, corresponding name pattern is t mt m+1t n; Wherein, m >=0, n >=m+2; Before adjacent with this name a word and below a word be respectively w m-1and w n+1, by comparing the probability of two paths in the embodiment of the present invention, thus determine whether identify this name.Wherein, this two paths is respectively:
Path 1 (name path, PN_PATH)
I.e. P (w m-1to PN) * P (PN to w n+1) * P (w mw m+1w n| PN)
Path 2 (non-name path, NOT_PN_PATH)
I.e. P (w m-1to w m) * P (w mto w m+1) * ... P (w nto w n+1)
Concrete, path 1 is can by w mw m+1w nbe identified as the path of name, wherein, first probable value is P (w m-1to PN), represent the transition probability value arriving name above of name, second probable value is P (PN to w n+1), represent the transition probability value hereafter of PN to name, the 3rd probable value is (w mw m+1w n| PN), represent that the name that this identifies is w mw m+1w nprobable value; Can find out, the probable value in path 1 is exactly the product of above-mentioned three probability.
Path 2 is can not by w mw m+1w nbe identified as the path of name, wherein, the probable value in this path 2 is product values of the transition probability of adjacent entry on path 2.
Further, by the probability of more above-mentioned two paths, thus determine whether identify that this name specifically comprises: if the probable value in path 1 is not less than the probable value in path 2, then by w mw m+1w nbe identified as name; Otherwise, by w mw m+1w nbe identified as name may occur to identify by mistake, can not by w mw m+1w nbe identified as name, and by w mw m+1w ncorresponding role is labeled as name and has nothing to do role, is namely labeled as A role.
It should be noted that, the account form in above-mentioned two paths is specially: the 3rd probable value P (w in (1) path 1 mw m+1w n| PN), by using role's emission probability to calculate, i.e. P (w mw m+1w n| PN)=p (t mto w m) * p (t m+1to w m+1) * ... * p (t nto w n); (2) other probable values in path 1 and all probable values in path 2 can obtain from class-based language model.In actual applications, more than the number in path 21 of the number of the probability be multiplied due to path 1; Namely according to the actual needs, can also at (w mw m+1w n| PN) add a weight factor w, to make the more accurate of above-mentioned two paths above.Wherein, the account form in two above-mentioned paths all can adopt existing embodiment to obtain, and repeats no longer in detail in the embodiment of the present invention.
In addition, because the possibility occurring under BCD pattern and XD pattern to identify is comparatively large by mistake, therefore, in order to correct the identification error that may occur more exactly; For BCD pattern and XD pattern, also need to add Article 3 and compare path, be called PN_PATH2; This path is:
Path 3 (name path 2, PN_PATH2)
I.e. P (w m-1to PN) * P (PN to w n) * P (w mw m+1w n-1| PN) * P (w nto w n+1)
Concrete, path 3 is can by w mw m+1w n-1be identified as the path of name, wherein, three probable values in the implication of first three probable value in path 3 and path 1 are similar, do not repeat them here; And the 4th probable value P (w nto w n+1) be the hereafter w of name nto the next one hereafter w n+1transition probability value; Same, this transition probability value also can obtain from class-based language model, does not repeat them here.
In the embodiment of the present invention, if when the probable value in path 3 is greater than the probable value in the probable value in path 1 and path 2, then need to revise w n-1and w ncorresponding role, is namely revised as t respectively n-1and t n.Wherein, when when correction pattern is BCD pattern, then by t n-1be revised as role E, by t nbe revised as role A; When correction pattern is XD pattern, then by t n-1be revised as angle Y, by t nbe revised as role A.
In order to this step of explanation clearly, continue to be described with above-mentioned example.Wherein, the character labeling sequence obtained through character labeling in above-mentioned steps is: in shop/A/A display/A week/B grace/C come/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A.Owing to now using BCD pattern, be in model domain to be revised, namely role's correcting module needs the size comparing three paths probable values, and draws final correction result.
Concrete: path 1:P (displaying to PN) * P (PN to and) * P (Zhou Enlai | PN); Wherein, P (Zhou Enlai | PN)=P (week | B) * P (grace | C) * P (come | D).
Path 2:P (display to week) * P (thoughtful grace) * P (grace is to coming) * P (come to and).
Path 3:P (displaying to PN) * P (PN to come) * P (all grace | PN) * P (come to and).Wherein, P (all grace | PN)=P (week | B) * P (grace | E).
In summary it can be seen, (1) if the probable value in path 1 is maximum, then identifies this name, does not need to make any role amendment; (2) if the probable value in path 2 is maximum, then need the role revising each character string corresponding to name to be identified to be A, namely week, grace, come, the role that three entries are corresponding is revised as name and is had nothing to do role A; (3) if the probable value in path 3 is maximum, then need grace and next role to be revised as role E and role A respectively.Repeat no longer in detail in each path probability value of this calculating and final this step of comparison procedure.
Step 404, is got rid of by the condition of setting or is revised the wrong identification of name.Wherein, the executive agent of this step can be rule checking module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
Concrete, comprising of the condition of this setting:
(1) in BCD pattern, if Chinese character corresponding to D role is and, and the character string be close to below is identified as name, then herein and be a conjunction; Namely needing this BCD schema modification is BE pattern, and original BC part is identified as name.Such as, he sees Guo Quan and Zhao Tao and is fighting, and the name that may identify is: Guo Quanhe (BCD pattern), Zhao Tao, by using above-mentioned condition, can correct and identify Guo Quanhe by mistake.For Guo Quan, and be counted as a conjunction.
(2) the transliteration name meeting any one condition following will not identify.More than the transliteration name (such as, Andrew Jefferson Karstlo Bill Gates Bauer Mo Qiaobusibulin Page, may be identified as a transliteration name, now need to get rid of this name) of 16 Chinese characters; Comprise the consecutive identical character (such as, A Aaluo) of 3 or more in the name identified, a name can not be identified as.
(3) in the name identified, if having above or below ", " (pause mark) time, then ", " if before and after word determine it is not name, so this name just can not identify.The right boundary of its reason to be pause mark be usually name, if the word name of pause mark segmentation, front and back should also have name to occur, otherwise this name must be got rid of.
Step 405, carries out division to the role U (name above individual character becomes word with name lead-in) in character labeling sequence and V (name end word and hereafter individual character become word) and processes.Wherein, in above-mentioned steps, obtain character labeling sequence, also needed in this step to carry out division process to role U and V, thus obtain character labeling sequence more accurately.Wherein, the executive agent of this step can disassemble module for role, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
It should be noted that, above-mentioned steps 403, ordinal relation not successively between step 404 and step 405, be just described for above-mentioned step in the embodiment of the present invention.In actual applications, above-mentioned step can also adjust according to the actual needs, such as, first perform the step carrying out dividing process in step 405 to role U and V in character labeling sequence, afterwards, in carry out step 403, the name identification role that may produce identification error in character labeling sequence is detected, and possible identification error is revised in time to the step of corresponding role, afterwards, got rid of by the condition of setting in carry out step 404 or revise the step of wrong identification of name, repeating no more in the embodiment of the present invention.
Concrete, being combined into word problem to solve between name with corresponding context, needing to disassemble role U and V, concrete process of disassembling is as the disassembling method of the role U of table 4, and the disassembling method of the role V shown in table 5; Certainly, according to the actual needs, the content in his-and-hers watches 4 and table 5 can also carry out adjusting and revising, do not repeat them here.
Table 4: the disassembling method of role U
A rear role of role U Disassemble result
C,E,G,Z AB
D AC
I,X2,E2 AH
Other roles AA
Table 5: the disassembling method of role V
The previous role of role V Disassemble result
C,X DA
B EA
I,X2 TA
H E2A
Other roles AA
In order to this step of explanation clearly, continue to be described with above-mentioned example; In above-mentioned character labeling sequence: in shop/A/A display/A week/B grace/C come/D and/Deng A/B grain husk/C excusing from death/V before/A use/A mistake/A /A article/A in, need to carry out division to role V to disassemble, can find out, the previous role of role V is role C, namely for result of disassembling be DA, can find out, the character labeling result obtained after division is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A.
Step 406, mates the character labeling sequence after division process according to name recognition mode, and exports the name of composition, record the position of this name in sentence.Wherein, the executive agent of this step can be Pattern Matching Module, certainly, according to the actual needs, other entities also can be used to process, do not repeat them here.
Concrete, this name recognition mode is as shown in table 6, certainly, according to the actual needs, the content of his-and-hers watches 6 can also carry out adjusting and revising, does not repeat them here.
Table 6: name recognition mode collection
Type Pattern
Chinese personal name recognition mode BCD,BE,BG,BZ,FB,Y,XD,
FE
Transliteration name recognition mode HE2,[H|X2][I|X2]+[T|X2],X2T,X2, Y
Further, in this step,, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence after division process, when having name recognition mode to concentrate corresponding content in the character labeling sequence namely after dividing process, according to the content that this name recognition mode set pair is answered, carry out the maximum coupling of pattern.Such as, when having Chinese personal name recognition mode BCD in character labeling sequence, then the result of the maximum coupling of pattern is BCD (pattern match result may be BC, CD etc.), visible, need the method for the maximum coupling of using forestland to mate character labeling sequence in the embodiment of the present invention, do not repeat them here.
When continuing to be described with above-mentioned example, because the character labeling sequence after division process is: in shop/A/A display/A week/B grace/C comes/D and/Deng A/B grain husk/C surpasses/D life/A before/A use/A mistake/A /A article/A, after the maximum coupling of pattern, the name identified is: Zhou Enlai's (BCD pattern), Deng Yingchao's (BCD pattern).In addition, above-mentioned transliteration name recognition mode [H|X2] [I|X2]+[T|X2] is the form of a canonical formula, if namely head-word role is H or role X2, several role I or role X2 are had in middle role, end word role is role X2 or role T, then can be identified as transliteration name.
It should be noted that, in embodiments of the present invention, above-mentioned character labeling module, role's correcting module, rule checking module, role disassembles module, Pattern Matching Module can be combined as one or more module according to the actual needs further, or is split as multiple submodule further.
Wherein, above-mentioned steps 401-step 406 can also adjust sequencing according to the actual needs, does not repeat them here.
Visible, the method that the application of the invention embodiment provides, takes full advantage of the advantage of character labeling model and class-based language model, under the prerequisite ensureing higher recall rate, can more adequately identify Chinese personal name and transliteration name; And the role proposed in the present invention do not rely on the set of contextual role word, thus name recognition system is made to be issued to higher name identification recall rate in certain training data condition.In addition, present invention also offers one and very practical do not log in entry guess strategy, effectively can guess that major part does not log in the role of entry, solve to a certain extent and do not log in the negative effect of word to name identification.
The embodiment of the present invention also provides a kind of device of Chinese personal name recognition, as shown in Figure 5, comprising:
Acquisition module 51, for obtaining list entries, and carries out participle to described list entries.
Character labeling module 52, for carrying out character labeling to the list entries after described acquisition module 51 participle, and obtains character labeling sequence.
Pattern Matching Module 53, for carrying out the maximum coupling of pattern according to name recognition mode to the character labeling sequence that described character labeling module 52 obtains, and exports the name of composition.
The embodiment of the present invention also provides a kind of device of Chinese personal name recognition, as shown in Figure 6, comprising:
Acquisition module 61, for obtaining list entries, and carries out participle to described list entries.
Character labeling module 62, for carrying out character labeling to the list entries after described acquisition module 61 participle, and obtains character labeling sequence.
Pattern Matching Module 63, for carrying out the maximum coupling of pattern according to name recognition mode to the character labeling sequence that described character labeling module 62 obtains, and exports the name of composition.
Role's correcting module 64, for detecting the name identification role in the character labeling sequence of described character labeling module acquisition, and to occurring that the name identification role of mistake revises.
Role disassembles module 65, carries out division process for the role U in the character labeling sequence to described character labeling module acquisition and role V.
Wherein, described role disassemble module 65 specifically for, when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A;
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A.
Further, described Pattern Matching Module 63 specifically for, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and exports the name of composition.When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name.
Training module 66, for before acquisition list entries, carries out model training; Described character labeling module 62 also for, the result according to the model training of described training module 66 carries out character labeling to the list entries after participle.
Concrete, described training module 66 specifically for, obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material, wherein, in described role's table, do not comprise contextual role information; And obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training.
Visible, the device that the application of the invention embodiment provides, takes full advantage of the advantage of character labeling model and class-based language model, under the prerequisite ensureing higher recall rate, can more adequately identify Chinese personal name and transliteration name; And the role proposed in the present invention do not rely on the set of contextual role word, thus name recognition system is made to be issued to higher name identification recall rate in certain training data condition.In addition, present invention also offers one and very practical do not log in entry guess strategy, effectively can guess that major part does not log in the role of entry, solve to a certain extent and do not log in the negative effect of word to name identification.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that the present invention can add required general hardware platform by software and realize, and can certainly pass through hardware, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium, comprising some instructions in order to make a station terminal equipment (can be mobile phone, personal computer, server, or the network equipment etc.) perform method described in each embodiment of the present invention.
The above is only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should look protection scope of the present invention.
It will be appreciated by those skilled in the art that the module in the device in embodiment can carry out being distributed in the device of embodiment according to embodiment description, also can carry out respective change and be arranged in the one or more devices being different from the present embodiment.The module of above-described embodiment can be integrated in one, and also can be separated deployment; A module can be merged into, also can split into multiple submodule further.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Be only several specific embodiment of the present invention above, but the present invention is not limited thereto, the changes that any person skilled in the art can think of all should fall into protection scope of the present invention.

Claims (10)

1. a method for Chinese personal name recognition, is characterized in that, comprising:
Carry out model training, the role's table used in described model training does not comprise contextual role information;
Obtain list entries, and participle is carried out to described list entries;
Result according to model training carries out character labeling to the list entries after participle, and obtains character labeling sequence;
Division process is carried out to the role U in character labeling sequence and role V; Wherein, described role U represents that individual character becomes word with name lead-in above, and described role V represents that name end word becomes word with hereafter individual character;
According to name recognition mode, the character labeling sequence after described division process is mated, and export the name of composition;
Wherein, describedly division process carried out to the role U in character labeling sequence comprise:
When after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A; Wherein, described role A represents that name has nothing to do role, described role B represents the surname of Chinese personal name, described role C represents the lead-in of the two-character given name of Chinese personal name, described role D represents the last word of the two-character given name of Chinese personal name, described role E represents the single-character given name of Chinese personal name, described role E2 represents the last word of the transliteration name of two words, described role G represents suffix, described role H represents the lead-in of transliteration name, described role X2 represents that transliteration name inside becomes word, and described role Z represents that the two-character given name of Chinese personal name itself becomes word;
Carry out division process to the role V in character labeling sequence specifically to comprise:
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A; Wherein, described role I represents the middle word of transliteration name.
2. the method for claim 1, is characterized in that, after described acquisition character labeling sequence, also comprises:
Detect the name identification role in described character labeling sequence, and to occurring that the name identification role of mistake revises.
3. method as claimed in claim 1 or 2, is characterized in that, describedly mates described character labeling sequence according to name recognition mode, and the name exporting composition comprises:
According to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and export the name of composition.
4. method as claimed in claim 3, is characterized in that, carries out the maximum coupling of pattern according to name recognition mode to the character labeling sequence divided after process through role U and role V, and the name exporting composition comprises:
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; Wherein, described role B represents the surname of Chinese personal name, described role C represents the lead-in of the two-character given name of Chinese personal name, described role D represents the last word of the two-character given name of Chinese personal name, and described role E represents the single-character given name of Chinese personal name, and described role F represents prefix, described role G represents suffix, described role X represents that the surname of Chinese personal name becomes word with the lead-in of two-character given name, and described role Y represents that complete name itself becomes word, and described role Z represents that the two-character given name of Chinese personal name itself becomes word;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name; Wherein, described role E2 represents the last word of the transliteration name of two words, and described role H represents the lead-in of transliteration name, and described role I represents the middle word of transliteration name, described role T represents three words and above transliteration name end word thereof, and described role X2 represents that transliteration name inside becomes word.
5. method as claimed in claim 1 or 2, is characterized in that, described in carry out model training and comprise:
Obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle;
According to role's table, language material after described participle is marked, obtain character labeling language material;
Obtain role according to described character labeling language material and shift language material and role launches language material; Wherein, it is removed by all entries that described role shifts language material, only retains the language material that corresponding role obtains, and it is each role and entry are placed on the language material that independent a line obtains that described role launches language material;
Shift language material and role according to described role to launch language material and carry out model training.
6. a device for Chinese personal name recognition, is characterized in that, comprising:
Training module, for before acquisition list entries, carries out model training, and the role's table used in described model training does not comprise contextual role information;
Acquisition module, for obtaining list entries, and carries out participle to described list entries;
Character labeling module, carries out character labeling for the result according to model training to the list entries after described acquisition module participle, and obtains character labeling sequence;
Role disassembles module, carries out division process for the role U in the character labeling sequence to described character labeling module acquisition and role V; Wherein, described role U represents that individual character becomes word with name lead-in above, and described role V represents that name end word becomes word with hereafter individual character;
Pattern Matching Module, for mating the character labeling sequence that the character labeling module after described division process obtains according to name recognition mode, and exports the name of composition;
Described role disassemble module specifically for, when after described role U, a role is C, E, G, or during Z, content corresponding for described role U is split into role A and role B; When a rear role of described role U is D, content corresponding for described role U is split into role A and role C; When after described role U, a role is I, X2, or during E2, content corresponding for described role U is split into role A and role H; When a rear role of described role U is other roles, content corresponding for described role U is split into role A and role A; Wherein, described role A represents that name has nothing to do role, described role B represents the surname of Chinese personal name, described role C represents the lead-in of the two-character given name of Chinese personal name, described role D represents the last word of the two-character given name of Chinese personal name, described role E represents the single-character given name of Chinese personal name, described role E2 represents the last word of the transliteration name of two words, described role G represents suffix, described role H represents the lead-in of transliteration name, described role X2 represents that transliteration name inside becomes word, and described role Z represents that the two-character given name of Chinese personal name itself becomes word;
When the previous role of described role V is C or X, content corresponding for described role V is split into role D and role A; When the previous role of described role V is B, content corresponding for described role V is split into role E and role A; When the previous role of described role V is I or X2, content corresponding for described role V is split into role T and role A; When the previous role of described role V is H, content corresponding for described role V is split into role E2 and role A; When the previous role of described role V is other roles, content corresponding for described role V is split into role A and role A; Wherein, described role I represents the middle word of transliteration name.
7. device as claimed in claim 6, is characterized in that, also comprise:
Role's correcting module, for detecting the name identification role in the character labeling sequence of described character labeling module acquisition, and to occurring that the name identification role of mistake revises.
8. device as claimed in claims 6 or 7, is characterized in that,
Described Pattern Matching Module specifically for, according to name recognition mode, the maximum coupling of pattern is carried out to the character labeling sequence divided after process through role U and role V, and exports the name of composition.
9. device as claimed in claim 8, is characterized in that,
When there is name recognition mode BCD in the character labeling sequence divided after process through role U and role V, when BE, BG, BZ, FB, Y, XD or FE, the result of the maximum coupling of pattern is corresponding content is Chinese personal name; Wherein, described role B represents the surname of Chinese personal name, described role C represents the lead-in of the two-character given name of Chinese personal name, described role D represents the last word of the two-character given name of Chinese personal name, and described role E represents the single-character given name of Chinese personal name, and described role F represents prefix, described role G represents suffix, described role X represents that the surname of Chinese personal name becomes word with the lead-in of two-character given name, and described role Y represents that complete name itself becomes word, and described role Z represents that the two-character given name of Chinese personal name itself becomes word;
When there is name recognition mode HE2, [H|X2] [I|X2]+[T|X2] in the character labeling sequence divided after process through role U and role V, when X2T, X2 or Y, the result of the maximum coupling of pattern is corresponding content is transliteration name; Wherein, described role E2 represents the last word of the transliteration name of two words, and described role H represents the lead-in of transliteration name, and described role I represents the middle word of transliteration name, described role T represents three words and above transliteration name end word thereof, and described role X2 represents that transliteration name inside becomes word.
10. device as claimed in claims 6 or 7, is characterized in that, described training module specifically for, obtain input language material, and remove the nested marking structure existed in described input language material, obtain the mark language material of non-nesting; Remove all part of speech marks in described input language material, obtain text language material, and use Words partition system to carry out cutting to described text language material, obtain language material after participle; According to role's table, language material after described participle is marked, obtain character labeling language material; And obtain role according to described character labeling language material and shift language material and role launches language material; Shift language material and role according to described role to launch language material and carry out model training; Wherein, it is removed by all entries that described role shifts language material, only retains the language material that corresponding role obtains, and it is each role and entry are placed on the language material that independent a line obtains that described role launches language material.
CN200910177127.XA 2009-09-27 2009-09-27 Method and device for identifying Chinese name Active CN102033879B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910177127.XA CN102033879B (en) 2009-09-27 2009-09-27 Method and device for identifying Chinese name

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910177127.XA CN102033879B (en) 2009-09-27 2009-09-27 Method and device for identifying Chinese name

Publications (2)

Publication Number Publication Date
CN102033879A CN102033879A (en) 2011-04-27
CN102033879B true CN102033879B (en) 2015-02-18

Family

ID=43886792

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910177127.XA Active CN102033879B (en) 2009-09-27 2009-09-27 Method and device for identifying Chinese name

Country Status (1)

Country Link
CN (1) CN102033879B (en)

Families Citing this family (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193646B (en) * 2010-03-18 2015-06-17 深圳市世纪光速信息技术有限公司 Method and device for generating personal name candidate words
CN103020046B (en) * 2012-12-24 2016-04-20 哈尔滨工业大学 Based on the name transliteration method of name origin classification
CN103076894B (en) * 2012-12-31 2016-05-18 百度在线网络技术(北京)有限公司 A kind of for build the method and apparatus of input entry according to object id information
US10089302B2 (en) 2013-02-26 2018-10-02 International Business Machines Corporation Native-script and cross-script chinese name matching
US9858268B2 (en) * 2013-02-26 2018-01-02 International Business Machines Corporation Chinese name transliteration
CN104424332A (en) * 2013-09-11 2015-03-18 富士通株式会社 Unambiguous Japanese name list building method and name identification method and device
CN103823859B (en) * 2014-02-21 2017-02-22 安徽博约信息科技股份有限公司 Name recognition algorithm based on combination of decision-making tree rules and multiple statistic models
CN105095322A (en) * 2014-05-23 2015-11-25 富士通株式会社 Personnel name unit dictionary expansion method, personnel name language recognition method, personnel name unit dictionary expansion device and personnel name language recognition device
CN104375662B (en) * 2014-11-10 2017-06-06 天津三星通信技术研究有限公司 Phonetics input method and device
CN106156051B (en) * 2015-03-27 2019-08-13 深圳市腾讯计算机系统有限公司 Construct the method and device of name corpus identification model
CN106681981B (en) * 2015-11-09 2019-10-25 北京国双科技有限公司 The mask method and device of Chinese part of speech
CN105373530A (en) * 2015-12-03 2016-03-02 北京锐安科技有限公司 Chinese name identification method and apparatus
CN105723361A (en) * 2016-01-07 2016-06-29 马岩 Network information word segmentation processing method and system
CN105808523A (en) * 2016-03-08 2016-07-27 浪潮软件股份有限公司 Method and apparatus for identifying document
CN106354713A (en) * 2016-08-29 2017-01-25 达而观信息科技(上海)有限公司 Method for automatically identifying Chinese name
CN106528527A (en) * 2016-10-14 2017-03-22 深圳中兴网信科技有限公司 Identification method and identification system for out of vocabularies
CN107330011B (en) * 2017-06-14 2019-03-26 北京神州泰岳软件股份有限公司 The recognition methods of the name entity of more strategy fusions and device
CN108170708B (en) * 2017-11-23 2021-03-30 杭州大搜车汽车服务有限公司 Vehicle entity identification method, electronic equipment, storage medium and system
CN108255806B (en) * 2017-12-22 2021-12-17 北京奇艺世纪科技有限公司 Name recognition method and device
CN108536679B (en) * 2018-04-13 2022-05-20 腾讯科技(成都)有限公司 Named entity recognition method, device, equipment and computer readable storage medium
CN109670181B (en) * 2018-12-21 2023-04-25 东软集团股份有限公司 Named entity recognition method and device
CN109753657B (en) * 2018-12-29 2022-02-25 北京泰迪熊移动科技有限公司 Data processing method and device for person name recognition, client and server
CN109885827B (en) * 2019-01-08 2023-10-27 北京捷通华声科技股份有限公司 Deep learning-based named entity identification method and system
CN112131871B (en) * 2020-09-22 2023-06-30 平安国际智慧城市科技股份有限公司 Method, device, equipment and storage medium for identifying Chinese name
CN112883161A (en) * 2021-03-05 2021-06-01 龙马智芯(珠海横琴)科技有限公司 Transliteration name recognition rule generation method, transliteration name recognition rule generation device, transliteration name recognition rule generation equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN101295292A (en) * 2007-04-23 2008-10-29 北大方正集团有限公司 Method and device for modeling and naming entity recognition based on maximum entropy model
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101154226A (en) * 2006-09-27 2008-04-02 腾讯科技(深圳)有限公司 Method for adding unlisted word to word stock of input method and its character input device
CN101295292A (en) * 2007-04-23 2008-10-29 北大方正集团有限公司 Method and device for modeling and naming entity recognition based on maximum entropy model
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
中文命名实体识别及其关系抽取研究;温锐;《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》;20060415;1-2,24-29 *
基于层叠隐马尔科夫模型的中文命名实体识别;俞鸿魁等;《通信学报》;20060228;第27卷(第2期);87-94 *

Also Published As

Publication number Publication date
CN102033879A (en) 2011-04-27

Similar Documents

Publication Publication Date Title
CN102033879B (en) Method and device for identifying Chinese name
CN108287858B (en) Semantic extraction method and device for natural language
CN105302795B (en) Chinese text check system and method based on the fuzzy pronunciation of Chinese and speech recognition
KR101744861B1 (en) Compound splitting
CN107608949A (en) A kind of Text Information Extraction method and device based on semantic model
CN106815197A (en) The determination method and apparatus of text similarity
CN106326303A (en) Spoken language semantic analysis system and method
CN103440252B (en) Information extracting method arranged side by side and device in a kind of Chinese sentence
KR101633556B1 (en) Apparatus for grammatical error correction and method using the same
CN106980620A (en) A kind of method and device matched to Chinese character string
CN107256212A (en) Chinese search word intelligence cutting method
JP2020098594A (en) Information processing method, natural language processing method, and information processing apparatus
CN102955775A (en) Automatic foreign name identification and control method based on context semantics
CN110826301B (en) Punctuation mark adding method, punctuation mark adding system, mobile terminal and storage medium
CN108319584A (en) A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms
KR101072460B1 (en) Method for korean morphological analysis
CN104077274B (en) Method and device for extracting hot word phrases from document set
US10515148B2 (en) Arabic spell checking error model
CN107229611B (en) Word alignment-based historical book classical word segmentation method
Rehman et al. A hybrid approach for urdu sentence boundary disambiguation.
CN107894977A (en) With reference to the Vietnamese part of speech labeling method of conversion of parts of speech part of speech disambiguation model and dictionary
CN104657343B (en) Recognize the method and device of transliteration name
Kuo et al. Morphological and syntactic features for Arabic speech recognition
CN107609006B (en) Search optimization method based on local log research
KR102430918B1 (en) Device and method for correcting Korean spelling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131016

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20131016

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518057 Zhenxing Road, SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant