CN102955775A

CN102955775A - Automatic foreign name identification and control method based on context semantics

Info

Publication number: CN102955775A
Application number: CN2012101972389A
Authority: CN
Inventors: 王祖兴; 吕钊; 顾君忠
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2012-06-14
Filing date: 2012-06-14
Publication date: 2013-03-06

Abstract

The invention provides an automatic foreign name identification and control method based on context semantics in a natural language processing system by researching foreign name characteristics and combining a statistic probability model. The method is characterized by comprising the following steps of: a. analyzing a text to be identified and acquiring a candidate foreign name string set; b. correcting and screening the candidate foreign name string set by utilizing a foreign name rule set to acquire a first middle foreign name string set; c. further screening the first middle foreign name string set by utilizing probability statistics and the probability model to further screen the acquired identified foreign name set; and d. determining the unidentified foreign names according to the identified foreign name set. According to the system, the context characteristics of the names and the word characteristics of the foreign names are fully utilized, the identification error caused by word segmentation is greatly reduced, the condition that the other named entities are identified into names is well avoided, and the identification effect is improved.

Description

The foreign name automatic recognition control method of based on the context semanteme

Technical field

The present invention relates to natural language processing field, specifically the foreign name recognition technology in the named entity recognition.

Background technology

Named entity recognition is hot issue and the basic work in the natural language processing, natural language processing is extremely important, and is applied to many fields of natural language processing, such as information retrieval, information extraction and mechanical translation etc.Named entity generally comprises name, place name, organizational structure's name, date, time etc.In various named entity recognition, name identification is in critical role always, and its recognition effect has material impact to Chinese word segmentation.Chinese personal name comprises Chinese personal name and foreign name.At present more to the research of Chinese personal name and obtained preferably effect, then relatively less to specializing in of foreigner's name identification, and recognition effect has much room for improvement.

The present invention is based on the context semanteme of name and foreigner's name being identified with the word feature of foreign name.The method that the present invention adopts can be divided into two stages, i.e. training stage and cognitive phase.In the training stage, we extract the contextual information word of name from training corpus, and sum up the name recognition rule.At test phase, we utilize name recognition rule and 5 foreign names to obtain candidate's name with the word set, and for those the stricter restrictive condition of candidate's name utilization---probability models outside the word of border, the name left and right sides, candidate's name is screened.Utilize at last the name of those Boundary Recognition mistakes of partial statistics correction, and utilize the name of correctly having identified to recall unrecognized name.

Summary of the invention

For defective of the prior art, the invention provides a kind of in natural language processing system the control method of the foreign automatic recognition of names of based on the context semanteme, it is characterized in that, comprise the steps: that a. analyzes text to be identified and obtains candidate foreigner name trail; B. utilize foreign name rule set described candidate foreigner name trail is revised and screening obtain in the middle of foreigner name trail; C. utilize probability statistics and probability model that foreigner's name trail in the middle of described is further screened to obtain and identify foreign name collection; And d. has identified foreign name collection and has confirmed the unrecognized foreign name that goes out according to described.

Preferably, comprise the steps: before the described step a that i. trains the foreign name rule set of generation based on artificial tagged corpus;

Preferably, described step I. comprise the steps: that also i1. extracts the sentence that comprises foreign name from described artificial tagged corpus; I2. remove in the described foreigner's of comprising name sentence mark and with the sentence of described removal mark as interim testing material storehouse; I3. based on utilizing the foreign name recognition system of only carrying out name identification according to name with the word rule from described interim testing material storehouse, to identify name; I4. relative discern result and former annotation results sum up candidate foreign country name rule; And i5. adds described rule to described foreign name rule set.

Preferably, described step I comprises the steps: that also i6. judges whether to sum up new foreign name rule from described interim testing material storehouse; If i7. the foreign name rule of the judged result of above-mentioned steps i6 for summing up from described interim testing material storehouse in addition then repeats above-mentioned steps i3 to i5.

Preferably, described step a comprises the steps: that also a1. carries out participle to text to be identified, and described word is carried out part of speech element mark; A2. the word string that marks is extracted in screening, and concentrates identification candidate name trail in described unmarked word string.

Preferably, described step a2 comprises the steps: that also the a21. extraction does not mark word string, intercepting may be that the word string of foreign translated name is as new not mark word string, each word of this new word string the inside belongs to foreign translated name word collection, and will newly not mark first Chinese character in the word string as the first Chinese character; A22. judge whether described the first Chinese character belongs to foreign name lead-in and gather with word; If a23. described step a22 judges that described the first Chinese character does not belong to described foreign name lead-in and gathers with word, then current the first Chinese character is not marked in the word string a rear Chinese character as the first Chinese character and goes to described step a21 described; If a24. described step a22 judges that described the first Chinese character belongs to described foreign name lead-in and gathers with word, then with described last Chinese character of word string that do not mark as the second Chinese character; A25. judge whether described the second Chinese character belongs to foreign name odd amount in addition to the round number and gather with word; If a26. described step a25 judges that described the second Chinese character does not belong to described foreign name tail word and gathers with word, then current the second Chinese character is not marked in the word string previous Chinese character as the second Chinese character and goes to described step a25 described; If a27. described step a22 judges that described the second Chinese character belongs to described foreign name tail word and gathers with word, then with the word string in described the first Chinese character to the second Chinese character that does not mark in the word string as candidate foreigner name string; And a28. repeats above-mentioned steps a21 to a28 until identified not mark word string in all described not note word trails, and forms described candidate foreigner name trail.

Preferably, described step c comprises the steps: that also c1. extracts not at the described candidate foreigner name string that is between the word of border, the left and right sides in described text to be identified, wherein said left margin word is for often appearing at the word before the name, and described right margin word is for often appearing at the word behind the name; C2. to calculate the described candidate foreigner name string that extracts among the described step c1 be the probability of real name to the probability of use model, and carry out the screening of candidate foreigner name string according to first threshold.

Preferably, also comprise the steps: that c3. utilizes partial statistics to proofread and correct the candidate's foreign country's name that filters out through step c2 of Boundary Recognition mistake and without the candidate foreign country name of step c2 screening after the described step c2.

Preferably, described steps d comprises the steps: that also d1. concentrates the identical but unrecognized foreign name of foreigner's name to confirm as foreign name with having identified foreign name described in the described text to be identified.

Preferably, described part of speech element comprises :-generic word;-right margin word;-left margin word;-not only left margin can be done but also the word of right margin can be done.

Another fermentation according to the present invention also provide a kind of in natural language processing system the control device of the foreign automatic recognition of names of based on the context semanteme, it is characterized in that, comprise such as lower module: foreign name rule set generation module, it is used for extracting foreign name rule set according to described artificial tagged corpus; Candidate foreigner name trail generation module, it is used for analyzing text to be identified and obtaining candidate foreigner name trail; The rule correcting module, it is used for utilizing foreign name rule set that described candidate foreigner name trail is revised and screened; The probability correcting module, it is used for utilizing probability statistics and probability model further to screen and obtains identifying foreign name collection; And recall module, it is used for determining the unrecognized foreign name that goes out according to foreign name of having identified.

Another fermentation according to the present invention also provide a kind of in natural language processing system the control device of the foreign automatic recognition of names of based on the context semanteme, it is characterized in that, carry out following steps: a. and analyze text to be identified and obtain candidate foreigner name trail; B. utilize foreign name rule set described candidate foreigner name trail is revised and screening obtain in the middle of foreigner name trail; C. utilize probability statistics and probability model that foreigner's name trail in the middle of described is further screened to obtain and identify foreign name collection; And d. has identified foreign name collection and has confirmed the unrecognized foreign name that goes out according to described.

The present invention in conjunction with statistical probability model, constructs foreign name automatic recognition system by the research to foreigner's name feature.By text message is carried out word segmentation processing, based on concerning with word feature and contextual shallow semantic of foreign name, obtain at last candidate's name.By partial statistics and utilize and to have identified name and recall the recognition result that unidentified name finally obtains system.What native system took full advantage of the contextual feature of name and foreign name uses the word feature, greatly reduces the identification error that produces owing to participle, preferably resolves the situation that other named entity recognition are name, has improved recognition effect.

Description of drawings

By reading with reference to the description of the following drawings to the identification of foreigner's name, it is more obvious that other features, objects and advantages of the present invention will become:

Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of the foreign name automatic identifying method of described based on the context semanteme;

Fig. 2 illustrates according to the first embodiment of the present invention, at the process flow diagram that obtains candidate foreigner name trail according to text to be identified;

Fig. 3 illustrates according to a second embodiment of the present invention, the structural representation of the foreign name automatic identifying method of described based on the context semanteme; And

Fig. 4 illustrates a third embodiment in accordance with the invention, the training process of the foreign name automatic identifying method of described based on the context semanteme and the process flow diagram of identifying;

Embodiment

Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of the foreign name automatic identifying method of described based on the context semanteme particularly, originally illustrates 4 steps, first-selection is step S201, analyzes text to be identified and obtains candidate foreigner name trail.Then be step S202, utilize foreign name rule set that described candidate foreigner name trail is revised and foreigner's name trail in the middle of first is obtained in screening.Be step S203 after the step S202, utilize probability statistics and probability model that described first middle foreigner's name trail is further screened to obtain and identify foreign name collection; And d. has identified foreign name collection and has confirmed the unrecognized foreign name that goes out according to described.

Particularly, it will be appreciated by those skilled in the art that three key factors can affect foreign name recognition effect of the present invention, they are respectively the contextual features of name, usefulness word feature and the Chinese word set of foreign name.In addition, the Feature Words of place name and organizational structure's name also can exert an influence to recognition effect.

The contextual feature of name has very strong indicative function to name.Usually the contextual feature of name is to be specialized by the contextual information word of name to embody.In the present invention, we are divided into two kinds with the context keyword: (1) often appears at the word before the name, is called the left margin word; (2) often come across the word of name back, be called the right margin word.Also have some border words not only can come across the front of name, and often appear at the back of name, this border word had both belonged to the left margin word and had also belonged to the right margin word.In our foreign name knowledge system, altogether collected approximately 8000 contextual information words, wherein comprise appellation word (president), predicate verb (attending), adjective (diligent), conjunction (with), adverbial word (), preposition (with) and punctuation mark etc.

Although having greatly with word and Chinese personal name of foreign country's name is different, relatively concentrated.We pass through the analysis to the foreign biographical dictionary of 4.57MB, have obtained 5 classes foreign country name and have gathered with word.

A, foreign name word collection: storage can be used for the Chinese character (ACS) of foreign name

B, foreign name lead-in collection: storage can be used for the Chinese character (HCS) of foreign name lead-in

C, foreign name tail word collection: storage can be used for the Chinese character (TCS) of foreign name tail word

D only is used for foreign name lead-in collection: storage only can be used for the Chinese character (UHCS) of foreign name lead-in

E only is used for foreign name tail word collection: storage only can be used for the Chinese character (UTCS) of foreign name tail word

Because other named entity title (such as place name, organizational structure's name) often is identified as name, so we have collected the Feature Words of place name, organizational structure's name.Be the Feature Words of place name such as " province ", " state ", " university ", " office " are the Feature Words of organizational structure's name.

Because the foreign name of part exists inner one-tenth word or foreign name to become the phenomenon of word with context, these foreign names may cutting apart by participle device mistake.In order to address this problem, according to the relation of foreign name and generic word, the word that we concentrate Chinese word is divided three classes, and is respectively standard Chinese word set (SCWS), conflict word set (CWS) and related word set (RWS).

A, the standard Chinese word set is stored the word that those cannot be used for foreign name, and these words will be as the dictionary of participle.

B, the conflict word set store the word that those can be used for foreign name, but these words itself does not consist of foreign name.For example " optimistic " is a common Chinese word, but it also can consist of the part of foreign name, such as " Michelangelo ".If we are stored in " optimistic " in the standard Chinese word set, pass through so after the participle, " Michelangelo " will be divided into " Michelangelo ", thereby cause the participle mistake.Consider this point, we have set up the conflict word set.

C, related word set is stored those and both can be used as foreign name, can be used as again the word of other physical names, and their conjunctive word.Such as " Washington ", it both can refer to the president of founding state of the U.S., also can refer to the U.S. provincial capital.When the conjunctive words such as appearance " provincial capital ", " city ", " White House " in the context, we just are considered as it place name, but not name.

Fig. 2 illustrates according to the first embodiment of the present invention, at the process flow diagram that obtains candidate foreigner name trail according to text to be identified.Particularly, originally illustrating 11 steps, at first is step S301, text to be identified is carried out participle and part-of-speech tagging, it will be appreciated by those skilled in the art that particularly preferably, can be by SCWS(standard Chinese Word Set) the participle dictionary carries out simple participle and mark.More specifically, for sentence C0C1 ... Ck-6Ck-5Ck-4Ck-3Ck-2Ck-1CkCk+1Ck+2Ck+3Ck+4Ck+5Ck+6Ck+7Ck+8Ck+9Ck+10 ... Cn-1Cn, become after participle and the mark: C0C1/rm ... Ck-6/BnCk-5Ck-4Ck-3Ck-2Ck-1 Ck/Av Ck+1Ck+2 Ck+3/An Ck+4Ck+5/rmCk+6/Cp Ck+7Ck+8Ck+9Ck+10/Av ... Cn-1Cn/rm. wherein "/rm " represents generic word, described generic word belongs to described standard Chinese word set, "/A ^*" expression right margin word, "/B ^*" expression left margin word, "/C ^*" expression not only can do the left margin word but also can do the word of right margin word."/A ^*" "/B ^*" "/C ^*" in " * " represent this word part of speech, for example "/An " represents that this word is that right margin word and this word are noun, again for example "/Av " represents that this word is that right margin word and this word are verb.

Be step S302 afterwards, extract not note word trail, never the word string that marks with word collection (ACS), foreign name lead-in collection (HCS) and foreign name tail word collection (TCS) by foreign name (for example: Ck-5Ck-4Ck-3Ck-2Ck-1, Ck+1Ck+2 and Ck+7Ck+8), and form not note word set of strings.

After described not note word trail generates, execution in step S303 is to step S311, do not identify candidate foreigner name trail from described not note word trail from described the mark the word string according to following rule, suppose any Chinese character string C1C2 ... Ck ... Cn, gather with word if all have Ci to belong to foreign name to all Ci, this Chinese character string is potential candidate's name so.Then utilize the head and the tail approximation Strategy, further determine the border of potential candidate's name, obtain candidate's name string.

Particularly, step S303 extracts one and does not mark word string C1C2 from described not note word trail ... Ck ... Cn, and suppose that described first Chinese character C1 that does not mark word string is the first Chinese character Ct.Be step S304 afterwards, judge whether described the first Chinese character Ct belongs to foreign name lead-in and gather with word, if Ct does not belong to foreign name lead-in and gathers with word, execution in step S305 then, with word Ct+1 behind described the first Chinese character Ct as the first Chinese character Ct and skip to step S304 and continue to judge whether described the first Chinese character Ct belongs to foreign name lead-in and gather with word; If Ct belongs to foreign name lead-in and gathers with word, execution in step S306 then, the lead-in location positioning of Potential names, and with described last Chinese character that do not mark in the word string as the second Chinese character Cu.Execution in step S307 behind the step S306, judge whether described the second Chinese character Cu belongs to foreign name tail word and gather with word, if described the second Chinese character Cu does not belong to foreign name tail word and gathers with word, execution in step S308 then, with the previous word Cu-1 of described the second Chinese character Cu as the second Chinese character Cu and skip to step S307 and continue to judge whether described the second Chinese character Cu belongs to foreign name tail word and gather with word; If Cu belongs to foreign name tail word and gathers with word, execution in step S309 then, the tail word location of Potential names determine, and with the word string that is no less than two words of described the first Chinese character to the second Chinese character as candidate foreigner name string.Execution in step S310 behind the step S309 judges whether not mark word strings all in the described not note word trail is identified completely, if unidentified complete, then repeating step S303 is to step S310; If identify complete then execution in step S311, concatenate into candidate foreigner name trail according to described not note word.

Fig. 3 illustrates according to a second embodiment of the present invention, the structural representation of the foreign name automatic identifying method of described based on the context semanteme.Particularly, originally illustrate five modules, foreign name rule set generation module 21, it is used for extracting foreign name rule set according to described artificial tagged corpus, described foreign name rule set is used for revising described candidate foreigner name trail, for example following rule:

Foreign country's name rule set generation module 21 has been summed up 16 rules altogether, has collected 1323 sentences that comprise foreign name, amounts to comprise 1612 foreign names.

Candidate foreigner name trail generation module 22 is used for analyzing text to be identified and obtaining candidate foreigner name trail.Particularly, it will be appreciated by those skilled in the art that text to be identified extracts not note word trail by the participle part-of-speech tagging, and utilize head and the tail to approach principle identification candidate foreigner name trail.Rule correcting module 23, the foreign name rule set that is used for utilizing foreign name rule set generation module 21 to generate is revised and is screened described candidate foreigner name trail; Probability correcting module 24, it is used for utilizing probability statistics and probability model further to screen and obtains identifying foreign name collection, particularly, when described middle foreigner's name string is not in the middle of the word of border, the left and right sides, at first calculating the described candidate foreigner name string that extracts among the described step c1 through the probability of use model is the probability of real name, and the screening of carrying out candidate foreigner name string according to first threshold again and middle the foreigner's name string in the word of border, the left and right sides utilize together partial statistics correction Boundary Recognition mistake.Recall module 25, be used for determining that according to the foreign name that probability correcting module 24 has been identified the unrecognized foreign name that goes out of text to be identified recalls Unidentified name.Utilize the name of having identified to remove to recall Unidentified name and will obtain identical result.。

Fig. 4 illustrates a third embodiment in accordance with the invention, the training process of the foreign name automatic identifying method of described based on the context semanteme and the process flow diagram of identifying.At first be the training stage, the present invention utilizes two large-scale artificial tagged corpus as training corpus, and they are respectively People's Daily's corpus and LCMC(The Lancaster Corpus of Mandarin Chinese) corpus.We extract the contextual information word of name and sum up foreign name rule from training corpus.The basic step of obtaining foreign name rule is as follows:

The first step extracts the sentence that comprises foreign name from corpus;

Second step removes the mark in these sentences, with the set of these sentences as interim testing material storehouse (TTC);

In the 3rd step, utilize initial foreign name recognition system from TTC, to identify name (so-called initial foreign name recognition system refers to not add rule and proofreaies and correct, and only utilizes name to carry out the system of name identification with the word rule);

The 4th step, relative discern result and former annotation results, the foreign name that sums up the candidate is regular;

The 5th step, respectively these candidates foreign country name rules are added in our system, relatively add behind every rule our obtained accuracy of system, finally add in our the initial name recognition system obtaining the highest rule of accuracy.

Constantly repeated for the 3rd step to the 5th step, until there is not new rule can improve the accuracy of our system.

Cognitive phase

The key step of cognitive phase is as follows:

The first step is carried out participle and mark by SCWS.For example, for sentence C0C1 ... Ck-6Ck-5Ck-4Ck-3Ck-2Ck-1CkCk+1Ck+2Ck+3Ck+4Ck+5Ck+6Ck+7Ck+8Ck+9Ck+10 ... Cn-1Cn, become after participle and the mark: C0C1/rm ... Ck-6/BnCk-5Ck-4Ck-3Ck-2Ck-1 Ck/Av Ck+1Ck+2 Ck+3/An Ck+4Ck+5/rmCk+6/Cp Ck+7Ck+8Ck+9Ck+10/Av ... Cn-1Cn/rm. wherein "/rm " represents generic word, "/A ^*" expression right margin word, "/B ^*" expression left margin word, "/C ^*" expression not only can do the left margin word but also can do the word of right margin word.

Second step, by ACS, identification candidate name in the word string that HCS and TCS never mark (for example: Ck-5Ck-4Ck-3Ck-2Ck-1, Ck+1Ck+2 and Ck+7Ck+8).

At first never identify potential candidate's name in the reference character string by candidate's name with word.If any Chinese character string C1C2 ... Ck ... Cn gathers with word if all have Ci to belong to foreign name to all Ci, and this Chinese character string is potential candidate's name so.Then utilize the head and the tail approximation Strategy, further determine the border of potential candidate's name, obtain candidate's name string, specifically describe as follows:

Step 0: suppose that the potential candidate's name string that obtains is C1C2 ... Ck ... Cn, t ← 1

Step 1: if belonging to foreign name lead-in, gathers with word Ct, and the lead-in location positioning of Potential names then, t ← n turns Step 3; Otherwise turn Step 2

Step 2:t ← t+1 turns Step 1;

Step 3: gather with word if Ct belongs to foreign name tail word, then the tail word location of Potential names is determined, otherwise turns Step 4;

Step 4:t ← t-1 turns Step 3;

Step 5: if potential candidate's name string that Potential names length greater than 2, then obtains is candidate's name, otherwise be not.

In the 3rd step, utilize rule set that candidate's name is revised.

If occur after candidate's name "-"+digital form, it negate this candidate's name.

If occur the form of Chinese Digital+measure word among candidate's name, perhaps candidate's name lead-in is measure word, is the form of Chinese Digital before candidate's name, negates this candidate's name.Being Chinese Digital for candidate's name tail word, is measure word after candidate's name, removes candidate's name tail word, remaining character string is carried out the judgement of candidate's name.

If candidate's name lead-in is the name separation dot, the word that can be used for the name lead-in is sought in continuation to candidate's name forward direction, can be used for the word of name lead-in or find not reference character string stem until find, if find the word that can be used for the name lead-in, then this candidate's name lead-in position resets the boundary to this word location.

If candidate's name tail word is the name separation dot, continuation can be used for the word of name tail word to the backward searching of candidate's name, can be used for the word of name tail word or find not reference character string afterbody until find, if find the word that can be used for name tail word, then this candidate's name tail word location resets the boundary to this word location.

If the word of the Feature Words such as organizational structure, place name appears being used in the name back, it negate this candidate's name.

If the name lead-in can be done the preposition word, for example " with ", the word that can arrange in pairs or groups with this preposition has then appearred again, for example " be ", then candidate's name left margin resets the boundary, with second word as candidate's name lead-in.

If the Chinese character of candidate's name tail word and candidate's name front or back consist of regular collocation to the time, such as " when ... the time ", candidate's name tail word be " time ", and candidate's name front occurred Chinese character " when ", then negate this candidate's name.

If occurred among candidate's name being used for the word of country name lead-in, and when having occurred Chinese characters such as " army ", " side ", " formula ", " enterprise " behind this word, negative this candidate's name.

If the name lead-in is " institute ", and not comprising " Saloman " in the name, then negates this candidate's name.

If candidate's name is positioned at a side of conjunction, and opposite side is not candidate's name, neither personal pronoun, then negative this candidate's name.

If candidate's name is included in the name conflict word set, it then negate this candidate's name.

If candidate's name is included in conjunctive word and concentrates, and word associated with it is not used in name, then negates this candidate's name.

If exist a plurality of candidate's names to occur continuously, and with ", " situation about separating, we do as judging:

Suppose to have n ", ", and ", " between do not comprise the contextual information word, we represent candidate's name number of identifying with num, if num 〉=(n+1)/2, so ", " the word string at interval all is candidate's name, otherwise all is not candidate's name.If ", " between the Feature Words of organizational structure's name, place name etc. has appearred, then same negate all with ", " candidate's name at interval.Such as: at sentence C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16C17C18C19 ... Cn, C5, C8, C12, and C15 be ", ", C3C4 and C6C7 are identified as candidate's name, n is 4 so, num is 2, because num 〉=(n+1)/2 be false, so negate ", " all candidate's names at interval.

If name inside comprises the word that only is used for the name lead-in or only is used for name tail word, it negate this candidate's name.

If occurred the personal pronouns such as " he ", " she " near candidate's name, it is that the possibility of real name will strengthen greatly so.

If the left side of candidate's name is a name left margin word, perhaps candidate's name is positioned at beginning of the sentence, so the lead-in location positioning of candidate's name

If the right side of candidate's name is a name right margin word, perhaps candidate's name is positioned at a tail, and the tail word location of candidate's name is determined so.

After above-mentioned candidate's name correction, if candidate's name then jumped to for the 5th step between the word of border, the left and right sides, otherwise jumped to for the 4th step.

In the 4th step, for those candidate's names outside the word of border, the left and right sides, it is the probability of real name that our probability of use model calculates it.We use pf, and pm and pl represent that respectively candidate's name string can consist of the probability of the lead-in of foreign name, middle word, tail word.Pname represents that candidate's name is the probability of real name.The formula of calculating probability is as follows:

Here NumInName (C) and NumInText (C) represent that respectively character C appears at the number of times in foreign name storehouse and the corpus.NumAsFirst (C), NumAsMiddle (C) and NumAsLast (C) are the lead-ins that Chinese character C appears at respectively foreign name, the number of times of the middle word of foreign name and the tail word of foreign name.Len represents the length of candidate's name, the Chinese character number that namely comprises.For a common word string, word string is more long more may to be separated by the participle device, so candidate's name is longer, more might be real name, because it is not isolated by the participle device.Consider this point, we are provided with parameter beta, become the probability of real name in order to improve longer candidate's name.In the situation that Pf*Pm*Pl is certain, candidate's name is longer, and the probability that it becomes real name is just larger.After calculating probability, we are provided with a threshold value and are used for getting rid of the name that those probable values are lower than threshold value.The optimum solution that we use the mode of crosscheck to draw threshold value is 0.026.

In the 5th step, utilize partial statistics to proofread and correct the name of Boundary Recognition mistake.For the same name that appears at diverse location, be a reason such as candidate's name owing to name becomes word or name and border Chinese character common identification with context, same name may be identified as different Chinese character strings.For example:

Storehouse handkerchief Mr. and Mrs married 67 years, and emotion will be got well.

Ku Pa then represents, he likes the sensation of taking exercises.

Ku Pa is the member of the extra large especially diligent army of Britain then.

In three words, the storehouse handkerchief has occurred three times altogether as name in the above.The storehouse handkerchief is correctly validated and is name in front twice, and " storehouse handkerchief when " has been identified as name by mistake in the 3rd word.In order to address this problem, we are provided with per 100 word as a statistic unit, utilize partial statistics algorithm (PFS) to revise name.The PFS algorithm is as follows:

Its basic thought is: add up the frequency that different names occur in this unit, and sort to short from long according to name length.For each name namei, if name namej(j＞i) is the part of name namei, replace all namej and border Chinese character so if all namej and its border Chinese character can consist of namei (1) with namei so.Otherwise (2) if the frequency of namej is higher than namei, and namei does not comprise the name separation dot, so all namei are become namej, if the character string of namei remainder still can consist of the name that this unit once occurred, the character string with the namei remainder also is labeled as name so.

In the 6th step, we utilize the name of correctly having identified to remove to recall those unrecognized names.For same name, may be because the position that occurs be different, and what have has been correctly validated, what also have is unrecognized.Therefore we can utilize these names of having identified to go for those unrecognized names.

Below show another exemplifying embodiment of the present invention.

We have chosen one piece of news from CRI Online's Europe news column, the contents are as follows:

German Chancellor spokesman denies that Merkel seeks to replace president's report

Www.chinanews.com foreign radios announced January 8, and German Chancellor Merkel's spokesman denied on the 7th that the relevant Merkel of media just discussed the report of replacing president with inner-party ally.

The presidential military Er Fuyin threat media of Germany are cancelled the report of its private loan event are suffered in many ways to criticize, and Wu Erfu is to this public apology, and the expression refusal is resigned presidency simultaneously.But military you member that the husband is a member of a political party thinks that his apology falls short of sincerity.

Analyze to claim, this scandal of president causes not little strike may for the Merkel who is doing one's utmost at present to save Euro crisis, and Merkel nominate military you husband in the time of 2010 be German presidential.

Merkel's spokesman claims, " Merkel does not discuss presidential successor's matters, and she thinks and has no reason to do like this." report before this, it is presidential successor that Merkel considers to nominate vice-premier's Roeselare.

The investigation demonstration of Berlin poll mechanism issue before interview is broadcasted on the 4th, 46% German thinks that Wu Erfu should resign.

The first step, participle and mark.Through behind the participle, the result is as follows:

Germany/premier rm/C_Bnc_An spokesman/Bnc denies/and Cv Merkel (mark) seeks/and Cv replaces/president Cv/C_Bnc_An report/Cv

In (mark) new/Aa net January 8 (mark) certificate/Bp dispatch from foreign news agency/rm/Cv ,/Cw Germany/premier rm/C_Bnc_An Merkel (mark) /deny C_Bd_Au spokesman/Bnc7 day (mark)/Cv media/rm relevant/C_Bvj_Ap Merkel just (marking) and/Cp inner-party/rm ally/rm discussion/Cv replaces/president Cv/C_Bnc_An /C_Bd_Au report/Cv./Cw

Military your husband's (mark) of Germany/president rm/C_Bnc_An because of/Cp threat/rm media/rm cancel/Bv is right/its (marking) of Cp individual/An loan/Av event/rm /C_Bd_Au report/Cv/Bc suffers/Cv in many ways/rm criticism/Cv, military your husband's (mark) of/Cw /Ad is right/Cp this (mark) is open/Av apology/Av ,/Cw while/C_Bd_Ac represents/Cv refusal/Cv resigns/president Av/C_Bnc_An position/rm./ Cw but/military you husband's (mark) place/An party (mark) member/Bn of Bc thinks/Cv he (mark) /C_Bd_Au apology/Av shortage/Cv sincerity/Av./Cw

Analysis/Cv claims/Cv; / Cw president/C_Bnc_An /this (mark) of C_Bd_Au one scandal/An possibility/Av meeting/Cv gives/Cp at present/Ct /Cu does one's utmost/rm draws/Av rescues/Cv Euro/rm crisis/rm /C_Bd_Au Merkel (mark) causes/Bv not /Cd is little/Ba strikes/Bv ,/Cw Merkel in the time of 2010 (mark) nomination/Cv force that husband (marking) be/C_Bv_Ap Germany/president rm/C_Bnc_An./Cw

Merkel's (mark) /C_Bd_Au spokesman/Bnc claims/Cv, / Cw "/Cw Merkel (mark) not /Ad discussions/Cv president/C_Bnc_An successor/Bn /C_Bd_Au matters/rm ,/Cw she (marking) thinks/Cv do not have/C_Bd_Av reason/rm like this/Ar does/Cv./ Cw "/Cw before this/Ct has/Cv report/Cv claims/Cv ,/Cw Merkel (mark) considerations/Cv nomination/Cv vice-premier/Cn Roeselare (marking) is/president C_Bv_Ap/C_Bnc_An successor/Bn./Cw

Berlin (mark) poll/rm mechanism/rm (mark) on the 4th before/Cp interview/Cv broadcasts/Avj/Cf issue/Av /C_Bd_Au investigation/Cv demonstration/Cv ,/Cw 46%(does not mark) /C_Bd_Au German/Bn thinks/Cv force that husband (marking) should/Av resignation/Av./Cw

Second step, name is just identified.The result that system tentatively identifies is as follows, and wherein the name of system identification adopts "/name " to mark:

Germany/premier rm/C_Bnc_An spokesman/Bnc denies/and Cv Merkel/(correct identification) name seeks/and Cv replaces/president Cv/C_Bnc_An report/Cv

In new/Aa net certificate/Bp dispatch from foreign news agency/rm/Cv on January 8 ,/Cw Germany/premier rm/C_Bnc_An Merkel/(correctly identifying) name /C_Bd_Au spokesman/Bnc denied on the 7th/Cv media/rm relevant/C_Bvj_Ap Merkel just/(wrong identification) name and/Cp inner-party/rm ally/rm discussion/Cv replaces/president Cv/C_Bnc_An /C_Bd_Au report/Cv./Cw

Military your husband/(correct identification) name of Germany/president rm/C_Bnc_An because of/Cp threat/rm media/rm cancel/Bv is right/its individual of Cp/An loan/Av event/rm /C_Bd_Au report/Cv/Bc suffers/Cv in many ways/rm criticism/Cv, military your husband/(correct identification) name of/Cw /Ad is right/this open/Av apology/Av of Cp ,/Cw while/C_Bd_Ac represents/Cv refusal/Cv resigns/president Av/C_Bnc_An position/rm./ Cw but/military you husband/(correct identification) name place/member/Bn of An party of Bc thinks/Cv he /C_Bd_Au apology/Av shortage/Cv sincerity/Av./Cw

Analysis/Cv claims/Cv; / Cw president/C_Bnc_An /this scandal of C_Bd_Au/An possibility/Av meeting/Cv gives/Cp at present/Ct /Cu does one's utmost/rm draws/Av rescues/Cv Euro/rm crisis/rm /C_Bd_Au Merkel/(correct identification) name causes/Bv not /Cd is little/Ba strikes/Bv ,/Cw Merkel/correct name is in the time of 2010/military your husband/(the correctly identifying) name of (wrong identification) name nomination/Cv is/C_Bv_Ap Germany/president rm/C_Bnc_An./Cw

Merkel/(correct identification) name /C_Bd_Au spokesman/Bnc claims/Cv, / Cw "/Cw Merkel/(correct identification) name not /Ad discussion/Cv president/C_Bnc_An successor/Bn /C_Bd_Au matters/rm ,/Cw she think/Cv do not have/C_Bd_Av reason/rm like this/Ar does/Cv./ Cw "/Cw before this/Ct has/Cv report/Cv claims/Cv ,/Cw Merkel/(correct identification) name considerations/Cv nomination/Cv vice-premier/Cn Roeselare/(correctly identifying) name is/president C_Bv_Ap/C_Bnc_An successor/Bn./Cw

Berlin/(wrong identification) name poll/rm mechanism/rm 4 days before/Cp interview/Cv broadcast/Avj/Cf issue/Av /C_Bd_Au investigation/Cv demonstration/Cv ,/Cw 46% /C_Bd_Au German/Bn thinks/military your husband/(correct identification) name of Cv should/Av resignation/Av./Cw

We can find out by recognition result, and all names have been positioned to, and wherein " during year " and " Berlin " are by the wrong name that is identified as, and " Merkel just " is also by the name that is identified as of mistake.

In the 3rd step, utilize rule to proofread and correct.The result of system identification is as follows after utilizing rule to proofread and correct:

German Chancellor spokesman denies that Merkel/(correct identification) name seeks to replace president's report

Www.chinanews.com foreign radios announced January 8, the spokesman of German Chancellor Merkel/(correct identification) name denied in 7th the relevant Merkel of media just/(wrong identification) name discusses with inner-party ally and replaces presidential report.

The presidential Wu Erfu of Germany/(correct identification) name suffers in many ways to criticize because of the report that threatens media to cancel its private loan event, and Wu Erfu/(correct identification) name is to this public apology, and the expression refusal is resigned presidency simultaneously.But Wu Erfu/(correct identification) member that name is a member of a political party thinks that his apology falls short of sincerity.

Analyze and claim; this scandal of president may cause not little strike to the Merkel who is doing one's utmost to save at present Euro crisis/(correct identification) name, and Merkel/(correct identification) name is nominated Wu Erfu/(correct identification) name in the time of 2010 be the Germany president.

The spokesman of Merkel/(correct identification) name claims, Merkel/(correct identification) name is not discussed presidential successor's matters, and she thinks and has no reason to do like this." report before this, Merkel/(correct identification) name considers that nomination vice-premier Roeselare/(correct identification) name is presidential successor.

The investigation of Berlin poll mechanism 4 days issue before interview is broadcasted shows, 46% German thinks that Wu Erfu/(correctly identifying) name should resign.

After the rule correction, only have " Merkel is just " by the wrong name that is identified as, other mistake has obtained correction.

In the 4th step, utilize the probability screening not meet the name of context rule.Because all between the word of border, the left and right sides, all probability of use models carry out the result of the result of name affirmation with the 3rd step to the name of all identifications in this section text.

In the 5th step, utilize partial statistics.Recognition result is as follows:

Www.chinanews.com foreign radios announced January 8, and the spokesman of German Chancellor Merkel/(correct identification) name denied on the 7th that the relevant Merkel of media/(correct identification) name just discussed the report of replacing president with inner-party ally.

In the 6th step, recall Unidentified name.Through behind the partial statistics, all names will obtain identical result in this correctly identification so continue to utilize the name of having identified to remove to recall Unidentified name.

In sum, for above-mentioned text, we can utilize the present invention correctly to identify foreign name, have reached purpose of the present invention.

Above specific embodiments of the invention are described.It will be appreciated that, the present invention is not limited to above-mentioned specific implementations, and those skilled in the art can make various distortion or modification within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims

1. the control method of the foreign automatic recognition of names of based on the context semanteme in natural language processing system is characterized in that, comprises the steps:

A. analyze text to be identified and obtain candidate foreigner name trail;

B. utilize foreign name rule set described candidate foreigner name trail is revised and screening obtain in the middle of foreigner name trail;

C. utilize probability model that foreigner's name trail in the middle of described is further screened to obtain and identify foreign name collection; And

D. identified foreign name collection and confirmed the unrecognized foreign name that goes out according to described.

2. control method according to claim 1 is characterized in that, comprises the steps: before the described step a

I. train based on artificial tagged corpus and generate foreign name rule set.

3. control method according to claim 2 is characterized in that, described step I also comprises the steps:

I1. from described artificial tagged corpus, extract the sentence that comprises foreign name;

I2. remove in the described foreigner's of comprising name sentence mark and with the sentence of described removal mark as interim testing material storehouse;

I3. based on utilizing the foreign name recognition system of only carrying out name identification according to name with the word rule from described interim testing material storehouse, to identify name;

I4. relative discern result and former annotation results sum up candidate foreign country name rule; And

I5. described rule is added to described foreign name rule set.

4. according to claim 2 or 3 described control methods, it is characterized in that, described step I also comprises the steps:

I6. judge whether from described interim testing material storehouse, to sum up new foreign name rule;

If i7. the foreign name rule of the judged result of above-mentioned steps i6 for summing up from described interim testing material storehouse in addition then repeats above-mentioned steps i3 to i5.

5. according to claim 1 to 4 described control methods, it is characterized in that, described step a also comprises the steps:

A1. text to be identified is carried out participle, and described word is carried out part of speech element mark;

A2. the word string that marks is extracted in screening, and concentrates identification candidate name trail in described unmarked word string.

6. control method according to claim 5 is characterized in that, described step a2 also comprises the steps:

A21. extract and do not mark word string, intercepting may be the word string of foreign translated name as new not mark word string, each word of this word string the inside belongs to foreign translated name word collection, and will newly not mark first Chinese character in the word string as the first Chinese character;

A22. judge whether described the first Chinese character belongs to foreign name lead-in and gather with word;

If a23. described step a22 judges that described the first Chinese character does not belong to described foreign name lead-in and gathers with word, then current the first Chinese character is not marked in the word string a rear Chinese character as the first Chinese character and goes to described step a21 described;

If a24. described step a22 judges that described the first Chinese character belongs to described foreign name lead-in and gathers with word, then with described last Chinese character of word string that do not mark as the second Chinese character;

A25. judge whether described the second Chinese character belongs to foreign name odd amount in addition to the round number and gather with word;

If a26. described step a25 judges that described the second Chinese character does not belong to described foreign name tail word and gathers with word, then current the second Chinese character is not marked in the word string previous Chinese character as the second Chinese character and goes to described step a25 described;

If a27. described step a22 judges that described the second Chinese character belongs to described foreign name tail word and gathers with word, then with the word string that is no less than two words in described the first Chinese character to the second Chinese character that does not mark in the word string as candidate foreigner name string;

A28. repeat above-mentioned steps a21 to a28 until identified not mark word string in all described not note word trails, and form described candidate foreigner name trail.

7. according to claim 1 to 6 described control methods, it is characterized in that, described step c also comprises the steps:

C1. extract not at the described candidate foreigner name string that is in described text to be identified between the word of border, the left and right sides, wherein said left margin word is for often appearing at the word before the name, and described right margin word is for often appearing at the word behind the name;

C2. to calculate the described candidate foreigner name string that extracts among the described step c1 be the probability of real name to the probability of use model, and carry out the screening of candidate foreigner name string according to first threshold.

8. according to claim 1 to 7 described control methods, it is characterized in that, also comprise the steps: after the described step c2

C3. utilize partial statistics to proofread and correct the candidate's foreign country's name that filters out through step c2 of Boundary Recognition mistake and without the candidate foreign country name of step c2 screening.

9. according to claim 1 to 8 described control methods, it is characterized in that, described steps d also comprises the steps:

D1. concentrate the identical but unrecognized foreign name of foreigner's name to confirm as foreign name with having identified foreign name described in the described text to be identified.

10. according to claim 1 to 9 described control methods, it is characterized in that, described part of speech element comprises:

-generic word, described generic word are can not be as the word of foreign name ingredient;

-right margin word, described right margin word be for often appearing at the word behind the name;

-left margin word, described left margin word is for often appearing at the front word of name;

-not only left margin can be done but also the word of right margin can be done.

11. the control device of the foreign automatic recognition of names of based on the context semanteme in natural language processing system is characterized in that, comprises such as lower module:

Foreign country's name rule set generation module, it is used for extracting foreign name rule set according to described artificial tagged corpus;

Candidate foreigner name trail generation module, it is used for analyzing text to be identified and obtaining candidate foreigner name trail;

The rule correcting module, it is used for utilizing foreign name rule set that described candidate foreigner name trail is revised and screened;

The probability correcting module, it is used for utilizing probability model further to screen and obtains identifying foreign name collection; And

Recall module, it is used for determining the unrecognized foreign name that goes out according to foreign name of having identified.