CN102681981A

CN102681981A - Natural language lexical analysis method, device and analyzer training method

Info

Publication number: CN102681981A
Application number: CN2011100621111A
Authority: CN
Inventors: 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2011-03-11
Filing date: 2011-03-11
Publication date: 2012-09-19

Abstract

The invention discloses a natural language analysis method, a natural language analysis device and an analyzer training method. The analysis method comprises the following steps: dividing the input natural language sentence into a plurality of sequences composed of the first type word and/or character which possibly is the constituent part of the second type word, wherein the first type word is the word except the second type word; computing the fraction of each sequence by a statistic probability model, wherein the model comprises the statistic the probability of the first type word in the context and the statistic probability of the character as the constituent part of the second type word; determining a candidate sequence according to the fraction; labeling the natural language sentence according to the candidate sequence; for the first type word which possibly exists in the candidate sequence, labeling word information obtained from the statistic probability model, and for the character which possibly exists except the possibly existent first type word, labeling character information obtained from the statistic probability model.

Description

Natural language morphology analysis methods, device and analyzer training method

Technical field

The present invention relates to natural language processing field, especially, the present invention relates to natural language morphology analysis methods, device and natural language lexical analyzer training method.

Background technology

The natural language lexical analysis is to be the speech as the composition of sentence with natural language sequence (for example sentence, perhaps paragraph) cutting.The rank of speech is lower than sentence, is higher than morpheme (the for example word in the Chinese), but the sentence that also exists single speech to constitute, the speech that perhaps single morpheme constitutes.Traditional natural language lexical analysis is to be the Several combination of possible speech with the cutting of natural language sequence; And calculate the mark of each combination according to the probability of various speech in context; The combination of the mark of threshold value is decided in the selector unification, and according to this combination the speech in this sequence is marked.

Patent documentation [1] (application number submitted on September 21st, 2005 of Fujitsu Ltd. be 200510103566.8 be entitled as " a kind of natural language component identification, means for correcting and method based on morpheme marking ", the one Chinese patent application that March 28, disclosed publication number was CN 1936885A in 2007) discloses a kind of natural language component recognition methods based on morpheme marking.It at first makes up the attribute labeling collection of morpheme according to the attribute information of language element to be analyzed and the positional information of morpheme in this language element, then from the mark sample of language element acquistion morpheme and morpheme attribute concern rule.Through the morpheme attribute labeling rule of acquistion, input text is carried out the morpheme attribute labeling, from morpheme attribute labeling sequence, identify the language element of needs and the key words sorting of this composition.Because document [1] is based on morpheme and carries out lexical analysis; There are calculation of complex, deficiency that operand is big; In addition, though aspect participle, do not receive in the traditional approach restriction, that is to say and to tell " neologisms " that traditional approach can not be told the participle mark of corpus; But, cause the participle accuracy rate of common speech not ideal enough on the contrary because the possibility of morpheme group speech is a lot.

Document [2] (" Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information ", Tetsuji Nakagawa, " Proceedings of the 20 ^ThInternational conference on Computational Linguistics ", December 2004) disclose and a kind of Chinese and Japanese have been carried out the method for participle based on word and word information.Nakagawa has combined method and the sign injecting method based on markov (Markov) model, utilizes based on speech with based on the information of word and obtains the participle for the pin-point accuracy of known words and unknown word.Although this method can be carried out part-of-speech tagging (Part-of-speech tagging) and participle to known words, how to carry out sign for the unknown word that identifies and annotate the problem that still exists relatively blindly.The unknown word that this method marked in addition only is restricted to the lower speech of occurrence frequency in the corpus, and for the higher proper noun of occurrence frequency in the corpus, this method can not obtain to be made up of word the rule of neologisms.For example, if carbon dioxide, ferric trichloride appear in high frequency in the corpus, then these two speech are not unknown words, and when ferric oxide occurring in the testing material, this method can not be discerned.In addition, for the unknown word that this method marked, can't know its part of speech.

More than two pieces of documents in full way of reference be incorporated among the application, constitute the part of present specification.

Summary of the invention

Provided hereinafter about brief overview of the present invention, so that the basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is confirmed key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while.

According to one embodiment of present invention; A kind of natural language morphology analysis methods is provided; Comprise: the natural language sentence segmentation of input is served as reasons possibly be the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type, and wherein the speech of the first kind is the speech except that the speech of second type; Utilize the mark of each sequence in the said a plurality of sequences of statistical probability Model Calculation, wherein, said statistical probability model comprises: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type; Confirm candidate sequence according to said mark; And said natural language statement is marked according to this candidate sequence; Comprise: to the speech of the first kind that possibly exist in this candidate sequence; The word information that mark obtains from said statistical probability model; To the word that possibly exist except that the speech of the said first kind that possibly exist, the word information that mark obtains from said statistical probability model.

According to another embodiment of the invention; A kind of natural language lexical analysis device is provided; Comprise: the cutting unit; It is configured to the natural language sentence segmentation of input served as reasons and possibly is the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type, and wherein the speech of the first kind is the speech except that the speech of second type; Statistical probability model storage unit, said statistical probability model comprises: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type; Score calculating unit, it is configured to utilize the mark of each sequence in the said a plurality of sequences of said statistical probability Model Calculation; Candidate sequence is confirmed the unit, and it is configured to confirm candidate sequence according to said mark; And mark unit; It is configured to according to this candidate sequence said natural language statement marked; Comprise: to the speech of the first kind that possibly exist in this candidate sequence; The word information that mark obtains from said statistical probability model, to the word that possibly exist except that the speech of the said first kind that possibly exist, the word information that mark obtains from said statistical probability model.

According to still another embodiment of the invention; A kind of natural language lexical analyzer training method is provided, has comprised: mark natural language sequence is as corpus, wherein; To the speech of first kind word information labeling only; To the speech of second type, constitute the word of the speech of said second type with the word information labeling, wherein the speech of the first kind is the speech except that the speech of said second type; With said corpus training natural language lexical analyzer.

By embodiments of the invention; Speech through the first kind in the word information labeling natural language; To the word that possibly exist except that the speech of the said first kind that possibly exist, mark has solved in the prior art too much based on candidate word in the mark of morpheme (word) fully from the word information that said statistical probability model obtains; The problem that the possibility of marking error is big, and the problem that always exists the part speech not to be identified based on the mark of speech.That is to say; Use embodiments of the invention; Can mark the speech (be morphology analysis methods train with device speech) of morphology analysis methods and device " known " simultaneously, and the word that constitutes the speech (be morphology analysis methods and install the speech of not trained) of morphology analysis methods and device " the unknown ".Simultaneously; Because just to the word information of a part of speech (i.e. the speech of second type) mark as the word of its ingredient; Can in a further embodiment the work through group speech mark neologisms be concentrated on the speech (i.e. the speech of second type) of particular type; Avoid too much interfere information, thereby improved the accuracy of mark " the unknown " speech.In embodiment further, also can further mark its part of speech.

According to an aspect of the present invention, a kind of program product is provided also, has comprised the executable instruction of machine, when on messaging device, carrying out said instruction, said instruction makes said messaging device carry out the method like above-mentioned embodiment.

According to a further aspect in the invention; A kind of storage medium is provided; This storage medium comprises machine-readable program code, and when on messaging device, carrying out said program code, said program code makes said messaging device carry out the method like above-mentioned embodiment.

Description of drawings

With reference to below in conjunction with the explanation of accompanying drawing, can understand above and other purpose, characteristics and advantage of the present disclosure to disclosure embodiment with being more prone to.Parts in the accompanying drawing are just in order to illustrate principle of the present disclosure.In the accompanying drawings, identical or similar techniques characteristic or parts will adopt identical or similar Reference numeral to represent.

Fig. 1 shows the process flow diagram of natural language morphology analysis methods according to an embodiment of the invention;

Fig. 2 shows the schematic diagram of natural language morphology analysis methods according to another embodiment of the invention;

Fig. 3 shows the schematic diagram of natural language lexical analysis device according to an embodiment of the invention;

Fig. 4 shows the schematic diagram of natural language lexical analysis device according to another embodiment of the invention;

Fig. 5 shows the process flow diagram of natural language lexical analyzer training method according to another embodiment of the invention; And

Fig. 6 shows the block diagram of the exemplary configurations of the computing machine that can be used for implementing method according to an embodiment of the invention and/or device.

In the accompanying drawings, identical or corresponding method step or parts have used identical or corresponding reference marker.

Embodiment

To combine accompanying drawing that example embodiment of the present invention is described hereinafter.In order to know and for simplicity, in instructions, not describe all characteristics of actual embodiment.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, so that realize developer's objectives, and these decisions may change along with the difference of embodiment to some extent.In addition, might be very complicated and time-consuming though will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is customary task.

At this, what also need explain a bit is for fear of having blured the present invention because of unnecessary details, only to show in the accompanying drawings and the closely-related apparatus structure of scheme according to the present invention, and omitted other details little with relation of the present invention.

In this application; Can not discern " unknown word " in order to solve simple speech mark; Simple sign is annotated the big problem of calculated amount, and the inventor proposes and can be divided into two types to speech, speech that can the word information labeling first kind; And mark the speech (can directly constitute the word of the speech of second type, also can mark the speech that makes up second type that obtains based on said word) of second type based on word information with word information based on the word information acquisition with the word information labeling.Correspondingly, use and classic method different statistic probability model.The speech of the first kind exists in this model, and the speech of second type does not exist in this model, but is present in this model as the word and the relevant information thereof of its ingredient.

First embodiment

First embodiment of the invention has proposed a kind of natural language morphology analysis methods.Fig. 1 shows the indicative flowchart of this method.

As shown in Figure 1, at step S110, the natural language sentence segmentation of input served as reasons possibly be the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type.At this, be that example is described with Chinese.Should be noted that embodiments of the invention only use Chinese property example as an example, yet the present invention is not limited to this.Those skilled in the art can also be applied to like natural languages such as Japanese, Koreans.

Why say " possibly be the speech of the first kind ", the meaning is temporarily it to be handled as " speech of the first kind ", but in final word segmentation result, it may not be legal speech, perhaps may not be the speech of the first kind." maybe as the word of the ingredient of the speech of second type " be similar implication, promptly temporarily it is handled as " as the word of the ingredient of the speech of second type ", but in final word segmentation result, it possibly participate in constituting the speech of the first kind.The speech of the speech of the first kind and second type constitutes the complete or collected works of speech.

Illustrate below; For Chinese sentence s1 " Xiao Ming will go to school tomorrow "; Can " tomorrow ", " going ", " going to school " be divided into maybe be as the speech of the first kind, and being divided into " little ", " bright " maybe be as the word of the ingredient of the speech (" Xiao Ming ") of second type.Should be noted that alleged " first kind " and " second type " only is in order to distinguish two types of speech here, both titles also can be exchanged.Correspondingly, can this Chinese sentence s1 be cut into sequence s11 " Xiao Ming/tomorrow/go/go to school ", s12 " little/bright/tomorrow/go/go to school ".Certainly, also have other cutting sequence results, only explain as an example here with this.In other words, cutting result comprises the combination in any of word under the situation that keeps neighbouring relations and front and back order that this statement comprised.For example, under a kind of extreme case, can be (possible) speech with whole sentence cutting; Under another kind of extreme case, can be single word (also might be monosyllabic word) with the whole cuttings of whole sentence.

Next, at step S120, utilize the mark of each sequence of statistical probability Model Calculation.For example, in this sequence s11, be syncopated as " Xiao Ming, tomorrow, go, go to school ", wherein, " Xiao Ming ", " tomorrow ", " going to school " possibly be the speech of the first kind, and " going " possibly be word, also might be monosyllabic word.Therefore, can consult the probability of these possible speech in the statistical probability model, and calculate the mark of this sequence s11 in view of the above.For example, the as if statistics probability model is only considered the collocation of part of speech and part of speech, then can suppositive w ₁" Xiao Ming ", w ₂" tomorrow ", w ₃" go ", w ₄" go to school " s11=w ₁, w ₂, w ₃, w ₄, the corresponding T1=t that is labeled as ₁, t ₂, t ₃, t ₄, wherein, t representes part of speech information, then for example can conceal the mark that Markov Model Calculation s11 has part-of-speech tagging T1 according to the binary based on speech referring to document [2], for:

P (s 11, T 1) = Π_{i = 1}^{n} P (w_{i} | t_{i}) P (t_{i} | t_{i - 1}) - - - (1)

Wherein, n representes the quantity of participle, and i representes the sequence number of participle, P (w _i| t _i) i speech w of expression _iHas mark t _iThe time probability, P (t _i| t _I-1) be illustrated in the t that is labeled as of i-1 speech _I-1The time i speech be labeled as t _iProbability, promptly reflect the mark t of previous speech _I-1Mark t to i speech _iInfluence.Certainly, this mark obviously also can otherwise calculate, such as preceding formula is carried out various mathematic(al) manipulations.

Similarly, can calculate a mark to sequence s12 " little/bright/tomorrow/go/go to school ".

P (s 12, T 2) = Π_{i = 1}^{n} P (w_{i} | {t^{'}}_{i}) P ({t^{'}}_{i} | {t^{'}}_{i - 1}) - - - (2)

Although formula (2) is similar in form with formula (1), the mark t ' difference in the formula (2), at this moment, the possible word information that is labeled as that " little " is corresponding is such as being NRF, the lead-in of expression noun; That " bright " is corresponding is NRB, the tail word of expression noun.Here; Comprised in the formula (2) maybe for the speech of the first kind under certain mark (for example, part of speech information) probability and except possibly being ingredient the mixing probability under certain mark (word information) of the word the speech of the first kind as the speech of second type.Certainly, this mark obviously also can otherwise calculate, such as preceding formula is carried out various mathematic(al) manipulations.

At step S130, confirm candidate sequence according to the statistical probability of in step S120, calculating.For example, in the example of being takeed, can compare the mark of s11, two sequences of s12 in the above, getting higher fractional person is the final analysis result.For example, if in the language material that makes up the statistical probability model, " Xiao Ming " is higher than sequence s11 that " little ", " bright " as the probability of word, then comprise " Xiao Ming " this speech respectively as candidate sequence as the probability of speech.On the contrary, if in the language material that makes up the statistical probability model, " Xiao Ming " is lower than " little ", " bright " respectively as the probability of word as the probability of speech, and then lexical analysis result is s12.Certainly, this is as just example.Also can be according to the mark of sequence, get forward in the ranking of mark height 2 or the sequence more than 2 as candidate sequence.In addition; Easy for narrating, only directly compare here " Xiao Ming " as the probability of speech and " little ", " bright " respectively as the probability of word, in fact; Because variation has taken place for context except that " Xiao Ming ", " little ", speech " bright " in different word segmentation result; Therefore its conditional probability also can change, so the mark of whole sequence all can change the mark that remains sequence that compares.

Next; At step S140; According to this candidate sequence the natural language statement is marked, can comprise: to the speech of the first kind that possibly exist in this candidate sequence, the word information that mark obtains from said statistical probability model; To the word that possibly exist except that the speech of the said first kind that possibly exist, according to the said word of word information labeling that obtains from said statistical probability model.For example; Under the lexical analysis result is the situation of s11 (there is the situation as " Xiao Ming " of the speech of the first kind in corresponding to the statistical probability model in this situation); Can this statement s1 be labeled as: " Xiao Ming/NR tomorrow/NT go/V go to school/V ", wherein, symbol NR representes the noun as name; The noun of NT express time, V representes verb.On the contrary, in the statistical probability model, do not exist under the situation as " Xiao Ming " of the speech of the first kind, candidate's cutting sequence will be s12, and promptly " little/NRF is bright/and NRB tomorrow/NT goes/V goes to school/V ".Wherein, symbol NRF representes the lead-in as the noun of name, and NRB representes the tail word as the noun of name, the noun of NT express time, and V representes verb.

According to another embodiment of the invention, the natural language morphology analysis methods can also comprise according to the word information that is marked corresponding word is combined as speech, and the speech of using the word information mark that obtains based on said word information to combine.

As shown in Figure 2, Fig. 2 shows the schematic diagram of natural language morphology analysis methods according to another embodiment of the invention.In natural language morphology analysis methods shown in Figure 2, step S210 is identical to S140 to the step S110 in the natural language morphology analysis methods of S240 and embodiment shown in Figure 1.In addition, can also comprise step S250, the word information that marks according to above-mentioned steps is combined into speech with corresponding word, and uses the word information that obtains based on said word information to mark the speech that combines.

Here, still adopt the example among the top embodiment to explain.Because indicating " little ", the word information that is comprised in the statistical probability model is name lead-in (NRF); " bright " is name tail word (NRB); And " little " and " bright " position relation in candidate sequence satisfies the attribute of its lead-in, tail word; And the name of " Xiao Ming " " NR " attribute also meets its position relation in whole sentence; Therefore in step S250, " little " and " bright " two words are combined as " Xiao Ming ", and are labeled as NR (as the noun of name), but it is as the speech appearance of second type at this moment.

Statistical probability model above-mentioned can comprise: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type.At this, " context " of the speech of the first kind is meant front and back position relation and/or all or part of related informations such as part of speech and/or collocation relation of each speech in Chinese sentence in the statement, and these information are also referred to as " word information ".For example; The statistical probability of speech in context can be the probability of a speech as various parts of speech such as noun, verbs; The probability of a speech in beginning of the sentence, sentence tail, sentence, and the probability that speech is arranged in pairs or groups as the speech of certain part of speech and other parts of speech each other, for example; The probability of certain verb before another verb, like that.Further,, can calculate the probability of certain speech in specific context, for example after beginning of the sentence is as noun, connect the probability of a verb based on above-mentioned various contextual informations, or the like.This calculated example is calculated as adopting condition probability formula well known in the art.

In this application, comprise two levels as the context of the word of the ingredient of speech, one is the context of its speech that possibly constitute, just as indicated above the same; Another is the context of word in its speech that possibly constitute, and for example, a word is as the middle word of prefix, suffix, speech or the probability of only words.So, for the word as the ingredient of speech, its statistical probability will be considered the context of above-mentioned two levels.For example in above-mentioned sequence s12, it is also conceivable that " little " probability in noun (can be more specifically to name) as lead-in, " bright " in noun (can be more specifically to name) as the probability of tail word.

That is to say, can comprise the for example word information of positional information and corresponding speech of the contextual information of this word in speech as the word information of the word of the ingredient of speech.Wherein, the positional information of word in the speech of second type can comprise lead-in, tail word, middle word and solely in the words one of at least.Solely words refers to the situation that a word constitutes a speech, for example, and people, mouth, hand etc.In when mark, for example can with this positional information be labeled as F (preceding), M (in), B (back), O (solely).Should be noted that also unrestricted the present invention here, and the mask method that location information is possible just is described as an example.

In the present invention, above-mentioned statistical probability model can be prepared in advance and use for the application's scheme.Can be by the main body oneself preparation of implementing the application, also can by the third party for example DRI provide.When preparing the statistical probability model, can use any approach to obtain relevant data, come directly to make up said statistical probability model according to statistics then.Certainly, also can be with marking language material in advance and the mode of model training being obtained said statistical probability model, this training patterns also is well-known, the latent Markov model of the employing binary of for example mentioning in the patent documentation [2] is learnt.Only, when the statistical probability model that prepare to supply the application to use, the content that will mark according to the application's scheme institute to mark in advance the language material that supplies the training use.That is to say, to corpus mark the first kind speech information and as the information of the word of the ingredient of the speech of second type.

According to the application on the other hand, the speech of said second type can comprise in low-frequency word, proper noun, number and the foreign word one of at least.Low-frequency word refers to the less speech of the frequency of occurrences in Chinese sentence, and concrete frequency threshold can be set according to the needs of using.In addition, proper noun can comprise in name, place name, mechanism's name, ProductName and the term one of at least.Term refers to speech sanctified by usage or habitual in certain industry.For example, " bluetooth " agreement in the communications field, " wooden horse " virus in the computer realm etc.Because the group speech characteristics of above-mentioned " special speech " are different from popular word, and neologisms occur easily, therefore in classic method, can bring very big difficulty to identifying.For example; If according in the document [2] through judging that the probability of possible speech in existing statistical model is judged to be when being " known words " or " unknown word "; Owing to do not utilize the positional information of each word in the speech that possibly form and the part of speech information of possible speech, existing participle instrument still is not easy to identify these possible speech.To this, the inventor discovers, can utilize the making words rule of this type speech.For example, the probability that the lead-in of certain noun occurs in noun is bigger, and another word is big at the probability of the tail word appearance of this noun.So; When above-mentioned lead-in appears at the first in another noun, when identification, just might think that according to the statistical model probability it can be used as the lead-in of a speech, equally; The tail word of this speech possibly be the word that another one once occurred as the tail word of " known words "; So, when identification, might be a speech just, thereby can solve the new word problem of identification these two word recognitions.

Different with the mode of judging " unknown word " in the document [2]; The application is not based on the speech that probability of occurrence that " unknown word " concentrate at mark is distinguished speech and second type of the first kind; But the speech (but also can comprise low-frequency word) of some foregoing special speech as second type; Thereby make that the identification of neologisms is more targeted, thereby improved discrimination and accuracy.Certainly, also can adopt other modes to divide according to actual conditions, the present invention be not limited to this.

Further, above-mentioned statistical probability model can comprise more concrete information, with the merging of the word of the ingredient of more convenient speech as second type.For example; For some word; When statistical probability, structure statistical probability model, can add up it and be used for proper noun (more specifically; Name, place name, mechanism's name, ProductName, term or the like), the probability of number, foreign word (even can distinguish languages) etc.; Perhaps when training statistical probability model in corpus mark proper noun (more specifically, name, place name, mechanism's name, ProductName, term or the like), number, foreign word (even can distinguish languages) etc., rather than only these speech generally are labeled as " noun ", " verb " and so on.

According to embodiments of the invention; Through with possibly being that unit calculates the probability in context with the word as the word of the ingredient of the speech of second type in the natural language statement; Thereby confirm candidate sequence, and the natural language statement is marked (can also corresponding word be combined as speech according to the word information as the word of the ingredient of the speech of second type) according to candidate sequence.The normal mark to speech in classic method; Through using the word information labeling as the word of the ingredient of the speech of second type; When constituting new speech in other statement thereby the word in the speech of second type occurs; Because the probability that exists this word in statistical model, to occur as the lead-in of another speech, middle word, tail word, the neologisms that therefore possibly constitute this type of word can carry out correct mark.Like this, solve this type non-existent speech in the statistical probability model and be difficult to the problem discerning and mark.

The applicant finds also that through research for the article (statement) of different field, its vocabulary formation characteristics are different.For example, for scientific and technical article, more term, name, numeral etc. are arranged wherein; For literary works, more name, place name are arranged wherein; For translated works, then have more foreign word (foreign language name, foreign language place name etc.).Therefore, according to a further aspect in the invention, can distinguish the speech of the first kind and the speech of second type, have better participle effect so that the application's scheme is directed against the statement of specific area according to field under the Chinese sentence.

Second embodiment

According to an aspect of the present invention, a kind of natural language lexical analysis device is provided.Fig. 3 shows the schematic diagram of natural language lexical analysis device 300 according to an embodiment of the invention.As shown in Figure 3, natural language lexical analysis device 300 can comprise: cutting unit 310, statistical probability model storage unit 320, score calculating unit 330, candidate sequence are confirmed unit 340 and mark unit 350.

Cutting unit 310 can be configured to the natural language sentence segmentation of input served as reasons and possibly be the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type, and wherein the speech of the first kind is the speech except that the speech of second type.

In the present embodiment, for convenience, still describe with Chinese sentence.Should be noted that this example only is schematically, but not to qualification of the present invention.Those skilled in the art can also be applied to other natural language statements such as Japanese, Korean.

In that for example s2 " the film Alice in Wonderland is fine to be seen " is when carrying out lexical analysis to Chinese sentence, cutting unit 310 becomes various possible sequences with this sentence segmentation." film ", " fairyland ", " very ", " good-looking " possibly be the speech of the first kind, and " love ", " beautiful ", " silk ", " dream ", " trip " possibly be the words as the ingredient of the speech of second type (" Alice " and " sleep-walking ").Here, only use the speech of two types of the first kind and second type classifications, the present invention is not limited to this, and two types title can be exchanged.Can this statement s2 be divided into sequence s21 " film/love/beautiful/silk/sleep-walking/fairyland/very/good-looking/", or sequence s22 " film/Alice/dream/trip/fairyland/very/good-looking/".Should be noted that only as an example cutting unit 310 can also be the sequence of various words combinations with this sentence segmentation here with this.

The statistical probability model of having stored preprepared in the statistical probability model storage unit 320 or having bought from the third party can comprise: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type.When preparing the statistical probability model, can use any approach to obtain relevant data, come directly to make up said statistical probability model according to statistics then.Certainly, also can be with marking language material in advance and the mode of model training being obtained said statistical probability model, this training patterns also is well-known, the latent Markov model of the employing binary of for example mentioning in the patent documentation [2] is learnt.Only, when the statistical probability model that prepare to supply the application to use, the content that will mark according to the application's scheme institute to mark in advance the language material that supplies the training use.That is to say, to corpus mark above-mentioned maybe as the information of the speech of the first kind with maybe be as the information of the word of the ingredient of the second type speech.

Score calculating unit 330 can be configured to utilize the mark of each sequence in the said a plurality of sequences of statistical probability Model Calculation.The compute classes of statistical probability is similar to the method for describing as in above-mentioned first embodiment, for for purpose of brevity, repeats no more at this.

The speech in the various possible cutting sequence of the Chinese sentence s2 that calculates according to score calculating unit 330 and the probability score of word, candidate sequence are confirmed that unit 340 can be arranged to according to said mark and are confirmed candidate sequence.Mark unit 350 can be configured to according to this candidate sequence said natural language statement marked; Comprise: to the speech of the first kind that possibly exist in this candidate sequence; The word information that mark obtains from said statistical probability model; To the word that possibly exist except that the speech of the said first kind that possibly exist, the word information that mark obtains from said statistical probability model.

Similar with first embodiment; According to the probability score of calculating; Candidate's slit mode can for " film/N love/NRF is beautiful/NRM silk/NRB dream/VF trip/VB fairyland/NS very/AD is good-looking/AD ", this situation is corresponding to situation about not existing in the statistical model as " Alice " and " sleep-walking " of the speech of the first kind.Wherein symbol N representes noun, and NRF representes the lead-in as the noun of name, and NRB representes the middle word as the noun of name, and NRB representes the tail word as the noun of name.VF representes the lead-in of verb, and VB representes the tail word of verb.AD representes independent adjective.

According to another embodiment of the invention, natural language lexical analysis device can also comprise assembled unit.Fig. 4 shows the schematic diagram of natural language lexical analysis device 400 according to another embodiment of the invention.As shown in Figure 4, natural language lexical analysis device 400 comprises that cutting unit 410, statistical probability model storage unit 420, score calculating unit 430, candidate sequence confirm unit 440, mark unit 450 and assembled unit 460.Here; Cutting unit 410 among Fig. 4, statistical probability model storage unit 420, score calculating unit 430, candidate sequence confirm that unit 440 and mark unit 450 and cutting unit 310, statistical probability model storage unit 320, score calculating unit 330, candidate sequence among Fig. 3 confirm unit 340 and mark the 26S Proteasome Structure and Function of unit 350 identical, repeat no more at this.Assembled unit 460 can be configured to according to the word information of said mark unit 450 marks corresponding word is combined as speech, and wherein, said mark unit 450 also is configured to use the word information mark combination that obtains based on said word information and the speech that gets.

Still adopt above-mentioned example to explain.Because " love " can be used as the lead-in NRF of noun, " beautiful " can be used as the middle word NRM of noun, and " silk " can be used as the tail word NRB of noun.In addition, " dream " can be used as the lead-in VF of verb, and " trip " can be used as the tail word VB of verb.Like this, mark unit 350 just can according to as the word-building rule of the lead-in of noun, middle word and tail word and as the word-building rule of the lead-in of verb and tail word further statement s2 is labeled as " film/N Alice/NR sleep-walking/V fairyland/NS very/AD is good-looking/AD ".Should be noted that here the speech " Alice " formed and " sleep-walking " are as the speech appearance of second type.

According to embodiments of the invention; Except the speech in the statement is marked; Through using the word information labeling as the word of the ingredient of the speech of second type; When constituting new speech in other statement, because therefore the probability that exists this word in statistical model, to occur as the lead-in of another speech, middle word, tail word can carry out correct mark to the speech that this type of word possibly constitute thereby the word in the speech of second type occurs.Through to the statistical probability model training learn to constitute and possibly solve the impalpable problem of non-existent speech in corpus as the rule of the speech of second type with word.

The word information that can comprise according to an aspect of the present invention, the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type.The word information of the speech of said second type can comprise part of speech information and/or the positional information of this speech in statement.Wherein, the positional information of word in the speech of second type can comprise lead-in, tail word, middle word and solely in the words one of at least.

According to a further aspect in the invention, as the speech that word made up of the ingredient of the speech of second type can comprise in low-frequency word, proper noun, number and the foreign word one of at least.In addition, proper noun can comprise in name, place name, mechanism's name, ProductName and the term one of at least.

According to a further aspect in the invention, can distinguish the speech of the first kind and the speech of second type, have better participle effect so that the application's scheme is directed against the statement of specific area according to field under the Chinese sentence.

Different with the mode of judging " unknown word " in the document [2]; The application distinguishes the speech of speech and second type of the first kind through calculating probability of occurrence that " unknown word " concentrate at mark; But the speech (also can comprise low-frequency word) of some foregoing special speech as second type; Thereby make that the identification of neologisms is more targeted, thereby improved discrimination and accuracy.Certainly, also can adopt other modes to divide according to actual conditions, the present invention be not limited to this.

The 3rd embodiment

According to a further aspect in the invention, a kind of natural language lexical analyzer training method is provided.Fig. 5 shows the process flow diagram of natural language lexical analyzer training method according to another embodiment of the invention.

As shown in Figure 5; Natural language lexical analyzer training method can comprise: mark natural language sequence is as corpus; Wherein, to the speech of first kind word information labeling only, to the speech of second type; Constitute the word of the speech of said second type with the word information labeling, wherein the speech of the first kind is the speech (step S510) except that the speech of said second type.

Still be example,, supply analyzer as corpus this sequence for Chinese sentence " Xiao Ming will go to school tomorrow " with Chinese.For example, can the speech that " tomorrow ", " going ", " going to school " are divided into the first kind be marked, and these speech of word information labeling only, that is, and " tomorrow/NT go/V go to school/V ".Wherein, the noun of symbol NT express time, V representes verb.For noun " Xiao Ming ", its speech that is divided into second type is handled as name.With " Xiao Ming " be labeled as respectively as this noun lead-in " little " and as the tail word " bright " of this noun, promptly be labeled as little/NRF bright/NRB.Wherein, symbol NRF representes the lead-in as the noun of name, and NRB representes the tail word as the noun of name.Although on mark is that " little ", " bright " are marked word information respectively,, when analyzer carries out the statement lexical analysis, just can they be labeled as a speech according to their position relation and group speech rule.Simultaneously, if corpus has also marked the tail word that " by force " can be used as name, when analyzer carries out lexical analysis, just can be combined into the neologisms " little strong " that never occurred so.

Should be noted that as stated use the speech of speech and second type of the first kind to be only used for distinguishing the existence of two types speech, the present invention is not limited to this, two types address can be exchanged.

When the speech of dividing two types, for example can divide according to field under the natural language statement.For example, be example with Chinese, for example,, more term, name, numeral etc. are arranged wherein for scientific and technical article; For literary works, more name, place name are arranged wherein; For translated works, then have more foreign word (foreign language name, foreign language place name etc.).Therefore, according to a further aspect in the invention, can distinguish the speech of the first kind and the speech of second type, have better participle effect so that the application's scheme is directed against the statement of specific area according to field under the Chinese sentence.

After being ready to corpus, with said corpus training natural language lexical analyzer (step S520).For the speech of the first kind, only calculate conditional probability model as speech.For the word in the speech of second type; According to the contextual information of each word in the speech of second type or claim word information, comprise its positional information in the speech of second type, with and the part of speech information of the speech of second type at place and/or the conditional probability model of positional information calculation each word of this speech in statement.Here, can adopt method training natural language lexical analyzer well known in the art.For example, adopt the latent Markov model training method of the binary of using in the patent documentation [2].Certainly, the present invention is not limited to this, and those skilled in the art can adopt other training method that is fit to training natural language lexical analyzers.

According to embodiments of the invention, it is one of following at least that proper noun can comprise: name, place name, mechanism's name, ProductName and term.Because the group speech characteristics of above-mentioned proper noun are different from popular word, and neologisms occur easily, therefore in classic method, can bring very big difficulty to identifying.For example; If according in the document [2] through judging that the probability of possible speech in existing statistical model is judged to be when being " known words " or " unknown word "; Owing to do not utilize the positional information of each word in the speech that possibly form and the part of speech information of possible speech, existing participle instrument still is not easy to identify these possible speech.To this, the inventor discovers the making words rule that can utilize this type speech.For example, the probability that the lead-in of certain noun occurs in noun is bigger, and another word is big as the probability of the tail word appearance of this noun.So; When above-mentioned lead-in appears at the first in another noun, when identification, just might think that according to the statistical model probability it can be used as the lead-in of a speech, equally; The tail word of this speech possibly be the word that another one once occurred as the tail word of " known words "; So, when identification, might be a speech just, thereby can solve the new word problem of identification these two word recognitions.

According to embodiments of the invention, the word information of the speech of the first kind can comprise part of speech information and/or the positional information of this speech in statement.For example, be example with Chinese, can be meant part of speech information such as noun, verb, adjective, adverbial word.Comprise the word information of the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type, the word information of the speech of wherein said second type comprises part of speech information and/or the positional information of this speech in statement.Likewise, be example with Chinese, part of speech information can be meant part of speech information such as noun, verb, adjective.

In addition, the positional information of word in the speech of second type can comprise one of following at least: lead-in, tail word, middle word and only words.Through the positional information of study word in the speech of second type; Grasp of first place, centre or the ending of a word at speech; The probability that perhaps occurs as autonomous word; In conjunction with the part of speech information of the speech of second type that possibly constitute, can make the word-building rule of these words of training pattern study, thereby improve the discrimination of the speech of second type that possibly constitute these words.

Train the lexical analyzer that obtains to be used to implement the technical scheme of first and second embodiments with this embodiment.Particularly; Train with this embodiment; In fact be to have obtained aforementioned statistical probability model; The statistical probability of the speech of the first kind wherein and word information thereof, and the statistical probability and the word information (word information that comprises the speech of second type that it is formed) thereof of word that constitutes the speech of second type are based on all that corpus with the aforementioned manner mark obtains.

The 4th embodiment

Property embodiment as an example, each step of said method and above-mentioned each forms module and/or device can be configured through the mode of software, firmware, hardware or its combination repeats no more at this.

As stated, treatment step or module can realize through software in said method or the equipment.To the computing machine with specialized hardware structure (multi-purpose computer 600 for example shown in Figure 6) program that constitutes this software is installed from storage medium or network, this computing machine can be carried out various functions etc. when various program is installed.

In Fig. 6, CPU (CPU) 601 carries out various processing according to program stored among ROM (read-only memory) (ROM) 602 or from the program that storage area 608 is loaded into random-access memory (ram) 603.In RAM 603, also store data required when CPU 601 carries out various processing or the like as required.CPU 601, ROM 602 and RAM 603 are connected to each other via bus 604.Input/output interface 605 also is connected to bus 604.

Following parts are connected to input/output interface 605: importation 606 (comprising keyboard, mouse or the like), output 607 (comprise display; Such as cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 608 (comprising hard disk etc.), communications portion 609 (comprising that NIC is such as LAN card, modulator-demodular unit etc.).Communications portion 609 is handled such as the Internet executive communication via network.As required, driver 610 also can be connected to input/output interface 605.Detachable media 611 is installed on the driver 610 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 608.

Realizing through software under the situation of above-mentioned series of processes, such as detachable media 611 program that constitutes software is being installed such as the Internet or storage medium from network.

It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 6 wherein having program stored therein, distribute so that the detachable media 611 of program to be provided to the user with equipment with being separated.The example of detachable media 611 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in R0M 602, the storage area 608 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.

The disclosure also proposes a kind of program product that stores the instruction code of machine-readable.When said instruction code is read and carried out by machine, can carry out above-mentioned method according to disclosure embodiment.

Correspondingly, the storage medium that is used for carrying the program product of the above-mentioned instruction code that stores machine-readable is also included within of the present disclosure open.Said storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick or the like.

In the above in the description to disclosure specific embodiment; Characteristic to a kind of embodiment is described and/or illustrated can use identical or similar mode in one or more other embodiment, to use; Combined with the characteristic in other embodiment, or substitute the characteristic in other embodiment.

Should stress that term " comprises/comprise " existence that when this paper uses, refers to characteristic, key element, step or assembly, but not get rid of the existence of one or more further feature, key element, step or assembly or additional.

Though more than combine accompanying drawing to describe embodiments of the invention in detail, should be understood that top described embodiment just is used to explain the present invention, and be not construed as limiting the invention.For a person skilled in the art, can make various modifications and change to above-mentioned embodiment and do not deviate from essence of the present invention and scope.Therefore, scope of the present invention is only limited appended claim and equivalents thereof.

In addition, the time sequencing of describing during method of the present disclosure is not limited to is to specifications carried out, also can according to other time sequencing ground, carry out concurrently or independently.The execution sequence of the method for therefore, describing in this instructions does not constitute restriction to technical scope of the present disclosure.

Can find out through above description,, following scheme is provided according to embodiment of the present disclosure:

1. 1 kinds of natural language morphology analysis methods of remarks comprise:

The natural language sentence segmentation of input served as reasons possibly be the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type, and wherein the speech of the first kind is the speech except that the speech of second type;

Utilize the mark of each sequence in the said a plurality of sequences of statistical probability Model Calculation, wherein, said statistical probability model comprises: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type;

Confirm candidate sequence according to said mark; And

According to this candidate sequence said natural language statement is marked; Comprise: to the speech of the first kind that possibly exist in this candidate sequence; The word information that mark obtains from said statistical probability model; To the word that possibly exist except that the speech of the said first kind that possibly exist, the word information that mark obtains from said statistical probability model.

Remarks 2. is like remarks 1 described natural language morphology analysis methods, and also comprise: the word information according to being marked is combined as speech with corresponding word, and uses the word information that obtains based on said word information to mark the speech that combines.

Remarks 3. is like remarks 1 or 2 described natural language morphology analysis methods, and wherein, the speech of the speech of the said first kind and second type is to divide according to the field of said natural language statement.

Remarks 4. is like remarks 1 or 2 described natural language morphology analysis methods, and wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.

Remarks 5. is like remarks 4 described natural language morphology analysis methods, and wherein, it is one of following at least that proper noun comprises: name, place name, mechanism's name, ProductName and term.

Remarks 6. is like remarks 1 or 2 described natural language morphology analysis methods; Wherein the word information of the speech of the first kind comprises part of speech information and/or the positional information of this speech in statement; Comprise the word information of the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type, the word information of the speech of said second type comprises part of speech information and/or the positional information of this speech in statement.

Remarks 7. is like remarks 6 described natural language morphology analysis methods, and wherein the positional information of word in the speech of second type comprises one of following at least: lead-in, tail word, middle word and only words.

8. 1 kinds of natural language lexical analyses of remarks device comprises:

The cutting unit; It is configured to the natural language sentence segmentation of input served as reasons and possibly is the speech of the first kind and/or a plurality of sequences that possibly constitute as the word of the ingredient of the speech of second type, and wherein the speech of the first kind is the speech except that the speech of second type;

Statistical probability model storage unit, said statistical probability model comprises: the statistical probability of the speech of the first kind in context; And as the statistical probability of word in context of the ingredient of the speech of second type;

Score calculating unit, it is configured to utilize the mark of each sequence in the said a plurality of sequences of said statistical probability Model Calculation;

Candidate sequence is confirmed the unit, and it is configured to confirm candidate sequence according to said mark; And

The mark unit; It is configured to according to this candidate sequence said natural language statement marked; Comprise: to the speech of the first kind that possibly exist in this candidate sequence; The word information that mark obtains from said statistical probability model, to the word that possibly exist except that the speech of the said first kind that possibly exist, the word information that mark obtains from said statistical probability model.

Remarks 9. is like remarks 8 described natural language lexical analysis devices; Also comprise assembled unit; It is configured to according to the word information of said mark unit mark corresponding word is combined as speech; Wherein, said mark unit also be configured to use the word information mark combination that obtains based on said word information and speech.

Remarks 10. is like remarks 8 or 9 described natural language lexical analysis devices, and wherein the speech of the speech of the first kind and second type is to divide according to the field of said natural language statement.

Remarks 11. is like remarks 8 or 9 described natural language lexical analysis devices, and wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.

Remarks 12. is like remarks 11 described natural language lexical analysis devices, and wherein, it is one of following at least that proper noun comprises: name, place name, mechanism's name, ProductName and term.

Remarks 13. is like remarks 8 or 9 described natural language lexical analysis devices; Wherein the word information of the speech of the first kind comprises part of speech information and/or the positional information of this speech in statement; Comprise the word information of the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type, the word information of the speech of said second type comprises part of speech information and/or the positional information of this speech in statement.

Remarks 14. is like remarks 13 described natural language lexical analysis devices, and wherein the positional information of word in the speech of second type comprises one of following at least: lead-in, tail word, middle word and only words.

15. 1 kinds of natural language lexical analyzers of remarks training method comprises:

Mark natural language sequence is as corpus; Wherein, to the speech of first kind word information labeling only, to the speech of second type; Constitute the word of the speech of said second type with the word information labeling, wherein the speech of the first kind is the speech except that the speech of said second type;

With said corpus training natural language lexical analyzer.

Remarks 16. is like remarks 15 described natural language lexical analyzer training methods, and wherein the speech of the speech of the first kind and second type is to divide according to the field of said natural language statement.

Remarks 17. is like remarks 15 or 16 described natural language lexical analyzer training methods, and wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.

Remarks 18. is like remarks 17 described natural language lexical analyzer training methods, and wherein, it is one of following at least that proper noun comprises: name, place name, mechanism's name, ProductName and term.

Remarks 19. is like remarks 15 or 16 described natural language lexical analyzer training methods; Wherein, The word information of the speech of the first kind comprises part of speech information and/or the positional information of this speech in statement; Comprise the word information of the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type, the word information of the speech of said second type comprises part of speech information and/or the positional information of this speech in statement.

Remarks 20. is like remarks 19 described natural language lexical analyzer training methods, and wherein, it is one of following at least that the positional information of word in the speech of second type comprises: lead-in, tail word, middle word and only words.

Claims

1. natural language morphology analysis methods comprises:

Confirm candidate sequence according to said mark; And

2. natural language morphology analysis methods as claimed in claim 1 also comprises: the word information according to being marked is combined as speech with corresponding word, and uses the word information that obtains based on said word information to mark the speech that combines.

3. according to claim 1 or claim 2 natural language morphology analysis methods, wherein, the speech of the speech of the said first kind and second type is to divide according to the field of said natural language statement.

4. according to claim 1 or claim 2 natural language morphology analysis methods, wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.

5. natural language morphology analysis methods as claimed in claim 4, wherein, it is one of following at least that proper noun comprises: name, place name, mechanism's name, ProductName and term.

6. according to claim 1 or claim 2 natural language morphology analysis methods; Wherein the word information of the speech of the first kind comprises part of speech information and/or the positional information of this speech in statement; Comprise the word information of the speech of positional information and corresponding second type of this word in the speech of second type as the word information of the word of the ingredient of the speech of second type, the word information of the speech of said second type comprises part of speech information and/or the positional information of this speech in statement.

7. natural language lexical analysis device comprises:

8. natural language lexical analysis device as claimed in claim 7, wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.

9. natural language lexical analyzer training method comprises:

With said corpus training natural language lexical analyzer.

10. natural language lexical analyzer training method as claimed in claim 9, wherein the speech of second type comprises one of following at least: low-frequency word, proper noun, number and foreign word.