CN103377184B

CN103377184B - Process the device of language data, method and equipment

Info

Publication number: CN103377184B
Application number: CN201210110614.6A
Authority: CN
Inventors: 葛乃晟; 孟遥; 于浩
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2012-04-16
Filing date: 2012-04-16
Publication date: 2016-03-30
Anticipated expiration: 2032-04-16
Also published as: CN103377184A

Abstract

Embodiments of the invention relate to the device, method and the equipment that process language data, at least to overcome the problem can not eliminating sentence structure ambiguity existing for existing language processing techniques.The device of above-mentioned process language data comprises: for identifying the identification module of at least two lexical units that can meet at least two kinds of syntactic relations; For obtaining the modular converter of candidate's statement respectively for often kind of syntactic relation; For the grading module of marking to each candidate's statement; And for the lexical unit of above-mentioned identification is defined as the determination module with the corresponding syntactic relation of the candidate statement the highest with score.The method of above-mentioned process language data is for performing the process of the function of the device that can realize above-mentioned process language data.The said equipment comprises the device of above-mentioned process language data.Apply above-mentioned technology of the present invention, the ambiguity with the structure of syntax ambiguity in language data can be eliminated, can natural language processing field be applied to.

Description

Process the device of language data, method and equipment

Technical field

The present invention relates to natural language processing field, particularly relate to a kind of device, method and the equipment that process language data.

Background technology

In existing automatic syntactic analysis process, structural ambiguity is ubiquitous.Such as, in Chinese, each word does not have clear and definite morphological markers, and same word can serve as different sentence elements respectively when not changing any form; And for example, in English, may there is multiple possibility in the object that the postpositive attributive under certain situation or attributive clause are modified; Etc..Therefore, it is far from being enough for only relying on part of speech mark to carry out language statement analysis, this aspect of part of speech still may also exist the ambiguity on many syntactic structures, also be, when the larger composition of a sentence or sentence is resolved into some less compositions, may occur different syntactic structures, be embodied as internal combination level different, or inner structure relation is different.Therefore, be necessary to propose a kind of scheme can eliminating said structure ambiguity.

Summary of the invention

Give hereinafter about brief overview of the present invention, to provide about the basic comprehension in some of the present invention.Should be appreciated that this general introduction is not summarize about exhaustive of the present invention.It is not that intention determines key of the present invention or pith, and nor is it intended to limit the scope of the present invention.Its object is only provide some concept in simplified form, in this, as the preorder in greater detail discussed after a while.

In view of the above-mentioned defect of prior art, an object of the present invention is to provide and a kind ofly processes the device of language data, the method for computer implemented process language data and equipment, at least to overcome the problem can not eliminating sentence structure ambiguity existing for existing language processing techniques.

To achieve these goals, according to an aspect of the present invention, provide a kind of device processing language data, this device comprises: identification module, it is arranged to the lexical unit obtained in language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in language data; Modular converter, it is arranged to for each syntactic relation, according to corresponding conversion method, the lexical unit of identification is converted to candidate's statement, has the feature syntactic structure corresponding with this syntactic relation to make candidate's statement; Grading module, it is arranged to and utilizes language model to mark to each candidate's statement; And determination module, it is arranged to and the lexical unit of identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

According to another aspect of the present invention, additionally provide a kind of method of computer implemented process language data, the method of this process language data comprises: obtain the lexical unit in language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in language data; For each syntactic relation, according to corresponding conversion method, the lexical unit of identification is converted to candidate's statement, to make candidate's statement, there is the feature syntactic structure corresponding with this syntactic relation; Language model is utilized to mark to each candidate's statement; And the lexical unit of identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

According to another aspect of the present invention, additionally provide a kind of equipment, this equipment comprises the device of process language data as above.

According to other side of the present invention, additionally provide corresponding computer-readable recording medium, this computer-readable recording medium stores the computer program that can be performed by computing equipment, described program can make described computing equipment perform the method for the process language data that above computer realizes when performing.

The method of the process language data realized according to the device of the above-mentioned process language data of the embodiment of the present invention and above computer and comprise the equipment of device of this process language data, one of at least following benefit can be realized: by determining unique syntactic relation of ambiguous structure in pending language data, the qi that disappears to above-mentioned ambiguous structure can be realized; Improve the accuracy of mechanical translation; And raising retrieval accuracy.

By below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.

Accompanying drawing explanation

The present invention can be better understood by reference to hereinafter given by reference to the accompanying drawings description, wherein employs same or analogous Reference numeral in all of the figs to represent identical or similar parts.Described accompanying drawing comprises in this manual together with detailed description below and forms the part of this instructions, and is used for illustrating the preferred embodiments of the present invention further and explaining principle and advantage of the present invention.In the accompanying drawings:

Fig. 1 is the block diagram of a kind of exemplary construction schematically showing the device processing language data according to an embodiment of the invention.

Fig. 2 is the block diagram of a kind of possible exemplary construction of the modular converter schematically shown in Fig. 1.

Fig. 3 is the process flow diagram of a kind of exemplary process of the method schematically showing computer implemented according to an embodiment of the invention process language data.

Fig. 4 is the process flow diagram of a kind of possible exemplary process of the step S330 schematically shown as shown in Figure 3.

Fig. 5 shows the structure diagram of the hardware configuration that can be used to a kind of possible messaging device realized according to the device of the process language data of the embodiment of the present invention and the method for computer implemented process language data.

The element that it will be appreciated by those skilled in the art that in accompanying drawing be only used to simple and clear for the purpose of illustrate, and not necessarily to draw in proportion.Such as, in accompanying drawing, the size of some element may be exaggerated relative to other elements, to contribute to improving the understanding to the embodiment of the present invention.

Embodiment

To be described one exemplary embodiment of the present invention by reference to the accompanying drawings hereinafter.For clarity and conciseness, all features of actual embodiment are not described in the description.But, should understand, must make a lot specific to the decision of embodiment in the process of any this practical embodiments of exploitation, to realize the objectives of developer, such as, meet those restrictive conditions relevant to system and business, and these restrictive conditions may change to some extent along with the difference of embodiment.In addition, although will also be appreciated that development is likely very complicated and time-consuming, concerning the those skilled in the art having benefited from present disclosure, this development is only routine task.

At this, also it should be noted is that, in order to avoid the present invention fuzzy because of unnecessary details, illustrate only in the accompanying drawings with according to the closely-related apparatus structure of the solution of the present invention and/or treatment step, and eliminate other details little with relation of the present invention.

In order to solve the problem can not eliminating sentence structure ambiguity existing for existing language processing techniques, the present invention proposes a kind of device processing language data.The device of this process language data comprises: identification module, and it is arranged to the lexical unit obtained in language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in language data; Modular converter, it is arranged to for each syntactic relation, according to corresponding conversion method, the lexical unit of identification is converted to candidate's statement, has the feature syntactic structure corresponding with this syntactic relation to make candidate's statement; Grading module, it is arranged to and utilizes language model to mark to each candidate's statement; And determination module, it is arranged to and the lexical unit of identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

The device processing language data is according to an embodiment of the invention described in detail below in conjunction with Fig. 1-Fig. 2.

Fig. 1 is the block diagram of a kind of exemplary construction schematically showing the device 100 processing language data according to an embodiment of the invention.As shown in Figure 1, the device 100 processing language data according to an embodiment of the invention comprises identification module 110, modular converter 120, grading module 130 and determination module 140.

As shown in Figure 1, identification module 110 in device 100 such as can obtain lexical unit in language data by carrying out word segmentation processing to above-mentioned language data, and can identify those at least two lexical units that can meet at least two kinds of syntactic relations by carrying out syntactic analysis to the above-mentioned language data through word segmentation processing.Wherein, syntactic analysis (Parsing) is the existing treatment technology of one analyzed the word grammatical function in sentence, it is a directly realization of chunk parsing (chunking) thought, and chunk parsing simplifies the description of sentence by identifying high-level structural unit.

Wherein, language data mentioned here such as can comprise one or more statement and/or phrase.Such as, for a phrase that may comprise in the language data automobile of master worker " repair " carry out word segmentation processing result can for " repair/master worker// automobile ", also namely can obtain 4 lexical units.Syntactic analysis is carried out to this sentence known, the syntactic relation that " automobile of a repairing master worker " may meet has two kinds, one is that guest's syntactic relation (is also, " automobile of a master worker " is the object of " repairing "), another is surely middle syntactic relation, " a repairing master worker's " is the attribute of " automobile ", language centered by " automobile ").It should be noted that, " syntactic relation that may meet " mentioned here or " syntactic relation that can meet " refer to and can meet on syntactic level and may meet the also incongruent syntactic relation of possibility at semantic layer.It can thus be appreciated that, " repairing ", " master worker ", " " and " automobile " these 4 lexical units can meet two kinds of syntactic relations, therefore, by identification module 110 can identify be similar to " repairing ", " master worker ", " " and many groups lexical unit of " automobile " this group lexical unit, also namely, can identify and be similar to " repair/master worker// automobile " this phrase or the sentence in part of speech aspect with syntactical structural ambiguity.Hereinafter, convenient and clear in order to describe, " can meet at least two lexical units of at least two kinds of syntactic relations " of being identified by identification module 110 is referred to as " ambiguous structure ".

Next, the process of modular converter 120, grading module 130 and determination module 140 can be passed through, determine unique syntactic relation of above-mentioned " ambiguous structure ".

Known by describing above, " at least two lexical units " that identification module 110 identifies can meet at least two kinds of syntactic relations, therefore, for each syntactic relation wherein, according to corresponding conversion method, the lexical unit of above-mentioned identification can be converted to candidate's statement by modular converter 120, with candidate's statement of obtaining after making to change, there is the feature syntactic structure corresponding with this syntactic relation.Wherein, feature syntactic structure mentioned here is determined by the definition/attribute of syntactic relation itself, hereinafter will illustrate.

In a kind of implementation of device processing language data according to an embodiment of the invention, modular converter 120 as shown in Figure 1 can adopt structure as shown in Figure 2 to realize.Fig. 2 is the block diagram of a kind of possible exemplary construction of the modular converter 120 schematically shown in Fig. 1.

As shown in Figure 2, modular converter 120 can comprise and determines submodule 210 and transform subblock 220.

Wherein, determine each syntactic relation at least two kinds of syntactic relations that " at least two lexical units " that submodule 210 can identify for identification module 110 can meet, determine the feature syntactic structure corresponding with this syntactic relation, determine the tense of sequencing between each sentence element in this feature syntactic structure and above-mentioned each sentence element, and determine to characterize the Feature Words of this feature syntactic structure and the position of Feature Words.

Then, transform subblock 220 can for each above-mentioned syntactic relation, the sequencing be converted between each sentence element in the feature syntactic structure corresponding with this syntactic relation by sequencing between the lexical unit of identification is consistent, the tense of the lexical unit of above-mentioned identification is converted to and the tense of each sentence element in the feature syntactic structure corresponding with this syntactic relation consistent, and according to characterizing Feature Words and the position thereof of the feature syntactic structure corresponding with this syntactic relation, correspondence position in the lexical unit of the identification after conversion adds corresponding Feature Words, to obtain candidate's statement.

In one example in which, if one of syntactic relation that " can meet at least two lexical units of at least two kinds of syntactic relations " that identification module 110 identifies can meet is dynamic guest's syntactic relation, also be, when above-mentioned at least two lexical units such as comprise the first lexical unit and the second lexical unit, suppose that the first lexical unit is verb lexical unit, the second lexical unit is object lexical unit.

Here, it should be noted that, although " above-mentioned at least two lexical units " only comprise the first lexical unit and the second lexical unit in example as described herein, but in actual treatment, " above-mentioned at least two lexical units " also can comprise more lexical unit, this situation can be regarded as in the second lexical unit and also comprise multiple morphology subelement, and this multiple morphology subelement is combined and constituted " object lexical unit ".Such as, in " repairing the automobile of a master worker ", for dynamic guest's syntactic relation that this phrase may meet, " repairing " can be regarded as the first lexical unit, and " automobile of a master worker " is regarded as the second lexical unit, wherein, the object of formation " repairing " combined by three morphology subelements in " master worker// automobile ".

Thus, in this example embodiment, when above-mentioned first lexical unit has passive form, by determining that dynamic guest's syntactic relation characteristic of correspondence syntactic structure that submodule 210 obtains can be such as such structure (hereinafter referred to as the first structure): " object "+" the passive form of verb ".(be also in this case, first lexical unit has the situation of passive form) under, for above-mentioned dynamic guest's syntactic relation, the tense that sequencing between the lexical unit that identification module 110 can be identified by transform subblock 220 and tense are converted to sequencing between each sentence element in above-mentioned first structure and each sentence element is consistent, also namely, the lexical unit that identification module 110 identifies can be converted to candidate's statement with following structure: " the second lexical unit "+" the passive form of the first lexical unit ".

In addition, in this example embodiment, when above-mentioned first lexical unit does not have passive form, by determining that dynamic guest's syntactic relation characteristic of correspondence syntactic structure that submodule 210 obtains such as can comprise such structure (hereinafter referred to as the second structure): " verb "+" object head "; " object ".In this situation (be also, first lexical unit does not have the situation of passive form) under, for above-mentioned dynamic guest's syntactic relation, the tense that sequencing between the lexical unit that identification module 110 can be identified by transform subblock 220 and tense are converted to sequencing between each sentence element in above-mentioned second structure and each sentence element is consistent, also namely, the lexical unit that identification module 110 identifies can be converted to candidate's statement with following structure: " the first lexical unit "+" head of the second lexical unit "; " the second lexical unit ".Such as, if the second lexical unit is such as " receiver of mobile phone ", then the head of the second lexical unit is " receiver ".It should be noted that, when the second lexical unit only comprises head itself, then " head of the second lexical unit " namely refers to the second lexical unit itself.

Due to existing characteristics word non-in above-mentioned first structure, the second structure, therefore do not need to add Feature Words in the candidate's statement through being converted to.

In another example, if one of syntactic relation that " can meet at least two lexical units of at least two kinds of syntactic relations " that identification module 110 identifies can meet is fixed middle syntactic relation, also be, when above-mentioned at least two lexical units such as comprise the 3rd lexical unit and the 4th lexical unit, if the 3rd lexical unit is attribute lexical unit, language lexical unit centered by the 4th lexical unit.

Wherein, with analogously mentioned above, also can comprise multiple morphology subelement in the 3rd lexical unit, the attribute lexical unit constituted for modifying/limiting the 4th lexical unit combined by this multiple morphology subelement.

In this example embodiment, by determining that the fixed middle syntactic relation characteristic of correspondence syntactic structure that submodule 210 obtains can be such as such structure (hereinafter referred to as the 3rd structure): " number "+" measure word "+" attribute "+" head ".Wherein, above-mentioned " number " and " measure word " are the Feature Words in a kind of feature syntactic structure that surely, syntactic relation is corresponding, but its Feature Words is not limited to this.For above-mentioned fixed middle syntactic relation, the tense that sequencing between the lexical unit that identification module 110 can be identified by transform subblock 220 and tense are converted to sequencing between each sentence element in above-mentioned 3rd structure and each sentence element is consistent, and the relevant position of the lexical unit of identification after conversion adds Feature Words (as " number " and " measure word "), then the lexical unit that identification module 110 identifies can be converted to candidate's statement with following structure: " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ".Wherein, measure word is here for representing the unit of quantity of the 4th lexical unit.Such as, for the fixed middle syntactic relation that " repairing the automobile of a master worker " may meet, " automobile " can be regarded as the 4th lexical unit, and " repairing a master worker's " is regarded as the 3rd lexical unit, wherein, the attribute being configured for limiting " automobile " combined by three morphology subelements in " repair/master worker/".

In addition, in this example embodiment, by determining that the fixed middle syntactic relation characteristic of correspondence syntactic structure that submodule 210 obtains also can be such as such structure (hereinafter referred to as the 4th structure): " attribute "; " number "+" measure word "+" attribute "+" head ".For above-mentioned fixed middle syntactic relation, the tense that sequencing between the lexical unit that identification module 110 can be identified by transform subblock 220 and tense are converted to sequencing between each sentence element in above-mentioned 4th structure and each sentence element is consistent, and the relevant position of the lexical unit of identification after conversion adds Feature Words (as " number " and " measure word "), then the lexical unit that identification module 110 identifies can be converted to candidate's statement with following structure: " the 3rd lexical unit "; " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ".Wherein, measure word is here for representing the unit of quantity of the 4th lexical unit.

Thus, the lexical unit that identification module 110 can be identified by modular converter 120 is converted to the candidate statement corresponding with each syntactic relation that above-mentioned " can meet at least two lexical units of at least two kinds of syntactic relations " may meet according to corresponding conversion method respectively, also namely, at least two candidate's statements can be obtained.

Then, language model can be utilized to mark to each candidate's statement in above-mentioned at least two candidate's statements by grading module 130, and the syntactic relation corresponding to that the highest for wherein score candidate's statement is finally defined as unique syntactic relation of the lexical unit that identification module 110 identifies by determination module 140.

Wherein, language model (is also, statistical language model, StatisticalLanguageModel) be a kind of model that can calculate the probability of sentence, it can adopt large-scale corpus (corpus) parameter to model to learn automatically.Utilize language model, can determine that the possibility of which word sequence is larger, or several words given, the word that next most probable occurs can be predicted, etc.Common method such as comprises n-gram model method, traditional decision-tree, maximum entropy model method, maximum entropy Markov model method, conditional random fields method, neural net method, etc., said method all can be used herein.

Such as, in the application example of device processing language data according to an embodiment of the invention, according to method mentioned above, " ambiguous structure " that identify in certain language data such as comprises:

" repair/master worker// automobile " (hereinafter referred to as " the first ambiguous structure "),

" repairing/automobile// master worker " (hereinafter referred to as " the second ambiguous structure "),

" become/company// leader " (hereinafter referred to as " the 3rd ambiguous structure "),

" become/soldier// student " (hereinafter referred to as " the 4th ambiguous structure ").

Wherein, the syntactic relation that the first ambiguous structure may meet comprises dynamic guest's syntactic relation and fixed middle syntactic relation.

For dynamic guest's syntactic relation that the first ambiguous structure may meet, because verb " repairing " wherein has passive form, therefore the first ambiguous structure can be converted to first candidate's statement " automobile of a master worker is repaired ".

In addition, for the fixed middle syntactic relation that the first ambiguous structure may meet, the first ambiguous structure can be converted to second candidate's statement and " repair the automobile opening master worker for one "; Or be converted to the 3rd candidate's statement and " repair a master worker; Repair the automobile opening master worker for one ".Preferably, second candidate's statement can be selected.

Similarly, for dynamic guest's syntactic relation that the second ambiguous structure may meet, the second ambiguous structure can be converted to the 4th candidate's statement " master worker for automobile is repaired "; And for the fixed middle syntactic relation that the second ambiguous structure may meet, the second ambiguous structure can be converted to the 5th candidate's statement " the Zhang master worker of a repairing automobile " or the 6th candidate's statement " repairing automobile; The Zhang master worker of a repairing automobile ", preferably, the 5th candidate's statement can be selected.

In addition, for dynamic guest's syntactic relation that the 3rd ambiguous structure may meet, do not have passive form because verb wherein " becomes ", therefore the 3rd ambiguous structure can be converted to the 7th candidate's statement " becomes leader; The leader of company ".In addition, for the 3rd ambiguous structure may meet fixed in syntactic relation, the 3rd ambiguous structure can be converted to the 8th candidate's statement leader of company " become ", or preferably the 3rd ambiguous structure be converted to the 9th candidate's statement and " become company; A leader becoming company ".

Similarly, for dynamic guest's syntactic relation that the 4th ambiguous structure may meet, the 4th ambiguous structure can be converted to the tenth candidate's statement " becomes student; The student of soldier "; And for the fixed middle syntactic relation that the 4th ambiguous structure may meet, 4th ambiguous structure can be converted to the 11 candidate's statement student of soldier " become ", or preferably the 4th ambiguous structure is converted to the 12 candidate's statement " become soldier; A student becoming soldier ".

For two candidate's statements (such as first candidate's statement and second candidate's statement) of the first ambiguous structure, language model is utilized to mark to these two candidate's statements respectively, appraisal result can obtain the score of score higher than second candidate's statement of first candidate's statement, thus can by the first ambiguous structure " repair/master worker// automobile " unique syntactic relation be defined as corresponding to first candidate's statement syntactic relation, also namely, dynamic guest's syntactic relation is defined as.

In addition, the score of the 5th candidate's statement in two candidate's statements (such as the 4th candidate's statement and the 5th candidate's statement) of the second ambiguous structure " repairing/automobile// master worker " is higher, therefore unique syntactic relation of the second ambiguous structure can be defined as the syntactic relation corresponding to the 5th candidate's statement, also namely, syntactic relation is surely defined as.

Similarly, can by the 3rd ambiguous structure " become/company// leader " unique syntactic relation be defined as corresponding to the 7th candidate's statement syntactic relation, also, be defined as dynamic guest's syntactic relation; And can by the 4th ambiguous structure " become/soldier// student " unique syntactic relation be defined as corresponding to the 11 or the 12 candidate's statement syntactic relation, also, be defined as syntactic relation surely.

In addition, in the Another application example of device processing language data according to an embodiment of the invention, according to method mentioned above, can know that the syntactic relation that the sentence " Seeagirlwithahat " (hereinafter referred to as the 5th ambiguous structure) in certain language data may meet comprises " verb+object+attribute " and " verb+object+complement " two kinds, also be, " withahat " may be the attribute for modifying object " agirl ", also may be the adverbial modifier for modifying verb " see ".

If " withahat " is the attribute for modifying object " agirl ", then attribute " withahat " can be close to its centre word modified " agirl ", also be, " agirlwithahat " is the object of " see ", can obtain the 13 candidate's statement " Agirlwithahatwasseen " of the 5th ambiguous structure.

If " withahat " is the adverbial modifier for modifying verb " see ", then the adverbial modifier " withahat " can be close to its verb modified " see ", can obtain the 14 candidate's statement " Agirlwasseenwithahat " of the 5th ambiguous structure.

Carry out scoring by language model to the 13 and the 14 candidate's statement can obtain, the score of the 13 candidate's statement " Agirlwithahatwasseen " is higher than the score of the 14 candidate's statement " Agirlwasseenwithahat ".Therefore, unique syntactic relation of the 5th ambiguous structure can be defined as the syntactic relation corresponding to the 13 candidate's statement, also, " verb+object+attribute " this syntactic relation.

Here, it should be noted that, in actual treatment, still there are some such as " to kill the dog of hunter " this and be not suitable for the statement/phrase using said apparatus to process, but in most cases, utilize the device of above-mentioned according to an embodiment of the invention process language data can determine unique syntactic relation of all ambiguous structures as described above of major part.

Known by describing above, the device of above-mentioned according to an embodiment of the invention process language data, it by the ambiguous structure in identifiable language data (is also, at least two lexical units of at least two kinds of syntactic relations can be met), each syntactic relation that the ambiguous structure of identification may meet according to it is correspondingly obtained candidate's statement respectively, and utilize language model to obtain the score of each candidate's statement, syntactic relation corresponding for candidate's statement the highest for wherein score is defined as unique syntactic relation of identified ambiguous structure, thus eliminate the syntax ambiguity of the ambiguous structure of above-mentioned identification.

In addition, embodiments of the invention additionally provide a kind of equipment, and this equipment comprises the device of process language data as above.

Such as, this equipment can identify the syntactic relation of the statement of pending mechanical translation by the device of process language data wherein, to carry out mechanical translation based on the statement of identified syntactic relation to above-mentioned pending mechanical translation.

So that " master worker repairing automobile had a holiday yesterday." be example; if " repairing the Zhang master worker of automobile " do not disappear qi; at one based on inside the machine translation system of syntactic analysis; be likely parsed into by the object of " repairing " " master worker for automobile had a holiday yesterday ", thus make the possibility of result translated be " RepairtheMasterZhangofcarwasonleaveyesterday. ".The rule-based translation system of Information application if this Syntactic Ambiguity can be disappeared after qi, can improve its syntactic analysis accuracy, also the object of " repairing " would not be analyzed wrong.Such as, by the device of the process language data in equipment, to " repairing the Zhang master worker of automobile ", disappear qi process, and also namely can know that " repairing automobile " is the attribute of " master worker ", then last translation result will be correct:

MasterZhangwhorepairsthecarwasonleaveyesterday.

Even if in the machine translation system of a Corpus--based Method, also likely " repairing automobile " and " master worker " cannot be linked together in grammatical relation.In the machine translation system of an actual Corpus--based Method, the translation result of the words is:

ThecarrepairsMasterZhangyesterdaywasonleave.

Similarly, if by the device of the process language data in equipment, to " repairing the Zhang master worker of automobile ", disappear qi process, the unique syntactic relation obtained is supplied to the machine translation system of Corpus--based Method, respectively the words can be split into and " open a master worker to have a holiday yesterday." and " repair automobile.”。Then, translated this two word respectively, and " repairing automobile " added to " master worker had a holiday yesterday " this sentence with the structure of attribute, also can be translated as follows:

MasterZhangwhorepairsthecaryesterdaywasonleave.

Such the words understanding is got up also easier.

In addition, in other examples, this equipment also can identify key phrase to be retrieved or the syntactic relation of statement by the device of process language data wherein, to retrieve above-mentioned key phrase to be retrieved or statement based on identified syntactic relation; Etc..

Wherein, this equipment can be such as computing machine (such as, desktop computer or notebook computer), mobile phone, panel computer, electric paper book, personal digital assistant, etc.

Known by describing above, the said equipment according to an embodiment of the invention, it processes pending language data by utilizing the device of built-in process language data, to determine unique syntactic relation of the ambiguous structure that wherein may exist, thus determined syntactic relation can be utilized to carry out other such as process such as mechanical translation, retrieval, can treatment effect be improved, such as, can improve the accuracy (as translation accuracy, retrieval accuracy) of process.

In addition, embodiments of the invention additionally provide a kind of method of computer implemented process language data.A kind of exemplary process of the method is described below in conjunction with Fig. 3 and Fig. 4.

As shown in Figure 3, the treatment scheme 300 of the method for computer implemented according to an embodiment of the invention process language data starts from step S310, then performs step S320.

In step s 320, obtain the lexical unit in above-mentioned language data, and in the lexical unit of above-mentioned language data, identify at least two lexical units that can meet at least two kinds of syntactic relations.Then step S330 is performed.

In step S330, for each syntactic relation at least two kinds of syntactic relations that " at least two lexical units " that identify in step S320 can meet, according to corresponding conversion method, " at least two lexical units " of above-mentioned identification is converted to candidate's statement, to make candidate's statement, there is the feature syntactic structure corresponding with this syntactic relation.Then step S340 is performed.

Wherein, each syntactic relation at least two kinds of syntactic relations that can be able to meet for " at least two lexical units " identified in step S320, performs step S410-S430 as shown in Figure 4, respectively to realize the process of step S320.

Wherein, in step S410, determine the tense of sequencing between each sentence element in the feature syntactic structure corresponding with this syntactic relation, this feature syntactic structure and each sentence element and Feature Words and the position thereof of this feature syntactic structure can be characterized.Then step S420 is performed.

In the step s 420, the tense sequencing between the lexical unit of identification and tense being converted to sequencing between each sentence element in determined feature syntactic structure and each sentence element is consistent.Then step S430 is performed.

In step S430, according to determined can the Feature Words of characteristic feature syntactic structure and position thereof, the correspondence position in the lexical unit of identification after conversion adds corresponding Feature Words, to obtain candidate's statement.

Wherein, in one example in which, if one of syntactic relation that " can meet at least two lexical units of at least two kinds of syntactic relations " of identifying can meet is dynamic guest's syntactic relation, also namely, when above-mentioned at least two lexical units such as comprise the first lexical unit and the second lexical unit, if the first lexical unit is verb lexical unit, second lexical unit is object lexical unit, wherein, the second lexical unit only can comprise a word, also comprises multiple morphology subelement.In this example embodiment, when above-mentioned first lexical unit has passive form, this dynamic guest's syntactic relation characteristic of correspondence syntactic structure can be such as the first structure mentioned above; In this situation (be also, first lexical unit has the situation of passive form) under, for above-mentioned dynamic guest's syntactic relation, the tense that the sequencing between the lexical unit of identification and tense can be converted to sequencing between each sentence element in above-mentioned first structure and each sentence element is consistent, also namely, the lexical unit of identification can be converted to candidate's statement with following structure: " the second lexical unit "+" the passive form of the first lexical unit ".In addition, in this example embodiment, when above-mentioned first lexical unit does not have passive form, this dynamic guest's syntactic relation characteristic of correspondence syntactic structure can be such as the second structure mentioned above; In this situation (be also, first lexical unit does not have the situation of passive form) under, for above-mentioned dynamic guest's syntactic relation, the tense that the sequencing between the lexical unit of identification and tense can be converted to sequencing between each sentence element in above-mentioned second structure and each sentence element is consistent, also namely, the lexical unit of identification can be converted to candidate's statement with following structure: " the first lexical unit "+" head of the second lexical unit "; " the second lexical unit ".

In addition, in another example, if one of syntactic relation that above-mentioned identification " can meet at least two lexical units of at least two kinds of syntactic relations " can meet is fixed middle syntactic relation, also be, when above-mentioned at least two lexical units such as comprise the 3rd lexical unit and the 4th lexical unit, if the 3rd lexical unit is attribute lexical unit, language lexical unit centered by the 4th lexical unit.Wherein, with analogously mentioned above, also can comprise multiple morphology subelement in the 3rd lexical unit, the attribute lexical unit constituted for modifying/limiting the 4th lexical unit combined by this multiple morphology subelement.In this example embodiment, this fixed middle syntactic relation characteristic of correspondence syntactic structure can be such as the 3rd structure mentioned above; For above-mentioned fixed middle syntactic relation, the tense that the sequencing between the lexical unit of above-mentioned identification and tense can be converted to sequencing between each sentence element in above-mentioned 3rd structure and each sentence element is consistent, and the relevant position of the lexical unit of identification after conversion adds Feature Words (as " number " and " measure word "), then the lexical unit of above-mentioned identification can be converted to candidate's statement with following structure: " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ".Wherein, measure word is here for representing the unit of quantity of the 4th lexical unit.In addition, in this example embodiment, above-mentioned fixed middle syntactic relation characteristic of correspondence syntactic structure also can be such as the 4th structure mentioned above; For above-mentioned fixed middle syntactic relation, the tense that the sequencing between the lexical unit of above-mentioned identification and tense can be converted to sequencing between each sentence element in above-mentioned 4th structure and each sentence element is consistent, and the relevant position of the lexical unit of identification after conversion adds Feature Words (as " number " and " measure word "), then the lexical unit of above-mentioned identification can be converted to candidate's statement with following structure: " the 3rd lexical unit "; " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ".Wherein, measure word is here for representing the unit of quantity of the 4th lexical unit.

Thus, the syntactic relation by meeting for often kind can obtain candidate's statement respectively.

Go to Fig. 3, in step S340, language model is utilized to mark to each candidate's statement that step S330 obtains for various syntactic relation, and in step S350, syntactic relation corresponding for candidate's statement the highest for wherein score is defined as unique syntactic relation of " at least two lexical units " of above-mentioned identification.

Treatment scheme 300 ends at step S360.

In actual applications, the method for above-mentioned process language data such as can be used to identify key phrase to be retrieved or the syntactic relation of statement, to retrieve key phrase to be retrieved or statement based on identified syntactic relation; Or the method for above-mentioned process language data can be utilized to identify the syntactic relation of the statement of pending mechanical translation, to carry out mechanical translation based on identified syntactic relation to the statement of pending mechanical translation.

Known by describing above, the method of the process language data of above computer realization according to an embodiment of the invention, it by the ambiguous structure in identifiable language data (is also, at least two lexical units of at least two kinds of syntactic relations can be met), each syntactic relation that the ambiguous structure of identification may meet according to it is correspondingly obtained candidate's statement respectively, and utilize language model to obtain the score of each candidate's statement, syntactic relation corresponding for candidate's statement the highest for wherein score is defined as unique syntactic relation of identified ambiguous structure, thus eliminate the syntax ambiguity of the ambiguous structure of above-mentioned identification.In the process such as mechanical translation or retrieval, the method of above-mentioned process language data can be used to process pending language data, to determine unique syntactic relation of the ambiguous structure that wherein may exist, thus determined syntactic relation can be utilized to carry out other such as process such as mechanical translation, retrieval, can treatment effect be improved, such as, can improve the accuracy (as translation accuracy, retrieval accuracy) of process.

The above-mentioned all modules processed according to an embodiment of the invention in the device of language data, submodule etc. can be configured by the mode of software, firmware, hardware or its combination in any.When being realized by software or firmware, to the machine (general-purpose machinery 500 such as shown in Fig. 5) with specialized hardware structure, the program forming this software or firmware can be installed from storage medium or network, this machine, when being provided with various program, can perform the various functions of above-mentioned each comprising modules, submodule.

Fig. 5 shows the structure diagram of the hardware configuration of a kind of possible messaging device of the method for device and the process language data that can be used to realize processing language data according to an embodiment of the invention.

In Figure 5, CPU (central processing unit) (CPU) 501 performs various process according to the program stored in ROM (read-only memory) (ROM) 502 or from the program that storage area 508 is loaded into random access memory (RAM) 503.In RAM503, also store the data required when CPU501 performs various process etc. as required.CPU501, ROM502 and RAM503 are connected to each other via bus 504.Input/output interface 505 is also connected to bus 504.

Following parts are also connected to input/output interface 505: importation 506 (comprising keyboard, mouse etc.), output 507 (comprise display, such as cathode-ray tube (CRT) (CRT), liquid crystal display (LCD) etc., and loudspeaker etc.), storage area 508 (comprising hard disk etc.), communications portion 509 (comprising network interface unit such as LAN card, modulator-demodular unit etc.).Communications portion 509 is via network such as the Internet executive communication process.As required, driver 510 also can be connected to input/output interface 505.Detachable media 511 such as disk, CD, magneto-optic disk, semiconductor memory etc. can be installed on driver 510 as required, and the computer program therefrom read can be installed in storage area 508 as required.

When series of processes above-mentioned by software simulating, can from network such as the Internet or the program from storage medium such as detachable media 511 installation formation software.

It will be understood by those of skill in the art that this storage medium is not limited to wherein having program stored therein shown in Fig. 5, distributes the detachable media 511 to provide program to user separately with equipment.The example of detachable media 511 comprises disk (comprising floppy disk), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Or hard disk that storage medium can be ROM502, comprise in storage area 508 etc., wherein computer program stored, and user is distributed to together with comprising their equipment.

In addition, the invention allows for a kind of program product storing the instruction code of machine-readable.When described instruction code is read by machine and performs, the above-mentioned method processing language data according to an embodiment of the invention can be performed.Correspondingly, the various storage mediums for the such as disk, CD, magneto-optic disk, semiconductor memory etc. that carry this program product are also included within of the present invention disclosing.

Above in the description of the specific embodiment of the invention, the feature described for a kind of embodiment and/or illustrate can use in one or more other embodiment in same or similar mode, combined with the feature in other embodiment, or substitute the feature in other embodiment.

Should emphasize, term " comprises/comprises " existence referring to feature, key element, step or assembly when using herein, but does not get rid of the existence or additional of one or more further feature, key element, step or assembly.Relate to the term " first " of ordinal number, " second " etc. do not represent enforcement order or the importance degree of feature, key element, step or assembly that these terms limit, and are arranged to for the purpose of being only used to be described clearly and identify between these features, key element, step or assembly.

In addition, the method for various embodiments of the present invention is not limited to describe the to specifications or time sequencing shown in accompanying drawing performs, also can according to other time sequencing, perform concurrently or independently.Therefore, the execution sequence of the method described in this instructions is not construed as limiting technical scope of the present invention.

In addition, obviously, also can realize in the mode being stored in the computer executable program in various machine-readable storage medium according to each operating process of said method of the present invention.

And, object of the present invention also can be realized by following manner: the storage medium storing above-mentioned executable program code is supplied to system or equipment directly or indirectly, and computing machine in this system or equipment or CPU (central processing unit) (CPU) read and perform said procedure code.

Now, as long as this system or equipment have the function of executive routine, then embodiments of the present invention are not limited to program, and this program also can be arbitrary form, such as, the program that performs of target program, interpreter or be supplied to the shell script etc. of operating system.

These machinable mediums above-mentioned include but not limited to: various storer and storage unit, semiconductor equipment, and disc unit is light, magnetic and magneto-optic disk such as, and other is suitable for the medium etc. of storage information.

In addition, client computer by being connected to the corresponding website on the Internet, and will being downloaded according to computer program code of the present invention and be installed in computing machine and then perform this program, also can realize the present invention.

Finally, also it should be noted that, in this article, such as left and right, first and second etc. and so on relational terms be only used for an entity or operation to separate with another entity or operational zone, and not necessarily require or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

To sum up, in an embodiment according to the present invention, the invention provides following scheme:

Remarks 1. 1 kinds processes the device of language data, comprise: identification module, it is arranged to the lexical unit obtained in described language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in described language data; Modular converter, it is arranged to for syntactic relation described in each, according to corresponding conversion method, the lexical unit of described identification is converted to candidate's statement, has the feature syntactic structure corresponding with this syntactic relation to make described candidate's statement; Grading module, it is arranged to and utilizes language model to mark to candidate's statement described in each; And determination module, it is arranged to and the lexical unit of described identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

The device of the process language data of remarks 2. according to remarks 1, wherein, described modular converter comprises:

Determine submodule, it is arranged to for syntactic relation described in each, determines the tense of sequencing between each sentence element in the feature syntactic structure corresponding with this syntactic relation, described feature syntactic structure and each sentence element and can characterize Feature Words and the position thereof of described feature syntactic structure; And

Transform subblock, it is arranged to for syntactic relation described in each,

Tense sequencing between the lexical unit of described identification and tense being converted to sequencing between each sentence element in the described feature syntactic structure corresponding with this syntactic relation and each sentence element is consistent, and

According to determined Feature Words and the position thereof that can characterize the described feature syntactic structure corresponding with this syntactic relation, the correspondence position in the lexical unit of described identification after conversion adds corresponding Feature Words, to obtain described candidate's statement.

The device of the process language data of remarks 3. according to remarks 2, wherein, one of syntactic relation that the lexical unit identified can meet is dynamic guest's syntactic relation, and the first lexical unit in the lexical unit identified is verb lexical unit, second lexical unit is object lexical unit, when described first lexical unit has passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock: " the second lexical unit "+" the passive form of the first lexical unit ".

The device of the process language data of remarks 4. according to remarks 3, wherein, when described first lexical unit does not have passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock: " the first lexical unit "+" head of the second lexical unit "; " the second lexical unit ".

Remarks 5. is according to the device of described process language data arbitrary in remarks 2-4, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit, for described fixed middle syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock: " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit "; Wherein, described measure word is for representing the unit of quantity of described 4th lexical unit.

Remarks 6. is according to the device of described process language data arbitrary in remarks 2-4, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit, for described fixed middle syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock: " the 3rd lexical unit "; " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit "; Wherein, described measure word is for representing the unit of quantity of described 4th lexical unit.

Remarks 7. 1 kinds of equipment, it comprises the device as the process language data as described in arbitrary in remarks 1-6.

The equipment of remarks 8. according to remarks 7, wherein, described equipment is arranged to:

The key phrase utilizing the device identification of described process language data to be retrieved or the syntactic relation of statement, to retrieve described key phrase to be retrieved or statement based on identified syntactic relation; Or

Utilize the syntactic relation of the statement of the pending mechanical translation of device identification of described process language data, to carry out mechanical translation based on the statement of identified syntactic relation to described pending mechanical translation.

The equipment of remarks 9. according to remarks 7 or 8, wherein, described equipment is one of following equipment: computing machine, mobile phone, panel computer, electric paper book and personal digital assistant.

The method of the computer implemented process language data of remarks 10. 1 kinds, comprising: obtain the lexical unit in described language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in described language data; For syntactic relation described in each, according to corresponding conversion method, the lexical unit of described identification is converted to candidate's statement, to make described candidate's statement, there is the feature syntactic structure corresponding with this syntactic relation; Language model is utilized to mark to candidate's statement described in each; And the lexical unit of described identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

The method of the process language data of remarks 11. according to remarks 10, wherein, described for syntactic relation described in each, according to corresponding conversion method, the lexical unit of described identification is converted to candidate's statement and comprises:

For syntactic relation described in each,

Determine the tense of sequencing between each sentence element in the feature syntactic structure corresponding with this syntactic relation, described feature syntactic structure and each sentence element and Feature Words and the position thereof of described feature syntactic structure can be characterized,

Tense sequencing between the lexical unit of described identification and tense being converted to sequencing between each sentence element in determined feature syntactic structure and each sentence element is consistent, and

According to determined Feature Words and the position thereof that can characterize described feature syntactic structure, the correspondence position in the lexical unit of described identification after conversion adds corresponding Feature Words, to obtain described candidate's statement.

The method of the process language data of remarks 12. according to remarks 11, wherein, one of syntactic relation that the lexical unit identified can meet is dynamic guest's syntactic relation, and the first lexical unit in the lexical unit identified is verb lexical unit, second lexical unit is object lexical unit, when described first lexical unit has passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure: " the second lexical unit "+" the passive form of the first lexical unit ".

The method of the process language data of remarks 13. according to remarks 12, wherein, when described first lexical unit does not have passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure: " the first lexical unit "+" head of the second lexical unit "; " the second lexical unit ".

Remarks 14. is according to the method for described process language data arbitrary in remarks 11-13, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit, for described fixed middle syntactic relation, identified lexical unit is converted to candidate's statement with following structure: " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit "; Wherein, described measure word is for representing the unit of quantity of described 4th lexical unit.

Remarks 15. is according to the method for described process language data arbitrary in remarks 11-13, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit, for described fixed middle syntactic relation, identified lexical unit is converted to candidate's statement with following structure: " the 3rd lexical unit "; " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit "; Wherein, described measure word is for representing the unit of quantity of described 4th lexical unit.

Remarks 16. is according to the method for described process language data arbitrary in remarks 10-15, and wherein, the method for described process language data is used for:

Identify the syntactic relation of key phrase to be retrieved or statement, to retrieve described key phrase to be retrieved or statement based on identified syntactic relation; Or

Identify the syntactic relation of the statement of pending mechanical translation, to carry out mechanical translation based on the statement of identified syntactic relation to described pending mechanical translation.

Claims

1. process a device for language data, comprising:

Identification module, it is arranged to the lexical unit obtained in described language data, and identifies at least two lexical units that can meet at least two kinds of syntactic relations in lexical unit in described language data;

Modular converter, it is arranged to for syntactic relation described in each, according to corresponding conversion method, the lexical unit of described identification is converted to candidate's statement, has the feature syntactic structure corresponding with this syntactic relation to make described candidate's statement;

Grading module, it is arranged to and utilizes language model to mark to candidate's statement described in each; And

Determination module, it is arranged to and the lexical unit of described identification is defined as having the corresponding syntactic relation of the candidate statement the highest with score.

2. the device of process language data according to claim 1, wherein, described modular converter comprises:

Transform subblock, it is arranged to for syntactic relation described in each,

3. the device of process language data according to claim 2, wherein, one of syntactic relation that the lexical unit identified can meet is dynamic guest's syntactic relation, and the first lexical unit in the lexical unit identified is verb lexical unit, second lexical unit is object lexical unit

When described first lexical unit has passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock:

" the second lexical unit "+" the passive form of the first lexical unit ".

4. the device of process language data according to claim 3, wherein, when described first lexical unit does not have passive form, for described dynamic guest's syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock:

" the first lexical unit "+" head of the second lexical unit "; " the second lexical unit ".

5. the device of process language data according to claim 2, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit

For described fixed middle syntactic relation, identified lexical unit is converted to candidate's statement with following structure by described transform subblock:

" number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ";

Wherein, described measure word is for representing the unit of quantity of described 4th lexical unit.

6. the device of process language data according to claim 2, wherein, one of syntactic relation that the lexical unit identified can meet is fixed middle syntactic relation, and the 3rd lexical unit in the lexical unit identified is attribute lexical unit, language lexical unit centered by 4th lexical unit

" the 3rd lexical unit "; " number "+" measure word "+" the 3rd lexical unit "+" the 4th lexical unit ";

7. process an equipment for language data, it comprises the device as the process language data as described in arbitrary in claim 1-6.

8. the equipment of process language data according to claim 7, wherein, described equipment is arranged to:

9. the equipment of process language data according to claim 7, wherein, described equipment is one of following equipment:

Computing machine, mobile phone, panel computer, electric paper book and personal digital assistant.

10. a method for computer implemented process language data, comprising:

Obtain the lexical unit in described language data, and in lexical unit in described language data, identify at least two lexical units that can meet at least two kinds of syntactic relations;

For syntactic relation described in each, according to corresponding conversion method, the lexical unit of described identification is converted to candidate's statement, to make described candidate's statement, there is the feature syntactic structure corresponding with this syntactic relation;

Language model is utilized to mark to candidate's statement described in each; And

The lexical unit of described identification is defined as there is the corresponding syntactic relation of the candidate statement the highest with score.