CN102214166B

CN102214166B - Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Info

Publication number: CN102214166B
Application number: CN 201010144623
Authority: CN
Inventors: 熊张亮; 何亮; 万磊
Original assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Current assignee: Samsung Electronics China R&D Center; Samsung Electronics Co Ltd
Priority date: 2010-04-06
Filing date: 2010-04-06
Publication date: 2013-02-20
Anticipated expiration: 2030-04-06
Also published as: CN102214166A

Abstract

The invention discloses a machine translation system and a machine translation method based on a syntactic analysis and hierarchical model. The machine translation system comprises a word alignment module, a phrase extraction module, a part-of-speech and syntax tagging module, a syntax-based non-contiguous phrase extraction module, a non-contiguous-phrase-based translation module, and a grading output module. In the machine translation system and the machine translation method, syntactic analysis is carried out on the basis of a general contiguous-phrase-based machine translation model, so that a syntax-based phrase rule base is extracted from a bilingual sentence alignment text, the problem of non-continuous fixed collocation of the context of the whole sentence is solved, and the invention accords with the syntactic characteristics of a language. Translation is carried out based on a non-contiguous phrase rule base and a phrase alignment table, and a translation result is graded based on an assessment model, so a translation effect is effectively improved.

Description

Machine translation system and method based on syntactic analysis and hierarchical model

Technical field

The present invention relates to mechanical translation, specifically, relate to a kind of machine translation system based on syntactic analysis and hierarchical model and method.

Background technology

Mechanical translation is the automatic translation that a kind of natural language translation is become another kind of natural language, and the type of machine translation system is a lot, at present the popular mechanical translation that is based on continuous phrase (PBMT) system.The problem that mechanical translation will solve is to utilize computing machine the sentence of source language (SL) or fragment to be automatically translated into sentence or the fragment of corresponding target language (TL).Comprise a bilingual alignment corpus (being the translation that each source language sentence all has the target language of or many correspondences) based on the mechanical translation of corpus, computing machine carries out the needed data of automatic translation and knowledge all obtains from corpus.

The PBMT system is take the base unit of phrase as translation.In translation process, system translates each word isolatedly, but continuous a plurality of words are translated together.Owing to enlarged the granularity of translation, be easy to process the local context dependence based on the method for phrase, can translate well idiom and collocation.General, in the method based on phrase, phrase can be the character string of arbitrary continuation, does not have syntactical restriction, can automatically extract easily the source language sentence that bilingual phrase is translated as appointment like this from the bilingualism corpora of word alignment.Method based on phrase need to be trained system.In the time of training, input first a bilingualism corpora, i.e. one group of sentence of translating each other.Know from the result of word alignment which word is translated each other in the sentence.Next also need to carry out Phrase extraction, namely extract the continuous word string that all are translated each other in the corpus, whether have real implication and need not manage this word string.

PBMT has following defective: (1) owing to the local context dependence, PBMT can not process long sentence or phrase well, and the long distance that especially discrete regular collocation brings is transferred the order problem; (2) because PBMT relies on continuous phrase statistical information fully, ignore the syntactic feature of language, failed to take full advantage of the knowledge that corpus comprises, translated the further raising of effect thereby limited it.

Summary of the invention

For above-mentioned shortcoming, the object of the present invention is to provide a kind of machine translation system based on syntactic analysis and hierarchical model and method.

According to an aspect of the present invention, a kind of machine translation system based on syntactic analysis and hierarchical model is provided, described machine translation system can comprise: the word alignment module receives bilingual sentence aligning texts, and obtain word alignment information from the bilingual alignment text that receives from the outside; The Phrase extraction module receives word alignment information from the word alignment module, utilizes the word alignment information that receives to carry out Phrase extraction, to obtain the phrase alignment table; Part of speech syntax labeling module, receive tagged corpus and bilingual sentence aligning texts from the outside, from extracting useful linguistry and probability distribution information thereof the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus; Non-continuous phrase extraction module based on syntax, receive the syntax tagged corpus from part of speech syntax labeling module, and the phrase alignment table that the alignment information that produces according to the word alignment module based on the syntax tagged corpus or Phrase extraction module produce carries out extracting based on the non-continuous phrase of syntax, to produce the non-continuous phrase rule base based on syntax; Translation module based on non-continuous phrase, from the non-continuous phrase rule base of non-continuous phrase extraction module reception based on syntax, and treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval all possible phrase, translation and probability thereof, and output translation result; The scoring output module receives assessment models from the outside, based on assessment models translation result marked, and the highest translation result of output score.

Described machine translation system also can comprise: based on the translation module of continuous phrase, receive the phrase alignment table from the Phrase extraction module, treat translation of the sentence and in the phrase alignment table, retrieve all possible phrase, translation and probability thereof, and translation result is outputed to the scoring output module.

Non-continuous phrase extraction module based on syntax can comprise: the non-continuous phrase extraction module, according to the word alignment information of word alignment module generation or the phrase alignment table of Phrase extraction module generation, adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base; The syntax filtering module filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax.

Described probability distribution information can comprise that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.

Described phrase alignment table can comprise source language phrase, target language phrase and probable value.

According to a further aspect in the invention, provide a kind of machine translation method based on syntactic analysis and hierarchical model, described machine translation method may further comprise the steps: receive bilingual sentence aligning texts, and obtain word alignment information from the bilingual alignment text that receives; Utilize word alignment information to carry out Phrase extraction, to obtain the phrase alignment table; Receive tagged corpus and bilingual sentence aligning texts, from extracting useful linguistry and probability distribution information thereof the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus; Carry out extracting based on the non-continuous phrase of syntax according to alignment information or phrase alignment table based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax; Treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval all possible phrase, translation and probability thereof; Receive assessment models, based on assessment models described translation is marked, and the highest translation result of output score.

Described machine translation method also can may further comprise the steps: treat translation of the sentence and retrieve all possible phrase, translation and probability thereof in the phrase alignment table.

Generation can may further comprise the steps based on the step of the non-continuous phrase rule base of syntax: adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts according to word alignment information or phrase alignment table, obtain the non-continuous phrase rule base; Based on the syntax tagged corpus non-continuous phrase rule base is filtered, to produce the non-continuous phrase rule base based on syntax.

Carry out syntactic analysis according to machine translation system of the present invention and method on the general Machine Translation Model basis based on continuous phrase, thereby from bilingual sentence aligning texts, extract the non-continuous phrase rule base based on syntax, solve the problem of the discontinuous regular collocation of full sentence context, make it meet the syntactic feature of language; Translate based on non-continuous phrase rule base and phrase alignment table, translation result is marked based on assessment models, thereby effectively improved the translation effect.

Description of drawings

By the detailed description of reference accompanying drawing to exemplary embodiment of the present, above and other feature of the present invention and aspect will become clearer, wherein:

Fig. 1 is the block diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation system of syntactic analysis and hierarchical model;

Fig. 2 is the diagram that structure syntax tagged corpus is shown;

Fig. 3 illustrates according to an exemplary embodiment of the present invention the diagram based on the non-continuous phrase extraction module of syntax shown in Fig. 1;

Fig. 4 is the diagram that the example of the non-continuous phrase extraction module operation among Fig. 3 is shown;

Fig. 5 is the diagram of example that single statement method analysis and filter of non-continuous phrase rule base is shown;

Fig. 6 A and Fig. 6 B are the diagrams that illustrates respectively according to an exemplary embodiment of the present with the mechanical translation of conventional art;

Fig. 7 is the process flow diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation method of syntactic analysis and level phrase model.

Embodiment

Below, describe with reference to the accompanying drawings exemplary embodiment of the present invention in detail.

Fig. 1 is the machine translation system that illustrates according to an exemplary embodiment of the present invention based on syntactic analysis and level phrase model.

As shown in Figure 1, comprise based on the machine translation system of syntactic analysis and level phrase model according to an exemplary embodiment of the present invention: word alignment module 101, Phrase extraction module 102, the translation module 103 based on continuous phrase, part of speech syntax labeling module 201, based on the non-continuous phrase extraction module 202 of syntax, based on translation module 301 and the scoring output module 302 of non-continuous phrase.

Word alignment module 101, Phrase extraction module 102, be to adopt employed module in traditional translation system based on continuous phrase based on the translation module 103 of continuous phrase, it is with part of speech syntax labeling module 201 according to an exemplary embodiment of the present invention, based on the common preprocessing part that consists of according to an exemplary embodiment of the present invention based on the machine translation system of syntactic analysis and level phrase model of the non-continuous phrase extraction module 202 of syntax.And based on the translation module 103 of continuous phrase with can consist of according to an exemplary embodiment of the present invention translation engine based on the machine translation system of syntactic analysis and level phrase model based on the translation module 301 of non-continuous phrase and scoring output module 302 according to an exemplary embodiment of the present invention.

With reference to Fig. 1, bilingual sentence aligning texts is input to word alignment module 101, word alignment module 101 utilize instrument (for example, GIZA++) from the bilingual alignment text of input, obtain word alignment information, and with this to the neat input information of word to Phrase extraction module 102.

Phrase extraction module 102 receives word alignment information from word alignment module 101, utilize the word alignment information that receives to carry out Phrase extraction, thereby obtain phrase alignment table (being also referred to as continuous phrase library), and the phrase alignment table that obtains is sent to based on the translation module 103 of continuous phrase with based on the non-continuous phrase extraction module 202 of syntax.Wherein, described phrase alignment table comprises following three parts: (1) source language phrase; (2) target language phrase; (3) probable value.

In the Computer Processing of natural language, rule-based syntax parsing mainly is the context-free syntax that uses Chomsky, but it seems helpless when processing the ambiguity of natural language.

In recent years the improvement of context-free syntax is mainly reflected in two aspects: on the one hand be that rule to the context-free syntax adds probability, probability context-free syntax (PCFG) has been proposed, except adding the probability to rule on the other hand, consider that also the centre word of rule for the impact of regular probability, has proposed probability vocabulary context-free syntax.

These researchs combine dexterously rule-based rationalist approach and based on the empiricism method of adding up, and have obtained preferably achievement, provide strong means for solving the syntax ambiguity problem.The probability syntax assigns a probability for the symbol string of a sentence or word, thereby catches the syntactic information more careful than general context-free syntax.Probability context-free syntax also is a kind of context-free syntax, each rule is wherein put on and is selected this regular probability, when processing each context-free rules, suppose that all they are independently in condition, the probability of a sentence with analysis during this sentence the product of each regular probability calculate.

The below comes to describe as an example of PCFG example the concrete operations of part of speech syntax labeling module 201 structure syntax tagged corpus (corpus is also referred to as treebank) here, with reference to Fig. 2.

At first, process (automatic or manual carrying out) by the mark to corpus, form the corpus with the markup information of different levels, as marked Binzhou treebank of part of speech and syntax tree information, it mainly marks collection shown in Fig. 2 (a).Tagged corpus is input to part of speech syntax labeling module 201.

Part of speech syntax labeling module 201 utilizes statistical tool from extracting useful linguistry and probability distribution information thereof the tagged corpus, and training (supervised training) method of guidance is namely arranged.Main probability distribution information comprises that certain word belongs to probability and context probability that the probability of certain part of speech, certain phrase belong to certain class phrase.

Part of speech syntax labeling module 201 is utilized linguistry and the probability distribution information thereof that extracts, bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus, and the syntax tagged corpus that produces is sent to non-continuous phrase extraction module 202 based on syntax.A sentence may have multiple annotation results, we choose maximum probability wherein as Output rusults, such as (a) of Fig. 2 with (b), according to probability calculation, the probability of Fig. 2 (a) is: P1=0.2 * 0.2 * 0.2 * 0.4 * 0.45 * 1.0 * 1.0 * 0.4 * 0.05=2.88 * 10 ^-5And the probability of Fig. 2 (b) is: P2=0.8 * 0.2 * 0.05 * 0.4 * 0.4 * 0.3 * 0.4 * 0.4 * 0.4 * 0.05=1.2288 * 10 ^-6, therefore, select the annotation results of Fig. 2 (a).

(c) of Fig. 2 and the Chinese sentence that (d) shows respectively part syntax mark collection and marked.

Non-continuous phrase extraction module 202 based on syntax receives the syntax tagged corpus from part of speech syntax labeling module 201, and the phrase alignment table that the alignment information that produces according to word alignment module 101 based on the syntax tagged corpus or Phrase extraction module 102 produce carries out extracting based on the non-continuous phrase of syntax, to obtain the non-continuous phrase rule base based on syntax.

The below describes non-continuous phrase extraction module 202 based on syntax in detail with reference to Fig. 3 to Fig. 5 and how to produce non-continuous phrase rule base based on syntax.

Fig. 3 to Fig. 5 shows non-continuous phrase extraction module 202 concrete formation and the concrete operations according to exemplary embodiment of the present invention.

As shown in Figure 3, the non-continuous phrase extraction module 202 based on syntax comprises non-continuous phrase extraction module 212 and syntax filtering module 222.

Describe non-continuous phrase extraction module 212 in detail below with reference to Fig. 4 and how to construct the non-continuous phrase rule base.

Non-continuous phrase extraction module 212 is according to the word alignment information of word alignment module 101 generations or the phrase alignment table of Phrase extraction module 102 generations, adopt the nonterminal symbols such as [X], [Y] to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base.

Fig. 4 shows a non-continuous phrase Rule Extraction example.The rule of this example is: [Y] of band [X] || | [Y] with[X] || | 0.10.30.6, wherein, the 0.1st, source language is to the translation probability of target language, and the 0.3rd, target language is to the word translation probability of source language, and the 0.6th, source language is to the word translation probability of target language.

The basic thought that the syntax of non-continuous phrase rule base is filtered is to guarantee that the phrase part that is extracted in the sentence should be a sentence element phrase with relative independentability, such as noun phrase (NP), numeral-classifier compound phrases (QP) etc. are to guarantee the translation quality in later stage.

Syntax filtering module 222 filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax;

Describe syntax filtering module 222 below with reference to Fig. 5 and how to carry out the syntax filtration.

Fig. 5 shows single statement method analysis and filter example of a non-continuous phrase rule base.

As shown in Figure 5, single sentence of input is carried out the syntax mark.

Consideration cuts out non-pronoun noun phrase (NP-NN) to the sentence after marking, and the situation so that [X] replaces is " subway line map " herein, 1st RULE of non-continuous phrase rule as keeping among Fig. 5 of generation.

Consider the situation of numeral-classifier compound phrase (QP), be specially the phrase that is labeled as QP, and comprise two child nodes, respectively CD and CLP, such as (QP (CD two) (CLP (M opens))), CD is replaced with [X], be " two " herein, 2nd rule of non-continuous phrase rule as keeping among Fig. 5 of generation;

Owing to do not meet syntax rule, the rule that is filtered be among Fig. 5 " is [X] to my subway line map? "

Abovely describe according to an exemplary embodiment of the present invention preprocessing part based on the machine translation system of syntactic analysis and level phrase model in detail with reference to accompanying drawing, the below describes according to an exemplary embodiment of the present invention translation engine based on the machine translation system of syntactic analysis and level phrase model with reference to Fig. 1 and Fig. 6.

Machine translation system based on syntactic analysis and hierarchical model according to the present invention is used translation model, language model, accent order model and demoder.

Mechanical translation based on syntactic analysis and hierarchical model according to the present invention is the expansion of translation model and the relative reduction of transferring the order model with the main difference of the mechanical translation based on continuous phrase of conventional art.

Translation model provides the relation of the correspondence translation between source language and the target language phrase, and represent the degree of this corresponding translation relation with a probable value, probable value is higher, shows the more accurate of translation correspondence, is used to the source language sentence that possible target language translation is provided.Based on the translation model of level phrase with correspondence translation relation by continuous phrase expanding to continuous phrase and based on the non-continuous phrase of syntax.

Language model has been stored a large amount of probable values, these probable values have provided the probabilistic relation information of each word and its front and back word or phrase, its effect is to judge that a phrase St meets the degree of target language syntax, custom, be used for translation result is selected, generally weigh this degree with a probable value PLM (St), the higher expression phrase of PLM (St) value more meets target language.

The target language of transferring the order model to be used for adjusting to translate out is the sequence of positions of word or phrase as a result, owing to based on the existence of the non-continuous phrase of syntax, transfer the funtion part of order model to be substituted, its weight can be corresponding lower.

The effect of translation engine is to coordinate above-mentioned several model and comes the source language sentence is translated.

With reference to Fig. 1, the phrase alignment table, retrieve all possible phrase, translation and probability thereof based on 103 pairs of sentences to be translated to through word segmentation from 102 outputs of Phrase extraction module of translation module of continuous phrase.

The non-continuous phrase rule base that receives based on syntax from non-continuous phrase extraction module 202 based on the translation module 301 of non-continuous phrase, and for through the sentence to be translated of word segmentation described based on the non-continuous phrase rule base of syntax in retrieval all possible phrase, translation and probability thereof.

Fig. 6 A illustrates according to an exemplary embodiment of the present invention and with level phrase model translator of Chinese is become English diagram based on syntactic analysis.

Label among Fig. 6 A (1)-(5) are corresponding one by one with following operation (1)-(5) respectively.

(1) input Chinese sentence to be translated;

(2) according to translation model, in the phrase alignment table, search for all possible phrase, translation and probability thereof based on the translation module 103 of continuous phrase;

(3) according to translation model, in the non-continuous phrase rule base, search for all possible non-continuous phrase, translation and probability thereof based on the translation module 301 of non-continuous phrase;

(4) according to the right translation probability of phrase, non-continuous phrase and ternary probabilistic language model etc., demoder calculates various general probabilitys that may translation results;

(5) demoder is chosen the top n sentence of general probability optimum as N-best candidate target language sentence.

In Fig. 6 A, (4)-(5) expression gathers the calculating general probability, thereby selects N candidate's sentence.In addition, in Fig. 6 A, | the scope that 3,6| represents be [3,6), namely comprise 3, but do not comprise 6, scope is before 6.

Fig. 6 B is accordingly according to the English diagram of conventional art translator of Chinese is become with Fig. 6 A.

Compare with Fig. 6 A according to the present invention, the key distinction is, only utilizes continuous phrase to translate in the conventional art translation process, and do not utilize syntactic analysis to cross the level phrase of filtration, X-＞([Y] of [X], [Y] of[X]) for example, carry out probability calculation, generate translation result.For example, in the application's method, " Shanghai of China " is translated into " Shanghai of China ", and the result who translates according to conventional art is " Chinese Shanghai ", so translation result according to the present invention is significantly better than the translation result according to conventional art.

The below will describe scoring output module 302 and based on assessment models translation result be marked.

The translation output that is input to scoring output module 302 is N candidate target language sentence, and N is more than or equal to 1.

Scoring output module 302 is also marked to N candidate target language sentence of input based on the assessment models of input.

Assessment models can comprehensive a plurality of translation features, such as the part of speech series model feature of language model feature, sentence, the sentence length of target language etc., come this N candidate target language sentence resequenced, choose the translation of global optimum and export as translation result.

Consider simplicity and the treatment effeciency of realization, language model with target language in exemplary embodiment of the present invention is described as assessment models, its effect is to judge that a sentence St meets the degree of target language syntax and custom, thereby translation result is selected.Generally weigh described degree with probable value PLM (St), the higher expression sentence of PLM (St) value more meets target language.

Consider the otherness for the treatment of effeciency and candidate's target language sentence, N=2 in current exemplary embodiment of the present invention, i.e. output sentence and the output sentence based on syntactic analysis and hierarchical model of only translating based on continuous phrase.

Scoring output module 302 is marked based on following basic procedure:

1, receive the candidate target language sentence of N=2, one is output sentence and output sentence based on syntactic analysis and hierarchical model of only translating based on continuous phrase;

2, utilize target language model (namely passing through language model) that its probable value is calculated in each possible translation;

3, select the output of score optimum.

The example that scoring output module 302 is marked is described below.

The translation source language is Chinese, and target language is English.The source language of input is: " could you tell me the payment terms ".

Result after the translation is (N=2):

1) Would you please tell me the pay terms. (based on the translation result of continuous phrase)

2) Would you please tell me the terms of payment. (based on the translation result of syntactic analysis and hierarchical model)

Language model is in English given a mark to these two results, because " payment terms " have its saying commonly used " terms of payment ", and " Would you please tell me the terms of payment. " more meets syntactic rule and the use habit of English, therefore, language model can provide a higher score value for this result:

1) middle result 1 is given a mark: 0.7

2) middle result 2 is given a mark: 0.9

5. select score value the highest as net result: Would you please tell me the terms ofpayment.

The below describes according to an exemplary embodiment of the present invention machine translation method based on syntactic analysis and hierarchical model with reference to Fig. 7.

Fig. 7 is the process flow diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation method of syntactic analysis and hierarchical model.

As shown in Figure 7, at step S701 and S702, input respectively tagged corpus and bilingual sentence aligning texts.

At step S703, carry out part of speech and syntax mark.At first utilize statistical tool from the tagged corpus of input, to extract useful linguistry and probability distribution information thereof, then, the linguistry that utilization extracts and probability distribution information thereof, bilingual or single language in the bilingual sentence aligning texts of input is carried out part of speech and syntax mark, finally produce syntax tagged corpus (or being called syntax mark treebank).

At step S704, utilize the GIZA++ instrument to obtain word alignment information from the bilingual sentence aligning texts of input.

At step S705, utilize the word alignment information extraction phrase that obtains at step S704, thereby obtain the phrase alignment table, described phrase alignment table comprises following three parts: (1) source language phrase; (2) target language phrase; (3) probable value.

At step S706, carry out non-continuous phrase based on the syntax tagged corpus that in step S703, obtains according to the alignment information that in step S704, produces or the phrase alignment table that in step S705, obtains and extract, to obtain the non-continuous phrase rule base based on syntax.

At length say, at first, based on the alignment information that in step S704, obtains or the phrase alignment table that in step S705, obtains, the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts is adopted [X], nonterminal symbols such as [Y] replaces, and obtains the non-continuous phrase rule base; Then, carry out syntax based on the syntax tagged corpus that in step S703, obtains and filter, to obtain the non-continuous phrase rule base based on syntax;

At step S707, according to translation model, search for all possible phrase, non-continuous phrase, translation and probability thereof at the phrase alignment table with in based on the non-continuous phrase rule base of syntax, N the translation that the output general probability has most is as candidate target language sentence.

At step S708, based on assessment models candidate target language sentence is marked, and select the conduct of global optimum finally to export.

More than describe with reference to the accompanying drawings according to an exemplary embodiment of the present invention based on machine translation system and the method for syntactic analysis and hierarchical model, it will be understood by those skilled in the art that to the invention is not restricted to above-mentioned exemplary embodiment.For example, in order to obtain all possible translation result, in Fig. 1, comprised the translation module 103 based on continuous phrase, and in the step S707 of Fig. 7, comprised in the phrase alignment table all possible phrase of search, non-continuous phrase, translation and probability thereof, if but in Fig. 1, do not comprise based on the translation module 103 of continuous phrase and do not comprised that in the step S707 of Fig. 7 the search to the phrase alignment table also is feasible.In addition, in exemplary embodiment of the present invention, assessment models is not limited to language model.

In the experiment of having carried out Korean-Chinese translation based on the prototype system of this patent.

The test set type: closed test (selecting test statement in training set) is 20%, and open test (test statement does not belong to training set) is 80%.

The result of artificial evaluation and test: compare with traditional machine translation system based on continuous phrase, the sentence that Korean-Chinese statement fluency obviously improves has increased more than 10%, reaches the practical level of 86.5% artificial evaluation and test rate of good.

In the embedded system that is equivalent to present main flow mobile phone hardware configuration, average translation speed is 2/second, has realized instant translation.

Below be Korean-Chinese translation (example 1) and Sino-Korean translation (example 2).

Example 1 (Korean-Chinese translation)

Example 2 (Sino-Korean translation)

Chinese: the room of please my bag being sent to me.

Translation result based on continuous phrase model:

(translation error);

Translation result based on syntactic analysis and hierarchical model of the present invention:

(translation is correct).

Can obviously improve the accuracy of translation based on the machine translation system of syntactic analysis and hierarchical model and method with respect to the machine translation system based on continuous phrase of the prior art and method according to an exemplary embodiment of the present invention, particularly in the situation of corpus dimension-limited.

Machine translation system and method based on syntactic analysis and hierarchical model both can be applied to computer system according to an exemplary embodiment of the present invention, also can be applicable to embedded system.

The present invention has introduced hierarchical model, extracts the non-continuous phrase rule base of acquisition alignment by the bilingualism corpora of sentence alignment, has solved the issues for translation of the discontinuous regular collocation of full sentence context.

The present invention has increased part of speech syntax labeling module and based on the non-continuous phrase extraction module of syntax, analyze and obtain in the corpus syntax mark tree of each (namely, to the sentence through the syntax mark) based on the non-continuous phrase rule base of syntax mark tree acquisition based on syntax, make it meet the syntactic feature of language, thereby improved the translation effect, and significantly reduced the scale of non-continuous phrase rule base, be suitable for using in embedded system;

The present invention is based on assessment models and translation result is marked and select, the highest translation result of output score is as net result, thereby advantage that can each translation model of effective integration has guaranteed the extensibility of system, has further improved the translation effect.

It should be appreciated by those skilled in the art, in the situation that does not break away from the spirit and scope of the present invention, can carry out in form and details various changes.Therefore, aforesaid exemplary embodiment is the purpose in order to illustrate only, and should not be interpreted as limitation of the present invention.Scope of the present invention is defined by the claims.

Claims

1. machine translation system based on syntactic analysis and hierarchical model comprises:

The word alignment module receives bilingual sentence aligning texts from the outside, and obtains word alignment information from the bilingual sentence aligning texts that receives;

The Phrase extraction module receives word alignment information from the word alignment module, utilizes the word alignment information that receives to carry out Phrase extraction, to obtain the phrase alignment table;

Part of speech syntax labeling module, receive tagged corpus and bilingual sentence aligning texts from the outside, from extracting linguistry and the probability distribution information thereof that is used for bilingual sentence aligning texts the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus;

Non-continuous phrase extraction module based on syntax, receive the syntax tagged corpus from part of speech syntax labeling module, and the phrase alignment table that the alignment information that produces according to the word alignment module based on the syntax tagged corpus or Phrase extraction module produce carries out extracting based on the non-continuous phrase of syntax, to produce the non-continuous phrase rule base based on syntax;

Translation module based on non-continuous phrase, from the non-continuous phrase rule base of non-continuous phrase extraction module reception based on syntax, and treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval sentence to be translated all possible phrase, translation and translation probability thereof, and output translation result;

The scoring output module receives assessment models from the outside, based on assessment models translation result marked, and the highest translation result of output score.

2. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1, it is characterized in that described machine translation system also comprises: based on the translation module of continuous phrase, receive the phrase alignment table from the Phrase extraction module, treat translation of the sentence and in the phrase alignment table, retrieve all possible phrase, translation and probability thereof, and translation result is outputed to the scoring output module.

3. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 or 2, it is characterized in that comprising based on the non-continuous phrase extraction module of syntax: the non-continuous phrase extraction module, according to the word alignment information of word alignment module generation or the phrase alignment table of Phrase extraction module generation, adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base; The syntax filtering module filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax.

4. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 is characterized in that described probability distribution information comprises that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.

5. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 is characterized in that described phrase alignment table comprises source language phrase, target language phrase and probable value.

6. machine translation method based on syntactic analysis and hierarchical model may further comprise the steps:

Receive bilingual sentence aligning texts, and from the bilingual sentence aligning texts that receives, obtain word alignment information;

Utilize word alignment information to carry out Phrase extraction, to obtain the phrase alignment table;

Receive tagged corpus and bilingual sentence aligning texts, from extracting linguistry and the probability distribution information thereof that is used for bilingual sentence aligning texts the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus;

Carry out extracting based on the non-continuous phrase of syntax according to alignment information or phrase alignment table based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax;

Treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval sentence to be translated all possible phrase, translation and translation probability thereof;

Receive assessment models, based on assessment models described translation is marked, and the highest translation result of output score.

7. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described machine translation method is further comprising the steps of: treat translation of the sentence and retrieve all possible phrase, translation and probability thereof in the phrase alignment table.

8. such as claim 6 or 7 described machine translation methods based on syntactic analysis and hierarchical model, the step that it is characterized in that producing based on the non-continuous phrase rule base of syntax may further comprise the steps:

Adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts according to word alignment information or phrase alignment table, obtain the non-continuous phrase rule base;

Based on the syntax tagged corpus non-continuous phrase rule base is filtered, to produce the non-continuous phrase rule base based on syntax.

9. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described probability distribution information comprises that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.

10. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described phrase alignment table comprises source language phrase, target language phrase and probable value.