CN102214166B - Machine translation system and machine translation method based on syntactic analysis and hierarchical model - Google Patents

Machine translation system and machine translation method based on syntactic analysis and hierarchical model Download PDF

Info

Publication number
CN102214166B
CN102214166B CN 201010144623 CN201010144623A CN102214166B CN 102214166 B CN102214166 B CN 102214166B CN 201010144623 CN201010144623 CN 201010144623 CN 201010144623 A CN201010144623 A CN 201010144623A CN 102214166 B CN102214166 B CN 102214166B
Authority
CN
China
Prior art keywords
phrase
syntax
translation
continuous
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 201010144623
Other languages
Chinese (zh)
Other versions
CN102214166A (en
Inventor
熊张亮
何亮
万磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics China R&D Center
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics China R&D Center, Samsung Electronics Co Ltd filed Critical Samsung Electronics China R&D Center
Priority to CN 201010144623 priority Critical patent/CN102214166B/en
Priority to KR1020110018439A priority patent/KR101777421B1/en
Priority to US13/079,283 priority patent/US8818790B2/en
Publication of CN102214166A publication Critical patent/CN102214166A/en
Application granted granted Critical
Publication of CN102214166B publication Critical patent/CN102214166B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a machine translation system and a machine translation method based on a syntactic analysis and hierarchical model. The machine translation system comprises a word alignment module, a phrase extraction module, a part-of-speech and syntax tagging module, a syntax-based non-contiguous phrase extraction module, a non-contiguous-phrase-based translation module, and a grading output module. In the machine translation system and the machine translation method, syntactic analysis is carried out on the basis of a general contiguous-phrase-based machine translation model, so that a syntax-based phrase rule base is extracted from a bilingual sentence alignment text, the problem of non-continuous fixed collocation of the context of the whole sentence is solved, and the invention accords with the syntactic characteristics of a language. Translation is carried out based on a non-contiguous phrase rule base and a phrase alignment table, and a translation result is graded based on an assessment model, so a translation effect is effectively improved.

Description

Machine translation system and method based on syntactic analysis and hierarchical model
Technical field
The present invention relates to mechanical translation, specifically, relate to a kind of machine translation system based on syntactic analysis and hierarchical model and method.
Background technology
Mechanical translation is the automatic translation that a kind of natural language translation is become another kind of natural language, and the type of machine translation system is a lot, at present the popular mechanical translation that is based on continuous phrase (PBMT) system.The problem that mechanical translation will solve is to utilize computing machine the sentence of source language (SL) or fragment to be automatically translated into sentence or the fragment of corresponding target language (TL).Comprise a bilingual alignment corpus (being the translation that each source language sentence all has the target language of or many correspondences) based on the mechanical translation of corpus, computing machine carries out the needed data of automatic translation and knowledge all obtains from corpus.
The PBMT system is take the base unit of phrase as translation.In translation process, system translates each word isolatedly, but continuous a plurality of words are translated together.Owing to enlarged the granularity of translation, be easy to process the local context dependence based on the method for phrase, can translate well idiom and collocation.General, in the method based on phrase, phrase can be the character string of arbitrary continuation, does not have syntactical restriction, can automatically extract easily the source language sentence that bilingual phrase is translated as appointment like this from the bilingualism corpora of word alignment.Method based on phrase need to be trained system.In the time of training, input first a bilingualism corpora, i.e. one group of sentence of translating each other.Know from the result of word alignment which word is translated each other in the sentence.Next also need to carry out Phrase extraction, namely extract the continuous word string that all are translated each other in the corpus, whether have real implication and need not manage this word string.
PBMT has following defective: (1) owing to the local context dependence, PBMT can not process long sentence or phrase well, and the long distance that especially discrete regular collocation brings is transferred the order problem; (2) because PBMT relies on continuous phrase statistical information fully, ignore the syntactic feature of language, failed to take full advantage of the knowledge that corpus comprises, translated the further raising of effect thereby limited it.
Summary of the invention
For above-mentioned shortcoming, the object of the present invention is to provide a kind of machine translation system based on syntactic analysis and hierarchical model and method.
According to an aspect of the present invention, a kind of machine translation system based on syntactic analysis and hierarchical model is provided, described machine translation system can comprise: the word alignment module receives bilingual sentence aligning texts, and obtain word alignment information from the bilingual alignment text that receives from the outside; The Phrase extraction module receives word alignment information from the word alignment module, utilizes the word alignment information that receives to carry out Phrase extraction, to obtain the phrase alignment table; Part of speech syntax labeling module, receive tagged corpus and bilingual sentence aligning texts from the outside, from extracting useful linguistry and probability distribution information thereof the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus; Non-continuous phrase extraction module based on syntax, receive the syntax tagged corpus from part of speech syntax labeling module, and the phrase alignment table that the alignment information that produces according to the word alignment module based on the syntax tagged corpus or Phrase extraction module produce carries out extracting based on the non-continuous phrase of syntax, to produce the non-continuous phrase rule base based on syntax; Translation module based on non-continuous phrase, from the non-continuous phrase rule base of non-continuous phrase extraction module reception based on syntax, and treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval all possible phrase, translation and probability thereof, and output translation result; The scoring output module receives assessment models from the outside, based on assessment models translation result marked, and the highest translation result of output score.
Described machine translation system also can comprise: based on the translation module of continuous phrase, receive the phrase alignment table from the Phrase extraction module, treat translation of the sentence and in the phrase alignment table, retrieve all possible phrase, translation and probability thereof, and translation result is outputed to the scoring output module.
Non-continuous phrase extraction module based on syntax can comprise: the non-continuous phrase extraction module, according to the word alignment information of word alignment module generation or the phrase alignment table of Phrase extraction module generation, adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base; The syntax filtering module filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax.
Described probability distribution information can comprise that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.
Described phrase alignment table can comprise source language phrase, target language phrase and probable value.
According to a further aspect in the invention, provide a kind of machine translation method based on syntactic analysis and hierarchical model, described machine translation method may further comprise the steps: receive bilingual sentence aligning texts, and obtain word alignment information from the bilingual alignment text that receives; Utilize word alignment information to carry out Phrase extraction, to obtain the phrase alignment table; Receive tagged corpus and bilingual sentence aligning texts, from extracting useful linguistry and probability distribution information thereof the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus; Carry out extracting based on the non-continuous phrase of syntax according to alignment information or phrase alignment table based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax; Treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval all possible phrase, translation and probability thereof; Receive assessment models, based on assessment models described translation is marked, and the highest translation result of output score.
Described machine translation method also can may further comprise the steps: treat translation of the sentence and retrieve all possible phrase, translation and probability thereof in the phrase alignment table.
Generation can may further comprise the steps based on the step of the non-continuous phrase rule base of syntax: adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts according to word alignment information or phrase alignment table, obtain the non-continuous phrase rule base; Based on the syntax tagged corpus non-continuous phrase rule base is filtered, to produce the non-continuous phrase rule base based on syntax.
Carry out syntactic analysis according to machine translation system of the present invention and method on the general Machine Translation Model basis based on continuous phrase, thereby from bilingual sentence aligning texts, extract the non-continuous phrase rule base based on syntax, solve the problem of the discontinuous regular collocation of full sentence context, make it meet the syntactic feature of language; Translate based on non-continuous phrase rule base and phrase alignment table, translation result is marked based on assessment models, thereby effectively improved the translation effect.
Description of drawings
By the detailed description of reference accompanying drawing to exemplary embodiment of the present, above and other feature of the present invention and aspect will become clearer, wherein:
Fig. 1 is the block diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation system of syntactic analysis and hierarchical model;
Fig. 2 is the diagram that structure syntax tagged corpus is shown;
Fig. 3 illustrates according to an exemplary embodiment of the present invention the diagram based on the non-continuous phrase extraction module of syntax shown in Fig. 1;
Fig. 4 is the diagram that the example of the non-continuous phrase extraction module operation among Fig. 3 is shown;
Fig. 5 is the diagram of example that single statement method analysis and filter of non-continuous phrase rule base is shown;
Fig. 6 A and Fig. 6 B are the diagrams that illustrates respectively according to an exemplary embodiment of the present with the mechanical translation of conventional art;
Fig. 7 is the process flow diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation method of syntactic analysis and level phrase model.
Embodiment
Below, describe with reference to the accompanying drawings exemplary embodiment of the present invention in detail.
Fig. 1 is the machine translation system that illustrates according to an exemplary embodiment of the present invention based on syntactic analysis and level phrase model.
As shown in Figure 1, comprise based on the machine translation system of syntactic analysis and level phrase model according to an exemplary embodiment of the present invention: word alignment module 101, Phrase extraction module 102, the translation module 103 based on continuous phrase, part of speech syntax labeling module 201, based on the non-continuous phrase extraction module 202 of syntax, based on translation module 301 and the scoring output module 302 of non-continuous phrase.
Word alignment module 101, Phrase extraction module 102, be to adopt employed module in traditional translation system based on continuous phrase based on the translation module 103 of continuous phrase, it is with part of speech syntax labeling module 201 according to an exemplary embodiment of the present invention, based on the common preprocessing part that consists of according to an exemplary embodiment of the present invention based on the machine translation system of syntactic analysis and level phrase model of the non-continuous phrase extraction module 202 of syntax.And based on the translation module 103 of continuous phrase with can consist of according to an exemplary embodiment of the present invention translation engine based on the machine translation system of syntactic analysis and level phrase model based on the translation module 301 of non-continuous phrase and scoring output module 302 according to an exemplary embodiment of the present invention.
With reference to Fig. 1, bilingual sentence aligning texts is input to word alignment module 101, word alignment module 101 utilize instrument (for example, GIZA++) from the bilingual alignment text of input, obtain word alignment information, and with this to the neat input information of word to Phrase extraction module 102.
Phrase extraction module 102 receives word alignment information from word alignment module 101, utilize the word alignment information that receives to carry out Phrase extraction, thereby obtain phrase alignment table (being also referred to as continuous phrase library), and the phrase alignment table that obtains is sent to based on the translation module 103 of continuous phrase with based on the non-continuous phrase extraction module 202 of syntax.Wherein, described phrase alignment table comprises following three parts: (1) source language phrase; (2) target language phrase; (3) probable value.
In the Computer Processing of natural language, rule-based syntax parsing mainly is the context-free syntax that uses Chomsky, but it seems helpless when processing the ambiguity of natural language.
In recent years the improvement of context-free syntax is mainly reflected in two aspects: on the one hand be that rule to the context-free syntax adds probability, probability context-free syntax (PCFG) has been proposed, except adding the probability to rule on the other hand, consider that also the centre word of rule for the impact of regular probability, has proposed probability vocabulary context-free syntax.
These researchs combine dexterously rule-based rationalist approach and based on the empiricism method of adding up, and have obtained preferably achievement, provide strong means for solving the syntax ambiguity problem.The probability syntax assigns a probability for the symbol string of a sentence or word, thereby catches the syntactic information more careful than general context-free syntax.Probability context-free syntax also is a kind of context-free syntax, each rule is wherein put on and is selected this regular probability, when processing each context-free rules, suppose that all they are independently in condition, the probability of a sentence with analysis during this sentence the product of each regular probability calculate.
The below comes to describe as an example of PCFG example the concrete operations of part of speech syntax labeling module 201 structure syntax tagged corpus (corpus is also referred to as treebank) here, with reference to Fig. 2.
At first, process (automatic or manual carrying out) by the mark to corpus, form the corpus with the markup information of different levels, as marked Binzhou treebank of part of speech and syntax tree information, it mainly marks collection shown in Fig. 2 (a).Tagged corpus is input to part of speech syntax labeling module 201.
Part of speech syntax labeling module 201 utilizes statistical tool from extracting useful linguistry and probability distribution information thereof the tagged corpus, and training (supervised training) method of guidance is namely arranged.Main probability distribution information comprises that certain word belongs to probability and context probability that the probability of certain part of speech, certain phrase belong to certain class phrase.
Part of speech syntax labeling module 201 is utilized linguistry and the probability distribution information thereof that extracts, bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus, and the syntax tagged corpus that produces is sent to non-continuous phrase extraction module 202 based on syntax.A sentence may have multiple annotation results, we choose maximum probability wherein as Output rusults, such as (a) of Fig. 2 with (b), according to probability calculation, the probability of Fig. 2 (a) is: P1=0.2 * 0.2 * 0.2 * 0.4 * 0.45 * 1.0 * 1.0 * 0.4 * 0.05=2.88 * 10 -5And the probability of Fig. 2 (b) is: P2=0.8 * 0.2 * 0.05 * 0.4 * 0.4 * 0.3 * 0.4 * 0.4 * 0.4 * 0.05=1.2288 * 10 -6, therefore, select the annotation results of Fig. 2 (a).
(c) of Fig. 2 and the Chinese sentence that (d) shows respectively part syntax mark collection and marked.
Non-continuous phrase extraction module 202 based on syntax receives the syntax tagged corpus from part of speech syntax labeling module 201, and the phrase alignment table that the alignment information that produces according to word alignment module 101 based on the syntax tagged corpus or Phrase extraction module 102 produce carries out extracting based on the non-continuous phrase of syntax, to obtain the non-continuous phrase rule base based on syntax.
The below describes non-continuous phrase extraction module 202 based on syntax in detail with reference to Fig. 3 to Fig. 5 and how to produce non-continuous phrase rule base based on syntax.
Fig. 3 to Fig. 5 shows non-continuous phrase extraction module 202 concrete formation and the concrete operations according to exemplary embodiment of the present invention.
As shown in Figure 3, the non-continuous phrase extraction module 202 based on syntax comprises non-continuous phrase extraction module 212 and syntax filtering module 222.
Describe non-continuous phrase extraction module 212 in detail below with reference to Fig. 4 and how to construct the non-continuous phrase rule base.
Non-continuous phrase extraction module 212 is according to the word alignment information of word alignment module 101 generations or the phrase alignment table of Phrase extraction module 102 generations, adopt the nonterminal symbols such as [X], [Y] to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base.
Fig. 4 shows a non-continuous phrase Rule Extraction example.The rule of this example is: [Y] of band [X] || | [Y] with[X] || | 0.10.30.6, wherein, the 0.1st, source language is to the translation probability of target language, and the 0.3rd, target language is to the word translation probability of source language, and the 0.6th, source language is to the word translation probability of target language.
The basic thought that the syntax of non-continuous phrase rule base is filtered is to guarantee that the phrase part that is extracted in the sentence should be a sentence element phrase with relative independentability, such as noun phrase (NP), numeral-classifier compound phrases (QP) etc. are to guarantee the translation quality in later stage.
Syntax filtering module 222 filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax;
Describe syntax filtering module 222 below with reference to Fig. 5 and how to carry out the syntax filtration.
Fig. 5 shows single statement method analysis and filter example of a non-continuous phrase rule base.
As shown in Figure 5, single sentence of input is carried out the syntax mark.
Consideration cuts out non-pronoun noun phrase (NP-NN) to the sentence after marking, and the situation so that [X] replaces is " subway line map " herein, 1st RULE of non-continuous phrase rule as keeping among Fig. 5 of generation.
Consider the situation of numeral-classifier compound phrase (QP), be specially the phrase that is labeled as QP, and comprise two child nodes, respectively CD and CLP, such as (QP (CD two) (CLP (M opens))), CD is replaced with [X], be " two " herein, 2nd rule of non-continuous phrase rule as keeping among Fig. 5 of generation;
Owing to do not meet syntax rule, the rule that is filtered be among Fig. 5 " is [X] to my subway line map? "
Abovely describe according to an exemplary embodiment of the present invention preprocessing part based on the machine translation system of syntactic analysis and level phrase model in detail with reference to accompanying drawing, the below describes according to an exemplary embodiment of the present invention translation engine based on the machine translation system of syntactic analysis and level phrase model with reference to Fig. 1 and Fig. 6.
Machine translation system based on syntactic analysis and hierarchical model according to the present invention is used translation model, language model, accent order model and demoder.
Mechanical translation based on syntactic analysis and hierarchical model according to the present invention is the expansion of translation model and the relative reduction of transferring the order model with the main difference of the mechanical translation based on continuous phrase of conventional art.
Translation model provides the relation of the correspondence translation between source language and the target language phrase, and represent the degree of this corresponding translation relation with a probable value, probable value is higher, shows the more accurate of translation correspondence, is used to the source language sentence that possible target language translation is provided.Based on the translation model of level phrase with correspondence translation relation by continuous phrase expanding to continuous phrase and based on the non-continuous phrase of syntax.
Language model has been stored a large amount of probable values, these probable values have provided the probabilistic relation information of each word and its front and back word or phrase, its effect is to judge that a phrase St meets the degree of target language syntax, custom, be used for translation result is selected, generally weigh this degree with a probable value PLM (St), the higher expression phrase of PLM (St) value more meets target language.
The target language of transferring the order model to be used for adjusting to translate out is the sequence of positions of word or phrase as a result, owing to based on the existence of the non-continuous phrase of syntax, transfer the funtion part of order model to be substituted, its weight can be corresponding lower.
The effect of translation engine is to coordinate above-mentioned several model and comes the source language sentence is translated.
With reference to Fig. 1, the phrase alignment table, retrieve all possible phrase, translation and probability thereof based on 103 pairs of sentences to be translated to through word segmentation from 102 outputs of Phrase extraction module of translation module of continuous phrase.
The non-continuous phrase rule base that receives based on syntax from non-continuous phrase extraction module 202 based on the translation module 301 of non-continuous phrase, and for through the sentence to be translated of word segmentation described based on the non-continuous phrase rule base of syntax in retrieval all possible phrase, translation and probability thereof.
Fig. 6 A illustrates according to an exemplary embodiment of the present invention and with level phrase model translator of Chinese is become English diagram based on syntactic analysis.
Label among Fig. 6 A (1)-(5) are corresponding one by one with following operation (1)-(5) respectively.
(1) input Chinese sentence to be translated;
(2) according to translation model, in the phrase alignment table, search for all possible phrase, translation and probability thereof based on the translation module 103 of continuous phrase;
(3) according to translation model, in the non-continuous phrase rule base, search for all possible non-continuous phrase, translation and probability thereof based on the translation module 301 of non-continuous phrase;
(4) according to the right translation probability of phrase, non-continuous phrase and ternary probabilistic language model etc., demoder calculates various general probabilitys that may translation results;
(5) demoder is chosen the top n sentence of general probability optimum as N-best candidate target language sentence.
In Fig. 6 A, (4)-(5) expression gathers the calculating general probability, thereby selects N candidate's sentence.In addition, in Fig. 6 A, | the scope that 3,6| represents be [3,6), namely comprise 3, but do not comprise 6, scope is before 6.
Fig. 6 B is accordingly according to the English diagram of conventional art translator of Chinese is become with Fig. 6 A.
Compare with Fig. 6 A according to the present invention, the key distinction is, only utilizes continuous phrase to translate in the conventional art translation process, and do not utilize syntactic analysis to cross the level phrase of filtration, X->([Y] of [X], [Y] of[X]) for example, carry out probability calculation, generate translation result.For example, in the application's method, " Shanghai of China " is translated into " Shanghai of China ", and the result who translates according to conventional art is " Chinese Shanghai ", so translation result according to the present invention is significantly better than the translation result according to conventional art.
The below will describe scoring output module 302 and based on assessment models translation result be marked.
The translation output that is input to scoring output module 302 is N candidate target language sentence, and N is more than or equal to 1.
Scoring output module 302 is also marked to N candidate target language sentence of input based on the assessment models of input.
Assessment models can comprehensive a plurality of translation features, such as the part of speech series model feature of language model feature, sentence, the sentence length of target language etc., come this N candidate target language sentence resequenced, choose the translation of global optimum and export as translation result.
Consider simplicity and the treatment effeciency of realization, language model with target language in exemplary embodiment of the present invention is described as assessment models, its effect is to judge that a sentence St meets the degree of target language syntax and custom, thereby translation result is selected.Generally weigh described degree with probable value PLM (St), the higher expression sentence of PLM (St) value more meets target language.
Consider the otherness for the treatment of effeciency and candidate's target language sentence, N=2 in current exemplary embodiment of the present invention, i.e. output sentence and the output sentence based on syntactic analysis and hierarchical model of only translating based on continuous phrase.
Scoring output module 302 is marked based on following basic procedure:
1, receive the candidate target language sentence of N=2, one is output sentence and output sentence based on syntactic analysis and hierarchical model of only translating based on continuous phrase;
2, utilize target language model (namely passing through language model) that its probable value is calculated in each possible translation;
3, select the output of score optimum.
The example that scoring output module 302 is marked is described below.
The translation source language is Chinese, and target language is English.The source language of input is: " could you tell me the payment terms ".
Result after the translation is (N=2):
1) Would you please tell me the pay terms. (based on the translation result of continuous phrase)
2) Would you please tell me the terms of payment. (based on the translation result of syntactic analysis and hierarchical model)
Language model is in English given a mark to these two results, because " payment terms " have its saying commonly used " terms of payment ", and " Would you please tell me the terms of payment. " more meets syntactic rule and the use habit of English, therefore, language model can provide a higher score value for this result:
1) middle result 1 is given a mark: 0.7
2) middle result 2 is given a mark: 0.9
5. select score value the highest as net result: Would you please tell me the terms ofpayment.
The below describes according to an exemplary embodiment of the present invention machine translation method based on syntactic analysis and hierarchical model with reference to Fig. 7.
Fig. 7 is the process flow diagram that illustrates according to an exemplary embodiment of the present invention based on the machine translation method of syntactic analysis and hierarchical model.
As shown in Figure 7, at step S701 and S702, input respectively tagged corpus and bilingual sentence aligning texts.
At step S703, carry out part of speech and syntax mark.At first utilize statistical tool from the tagged corpus of input, to extract useful linguistry and probability distribution information thereof, then, the linguistry that utilization extracts and probability distribution information thereof, bilingual or single language in the bilingual sentence aligning texts of input is carried out part of speech and syntax mark, finally produce syntax tagged corpus (or being called syntax mark treebank).
At step S704, utilize the GIZA++ instrument to obtain word alignment information from the bilingual sentence aligning texts of input.
At step S705, utilize the word alignment information extraction phrase that obtains at step S704, thereby obtain the phrase alignment table, described phrase alignment table comprises following three parts: (1) source language phrase; (2) target language phrase; (3) probable value.
At step S706, carry out non-continuous phrase based on the syntax tagged corpus that in step S703, obtains according to the alignment information that in step S704, produces or the phrase alignment table that in step S705, obtains and extract, to obtain the non-continuous phrase rule base based on syntax.
At length say, at first, based on the alignment information that in step S704, obtains or the phrase alignment table that in step S705, obtains, the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts is adopted [X], nonterminal symbols such as [Y] replaces, and obtains the non-continuous phrase rule base; Then, carry out syntax based on the syntax tagged corpus that in step S703, obtains and filter, to obtain the non-continuous phrase rule base based on syntax;
At step S707, according to translation model, search for all possible phrase, non-continuous phrase, translation and probability thereof at the phrase alignment table with in based on the non-continuous phrase rule base of syntax, N the translation that the output general probability has most is as candidate target language sentence.
At step S708, based on assessment models candidate target language sentence is marked, and select the conduct of global optimum finally to export.
More than describe with reference to the accompanying drawings according to an exemplary embodiment of the present invention based on machine translation system and the method for syntactic analysis and hierarchical model, it will be understood by those skilled in the art that to the invention is not restricted to above-mentioned exemplary embodiment.For example, in order to obtain all possible translation result, in Fig. 1, comprised the translation module 103 based on continuous phrase, and in the step S707 of Fig. 7, comprised in the phrase alignment table all possible phrase of search, non-continuous phrase, translation and probability thereof, if but in Fig. 1, do not comprise based on the translation module 103 of continuous phrase and do not comprised that in the step S707 of Fig. 7 the search to the phrase alignment table also is feasible.In addition, in exemplary embodiment of the present invention, assessment models is not limited to language model.
In the experiment of having carried out Korean-Chinese translation based on the prototype system of this patent.
The test set type: closed test (selecting test statement in training set) is 20%, and open test (test statement does not belong to training set) is 80%.
The result of artificial evaluation and test: compare with traditional machine translation system based on continuous phrase, the sentence that Korean-Chinese statement fluency obviously improves has increased more than 10%, reaches the practical level of 86.5% artificial evaluation and test rate of good.
In the embedded system that is equivalent to present main flow mobile phone hardware configuration, average translation speed is 2/second, has realized instant translation.
Figure GSA00000062503400111
Below be Korean-Chinese translation (example 1) and Sino-Korean translation (example 2).
Example 1 (Korean-Chinese translation)
Figure GSA00000062503400112
Example 2 (Sino-Korean translation)
Chinese: the room of please my bag being sent to me.
Translation result based on continuous phrase model:
Figure GSA00000062503400113
(translation error);
Translation result based on syntactic analysis and hierarchical model of the present invention:
Figure GSA00000062503400114
(translation is correct).
Can obviously improve the accuracy of translation based on the machine translation system of syntactic analysis and hierarchical model and method with respect to the machine translation system based on continuous phrase of the prior art and method according to an exemplary embodiment of the present invention, particularly in the situation of corpus dimension-limited.
Machine translation system and method based on syntactic analysis and hierarchical model both can be applied to computer system according to an exemplary embodiment of the present invention, also can be applicable to embedded system.
The present invention has introduced hierarchical model, extracts the non-continuous phrase rule base of acquisition alignment by the bilingualism corpora of sentence alignment, has solved the issues for translation of the discontinuous regular collocation of full sentence context.
The present invention has increased part of speech syntax labeling module and based on the non-continuous phrase extraction module of syntax, analyze and obtain in the corpus syntax mark tree of each (namely, to the sentence through the syntax mark) based on the non-continuous phrase rule base of syntax mark tree acquisition based on syntax, make it meet the syntactic feature of language, thereby improved the translation effect, and significantly reduced the scale of non-continuous phrase rule base, be suitable for using in embedded system;
The present invention is based on assessment models and translation result is marked and select, the highest translation result of output score is as net result, thereby advantage that can each translation model of effective integration has guaranteed the extensibility of system, has further improved the translation effect.
It should be appreciated by those skilled in the art, in the situation that does not break away from the spirit and scope of the present invention, can carry out in form and details various changes.Therefore, aforesaid exemplary embodiment is the purpose in order to illustrate only, and should not be interpreted as limitation of the present invention.Scope of the present invention is defined by the claims.

Claims (10)

1. machine translation system based on syntactic analysis and hierarchical model comprises:
The word alignment module receives bilingual sentence aligning texts from the outside, and obtains word alignment information from the bilingual sentence aligning texts that receives;
The Phrase extraction module receives word alignment information from the word alignment module, utilizes the word alignment information that receives to carry out Phrase extraction, to obtain the phrase alignment table;
Part of speech syntax labeling module, receive tagged corpus and bilingual sentence aligning texts from the outside, from extracting linguistry and the probability distribution information thereof that is used for bilingual sentence aligning texts the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus;
Non-continuous phrase extraction module based on syntax, receive the syntax tagged corpus from part of speech syntax labeling module, and the phrase alignment table that the alignment information that produces according to the word alignment module based on the syntax tagged corpus or Phrase extraction module produce carries out extracting based on the non-continuous phrase of syntax, to produce the non-continuous phrase rule base based on syntax;
Translation module based on non-continuous phrase, from the non-continuous phrase rule base of non-continuous phrase extraction module reception based on syntax, and treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval sentence to be translated all possible phrase, translation and translation probability thereof, and output translation result;
The scoring output module receives assessment models from the outside, based on assessment models translation result marked, and the highest translation result of output score.
2. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1, it is characterized in that described machine translation system also comprises: based on the translation module of continuous phrase, receive the phrase alignment table from the Phrase extraction module, treat translation of the sentence and in the phrase alignment table, retrieve all possible phrase, translation and probability thereof, and translation result is outputed to the scoring output module.
3. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 or 2, it is characterized in that comprising based on the non-continuous phrase extraction module of syntax: the non-continuous phrase extraction module, according to the word alignment information of word alignment module generation or the phrase alignment table of Phrase extraction module generation, adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts, obtain the non-continuous phrase rule base; The syntax filtering module filters the non-continuous phrase rule base that the non-continuous phrase extraction module produces based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax.
4. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 is characterized in that described probability distribution information comprises that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.
5. the machine translation system based on syntactic analysis and hierarchical model as claimed in claim 1 is characterized in that described phrase alignment table comprises source language phrase, target language phrase and probable value.
6. machine translation method based on syntactic analysis and hierarchical model may further comprise the steps:
Receive bilingual sentence aligning texts, and from the bilingual sentence aligning texts that receives, obtain word alignment information;
Utilize word alignment information to carry out Phrase extraction, to obtain the phrase alignment table;
Receive tagged corpus and bilingual sentence aligning texts, from extracting linguistry and the probability distribution information thereof that is used for bilingual sentence aligning texts the tagged corpus, and utilize the linguistry and the probability distribution information thereof that extract that the bilingual or single language in the bilingual sentence aligning texts is carried out part of speech and syntax mark, produce the syntax tagged corpus;
Carry out extracting based on the non-continuous phrase of syntax according to alignment information or phrase alignment table based on the syntax tagged corpus, to produce the non-continuous phrase rule base based on syntax;
Treat translation of the sentence described based on the non-continuous phrase rule base of syntax in the retrieval sentence to be translated all possible phrase, translation and translation probability thereof;
Receive assessment models, based on assessment models described translation is marked, and the highest translation result of output score.
7. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described machine translation method is further comprising the steps of: treat translation of the sentence and retrieve all possible phrase, translation and probability thereof in the phrase alignment table.
8. such as claim 6 or 7 described machine translation methods based on syntactic analysis and hierarchical model, the step that it is characterized in that producing based on the non-continuous phrase rule base of syntax may further comprise the steps:
Adopt nonterminal symbol to replace the continuous phrase of bilingual alignment among every of bilingual sentence aligning texts according to word alignment information or phrase alignment table, obtain the non-continuous phrase rule base;
Based on the syntax tagged corpus non-continuous phrase rule base is filtered, to produce the non-continuous phrase rule base based on syntax.
9. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described probability distribution information comprises that particular words belongs to probability and context probability that the probability of particular category of word, particular phrase belong to the certain kinds phrase.
10. the machine translation method based on syntactic analysis and hierarchical model as claimed in claim 6 is characterized in that described phrase alignment table comprises source language phrase, target language phrase and probable value.
CN 201010144623 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model Active CN102214166B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN 201010144623 CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model
KR1020110018439A KR101777421B1 (en) 2010-04-06 2011-03-02 A syntactic analysis and hierarchical phrase model based machine translation system and method
US13/079,283 US8818790B2 (en) 2010-04-06 2011-04-04 Syntactic analysis and hierarchical phrase model based machine translation system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010144623 CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Publications (2)

Publication Number Publication Date
CN102214166A CN102214166A (en) 2011-10-12
CN102214166B true CN102214166B (en) 2013-02-20

Family

ID=44745481

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010144623 Active CN102214166B (en) 2010-04-06 2010-04-06 Machine translation system and machine translation method based on syntactic analysis and hierarchical model

Country Status (1)

Country Link
CN (1) CN102214166B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103116575B (en) * 2011-11-16 2016-06-22 富士通株式会社 Translation word order probability defining method and device based on gradation phrase model
KR101475284B1 (en) * 2011-11-29 2014-12-23 에스케이텔레콤 주식회사 Error detection apparatus and method based on shallow parser for estimating writing automatically
CN103914447B (en) * 2013-01-09 2017-04-19 富士通株式会社 Information processing device and information processing method
CN104346325B (en) * 2013-07-30 2017-05-10 富士通株式会社 Information processing method and information processing device
CN104050160B (en) * 2014-03-12 2017-04-05 北京紫冬锐意语音科技有限公司 Interpreter's method and apparatus that a kind of machine is blended with human translation
CN106372053B (en) * 2015-07-22 2020-04-28 华为技术有限公司 Syntactic analysis method and device
CN106383818A (en) * 2015-07-30 2017-02-08 阿里巴巴集团控股有限公司 Machine translation method and device
CN106484682B (en) 2015-08-25 2019-06-25 阿里巴巴集团控股有限公司 Machine translation method, device and electronic equipment based on statistics
CN106484681B (en) 2015-08-25 2019-07-09 阿里巴巴集团控股有限公司 A kind of method, apparatus and electronic equipment generating candidate translation
CN105320644B (en) * 2015-09-23 2018-01-02 陕西中医药大学 A kind of rule-based automatic Chinese syntactic analysis method
CN106156013B (en) * 2016-06-30 2019-02-19 电子科技大学 A kind of two-part machine translation method that regular collocation type phrase is preferential
KR102458244B1 (en) * 2017-11-23 2022-10-24 삼성전자주식회사 Machine translation method and apparatus
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN108897852B (en) * 2018-06-29 2020-10-23 北京百度网讯科技有限公司 Method, device and equipment for judging continuity of conversation content
TWI703556B (en) * 2018-10-24 2020-09-01 中華電信股份有限公司 Method for speech synthesis and system thereof
CN111104796B (en) * 2019-12-18 2023-05-05 北京百度网讯科技有限公司 Method and device for translation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1228566A (en) * 1998-03-11 1999-09-15 英业达股份有限公司 Non-continuous phrase matching translation device and method
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
CN101290616A (en) * 2008-06-11 2008-10-22 中国科学院计算技术研究所 Statistical machine translation method and system
CN101685441A (en) * 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7353165B2 (en) * 2002-06-28 2008-04-01 Microsoft Corporation Example based machine translation system
KR100911619B1 (en) * 2007-12-11 2009-08-12 한국전자통신연구원 Method and apparatus for constructing vocabulary pattern of english

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1228566A (en) * 1998-03-11 1999-09-15 英业达股份有限公司 Non-continuous phrase matching translation device and method
CN1652106A (en) * 2004-02-04 2005-08-10 北京赛迪翻译技术有限公司 Machine translation method and apparatus based on language knowledge base
CN101290616A (en) * 2008-06-11 2008-10-22 中国科学院计算技术研究所 Statistical machine translation method and system
CN101685441A (en) * 2008-09-24 2010-03-31 中国科学院自动化研究所 Generalized reordering statistic translation method and device based on non-continuous phrase

Also Published As

Publication number Publication date
CN102214166A (en) 2011-10-12

Similar Documents

Publication Publication Date Title
CN102214166B (en) Machine translation system and machine translation method based on syntactic analysis and hierarchical model
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
CN109918666B (en) Chinese punctuation mark adding method based on neural network
CN101655837B (en) Method for detecting and correcting error on text after voice recognition
Kumar et al. Part of speech taggers for morphologically rich indian languages: a survey
CN110489760A (en) Based on deep neural network text auto-collation and device
CN104756100B (en) It is intended to estimation unit and is intended to method of estimation
CN102799577B (en) A kind of Chinese inter-entity semantic relation extraction method
CN106777275A (en) Entity attribute and property value extracting method based on many granularity semantic chunks
CN110866399B (en) Chinese short text entity recognition and disambiguation method based on enhanced character vector
CN105808525A (en) Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
CN106570180A (en) Artificial intelligence based voice searching method and device
CN110378409A (en) It is a kind of based on element association attention mechanism the Chinese get over news documents abstraction generating method
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
WO2017177809A1 (en) Word segmentation method and system for language text
CN108509409A (en) A method of automatically generating semantic similarity sentence sample
CN106569993A (en) Method and device for mining hypernym-hyponym relation between domain-specific terms
Dien et al. POS-tagger for English-Vietnamese bilingual corpus
CN104375988A (en) Word and expression alignment method and device
CN108287825A (en) A kind of term identification abstracting method and system
Kübler et al. Part of speech tagging for Arabic
Parameswarappa et al. Kannada word sense disambiguation using decision list
CN112183073A (en) Text error correction and completion method suitable for legal hot-line speech recognition
Tlili-Guiassa Hybrid method for tagging Arabic text
CN108959630A (en) A kind of character attribute abstracting method towards English without structure text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant