WO2012079257A1 - Procédé et dispositif de traduction automatique - Google Patents

Procédé et dispositif de traduction automatique Download PDF

Info

Publication number
WO2012079257A1
WO2012079257A1 PCT/CN2010/079963 CN2010079963W WO2012079257A1 WO 2012079257 A1 WO2012079257 A1 WO 2012079257A1 CN 2010079963 W CN2010079963 W CN 2010079963W WO 2012079257 A1 WO2012079257 A1 WO 2012079257A1
Authority
WO
WIPO (PCT)
Prior art keywords
source language
arbitrary
translation
unit
phrase
Prior art date
Application number
PCT/CN2010/079963
Other languages
English (en)
Chinese (zh)
Inventor
徐金安
孟凡东
陈恰
潘栩
达珍
孟庆辰
Original Assignee
北京交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京交通大学 filed Critical 北京交通大学
Priority to PCT/CN2010/079963 priority Critical patent/WO2012079257A1/fr
Priority to CN201080070253.6A priority patent/CN103314369B/zh
Publication of WO2012079257A1 publication Critical patent/WO2012079257A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • the present invention relates to the field of machine translation, and in particular to a machine translation apparatus and method. Background technique
  • machine translation involves many disciplines and technologies such as artificial intelligence, mathematics, linguistics, computational language, speech recognition and speech synthesis. It has the characteristics of comprehensive and cross-cutting.
  • machine translation systems can be divided into two categories based on rules and corpus. Direct translation methods, conversion methods, and intermediate language methods are classified into rule-based translation methods; corpus-based methods can be further classified into memory-based books.
  • Translation methods instance-based translation methods, neural network-based translation methods, and statistical-based translation methods.
  • the existing machine translation method includes the following steps: Machine translation analyzes the source language statement, divides the source language statement by words and phrases, and establishes a parse tree. Different parsing trees will appear according to the composition of words and phrases. That is, the source language sentence analysis forest is formed, and the machine translation system analyzes the parsing trees included in the parsing forest one by one, and selects the highly credible translation from the analysis results as the final translation result.
  • the present invention provides a machine translation apparatus and method.
  • the specific technical solutions are as follows:
  • a machine translation apparatus comprising:
  • a source language input unit for inputting a source language statement
  • a source language analysis unit configured to perform lexical analysis and syntax analysis on the source language statement to obtain a syntax structure of the source language sentence, and assign a attribute feature to a node in the syntax structure;
  • the arbitrary lattice determination model storage unit is configured to store an arbitrary lattice determination model, and the arbitrary lattice determination model provides a model basis for whether the source language statement contains an arbitrary lattice;
  • An arbitrary cell determining unit configured to match the arbitrary lattice determination model according to the attribute feature, and if yes, determine that the source language statement contains an arbitrary lattice, and if not, determine that the source language statement does not Contains any Grid
  • An arbitrary lattice phrase extracting unit configured to obtain an arbitrary lattice phrase in the syntax structure according to the arbitrary lattice obtained by matching;
  • An arbitrary phrase translation unit for performing machine translation on the arbitrary lattice phrase
  • a first extracting unit configured to acquire a source language remaining statement after removing the arbitrary lattice phrase
  • a machine translation unit configured to perform machine translation on the remaining statements of the source language
  • a translation result integration unit configured to perform a combination of translation results of the arbitrary-character phrase translation unit and the machine translation unit, and use a combination with a high probability of occurrence as a target language
  • a target language output unit for outputting the target language.
  • a machine translation method comprising:
  • the arbitrary lattice determination model provides a model basis for whether the source language statement contains an arbitrary lattice
  • the target language is output.
  • FIG. 1 is a block diagram of a machine translation apparatus according to Embodiment 1 of the present invention.
  • Embodiment 1 of the present invention is a schematic diagram showing an example of a result of lexical analysis provided by Embodiment 1 of the present invention
  • FIG. 3 is a schematic diagram showing an example of a grammatical category of words and words associated with each other according to Embodiment 1 of the present invention
  • FIG. 4 is a schematic diagram showing an exemplary data structure of grammar rules provided by Embodiment 1 of the present invention
  • Embodiment 1 of the present invention is a schematic diagram showing an example of an arbitrary lattice decision model library provided by Embodiment 1 of the present invention.
  • FIG. 6 is a schematic diagram showing an example of a syntax structure analysis result provided by Embodiment 1 of the present invention.
  • Embodiment 7 is a flowchart of a machine translation method provided by Embodiment 2 of the present invention.
  • FIG. 8 is a schematic diagram showing an example of a syntax structure obtained by extracting an arbitrary cell according to Embodiment 2 of the present invention.
  • FIG. 9 is a schematic diagram of a statistical method for parallel corpus segmentation for machine translation according to Embodiment 2 of the present invention
  • FIG. 10 is a schematic diagram of a training method for a statistical-based machine translation device according to Embodiment 2 of the present invention
  • the embodiment provides a machine translation device, the device includes: a source language input unit for inputting a source language statement; a source language analysis unit, configured to perform lexical analysis and syntax analysis on the source language statement to obtain the source a syntax structure of the language statement, and assigning an attribute feature to the node in the syntax structure; an arbitrary lattice determination model storage unit, configured to store an arbitrary lattice determination model, wherein the arbitrary lattice determination model is whether the source language statement contains any Providing a model basis; an arbitrary cell determining unit, configured to match the arbitrary lattice determining model according to the attribute feature, and if yes, determining that the source language statement contains an arbitrary lattice, and if not, determining the The source language statement does not contain an arbitrary lattice; the arbitrary lattice phrase extracting unit is configured to obtain an arbitrary lattice phrase in the syntax structure according to the arbitrary lattice obtained by matching; the arbitrary lattice phrase translation unit is configured to use
  • an arbitrary lattice in the source language sentence is found, and the source language statement is split into two parts according to the arbitrary lattice, that is, a more complicated sentence is split into two.
  • a simple statement The two simple sentences are translated separately, the translation results are integrated, and the integrated result with large combination probability is selected as the translation result, thereby reducing the complexity of the syntactic structure of the source language, improving the sentence structure of the target language and the efficiency of generating the grammar. Improve the translation accuracy, and make the amount of computation for machine translation decoding appropriately reduced, providing an effective device and method for machine translation research.
  • FIG. 1 is a machine translation apparatus 100 according to Embodiment 1 of the present invention.
  • the apparatus includes: a source language input unit 101, a source language analysis unit 102, an arbitrary lattice determination model storage unit 103, and an arbitrary lattice determination unit 104.
  • the unit may be any universal input module and input device, including: a pointing device, a keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • the input source language statement is stored in the computer memory or buffer.
  • the source language analyzing unit 102 is configured to perform lexical analysis on the source language sentence input by the source language input unit 101, obtain a sequence of words of the source language sentence, perform syntactic analysis according to the word sequence, and obtain a syntactic structure of the source language sentence, which is in the syntactic structure.
  • the node is assigned an attribute feature and output to the arbitrary cell determining unit 104;
  • any general lexical analysis technique can be used in the process of lexical analysis of source language sentences, such as a method of maximizing the probability of division by dynamic programming using a word division model, that is, according to a word division model, using a dynamic programming method
  • the source language statement divides the words, and selects the most probable division method as the final output word sequence.
  • the lexical analysis tool can be used to perform lexical analysis on the input source language statements, including: Stanford Parse, Institute of Computing Technology, ICTCLAS Analysis System, ChaSen, etc.
  • any syntactic analysis method such as icon parsing and general LR profiling, can be used for syntactic analysis of source language statements.
  • syntactic analysis tools can be used for syntactic analysis, including: Japanese Cab 0C h a , KNP, etc.
  • the symbol ".” identifies a breakpoint between the 202 word and the word.
  • the identifier of the breakpoint is not unique, and it can also be "space”.
  • the lexical dictionary and the preset grammar rules are used to assign attribute features to the nodes in the syntactic structure, and the syntactic structure includes the grammatical categories of the corresponding words and each of them is closed. Connected nodes; Figure 3 shows an example of the grammatical category of words in the sequence of words 202 shown in Figure 2.
  • the vocabulary dictionary includes grammatical categories of words and words associated with each other, for example, the Japanese word 301 "Peace” is associated with the grammatical category Pron. (Pronoun), in addition to Pron., the grammatical category of the vocabulary includes V (verb), P (auxiliary), N (noun), etc.
  • a predetermined grammar rule is given, in which the grammatical category to the left of the arrow is specified with the grammatical categories 1 and 2 to the right of the arrow.
  • the sentence (grammar category S) has a noun phrase and a verb phrase (grammatical category NP VP), and the source language analyzing unit 102 will refer to the grammar rule in the process of syntactic analysis of the source language sentence.
  • the source language analysis unit 102 analyzes the syntactic structure of the Chinese sentence, and can analyze that "I” is the subject of the sentence, and “Yes” is the predicate. "Chinese people” is the result of the analysis of the object.
  • the source language analysis unit 102 can also assign the attribute words such as part of speech, semantics, and concept to the words in the word sequence by referring to the semantic class dictionary.
  • Japanese WordNet Japanese word series
  • EDR electronic dictionary etc.
  • the component "he/pronoun” in the above input sentence can be given the attribute characteristics of "person", and “ ⁇ ” can be given the attribute characteristics of "place (place)” or “building (building)", “self-driving car” can be given The characteristics of the traffic agency (vehicle)” and so on.
  • semantic dictionary the vocabulary dictionary, and the grammar rules are all stored in the source language syntax analysis unit in advance.
  • the arbitrary lattice determination model storage unit 103 is configured to store an arbitrary lattice determination model, which is composed of a number, a surface of the word (the word itself), a part of speech, a semantic classification of the word, and a lattice auxiliary word; the arbitrary lattice determination model is a knowledge base, The main function is to provide a basis for determining whether there is any space in the input source language statement;
  • the arbitrary lattice determination model may be manually written to formulate certain rules, or may be extracted from the learning data according to the machine learning principle using statistical methods; wherein, the machine learning methods are various, and may be appropriately selected according to needs.
  • SVM support vector machine
  • decision tree decision tree
  • the invention does not limit the arbitrary lattice decision The specific implementation method of the model;
  • the arbitrary cell determining unit 104 is configured to extract the node attribute feature in the data structure from the source language sentence analyzing unit 102, and match the extracted attribute feature with the intention degree determining model stored by the arbitrary cell determining model storage unit 103, if matched, Then, it is determined that there is an arbitrary lattice in the source language statement. If there is no match, it is determined that there is no arbitrary lattice in the source language sentence.
  • FIG. 5 is a schematic diagram of an example of the arbitrary lattice determination model library provided by the embodiment of the present invention.
  • the arbitrary lattice decision model in the arbitrary lattice decision model library is composed of the number, the surface of the word (the word itself), the part of speech, the semantic classification of the word, and the helper word.
  • the arbitrary cell determining unit 104 extracts the node attribute feature in the data structure from the source language sentence analyzing unit 102, and can use when the extracted attribute feature matches the arbitrary cell determining model in the arbitrary cell determining model library shown in FIG.
  • the model in the arbitrary lattice decision model library [surface + lattice auxiliary word], or [semantic classification + lattice auxiliary word], or [surface layer + part of speech + lattice auxiliary word], or [surface layer + part of speech + semantic classification + lattice auxiliary word]
  • the pattern is matched with the node attribute feature in the data structure from the source language sentence analysis unit 102 to determine whether the source language statement contains an arbitrary lattice.
  • the source language statement "People's "Books from the car ⁇ line ⁇ ”
  • the judgment model is matched, and the matching method has various forms.
  • the attribute of [self-driving] contains only the noun ⁇
  • the [self-driving] ⁇ and [] are used as the feature vector and any of the arbitrary lattice determination model libraries shown in FIG.
  • the lattice determination model performs pattern matching; when the attribute of [self-driving] contains the noun [ ⁇ ] and the semantic attribute [traffic authority], the characteristic attribute composed of [traffic authority] and [ ⁇ ] can be simply shown in FIG.
  • the arbitrary lattice decision model in the arbitrary lattice decision model library performs pattern matching; obviously, both methods are matched with the model numbered 2 in FIG. 5; thereby judging that [[] in the self-driving car is an arbitrary lattice.
  • the arbitrary cell determining unit 104 includes an extracting module 1041, a reading module 1042, and a matching module 1043.
  • the extracting module 1041 is configured to extract attribute features in the source language sentence analyzing unit 102, and the attribute features include part of speech, word meaning, concept, and the like. ;
  • the attribute features of the predicate words such as nouns, lattice auxiliary words, and verbs in the sentence are extracted as attribute features for arbitrary determination of the source language sentence;
  • the source language statement that is entered is “in the book, "The bookstore is self-driving car”, [Pi ii], [ ⁇ ],
  • the matching determination module 1042 matches the attribute feature of the extracted syntax structure node with the arbitrary lattice determination model stored by the arbitrary lattice determination model storage unit 103. If it matches, it determines that there is an arbitrary lattice in the source language sentence, and if not, determines the source language. There is no arbitrary cell in the statement;
  • the arbitrary-character phrase extracting unit 105 is configured to: when the arbitrary-cell determining unit 103 determines that there is an arbitrary lattice in the source language sentence, extract a node string associated with an arbitrary lattice from the syntactic structure as an arbitrary lattice phrase, and extract the arbitrary lattice The phrase is output to the arbitrary phrase translation unit 106;
  • Figure 6 depicts the syntactic analysis result of the input sentence "Peer ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ " " , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ⁇ ⁇ ⁇ .
  • the arbitrary-character phrase translation unit 106 is configured to extract a source language phrase after removing the arbitrary-character phrase, and integrate the extracted sentence component of the source language phrase after removing the arbitrary-character phrase, and output the translation result to the translation result integration unit. 109;
  • the translation method for the part is flexible, and the form can be various, such as a translation dictionary using a dedicated arbitrary phrase, or
  • the use of rule-based translation methods to translate arbitrary phrases can of course be implemented using instance-based, or statistical-based machine translation methods;
  • a first extracting unit 107 configured to extract a node string associated with an arbitrary cell from the syntax structure as an output to the machine translation unit 108;
  • the machine translation unit 108 is configured to perform machine translation on the statement transmitted by the first extraction unit 107, and output the translation result to the translation result integration unit 109;
  • the machine translation unit 108 is further configured to perform a machine translation process on the input source language statement directly when the arbitrary cell determination unit 104 determines that the analysis result of the source language analysis unit 102 does not include an arbitrary lattice phrase, and output the translation result to the translation.
  • machine translation unit 108 may translate incoming statements in a rule-based machine translation system, an instance-based machine translation system, or a statistical-based machine translation system.
  • the translation result integration unit 109 is configured to receive the translation result of the arbitrary lattice phrase translation unit 106 and the translation result of the machine translation unit 108, and integrate the two results to generate a complete target language sentence, and generate the target language sentence. Output to the target language output unit 110;
  • the translation result integration unit 109 includes: a translation result integration module 1091 and an integration comparison module 1092; wherein the translation result integration module 1091 is configured to perform the translation result of the arbitrary lattice phrase translation unit 106 and the translation result of the machine translation unit 108. Permutations; Specifically, the translation result integration module 1091 may sort the two parts by using a language model of the target language;
  • the integration comparison module 1092 is configured to compare the magnitude of the probability of occurrence of the integration result of the translation result integration module 1091, and output the translation integration result with a high probability of occurrence to the target language output unit 110;
  • the target language output unit 110 is configured to receive and output the target language sentence generated by the translation result integration unit 110.
  • the target language sentence has a plurality of output modes, which may be a file output, or a display output.
  • the output is displayed on the display device in the form of an image, or the result is printed by the printer and synthesized by a speech synthesizer. You can switch between using these systems or using them at the same time as needed.
  • the embodiment provides a machine translation method, the method comprising: inputting a source language statement; performing lexical analysis and syntax analysis on the source language statement to obtain a syntax structure of the source language statement, and in the syntax structure
  • the node assigns an attribute feature; according to the attribute feature, matching with the stored arbitrary lattice determination model, if it matches, determining that the source language statement contains an arbitrary lattice, and if not, determining that the source language statement does not contain any a cell, wherein the arbitrary cell decision model provides a model basis for whether the source language statement includes an arbitrary cell; and the arbitrary cell in the syntactic structure is obtained according to the random cell obtained by the matching, and the arbitrary cell is obtained Phrasing machine translation; obtaining a source language remaining statement after removing the arbitrary lattice phrase, and performing machine translation on the remaining language statement of the source language; arranging and combining the translation result of the arbitrary lattice phrase and the remaining language of the source language, A combination with a high probability
  • step S01 input source language statement, and store it in a memory unit or a buffer of a computer's memory; if necessary, various input devices can be used to input the source language statement, including: a pointing device, A keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • various input devices can be used to input the source language statement, including: a pointing device, A keyboard, a handwritten character recognition device, an optical character recognition device and a voice recognition device, and an input device in the form of a text file or a database.
  • the input source language sentence is Japanese "People's "Books of the Eighth Cars T"
  • the target language is Chinese as an example.
  • the translation method of the present invention is not limited to Japanese to Chinese translation.
  • Step S02 performing lexical analysis on the source language statement, obtaining a sequence of words of the source language sentence, performing syntax analysis according to the word sequence, obtaining a syntax structure of the source language sentence, assigning attribute features to the nodes in the syntax structure, and performing attribute features and syntax
  • the structure is output as an analysis result
  • any general lexical analysis technique can be used in the process of lexical analysis of source language sentences, such as a method of maximizing the probability of division by dynamic programming using a word division model, that is, according to a word division model, using a dynamic programming method
  • the source language statement divides the words, and selects the most probable division method as the final output word sequence.
  • the lexical analysis tool can be used to perform lexical analysis on the input source language sentences, including: Stanford Parse, ICTCLAS analysis system of Chinese Academy of Sciences, ChaSen, etc.
  • any syntactic analysis method such as icon parsing and general LR profiling, can be used for syntactic analysis of source language statements.
  • syntactic analysis tools can be used for syntactic analysis, including: Cabocha, KNP, etc. in Japanese.
  • the symbol ".” identifies a breakpoint between the 202 word and the word.
  • the identifier of the breakpoint is not unique, and it can also be "space”.
  • the lexical dictionary and the preset grammar rules are used to assign attribute features to nodes in the syntax structure, and the syntax structure includes the grammatical categories of the corresponding words and each of them is associated with each other.
  • Node An example of the grammatical category of words in the sequence of words 202 shown in FIG. 2 is given in FIG.
  • the vocabulary dictionary includes grammatical categories of words and words associated with each other, for example, the Japanese word 301 "Peace” is associated with the grammatical category Pron. (Pronoun), in addition to Pron., the grammatical category of the vocabulary includes V (verb), P (auxiliary), N (noun), etc.
  • a predetermined grammar rule is given, in which the grammatical category to the left of the arrow is specified with the grammatical categories 1 and 2 to the right of the arrow.
  • the sentence (grammatical category S) has a noun phrase and a verb phrase (grammatical category NP VP ), and the source language analyzing unit 102 will refer to the grammar rule in the process of syntactic analysis of the source language sentence.
  • the source language analysis unit 102 analyzes the syntactic structure of the Chinese sentence, and can analyze that "I” is the subject of the sentence, and “Yes” is the predicate. "Chinese people” is the result of the analysis of the object.
  • semantic dictionary the vocabulary dictionary, and the grammar rules are all stored in the source language syntax analysis unit in advance.
  • Step S03 extracting attribute features, such as words, part of speech, semantic classification, concepts, and the like from the analysis result; specifically, extracting attribute features of the predicate such as nouns, lattice auxiliary words, and verbs in the sentence as the source language statement Attribute characteristics;
  • the input source language statement "People's Bookstore is self-defeating", "He H:], [ ⁇ ], [Self-bringing and predicate [ ⁇ ] and other parts of the language, as well as surface information, part of speech, Information such as semantic classification of words is used as an attribute feature for arbitrary lattice determination.
  • Step S04 the attribute feature of the extracted syntax structure node is matched with the stored arbitrary lattice determination model. If it matches, it is determined that there is an arbitrary lattice in the source language statement, and if S05 is not matched, it is determined that there is no arbitrary lattice in the source language statement. , execute S08;
  • the arbitrary lattice decision model is composed of the number, the surface of the word (the word itself), the part of speech, the semantic classification of the word, and the auxiliary word. It is a kind of knowledge base. Its main function is to determine whether there is any in the source language statement of the input. Provide basis for
  • the arbitrary lattice determination model may be manually written to formulate certain rules, or may be extracted from the learning data according to the machine learning principle using statistical methods; wherein, the machine learning methods are various, and may be appropriately selected according to needs.
  • SVM support vector machine
  • decision tree decision tree
  • the present invention does not limit the specific implementation of the arbitrary lattice decision model;
  • matching the attribute features of the extracted syntax structure node with the stored arbitrary lattice determination model includes: matching the extracted attribute features with the arbitrary lattice determination model in the arbitrary lattice determination model library shown in FIG. 5 , You can use this model to determine the model in the model library [surface + lattice auxiliary words], or [semantic classification + lattice auxiliary words], or [surface layer + part of speech + lattice auxiliary words], or [surface layer + part of speech + semantic classification + lattice auxiliary words]
  • the pattern matching is performed in various forms and from the source language sentence analyzing unit 102 to extract the node attribute features in the data structure to determine whether the source language statement contains an arbitrary lattice.
  • the source language statement "He” "Yu Shuguan from the car ⁇ line ⁇ , can first extract the feature quantity of the [self-driving car] and [ ⁇ ] in the source language statement, and then any of the arbitrary cell model library shown in Figure 5
  • the lattice judgment model is matched, and the matching method has various forms.
  • the attribute of [self-driving] contains only the noun [n]
  • the [self-driving] [n] and [] are used as the feature vector and the arbitrary lattice determination shown in FIG.
  • the arbitrary lattice judgment model in the model library performs pattern matching; when the attribute of [self-driving] contains the noun [ ⁇ ] and the semantic attribute [traffic authority], it is possible to simply use [traffic authority] and [composition attribute features and diagrams
  • the arbitrary lattice decision model in the arbitrary lattice decision model library shown in FIG. 5 performs pattern matching; obviously, both methods are matched with the model numbered 2 in FIG. 5; thereby judging that [[] in the self-driving car is arbitrary Grid.
  • Step S05 extracting a node string associated with an arbitrary lattice from the syntax structure, performing the operation of step S06 on the extracted arbitrary lattice phrase portion, and performing an operation of S07 on removing the remaining portion of the arbitrary lattice phrase;
  • FIG. 6 depicts the syntactic analysis result of the input sentence "Pei ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
  • Step S06 performing machine translation on the extracted arbitrary lattice phrase, executing step S08;
  • the translation method for the part is flexible, and the form can be various, and the corresponding phrase pairs are extracted from the large-scale corpus.
  • Dedicated translations are implemented using a dictionary, or a rule-based translation method is used to translate arbitrary phrases, and of course, it can also be implemented using an instance-based, or statistical-based machine translation method;
  • Step S07 performing machine translation
  • the translation of the remaining part of the source language sentence after removing the arbitrary lattice phrase specifically includes: arranging and combining the remaining sentence components of the extracted source language after removing the arbitrary lattice phrase, and combining the results
  • the combination with the highest probability of occurrence is machine translation.
  • the machine translation method in this step is not specifically limited, and may be a rule-based machine translation system, an instance-based machine translation system, or a statistical-based machine translation system.
  • the translation of a string is based on an example, and the similarity between the string and the sample is used as a translation score; for a statistical-based translation system, the translation of the string Based on the translation of the language model, the translation probabilities based on the translation model are used as translation scores.
  • the translation of the strings is based on the syntax and the rules adopted, and the syntax is credible. The degree and the preference of the rule are used to obtain the translation score.
  • Step S08 integrating the translation results of steps S06 and S07;
  • the two translation results are arranged and combined, and one of the combinations with a high probability of occurrence is selected as the integration result and output.
  • Step Machine Translation Integration The function of S08 is to integrate the translation results of step S06 and step S07. If the translation result from Japanese to Chinese is "He goes to the library" and "Bicycle", the target can be used.
  • the language model of the language sorts the two parts above. It can be concluded that when the quality and scale of the Chinese corpus of the language model under construction is guaranteed, it can be calculated that the probability that he is going to the library by bicycle is the greatest. Then, the processing result of step S08 is output to the step target language output S09.
  • Step S09 outputting the integrated result output obtained in step S08 to obtain a final target language
  • the output forms are various and can be outputted through a display, a text file, or a voice output; for example, the output is displayed on the display device in the form of an image, and the result is printed by the printer and synthesized by the speech synthesizer. You can switch between using these systems or using them at the same time as needed.
  • FIG. 9 is based on statistics in the embodiment of the present invention.
  • a schematic diagram of a parallel corpus segmentation method for machine translation as shown in FIG. 9, the parallel corpus segmentation is mainly performed by the parallel corpus segmentation unit 210, and the parallel corpus segmentation unit 210 can use the arbitrary lattice decision model to determine the sentences in the corpus. It is easy to get two parts, including an arbitrary box and a sentence with an arbitrary lattice, to complete the segmentation of the original parallel corpus.
  • the purpose of such processing is to construct a translation model and a speech model for statistical machine translation, the corpus of the above two parts can be flexibly utilized as needed.
  • FIG. 10 is a schematic diagram of a training method of a statistical-based machine translation apparatus according to an embodiment of the present invention.
  • the function of the speech model/translation model construction unit 310 in the training method is to construct a translation model and a language model, and a traditional tool such as GIZA++ Etc., SRLM, etc. can be used.
  • FIG. 11 is a schematic diagram of a training method of a statistical-based machine translation apparatus according to an embodiment of the present invention.
  • the training corpus adopts a source-target language parallel corpus for removing arbitrary lattice phrases.
  • the statement, and the two simple sentences are translated separately, the translation results are integrated, and the combined result with large combined probability is selected as the translation result, thereby reducing the complexity of the syntactic structure of the source language and improving the sentence structure and grammar generation efficiency of the target language. , to improve the translation accuracy, and to reduce the amount of machine translation decoding operations, to provide an effective device and method for machine translation research.
  • All or part of the technical solutions provided by the above embodiments may be implemented by software programming, and the software program is stored in a readable storage medium such as a hard disk, an optical disk or a floppy disk in a computer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé de traduction automatique et un dispositif de traduction automatique s'appliquant au domaine du traitement des langues naturelles. Le dispositif comprend : une unité d'entrée de langue source, qui introduit des phrases de langue source ; une unité d'analyse de langue source, qui analyse la morphologie et la syntaxe pour acquérir la structure syntaxique, et définit des caractéristiques de propriétés pour les nœuds de la structure syntaxique ; une unité de stockage de modèle de détermination de génitif, qui sauvegarde le modèle de détermination de génitif ; une unité de détermination de génitif, qui détermine si la phrase contient un génitif ; une unité d'extraction de locutions au génitif, qui obtient des locutions au génitif ; une unité de traduction de locutions au génitif, qui traduit des locutions au génitif ; une première unité d'extraction, qui obtient la partie restante de la phrase de langue source ; une unité de traduction automatique, qui traduit la partie restante de la phrase de langue source ; une unité d'intégration de résultats de traduction, qui intègre les résultats de la traduction pour obtenir une langue cible ; une unité de sortie de langue cible, qui produit la langue cible. L'invention permet de diminuer la complexité de la structure syntaxique de la langue source, et d'augmenter l'efficacité de génération de la langue cible, améliorant ainsi la précision de la traduction, et réduisant de façon appropriée le traitement de décodage de la traduction automatique.
PCT/CN2010/079963 2010-12-17 2010-12-17 Procédé et dispositif de traduction automatique WO2012079257A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2010/079963 WO2012079257A1 (fr) 2010-12-17 2010-12-17 Procédé et dispositif de traduction automatique
CN201080070253.6A CN103314369B (zh) 2010-12-17 2010-12-17 机器翻译装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/079963 WO2012079257A1 (fr) 2010-12-17 2010-12-17 Procédé et dispositif de traduction automatique

Publications (1)

Publication Number Publication Date
WO2012079257A1 true WO2012079257A1 (fr) 2012-06-21

Family

ID=46243999

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/079963 WO2012079257A1 (fr) 2010-12-17 2010-12-17 Procédé et dispositif de traduction automatique

Country Status (2)

Country Link
CN (1) CN103314369B (fr)
WO (1) WO2012079257A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320650A (zh) * 2014-07-31 2016-02-10 崔晓光 一种机器翻译方法及其系统
CN111241245A (zh) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268132B (zh) * 2014-09-11 2017-04-26 北京交通大学 机器翻译方法及系统
CN104268133B (zh) * 2014-09-11 2018-02-13 北京交通大学 机器翻译方法及系统
AU2014409115A1 (en) * 2014-10-17 2017-04-27 Mz Ip Holdings, Llc System and method for language detection
CN104391842A (zh) * 2014-12-18 2015-03-04 苏州大学 一种翻译模型构建方法和系统
CN110175338B (zh) * 2019-05-31 2023-09-26 北京金山数字娱乐科技有限公司 一种数据处理方法及装置
CN111104796B (zh) * 2019-12-18 2023-05-05 北京百度网讯科技有限公司 用于翻译的方法和装置
CN112613326B (zh) * 2020-12-18 2022-11-08 北京理工大学 一种融合句法结构的藏汉语言神经机器翻译方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1156287A (zh) * 1995-09-11 1997-08-06 松下电器产业株式会社 机器翻译用中文生成装置
JP2827321B2 (ja) * 1989-09-18 1998-11-25 日本電気株式会社 日本語から中国語への機械翻訳方式
CN1308748A (zh) * 1998-05-04 2001-08-15 特雷道斯股份有限公司 机器辅助翻译工具
CN1407483A (zh) * 2001-09-04 2003-04-02 优网通国际资讯股份有限公司 文本表达方法及系统以及文本翻译方法及系统
CN1595398A (zh) * 2003-09-09 2005-03-16 株式会社国际电气通信基础技术研究所 选择改良多个候补译文所生成的最优译文的机器翻译系统
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2827321B2 (ja) * 1989-09-18 1998-11-25 日本電気株式会社 日本語から中国語への機械翻訳方式
CN1156287A (zh) * 1995-09-11 1997-08-06 松下电器产业株式会社 机器翻译用中文生成装置
CN1308748A (zh) * 1998-05-04 2001-08-15 特雷道斯股份有限公司 机器辅助翻译工具
CN1407483A (zh) * 2001-09-04 2003-04-02 优网通国际资讯股份有限公司 文本表达方法及系统以及文本翻译方法及系统
CN1595398A (zh) * 2003-09-09 2005-03-16 株式会社国际电气通信基础技术研究所 选择改良多个候补译文所生成的最优译文的机器翻译系统
CN101593174A (zh) * 2009-03-11 2009-12-02 林勋准 一种机器翻译方法及系统

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105320650A (zh) * 2014-07-31 2016-02-10 崔晓光 一种机器翻译方法及其系统
CN111241245A (zh) * 2020-01-14 2020-06-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备
CN111241245B (zh) * 2020-01-14 2021-02-05 百度在线网络技术(北京)有限公司 人机交互处理方法、装置及电子设备

Also Published As

Publication number Publication date
CN103314369A (zh) 2013-09-18
CN103314369B (zh) 2015-08-12

Similar Documents

Publication Publication Date Title
WO2012079257A1 (fr) Procédé et dispositif de traduction automatique
KR101130444B1 (ko) 기계번역기법을 이용한 유사문장 식별 시스템
US9697477B2 (en) Non-factoid question-answering system and computer program
CN103970798B (zh) 数据的搜索和匹配
US20100121630A1 (en) Language processing systems and methods
WO2010046782A2 (fr) Traduction automatique hybride
Karim Technical challenges and design issues in bangla language processing
Abdurakhmonova et al. Linguistic functionality of Uzbek Electron Corpus: uzbekcorpus. uz
Zakharov Corpora of the Russian language
Peng et al. Research on tree kernel-based personal relation extraction
JP5722375B2 (ja) 文末表現変換装置、方法、及びプログラム
Soumya et al. Development of a POS tagger for Malayalam-an experience
Nguyen et al. A tree-to-string phrase-based model for statistical machine translation
Monga et al. Speech to Indian Sign Language Translator
CN105045784A (zh) 英语词句的存取装置方法和装置
JP4478042B2 (ja) 頻度情報付き単語集合生成方法、プログラムおよびプログラム記憶媒体、ならびに、頻度情報付き単語集合生成装置、テキスト索引語作成装置、全文検索装置およびテキスト分類装置
JP4940251B2 (ja) 文書処理プログラム及び文書処理装置
JP6145011B2 (ja) 文正規化システム、文正規化方法及び文正規化プログラム
JP3050743B2 (ja) 言語データベースの形態素列変換装置
JP2019087058A (ja) 文章中の省略を特定する人工知能装置
España-Bonet et al. Going beyond zero-shot MT: combining phonological, morphological and semantic factors. The UdS-DFKI System at IWSLT 2017
Tsai et al. Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem
Samir et al. Training and evaluation of TreeTagger on Amazigh corpus
JP2014134871A (ja) 質問応答用検索キーワード生成方法、装置、及びプログラム
JPH11338863A (ja) 未知名詞および表記ゆれカタカナ語自動収集・認定装置、ならびにそのための処理手順を記録した記録媒体

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10860630

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 07/10/2013)

122 Ep: pct application non-entry in european phase

Ref document number: 10860630

Country of ref document: EP

Kind code of ref document: A1