WO2011047608A1 - Procédé de formation d'une paire de phrases bilingue ayant une certaine configuration, et dispositif de formation correspondant - Google Patents

Procédé de formation d'une paire de phrases bilingue ayant une certaine configuration, et dispositif de formation correspondant Download PDF

Info

Publication number
WO2011047608A1
WO2011047608A1 PCT/CN2010/077772 CN2010077772W WO2011047608A1 WO 2011047608 A1 WO2011047608 A1 WO 2011047608A1 CN 2010077772 W CN2010077772 W CN 2010077772W WO 2011047608 A1 WO2011047608 A1 WO 2011047608A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
unit
translation
sentence
forming
Prior art date
Application number
PCT/CN2010/077772
Other languages
English (en)
Chinese (zh)
Inventor
张龙哺
Original Assignee
北京东方爱译科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京东方爱译科技有限责任公司 filed Critical 北京东方爱译科技有限责任公司
Publication of WO2011047608A1 publication Critical patent/WO2011047608A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present invention relates to the establishment and accumulation of intelligent translation knowledge in the field of computer translation technology, and more particularly to a method for forming a dual statement pair and a forming apparatus thereof. Background of the invention
  • TM Translation Memory Technology
  • Figure 1A shows a conventional translation scheme using TM translation technology.
  • the TM translation mode compares (matches) the input original sentence with the original part of the corpus double statement pair. If the match is matched or the specified match rate is satisfied, the translated portion of the double sentence pair is output as the TM translation result.
  • Figure 1B shows an example of a sentence pair recorded by a conventional sentence-to-record method. That is, the original text is recorded in the left part, and the translation is recorded in the right part, separated by a separator.
  • the original text and the translation are regular text content, namely words (words), punctuation marks, and so on.
  • words words
  • punctuation marks and so on.
  • this kind of The role of sentence pairs is very limited. That is to say, in addition to the same sentence can give accurate translation results, it can not give accurate translation results for similar sentences.
  • TM improvements have emerged in recent years, such as the use of sentence patterns in TM technology solutions, the purpose of which is to cover more sentences with the sentence patterns stored in the sentence library.
  • the principle is to abstract the translated example sentence into a sentence pattern.
  • the sentence to be translated is first parsed and abstracted into a grammatical tree structure, and then the sentence is used to create the translation with the sentence to be translated (translation ).
  • This method actually returns to the old path of traditional MT technology.
  • the first reason is that abstracting an example sentence into a grammatical sentence is a time-consuming and laborious task, and it cannot be automated. At present, no practical sentence accumulation methods and tools have been seen.
  • the inventor of the present application based on years of research on human brain translation thinking and foreign language learning and memory, proposed a set of systems for simulating human brain memory and storing translation knowledge, namely Bodian intelligent knowledge base system, and Corresponding Super Intelligent Computer Translation Technology (TM++).
  • the sentence pair is not a simple original text plus translation form, nor is it an abstract sentence pattern as described above, but an instance-based patterned sentence pair.
  • the advantages of the model sentence to the translation theory are: 1. To embody and instantiate the complex abstract grammar for easy understanding and implementation; 2.
  • the pattern sentence is very suitable because it is The translation examples and the translated sentence patterns are integrated, which not only preserves the uniqueness of the specific translated sentence pairs but also the universality of the translated sentence patterns.
  • Figures 2A-2C and 3A-3C show some examples of instance-based patterned sentence pairs (referred to as patterned sentence pairs).
  • the object of the present invention is to provide a method of forming a patterned sentence pair and a forming apparatus therefor.
  • the patterned sentence pairs can be formed and accumulated quickly and efficiently.
  • the knowledge base of the intelligent knowledge base can be accumulated for all users, and the machine can automatically generate and accumulate intelligent translation knowledge while the user translates. This completely frees the traditional translation software from the language experts to develop translation rules or sentence patterns and is written or updated by software professional technicians, and will greatly accelerate the development and speed of the intelligent knowledge base. Therefore, it provides a viable technical solution for the early realization of high quality fully automatic machine translation.
  • Figure 1A is a block schematic diagram of a conventional TM computer translation technology solution.
  • Figure 1B shows an example of a traditional sentence pair.
  • 2A-2C and 3A-3C show examples of patterned sentence pairs in the present invention.
  • Figure 4 shows an example of additional information for a patterned sentence pair.
  • Figure 5 is a flow chart showing a first embodiment of a method of forming a patterned sentence pair of the present invention.
  • Figure 6 is a flow chart showing a second embodiment of the method for forming a patterned sentence pair of the present invention.
  • Figure 7 is a flow chart showing a third embodiment of the method of forming a patterned sentence pair of the present invention.
  • Figure 8 is a flow chart showing a fourth embodiment of the method for forming a patterned sentence pair of the present invention.
  • Figure 9 is a flow chart showing a fifth embodiment of the method of forming a patterned sentence pair of the present invention.
  • Figure 10 is a flow chart showing a sixth embodiment of a method of forming a patterned sentence pair of the present invention.
  • Figure 11 is a block diagram showing a first embodiment of the patterning sentence pair forming apparatus of the present invention.
  • Figure 12 is a block diagram showing a second embodiment of the patterning sentence pair forming apparatus of the present invention.
  • Figure 13 shows the user interface interface of a moded dual statement pair forming apparatus of the present invention.
  • Fig. 14 schematically shows an example of a word unit. The specific embodiments of the present invention will be described in detail below with reference to the drawings. Implementation
  • a two-state sentence pair includes: a source sentence expressed in a first language (referred to as a first-language original sentence), and a corresponding target sentence in a second language (referred to as a second-language translation sentence).
  • the first-language original sentence is sometimes referred to more simply as the original text
  • the second-language translated sentence is sometimes referred to more simply as the translation, because the second-language translated sentence is usually the translation result of the first-language original sentence.
  • the original or original sentence it can be a simple sentence, a complex sentence, or a phrase, a phrase, a short sentence, and so on.
  • the original sentence referred to in this application has no limitation on its length or structure.
  • the method for forming a modular sentence pair of the present invention can be used in a computer translation system, and is particularly useful for forming and maintaining a sentence library in a computer translation system. Of course, it can also be used in other fields, such as corpus collection and collation.
  • Figures 2-3 illustrate various embodiments of patterned sentence-to-recording in the present invention.
  • the original language that is, the first language
  • the translation that is, the second language
  • the first part and the second part may be in the same file, for example, the first part and the second part are in the same line, separated by a specific separator, as shown in FIG. 2A; Or the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in FIG. 2B.
  • each of the first part and the second part may exist in a separate file, between the first part and the second part of the same sentence pair Have a corresponding relationship, such as they are in the same row.
  • the first part and the second part can be in the same form.
  • the first part and the second part are in different column units of the same row, as shown in Fig. 3A.
  • the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in Fig. 3B.
  • each of the first part and the second part may exist in a single form, and there is a correspondence between the first part and the second part of the same sentence pair. Relationships, such as they are in the same row.
  • the patterned sentence pair described in the present invention has a regular unit and a pattern unit at least in any of the first portion and the second portion. In the patterning unit, the content of the own language unit and the corresponding language unit information are recorded in a predetermined format.
  • a patterning unit is used in addition to the conventional unit.
  • the conventional unit means an immutable part, that is, a conventional translation, for example, "Yes", “Buy”, “One Piece” in Figure 2-3 (may also be considered “buy a piece” is a conventional unit or Regular unit block).
  • the patterning unit means a part that can be replaced, that is, the part of the translation can be replaced by other content, for example, in Figure 2-3: ⁇ he
  • a sentence pair having the patterning unit is referred to as a patterned sentence pair.
  • the number of the conventional unit and the modular unit and the positional relationship therebetween may be arbitrary, which are determined by the structure of the sentence and the need for translation.
  • a modular sentence pair typically has one or more regular units, one or more modular units.
  • the positional relationship between the conventional unit and the modular unit may be mutually inter-phased, or several conventional units or modular units may be connected to each other.
  • a patterned sentence pair can all be a modular unit. E.g:
  • the patterning unit has a predetermined format.
  • the purpose of using the predetermined format is to enable the translation units in the patterning unit to be replaced.
  • the patterning unit can include information such as the corresponding original unit, part of speech, attribute, and serial number in the sentence as needed to make accurate and desirable replacement.
  • An example of a patterning unit is as follows: "He I he I pronoun
  • the various information in the patterning unit can be separated by a specific separator, such as the characters "1", ", or spaces, or tabs, etc.
  • the purpose is to be able to use the patterned sentence pairs when translating Better identification and processing.
  • Each modular unit can be identified by a specific symbol pair, such as: " ⁇ ” and “ ⁇ ”, “ ⁇ ” and “/ ⁇ ”, etc., so that the patterning unit can be easily Identified.
  • the second part of the translation has a patterning unit: ⁇ he
  • the patterning unit is not clearly marked in the first part, those words or phrases of the first part referred to in the patterning unit in the second part of the translation are implied as replaceable. They are the 0th he, the 5th his, the 6th wife, and the 3rd gold_watch.
  • the patterning unit may also be recorded.
  • the conventional unit means an immutable part, that is, a conventional text.
  • the patterning unit means a part that can be replaced, that is, the part of the original text can be replaced by other contents.
  • the number of conventional units and modular units can be arbitrary, depending on the structure of the sentence and the need for translation.
  • the patterning unit is recorded in a predetermined format.
  • the purpose of the patterning is to enable the original text units in the patterning unit to be replaced.
  • the patterning unit can include information such as part of speech, attributes, and the like of the word or phrase as needed to make accurate and desirable replacements.
  • the additional information in the patterning unit in the first partial text source is preferably complementary to the additional information in the patterning unit in the second partial translation.
  • the patterning unit in the first part of the original text is preferably generated simultaneously with the corresponding patterning unit in the second part of the translation.
  • FIG. 2B and 2C An example of labeling a word or phrase that can be replaced by a patterning unit in the first part, the original text, is shown in Figures 2B and 2C.
  • the labeling method is as follows: ⁇ he
  • other methods of labeling may be used, the purpose of which is to facilitate identification and replacement.
  • the original unit content and the corresponding translation unit information may also be recorded in a predetermined format.
  • the translation unit information includes: content of the translation unit and information such as part of speech, attribute or serial number of the translation unit, or any combination of the above various information.
  • additional information can also be recorded therein, such as: the total number of units of the sentence, the modification mark, the quality level, the user name, the update date, the language number, etc., as shown in FIG. Shown.
  • the additional information may be placed at the beginning, end, or other location of the patterned sentence pair as long as it has a correspondence with the patterned sentence pair.
  • 01" in Fig. 4 is a specific example of the attachment information.
  • the patterned sentence pair in the present invention is both a translation example and a translation model. Therefore, it can retain the uniqueness of the specific translated sentence pairs and the universality of the translation mode. Using this pattern of sentence pairs, you can make regular input sentence sentences
  • the matching translation in order to guarantee the specific translation requirements of the specific sentence, the pattern matching matching translation of the input original sentence, and the more advanced intelligent translation, the related content can refer to other related inventions of the present invention.
  • the method for forming a modular sentence pair of the present invention does not require abstraction of the translated double sentence pair (the abstraction operation requires a lot of grammar thinking and induction and a large number of rules), and only needs to add some existing information, so This method is easily implemented by a computer.
  • the interactive translation (IT) module or the auxiliary translation module (CAT) is used to collect the information needed by the patterning unit and form the required patterning unit, and then write them into the model sentence pair.
  • the interactive translation (IT) module or the auxiliary translation module (CAT) is used to collect the information needed by the patterning unit and form the required patterning unit, and then write them into the model sentence pair.
  • CAT auxiliary translation module
  • a method for forming a dual statement pair includes the following steps: Step S1: Select a word in the original sentence.
  • the word can be a word, a phrase, or a phrase.
  • Step S2 Determine whether the grammatical attribute of the word meets the condition of the replaceable word.
  • the replaceable word conditions may be specified and judged according to part of speech, for example: nouns, adjectives, pronouns, numerals, etc. are predetermined as replaceable words. Then, if the part of speech of a word is a noun, an adjective, a pronoun or a number, the grammatical attribute of the word conforms to the condition of the replaceable word.
  • the alternative word condition can also be specified and judged according to the attribute of the word, for example: a word whose attribute is "object", "person", "time” or "place” is defined as a replaceable word.
  • step S3 the identification information of the word and the translation content of the word are combined into a patterning unit, and written to the translation portion.
  • the identification information may include information such as the content of the original text unit and the part of speech, the attribute or the serial number of the original text unit, or any combination of the above various kinds of information, as needed. See the description of the modular unit above for more details.
  • step S4 is executed: the translation content of the word is written to the translation portion.
  • a second embodiment of a method for forming a dual statement pair according to the present invention is different from the first embodiment shown in FIG. 5 in that the result of the determination in step S2 is "NO".
  • the execution step S5 is further performed to determine whether there is a special control character or instruction.
  • Flexible control of the formation of the patterning unit can be achieved by setting special control symbols or commands. With it, words that do not conform to replaceable words can be modeled outside of the predetermined rules.
  • step S3 If the result of the determination in step S5 is "YES”, then step S3 is performed: forming the identification information of the word and the translated content of the word into a replaceable unit, and writing it to the translation part; If the result of the determination in S5 is "NO”, then step S4 is performed: the translation content of the word is written to the translation portion.
  • FIG. 7 there is shown a mode double statement pair forming method of a third embodiment of the present invention, wherein the word corresponds to a word unit.
  • the execution step SO forms the word unit of the original sentence into a word unit.
  • the step S 1 is specifically: selecting one translation in the word unit.
  • the specific method of forming the word unit of the original sentence may be a method of looking up the dictionary, that is, using the original word to search the dictionary or the sentence, and obtaining the corresponding translation (interpretation), part of speech, attribute, association and the like.
  • the word serial number information of the word in the original sentence is also included.
  • the SO operation can be performed on all the words in the original sentence to form an array of word units.
  • a method for forming a pattern double statement pair according to the fourth embodiment of the present invention is different from the third embodiment shown in FIG. 7 in that: the result of the determination in step S2 is "NO".
  • the execution step S5 is further performed to determine whether there is a special control character or instruction.
  • FIG. 9 there is shown a patterning two-statement pair forming method in accordance with a fifth embodiment of the present invention, the method comprising:
  • the basis can also be the various predetermined criteria or conditions discussed above.
  • the identification information of the word is added at the translation to form a patterning unit.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • the method includes:
  • the basis can also be the various predetermined criteria or conditions discussed above. Finding a translation with the replaceable word in the target sentence;
  • the identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • the identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • a patterned double sentence pair has a patterning unit in at least a translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
  • a judging module configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition
  • the modularizing unit forming module configured to form the identifying information and the content of the word into a modular unit
  • a writing module configured to write a translation or a patterning unit of the word to the translation part
  • a word unit forming module configured to form the word unit, and the forming method may be to look up the dictionary.
  • the patterned double sentence pair has a patterning unit in at least the translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
  • a judging module configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition
  • the modularizing unit forming module configured to form the identifying information and the content of the word into a modular unit
  • a writing module configured to write a translation or a patterning unit of the word to the translation part
  • a word unit forming module configured to form the word unit
  • the forming method may be to search the dictionary; the word unit forming module may be All words in a sentence perform a word unit forming operation to form an array of word cells.
  • Figure 13 shows a user interface interface of a moded dual statement pair forming apparatus of the present invention.
  • the original sentence ⁇ We see the wonderful translation result of the system
  • Each word in "TM++ technology.” is displayed in the interactive translation area (the upper half of the figure) and forms a unit of words.
  • the third (4th word) word unit is specifically shown.
  • Multi-translation In the interactive translation area, when a certain translation is clicked with a mouse, the device for forming a modular double-statement pair of the present invention can be triggered, and the method for forming a dual-state sentence pair according to the present invention To form a modular sentence pair.
  • Fig. 14 schematically shows an example of a word unit. While various aspects, embodiments, and embodiments of the present application have been described in detail above, the invention of the present application is not limited thereto. Various changes, modifications, or modifications can be made by those skilled in the art. Such changes, modifications, and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope of the invention.

Abstract

L'invention concerne un procédé de formation d'une paire de phrases bilingue ayant une certaine configuration. Cette paire de phrases bilingue ayant une certaine configuration comporte une unité ayant une certaine configuration dans une partie de traduction au minimum, et le contenu d'une unité de traduction ainsi que les informations d'identification correspondantes du texte d'origine se trouvent dans ladite unité ayant une certaine configuration. Ledit procédé consiste : à sélectionner un mot dans une phrase du texte d'origine; à décider si les attributs grammaticaux du mot correspondent aux conditions des mots remplaçables; si le résultat de la décision est « oui », les informations d'identification du mot et le contenu de la traduction du mot sont composés de manière à devenir une unité remplaçable, et cette unité remplaçable est écrite dans la partie de traduction; si le résultat de la décision est « non », le contenu de la traduction du mot est écrit dans la partie de traduction. L'invention se rapporte également à un dispositif de formation correspondant.
PCT/CN2010/077772 2009-10-20 2010-10-15 Procédé de formation d'une paire de phrases bilingue ayant une certaine configuration, et dispositif de formation correspondant WO2011047608A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200910180877.2 2009-10-20
CN200910180877.2A CN102043773B (zh) 2009-10-20 2009-10-20 模式化双语句对形成方法及其形成装置

Publications (1)

Publication Number Publication Date
WO2011047608A1 true WO2011047608A1 (fr) 2011-04-28

Family

ID=43899826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/077772 WO2011047608A1 (fr) 2009-10-20 2010-10-15 Procédé de formation d'une paire de phrases bilingue ayant une certaine configuration, et dispositif de formation correspondant

Country Status (2)

Country Link
CN (1) CN102043773B (fr)
WO (1) WO2011047608A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391840A (zh) * 2014-11-24 2015-03-04 上海迈外迪网络科技有限公司 翻译方法及装置
CN105183723A (zh) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 一种翻译软件与语料搜索的关联方法
CN105183722A (zh) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 一种汉英双语翻译语料的对齐方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1272655A (zh) * 2000-06-19 2000-11-08 李玉鉴(鑑) 英汉翻译机器
JP2006127405A (ja) * 2004-11-01 2006-05-18 Advanced Telecommunication Research Institute International バイリンガルパラレルテキストをアライメントする方法及びそのためのコンピュータで実行可能なプログラム
CN101206643A (zh) * 2006-12-21 2008-06-25 中国科学院计算技术研究所 一种融合了句型模板和统计机器翻译技术的翻译方法
CN101441623A (zh) * 2007-11-20 2009-05-27 富士施乐株式会社 翻译装置及信息处理方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1272655A (zh) * 2000-06-19 2000-11-08 李玉鉴(鑑) 英汉翻译机器
JP2006127405A (ja) * 2004-11-01 2006-05-18 Advanced Telecommunication Research Institute International バイリンガルパラレルテキストをアライメントする方法及びそのためのコンピュータで実行可能なプログラム
CN101206643A (zh) * 2006-12-21 2008-06-25 中国科学院计算技术研究所 一种融合了句型模板和统计机器翻译技术的翻译方法
CN101441623A (zh) * 2007-11-20 2009-05-27 富士施乐株式会社 翻译装置及信息处理方法

Also Published As

Publication number Publication date
CN102043773A (zh) 2011-05-04
CN102043773B (zh) 2014-09-03

Similar Documents

Publication Publication Date Title
WO2011017902A1 (fr) Procede d'enregistrement destine a modeliser des paires de phrases bilingues et procede et systeme de traduction associes
Al‐Sughaiyer et al. Arabic morphological analysis techniques: A comprehensive survey
CN101595474B (zh) 语言分析
Tiedemann Recycling translations: Extraction of lexical data from parallel corpora and their application in natural language processing
CN104462057B (zh) 用于产生语言分析的词汇资源的方法和系统
JP2003030185A (ja) 2カ国語コーパスからの変換マッピングの自動抽出
JP2002215617A (ja) 品詞タグ付けをする方法
Roux et al. An Ontology Enrichment Method for a Pragmatic Information Extraction System gathering Data on Genetic Interactions.
King Practical Natural Language Processing for Low-Resource Languages.
Chiarcos et al. Analyzing middle high German syntax with RDF and SPARQL
Lavie et al. Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario
Ngo et al. EVBCorpus-a multi-layer English-Vietnamese bilingual corpus for studying tasks in comparative linguistics
WO2011047608A1 (fr) Procédé de formation d'une paire de phrases bilingue ayant une certaine configuration, et dispositif de formation correspondant
Pouliquen et al. Automatic construction of multilingual name dictionaries
Stepanov et al. Language style and domain adaptation for cross-language SLU porting
Parameswarappa et al. Kannada word sense disambiguation for machine translation
GB2572539A (en) System and method for parsing user query
Krishnamurthy et al. Ease: Enabling hardware assertion synthesis from english
Saleh Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages
Sukhahuta et al. Information extraction strategies for Thai documents
Krizhanovsky The comparison of Wiktionary thesauri transformed into the machine-readable format
Chen et al. A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation
Čmejrek Using Dependency Tree Structure for Czech-English Machine Translation
Kaeshammer Hierarchical machine translation with discontinuous phrases
JP4588417B2 (ja) 翻訳装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10824449

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10824449

Country of ref document: EP

Kind code of ref document: A1