WO2011047608A1 - Forming method of patterned bilingual sentence pair and forming device thereof - Google Patents

Forming method of patterned bilingual sentence pair and forming device thereof Download PDF

Info

Publication number
WO2011047608A1
WO2011047608A1 PCT/CN2010/077772 CN2010077772W WO2011047608A1 WO 2011047608 A1 WO2011047608 A1 WO 2011047608A1 CN 2010077772 W CN2010077772 W CN 2010077772W WO 2011047608 A1 WO2011047608 A1 WO 2011047608A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
unit
translation
sentence
forming
Prior art date
Application number
PCT/CN2010/077772
Other languages
French (fr)
Chinese (zh)
Inventor
张龙哺
Original Assignee
北京东方爱译科技有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京东方爱译科技有限责任公司 filed Critical 北京东方爱译科技有限责任公司
Publication of WO2011047608A1 publication Critical patent/WO2011047608A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/47Machine-assisted translation, e.g. using translation memory

Definitions

  • the present invention relates to the establishment and accumulation of intelligent translation knowledge in the field of computer translation technology, and more particularly to a method for forming a dual statement pair and a forming apparatus thereof. Background of the invention
  • TM Translation Memory Technology
  • Figure 1A shows a conventional translation scheme using TM translation technology.
  • the TM translation mode compares (matches) the input original sentence with the original part of the corpus double statement pair. If the match is matched or the specified match rate is satisfied, the translated portion of the double sentence pair is output as the TM translation result.
  • Figure 1B shows an example of a sentence pair recorded by a conventional sentence-to-record method. That is, the original text is recorded in the left part, and the translation is recorded in the right part, separated by a separator.
  • the original text and the translation are regular text content, namely words (words), punctuation marks, and so on.
  • words words
  • punctuation marks and so on.
  • this kind of The role of sentence pairs is very limited. That is to say, in addition to the same sentence can give accurate translation results, it can not give accurate translation results for similar sentences.
  • TM improvements have emerged in recent years, such as the use of sentence patterns in TM technology solutions, the purpose of which is to cover more sentences with the sentence patterns stored in the sentence library.
  • the principle is to abstract the translated example sentence into a sentence pattern.
  • the sentence to be translated is first parsed and abstracted into a grammatical tree structure, and then the sentence is used to create the translation with the sentence to be translated (translation ).
  • This method actually returns to the old path of traditional MT technology.
  • the first reason is that abstracting an example sentence into a grammatical sentence is a time-consuming and laborious task, and it cannot be automated. At present, no practical sentence accumulation methods and tools have been seen.
  • the inventor of the present application based on years of research on human brain translation thinking and foreign language learning and memory, proposed a set of systems for simulating human brain memory and storing translation knowledge, namely Bodian intelligent knowledge base system, and Corresponding Super Intelligent Computer Translation Technology (TM++).
  • the sentence pair is not a simple original text plus translation form, nor is it an abstract sentence pattern as described above, but an instance-based patterned sentence pair.
  • the advantages of the model sentence to the translation theory are: 1. To embody and instantiate the complex abstract grammar for easy understanding and implementation; 2.
  • the pattern sentence is very suitable because it is The translation examples and the translated sentence patterns are integrated, which not only preserves the uniqueness of the specific translated sentence pairs but also the universality of the translated sentence patterns.
  • Figures 2A-2C and 3A-3C show some examples of instance-based patterned sentence pairs (referred to as patterned sentence pairs).
  • the object of the present invention is to provide a method of forming a patterned sentence pair and a forming apparatus therefor.
  • the patterned sentence pairs can be formed and accumulated quickly and efficiently.
  • the knowledge base of the intelligent knowledge base can be accumulated for all users, and the machine can automatically generate and accumulate intelligent translation knowledge while the user translates. This completely frees the traditional translation software from the language experts to develop translation rules or sentence patterns and is written or updated by software professional technicians, and will greatly accelerate the development and speed of the intelligent knowledge base. Therefore, it provides a viable technical solution for the early realization of high quality fully automatic machine translation.
  • Figure 1A is a block schematic diagram of a conventional TM computer translation technology solution.
  • Figure 1B shows an example of a traditional sentence pair.
  • 2A-2C and 3A-3C show examples of patterned sentence pairs in the present invention.
  • Figure 4 shows an example of additional information for a patterned sentence pair.
  • Figure 5 is a flow chart showing a first embodiment of a method of forming a patterned sentence pair of the present invention.
  • Figure 6 is a flow chart showing a second embodiment of the method for forming a patterned sentence pair of the present invention.
  • Figure 7 is a flow chart showing a third embodiment of the method of forming a patterned sentence pair of the present invention.
  • Figure 8 is a flow chart showing a fourth embodiment of the method for forming a patterned sentence pair of the present invention.
  • Figure 9 is a flow chart showing a fifth embodiment of the method of forming a patterned sentence pair of the present invention.
  • Figure 10 is a flow chart showing a sixth embodiment of a method of forming a patterned sentence pair of the present invention.
  • Figure 11 is a block diagram showing a first embodiment of the patterning sentence pair forming apparatus of the present invention.
  • Figure 12 is a block diagram showing a second embodiment of the patterning sentence pair forming apparatus of the present invention.
  • Figure 13 shows the user interface interface of a moded dual statement pair forming apparatus of the present invention.
  • Fig. 14 schematically shows an example of a word unit. The specific embodiments of the present invention will be described in detail below with reference to the drawings. Implementation
  • a two-state sentence pair includes: a source sentence expressed in a first language (referred to as a first-language original sentence), and a corresponding target sentence in a second language (referred to as a second-language translation sentence).
  • the first-language original sentence is sometimes referred to more simply as the original text
  • the second-language translated sentence is sometimes referred to more simply as the translation, because the second-language translated sentence is usually the translation result of the first-language original sentence.
  • the original or original sentence it can be a simple sentence, a complex sentence, or a phrase, a phrase, a short sentence, and so on.
  • the original sentence referred to in this application has no limitation on its length or structure.
  • the method for forming a modular sentence pair of the present invention can be used in a computer translation system, and is particularly useful for forming and maintaining a sentence library in a computer translation system. Of course, it can also be used in other fields, such as corpus collection and collation.
  • Figures 2-3 illustrate various embodiments of patterned sentence-to-recording in the present invention.
  • the original language that is, the first language
  • the translation that is, the second language
  • the first part and the second part may be in the same file, for example, the first part and the second part are in the same line, separated by a specific separator, as shown in FIG. 2A; Or the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in FIG. 2B.
  • each of the first part and the second part may exist in a separate file, between the first part and the second part of the same sentence pair Have a corresponding relationship, such as they are in the same row.
  • the first part and the second part can be in the same form.
  • the first part and the second part are in different column units of the same row, as shown in Fig. 3A.
  • the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in Fig. 3B.
  • each of the first part and the second part may exist in a single form, and there is a correspondence between the first part and the second part of the same sentence pair. Relationships, such as they are in the same row.
  • the patterned sentence pair described in the present invention has a regular unit and a pattern unit at least in any of the first portion and the second portion. In the patterning unit, the content of the own language unit and the corresponding language unit information are recorded in a predetermined format.
  • a patterning unit is used in addition to the conventional unit.
  • the conventional unit means an immutable part, that is, a conventional translation, for example, "Yes", “Buy”, “One Piece” in Figure 2-3 (may also be considered “buy a piece” is a conventional unit or Regular unit block).
  • the patterning unit means a part that can be replaced, that is, the part of the translation can be replaced by other content, for example, in Figure 2-3: ⁇ he
  • a sentence pair having the patterning unit is referred to as a patterned sentence pair.
  • the number of the conventional unit and the modular unit and the positional relationship therebetween may be arbitrary, which are determined by the structure of the sentence and the need for translation.
  • a modular sentence pair typically has one or more regular units, one or more modular units.
  • the positional relationship between the conventional unit and the modular unit may be mutually inter-phased, or several conventional units or modular units may be connected to each other.
  • a patterned sentence pair can all be a modular unit. E.g:
  • the patterning unit has a predetermined format.
  • the purpose of using the predetermined format is to enable the translation units in the patterning unit to be replaced.
  • the patterning unit can include information such as the corresponding original unit, part of speech, attribute, and serial number in the sentence as needed to make accurate and desirable replacement.
  • An example of a patterning unit is as follows: "He I he I pronoun
  • the various information in the patterning unit can be separated by a specific separator, such as the characters "1", ", or spaces, or tabs, etc.
  • the purpose is to be able to use the patterned sentence pairs when translating Better identification and processing.
  • Each modular unit can be identified by a specific symbol pair, such as: " ⁇ ” and “ ⁇ ”, “ ⁇ ” and “/ ⁇ ”, etc., so that the patterning unit can be easily Identified.
  • the second part of the translation has a patterning unit: ⁇ he
  • the patterning unit is not clearly marked in the first part, those words or phrases of the first part referred to in the patterning unit in the second part of the translation are implied as replaceable. They are the 0th he, the 5th his, the 6th wife, and the 3rd gold_watch.
  • the patterning unit may also be recorded.
  • the conventional unit means an immutable part, that is, a conventional text.
  • the patterning unit means a part that can be replaced, that is, the part of the original text can be replaced by other contents.
  • the number of conventional units and modular units can be arbitrary, depending on the structure of the sentence and the need for translation.
  • the patterning unit is recorded in a predetermined format.
  • the purpose of the patterning is to enable the original text units in the patterning unit to be replaced.
  • the patterning unit can include information such as part of speech, attributes, and the like of the word or phrase as needed to make accurate and desirable replacements.
  • the additional information in the patterning unit in the first partial text source is preferably complementary to the additional information in the patterning unit in the second partial translation.
  • the patterning unit in the first part of the original text is preferably generated simultaneously with the corresponding patterning unit in the second part of the translation.
  • FIG. 2B and 2C An example of labeling a word or phrase that can be replaced by a patterning unit in the first part, the original text, is shown in Figures 2B and 2C.
  • the labeling method is as follows: ⁇ he
  • other methods of labeling may be used, the purpose of which is to facilitate identification and replacement.
  • the original unit content and the corresponding translation unit information may also be recorded in a predetermined format.
  • the translation unit information includes: content of the translation unit and information such as part of speech, attribute or serial number of the translation unit, or any combination of the above various information.
  • additional information can also be recorded therein, such as: the total number of units of the sentence, the modification mark, the quality level, the user name, the update date, the language number, etc., as shown in FIG. Shown.
  • the additional information may be placed at the beginning, end, or other location of the patterned sentence pair as long as it has a correspondence with the patterned sentence pair.
  • 01" in Fig. 4 is a specific example of the attachment information.
  • the patterned sentence pair in the present invention is both a translation example and a translation model. Therefore, it can retain the uniqueness of the specific translated sentence pairs and the universality of the translation mode. Using this pattern of sentence pairs, you can make regular input sentence sentences
  • the matching translation in order to guarantee the specific translation requirements of the specific sentence, the pattern matching matching translation of the input original sentence, and the more advanced intelligent translation, the related content can refer to other related inventions of the present invention.
  • the method for forming a modular sentence pair of the present invention does not require abstraction of the translated double sentence pair (the abstraction operation requires a lot of grammar thinking and induction and a large number of rules), and only needs to add some existing information, so This method is easily implemented by a computer.
  • the interactive translation (IT) module or the auxiliary translation module (CAT) is used to collect the information needed by the patterning unit and form the required patterning unit, and then write them into the model sentence pair.
  • the interactive translation (IT) module or the auxiliary translation module (CAT) is used to collect the information needed by the patterning unit and form the required patterning unit, and then write them into the model sentence pair.
  • CAT auxiliary translation module
  • a method for forming a dual statement pair includes the following steps: Step S1: Select a word in the original sentence.
  • the word can be a word, a phrase, or a phrase.
  • Step S2 Determine whether the grammatical attribute of the word meets the condition of the replaceable word.
  • the replaceable word conditions may be specified and judged according to part of speech, for example: nouns, adjectives, pronouns, numerals, etc. are predetermined as replaceable words. Then, if the part of speech of a word is a noun, an adjective, a pronoun or a number, the grammatical attribute of the word conforms to the condition of the replaceable word.
  • the alternative word condition can also be specified and judged according to the attribute of the word, for example: a word whose attribute is "object", "person", "time” or "place” is defined as a replaceable word.
  • step S3 the identification information of the word and the translation content of the word are combined into a patterning unit, and written to the translation portion.
  • the identification information may include information such as the content of the original text unit and the part of speech, the attribute or the serial number of the original text unit, or any combination of the above various kinds of information, as needed. See the description of the modular unit above for more details.
  • step S4 is executed: the translation content of the word is written to the translation portion.
  • a second embodiment of a method for forming a dual statement pair according to the present invention is different from the first embodiment shown in FIG. 5 in that the result of the determination in step S2 is "NO".
  • the execution step S5 is further performed to determine whether there is a special control character or instruction.
  • Flexible control of the formation of the patterning unit can be achieved by setting special control symbols or commands. With it, words that do not conform to replaceable words can be modeled outside of the predetermined rules.
  • step S3 If the result of the determination in step S5 is "YES”, then step S3 is performed: forming the identification information of the word and the translated content of the word into a replaceable unit, and writing it to the translation part; If the result of the determination in S5 is "NO”, then step S4 is performed: the translation content of the word is written to the translation portion.
  • FIG. 7 there is shown a mode double statement pair forming method of a third embodiment of the present invention, wherein the word corresponds to a word unit.
  • the execution step SO forms the word unit of the original sentence into a word unit.
  • the step S 1 is specifically: selecting one translation in the word unit.
  • the specific method of forming the word unit of the original sentence may be a method of looking up the dictionary, that is, using the original word to search the dictionary or the sentence, and obtaining the corresponding translation (interpretation), part of speech, attribute, association and the like.
  • the word serial number information of the word in the original sentence is also included.
  • the SO operation can be performed on all the words in the original sentence to form an array of word units.
  • a method for forming a pattern double statement pair according to the fourth embodiment of the present invention is different from the third embodiment shown in FIG. 7 in that: the result of the determination in step S2 is "NO".
  • the execution step S5 is further performed to determine whether there is a special control character or instruction.
  • FIG. 9 there is shown a patterning two-statement pair forming method in accordance with a fifth embodiment of the present invention, the method comprising:
  • the basis can also be the various predetermined criteria or conditions discussed above.
  • the identification information of the word is added at the translation to form a patterning unit.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • the method includes:
  • the basis can also be the various predetermined criteria or conditions discussed above. Finding a translation with the replaceable word in the target sentence;
  • the identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • the identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
  • the original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
  • a patterned double sentence pair has a patterning unit in at least a translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
  • a judging module configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition
  • the modularizing unit forming module configured to form the identifying information and the content of the word into a modular unit
  • a writing module configured to write a translation or a patterning unit of the word to the translation part
  • a word unit forming module configured to form the word unit, and the forming method may be to look up the dictionary.
  • the patterned double sentence pair has a patterning unit in at least the translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
  • a judging module configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition
  • the modularizing unit forming module configured to form the identifying information and the content of the word into a modular unit
  • a writing module configured to write a translation or a patterning unit of the word to the translation part
  • a word unit forming module configured to form the word unit
  • the forming method may be to search the dictionary; the word unit forming module may be All words in a sentence perform a word unit forming operation to form an array of word cells.
  • Figure 13 shows a user interface interface of a moded dual statement pair forming apparatus of the present invention.
  • the original sentence ⁇ We see the wonderful translation result of the system
  • Each word in "TM++ technology.” is displayed in the interactive translation area (the upper half of the figure) and forms a unit of words.
  • the third (4th word) word unit is specifically shown.
  • Multi-translation In the interactive translation area, when a certain translation is clicked with a mouse, the device for forming a modular double-statement pair of the present invention can be triggered, and the method for forming a dual-state sentence pair according to the present invention To form a modular sentence pair.
  • Fig. 14 schematically shows an example of a word unit. While various aspects, embodiments, and embodiments of the present application have been described in detail above, the invention of the present application is not limited thereto. Various changes, modifications, or modifications can be made by those skilled in the art. Such changes, modifications, and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope of the invention.

Abstract

A forming method of a patterned bilingual sentence pair, wherein, the patterned bilingual sentence pair has a patterned unit at least in a translation part; a translation unit content and corresponding identification information of the original text are in the patterned unit; the method includes: step S1: selecting a word from a sentence of the original text; step S2: judging whether the grammatical attributes of the word comply with the replaceable word conditions; if the judgment result in step S2 is "yes", then step S3 is executed: the identification information of the word and the translation content of the word are composed to be a replaceable unit, and the replaceable unit is written into the translation part; if the judgment result in step S2 is "no", then step S4 is executed: the translation content of the word is written into the translation part. And a corresponding forming device is provided.

Description

模式化双语句对形成方法及其形成装置  Patterned double sentence pair forming method and forming device thereof
发明领域 Field of invention
本发明涉及计算机翻译技术领域中的智能化翻译知识的建立与积累 技术, 更具体地, 涉及模式化双语句对形成方法及其形成装置。 发明背景  The present invention relates to the establishment and accumulation of intelligent translation knowledge in the field of computer translation technology, and more particularly to a method for forming a dual statement pair and a forming apparatus thereof. Background of the invention
自从上个世纪 30年代, 人们就提出了机器翻译的设想。 随着计算机 技术的发展,先后出现了各种类型的计算机翻译系统和技术, 比如 ED (电 子词典)、 MT (机器翻译)、 TM (翻译存储器)、 IT (交互翻译)和 CAT (计 算机辅助翻译) 等等。  Since the 1930s, people have come up with the idea of machine translation. With the development of computer technology, various types of computer translation systems and technologies have emerged, such as ED (Electronic Dictionary), MT (Machine Translation), TM (Translation Memory), IT (Interactive Translation) and CAT (Computer Aided Translation). ) and many more.
这些系统分别用不同的方法针对自然语言的某些方面进行语言转换 工作。 其中, 电子词典只能对单词进行翻译或查找。  These systems use different methods for language conversion work on certain aspects of natural language. Among them, the electronic dictionary can only translate or find words.
传统的 MT技术是基于语法规则对语言进行转换, 其中语法规则是语 言专家撰写的, 并由程序员写在翻译程序中的, 它只能由程序员添加和 修改。 由于语言的丰富性和灵活性, 靠少量的语法规则是不可能覆盖所 有的语言现象的。 因此传统的 MT技术不能获得好的翻译质量, 尤其是针 对长句子和句型复杂的句子。  Traditional MT technology converts languages based on grammar rules, which are written by language experts and written by the programmer in the translation program. It can only be added and modified by programmers. Due to the richness and flexibility of the language, it is impossible to cover all linguistic phenomena with a small number of grammatical rules. Therefore, traditional MT technology cannot obtain good translation quality, especially for sentences with long sentences and complex sentences.
随着计算机运算速度和记录介质的存储容量的迅速提高, 人们于上 世纪 90年代提出了基于统计的翻译技术, 即翻译存储器技术 (TM)。 其 基本思路是海量存储双语句对, 对于已经翻译过或已存储的原文句子, 只要提取出相应的译文, 就能得到准确的翻译结果。 因此, TM技术为计 算机翻译技术指出了一条高质量准确翻译的方向。  With the rapid increase in the speed of computer operations and the storage capacity of recording media, a statistical-based translation technique, Translation Memory Technology (TM), was introduced in the 1990s. The basic idea is to store two pairs of statements in large quantities. For the original sentence that has been translated or stored, as long as the corresponding translation is extracted, accurate translation results can be obtained. Therefore, TM technology points to a high-quality and accurate translation direction for computer translation technology.
图 1A示出传统的采用 TM翻译技术的翻译方案。 其中, TM翻译模式 将输入的原文句子与语料库的双语句对的原文部分相比较(匹配)。 如果 完全匹配或满足规定的匹配率, 则将双语句对的译文部分作为 TM翻译结 果输出。  Figure 1A shows a conventional translation scheme using TM translation technology. Among them, the TM translation mode compares (matches) the input original sentence with the original part of the corpus double statement pair. If the match is matched or the specified match rate is satisfied, the translated portion of the double sentence pair is output as the TM translation result.
图 1B显示了传统的句对记录方法所记录的句对例子。 即在左边部分 记录原文, 在右边部分记录译文, 中间用分隔符分开。 其中, 原文和译 文都是常规的文字内容, 即单词 (字)、 标点符号等。 其中, 除了原文与 译文之间的分隔符之外, 不存在其它用于帮助翻译的信息。 因此, 这种 句对的作用是非常有限的。 也就是说, 除了相同的句子能给出准确的翻 译结果外, 对于类似的句子也不能给出准确的翻译结果。 Figure 1B shows an example of a sentence pair recorded by a conventional sentence-to-record method. That is, the original text is recorded in the left part, and the translation is recorded in the right part, separated by a separator. Among them, the original text and the translation are regular text content, namely words (words), punctuation marks, and so on. There is no other information to help translate than the separator between the original and the translation. Therefore, this kind of The role of sentence pairs is very limited. That is to say, in addition to the same sentence can give accurate translation results, it can not give accurate translation results for similar sentences.
因此, 使用传统的 TM技术, 就必须积累所有可能出现的句子和翻译 句对。 但由于语言的灵活性和丰富性, 以及各个作者写作的随意性, 要 积累某种翻译语言对中所有句子几乎是不可能完成的事情。 这是因为所 说句子量是无限的或不可估量的。 在实践中, 我们在某个专业中积累了 几十万个句对, 花费了许多人力和财力, 但在进行翻译测试时, 只有千 分之几的覆盖率。 因此, TM计算机翻译技术又碰到了巨大的障碍。 由此, 人们反过来又想起传统 MT技术的好处了, 即用少量的语法规则或句型来 覆盖更多的句子。 或者将 MT技术与 TM技术结合起来, 形成多策略的翻 译技术。  Therefore, with traditional TM technology, it is necessary to accumulate all possible sentences and translation sentence pairs. However, due to the flexibility and richness of the language, and the arbitrariness of each author's writing, it is almost impossible to accumulate all the sentences in a certain translation language pair. This is because the amount of sentences is infinite or immeasurable. In practice, we have accumulated hundreds of thousands of sentence pairs in a certain profession, and spent a lot of manpower and financial resources, but in the translation test, only a few thousand coverage. Therefore, TM computer translation technology has encountered huge obstacles. Thus, people in turn think of the benefits of traditional MT technology, that is, to cover more sentences with a small number of grammatical rules or sentence patterns. Or combine MT technology with TM technology to form a multi-strategy translation technology.
另外, 近几年也出现了一些 TM改进技术, 比如将句型用于 TM技术 方案, 其目的是用句型库中存储的句型来覆盖更多的句子。 其原理是将 翻译的例句抽象成为一个句型, 在翻译时也先将要翻译的句子进行语法 分析并抽象为一个语法树结构, 然后再用上述句型与要翻译的句子进行 译文的创建 (翻译)。 这种方法实际上又回到了传统 MT技术的老路上, 首先是因为将例句抽象成语法句型是一件很费时且费力的工作, 而且不 能自动进行。 目前还没有看到实用化的句型积累方法和工具。  In addition, some TM improvements have emerged in recent years, such as the use of sentence patterns in TM technology solutions, the purpose of which is to cover more sentences with the sentence patterns stored in the sentence library. The principle is to abstract the translated example sentence into a sentence pattern. In the translation, the sentence to be translated is first parsed and abstracted into a grammatical tree structure, and then the sentence is used to create the translation with the sentence to be translated (translation ). This method actually returns to the old path of traditional MT technology. The first reason is that abstracting an example sentence into a grammatical sentence is a time-consuming and laborious task, and it cannot be automated. At present, no practical sentence accumulation methods and tools have been seen.
本申请的发明人, 基于多年对人类大脑的翻译思维以及外语学习和 记忆的研究, 提出了一整套模拟人脑记忆和存储翻译知识的体系, 即博 典 (Bodian ) 智能化知识库体系, 及其相应的超级智能计算机翻译技术 ( TM++)。 该智能化知识库体系中, 句对不是简单的原文加译文形式, 也 不是上面所说的抽象化句型, 而一种基于实例的模式化句对。 该模式化 句对翻译理论的优点是: 1、 将复杂抽象的语法具体化和实例化, 便于理 解和实施; 2、 在智能化翻译技术中, 该模式化句对非常适用, 因为它对 兼翻译实例和翻译句型于一体, 既保留具体翻译句对的独特性又具有翻 译句型的普遍性。图 2A-2C和图 3A-3C显示了基于实例的模式化句对(简 称为模式化句对) 的一些例子。  The inventor of the present application, based on years of research on human brain translation thinking and foreign language learning and memory, proposed a set of systems for simulating human brain memory and storing translation knowledge, namely Bodian intelligent knowledge base system, and Corresponding Super Intelligent Computer Translation Technology (TM++). In the intelligent knowledge base system, the sentence pair is not a simple original text plus translation form, nor is it an abstract sentence pattern as described above, but an instance-based patterned sentence pair. The advantages of the model sentence to the translation theory are: 1. To embody and instantiate the complex abstract grammar for easy understanding and implementation; 2. In the intelligent translation technology, the pattern sentence is very suitable because it is The translation examples and the translated sentence patterns are integrated, which not only preserves the uniqueness of the specific translated sentence pairs but also the universality of the translated sentence patterns. Figures 2A-2C and 3A-3C show some examples of instance-based patterned sentence pairs (referred to as patterned sentence pairs).
本申请的发明人还发明了多种方法、 装置和系统来实现所述智能化 发明内容 The inventors of the present application have also invented various methods, devices and systems to achieve the intelligentization Summary of the invention
本申请的发明目的就是要提供模式化句对的形成方法及其形成装 置。  SUMMARY OF THE INVENTION The object of the present invention is to provide a method of forming a patterned sentence pair and a forming apparatus therefor.
利用所述的模式化句对的形成方法及其形成装置, 可以快速高效地 形成和积累模式化句对。 并可将智能化知识库的知识积累面向所有用户, 可以在用户翻译的同时, 机器自动进行智能化翻译知识的形成和积累。 这彻底摆脱了传统翻译软件由语言专家制定翻译规则或句型并由软件专 业技术人员写入或更新的束缚, 并将大大加快智能化知识库的开发和完 善速度。 因此, 为早日实现高质量的全自动机器翻译提供了可行的技术 解决方案。  With the method of forming the patterned sentence pairs and the forming apparatus thereof, the patterned sentence pairs can be formed and accumulated quickly and efficiently. The knowledge base of the intelligent knowledge base can be accumulated for all users, and the machine can automatically generate and accumulate intelligent translation knowledge while the user translates. This completely frees the traditional translation software from the language experts to develop translation rules or sentence patterns and is written or updated by software professional technicians, and will greatly accelerate the development and speed of the intelligent knowledge base. Therefore, it provides a viable technical solution for the early realization of high quality fully automatic machine translation.
附图说明 DRAWINGS
图 1A是传统 TM计算机翻译技术方案的方框示意图。  Figure 1A is a block schematic diagram of a conventional TM computer translation technology solution.
图 1B显示了传统句对的一个例子。  Figure 1B shows an example of a traditional sentence pair.
图 2A-2C和图 3A-3C显示了本发明中的模式化句对的例子。  2A-2C and 3A-3C show examples of patterned sentence pairs in the present invention.
图 4显示了模式化句对的附加信息的例子。  Figure 4 shows an example of additional information for a patterned sentence pair.
图 5是本发明的模式化句对的形成方法第一实施例的流程图。  Figure 5 is a flow chart showing a first embodiment of a method of forming a patterned sentence pair of the present invention.
图 6是本发明的模式化句对的形成方法第二实施例的流程图。  Figure 6 is a flow chart showing a second embodiment of the method for forming a patterned sentence pair of the present invention.
图 7是本发明的模式化句对的形成方法第三实施例的流程图。  Figure 7 is a flow chart showing a third embodiment of the method of forming a patterned sentence pair of the present invention.
图 8是本发明的模式化句对的形成方法第四实施例的流程图。  Figure 8 is a flow chart showing a fourth embodiment of the method for forming a patterned sentence pair of the present invention.
图 9是本发明的模式化句对的形成方法第五实施例的流程图。  Figure 9 is a flow chart showing a fifth embodiment of the method of forming a patterned sentence pair of the present invention.
图 10是本发明的模式化句对的形成方法第六实施例的流程图。  Figure 10 is a flow chart showing a sixth embodiment of a method of forming a patterned sentence pair of the present invention.
图 11是本发明的模式化句对的形成装置的第一实施例的方框图。 图 12是本发明的模式化句对的形成装置的第二实施例的方框图。 图 13 显示了本发明一种模式化双语句对的形成装置的用户接口界 面。  Figure 11 is a block diagram showing a first embodiment of the patterning sentence pair forming apparatus of the present invention. Figure 12 is a block diagram showing a second embodiment of the patterning sentence pair forming apparatus of the present invention. Figure 13 shows the user interface interface of a moded dual statement pair forming apparatus of the present invention.
图 14示意性地显示了一个词单元的例子。 下面将结合附图对本发明的各个具体实施例进行详细说明。 实施方式 Fig. 14 schematically shows an example of a word unit. The specific embodiments of the present invention will be described in detail below with reference to the drawings. Implementation
在对本发明的具体实施例进行描述之前, 先对本发明中的模式化句 对进行详细说明。  Before describing a specific embodiment of the present invention, the modematic sentence pairs in the present invention will be described in detail.
通常, 双语句对包括: 用第一语种表示的原文句子 (简称为第一语 种原文句子), 以及用相应的、 用第二语种表示的译文句子 (简称为第二 语种译文句子)。 第一语种原文句子有时也被更简单地称为原文, 第二语 种译文句子有时也被更简单地称为译文, 这是因为第二语种译文句子通 常是第一语种原文句子的翻译结果。  In general, a two-state sentence pair includes: a source sentence expressed in a first language (referred to as a first-language original sentence), and a corresponding target sentence in a second language (referred to as a second-language translation sentence). The first-language original sentence is sometimes referred to more simply as the original text, and the second-language translated sentence is sometimes referred to more simply as the translation, because the second-language translated sentence is usually the translation result of the first-language original sentence.
对于原文或原文句子, 可以是一个简单句、 复杂句、 或者是词组、 短语、 短句等等。 总之, 本申请中所说的原文句子对其本身的长度或结 构没有限制。  For the original or original sentence, it can be a simple sentence, a complex sentence, or a phrase, a phrase, a short sentence, and so on. In summary, the original sentence referred to in this application has no limitation on its length or structure.
本发明的模式化句对的形成方法可用于计算机翻译系统中, 尤其可 用于计算机翻译系统中句库的形成和维护。 当然, 还可用于其它领域中, 比如的语料库收集和整理。  The method for forming a modular sentence pair of the present invention can be used in a computer translation system, and is particularly useful for forming and maintaining a sentence library in a computer translation system. Of course, it can also be used in other fields, such as corpus collection and collation.
图 2-3显示了本发明中的模式化句对记录的各种实施例。  Figures 2-3 illustrate various embodiments of patterned sentence-to-recording in the present invention.
在本发明中的模式化双语句对中, 将原文即第一语种记录在第一部 分, 将译文即第二语种记录在第二部分。 在以文件形式存储的情况下, 第一部分和第二部分可以同处于一个文件中, 比如第一部分和第二部分 处于同一行中, 之间用一个特定的分隔符分开, 如图 2A所示; 或者第一 部分和第二部分分别处于相邻的二行中, 比如第一部分为奇数行, 第二 部分为偶数行, 如图 2B所示。 另外, 在以文件形式存储的情况下, 如图 2C所示, 第一部分和第二部分中的每一个都可以是存在于一个单独的文 件中, 同一句对的第一部分与第二部分之间具有对应关系, 比如它们处 于相同的行中。  In the modeled two-statement pair in the present invention, the original language, that is, the first language is recorded in the first portion, and the translation, that is, the second language, is recorded in the second portion. In the case of storing in the form of a file, the first part and the second part may be in the same file, for example, the first part and the second part are in the same line, separated by a specific separator, as shown in FIG. 2A; Or the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in FIG. 2B. In addition, in the case of storing in the form of a file, as shown in FIG. 2C, each of the first part and the second part may exist in a separate file, between the first part and the second part of the same sentence pair Have a corresponding relationship, such as they are in the same row.
在以数据库形式存储的情况下, 第一部分和第二部分可以同处于一 个表单中。 比如第一部分和第二部分处于同一行的不同列单元中, 如图 3A所示。 或者第一部分和第二部分分别处于相邻的二行中, 比如第一部 分为奇数行, 第二部分为偶数行, 如图 3B所示。  In the case of storage in the form of a database, the first part and the second part can be in the same form. For example, the first part and the second part are in different column units of the same row, as shown in Fig. 3A. Or the first part and the second part are respectively in two adjacent rows, for example, the first part is an odd line, and the second part is an even line, as shown in Fig. 3B.
在以数据库形式存储的情况下, 如图 3C所示, 第一部分和第二部分 中的每一个都可以是存在于一个单独的表单中, 同一句对的第一部分与 第二部分之间具有对应关系, 比如它们处于相同的行中。 本发明中所述的模式化句对, 至少在所述第一部分和第二部分的任 何一个中, 具有常规单元和模式化单元。 在所述模式化单元中, 用预定 格式记录自身语种单元内容以及相对应的语种单元信息。 In the case of being stored in the form of a database, as shown in FIG. 3C, each of the first part and the second part may exist in a single form, and there is a correspondence between the first part and the second part of the same sentence pair. Relationships, such as they are in the same row. The patterned sentence pair described in the present invention has a regular unit and a pattern unit at least in any of the first portion and the second portion. In the patterning unit, the content of the own language unit and the corresponding language unit information are recorded in a predetermined format.
具体地, 在第二部分记录的所述译文中, 除了常规单元之外, 还用 模式化单元。 所述常规单元意指不可变的部分, 即常规的译文, 比如, 图 2-3中的 "为"、 "买了"、 "一块"(也可认为 "买了一块"是一个常规 单元或常规单元块)。 所述模式化单元意指可被替换的部分, 即该部分译 文可被其它内容替换, 比如, 图 2-3中的: {\he|0|他 /}、 {\his|5|他的 /}、 {\wife|6|妻子 /}、 {\gold watch|3|金表 /}。  Specifically, in the translation recorded in the second portion, a patterning unit is used in addition to the conventional unit. The conventional unit means an immutable part, that is, a conventional translation, for example, "Yes", "Buy", "One Piece" in Figure 2-3 (may also be considered "buy a piece" is a conventional unit or Regular unit block). The patterning unit means a part that can be replaced, that is, the part of the translation can be replaced by other content, for example, in Figure 2-3: {\he|0|he/}, {\his|5|his /}, {\wife|6|Wife/}, {\gold watch|3|Gold watch/}.
具有所述模式化单元的句对被称为模式化句对。 所述常规单元和模 式化单元的数量以及它们之间的位置关系可以是任意的, 它们是由句子 的结构和翻译的需要而定的。 一个模式化句对通常具有一个或多个常规 单元, 一个或多个模式化单元。 常规单元与模式化单元位置关系可以是 互相相间, 也可以是几个常规单元或模式化单元相连再互相相间。  A sentence pair having the patterning unit is referred to as a patterned sentence pair. The number of the conventional unit and the modular unit and the positional relationship therebetween may be arbitrary, which are determined by the structure of the sentence and the need for translation. A modular sentence pair typically has one or more regular units, one or more modular units. The positional relationship between the conventional unit and the modular unit may be mutually inter-phased, or several conventional units or modular units may be connected to each other.
一个模式化句对中可以全部是模式化单元。 例如:  A patterned sentence pair can all be a modular unit. E.g:
{\lazy I adj/} {\boy | n | /} {\! | f /} - {\lazy 101懒賺勺 /} {\boy 111男 {\! 121! /} 所述模式化单元具有预定的格式。 采用预定格式的目的是使该模式 化单元中的译文单元能被替换。 模式化单元可根据需要包含对应的原文 单元、 词性、 属性和句中序号等信息, 以便进行准确且符合需要的替换。 模式化单元的例子如下: "他 I he I pronoun |0"、 {\gold watch | 3 |金表 noun|物品 /}、 {\wife|6|妻子 /}。 其中, "他" 、 "金表"、 "妻子"为译 文单元的内容; "he"、 "gold watch" 、 "wife" 为原文单元的内容; "pronoun "、 "noun"为原文单元的词性,也可以是译文单元的词性; "0"、 "3"、 "6"为原文单元在原文句子中的词序号, 为了计算机编程一致, 序号以 "0"开始。 {\lazy I adj/} {\boy | n | /} {\! | f /} - {\lazy 101 lazy earning spoon /} {\boy 111 male {\! 121! /} The patterning unit has a predetermined format. The purpose of using the predetermined format is to enable the translation units in the patterning unit to be replaced. The patterning unit can include information such as the corresponding original unit, part of speech, attribute, and serial number in the sentence as needed to make accurate and desirable replacement. An example of a patterning unit is as follows: "He I he I pronoun |0", {\gold watch | 3 | Gold watch noun|Items /}, {\wife|6|Wife /}. Among them, "he", "golden watch", "wife" are the contents of the translation unit; "he", "gold watch", "wife" are the contents of the original unit; "pronoun" and "noun" are the part of the original unit It can also be the part of speech of the translation unit; "0", "3", "6" are the word numbers of the original unit in the original sentence. For computer programming, the serial number starts with "0".
模式化单元中的各种信息之间可用特定的分隔符分开, 比如字符 " 1 "、 " , 或空格、 或制表符等等。 其目的是为了在利用模式化句对 进行翻译时, 能更好地进行识别和处理。 每个模式化单元可以用特定符 号对标识, 比如: "{"和 "}"、 "{\"和 "/}"等等, 以便模式化单元能 很容易地被识别。  The various information in the patterning unit can be separated by a specific separator, such as the characters "1", ", or spaces, or tabs, etc. The purpose is to be able to use the patterned sentence pairs when translating Better identification and processing. Each modular unit can be identified by a specific symbol pair, such as: "{" and "}", "{\" and "/}", etc., so that the patterning unit can be easily Identified.
在图 2A的例子中, 第二部分译文中具有模式化单元: {\he|0|他 /}、 {\his | 5 |他的 /}、 {\wife | 6 |妻子 /}、 {\gold watch | 3 |金表 /}等。 虽然 在第一部分没有明显标出模式化单元, 但在第二部分译文中的模式化单 元所指第一部分的那些单词或词组被暗示为可替换的。 它们是第 0位的 he、 第 5位的 his、 第 6位的 wife、 第 3位的 gold_watch。 In the example of Figure 2A, the second part of the translation has a patterning unit: {\he|0|he/}, {\his | 5 | His /}, {\wife | 6 | Wife /}, {\gold watch | 3 | Gold Watch /}, etc. Although the patterning unit is not clearly marked in the first part, those words or phrases of the first part referred to in the patterning unit in the second part of the translation are implied as replaceable. They are the 0th he, the 5th his, the 6th wife, and the 3rd gold_watch.
进一歩, 在第一部分的记录的所述原文中, 除了记录常规单元之外, 也还可记录模式化单元。 所述常规单元意指不可变的部分, 即常规的原 文。 所述模式化单元意指可被替换的部分, 即该部分原文可被其它内容 替换。 同理, 所述常规单元和模式化单元的数量可以是任意的, 它们是 由句子的结构和翻译的需要而定的。  Further, in the original text of the first part of the record, in addition to recording the regular unit, the patterning unit may also be recorded. The conventional unit means an immutable part, that is, a conventional text. The patterning unit means a part that can be replaced, that is, the part of the original text can be replaced by other contents. Similarly, the number of conventional units and modular units can be arbitrary, depending on the structure of the sentence and the need for translation.
所述模式化单元用预定的格式记录。 模式化的目的是使该模式化单 元中的原文单元能被替换。 模式化单元可根据需要包含该单词或词组的 词性、 属性等信息, 以便进行准确且符合需要的替换。 优选地, 在第一 部分原文中的模式化单元中附加信息最好是与在第二部分译文中的模式 化单元中的附加信息互补。 另外, 第一部分原文中的模式化单元最好与 第二部分译文中的相应的模式化单元同时产生。  The patterning unit is recorded in a predetermined format. The purpose of the patterning is to enable the original text units in the patterning unit to be replaced. The patterning unit can include information such as part of speech, attributes, and the like of the word or phrase as needed to make accurate and desirable replacements. Preferably, the additional information in the patterning unit in the first partial text source is preferably complementary to the additional information in the patterning unit in the second partial translation. In addition, the patterning unit in the first part of the original text is preferably generated simultaneously with the corresponding patterning unit in the second part of the translation.
在图 2B和 2C中显示了在第一部分即原文中对模式化单元即可替换 的单词或词组进行标注的例子。如图所示,标注方式如下: {\he | pron/}、 {\gold_watch | noun/} 、 {\his | prond | /}、 {\wif e | noun | /}。 当然还 可采用其它的标注方式, 其目的只要是利于被识别和被替换。  An example of labeling a word or phrase that can be replaced by a patterning unit in the first part, the original text, is shown in Figures 2B and 2C. As shown in the figure, the labeling method is as follows: {\he | pron/}, {\gold_watch | noun/}, {\his | prond | /}, {\wif e | noun | /}. Of course, other methods of labeling may be used, the purpose of which is to facilitate identification and replacement.
另外, 在所述第一部分的原文句子中, 在所述模式化单元中, 还可 用预定格式记录原文单元内容以及对应的译文单元信息。 所述译文单元 信息包括: 译文单元内容以及译文单元的词性、 属性或句中序号等信息, 或者是上述多种信息的任意组合。  Further, in the original sentence of the first portion, in the patterning unit, the original unit content and the corresponding translation unit information may also be recorded in a predetermined format. The translation unit information includes: content of the translation unit and information such as part of speech, attribute or serial number of the translation unit, or any combination of the above various information.
为了上述模式化句对能被更好地利用, 在其中还可以记录一些附加 信息, 比如: 该句的单元总数、 修改标记、 质量等级、 用户名、 更新日 期、 语种编号等等, 如图 4所示。 所述附加信息可放在模式化句对的开 头、 结尾、 或其它位置, 只要它与所述模式化句对有对应关系。 图 4 中 的 " 29|N|2|L。gan88|031121|01 "是一个附件信息的具体例子。 In order to make better use of the above-mentioned pattern sentence pairs, some additional information can also be recorded therein, such as: the total number of units of the sentence, the modification mark, the quality level, the user name, the update date, the language number, etc., as shown in FIG. Shown. The additional information may be placed at the beginning, end, or other location of the patterned sentence pair as long as it has a correspondence with the patterned sentence pair. "29|N|2|L.g an 88|031121|01" in Fig. 4 is a specific example of the attachment information.
从上面描述的内容可看出, 本发明中的模式化句对既是一个翻译实 例又可是一个翻译模型。 因此它能保留具体翻译句对的独特性又具有翻 译模式的普遍性。 利用该模式化句对, 既可对输入的原文句子进行常规 的匹配翻译, 以保证具体句子的特殊性翻译要求, 又可对输入的原文句 子进行模式化匹配翻译, 以及更高级的智能化翻译, 相关内容可参见本 发明的其它相关发明。 As can be seen from the above description, the patterned sentence pair in the present invention is both a translation example and a translation model. Therefore, it can retain the uniqueness of the specific translated sentence pairs and the universality of the translation mode. Using this pattern of sentence pairs, you can make regular input sentence sentences The matching translation, in order to guarantee the specific translation requirements of the specific sentence, the pattern matching matching translation of the input original sentence, and the more advanced intelligent translation, the related content can refer to other related inventions of the present invention.
本发明的模式化句对的形成方法不需要对所翻译的双语句对进行抽 象化操作(该抽象化操作需要许多语法思考和归纳以及大量规则), 而只 需要增加一些已有的信息, 因此该方法容易由计算机实现。 比如, 在用 翻译软件进行翻译过程中,利用交互翻译( IT )模块或辅助翻译模块( CAT ) 来采集模式化单元所需要的信息并形成需要的模式化单元, 再将它们写 成模式化句对即可。 下面将参考图 5-12对本发明的各个具体实施例进行详细的说明。 首先参见图 5,它显示了按照本发明的模式化句对形成方法的第一实 施例。  The method for forming a modular sentence pair of the present invention does not require abstraction of the translated double sentence pair (the abstraction operation requires a lot of grammar thinking and induction and a large number of rules), and only needs to add some existing information, so This method is easily implemented by a computer. For example, in the process of translating with translation software, the interactive translation (IT) module or the auxiliary translation module (CAT) is used to collect the information needed by the patterning unit and form the required patterning unit, and then write them into the model sentence pair. Just fine. DETAILED DESCRIPTION OF THE INVENTION Various specific embodiments of the present invention will now be described in detail with reference to Figures 5-12. Referring first to Figure 5, there is shown a first embodiment of a method of forming a patterned sentence pair in accordance with the present invention.
在该实施例中, 按照本发明的一种模式化双语句对形成方法, 包括: 歩骤 S1 : 选取原文句子中的一个词。 所述词可以是一个单词, 也可 以是一个词组, 或者是一个短语。  In this embodiment, a method for forming a dual statement pair according to the present invention includes the following steps: Step S1: Select a word in the original sentence. The word can be a word, a phrase, or a phrase.
歩骤 S2: 判断所述词的语法属性是否符合可替换词条件。 所述可替 换词条件可以根据词性来规定和判断, 比如: 名词、 形容词、 代词、 数 词等预定为可替换词。 那么, 如果某词的词性是名词、 形容词、 代词或 数词, 则该词的语法属性符合可替换词条件。 当然所述可替换词条件也 可以根据词的属性来规定和判断, 比如: 将属性为 "物"、 "人"、 "时间" 或 "地点" 的词定义为可替换词。  Step S2: Determine whether the grammatical attribute of the word meets the condition of the replaceable word. The replaceable word conditions may be specified and judged according to part of speech, for example: nouns, adjectives, pronouns, numerals, etc. are predetermined as replaceable words. Then, if the part of speech of a word is a noun, an adjective, a pronoun or a number, the grammatical attribute of the word conforms to the condition of the replaceable word. Of course, the alternative word condition can also be specified and judged according to the attribute of the word, for example: a word whose attribute is "object", "person", "time" or "place" is defined as a replaceable word.
如果歩骤 S2的判断结果为 "是 ", 则执行歩骤 S3: 将所述词的标识 信息与所述词的译文内容组成一个模式化单元, 并将其写入到译文部分。 所述标识信息根据需要可包含原文单元内容以及原文单元的词性、 属性 或句中序号等信息, 或者是上述多种信息的任意组合。 更多详情, 可参 见上面对模式化单元的描述。  If the result of the determination in step S2 is "YES", then step S3 is performed: the identification information of the word and the translation content of the word are combined into a patterning unit, and written to the translation portion. The identification information may include information such as the content of the original text unit and the part of speech, the attribute or the serial number of the original text unit, or any combination of the above various kinds of information, as needed. See the description of the modular unit above for more details.
如果歩骤 S2的判断结果为 "否 ", 则执行歩骤 S4: 将所述词的译文 内容写入到译文部分。  If the result of the determination in step S2 is "NO", then step S4 is executed: the translation content of the word is written to the translation portion.
如图 6所示,按照本发明的一种模式化双语句对形成方法的第二实施 例, 它与图 5所示第一实施不同的是, 在歩骤 S2的判断结果为 "否" 的 情况下, 进一歩执行歩骤 S5 : 判断是否有特殊控制符或指令。 设置特殊 控制符或指令, 可以对模式化单元的形成进行灵活的控制。 有了它, 可 以在预定规则之外, 对语法属性不符合可替换词的词进行模式化处理。 As shown in FIG. 6, a second embodiment of a method for forming a dual statement pair according to the present invention is different from the first embodiment shown in FIG. 5 in that the result of the determination in step S2 is "NO". In the case, the execution step S5 is further performed to determine whether there is a special control character or instruction. Flexible control of the formation of the patterning unit can be achieved by setting special control symbols or commands. With it, words that do not conform to replaceable words can be modeled outside of the predetermined rules.
如果歩骤 S5的判断结果为 "是 ", 则执行歩骤 S3 : 将所述词的标识 信息与所述词的译文内容组成一个可替换单元, 并将其写入到译文部分; 如果歩骤 S5的判断结果为 "否 ", 则执行歩骤 S4: 将所述词的译文 内容写入到译文部分。  If the result of the determination in step S5 is "YES", then step S3 is performed: forming the identification information of the word and the translated content of the word into a replaceable unit, and writing it to the translation part; If the result of the determination in S5 is "NO", then step S4 is performed: the translation content of the word is written to the translation portion.
参见图 7,它显示了本发明第三实施例的一种模式化双语句对形成方 法, 其中, 所述词对应于一个词单元。  Referring to Fig. 7, there is shown a mode double statement pair forming method of a third embodiment of the present invention, wherein the word corresponds to a word unit.
在所述歩骤 S 1之前, 执行歩骤 SO : 将原文句子的词形成词单元。 所述歩骤 S 1具体为: 选取所述词单元中的一个译文。  Before the step S1, the execution step SO: forms the word unit of the original sentence into a word unit. The step S 1 is specifically: selecting one translation in the word unit.
将原文句子的词形成词单元的具体方法可以是查词典的方法,即用该原 文单词对词典或句子进行查找, 得到相应的译文 (释义)、 词性、 属性、 联想等内容。 在所述词单元中, 还包含该词在原文句子中的词序号信息。  The specific method of forming the word unit of the original sentence may be a method of looking up the dictionary, that is, using the original word to search the dictionary or the sentence, and obtaining the corresponding translation (interpretation), part of speech, attribute, association and the like. In the word unit, the word serial number information of the word in the original sentence is also included.
进一歩, 可以对原文句子中的所有词执行歩骤 SO的操作, 以形成词单元 阵列。  Further, the SO operation can be performed on all the words in the original sentence to form an array of word units.
如图 8所示, 本发明第四实施例的一种模式化双语句对形成方法, 与 图 7所示第三实施例的不同之处在于: 在歩骤 S2的判断结果为 "否"的 情况下, 进一歩执行歩骤 S5 : 判断是否有特殊控制符或指令。  As shown in FIG. 8, a method for forming a pattern double statement pair according to the fourth embodiment of the present invention is different from the third embodiment shown in FIG. 7 in that: the result of the determination in step S2 is "NO". In the case, the execution step S5 is further performed to determine whether there is a special control character or instruction.
参见图 9,它显示了本发明第五实施例的一种模式化双语句对形成方 法, 所述方法包括:  Referring to Figure 9, there is shown a patterning two-statement pair forming method in accordance with a fifth embodiment of the present invention, the method comprising:
在原文句子中找出可替换词; 其依据也可以是上面讨论的各种预定 标准或条件。  Find alternative words in the original sentence; the basis can also be the various predetermined criteria or conditions discussed above.
在译文句子中找出与所述可替换词的译文;  Finding a translation with the replaceable word in the target sentence;
在所述译文处增加该词的标识信息, 以形成模式化单元。  The identification information of the word is added at the translation to form a patterning unit.
所述原文标识信息包含原文单元内容以及原文单元的词性、 属性或 句中序号等信息, 或者是上述多种信息的任意组合。  The original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
参见图 10, 它显示了本发明第六实施例的一种模式化双语句对形成 方法, 所述方法包括:  Referring to FIG. 10, a method for forming a patterned double sentence pair according to a sixth embodiment of the present invention is shown. The method includes:
在原文句子中找出可替换词; 其依据也可以是上面讨论的各种预定 标准或条件。 在译文句子中找出与所述可替换词的译文; Find alternative words in the original sentence; the basis can also be the various predetermined criteria or conditions discussed above. Finding a translation with the replaceable word in the target sentence;
将该词的标识信息与所述的译文内容组成一个模式化单元, 并用其 替换原先的译文内容。  The identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
所述原文标识信息包含原文单元内容以及原文单元的词性、 属性或 句中序号等信息, 或者是上述多种信息的任意组合。  The original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
将该词的标识信息与所述的译文内容组成一个模式化单元, 并用其 替换原先的译文内容。  The identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
所述原文标识信息包含原文单元内容以及原文单元的词性、 属性或 句中序号等信息, 或者是上述多种信息的任意组合。  The original text identification information includes information of the original unit and the part of the original unit, the attribute or the serial number of the sentence, or any combination of the above various kinds of information.
参见图 11, 它显示了本发明一种模式化双语句对的形成装置的第一 实施例, 其中, 模式化双语句对至少在译文部分中具有模式化单元; 在 所述模式化单元中, 具有译文单元内容以及相应的原文标识信息; 所述 装置包括:  Referring to FIG. 11, there is shown a first embodiment of a device for forming a patterned double sentence pair according to the present invention, wherein a patterned double sentence pair has a patterning unit in at least a translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
判断模块, 用于判断词的语法属性是否符合可替换词条件; 模式化单元形成模块, 用于将词的标识信息与内容组成一个模式化 单元;  a judging module, configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition; the modularizing unit forming module, configured to form the identifying information and the content of the word into a modular unit;
写入模块, 用于将词的译文或模式化单元写入到译文部分; 以及, 词单元形成模块, 用于形成词单元, 形成方法可以是查词典。  a writing module, configured to write a translation or a patterning unit of the word to the translation part; and a word unit forming module, configured to form the word unit, and the forming method may be to look up the dictionary.
参见图 12, 它显示了本发明一种模式化双语句对的形成装置的第二 实施例, 其中, 模式化双语句对至少在译文部分中具有模式化单元; 在 所述模式化单元中, 具有译文单元内容以及相应的原文标识信息; 所述 装置包括:  Referring to Figure 12, there is shown a second embodiment of a moded dual statement pair forming apparatus of the present invention, wherein the patterned double sentence pair has a patterning unit in at least the translation portion; in the patterning unit, Having a translation unit content and corresponding original text identification information; the device includes:
判断模块, 用于判断词的语法属性是否符合可替换词条件; 模式化单元形成模块, 用于将词的标识信息与内容组成一个模式化 单元;  a judging module, configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition; the modularizing unit forming module, configured to form the identifying information and the content of the word into a modular unit;
写入模块, 用于将词的译文或模式化单元写入到译文部分; 以及, 词单元形成模块, 用于形成词单元, 形成方法可以是查词典; 所述词单元形成模块, 可以对原文句子中的所有词进行词单元形成 操作, 以形成词单元阵列。  a writing module, configured to write a translation or a patterning unit of the word to the translation part; and, a word unit forming module, configured to form the word unit, the forming method may be to search the dictionary; the word unit forming module may be All words in a sentence perform a word unit forming operation to form an array of word cells.
图 13 显示了本发明一种模式化双语句对的形成装置的用户接口界 面。 其中, 原文句子〃 We see the wonderful translation result of the system with TM++ technology ."中的每个词都显示在交互翻译区 (该图的上半部 分) 并且都形成了词单元。 图中, 特别显示了第 3号 (第 4个词) 词单 元的更多译文(释义)。在所述交互翻译区中, 当用鼠标点击某个译文时, 即可触发本发明的模式化双语句对的形成装置, 并按照本发明的模式化 双语句对形成方法来形成模式化句对。 Figure 13 shows a user interface interface of a moded dual statement pair forming apparatus of the present invention. Among them, the original sentence 〃 We see the wonderful translation result of the system Each word in "TM++ technology." is displayed in the interactive translation area (the upper half of the figure) and forms a unit of words. In the figure, the third (4th word) word unit is specifically shown. Multi-translation (interpretation). In the interactive translation area, when a certain translation is clicked with a mouse, the device for forming a modular double-statement pair of the present invention can be triggered, and the method for forming a dual-state sentence pair according to the present invention To form a modular sentence pair.
图 14示意性地显示了一个词单元的例子。 虽然上面对本申请的各个方面及实施方式和实施例进行了详细描 述, 但本申请的发明并不限制于此。 本专业的技术人员可以做出各种变 化、 改形或修改。 只要这些变化、 改型或修改不脱离本发明的精神和原 理, 它们就应被包括在本发明的范围之内。  Fig. 14 schematically shows an example of a word unit. While various aspects, embodiments, and embodiments of the present application have been described in detail above, the invention of the present application is not limited thereto. Various changes, modifications, or modifications can be made by those skilled in the art. Such changes, modifications, and modifications are intended to be included within the scope of the present invention without departing from the spirit and scope of the invention.

Claims

权 利 要 求 书 Claim
1. 一种模式化双语句对形成方法, 其中, 模式化双语句对至少在译 文部分中具有模式化单元; 在所述模式化单元中, 具有译文单元内容以 及相应的原文标识信息; 所述方法包括:  A method for forming a schema double statement pair, wherein the mode double statement pair has a patterning unit in at least a translation part; and in the patterning unit, having a translation unit content and corresponding original text identification information; Methods include:
歩骤 S1 : 选取原文句子中的一个词;  Step S1: Select a word in the original sentence;
歩骤 S2 : 判断所述词的语法属性是否符合可替换词条件;  Step S2: determining whether the grammatical attribute of the word meets the condition of the replaceable word;
如果歩骤 S2的判断结果为 "是 ", 则执行歩骤 S3: 将所述词的标识 信息与所述词的译文内容组成一个可替换单元, 并将其写入到译文部分; 如果歩骤 S2的判断结果为 "否 ", 则执行歩骤 S4: 将所述词的译文 内容写入到译文部分。  If the result of the determination in step S2 is "YES", then step S3 is performed: forming the identification information of the word and the translated content of the word into a replaceable unit, and writing it to the translation part; If the result of the determination in S2 is "NO", then step S4 is performed: the translation content of the word is written to the translation portion.
2. 按照权利要求 1 的一种模式化双语句对形成方法, 其中, 在歩骤 S2的判断结果为 "否" 的情况下, 进一歩执行歩骤 S5 : 判断是否有特殊 控制符或指令;  2. The method according to claim 1, wherein, in the case where the determination result in step S2 is "NO", the processing step S5 is further performed to: determine whether there is a special control character or instruction;
如果歩骤 S5的判断结果为 "是 ", 则执行歩骤 S3: 将所述词的标识 信息与所述词的译文内容组成一个可替换单元, 并将其写入到译文部分; 如果歩骤 S5的判断结果为 "否 ", 则执行歩骤 S4: 将所述词的译文 内容写入到译文部分。  If the result of the determination in step S5 is "YES", then step S3 is performed: forming the identification information of the word and the translated content of the word into a replaceable unit, and writing it to the translation part; If the result of the determination in S5 is "NO", then step S4 is performed: the translation content of the word is written to the translation portion.
3. 按照权利要求 1 的一种模式化双语句对形成方法, 其中, 所述词 对应于一个词单元;  3. A patterned double sentence pair formation method according to claim 1, wherein said word corresponds to a word unit;
在所述歩骤 S1之前, 执行歩骤 so: 将原文句子的词形成词单元; 所述歩骤 S1具体为: 选取所述词单元中的一个译文。  Before the step S1, executing step so: forming a word of the original sentence into a word unit; the step S1 is specifically: selecting one of the word units.
4. 按照权利要求 3的一种模式化双语句对形成方法, 其中, 所述歩骤 SO具体为: 通过查词典的方法, 将原文句子的词形成词单元;  4. The method according to claim 3, wherein the step SO is specifically: forming a word unit by using a dictionary method;
进一歩, 可以对原文句子中的所有词执行歩骤 SO的操作, 以形成词单元 阵列。  Further, the SO operation can be performed on all the words in the original sentence to form an array of word units.
5. 按照权利要求 1 的一种模式化双语句对形成方法, 其中, 所述原 文标识信息包含原文单元内容以及原文单元的词性、 属性或句中序号等 信息, 或者是上述多种信息的任意组合。  5. The method according to claim 1, wherein the original text identification information includes information of a source unit and a part of a word unit, an attribute or a serial number of the original unit, or any of the plurality of types of information. combination.
6. 一种模式化双语句对形成方法, 其中, 模式化双语句对至少在译 文部分中具有模式化单元; 在所述模式化单元中, 具有译文单元内容以 及相应的原文标识信息; 所述方法包括: 在原文句子中找出可替换词; A method for forming a pattern double statement pair, wherein the patterned double statement pair has a patterning unit in at least a translation part; and in the patterning unit, having a translation unit content and corresponding original text identification information; Methods include: Find alternative words in the original sentence;
在译文句子中找出与所述可替换词的译文;  Finding a translation with the replaceable word in the target sentence;
在所述译文处增加该词的标识信息, 以形成模式化单元。  The identification information of the word is added at the translation to form a patterning unit.
7. 一种模式化双语句对形成方法, 其中, 模式化双语句对至少在译 文部分中具有模式化单元; 在所述模式化单元中, 具有译文单元内容以 及相应的原文标识信息; 所述方法包括:  A method for forming a dual statement pair, wherein the mode double statement pair has a patterning unit in at least a translation part; and in the patterning unit, having a translation unit content and corresponding original text identification information; Methods include:
在原文句子中找出可替换词;  Find alternative words in the original sentence;
在译文句子中找出与所述可替换词的译文;  Finding a translation with the replaceable word in the target sentence;
将该词的标识信息与所述的译文内容组成一个模式化单元, 并用其 替换原先的译文内容。  The identification information of the word and the translated content are combined into a patterning unit, and the original translation content is replaced by the original translation content.
8. 按照权利要求 7的一种模式化双语句对形成方法, 其中, 所述原 文标识信息包含原文单元内容以及原文单元的词性、 属性或句中序号等 信息, 或者是上述多种信息的任意组合。  8. The method according to claim 7, wherein the original text identification information includes information of a source unit and a part of a word unit, an attribute or a serial number of the original unit, or any of the plurality of types of information. combination.
9. 一种模式化双语句对的形成装置, 其中, 模式化双语句对至少在 译文部分中具有模式化单元; 在所述模式化单元中, 具有译文单元内容 以及相应的原文标识信息; 所述装置包括:  9. A device for forming a schema double statement pair, wherein the patterned double statement pair has a patterning unit in at least a translation part; and in the patterning unit, having a translation unit content and corresponding original text identification information; The device includes:
判断模块, 用于判断词的语法属性是否符合可替换词条件; 模式化单元形成模块, 用于将词的标识信息与内容组成一个模式化 单元;  a judging module, configured to determine whether the grammatical attribute of the word conforms to the replaceable word condition; the modularizing unit forming module, configured to form the identifying information and the content of the word into a modular unit;
写入模块, 用于将词的译文或模式化单元写入到译文部分。  A write module for writing a translation or a patterning unit of a word to a translation part.
10. 按照权利要求 9的一种模式化双语句对的形成装置, 还包括: 词单元形成模块, 用于形成词单元, 形成方法可以是查词典; 所述词单元形成模块, 可以对原文句子中的所有词进行词单元形成 操作, 以形成词单元阵列。  10. The device for forming a modular double sentence pair according to claim 9, further comprising: a word unit forming module for forming a word unit, the forming method may be a dictionary; the word unit forming module may be used for the original sentence All words in the word unit are formed to form an array of word cells.
PCT/CN2010/077772 2009-10-20 2010-10-15 Forming method of patterned bilingual sentence pair and forming device thereof WO2011047608A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200910180877.2A CN102043773B (en) 2009-10-20 2009-10-20 Method and device for forming modularized bilingual sentence pairs
CN200910180877.2 2009-10-20

Publications (1)

Publication Number Publication Date
WO2011047608A1 true WO2011047608A1 (en) 2011-04-28

Family

ID=43899826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/077772 WO2011047608A1 (en) 2009-10-20 2010-10-15 Forming method of patterned bilingual sentence pair and forming device thereof

Country Status (2)

Country Link
CN (1) CN102043773B (en)
WO (1) WO2011047608A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391840A (en) * 2014-11-24 2015-03-04 上海迈外迪网络科技有限公司 Translation method and device
CN105183722A (en) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 Chinese-English bilingual translation corpus alignment method
CN105183723A (en) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 Associating method for translation software and language material searching

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1272655A (en) * 2000-06-19 2000-11-08 李玉鉴(鑑) English-Chinese translation machine
JP2006127405A (en) * 2004-11-01 2006-05-18 Advanced Telecommunication Research Institute International Method for carrying out alignment of bilingual parallel text and executable program in computer
CN101206643A (en) * 2006-12-21 2008-06-25 中国科学院计算技术研究所 Translation method syncretizing sentential form template and statistics mechanical translation technique
CN101441623A (en) * 2007-11-20 2009-05-27 富士施乐株式会社 Translation device, and information processing method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1272655A (en) * 2000-06-19 2000-11-08 李玉鉴(鑑) English-Chinese translation machine
JP2006127405A (en) * 2004-11-01 2006-05-18 Advanced Telecommunication Research Institute International Method for carrying out alignment of bilingual parallel text and executable program in computer
CN101206643A (en) * 2006-12-21 2008-06-25 中国科学院计算技术研究所 Translation method syncretizing sentential form template and statistics mechanical translation technique
CN101441623A (en) * 2007-11-20 2009-05-27 富士施乐株式会社 Translation device, and information processing method

Also Published As

Publication number Publication date
CN102043773B (en) 2014-09-03
CN102043773A (en) 2011-05-04

Similar Documents

Publication Publication Date Title
WO2011017902A1 (en) Recording method for patterning bilingual sentence pairs and translating method and translating system
Al‐Sughaiyer et al. Arabic morphological analysis techniques: A comprehensive survey
CN101595474B (en) Linguistic analysis
Tiedemann Recycling translations: Extraction of lexical data from parallel corpora and their application in natural language processing
CN104462057B (en) For the method and system for the lexicon for producing language analysis
JP2003030185A (en) Automatic extraction of transfer mapping from bilingual corpora
JP2002215617A (en) Method for attaching part of speech tag
Roux et al. An Ontology Enrichment Method for a Pragmatic Information Extraction System gathering Data on Genetic Interactions.
King Practical Natural Language Processing for Low-Resource Languages.
Chiarcos et al. Analyzing middle high German syntax with RDF and SPARQL
Lavie et al. Experiments with a Hindi-to-English transfer-based MT system under a miserly data scenario
Ngo et al. EVBCorpus-a multi-layer English-Vietnamese bilingual corpus for studying tasks in comparative linguistics
WO2011047608A1 (en) Forming method of patterned bilingual sentence pair and forming device thereof
Pouliquen et al. Automatic construction of multilingual name dictionaries
Stepanov et al. Language style and domain adaptation for cross-language SLU porting
Parameswarappa et al. Kannada word sense disambiguation for machine translation
Krishnamurthy et al. Ease: Enabling hardware assertion synthesis from english
Saleh Automatic extraction of lemma-based bilingual dictionaries for morphologically rich languages
Sukhahuta et al. Information extraction strategies for Thai documents
Chen et al. A simplification–translation–restoration framework for domain adaptation in statistical machine translation: A case study in medical record translation
Krizhanovsky The comparison of Wiktionary thesauri transformed into the machine-readable format
Čmejrek Using Dependency Tree Structure for Czech-English Machine Translation
Kaeshammer Hierarchical machine translation with discontinuous phrases
JP4588417B2 (en) Translation device
GB2572539A (en) System and method for parsing user query

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10824449

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10824449

Country of ref document: EP

Kind code of ref document: A1