WO2022179149A1 - Machine translation method and apparatus based on translation memory - Google Patents

Machine translation method and apparatus based on translation memory Download PDF

Info

Publication number
WO2022179149A1
WO2022179149A1 PCT/CN2021/126674 CN2021126674W WO2022179149A1 WO 2022179149 A1 WO2022179149 A1 WO 2022179149A1 CN 2021126674 W CN2021126674 W CN 2021126674W WO 2022179149 A1 WO2022179149 A1 WO 2022179149A1
Authority
WO
WIPO (PCT)
Prior art keywords
translation
translated
corpus
original
original text
Prior art date
Application number
PCT/CN2021/126674
Other languages
French (fr)
Chinese (zh)
Inventor
毛红保
Original Assignee
语联网(武汉)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 语联网(武汉)信息技术有限公司 filed Critical 语联网(武汉)信息技术有限公司
Publication of WO2022179149A1 publication Critical patent/WO2022179149A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the technical field of machine translation, and in particular, to a method and device for machine translation based on translation memory.
  • Translation memory is a bilingual corpus generated and retained by translators during the translation process. It is usually data of relatively high quality of translation after manual proofreading. Due to the limited corpus in the translation memory, it is likely that the exact same corpus as the current text to be translated cannot be retrieved from the translation memory, so the translation of the current text to be translated cannot be directly obtained from the translation memory.
  • Translation memories can be used to assist current translation tasks.
  • the existing method is to retrieve the corpus similar to the current text to be translated from the translation memory, and present the corresponding translation to the translator.
  • the translator manually modifies the translation of the similar corpus according to the current text to be translated to obtain the translation of the current text to be translated.
  • translators Due to the large differences in sentence structure and expression between the original and translated texts of similar corpora, translators need to spend a lot of time checking and editing the translations of similar corpora, which is labor-intensive.
  • the present application provides a method and device for machine translation based on translation memory, which is used to solve the problem of time-consuming and laborious work when translators check and edit translations of similar corpora in the prior art, and realize automatic translation of text to be translated based on translation memory. translate.
  • the application provides a machine translation method based on translation memory, including:
  • the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
  • the translation of the original text of the corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output, including:
  • the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output.
  • the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the to-be-translated text is output Translation of the original text, including:
  • the original text to be translated is output through the linear processing layer and the softmax layer of the decoder in turn. 's translation.
  • the mask includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
  • the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the The numbers are inside the brackets.
  • the mapping of the difference part to the translation of the original text of the corpus includes:
  • the difference part is mapped to the translation of the original corpus.
  • the machine translation model is a Transformer model.
  • the application also provides a machine translation device based on translation memory, including:
  • the search module is used to search the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
  • a comparison module configured to compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
  • a replacement module configured to map the difference part to the translation of the original corpus, and replace the translation mapped with the difference part in the translation of the original corpus with a mask;
  • a translation module configured to use the replaced translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated;
  • the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
  • the present application also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to achieve any of the above The steps of the translation memory-based machine translation method.
  • the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any one of the above-mentioned translation memory-based machine translation methods.
  • the translation memory-based machine translation method and device searches the translation memory for the original text of the corpus and the translation of the original text with the highest similarity to the original text to be translated, and automatically compares the similarity between the original text to be translated and the original text of the corpus , effectively reduce the work intensity of manual checking, then map the difference in the original corpus to the translation of the original corpus, replace the mapped translation of the difference in the translation of the original corpus with a mask, and finally combine the replaced translation of the original corpus Automatic translation of the original text to be translated can not only improve translation efficiency, reduce translation costs, but also improve translation accuracy.
  • Fig. 1 is one of the schematic flow sheets of the translation memory-based machine translation method provided by the application;
  • Fig. 2 is the structural representation of the machine translation model in the translation memory-based machine translation method provided by the application;
  • Fig. 3 is the second schematic flow chart of the translation memory-based machine translation method provided by the application.
  • FIG. 4 is a schematic structural diagram of a translation memory-based machine translation device provided by the application.
  • FIG. 5 is a schematic structural diagram of an electronic device provided by the present application.
  • the translation memory-based machine translation method of the present application is described below with reference to FIG. 1 .
  • the method includes: Step 101 , searching for the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
  • the original text to be translated may be a text that needs to be translated in various application fields, such as engineering, advertising, or medicine. This embodiment is not limited to the type and quantity of the original text to be translated.
  • a large amount of bilingual corpus data is stored in the translation memory, and these corpus data are data of relatively high translation quality after manual proofreading.
  • the text similarity retrieval method can be used to take the original text to be translated as the query text, retrieve the original corpus with the highest similarity to the original text to be translated from the translation memory, and retrieve the translation of the original text from the translation memory.
  • the method of calculating the similarity may be calculating the Pearson correlation or the Euclidean distance between the original text to be translated and the original text of the corpus in the translation memory. This embodiment is not limited to the calculation method of the similarity.
  • Step 102 compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
  • the original text of the corpus retrieved from the translation memory may or may not be completely consistent with the original text to be translated. Therefore, after retrieving the original corpus from the translation memory, it is necessary to compare the similarity between the original to be translated and the original of the corpus to determine whether the original to be translated and the original of the corpus are completely consistent. It may be to perform word segmentation processing on the original text to be translated and the original text of the corpus, compare the similarity of words in the same position of the original text to be translated and the original text of the corpus, and determine whether the original text to be translated and the original text of the corpus are completely consistent according to the comparison result. This embodiment is not limited to this determination method.
  • the difference is marked in the original text of the corpus.
  • the original text to be translated is "I have an apple", and the original text with the highest similarity is "I have a pear”.
  • the difference between the original text and the original text to be translated can be obtained as "pear”.
  • the difference parts can be marked in the original corpus.
  • the original text of the marked corpus is "I have a [pear]", and this embodiment is not limited to this marking method.
  • Step 103 mapping the difference part to the translation of the original text of the corpus, and replacing the mapped translation of the difference part in the translation of the original text of the corpus with a mask;
  • the difference parts can be mapped to the translation of the original text of the corpus.
  • the original corpus is "I have a pear”
  • the translation of the original corpus is "I have a pear”
  • the difference in the original corpus is "pear”
  • the corresponding difference in the translation of the original corpus is "pear”.
  • the original text of the corpus is "I have a [pear]”
  • the translation of the original text is "I have a [pear]”.
  • mask replacement is performed on the translation mapped by the difference part in the translation of the original corpus.
  • the type of mask can be set according to actual needs.
  • Step 104 take the translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated; wherein, the machine translation model uses the translated text sample as a sample, and the translation The translation corresponding to the original sample is obtained by training as a label.
  • the replaced translation of the original corpus and the original to be translated can be input into the machine translation model, and the machine translation model can learn the translation of the replaced original of the corpus and the original to be translated, and can output an accurate translation of the original to be translated.
  • the machine translation model can be a neural machine translation model, but is not limited to this type.
  • the original text to be translated and the translation of the original text to be translated output by the machine translation model can also be added to the translation memory to provide rich corpus data for the expansion of the translation memory.
  • the translation of the original corpus and the original to be translated are automatically translated, which can not only improve the accuracy of translation, but also reduce the need for checking and editing. work intensity, improve translation efficiency, and reduce translation costs.
  • the original text of the corpus and the original text of the corpus with the highest similarity to the original text to be translated are searched in the translation memory, and the similarity between the original text to be translated and the original text of the corpus is automatically compared, thereby effectively reducing the work intensity of manual checking, and then the The difference part in the original corpus is mapped to the translation of the original corpus, the translation of the difference mapped in the translation of the original corpus is replaced with a mask, and finally the translation of the original corpus after the replacement and the original to be translated are automatically translated, not only It can improve translation efficiency, reduce translation cost, and improve translation accuracy.
  • the translation of the original corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output, including: converting the original text to be translated Input the translated text into the first encoder of the machine translation model, and output the encoding result of the original text to be translated; input the translation of the replaced corpus original into the second encoder of the machine translation model, and output the corpus
  • the encoding result of the translation of the original text; the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output.
  • the machine translation model is a multi-input translation model, including two parallel encoders, namely, a first encoder and a second encoder. Wherein, the first encoder and the second encoder may be multiple layers. This embodiment is not limited to the number and structure of encoder layers.
  • the machine translation model further includes a decoder, and the decoder may also be multi-layered, and this embodiment is not limited to the number and structure of the decoder layers.
  • the original text to be translated can be input into the first encoder, and the first encoder learns the original text to be translated, and outputs the encoding result of the original text to be translated; at the same time, the translation of the replaced original text of the corpus is input into the second encoder, and the second encoder passes After learning the translation of the original corpus, the encoding result of the translation of the original corpus is output. Then, the coding result of the original text to be translated and the coding result of the translation of the original text of the corpus are input into the decoder, and the decoder outputs the final translation result after learning the coding result of the original text to be translated and the coding result of the translation of the original text of the corpus.
  • the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output
  • the method includes: inputting the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, and then sequentially passing through the linear processing layer and the softmax layer of the decoder, and outputting the to-be-translated layer. Translation of the original text.
  • the encoder includes multiple sub-layers, and each sub-layer includes a feed-forward neural network layer, a cross-attention layer and a self-attention layer. As shown in Figure 2, the encoder also includes an input layer, a Linear (linear processing) layer, and a softmax layer. The Linear layer is used to flatten the input features into the form of a 1D tensor.
  • the result of the first cross-attention operation is output.
  • the second cross-attention operation result is output.
  • the result of the second cross-attention operation is sequentially passed through the linear processing layer and the softmax layer of the decoder to output the translation of the original text to be translated.
  • the mask in this embodiment includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
  • parentheses and preset characters can be used as masks.
  • the brackets can be square brackets
  • the preset character can be mask
  • the mask is [mask].
  • the present embodiment is not limited to this type of mask.
  • the translation of the difference mapping in the translation of the original corpus can be replaced by [mask]. For example, the translation of the original corpus is "I have a pear”. "pear" is the translation of the difference part mapping, then the translation of the original corpus after mask replacement is "I have a[mask]".
  • the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the number is located in the inside parentheses.
  • the translations mapped by the corresponding difference parts are replaced one by one by using a plurality of masks containing numbers respectively.
  • a plurality of masks containing numbers Such as [mask1] and [mask2], etc. Among them, 1 and 2 in parentheses are the numbers of the difference parts.
  • mapping the difference part to the translation of the original corpus includes: performing word alignment on the original corpus and the translation of the original corpus; according to the word alignment As a result, the difference portion is mapped to a translation of the corpus original.
  • a word alignment tool can be used to perform automatic word alignment on the original corpus and the translation of the original corpus. After word alignment, there is a correspondence between each word in the original corpus and each word in the translation of the original corpus.
  • the word alignment tool may be a fast_align word alignment tool or a GIZA++ word alignment tool, etc. This embodiment is not limited to the word alignment tool.
  • the original corpus is "I have a pear”
  • the translation of the original corpus is "I have a pear”.
  • the difference parts can be quickly mapped from the original corpus to the translation of the corpus.
  • the machine translation model described in this embodiment is a Transformer model.
  • a multi-input Transformer model can be used to translate the original text to be translated.
  • the Transformer model uses a self-attention network for encoding and decoding.
  • Both the Encoder (encoder) and the Decoder (decoder) are composed of multiple sub-layers, and each sub-layer includes a self-attention layer and a feed-forward neural network layer.
  • an Encoder-Decoder cross-attention layer is attached between the self-attention layer and the feed-forward neural network layer.
  • Transformer models achieve state-of-the-art translation performance in many language translations.
  • Step 1 Match the original text to be translated with the original text of the corpus in the translation memory, and output the original text of the corpus and the translation of the original text with the highest similarity to the original text to be translated;
  • Step 2 word-aligning the original corpus and the translation of the original corpus
  • Step 3 compare the original text of the corpus with the original text to be translated, and mark the differences existing in the original text of the corpus;
  • Step 4 Map the marked differences in the original corpus to the translation of the original corpus
  • Step 5 use the mask to replace the translation mapped by the difference part in the translation of the original text of the corpus;
  • step 6 the translated text of the original corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output.
  • the translation memory-based machine translation apparatus provided by the present application is described below.
  • the translation memory-based machine translation apparatus described below and the translation memory-based machine translation method described above may refer to each other correspondingly.
  • the present embodiment provides a machine translation device based on translation memory.
  • the device includes a search module 401, a comparison module 402, a replacement module 403 and a translation module 404, wherein:
  • the search module 401 is used for searching the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
  • the original text to be translated may be a text that needs to be translated in various application fields, such as engineering, advertising, or medicine. This embodiment is not limited to the type and quantity of the original text to be translated.
  • a large amount of bilingual corpus data is stored in the translation memory, and these corpus data are data of relatively high translation quality after manual proofreading.
  • the text similarity retrieval method can be used to take the original text to be translated as the query text, retrieve the original corpus with the highest similarity to the original text to be translated from the translation memory, and retrieve the translation of the original text from the translation memory.
  • the method of calculating the similarity may be calculating the Pearson correlation or the Euclidean distance between the original text to be translated and the original text of the corpus in the translation memory. This embodiment is not limited to the calculation method of the similarity.
  • the comparison module 402 is configured to compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
  • the original text of the corpus retrieved from the translation memory may or may not be completely consistent with the original text to be translated. Therefore, after retrieving the original corpus from the translation memory, it is necessary to compare the similarity between the original to be translated and the original of the corpus to determine whether the original to be translated and the original of the corpus are completely consistent. It may be to perform word segmentation processing on the original text to be translated and the original text of the corpus, compare the similarity of words in the same position of the original text to be translated and the original text of the corpus, and determine whether the original text to be translated and the original text of the corpus are completely consistent according to the comparison result. This embodiment is not limited to this determination method.
  • the replacement module 403 is configured to map the difference part to the translation of the original text of the corpus, and replace the mapped translation of the difference part in the translation of the original text of the corpus with a mask;
  • the difference parts can be mapped to the translation of the original text of the corpus. Then, mask replacement is performed on the translation mapped by the difference part in the translation of the original corpus.
  • the type of mask can be set according to actual needs.
  • the translation module 404 is configured to use the translation of the replaced original text and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated; wherein, the machine translation model uses a sample of the original text to be translated as a sample.
  • the translation corresponding to the translated original sample is obtained by training as a label.
  • the replaced translation of the original corpus and the original to be translated can be input into the machine translation model, and the machine translation model can learn the translation of the replaced original of the corpus and the original to be translated, and can output an accurate translation of the original to be translated.
  • the machine translation model may be a neural machine translation model, but is not limited to this type.
  • the original text to be translated and the translation of the original text to be translated output by the machine translation model can also be added to the translation memory to provide rich corpus data for the expansion of the translation memory.
  • the translation of the original corpus and the original to be translated are automatically translated, which can not only improve the accuracy of translation, but also reduce the need for checking and editing. work intensity, improve translation efficiency, and reduce translation costs.
  • the original text of the corpus and the original text of the corpus with the highest similarity to the original text to be translated are searched in the translation memory, and the similarity between the original text to be translated and the original text of the corpus is automatically compared, thereby effectively reducing the work intensity of manual checking, and then the The difference part in the original corpus is mapped to the translation of the original corpus, the translation of the difference mapped in the translation of the original corpus is replaced with a mask, and finally the translation of the original corpus after the replacement and the original to be translated are automatically translated, not only It can improve translation efficiency, reduce translation cost, and improve translation accuracy.
  • the translation module in this embodiment is specifically configured to: input the original text to be translated into the first encoder of the machine translation model, and output the encoding result of the original text to be translated;
  • the translation of the original corpus is input into the second encoder of the machine translation model, and the encoding result of the translation of the original corpus is output; the encoding result of the original to be translated and the encoding result of the translation of the original corpus are input into the machine.
  • the decoder of the translation model outputs the translation of the original text to be translated.
  • the translation module in this embodiment is further configured to input the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, and then sequentially go through the The linear processing layer and the softmax layer of the decoder output the translation of the original text to be translated.
  • the mask in this embodiment includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
  • the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the number is located in the inside parentheses.
  • this embodiment further includes a mapping module for performing word alignment on the original text of the corpus and the translation of the original text of the corpus; according to the word alignment result, the difference part is mapped to the The translation of the original text.
  • the machine translation model described in this embodiment is a Transformer model.
  • FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device.
  • the electronic device may include: a processor (processor) 501, a communication interface (Communications Interface) 502, a memory (memory) 503 and a communication bus 504,
  • the processor 501 , the communication interface 502 , and the memory 503 communicate with each other through the communication bus 504 .
  • the processor 501 may invoke the logic instructions in the memory 503 to execute a translation memory-based machine translation method, the method comprising: searching the original corpus with the highest similarity to the original to be translated and a translation of the original corpus from the translation memory ; Compare the original text to be translated and the original text of the corpus, and obtain the difference part in the original text of the corpus that is different from the original text to be translated; map the difference part to the translation of the original text of the corpus, and convert the corpus The translation mapped by the difference part in the translation of the original text is replaced with a mask; the translation of the replaced corpus original text and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output; wherein, the machine translation The model is obtained by training the translated text sample as a sample, and the translation corresponding to the translated text sample as a label.
  • the above-mentioned logic instructions in the memory 503 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
  • the present application also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer can execute the translation memory-based machine translation method provided by the above methods, and the method includes: searching the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
  • the original text to be translated is compared with the original text of the corpus, and the difference part of the original text of the corpus that is different from the original text to be translated is obtained; the difference part is mapped to the translation of the original text of the corpus, and the The translation mapped by the difference part in the translation is replaced with a mask;
  • the translation of the replaced corpus original text and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output;
  • the machine translation model consists of The translated original sample is used as a sample
  • the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the computer program is implemented to execute the translation memory-based machine translation methods provided above,
  • the method includes: searching the original corpus with the highest similarity to the original to be translated and a translation of the original corpus from a translation memory; comparing the original to be translated and the original of the corpus, and obtaining the Describe the different parts of the original text to be translated; map the difference parts to the translation of the original corpus, and replace the translation mapped with the difference in the translation of the original corpus as a mask;
  • the original text to be translated is used as the input of the machine translation model, and the translation of the original text to be translated is output; wherein, the machine translation model is obtained by training a sample of the original text to be translated as a sample and the translation corresponding to the sample of the original translated text as a label.
  • the device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
  • each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware.
  • the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

Provided in the present application is a machine translation method based on a translation memory. The method comprises: searching, in a translation memory, for corpus source text having the highest similarity to source text to be translated, and translated text of the corpus source text; comparing the source text to be translated with the corpus source text, and acquiring a difference part in the corpus source text that is different from the source text to be translated; mapping the difference part to the translated text of the corpus source text, and replacing translated text, to which the difference part is mapped, in the translated text of the corpus source text with a mask; and taking the translated text of the corpus source text after replacement is performed and the source text to be translated as an input of a machine translation model, and outputting translated text of the source text to be translated, wherein the machine translation model is obtained by means of performing training by taking a translation source text sample as a sample and taking translated text corresponding to the translation source text sample as a label. By means of the present application, translation is performed by combining source text to be translated and translated text of corpus source text, such that the translation efficiency can be improved, the translation cost can be reduced, and the translation accuracy can also be improved.

Description

基于翻译记忆库的机器翻译方法及装置Method and device for machine translation based on translation memory
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2021年2月23日提交的申请号为202110203208.3,发明名称为“基于翻译记忆库的机器翻译方法及装置”的中国专利申请的优先权,其通过引用方式全部并入本文。This application claims the priority of the Chinese patent application with the application number 202110203208.3 filed on February 23, 2021 and the invention title is "Method and Device for Machine Translation Based on Translation Memory", which is fully incorporated herein by reference.
技术领域technical field
本申请涉及机器翻译技术领域,尤其涉及一种基于翻译记忆库的机器翻译方法及装置。The present application relates to the technical field of machine translation, and in particular, to a method and device for machine translation based on translation memory.
背景技术Background technique
翻译记忆库是译员在翻译过程中产生并保留的双语语料,通常都是经过人工校对之后译文质量比较高的数据。由于翻译记忆库中的语料有限,很可能从翻译记忆库中检索出不出与当前待翻译文本完全一样的语料,从而无法直接从翻译记忆库中获得当前待翻译文本的译文。Translation memory is a bilingual corpus generated and retained by translators during the translation process. It is usually data of relatively high quality of translation after manual proofreading. Due to the limited corpus in the translation memory, it is likely that the exact same corpus as the current text to be translated cannot be retrieved from the translation memory, so the translation of the current text to be translated cannot be directly obtained from the translation memory.
翻译记忆库可用来辅助当前的翻译任务。现有的方式是从翻译记忆库中检索出与当前待翻译文本相似的语料,将其对应的译文呈现给译员。译员根据当前待翻译文本对相似语料的译文进行手动修改获得当前待翻译文本的译文。Translation memories can be used to assist current translation tasks. The existing method is to retrieve the corpus similar to the current text to be translated from the translation memory, and present the corresponding translation to the translator. The translator manually modifies the translation of the similar corpus according to the current text to be translated to obtain the translation of the current text to be translated.
由于相似语料的原文和译文之间句子结构、表述方式等差别较大,译员需要花费大量时间对相似语料的译文进行核对和编辑,工作强度大。Due to the large differences in sentence structure and expression between the original and translated texts of similar corpora, translators need to spend a lot of time checking and editing the translations of similar corpora, which is labor-intensive.
发明内容SUMMARY OF THE INVENTION
本申请提供一种基于翻译记忆库的机器翻译方法及装置,用以解决现有技术中译员对相似语料的译文进行核对和编辑时,费时费力的缺陷,实现基于翻译记忆库自动对待翻译文本进行翻译。The present application provides a method and device for machine translation based on translation memory, which is used to solve the problem of time-consuming and laborious work when translators check and edit translations of similar corpora in the prior art, and realize automatic translation of text to be translated based on translation memory. translate.
本申请提供一种基于翻译记忆库的机器翻译方法,包括:The application provides a machine translation method based on translation memory, including:
从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;Find the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;Comparing the original text to be translated and the original text of the corpus to obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;mapping the difference part to the translation of the original corpus, and replacing the translation mapped with the difference part in the translation of the original corpus with a mask;
将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;Using the replaced translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and outputting the translation of the original text to be translated;
其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。Wherein, the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,所述将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文,包括:According to a translation memory-based machine translation method provided by this application, the translation of the original text of the corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output, including:
将所述待翻译原文输入所述机器翻译模型的第一编码器,输出所述待翻译原文的编码结果;Input the original text to be translated into the first encoder of the machine translation model, and output the encoding result of the original text to be translated;
将所述替换后的语料原文的译文输入所述机器翻译模型的第二编码器,输出所述语料原文的译文的编码结果;inputting the replaced translation of the original corpus into the second encoder of the machine translation model, and outputting an encoding result of the translation of the original corpus;
将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文。The encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,所述将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文,包括:According to a translation memory-based machine translation method provided by the present application, the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the to-be-translated text is output Translation of the original text, including:
将所述待翻译原文的编码结果和目标文本的译文的编码结果输入所述解码器的交叉注意力机制层后,依次经过所述解码器的线性处理层和softmax层,输出所述待翻译原文的译文。After inputting the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, the original text to be translated is output through the linear processing layer and the softmax layer of the decoder in turn. 's translation.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,所述掩码包括括号和预设字符;其中,所述预设字符位于所述括号内部。According to a translation memory-based machine translation method provided by the present application, the mask includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,若所述差异部分为多个,则替换每个所述差异部分映射的译文的掩码还包括每个差异部分的编号,所述编号位于所述括号内部。According to a translation memory-based machine translation method provided by the present application, if there are multiple difference parts, the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the The numbers are inside the brackets.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,所述将所述差异部分映射到所述语料原文的译文,包括:According to a translation memory-based machine translation method provided by the present application, the mapping of the difference part to the translation of the original text of the corpus includes:
将所述语料原文和所述语料原文的译文进行词对齐;word-aligning the original text of the corpus and the translation of the original text of the corpus;
根据词对齐结果,将所述差异部分映射到所述语料原文的译文。According to the word alignment result, the difference part is mapped to the translation of the original corpus.
根据本申请提供的一种基于翻译记忆库的机器翻译方法,所述机器翻译模型为Transformer模型。According to a translation memory-based machine translation method provided by the present application, the machine translation model is a Transformer model.
本申请还提供一种基于翻译记忆库的机器翻译装置,包括:The application also provides a machine translation device based on translation memory, including:
查找模块,用于从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;The search module is used to search the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
比较模块,用于将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;a comparison module, configured to compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
替换模块,用于将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;a replacement module, configured to map the difference part to the translation of the original corpus, and replace the translation mapped with the difference part in the translation of the original corpus with a mask;
翻译模块,用于将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;a translation module, configured to use the replaced translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated;
其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。Wherein, the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
本申请还提供一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述任一种所述基于翻译记忆库的机器翻译方法的步骤。The present application also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor executes the computer program to achieve any of the above The steps of the translation memory-based machine translation method.
本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现如上述任一种所述基于翻译记忆库的机器翻译方法的步骤。The present application also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the steps of any one of the above-mentioned translation memory-based machine translation methods.
本申请提供的基于翻译记忆库的机器翻译方法及装置,通过在翻译记忆库中查找与待翻译原文相似度最高的语料原文和语料原文的译文,并自动对待翻译原文和语料原文进行相似性比较,有效减少人工校核的工作强度,然后将语料原文中的差异部分映射到语料原文的译文,将语料原文的译文中差异部分映射的译文替换为掩码,最后联合替换后的语料原文的译文和待翻译原文对待翻译原文进行自动翻译,不仅可以提高翻译效率,降低翻译成本,还可以提高翻译的准确性。The translation memory-based machine translation method and device provided by the present application searches the translation memory for the original text of the corpus and the translation of the original text with the highest similarity to the original text to be translated, and automatically compares the similarity between the original text to be translated and the original text of the corpus , effectively reduce the work intensity of manual checking, then map the difference in the original corpus to the translation of the original corpus, replace the mapped translation of the difference in the translation of the original corpus with a mask, and finally combine the replaced translation of the original corpus Automatic translation of the original text to be translated can not only improve translation efficiency, reduce translation costs, but also improve translation accuracy.
附图说明Description of drawings
为了更清楚地说明本申请或现有技术中的技术方案,下面将对实施例 或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions in the present application or the prior art more clearly, the following briefly introduces the accompanying drawings required in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are the For some embodiments of the application, for those of ordinary skill in the art, other drawings can also be obtained according to these drawings without any creative effort.
图1是本申请提供的基于翻译记忆库的机器翻译方法的流程示意图之一;Fig. 1 is one of the schematic flow sheets of the translation memory-based machine translation method provided by the application;
图2是本申请提供的基于翻译记忆库的机器翻译方法中机器翻译模型的结构示意图;Fig. 2 is the structural representation of the machine translation model in the translation memory-based machine translation method provided by the application;
图3是本申请提供的基于翻译记忆库的机器翻译方法的流程示意图之二;Fig. 3 is the second schematic flow chart of the translation memory-based machine translation method provided by the application;
图4是本申请提供的基于翻译记忆库的机器翻译装置的结构示意图;4 is a schematic structural diagram of a translation memory-based machine translation device provided by the application;
图5是本申请提供的电子设备的结构示意图。FIG. 5 is a schematic structural diagram of an electronic device provided by the present application.
具体实施方式Detailed ways
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请中的附图,对本申请中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the technical solutions in the present application will be described clearly and completely below with reference to the accompanying drawings in the present application. Obviously, the described embodiments are part of the embodiments of the present application. , not all examples. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of the present application.
下面结合图1描述本申请的基于翻译记忆库的机器翻译方法,该方法包括:步骤101,从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;The translation memory-based machine translation method of the present application is described below with reference to FIG. 1 . The method includes: Step 101 , searching for the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
其中,待翻译原文可以是各应用领域中需要进行翻译的文本,如,工程、广告或医学等。本实施例不限于待翻译原文的类型和数量。翻译记忆库中存储有大量的双语语料数据,且这些语料数据均为人工校对之后译文质量比较高的数据。The original text to be translated may be a text that needs to be translated in various application fields, such as engineering, advertising, or medicine. This embodiment is not limited to the type and quantity of the original text to be translated. A large amount of bilingual corpus data is stored in the translation memory, and these corpus data are data of relatively high translation quality after manual proofreading.
可以通过文本相似性检索方法,将待翻译原文作为查询文本,从翻译记忆库中检索与待翻译原文相似度最高的语料原文,并从翻译记忆库中取出语料原文的译文。其中,计算相似度的方式,可以是计算待翻译原文与翻译记忆库中的语料原文之间的皮尔逊相关性或欧式距离等。本实施例不限于相似度的计算方式。The text similarity retrieval method can be used to take the original text to be translated as the query text, retrieve the original corpus with the highest similarity to the original text to be translated from the translation memory, and retrieve the translation of the original text from the translation memory. The method of calculating the similarity may be calculating the Pearson correlation or the Euclidean distance between the original text to be translated and the original text of the corpus in the translation memory. This embodiment is not limited to the calculation method of the similarity.
步骤102,将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分; Step 102, compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
具体地,从翻译记忆库中检索的语料原文与待翻译原文可能完全一致,也可能不完全一致。因此,从翻译记忆库中检索出语料原文后,需要将待翻译原文和语料原文进行相似性比较,确定待翻译原文和语料原文是否完全一致。可以是对待翻译原文和语料原文进行分词处理,将待翻译原文和语料原文相同位置上的词进行相似性比较,根据比较结果确定待翻译原文和语料原文是否完全一致。本实施例不限于这种确定方式。Specifically, the original text of the corpus retrieved from the translation memory may or may not be completely consistent with the original text to be translated. Therefore, after retrieving the original corpus from the translation memory, it is necessary to compare the similarity between the original to be translated and the original of the corpus to determine whether the original to be translated and the original of the corpus are completely consistent. It may be to perform word segmentation processing on the original text to be translated and the original text of the corpus, compare the similarity of words in the same position of the original text to be translated and the original text of the corpus, and determine whether the original text to be translated and the original text of the corpus are completely consistent according to the comparison result. This embodiment is not limited to this determination method.
若待翻译原文和语料原文不一致,则在语料原文中标注出差异部分。例如,待翻译原文为“我有一个苹果”,相似度最高的语料原文为“我有一个梨”,根据相似性比较结果可以获取语料原文中与待翻译原文不同的差异部分为“梨”,则可以在语料原文中对差异部分进行标注。标注后的语料原文为“我有一个[梨]”,本实施例不限于这种标注方式。If the original text to be translated is inconsistent with the original text of the corpus, the difference is marked in the original text of the corpus. For example, the original text to be translated is "I have an apple", and the original text with the highest similarity is "I have a pear". According to the similarity comparison result, the difference between the original text and the original text to be translated can be obtained as "pear". Then the difference parts can be marked in the original corpus. The original text of the marked corpus is "I have a [pear]", and this embodiment is not limited to this marking method.
步骤103,将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码; Step 103, mapping the difference part to the translation of the original text of the corpus, and replacing the mapped translation of the difference part in the translation of the original text of the corpus with a mask;
具体地,获取语料原文中与待翻译原文不同的差异部分后,可以将差异部分映射到语料原文的译文。例如,语料原文为“我有一个梨”,语料原文的译文为“I have a pear”,语料原文中的差异部分为“梨”,对应地语料原文的译文中的差异部分为“pear”。对差异部分进行标注后,语料原文为“我有一个[梨]”,将标注后的差异部分映射到语料原文的译文后,语料原文的译文为“I have a[pear]”。Specifically, after obtaining the difference parts in the original text of the corpus that are different from the original text to be translated, the difference parts can be mapped to the translation of the original text of the corpus. For example, the original corpus is "I have a pear", the translation of the original corpus is "I have a pear", the difference in the original corpus is "pear", and the corresponding difference in the translation of the original corpus is "pear". After marking the difference part, the original text of the corpus is "I have a [pear]", and after mapping the marked difference part to the translation of the original text, the translation of the original text is "I have a [pear]".
然后,对语料原文的译文中差异部分映射的译文进行掩码替换。其中,掩码的类型可以根据实际需求进行设置。Then, mask replacement is performed on the translation mapped by the difference part in the translation of the original corpus. Among them, the type of mask can be set according to actual needs.
步骤104,将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。 Step 104, take the translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated; wherein, the machine translation model uses the translated text sample as a sample, and the translation The translation corresponding to the original sample is obtained by training as a label.
具体地,可以将替换后的语料原文的译文和待翻译原文输入机器翻译模型,机器翻译模型对替换后的语料原文的译文和待翻译原文进行学习,可以输出准确的待翻译原文的译文。其中,机器翻译模型可以是神经机器 翻译模型,但不限于此种类型。Specifically, the replaced translation of the original corpus and the original to be translated can be input into the machine translation model, and the machine translation model can learn the translation of the replaced original of the corpus and the original to be translated, and can output an accurate translation of the original to be translated. Among them, the machine translation model can be a neural machine translation model, but is not limited to this type.
此外,也可以将待翻译原文和机器翻译模型输出的待翻译原文的译文加入翻译记忆库,为翻译记忆库的扩充提供丰富的语料数据。In addition, the original text to be translated and the translation of the original text to be translated output by the machine translation model can also be added to the translation memory to provide rich corpus data for the expansion of the translation memory.
由于翻译记忆库的语料数据中包含高质量的译文,因此,本实施例联合语料原文的译文和待翻译原文对待翻译原文进行自动翻译,不仅可以提高翻译的准确性,还可以减少核对和编辑的工作强度,提高翻译效率,降低翻译成本。Since the corpus data in the translation memory contains high-quality translations, in this embodiment, the translation of the original corpus and the original to be translated are automatically translated, which can not only improve the accuracy of translation, but also reduce the need for checking and editing. work intensity, improve translation efficiency, and reduce translation costs.
本实施例通过在翻译记忆库中查找与待翻译原文相似度最高的语料原文和语料原文的译文,并自动对待翻译原文和语料原文进行相似性比较,有效减少人工校核的工作强度,然后将语料原文中的差异部分映射到语料原文的译文,将语料原文的译文中差异部分映射的译文替换为掩码,最后联合替换后的语料原文的译文和待翻译原文对待翻译原文进行自动翻译,不仅可以提高翻译效率,降低翻译成本,还可以提高翻译的准确性。In this embodiment, the original text of the corpus and the original text of the corpus with the highest similarity to the original text to be translated are searched in the translation memory, and the similarity between the original text to be translated and the original text of the corpus is automatically compared, thereby effectively reducing the work intensity of manual checking, and then the The difference part in the original corpus is mapped to the translation of the original corpus, the translation of the difference mapped in the translation of the original corpus is replaced with a mask, and finally the translation of the original corpus after the replacement and the original to be translated are automatically translated, not only It can improve translation efficiency, reduce translation cost, and improve translation accuracy.
在上述实施例的基础上,本实施例中所述将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文,包括:将所述待翻译原文输入所述机器翻译模型的第一编码器,输出所述待翻译原文的编码结果;将所述替换后的语料原文的译文输入所述机器翻译模型的第二编码器,输出所述语料原文的译文的编码结果;将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文。On the basis of the above embodiment, in this embodiment, the translation of the original corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output, including: converting the original text to be translated Input the translated text into the first encoder of the machine translation model, and output the encoding result of the original text to be translated; input the translation of the replaced corpus original into the second encoder of the machine translation model, and output the corpus The encoding result of the translation of the original text; the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output.
其中,机器翻译模型为多输入翻译模型,包括两个并行的编码器,即为第一编码器和第二编码器。其中,第一编码器和第二编码器可以为多层。本实施例不限于编码器层数和结构。机器翻译模型还包括解码器,解码器也可以为多层,本实施例不限于解码器层数和结构。The machine translation model is a multi-input translation model, including two parallel encoders, namely, a first encoder and a second encoder. Wherein, the first encoder and the second encoder may be multiple layers. This embodiment is not limited to the number and structure of encoder layers. The machine translation model further includes a decoder, and the decoder may also be multi-layered, and this embodiment is not limited to the number and structure of the decoder layers.
可以将待翻译原文输入第一编码器,第一编码器通过对待翻译原文进行学习,输出待翻译原文的编码结果;同时将替换后的语料原文的译文输入第二编码器,第二编码器通过对语料原文的译文进行学习后,输出语料原文的译文的编码结果。然后,将待翻译原文的编码结果和语料原文的译文的编码结果输入解码器中,解码器对待翻译原文的编码结果和语料原文的译文的编码结果进行学习后,输出最终的翻译结果。The original text to be translated can be input into the first encoder, and the first encoder learns the original text to be translated, and outputs the encoding result of the original text to be translated; at the same time, the translation of the replaced original text of the corpus is input into the second encoder, and the second encoder passes After learning the translation of the original corpus, the encoding result of the translation of the original corpus is output. Then, the coding result of the original text to be translated and the coding result of the translation of the original text of the corpus are input into the decoder, and the decoder outputs the final translation result after learning the coding result of the original text to be translated and the coding result of the translation of the original text of the corpus.
在上述实施例的基础上,本实施例中所述将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文,包括:将所述待翻译原文的编码结果和目标文本的译文的编码结果输入所述解码器的交叉注意力机制层后,依次经过所述解码器的线性处理层和softmax层,输出所述待翻译原文的译文。On the basis of the above embodiment, in this embodiment, the encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output, The method includes: inputting the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, and then sequentially passing through the linear processing layer and the softmax layer of the decoder, and outputting the to-be-translated layer. Translation of the original text.
其中,编码器包括多个子层,每个子层包括前馈神经网络层、交叉注意力层和自注意力层。如图2所示,编码器还包括输入层、Linear(线性处理)层和softmax层。Linear层用于将输入特征展平成一维张量的形式。Among them, the encoder includes multiple sub-layers, and each sub-layer includes a feed-forward neural network layer, a cross-attention layer and a self-attention layer. As shown in Figure 2, the encoder also includes an input layer, a Linear (linear processing) layer, and a softmax layer. The Linear layer is used to flatten the input features into the form of a 1D tensor.
待翻译原文的编码结果在解码器的交叉注意力层进行交叉attention(注意力)运算后,输出第一交叉attention运算结果。然后再将第一交叉attention运算结果和语料原文的译文的编码结果进行交叉attention运算后,输出第二交叉attention运算结果。将第二交叉attention运算结果依次经过所述解码器的线性处理层和softmax层,输出所述待翻译原文的译文。After the encoding result of the original text to be translated is subjected to the cross-attention operation in the cross-attention layer of the decoder, the result of the first cross-attention operation is output. Then, after performing the cross-attention operation on the result of the first cross-attention operation and the encoding result of the translation of the original corpus, the second cross-attention operation result is output. The result of the second cross-attention operation is sequentially passed through the linear processing layer and the softmax layer of the decoder to output the translation of the original text to be translated.
在上述各实施例的基础上,本实施例中所述掩码包括括号和预设字符;其中,所述预设字符位于所述括号内部。Based on the foregoing embodiments, the mask in this embodiment includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
具体地,可以采用括号和预设字符作为掩码。其中,括号可以为中括号,预设字符可以为mask,则掩码为[mask]。本实施例不限于这种类型的掩码。通过使用该掩码,可以将语料原文的译文中差异部分映射的译文替换为[mask]。例如,语料原文的译文为“I have a pear”。“pear”为差异部分映射的译文,则掩码替换后的语料原文的译文为“I have a[mask]”。Specifically, parentheses and preset characters can be used as masks. Among them, the brackets can be square brackets, the preset character can be mask, and the mask is [mask]. The present embodiment is not limited to this type of mask. By using this mask, the translation of the difference mapping in the translation of the original corpus can be replaced by [mask]. For example, the translation of the original corpus is "I have a pear". "pear" is the translation of the difference part mapping, then the translation of the original corpus after mask replacement is "I have a[mask]".
在上述实施例的基础上,本实施例中若所述差异部分为多个,则替换每个所述差异部分映射的译文的掩码还包括每个差异部分的编号,所述编号位于所述括号内部。On the basis of the above-mentioned embodiment, in this embodiment, if there are multiple different parts, the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the number is located in the inside parentheses.
具体地,若语料原文的译文中存在多个差异部分,则分别使用多个含有编号的掩码逐个替换相应的差异部分映射的译文。如[mask1]和[mask2]等。其中,括号中的1和2为差异部分的编号。Specifically, if there are multiple difference parts in the translation of the original text of the corpus, the translations mapped by the corresponding difference parts are replaced one by one by using a plurality of masks containing numbers respectively. Such as [mask1] and [mask2], etc. Among them, 1 and 2 in parentheses are the numbers of the difference parts.
在上述各实施例的基础上,本实施例中所述将所述差异部分映射到所述语料原文的译文,包括:将所述语料原文和所述语料原文的译文进行词对齐;根据词对齐结果,将所述差异部分映射到所述语料原文的译文。On the basis of the above embodiments, in this embodiment, mapping the difference part to the translation of the original corpus includes: performing word alignment on the original corpus and the translation of the original corpus; according to the word alignment As a result, the difference portion is mapped to a translation of the corpus original.
具体地,将差异部分映射到语料原文的译文之前,可以采用词对齐工 具对语料原文和语料原文的译文进行自动词对齐。词对齐后,语料原文中的每个词和语料原文的译文中的每个词存在对应关系。其中,词对齐工具可以是fast_align词对齐工具或GIZA++词对齐工具等,本实施例不限于词对齐工具。Specifically, before mapping the difference part to the translation of the original corpus, a word alignment tool can be used to perform automatic word alignment on the original corpus and the translation of the original corpus. After word alignment, there is a correspondence between each word in the original corpus and each word in the translation of the original corpus. The word alignment tool may be a fast_align word alignment tool or a GIZA++ word alignment tool, etc. This embodiment is not limited to the word alignment tool.
例如,语料原文为“我有一个梨”,语料原文的译文为“I have a pear”,通过词对齐处理后,“我”和“I”对应,“有”和“have”对应,“一个”和“a”对应,“梨”和“pear”对应。For example, the original corpus is "I have a pear", and the translation of the original corpus is "I have a pear". After word alignment processing, "I" corresponds to "I", "有" corresponds to "have", "a " corresponds to "a", and "pear" corresponds to "pear".
本实施例通过对语料原文和语料原文的译文进行自动词对齐,可以将差异部分从语料原文快速映射到语料译文中。In this embodiment, by performing automatic word alignment on the original corpus and the translation of the original corpus, the difference parts can be quickly mapped from the original corpus to the translation of the corpus.
在上述各实施例的基础上,本实施例中所述机器翻译模型为Transformer模型。On the basis of the foregoing embodiments, the machine translation model described in this embodiment is a Transformer model.
具体地,可以使用多输入的Transformer模型对待翻译原文进行翻译。其中,Transformer模型使用自注意力网络进行编码和解码。Encoder(编码器)和Decoder(解码器)均由多个子层构成,每一子层包括一个自注意力层和一个前馈神经网络层。Decoder中在自注意力层和前馈神经网络层之间附加一个Encoder-Decoder交叉注意力层。Transformer模型在许多语言翻译中实现了最先进的翻译性能。Specifically, a multi-input Transformer model can be used to translate the original text to be translated. Among them, the Transformer model uses a self-attention network for encoding and decoding. Both the Encoder (encoder) and the Decoder (decoder) are composed of multiple sub-layers, and each sub-layer includes a self-attention layer and a feed-forward neural network layer. In the Decoder, an Encoder-Decoder cross-attention layer is attached between the self-attention layer and the feed-forward neural network layer. Transformer models achieve state-of-the-art translation performance in many language translations.
如图3所示为本实施例的完整流程示意图,具体步骤包括:As shown in FIG. 3, the complete flow chart of this embodiment is shown, and the specific steps include:
步骤1,将待翻译原文与翻译记忆库中的语料原文进行匹配,输出与待翻译原文相似度最高的语料原文和语料原文的译文;Step 1: Match the original text to be translated with the original text of the corpus in the translation memory, and output the original text of the corpus and the translation of the original text with the highest similarity to the original text to be translated;
步骤2,将语料原文和语料原文的译文进行词对齐; Step 2, word-aligning the original corpus and the translation of the original corpus;
步骤3,将语料原文和待翻译原文进行比较,并对语料原文中存在的差异部分进行标注;Step 3, compare the original text of the corpus with the original text to be translated, and mark the differences existing in the original text of the corpus;
步骤4,将语料原文中标注的差异部分映射到语料原文的译文中;Step 4: Map the marked differences in the original corpus to the translation of the original corpus;
步骤5,使用掩码对语料原文的译文中差异部分映射的译文进行替换;Step 5, use the mask to replace the translation mapped by the difference part in the translation of the original text of the corpus;
步骤6,将替换后的语料原文的译文和待翻译原文作为机器翻译模型的输入,输出待翻译原文的译文。In step 6, the translated text of the original corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output.
下面对本申请提供的基于翻译记忆库的机器翻译装置进行描述,下文描述的基于翻译记忆库的机器翻译装置与上文描述的基于翻译记忆库的机器翻译方法可相互对应参照。The translation memory-based machine translation apparatus provided by the present application is described below. The translation memory-based machine translation apparatus described below and the translation memory-based machine translation method described above may refer to each other correspondingly.
如图4所示,本实施例提供一种基于翻译记忆库的机器翻译装置,该装置包括查找模块401、比较模块402、替换模块403和翻译模块404,其中:As shown in FIG. 4 , the present embodiment provides a machine translation device based on translation memory. The device includes a search module 401, a comparison module 402, a replacement module 403 and a translation module 404, wherein:
查找模块401用于从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;The search module 401 is used for searching the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
其中,待翻译原文可以是各应用领域中需要进行翻译的文本,如,工程、广告或医学等。本实施例不限于待翻译原文的类型和数量。翻译记忆库中存储有大量的双语语料数据,且这些语料数据均为人工校对之后译文质量比较高的数据。The original text to be translated may be a text that needs to be translated in various application fields, such as engineering, advertising, or medicine. This embodiment is not limited to the type and quantity of the original text to be translated. A large amount of bilingual corpus data is stored in the translation memory, and these corpus data are data of relatively high translation quality after manual proofreading.
可以通过文本相似性检索方法,将待翻译原文作为查询文本,从翻译记忆库中检索与待翻译原文相似度最高的语料原文,并从翻译记忆库中取出语料原文的译文。其中,计算相似度的方式,可以是计算待翻译原文与翻译记忆库中的语料原文之间的皮尔逊相关性或欧式距离等。本实施例不限于相似度的计算方式。The text similarity retrieval method can be used to take the original text to be translated as the query text, retrieve the original corpus with the highest similarity to the original text to be translated from the translation memory, and retrieve the translation of the original text from the translation memory. The method of calculating the similarity may be calculating the Pearson correlation or the Euclidean distance between the original text to be translated and the original text of the corpus in the translation memory. This embodiment is not limited to the calculation method of the similarity.
比较模块402用于将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;The comparison module 402 is configured to compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
具体地,从翻译记忆库中检索的语料原文与待翻译原文可能完全一致,也可能不完全一致。因此,从翻译记忆库中检索出语料原文后,需要将待翻译原文和语料原文进行相似性比较,确定待翻译原文和语料原文是否完全一致。可以是对待翻译原文和语料原文进行分词处理,将待翻译原文和语料原文相同位置上的词进行相似性比较,根据比较结果确定待翻译原文和语料原文是否完全一致。本实施例不限于这种确定方式。Specifically, the original text of the corpus retrieved from the translation memory may or may not be completely consistent with the original text to be translated. Therefore, after retrieving the original corpus from the translation memory, it is necessary to compare the similarity between the original to be translated and the original of the corpus to determine whether the original to be translated and the original of the corpus are completely consistent. It may be to perform word segmentation processing on the original text to be translated and the original text of the corpus, compare the similarity of words in the same position of the original text to be translated and the original text of the corpus, and determine whether the original text to be translated and the original text of the corpus are completely consistent according to the comparison result. This embodiment is not limited to this determination method.
若待翻译原文和语料原文不一致,则在语料原文中标注出差异部分。If the original text to be translated is inconsistent with the original text of the corpus, the difference is marked in the original text of the corpus.
替换模块403用于将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;The replacement module 403 is configured to map the difference part to the translation of the original text of the corpus, and replace the mapped translation of the difference part in the translation of the original text of the corpus with a mask;
具体地,获取语料原文中与待翻译原文不同的差异部分后,可以将差异部分映射到语料原文的译文。然后,对语料原文的译文中差异部分映射的译文进行掩码替换。其中,掩码的类型可以根据实际需求进行设置。Specifically, after obtaining the difference parts in the original text of the corpus that are different from the original text to be translated, the difference parts can be mapped to the translation of the original text of the corpus. Then, mask replacement is performed on the translation mapped by the difference part in the translation of the original corpus. Among them, the type of mask can be set according to actual needs.
翻译模块404用于将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;其中,所述机器翻译 模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。The translation module 404 is configured to use the translation of the replaced original text and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated; wherein, the machine translation model uses a sample of the original text to be translated as a sample. The translation corresponding to the translated original sample is obtained by training as a label.
具体地,可以将替换后的语料原文的译文和待翻译原文输入机器翻译模型,机器翻译模型对替换后的语料原文的译文和待翻译原文进行学习,可以输出准确的待翻译原文的译文。其中,机器翻译模型可以是神经机器翻译模型,但不限于此种类型。Specifically, the replaced translation of the original corpus and the original to be translated can be input into the machine translation model, and the machine translation model can learn the translation of the replaced original of the corpus and the original to be translated, and can output an accurate translation of the original to be translated. The machine translation model may be a neural machine translation model, but is not limited to this type.
此外,也可以将待翻译原文和机器翻译模型输出的待翻译原文的译文加入翻译记忆库,为翻译记忆库的扩充提供丰富的语料数据。In addition, the original text to be translated and the translation of the original text to be translated output by the machine translation model can also be added to the translation memory to provide rich corpus data for the expansion of the translation memory.
由于翻译记忆库的语料数据中包含高质量的译文,因此,本实施例联合语料原文的译文和待翻译原文对待翻译原文进行自动翻译,不仅可以提高翻译的准确性,还可以减少核对和编辑的工作强度,提高翻译效率,降低翻译成本。Since the corpus data in the translation memory contains high-quality translations, in this embodiment, the translation of the original corpus and the original to be translated are automatically translated, which can not only improve the accuracy of translation, but also reduce the need for checking and editing. work intensity, improve translation efficiency, and reduce translation costs.
本实施例通过在翻译记忆库中查找与待翻译原文相似度最高的语料原文和语料原文的译文,并自动对待翻译原文和语料原文进行相似性比较,有效减少人工校核的工作强度,然后将语料原文中的差异部分映射到语料原文的译文,将语料原文的译文中差异部分映射的译文替换为掩码,最后联合替换后的语料原文的译文和待翻译原文对待翻译原文进行自动翻译,不仅可以提高翻译效率,降低翻译成本,还可以提高翻译的准确性。In this embodiment, the original text of the corpus and the original text of the corpus with the highest similarity to the original text to be translated are searched in the translation memory, and the similarity between the original text to be translated and the original text of the corpus is automatically compared, thereby effectively reducing the work intensity of manual checking, and then the The difference part in the original corpus is mapped to the translation of the original corpus, the translation of the difference mapped in the translation of the original corpus is replaced with a mask, and finally the translation of the original corpus after the replacement and the original to be translated are automatically translated, not only It can improve translation efficiency, reduce translation cost, and improve translation accuracy.
在上述实施例的基础上,本实施例中翻译模块具体用于:将所述待翻译原文输入所述机器翻译模型的第一编码器,输出所述待翻译原文的编码结果;将所述替换后的语料原文的译文输入所述机器翻译模型的第二编码器,输出所述语料原文的译文的编码结果;将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文。On the basis of the above-mentioned embodiment, the translation module in this embodiment is specifically configured to: input the original text to be translated into the first encoder of the machine translation model, and output the encoding result of the original text to be translated; The translation of the original corpus is input into the second encoder of the machine translation model, and the encoding result of the translation of the original corpus is output; the encoding result of the original to be translated and the encoding result of the translation of the original corpus are input into the machine. The decoder of the translation model outputs the translation of the original text to be translated.
在上述实施例的基础上,本实施例中翻译模块,还用于将所述待翻译原文的编码结果和目标文本的译文的编码结果输入所述解码器的交叉注意力机制层后,依次经过所述解码器的线性处理层和softmax层,输出所述待翻译原文的译文。On the basis of the above-mentioned embodiment, the translation module in this embodiment is further configured to input the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, and then sequentially go through the The linear processing layer and the softmax layer of the decoder output the translation of the original text to be translated.
在上述各实施例的基础上,本实施例中所述掩码包括括号和预设字符;其中,所述预设字符位于所述括号内部。Based on the foregoing embodiments, the mask in this embodiment includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
在上述实施例的基础上,本实施例中若所述差异部分为多个,则替换每个所述差异部分映射的译文的掩码还包括每个差异部分的编号,所述编号位于所述括号内部。On the basis of the above-mentioned embodiment, in this embodiment, if there are multiple different parts, the mask for replacing the translation mapped by each difference part also includes the number of each difference part, and the number is located in the inside parentheses.
在上述各实施例的基础上,本实施例中还包括映射模块,用于将所述语料原文和所述语料原文的译文进行词对齐;根据词对齐结果,将所述差异部分映射到所述语料原文的译文。On the basis of the above embodiments, this embodiment further includes a mapping module for performing word alignment on the original text of the corpus and the translation of the original text of the corpus; according to the word alignment result, the difference part is mapped to the The translation of the original text.
在上述各实施例的基础上,本实施例中所述机器翻译模型为Transformer模型。On the basis of the foregoing embodiments, the machine translation model described in this embodiment is a Transformer model.
图5示例了一种电子设备的实体结构示意图,如图5所示,该电子设备可以包括:处理器(processor)501、通信接口(Communications Interface)502、存储器(memory)503和通信总线504,其中,处理器501,通信接口502,存储器503通过通信总线504完成相互间的通信。处理器501可以调用存储器503中的逻辑指令,以执行基于翻译记忆库的机器翻译方法,该方法包括:从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。FIG. 5 illustrates a schematic diagram of the physical structure of an electronic device. As shown in FIG. 5 , the electronic device may include: a processor (processor) 501, a communication interface (Communications Interface) 502, a memory (memory) 503 and a communication bus 504, The processor 501 , the communication interface 502 , and the memory 503 communicate with each other through the communication bus 504 . The processor 501 may invoke the logic instructions in the memory 503 to execute a translation memory-based machine translation method, the method comprising: searching the original corpus with the highest similarity to the original to be translated and a translation of the original corpus from the translation memory ; Compare the original text to be translated and the original text of the corpus, and obtain the difference part in the original text of the corpus that is different from the original text to be translated; map the difference part to the translation of the original text of the corpus, and convert the corpus The translation mapped by the difference part in the translation of the original text is replaced with a mask; the translation of the replaced corpus original text and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output; wherein, the machine translation The model is obtained by training the translated text sample as a sample, and the translation corresponding to the translated text sample as a label.
此外,上述的存储器503中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the above-mentioned logic instructions in the memory 503 can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
另一方面,本申请还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,计算机能够执行上述各方法所提供的基于翻译记忆库的机器翻译方法,该方法包括:从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。In another aspect, the present application also provides a computer program product, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer When executing, the computer can execute the translation memory-based machine translation method provided by the above methods, and the method includes: searching the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory; The original text to be translated is compared with the original text of the corpus, and the difference part of the original text of the corpus that is different from the original text to be translated is obtained; the difference part is mapped to the translation of the original text of the corpus, and the The translation mapped by the difference part in the translation is replaced with a mask; the translation of the replaced corpus original text and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output; wherein, the machine translation model consists of The translated original sample is used as a sample, and the translation corresponding to the translated original sample is obtained by training as a label.
又一方面,本申请还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现以执行上述各提供的基于翻译记忆库的机器翻译方法,该方法包括:从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。In another aspect, the present application also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the computer program is implemented to execute the translation memory-based machine translation methods provided above, The method includes: searching the original corpus with the highest similarity to the original to be translated and a translation of the original corpus from a translation memory; comparing the original to be translated and the original of the corpus, and obtaining the Describe the different parts of the original text to be translated; map the difference parts to the translation of the original corpus, and replace the translation mapped with the difference in the translation of the original corpus as a mask; The original text to be translated is used as the input of the machine translation model, and the translation of the original text to be translated is output; wherein, the machine translation model is obtained by training a sample of the original text to be translated as a sample and the translation corresponding to the sample of the original translated text as a label.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡 献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware. Based on this understanding, the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be The technical solutions described in the foregoing embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (10)

  1. 一种基于翻译记忆库的机器翻译方法,其特征在于,包括:A method for machine translation based on translation memory, comprising:
    从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;Find the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
    将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;Comparing the original text to be translated and the original text of the corpus to obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
    将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;mapping the difference part to the translation of the original corpus, and replacing the translation mapped with the difference part in the translation of the original corpus with a mask;
    将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;Using the replaced translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and outputting the translation of the original text to be translated;
    其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。Wherein, the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
  2. 根据权利要求1所述的基于翻译记忆库的机器翻译方法,其特征在于,所述将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文,包括:The method for machine translation based on translation memory according to claim 1, wherein the translation of the original text of the corpus and the original text to be translated are used as the input of the machine translation model, and the translation of the original text to be translated is output. translations, including:
    将所述待翻译原文输入所述机器翻译模型的第一编码器,输出所述待翻译原文的编码结果;Input the original text to be translated into the first encoder of the machine translation model, and output the encoding result of the original text to be translated;
    将所述替换后的语料原文的译文输入所述机器翻译模型的第二编码器,输出所述语料原文的译文的编码结果;inputting the replaced translation of the original corpus into the second encoder of the machine translation model, and outputting an encoding result of the translation of the original corpus;
    将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文。The encoding result of the original text to be translated and the encoding result of the translation of the original text of the corpus are input into the decoder of the machine translation model, and the translation of the original text to be translated is output.
  3. 根据权利要求2所述的基于翻译记忆库的机器翻译方法,其特征在于,所述将所述待翻译原文的编码结果和语料原文的译文的编码结果输入所述机器翻译模型的解码器,输出所述待翻译原文的译文,包括:The method for machine translation based on translation memory according to claim 2, wherein the encoding result of the original text to be translated and the encoding result of the translation of the original corpus are input into the decoder of the machine translation model, and output The translation of the original text to be translated, including:
    将所述待翻译原文的编码结果和目标文本的译文的编码结果输入所述解码器的交叉注意力机制层后,依次经过所述解码器的线性处理层和softmax层,输出所述待翻译原文的译文。After inputting the encoding result of the original text to be translated and the encoding result of the translation of the target text into the cross-attention mechanism layer of the decoder, the original text to be translated is output through the linear processing layer and the softmax layer of the decoder in turn. 's translation.
  4. 根据权利要求1-3任一所述的基于翻译记忆库的机器翻译方法,其特征在于,所述掩码包括括号和预设字符;其中,所述预设字符位于所述括号内部。The machine translation method based on translation memory according to any one of claims 1-3, wherein the mask includes brackets and preset characters; wherein, the preset characters are located inside the brackets.
  5. 根据权利要求4所述的基于翻译记忆库的机器翻译方法,其特征在于,若所述差异部分为多个,则替换每个所述差异部分映射的译文的掩码还包括每个差异部分的编号,所述编号位于所述括号内部。The method for machine translation based on translation memory according to claim 4, wherein if there are multiple difference parts, the mask for replacing the translation mapped by each difference part further includes the number, which is inside the brackets.
  6. 根据权利要求1-3任一所述的基于翻译记忆库的机器翻译方法,其特征在于,所述将所述差异部分映射到所述语料原文的译文,包括:The machine translation method based on translation memory according to any one of claims 1-3, wherein the mapping of the difference part to the translation of the original text of the corpus comprises:
    将所述语料原文和所述语料原文的译文进行词对齐;word-aligning the original corpus and the translation of the original corpus;
    根据词对齐结果,将所述差异部分映射到所述语料原文的译文。According to the word alignment result, the difference part is mapped to the translation of the original corpus.
  7. 根据权利要求1-3任一所述的基于翻译记忆库的机器翻译方法,其特征在于,所述机器翻译模型为Transformer模型。The machine translation method based on translation memory according to any one of claims 1-3, wherein the machine translation model is a Transformer model.
  8. 一种基于翻译记忆库的机器翻译装置,其特征在于,包括:A machine translation device based on translation memory, comprising:
    查找模块,用于从翻译记忆库中查找与待翻译原文相似度最高的语料原文和所述语料原文的译文;The search module is used to search the original corpus with the highest similarity to the original to be translated and the translation of the original corpus from the translation memory;
    比较模块,用于将所述待翻译原文和所述语料原文进行比较,获取所述语料原文中与所述待翻译原文不同的差异部分;a comparison module, configured to compare the original text to be translated and the original text of the corpus, and obtain the difference parts in the original text of the corpus that are different from the original text to be translated;
    替换模块,用于将所述差异部分映射到所述语料原文的译文,将所述语料原文的译文中差异部分映射的译文替换为掩码;a replacement module, configured to map the difference part to the translation of the original corpus, and replace the translation mapped with the difference part in the translation of the original corpus with a mask;
    翻译模块,用于将替换后的语料原文的译文和所述待翻译原文作为机器翻译模型的输入,输出所述待翻译原文的译文;a translation module, configured to use the replaced translation of the original text of the corpus and the original text to be translated as the input of the machine translation model, and output the translation of the original text to be translated;
    其中,所述机器翻译模型由翻译原文样本作为样本,所述翻译原文样本对应的译文作为标签进行训练获得。Wherein, the machine translation model is obtained by training a sample of the translated original text as a sample, and a translation corresponding to the translated original sample as a label.
  9. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至7任一项所述基于翻译记忆库的机器翻译方法的步骤。An electronic device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, characterized in that, when the processor executes the program, the implementation of claims 1 to 7 The steps of any one of the translation memory-based machine translation methods.
  10. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述基于翻译记忆库的机器翻译方法的步骤。A non-transitory computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the translation memory-based machine translation according to any one of claims 1 to 7 is realized steps of the method.
PCT/CN2021/126674 2021-02-23 2021-10-27 Machine translation method and apparatus based on translation memory WO2022179149A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110203208.3 2021-02-23
CN202110203208.3A CN112818712B (en) 2021-02-23 2021-02-23 Machine translation method and device based on translation memory library

Publications (1)

Publication Number Publication Date
WO2022179149A1 true WO2022179149A1 (en) 2022-09-01

Family

ID=75865183

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/126674 WO2022179149A1 (en) 2021-02-23 2021-10-27 Machine translation method and apparatus based on translation memory

Country Status (2)

Country Link
CN (1) CN112818712B (en)
WO (1) WO2022179149A1 (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112818712B (en) * 2021-02-23 2024-06-11 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library
CN113420570B (en) * 2021-07-01 2024-04-30 沈阳创思佳业科技有限公司 Method, system and device for improving translation accuracy
CN114429144B (en) * 2021-12-28 2023-07-07 华东师范大学 Diversified machine translation method using auxiliary memory
CN114462427A (en) * 2022-01-26 2022-05-10 四川语言桥信息技术有限公司 Machine translation method and device based on term protection
CN114638241A (en) * 2022-03-30 2022-06-17 阿里巴巴(中国)有限公司 Data matching method, device, equipment and storage medium
CN115019330A (en) * 2022-06-16 2022-09-06 特赞(上海)信息科技有限公司 Cartoon translation matching method and system, electronic device and storage medium
CN115860015B (en) * 2022-12-29 2023-06-20 北京中科智加科技有限公司 Translation memory-based transcription text translation method and computer equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140163951A1 (en) * 2012-12-07 2014-06-12 Xerox Corporation Hybrid adaptation of named entity recognition
CN107885737A (en) * 2017-12-27 2018-04-06 传神语联网网络科技股份有限公司 A kind of human-computer interaction interpretation method and system
CN109710951A (en) * 2018-12-27 2019-05-03 北京百度网讯科技有限公司 Supplementary translation method, apparatus, equipment and storage medium based on translation history
CN110046359A (en) * 2019-04-16 2019-07-23 苏州大学 Neural machine translation method based on sample guidance
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6175900B2 (en) * 2013-05-23 2017-08-09 富士通株式会社 Translation apparatus, method, and program
CN109408834B (en) * 2018-12-17 2022-06-10 北京百度网讯科技有限公司 Auxiliary machine translation method, device, equipment and storage medium
CN110532575A (en) * 2019-08-21 2019-12-03 语联网(武汉)信息技术有限公司 Text interpretation method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140163951A1 (en) * 2012-12-07 2014-06-12 Xerox Corporation Hybrid adaptation of named entity recognition
CN107885737A (en) * 2017-12-27 2018-04-06 传神语联网网络科技股份有限公司 A kind of human-computer interaction interpretation method and system
CN109710951A (en) * 2018-12-27 2019-05-03 北京百度网讯科技有限公司 Supplementary translation method, apparatus, equipment and storage medium based on translation history
CN110046359A (en) * 2019-04-16 2019-07-23 苏州大学 Neural machine translation method based on sample guidance
CN112818712A (en) * 2021-02-23 2021-05-18 语联网(武汉)信息技术有限公司 Machine translation method and device based on translation memory library

Also Published As

Publication number Publication date
CN112818712B (en) 2024-06-11
CN112818712A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
WO2022179149A1 (en) Machine translation method and apparatus based on translation memory
CN110046261B (en) Construction method of multi-modal bilingual parallel corpus of construction engineering
WO2022088570A1 (en) Method and apparatus for post-editing of translation, electronic device, and storage medium
WO2020133039A1 (en) Entity identification method and apparatus in dialogue corpus, and computer device
US11822897B2 (en) Systems and methods for structured text translation with tag alignment
WO2022148104A1 (en) Machine translation method and system based on pre-training model
CN112541365B (en) Machine translation method and device based on term replacement
Jabaian et al. Comparison and combination of lightly supervised approaches for language portability of a spoken language understanding system
JP2009151777A (en) Method and apparatus for aligning spoken language parallel corpus
CN110717341A (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN114925170B (en) Text proofreading model training method and device and computing equipment
CN113343717A (en) Neural machine translation method based on translation memory library
CN113408307B (en) Neural machine translation method based on translation template
CN108491399A (en) Chinese to English machine translation method based on context iterative analysis
JP7520085B2 (en) Text error correction and text error correction model generation method, device, equipment, and medium
Pinnis et al. Tilde MT platform for developing client specific MT solutions
WO2022166267A1 (en) Machine translation post-editing method and system
Shi et al. Neural Chinese word segmentation as sequence to sequence translation
CN106776590A (en) A kind of method and system for obtaining entry translation
Zhang Research on English machine translation system based on the internet
Turganbayeva et al. The solution of the problem of unknown words under neural machine translation of the Kazakh language
Gutiérrez-Artacho et al. Human post-editing in hybrid machine translation systems: automatic and manual analysis and evaluation
CN111597827A (en) Method and device for improving machine translation accuracy
Gupta et al. Product Review Translation: Parallel Corpus Creation and Robustness towards User-Generated Noisy Text
Miao et al. An unknown word processing method in NMT by integrating syntactic structure and semantic concept

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927573

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927573

Country of ref document: EP

Kind code of ref document: A1