WO2020124674A1 - 向量化译员的翻译个性特征的方法及装置 - Google Patents

向量化译员的翻译个性特征的方法及装置 Download PDF

Info

Publication number
WO2020124674A1
WO2020124674A1 PCT/CN2018/124915 CN2018124915W WO2020124674A1 WO 2020124674 A1 WO2020124674 A1 WO 2020124674A1 CN 2018124915 W CN2018124915 W CN 2018124915W WO 2020124674 A1 WO2020124674 A1 WO 2020124674A1
Authority
WO
WIPO (PCT)
Prior art keywords
bilingual
translator
corpus
model
lstm network
Prior art date
Application number
PCT/CN2018/124915
Other languages
English (en)
French (fr)
Inventor
张睦
Original Assignee
语联网(武汉)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 语联网(武汉)信息技术有限公司 filed Critical 语联网(武汉)信息技术有限公司
Publication of WO2020124674A1 publication Critical patent/WO2020124674A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • Embodiments of the present disclosure relate to the technical field of natural language processing, and more particularly, to a method and device for translating individual characteristics of vectorized translators.
  • the auxiliary translation tool calls the machine translation engine and gives a preliminary translation, which is then reviewed and edited by professional translators to produce high-quality translation results.
  • the auxiliary translation tool can select a translator's favorite result from multiple machine translation engine translations, which can often reduce the post-translator's workload of editing; conversely, when the translator gets a non-translation
  • the introduction of personalized features of translators can also play a positive role.
  • Embodiments of the present disclosure provide a method and device for translating personality characteristics of a vectorized translator who overcomes the above problems or at least partially solves the above problems.
  • an embodiment of the present disclosure provides a method for translating personalized features of a vectorized translator, including:
  • Each bilingual corpus sample in each translator's own bilingual corpus is input to the trained LSTM network for training, keeping the encoder model parameters of the trained LSTM network unchanged, and obtaining the LSTM corresponding to the translator The internet;
  • a pair of bilingual corpus samples include a source sentence and a translation sentence; the word vector model is based on the latest Wikipedia source corpus and target corpus, and is obtained by training using the Skip-Gram algorithm; T and M are both Natural numbers greater than 1.
  • an embodiment of the present disclosure provides a vectorized translator's translation personality feature, including:
  • the sample building module is configured to select T translators, select M pairs of bilingual corpus samples from the historical corpus translated by each translator to construct the translator's own bilingual corpus, and all the translator's own bilingual corpus form a bilingual sample set, and Preprocessing the bilingual sample set;
  • the word vector acquisition module is configured to input the bilingual corpus samples in the pre-processed bilingual sample set one by one into the word vector model, and output the word vector corresponding to the bilingual corpus sample;
  • the general model training module is configured to input each pair of bilingual corpus samples and corresponding word vectors in the bilingual sample set into the encoder model and decoder model based on the LSTM network for training to obtain the trained LSTM network;
  • the personalized training module is configured to input the bilingual corpus samples of each translator's own bilingual corpus one by one to the trained LSTM network for training, keeping the encoder model parameters of the trained LSTM network unchanged To obtain the LSTM network corresponding to the translator;
  • the vectorization module is configured to generate a translator vector reflecting the translator's translation personality characteristics based on the parameters of the decoder model in the LSTM network corresponding to the translator;
  • a pair of bilingual corpus samples include a source sentence and a translation sentence; the word vector model is based on the latest Wikipedia source corpus and target corpus, and is obtained by training using the Skip-Gram algorithm; T and M are both Natural numbers greater than 1.
  • an embodiment of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor executes the program as described in the first aspect. Steps provided by the vectorized translator's method of translating personality characteristics.
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the translation personality characteristics of the vectorized translator as provided in the first aspect Steps of the method.
  • the method and device for translating personality characteristics of vectorized translators trains the encoder model and decoder model based on the LSTM network according to the historical translation corpus of the translator, so as to obtain translators reflecting the personality translation characteristics of each translator
  • the vector does not need to filter the translator's personality characteristics, nor does it need to manually label the sample data.
  • the training cost is low and the accuracy is high.
  • the obtained translator vector can accurately and objectively reflect the translator's translation personality characteristics.
  • FIG. 1 is a schematic flowchart of a method for translating individual characteristics of a vectorized translator provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of an encoder model provided by an embodiment of the present disclosure encoding original text into a sentence vector
  • FIG. 3 is a schematic diagram of a decoder model provided by an embodiment of the present disclosure generating a translation based on sentence vectors;
  • FIG. 4 is a schematic diagram of an LSTM network corresponding to a training translator provided by an embodiment of the present disclosure
  • FIG. 5 is a schematic structural diagram of a device for translating personality characteristics of a vectorized translator provided by an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a physical structure of an electronic device provided by an embodiment of the present disclosure.
  • FIG. 1 it is a schematic flowchart of a method for translating personality characteristics of a vectorized translator provided by an embodiment of the present disclosure, including:
  • Step 100 Select T translators, select M pairs of bilingual corpus samples from the historical corpus translated by each translator to construct the translator's own bilingual corpus, all the translator's own bilingual corpus form a bilingual sample set, and the bilingual Sample set for preprocessing;
  • the historical corpus translated by the translator needs to be used.
  • T ⁇ t 1 ,t 2 ,t 3 ,...t n ⁇
  • M pairs of bilingual corpus samples constitute the translator Own bilingual corpus.
  • a pair of bilingual corpus samples include a source sentence and a target sentence.
  • T and M are natural numbers greater than one.
  • Preprocessing the bilingual sample set includes the following operations: performing word segmentation processing on each original sentence and translated sentence in the bilingual sample set, and disturbing the original order between the sentences in the bilingual sample set;
  • the bilingual sample set is divided into a training sample set and a verification sample set according to a certain ratio. For example, 80% of the data is selected as the training sample set, and the remaining 20% is used as the verification sample set.
  • Step 101 Input bilingual corpus samples in the pre-processed bilingual sample set one by one into a word vector model, and output word vectors corresponding to the bilingual corpus samples;
  • the word vector is trained based on the word sequence in a fixed-size window in the corpus. It can learn the semantic and grammatical information of the word. It is a low-dimensional, dense real-valued vector representation form. Vector operations can obtain the correlation between words.
  • Each pair of bilingual corpus samples in the pre-processed bilingual sample set are input into a pre-trained word vector model one by one, wherein, for the original corpus, there is a pre-trained original word vector model, and the original text after word segmentation is processed
  • the word vector corresponding to the original sentence will be output.
  • the target corpus will be input into the pre-trained target word vector model, and the word vector corresponding to the target sentence will be output.
  • the word vector model is based on the latest Wikipedia's original corpus and translated corpus, and is obtained by training using the Skip-Gram algorithm.
  • Skip-Gram algorithm is suitable for large-scale corpus word vector training.
  • the word vectors obtained using the Skip-Gram algorithm can well reflect the semantic correlation between words.
  • Step 102 Input each pair of bilingual corpus samples and corresponding word vectors in the bilingual sample set into the encoder model and decoder model based on the LSTM network for training, and obtain the trained LSTM network;
  • the training sample set and the verification sample set divided by the bilingual sample set and the word vector pre-trained in step 101 are used to train a neural network model based on a sequence-to-sequence framework to obtain a general LSTM network.
  • the sequence-to-sequence-based model contains two major modules: encoder and decoder.
  • the encoder uses the recurrent neural network LSTM (Long Short-Term Memory) to encode the original sentence into a vector, and the decoder uses this vector to generate the translated sentence.
  • LSTM Long Short-Term Memory
  • the encoder model parameters and the decoder model parameters are updated, thereby optimizing the model and reducing the difference between the translation output by the decoder model and the standard translation.
  • Step 103 Input the bilingual corpus samples in each translator's own bilingual corpus one by one into the trained LSTM network for training, keep the encoder model parameters of the trained LSTM network unchanged, and obtain the translator Corresponding LSTM network;
  • each translator's own bilingual corpus is used to continue training the general LSTM network, where the encoder model parameters of the general LSTM network are maintained during training Change and get an LSTM network for every translator.
  • Each translator's LSTM network has the following functions: input the original corpus into the LSTM network corresponding to each translator, and obtain a translation that matches the translator's personal translation characteristics. It can be understood that the LSTM network corresponding to each translator is obtained through training, and the parameters of the decoder model in the LSTM network can reflect the translator's translation personality characteristics.
  • the embodiments of the present disclosure only need to obtain the translator's translation personality characteristics based on the historical corpus translated by the translator, without the need to screen the translator's personality characteristics, and without manually labeling the sample data.
  • Step 104 Based on the parameters of the decoder model in the LSTM network corresponding to the translator, generate a translator vector reflecting the translator's translation personality characteristics;
  • the purpose of the embodiments of the present disclosure is to quantify the translator's translation personality characteristics. Therefore, the translator vector of each translator is generated based on the parameters of the decoder model in the LSTM network corresponding to each translator, and the translator vector can reflect the translator's Translate personality characteristics, thus quantifying the translator's personality characteristics of translation into a dense vector in a certain dimension.
  • auxiliary translation tools can select one of the translations of multiple machine translation engines that best fits the translator’s translation personality, thereby reducing the post-translator’s workload of editing; in terms of automatic dispatch of translation manuscripts and term corpus prompts, Vectors can also play a more active role.
  • the method for vectorizing translator's personality characteristics of translators trains the encoder model and decoder model based on the LSTM network according to the translator's historical translation corpus, so as to obtain translator vectors that reflect the personality translation characteristics of each translator, There is no need to filter the personality characteristics of the translator, and there is no need to manually label the sample data. The training cost is low and the accuracy is high.
  • the obtained translator vector can accurately and objectively reflect the translator's personality characteristics.
  • the steps of training the word vector model are specifically:
  • the Skip-Gram algorithm is used to train the word vectors respectively, and the original word vector model and the translated word vector model are obtained after the training.
  • the original text in the bilingual sample set is Chinese and the translation is English
  • the Chinese corpus and the English corpus in the same language as the bilingual sample set are respectively downloaded from the latest Wikipedia And use the existing word segmentation algorithm to perform word segmentation processing on each sentence in the acquired Chinese corpus and English corpus.
  • the Skip-Gram algorithm is used to train the word vectors respectively, and the original word vector model and the translated word vector model are obtained after the training.
  • the Skip-Gram algorithm is used to train the Chinese and English word vectors respectively.
  • some important hyperparameter settings are: the dimension of the word vector is 300, and the context window is 5.
  • the step of inputting the pre-processed bilingual corpus samples in the bilingual sample set into the word vector model one by one, and outputting the word vector corresponding to the bilingual corpus samples is specifically:
  • the translated sentence in the bilingual corpus sample is input into the translated word vector model to obtain the word vector corresponding to the translated sentence.
  • the original text in the bilingual corpus sample is Sentences are input into the original word vector model to obtain the word vector corresponding to the original sentence, and the translated sentences in the bilingual corpus sample are input to the translated word vector model to obtain the corresponding to the translated sentence Word vector.
  • the pair of bilingual corpus samples and corresponding word vectors in the bilingual sample set are input into the LSTM network-based encoder model and decoder model for training to obtain the trained LSTM network’s
  • the steps are as follows:
  • the original sentence in the bilingual corpus sample and the word vector corresponding to the original sentence are input into the LSTM network-based encoder model to generate the corresponding Sentence vector
  • the trained LSTM network is obtained.
  • the bilingual corpus samples in the training sample set are input to the encoder model and decoder model based on the LSTM network for training.
  • the original sentence of the bilingual corpus sample and the word vector corresponding to the original sentence are input into an encoder model based on the LSTM network, and the sentence vector corresponding to the original sentence is output.
  • FIG. 2 it is a schematic diagram of the encoder model provided by the embodiment of the present disclosure encoding the original text into a sentence vector.
  • the original sentence is “skills training is very important” and is encoded into a vector c by the encoder.
  • FIG. 3 a schematic diagram of a decoder model provided by an embodiment of the present disclosure for generating a translation based on a sentence vector, and inputting the sentence vector c corresponding to the original sentence and the word vector of the translated sentence in the bilingual corpus sample based on
  • the decoder model of the LSTM network the translation output of the model "The Skill Building" is obtained.
  • the translated sentences in the bilingual corpus sample are standard translations. Compare the difference between the standard translation and the model translation output by the decoder model, and use the difference to update the parameters of the decoder model and the encoder model through the back propagation algorithm.
  • test sample set is used to test the encoder model and the decoder model based on the LSTM network after the training sample set is trained. After the test is completed, the trained LSTM network is obtained.
  • the bilingual corpus samples of each translator's own bilingual corpus are input one by one into the trained LSTM network for training, and the encoder model parameters of the trained LSTM network are kept.
  • the step of obtaining the LSTM network corresponding to the translator who has completed the training is as follows:
  • the sentence vector corresponding to the original sentence in the bilingual corpus sample is connected with the word vector and the translator vector corresponding to the translated sentence in the bilingual corpus sample, and input into the decoder model of the trained LSTM network to generate Predicted translation
  • the interpreter vector is updated.
  • the training is based on a bilingual sample set composed of the bilingual corpus of all translators, and the obtained LSTM network can be regarded as a general model.
  • each translator's own bilingual corpus is used for training To get an LSTM network for every translator.
  • FIG. 4 it is a schematic diagram of an LSTM network corresponding to training translators provided by an embodiment of the present disclosure.
  • the original sentence in the bilingual corpus sample and the word vector corresponding to the original sentence are input into the encoder model of the trained LSTM network
  • the sentence vector c corresponding to the original sentence in the bilingual corpus sample
  • the word vector corresponding to the translated sentence in the bilingual corpus sample and the translation vector v input to the training completion
  • the decoder model of the LSTM network obtain the predicted translation; compare the difference between the predicted translation and the translated sentence in the bilingual corpus sample, keep the encoder model parameters of the trained LSTM network unchanged, and use the reverse
  • the propagation algorithm updates the parameters of the trained LSTM network encoder model; and updates the interpreter vector according to the updated parameters of the trained LSTM network encoder model.
  • translator vector is initialized to zero, and then updated after each training session.
  • FIG. 5 it is a schematic structural diagram of an apparatus for translating personality characteristics of a vectorized translator provided by an embodiment of the present disclosure.
  • the apparatus is used to implement the method for translating personality characteristics of a vectorized translator described in the foregoing embodiments. Therefore, the descriptions and definitions in the methods in the foregoing embodiments can be used to understand the execution modules in the embodiments of the present disclosure.
  • the device includes: a sample construction module 510, a word vector acquisition module 520, a general model training module 530, a personalized training module 540, and a vectorization module 550, wherein,
  • the sample construction module 510 is configured to select T translators, select M pairs of bilingual corpus samples from the historical corpus translated by each translator to construct the translator's own bilingual corpus, and all the translator's own bilingual corpus form a bilingual sample set, And pre-process the bilingual sample set;
  • the word vector acquisition module 520 is configured to input the bilingual corpus samples in the pre-processed bilingual sample set one by one into the word vector model, and output the word vector corresponding to the bilingual corpus sample;
  • the general model training module 530 is configured to input each pair of bilingual corpus samples and corresponding word vectors in the bilingual sample set into the encoder model and decoder model based on the LSTM network for training, to obtain the trained LSTM network;
  • the personalized training module 540 is configured to input the bilingual corpus samples of each translator's own bilingual corpus one by one into the trained LSTM network for training, keeping the encoder model parameters of the trained LSTM network Change, obtain the LSTM network corresponding to the translator;
  • the vectorization module 550 is configured to generate a translator vector reflecting the translator's translation personality characteristics based on the parameters of the decoder model in the LSTM network corresponding to the translator;
  • a pair of bilingual corpus samples include a source sentence and a translation sentence; the word vector model is based on the latest Wikipedia source corpus and target corpus, and is obtained by training using the Skip-Gram algorithm; T and M are both Natural numbers greater than 1.
  • sample construction module 510 is specifically configured as:
  • the bilingual sample set is divided into a training sample set and a verification sample set.
  • the apparatus for vectorizing translator's translation personality characteristics trains the encoder model and decoder model based on the LSTM network according to the translator's historical translation corpus, so as to obtain a translator vector reflecting each translator's personality translation characteristics, There is no need to filter the personality characteristics of the translator, and there is no need to manually label the sample data. The training cost is low and the accuracy is high.
  • the obtained translator vector can accurately and objectively reflect the translator's personality characteristics.
  • the electronic device may include: a processor (processor) 610, a communication interface (Communications) Interface 620, a memory (memory) 630 and communication The bus 640, in which the processor 610, the communication interface 620, and the memory 630 complete communication with each other through the communication bus 640.
  • processor processor
  • Communication interface Communication
  • memory memory
  • the processor 610 can call a computer program stored on the memory 630 and run on the processor 610 to execute the method for translating the personality characteristics of the vectorized translator provided in the above embodiments, for example, including: selecting T translators, and selecting Select M pairs of bilingual corpus samples from the historical corpus translated by the translator to construct the translator's own bilingual corpus, all the translator's own bilingual corpus form a bilingual sample set, and preprocess the bilingual sample set; it will be preprocessed
  • the bilingual corpus samples in the bilingual sample set are input into the word vector model one by one, and the word vector corresponding to the bilingual corpus sample is output; each pair of bilingual corpus samples and corresponding word vectors in the bilingual sample set are input to the LSTM-based Train the encoder model and the decoder model of the network to obtain the trained LSTM network; input each bilingual corpus sample in each translator's own bilingual corpus into the trained LSTM network for training, and keep the The encoder model parameters of the trained LSTM network are unchanged, and the LSTM network corresponding
  • the logic instructions in the above-mentioned memory 630 may be implemented in the form of software functional units and sold or used as independent products, and may be stored in a computer-readable storage medium.
  • the technical solutions of the embodiments of the present disclosure essentially or part of the contribution to the existing technology or the technical solutions may be embodied in the form of software products, and the computer software products are stored in a storage medium , Including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .
  • Embodiments of the present disclosure also provide a non-transitory computer-readable storage medium on which a computer program is stored.
  • a method for realizing the translation personality characteristics of the vectorized translator provided by the foregoing embodiments, for example Including: selecting T translators, selecting M pairs of bilingual corpus samples from the historical corpus translated by each translator to construct the translator's own bilingual corpus, all the translator's own bilingual corpus form a bilingual sample set, and the bilingual sample Preprocessing; input the bilingual corpus samples in the pre-processed bilingual sample set one by one into the word vector model, and output the word vector corresponding to the bilingual corpus sample; each pair of bilingual corpus samples in the bilingual sample set And the corresponding word vectors are input into the encoder model and decoder model based on the LSTM network for training to obtain the trained LSTM network; each bilingual corpus sample in each translator's own bilingual corpus is input to the training completed one by one Train in the LSTM network, keep the encoder model parameters of
  • the device embodiments described above are only schematics, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located One place, or can be distributed to multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art can understand and implement without paying creative labor.
  • each embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, it can also be implemented by hardware.
  • the above technical solution can be embodied in the form of a software product in essence or part of the contribution to the existing technology, and the computer software product can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic Discs, optical discs, etc., include several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments.

Abstract

本公开实施例提供一种向量化译员的翻译个性特征的方法及装置,所述方法包括:基于多位译员的历史翻译语料构建双语样本集;将双语样本集输入至词向量模型中,输出词向量;将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练;将每位译员自身的双语语料集输入至训练完成的LSTM网络中进行训练,保持编码器模型参数不变,获得译员对应的LSTM网络;基于译员对应的LSTM网络中解码器模型的参数生成译员向量。本公开实施例不需要对译员的个性特征进行筛选,也无须对样本数据进行人工标注,训练成本低、精度高,所获得的译员向量能够准确客观地反映译员的翻译个性特征。

Description

向量化译员的翻译个性特征的方法及装置
交叉引用
本申请引用于2018年12月21日提交的专利名称为“向量化译员的翻译个性特征的方法及装置”的第2018115708834号中国专利申请,其通过引用被全部并入本申请。
技术领域
本公开实施例涉及自然语言处理技术领域,更具体地,涉及一种向量化译员的翻译个性特征的方法及装置。
背景技术
随着人工智能的发展,机器翻译的质量不断的提高,基于机器翻译的后编辑成为了译员翻译的一种新趋势。在很多时候,不同的译员对同一句原文的翻译存在着或多或少的不同,例如存在着措辞用句上的差异。这种差异与每个译员的个人特征属性相关,其中包括:年龄、性别、生活地域、性格、社会背景、教育经历,以及个人经历等。通过理解个人特征属性对译员翻译偏好差异的影响,可以更好地为译员提供更加个性化的辅助翻译,从而提升译员的翻译工作效率。
例如,在后编辑模式下,辅助翻译工具调用机器翻译引擎结果给出一个翻译初稿,再由专业的翻译人员在此基础上进行审校和编辑,产生高质量的译文结果。当考虑了译员在翻译上的个性化特征,辅助翻译工具可以从多个机器翻译引擎译文中选取一个译员最偏爱的结果,这往往能够减少译员后编辑的工作量;反之,当译员获得一个非个性化推荐的译文时,她/他往往需要花更多时间和精力来进行后编辑工作以达到个人满意的翻译结果。类似地,在翻译稿件自动派单以及术语语料提示等方面,引入译员个性化的特征也能够起到比较积极的作用。
然而,在构建个性化的辅助翻译前,需要考虑的一个问题是如何去量化每个译员的个人特征属性。传统的做法是人工打标签的方式显示地标注译员个人特征属性,并利用监督型机器学习的算法去构建相关的模型。然 而这种方法有两个明显的缺点:一是客观且全面地反映出一个译员个性需要筛选出一些相关度高的个人特征,并不是一件容易的事情;二是人工标注数据集是一项耗时耗力的工作并且成本非常高。
发明内容
本公开实施例提供一种克服上述问题或者至少部分地解决上述问题的向量化译员的翻译个性特征的方法及装置。
第一方面,本公开实施例提供一种向量化译员的翻译个性特征的方法,包括:
选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
第二方面,本公开实施例提供一种向量化译员的翻译个性特征的装置,包括:
样本构建模块,被配置为选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
词向量获取模块,被配置为将经过预处理的所述双语样本集中的双语 语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
通用模型训练模块,被配置为将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
个性化训练模块,被配置为将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
向量化模块,被配置为基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
第三方面,本公开实施例提供一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面所提供的向量化译员的翻译个性特征的方法的步骤。
第四方面,本公开实施例提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所提供的向量化译员的翻译个性特征的方法的步骤。
本公开实施例提供的向量化译员的翻译个性特征的方法及装置,根据译员的历史翻译语料对基于LSTM网络的编码器模型和解码器模型进行训练,从而获得反映每个译员个性翻译特征的译员向量,不需要对译员的个性特征进行筛选,也无须对样本数据进行人工标注,训练成本低、精度高,所获得的译员向量能够准确客观地反映译员的翻译个性特征。
附图说明
为了更清楚地说明本公开实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例提供的向量化译员的翻译个性特征的方法的流程示意图;
图2为本公开实施例提供的编码器模型将原文编码成句向量的示意图;
图3为本公开实施例提供的解码器模型基于句向量生成译文的示意图;
图4为本公开实施例提供的训练译员对应的LSTM网络的示意图;
图5为本公开实施例提供的向量化译员的翻译个性特征的装置的结构示意图;
图6为本公开实施例提供的电子设备的实体结构示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
如图1所示,为本公开实施例提供的向量化译员的翻译个性特征的方法的流程示意图,包括:
步骤100、选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
具体地,为了获取译员的翻译个性特征,需利用译员翻译过的历史语料。
选取T位译员,例如,T={t 1,t 2,t 3,…t n},并从每个译员所翻译过历史语料中各选取M对双语语料样本,M对双语语料样本构成译员自身的双语语料集。
其中,一对双语语料样本包括一个原文句子和一个译文句子。
例如,选取M对从中文到英文的双语语料,则译员自身的双语语料集={m i1,m i2,…,m iM},双语样本集={m 11,m 12,…m 1M,m 21,m 22,…,m 2M,…,m T1,m T2,…,m TM},其中m ij代表第i位译员的第j条中文到英文的语料。
值得说明的是,T和M均为大于1的自然数。
对所述双语样本集进行预处理包括以下操作:对所述双语样本集中的每个原文句子和译文句子进行分词处理,并打乱所述双语样本集中句子之间原有的顺序;将所述双语样本集按照一定的比例划分为训练样本集和验证样本集,例如选取其中百分之八十的数据作为训练样本集,剩余的百分之二十作为验证样本集。
步骤101、将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
具体地,词向量是根据语料中固定大小窗口内的词语序列训练得到的,能够学习到词语的语义和语法信息,是一种低维、稠密的实值向量表现形式,通过词向量之间的向量运算可以获得词语之间的相关性关系。
将经过预处理的所述双语样本集中的每对双语语料样本逐一输入至预先训练好的词向量模型中,其中,针对原文语料,有预先训练好的原文词向量模型,将经过分词处理的原文句子输入到原文词向量模型中,会输出原文句子对应的词向量,同样地,针对译文语料,将译文语料输入到预先训练好的译文词向量模型中,会输出译文句子对应的词向量。
在本公开实施例中,词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的。Skip-Gram算法适合大规模语料词向量的训练。使用Skip-Gram算法获得的词向量,能够很好的反映词语之间的语义相关性。
步骤102、将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
具体地,利用所述双语样本集划分出来的训练样本集和验证样本集,以及步骤101中预训练的词向量,对基于序列到序列框架的神经网络模型进行训练,获得一个通用的LSTM网络。
基于序列到序列的模型包含两大模块:编码器和解码器。编码器使用循环神经网络LSTM(Long Short-Term Memory)将原文句子编码成向量,解码器通过此向量生成译文句子。
通过比较解码器模型输出的译文与标准译文之间的差异,进行编码器模型参数和解码器模型参数的更新,从而优化模型,减小解码器模型输出 的译文与标准译文之间的差异。
步骤103、将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
具体地,在获得了一个通用的LSTM网络的基础上,利用每位译员自身的的双语语料集继续训练该通用的LSTM网络,其中,训练过程中保持该通用的LSTM网络的编码器模型参数不变,获得针对每位译员的LSTM网络。
每位译员的LSTM网络具有如下功能:将原文语料输入到每位译员对应的LSTM网络中,可以获得符合该译员个人翻译特征的译文。可以理解的是,通过训练获得每位译员对应的LSTM网络,该LSTM网络中解码器模型的参数能够反映该译员的翻译个性特征。
本公开实施例只需要根据译员所翻译过的历史语料就可以获取该译员的翻译个性特征,不需要对译员的个性特征进行筛选,也无须对样本数据进行人工标注。
步骤104、基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
最后,本公开实施例的目的是为了对译员的翻译个性特征进行量化,因此,基于每位译员对应的LSTM网络中解码器模型的参数生成每位译员的译员向量,该译员向量能够反映译员的翻译个性特征,从而实现了将译员的翻译个性特征量化为某个维度下的稠密向量。
借助译员向量,辅助翻译工具可以从多个机器翻译引擎译文中选取一个最符合译员翻译个性的结果,从而减少译员后编辑的工作量;在翻译稿件自动派单以及术语语料提示等方面,借助译员向量也能够起到比较积极的作用。
本公开实施例提供的向量化译员的翻译个性特征的方法,根据译员的历史翻译语料对基于LSTM网络的编码器模型和解码器模型进行训练,从而获得反映每个译员个性翻译特征的译员向量,不需要对译员的个性特征进行筛选,也无须对样本数据进行人工标注,训练成本低、精度高,所获得的译员向量能够准确客观地反映译员的翻译个性特征。
基于上述实施例的内容,训练所述词向量模型的步骤,具体为:
从最新的维基百科获取与所述双语样本集语种相同的原文语料集和译文语料集,并对所述原文语料集和译文语料集中的每个句子进行分词处理;
基于经过分词处理的所述原文语料集和译文语料集,利用Skip-Gram算法分别进行词向量的训练,训练完成后获得原文词向量模型和译文词向量模型。
具体地,根据双语样本集中原文句子和译文句子的语种,例如双语样本集中原文为中文、译文为英文,则从最新的维基百科中分别下载与双语样本集语种相同的中文语料集和英文语料集,并利用现有的分词算法对所获取的中文语料集和英文语料集中的每个句子进行分词处理。
然后,基于经过分词处理的所述原文语料集和译文语料集,利用Skip-Gram算法分别进行词向量的训练,训练完成后获得原文词向量模型和译文词向量模型。
在一个实施例中,利用Skip-Gram算法分别进行中文和英文词向量的训练,其中,一些重要的超参数设置为:词向量的维度为300,上下文窗口为5。
基于上述实施例的内容,所述将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量的步骤,具体为:
针对所述双语样本集中的任一对双语语料样本,将所述双语语料样本中的原文句子输入至所述原文词向量模型中,获得所述原文句子对应的词向量;
将所述双语语料样本中的译文句子输入至所述译文词向量模型中,获得所述译文句子对应的词向量。
具体地,在利用Skip-Gram算法分别进行词向量的训练获得原文词向量模型和译文词向量模型后,针对所述双语样本集中的任一对双语语料样本,将所述双语语料样本中的原文句子输入至所述原文词向量模型中,可获得所述原文句子对应的词向量,将所述双语语料样本中的译文句子输入至所述译文词向量模型中,可获得所述译文句子对应的词向量。
基于上述实施例的内容,所述将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络的步骤,具体为:
针对所述训练样本集中的任一对双语语料样本,将所述双语语料样本中的原文句子和所述原文句子对应的词向量输入基于LSTM网络的编码器模型中,生成所述原文句子对应的句向量;
将所述原文句子对应的句向量和所述双语语料样本中的译文句子对应的词向量输入基于LSTM网络的解码器模型中,获得模型译文;
比较所述模型译文与所述双语语料样本中的译文句子之间的差异,基于所述差异,利用反向传播算法更新所述基于LSTM网络的编码器模型参数和解码器模型参数;
利用所述测试样本集对所述基于LSTM网络的编码器模型和解码器模型进行测试,测试完成后,获得训练完成的LSTM网络。
具体地,将训练样本集中的双语语料样本逐一输入到基于LSTM网络的编码器模型和解码器模型中进行训练。针对所述训练样本集中的任一对双语语料样本,将该双语语料样本的原文句子和该原文句子对应的词向量输入基于LSTM网络的编码器模型中,输出该原文句子对应的句向量。
如图2所示,为本公开实施例提供的编码器模型将原文编码成句向量的示意图,原文句子为“技能的培养非常重要”,被编码器编码成向量c。
然后,如图3所示,为本公开实施例提供的解码器模型基于句向量生成译文的示意图,将所述原文句子对应的句向量c和该双语语料样本中的译文句子的词向量输入基于LSTM网络的解码器模型中,获得模型输出的译文“The skill building is important”。
该双语语料样本中的译文句子是标准译文,比较该标准译文与解码器模型输出的模型译文之间的差异,并利用该差异通过反向传播算法更新解码器模型和编码器模型的参数。
训练结束后,利用所述测试样本集对训练样本集训练结束后的基于LSTM网络的编码器模型和解码器模型进行测试,测试完成后,获得训练完成的LSTM网络。
基于上述实施例的内容,所述将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得训练完成的所述译员对应的LSTM网络的步骤,具体为:
针对每个译员自身的双语语料集中的任一对双语语料样本,将所述双语语料样本中的原文句子和所述原文句子对应的词向量输入至所述训练完成的LSTM网络的编码器模型中,获得所述双语语料样本中的原文句子对应的句向量;
将所述双语语料样本中的原文句子对应的句向量与所述双语语料样本中的译文句子对应的词向量和译员向量连接起来,输入至所述训练完成的LSTM网络的解码器模型中,生成预测译文;
将所述预测译文与所述双语语料样本中的译文句子进行比较,保持所述训练完成的LSTM网络的编码器模型参数不变,利用反向传播算法更新所述训练完成的LSTM网络的编码器模型的参数;
根据更新后的所述训练完成的LSTM网络的编码器模型的参数,更新译员向量。
具体地,基于所有译员的双语语料集所构成的双语样本集进行训练,获得的LSTM网络可以认为是一个通用的模型,在该通用模型的基础上,利用每位译员自身的双语语料集进行训练,获得针对每位译员的LSTM网络。
如图4所示,为本公开实施例提供的训练译员对应的LSTM网络的示意图。
针对每个译员自身的双语语料集中的任一对双语语料样本,首先,将该双语语料样本中的原文句子和该原文句子对应的词向量输入至所述训练完成的LSTM网络的编码器模型中,获得所述双语语料样本中的原文句子对应的句向量c;然后,将该句向量c、该双语语料样本中的译文句子对应的词向量以及译文向量v连接起来,输入至所述训练完成的LSTM网络的解码器模型中,获得预测译文;比较该预测译文与该双语语料样本中的译文句子之间的差异,保持所述训练完成的LSTM网络的编码器模型参数不变,利用反向传播算法更新所述训练完成的LSTM网络的编码器模型 的参数;并根据更新后的所述训练完成的LSTM网络的编码器模型的参数,更新译员向量。
值得说明的是,译员向量初始化为零,然后,在每一次训练结束后进行更新。
如图5所示,为本公开实施例提供的向量化译员的翻译个性特征的装置的结构示意图,该装置用于实现在前述各实施例中所述的向量化译员的翻译个性特征的方法。因此,在前述各实施例中的方法中的描述和定义,可以用于本公开实施例中各执行模块的理解。
如图所示,该装置包括:样本构建模块510、词向量获取模块520、通用模型训练模块530、个性化训练模块540和向量化模块550,其中,
样本构建模块510,被配置为选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
词向量获取模块520,被配置为将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
通用模型训练模块530,被配置为将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
个性化训练模块540,被配置为将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
向量化模块550,被配置为基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
其中,所述样本构建模块510具体被配置为:
对所述双语样本集中的每个句子进行分词处理,并打乱所述双语样本 集中句子之间原有的顺序;
将所述双语样本集划分为训练样本集和验证样本集。
本公开实施例提供的向量化译员的翻译个性特征的装置,根据译员的历史翻译语料对基于LSTM网络的编码器模型和解码器模型进行训练,从而获得反映每个译员个性翻译特征的译员向量,不需要对译员的个性特征进行筛选,也无须对样本数据进行人工标注,训练成本低、精度高,所获得的译员向量能够准确客观地反映译员的翻译个性特征。
图6为本公开实施例提供的电子设备的实体结构示意图,如图6所示,该电子设备可以包括:处理器(processor)610、通信接口(Communications Interface)620、存储器(memory)630和通信总线640,其中,处理器610,通信接口620,存储器630通过通信总线640完成相互间的通信。处理器610可以调用存储在存储器630上并可在处理器610上运行的计算机程序,以执行上述各实施例提供的向量化译员的翻译个性特征的方法,例如包括:选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
此外,上述的存储器630中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现 出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本公开实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述各实施例提供的向量化译员的翻译个性特征的方法,例如包括:选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布至多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解至各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通 过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (10)

  1. 一种向量化译员的翻译个性特征的方法,其特征在于,包括:
    选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
    将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
    将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
    将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
    基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
    其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述双语样本集进行预处理的步骤,具体为:
    对所述双语样本集中的每个句子进行分词处理,并打乱所述双语样本集中句子之间原有的顺序;
    将所述双语样本集划分为训练样本集和验证样本集。
  3. 根据权利要求1所述的方法,其特征在于,训练所述词向量模型的步骤,具体为:
    从最新的维基百科获取与所述双语样本集语种相同的原文语料集和译文语料集,并对所述原文语料集和译文语料集中的每个句子进行分词处理;
    基于经过分词处理的所述原文语料集和译文语料集,利用Skip-Gram算法分别进行词向量的训练,训练完成后获得原文词向量模型和译文词向 量模型。
  4. 根据权利要求3所述的方法,其特征在于,所述将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量的步骤,具体为:
    针对所述双语样本集中的任一对双语语料样本,将所述双语语料样本中的原文句子输入至所述原文词向量模型中,获得所述原文句子对应的词向量;
    将所述双语语料样本中的译文句子输入至所述译文词向量模型中,获得所述译文句子对应的词向量。
  5. 根据权利要求2所述的方法,其特征在于,所述将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络的步骤,具体为:
    针对所述训练样本集中的任一对双语语料样本,将所述双语语料样本中的原文句子和所述原文句子对应的词向量输入基于LSTM网络的编码器模型中,生成所述原文句子对应的句向量;
    将所述原文句子对应的句向量和所述双语语料样本中的译文句子对应的词向量输入基于LSTM网络的解码器模型中,获得模型译文;
    比较所述模型译文与所述双语语料样本中的译文句子之间的差异,基于所述差异,利用反向传播算法更新所述基于LSTM网络的编码器模型参数和解码器模型参数;
    利用所述测试样本集对所述基于LSTM网络的编码器模型和解码器模型进行测试,测试完成后,获得训练完成的LSTM网络。
  6. 根据权利要求1所述的方法,其特征在于,所述将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得训练完成的所述译员对应的LSTM网络的步骤,具体为:
    针对每个译员自身的双语语料集中的任一对双语语料样本,将所述双语语料样本中的原文句子和所述原文句子对应的词向量输入至所述训练完成的LSTM网络的编码器模型中,获得所述双语语料样本中的原文句子 对应的句向量;
    将所述双语语料样本中的原文句子对应的句向量与所述双语语料样本中的译文句子对应的词向量和译员向量连接起来,输入至所述训练完成的LSTM网络的解码器模型中,生成预测译文;
    将所述预测译文与所述双语语料样本中的译文句子进行比较,保持所述训练完成的LSTM网络的编码器模型参数不变,利用反向传播算法更新所述训练完成的LSTM网络的编码器模型的参数;
    根据更新后的所述训练完成的LSTM网络的编码器模型的参数,更新译员向量。
  7. 一种向量化译员的翻译个性特征的装置,其特征在于,包括:
    样本构建模块,被配置为选取T位译员,从每位译员所翻译过的历史语料中选取M对双语语料样本构建译员自身的双语语料集,所有译员自身的双语语料集组成双语样本集,并对所述双语样本集进行预处理;
    词向量获取模块,被配置为将经过预处理的所述双语样本集中的双语语料样本逐一输入至词向量模型中,输出所述双语语料样本对应的词向量;
    通用模型训练模块,被配置为将所述双语样本集中的每对双语语料样本和对应的词向量输入至基于LSTM网络的编码器模型和解码器模型中进行训练,获得训练完成的LSTM网络;
    个性化训练模块,被配置为将每位译员自身的双语语料集中的双语语料样本逐一输入至所述训练完成的LSTM网络中进行训练,保持所述训练完成的LSTM网络的编码器模型参数不变,获得所述译员对应的LSTM网络;
    向量化模块,被配置为基于所述译员对应的LSTM网络中解码器模型的参数,生成反映所述译员的翻译个性特征的译员向量;
    其中,一对双语语料样本包括一个原文句子和一个译文句子;所述词向量模型是基于最新的维基百科的原文语料集和译文语料集,利用Skip-Gram算法训练获得的;T和M均为大于1的自然数。
  8. 根据权利要求7所述的装置,其特征在于,所述样本构建模块具体被配置为:
    对所述双语样本集中的每个句子进行分词处理,并打乱所述双语样本 集中句子之间原有的顺序;
    将所述双语样本集划分为训练样本集和验证样本集。
  9. 一种电子设备,其特征在于,包括:
    至少一个处理器;以及
    与所述处理器通信连接的至少一个存储器,其中:
    所述存储器存储有可被所述处理器执行的程序指令,所述处理器调用所述程序指令能够执行如权利要求1至6任一所述的方法。
  10. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令使所述计算机执行如权利要求1至6任一所述的方法。
PCT/CN2018/124915 2018-12-21 2018-12-28 向量化译员的翻译个性特征的方法及装置 WO2020124674A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811570883.4 2018-12-21
CN201811570883.4A CN109670180B (zh) 2018-12-21 2018-12-21 向量化译员的翻译个性特征的方法及装置

Publications (1)

Publication Number Publication Date
WO2020124674A1 true WO2020124674A1 (zh) 2020-06-25

Family

ID=66145815

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/124915 WO2020124674A1 (zh) 2018-12-21 2018-12-28 向量化译员的翻译个性特征的方法及装置

Country Status (2)

Country Link
CN (1) CN109670180B (zh)
WO (1) WO2020124674A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633007A (zh) * 2020-12-21 2021-04-09 科大讯飞股份有限公司 一种语义理解模型构建方法及装置、语义理解方法及装置

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110543643B (zh) * 2019-08-21 2022-11-11 语联网(武汉)信息技术有限公司 文本翻译模型的训练方法及装置
CN110543644B (zh) * 2019-09-04 2023-08-29 语联网(武汉)信息技术有限公司 包含术语翻译的机器翻译方法、装置与电子设备
CN110866404B (zh) * 2019-10-30 2023-05-05 语联网(武汉)信息技术有限公司 基于lstm神经网络的词向量生成方法及装置
CN110866395B (zh) * 2019-10-30 2023-05-05 语联网(武汉)信息技术有限公司 基于译员编辑行为的词向量生成方法及装置
CN111191468B (zh) * 2019-12-17 2023-08-25 语联网(武汉)信息技术有限公司 术语替换方法及装置
CN113408552A (zh) * 2020-03-16 2021-09-17 京东方科技集团股份有限公司 特征量化模型训练、特征量化、数据查询方法及系统
CN111508470B (zh) * 2020-04-26 2024-04-12 北京声智科技有限公司 一种语音合成模型的训练方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193912A (zh) * 2010-03-12 2011-09-21 富士通株式会社 短语划分模型建立方法、统计机器翻译方法以及解码器
CN102789451A (zh) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 一种个性化的机器翻译系统、方法及训练翻译模型的方法
US8805669B2 (en) * 2010-07-13 2014-08-12 Dublin City University Method of and a system for translation
US20170031899A1 (en) * 2015-07-31 2017-02-02 Samsung Electronics Co., Ltd. Apparatus and method for determining translation word

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102630668B1 (ko) * 2016-12-06 2024-01-30 한국전자통신연구원 입력 텍스트를 자동으로 확장하는 시스템 및 방법
CN107368475B (zh) * 2017-07-18 2021-06-04 中译语通科技股份有限公司 一种基于生成对抗神经网络的机器翻译方法和系统
CN108829684A (zh) * 2018-05-07 2018-11-16 内蒙古工业大学 一种基于迁移学习策略的蒙汉神经机器翻译方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102193912A (zh) * 2010-03-12 2011-09-21 富士通株式会社 短语划分模型建立方法、统计机器翻译方法以及解码器
US8805669B2 (en) * 2010-07-13 2014-08-12 Dublin City University Method of and a system for translation
CN102789451A (zh) * 2011-05-16 2012-11-21 北京百度网讯科技有限公司 一种个性化的机器翻译系统、方法及训练翻译模型的方法
US20170031899A1 (en) * 2015-07-31 2017-02-02 Samsung Electronics Co., Ltd. Apparatus and method for determining translation word

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633007A (zh) * 2020-12-21 2021-04-09 科大讯飞股份有限公司 一种语义理解模型构建方法及装置、语义理解方法及装置
CN112633007B (zh) * 2020-12-21 2024-04-30 中国科学技术大学 一种语义理解模型构建方法及装置、语义理解方法及装置

Also Published As

Publication number Publication date
CN109670180B (zh) 2020-05-08
CN109670180A (zh) 2019-04-23

Similar Documents

Publication Publication Date Title
WO2020124674A1 (zh) 向量化译员的翻译个性特征的方法及装置
CN110334361B (zh) 一种面向小语种语言的神经机器翻译方法
US9753914B2 (en) Natural expression processing method, processing and response method, device, and system
US20220147845A1 (en) Generation of recommendation reason
CN110069790B (zh) 一种通过译文回译对照原文的机器翻译系统及方法
CN110543643B (zh) 文本翻译模型的训练方法及装置
CN110555213B (zh) 文本翻译模型的训练方法、文本翻译方法及装置
CN110070855B (zh) 一种基于迁移神经网络声学模型的语音识别系统及方法
WO2021082427A1 (zh) 韵律控制的诗词生成方法、装置、设备及存储介质
CN111144140B (zh) 基于零次学习的中泰双语语料生成方法及装置
CN107870901A (zh) 从翻译源原文生成相似文的方法、程序、装置以及系统
US20230023789A1 (en) Method for identifying noise samples, electronic device, and storage medium
CN110765791A (zh) 机器翻译自动后编辑方法及装置
Mandal et al. Futurity of translation algorithms for neural machine translation (NMT) and its vision
CN113239710A (zh) 多语言机器翻译方法、装置、电子设备和存储介质
CN111553157A (zh) 一种基于实体替换的对话意图识别方法
CN111178097B (zh) 基于多级翻译模型生成中泰双语语料的方法及装置
CN109657244B (zh) 一种英文长句自动切分方法及系统
CN115438678B (zh) 机器翻译方法、装置、电子设备及存储介质
CN115017924B (zh) 跨语际语言翻译的神经机器翻译模型构建及其翻译方法
CN110147556B (zh) 一种多向神经网络翻译系统的构建方法
CN111666774A (zh) 基于文档上下文的机器翻译方法及装置
CN115293177B (zh) 基于二重迁移学习的小语种神经网络机器翻译方法及系统
US11664010B2 (en) Natural language domain corpus data set creation based on enhanced root utterances
WO2021109679A1 (zh) 一种构建机器翻译模型的方法、翻译装置及计算机可读存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17.01.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 18943797

Country of ref document: EP

Kind code of ref document: A1