CN114676708B

CN114676708B - Low-resource neural machine translation method based on multi-strategy prototype generation

Info

Publication number: CN114676708B
Application number: CN202210293213.2A
Authority: CN
Inventors: 余正涛; 朱恩昌; 于志强
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2024-04-23
Anticipated expiration: 2042-03-24
Also published as: CN114676708A

Abstract

The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, belonging to the technical field of natural language processing. The method comprises the following steps: first, a prototype sequence is searched by combining keyword matching and distributed representation matching, and if matching is not obtained, a usable pseudo prototype sequence is generated by using a pseudo prototype generation method. Second, to make efficient use of prototype sequences, conventional encoder-decoder frameworks have been improved. The encoding end receives the prototype sequence input by using an additional encoder; the decoding end uses the improved loss function to reduce the influence of the low-quality prototype sequence on the model while controlling the information flow by using a gating mechanism. The method provided by the invention can effectively improve the quantity and quality of the search results based on a small quantity of parallel corpus, and is suitable for neural machine translation in a low-resource environment and in a similarity language environment.

Description

Low-resource neural machine translation method based on multi-strategy prototype generation

技术领域Technical Field

本发明涉及基于多策略原型生成的低资源神经机器翻译方法，属于自然语言处理技术领域。The invention relates to a low-resource neural machine translation method based on multi-strategy prototype generation, and belongs to the technical field of natural language processing.

背景技术Background technique

近年来，随着端到端翻译模型和注意力机制的提出，神经机器翻译(NeuralMachine Translation,NMT)取得了长足的发展，在主流语言对上的翻译性能迅速超过统计机器翻译，逐渐发展为目前主流的机器翻译模式。为提升神经机器翻译性能，研究者们提出了各种方法。其中，基于原型序列融入的原型方法受到很多关注。资源丰富场景下，利用相似性翻译作为目标端原型序列，能够有效提升神经机器翻译的性能。然而在低资源场景下，由于平行语料资源匮乏，导致不能匹配得到原型序列或序列质量不佳。因此，在低资源场景下，探索如何有效利用原型序列来提高神经机器翻译的性能，具有非常重要的研究和应用价值。In recent years, with the introduction of end-to-end translation models and attention mechanisms, neural machine translation (NMT) has made great progress. Its translation performance on mainstream language pairs has quickly surpassed statistical machine translation and has gradually developed into the current mainstream machine translation model. In order to improve the performance of neural machine translation, researchers have proposed various methods. Among them, the prototype method based on prototype sequence integration has received a lot of attention. In resource-rich scenarios, using similarity translation as the target-side prototype sequence can effectively improve the performance of neural machine translation. However, in low-resource scenarios, due to the lack of parallel corpus resources, the prototype sequence cannot be matched or the sequence quality is poor. Therefore, in low-resource scenarios, exploring how to effectively use prototype sequences to improve the performance of neural machine translation has very important research and application value.

原型序列是存在于翻译记忆库中的目标端句子，内含目标语言端语义信息。原型方法通过在翻译进程中引入原型序列来利用目标端语义信息，使其被隐式地用于指导词对齐和解码约束等过程。目前原型方法领域的研究工作主要集中在原型检索和原型利用两个阶段。原型序列检索方法在资源丰富场景下得到了较好的发展，原因在于资源丰富场景下存在大规模的翻译记忆库。因此原型方法可以通过检索记忆库得到较高质量的原型序列，进而有效地提升翻译性能。然而在低资源场景下，受限于平行语料的规模和质量，传统的原型序列检索方法往往难以检索得到可用的原型。对下一步翻译任务的效果提升有限。除此以外，在对原型序列利用方面，尤其是将原型序列作为编码输入融入翻译模型的方式上，研究者们提出了很多改进方法。例如采用双编码器结构对输入句子和原型序列同时进行编码，同时在解码端引入门控机制来平衡源句和原型序列间的信息比例。然而，以上方法均带来了翻译性能上的提升，但是仍然主要面向资源丰富场景，较少针对低资源场景进行特定的改进。因此，本发明提出了基于多策略原型生成的低资源神经机器翻译方法，通过改进的原型获取方法和特定的翻译框架结构，更好地提升低资源神经机器翻译的性能。The prototype sequence is a target-side sentence existing in the translation memory, which contains the semantic information of the target language. The prototype method utilizes the semantic information of the target side by introducing the prototype sequence in the translation process, so that it is implicitly used to guide processes such as word alignment and decoding constraints. At present, the research work in the field of prototype methods mainly focuses on the two stages of prototype retrieval and prototype utilization. The prototype sequence retrieval method has been well developed in resource-rich scenarios because there are large-scale translation memories in resource-rich scenarios. Therefore, the prototype method can obtain high-quality prototype sequences by retrieving the memory, thereby effectively improving the translation performance. However, in low-resource scenarios, limited by the scale and quality of parallel corpora, traditional prototype sequence retrieval methods often find it difficult to retrieve usable prototypes. The effect of the next translation task is limited. In addition, in terms of the utilization of prototype sequences, especially the way of integrating prototype sequences as encoding input into the translation model, researchers have proposed many improvement methods. For example, a dual encoder structure is used to encode the input sentence and the prototype sequence at the same time, and a gating mechanism is introduced at the decoding end to balance the information ratio between the source sentence and the prototype sequence. However, all of the above methods have brought improvements in translation performance, but they are still mainly aimed at resource-rich scenarios, and there are few specific improvements for low-resource scenarios. Therefore, the present invention proposes a low-resource neural machine translation method based on multi-strategy prototype generation, which better improves the performance of low-resource neural machine translation through an improved prototype acquisition method and a specific translation framework structure.

发明内容Summary of the invention

本发明提供了基于多策略原型生成的低资源神经机器翻译方法，通过结合传统检索方法和所提出的伪原型生成方法提升原型序列获取的效率和质量，同时利用神经网络结构改变的方式将检索到的原型融入编解码器框架，在最大化利用原型序列所含语义信息的同时削弱低质量序列带来的影响；能提升低资源神经机器翻译的性能。The present invention provides a low-resource neural machine translation method based on multi-strategy prototype generation. The efficiency and quality of prototype sequence acquisition are improved by combining the traditional retrieval method with the proposed pseudo-prototype generation method. At the same time, the retrieved prototypes are integrated into the codec framework by changing the neural network structure, thereby maximizing the use of the semantic information contained in the prototype sequence while weakening the impact of low-quality sequences. The performance of low-resource neural machine translation can be improved.

本发明的技术方案是：基于多策略原型生成的低资源神经机器翻译方法，所述方法的具体步骤如下：The technical solution of the present invention is: a low-resource neural machine translation method based on multi-strategy prototype generation, and the specific steps of the method are as follows:

Step1、语料预处理：预处理不同规模的平行训练语料、验证语料和测试语料，用于模型训练、参数调优和效果测试；并构建多语言全局替换词典和关键词词典，用于伪原型生成；Step 1, corpus preprocessing: preprocess parallel training corpora, verification corpora and test corpora of different sizes for model training, parameter tuning and effect testing; and build multi-language global replacement dictionaries and keyword dictionaries for pseudo-prototype generation;

Step2、原型生成：利用基于多种策略混合的原型生成方法进行原型生成，以保证原型序列的可用性；该步骤的具体思路为：首先结合使用模糊匹配和分布式表示匹配进行原型检索，如未检索到原型，则利用词替换操作对输入句子中的关键词进行替换，得到伪原型序列；Step 2, prototype generation: Prototype generation is performed using a prototype generation method based on a mixture of multiple strategies to ensure the availability of the prototype sequence; the specific idea of this step is: first, prototype retrieval is performed by combining fuzzy matching and distributed representation matching. If the prototype is not retrieved, the keyword in the input sentence is replaced by a word replacement operation to obtain a pseudo prototype sequence;

Step3、融入原型序列的翻译模型构建：改进传统基于注意力机制的神经机器翻译模型的编解码器结构，以更好的融入原型序列，使用步骤Step1，Step2的语料作为模型输入，产生最终译文。Step 3. Construct a translation model that incorporates the prototype sequence: Improve the encoder-decoder structure of the traditional attention-based neural machine translation model to better integrate the prototype sequence, use the corpus of Step 1 and Step 2 as the model input, and generate the final translation.

作为本发明的优选方案，所述Step1的具体步骤为：As a preferred embodiment of the present invention, the specific steps of Step 1 are:

Step1.1、使用机器翻译领域的通用数据集IWSLT15进行模型训练，翻译任务为英-越、英-中和英-德；验证和测试方面，选择tst2012作为验证集进行参数优化和模型选择，选择tst2013作为测试集进行测试评估；Step 1.1. Use the general dataset IWSLT15 in the field of machine translation for model training. The translation tasks are English-Vietnamese, English-Chinese and English-German. For verification and testing, select tst2012 as the verification set for parameter optimization and model selection, and select tst2013 as the test set for test evaluation.

Step1.2、使用PanLex、维基百科、实验室自建的英汉-东南亚语词典以及谷歌翻译接口来构建英-越-中-德全局替换词典；Step 1.2, use PanLex, Wikipedia, the laboratory's own English-Chinese-Southeast Asian dictionary and Google Translate interface to build an English-Vietnamese-Chinese-German global replacement dictionary;

Step1.3、在Step1.2的基础上，通过标记筛选方式得到关键词典，筛选过程中保留全部实体；为避免替换过于集中于某些热点名词，对名词性词汇于语料中检索并按出现频率进行倒排。Step 1.3: Based on Step 1.2, a key dictionary is obtained through tag screening, and all entities are retained during the screening process; to avoid excessive concentration of replacement on certain hot nouns, noun words are searched in the corpus and reversed according to the frequency of occurrence.

作为本发明的优选方案，所述Step2的具体步骤为：As a preferred embodiment of the present invention, the specific steps of Step 2 are:

Step2.1、结合使用模糊匹配和分布式表示匹配进行原型检索；具体实现如下：翻译记忆库是由L对平行句组成的集合{(s_l,t_l):l＝1,…,L},其中s_l为源句，t_l为目标句；对给定的输入句子x，首先使用关键词匹配于翻译记忆库中进行检索；采用模糊匹配作为关键词匹配方法，其定义为:Step 2.1, use fuzzy matching and distributed representation matching to perform prototype retrieval; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences {(s _l ,t _l ):l＝1,…,L}, where s _l is the source sentence and t _l is the target sentence; for a given input sentence x, first use keyword matching to search in the translation memory; use fuzzy matching as the keyword matching method, which is defined as:

其中ED(x,s_i)是x,s_i间的编辑距离,|x|为x的句长；Where ED(x,s _i ) is the edit distance between x and s _i , |x| is the sentence length of x;

与基于关键词的匹配方法不同，分布式表示匹配根据句子向量表征之间的距离进行检索，某种程度上是利用语义信息进行相似性检索的手段，也因此提供了与关键词匹配不同的检索视角；基于余弦相似度的分布式表示匹配定义为：Different from the keyword-based matching method, distributed representation matching performs retrieval based on the distance between sentence vector representations. To some extent, it is a means of similarity retrieval using semantic information, and therefore provides a different retrieval perspective from keyword matching. Distributed representation matching based on cosine similarity is defined as:

其中h_x和分别为x和s_i的向量表征，||h_x||为向量h_x的度量；为实现快速计算，首先使用多语言预训练模型mBERT得到句子x和s_i的向量表征，随后依据表征，使用faiss工具进行相似性匹配；where h _x and are the vector representations of x and _si respectively, and ||h _x || is the metric of vector h _x . To achieve fast calculation, we first use the multilingual pre-trained model mBERT to obtain the vector representations of sentences x and _si , and then use the faiss tool to perform similarity matching based on the representations.

当模糊匹配能够得到最优匹配源句s_best时，利用分布式表示匹配得到top-k个匹配结果的集合s′＝{s₁,s₂,…,s_k},如s_best∈s′，则选取s_best对应的目标端句子t_best作为原型序列；当模糊匹配未能检索到匹配源句或时，则通过分布式表示匹配检索出最优匹配源句s_best；When fuzzy matching can obtain the best matching source sentence s _best , the distributed representation matching is used to obtain the set of top-k matching results s′＝{s ₁ ,s ₂ ,…,s _k }. If s _best ∈s′, the target sentence t _best corresponding to s _best is selected as the prototype sequence. When fuzzy matching fails to retrieve a matching source sentence or , the best matching source sentence s _best is retrieved through distributed representation matching;

Step2.2、若Step2.1未检索到原型，则对输入的句子进行关键词替换，生成伪原型，称之为基于词替换的伪原型生成；具体包含以下两种替换策略；Step 2.2: If the prototype is not retrieved in Step 2.1, the input sentence is replaced with keywords to generate a pseudo-prototype, which is called pseudo-prototype generation based on word replacement. Specifically, it includes the following two replacement strategies:

全局替换:当输入句子未能检索到匹配时，基于最大化原则，利用双语词典对输入句子中的词进行尽力替换，替换后的句子被称为伪原型序列；Global replacement: When no match is found in the input sentence, the bilingual dictionary is used to replace the words in the input sentence as much as possible based on the maximization principle. The replaced sentence is called a pseudo-prototype sequence.

关键词替换：从双语词典中抽取重要名词和实体构建关键词词典；当输入句子未能检索到匹配时，利用该词典对输入句子中的关键词进行替换，生成伪原型序列，替换次数上限小于设定的阈值；期望在共享词表的基础上，该混合了源端和重要目标端词汇的伪原型序列能够为译文的生成提供指导。Keyword replacement: Extract important nouns and entities from the bilingual dictionary to build a keyword dictionary. When no match is found in the input sentence, use the dictionary to replace the keywords in the input sentence to generate a pseudo-prototype sequence. The upper limit of the number of replacements is less than the set threshold. It is expected that based on the shared vocabulary, the pseudo-prototype sequence that mixes source and important target vocabulary can provide guidance for the generation of translation.

作为本发明的优选方案，所述Step3中包括：As a preferred embodiment of the present invention, Step 3 includes:

编码端采用双编码器结构，能够同时接收句子输入和原型序列输入，然后将输入编码为相应的隐状态表示；句子编码器为标准的Transformer编码器，由多层堆叠而成；其中每层又由2个子层构成：多头自注意力层和前馈神经网络层，均使用残差连接和层正则化机制；给定输入句子x＝(x₁,x₂,…,x_m)，句子编码器将其编码为相应的隐状态序列h_x＝(h_x1,h_x2,…,h_xm)，其中h_xi为x_i对应的隐状态，原型编码器在神经网络结构上与句子编码器一致，给定原型序列t＝(t₁,t₂,…,t_n)，原型编码器将其编码为相应的隐状态序列h_t＝(h_t1,h_t2,…,h_tn),其中h_ti为t_i对应的隐状态。The encoding end adopts a dual encoder structure, which can simultaneously receive sentence input and prototype sequence input, and then encode the input into the corresponding hidden state representation; the sentence encoder is a standard Transformer encoder, which is composed of multiple layers; each layer is composed of two sub-layers: a multi-head self-attention layer and a feedforward neural network layer, both of which use residual connections and layer regularization mechanisms; given an input sentence x = (x ₁ , x ₂ , …, x _m ), the sentence encoder encodes it into the corresponding hidden state sequence h _x = (h _x1 , h _x2 , …, h _xm ), where h _xi is the hidden state corresponding to _xi . The prototype encoder is consistent with the sentence encoder in terms of neural network structure. Given a prototype sequence t = (t ₁ , t ₂ , …, t _n ), the prototype encoder encodes it into the corresponding hidden state sequence h _t = (h _t1 , h _t2 , …, h _tn ), where h _ti is the hidden state corresponding to _ti .

解码端融入门控机制，利用神经网络自学习能力实现句子信息和原型信息间的比例优化，控制解码过程中的信息流动；改进后的解码器由三个子层构成：(1)自注意力层；(2)改进的编解码注意力层；(3)全连接前馈网络层；其中，改进的编解码注意力层由句子编解码注意力模块和原型编解码注意力模块构成；接收到i时刻多头自注意力层的输出s_self和句子编码器的输出h_x时，句子编解码注意力模块进行注意力计算。The decoding end incorporates a gating mechanism and uses the self-learning ability of the neural network to optimize the ratio between sentence information and prototype information, thereby controlling the information flow during the decoding process. The improved decoder consists of three sublayers: (1) self-attention layer; (2) improved codec attention layer; (3) fully connected feedforward network layer. The improved codec attention layer consists of a sentence codec attention module and a prototype codec attention module. When receiving the output s _self of the multi-head self-attention layer at time i and the output h _x of the sentence encoder, the sentence codec attention module performs attention calculation.

作为本发明的优选方案，所述Step3中，句子编解码注意力模块进行注意力计算包括：As a preferred solution of the present invention, in Step 3, the sentence encoding and decoding attention module performs attention calculation including:

s_x＝MultiHeadAtt(s_self,h_x,h_x)s _x =MultiHeadAtt(s _self ,h _x ,h _x )

其中MultiHeadAtt(·)为基于多头的注意力计算,与此类似，原型编解码注意力的计算为：Among them, MultiHeadAtt(·) is the attention calculation based on multiple heads. Similarly, the calculation of prototype encoding and decoding attention is:

s_t＝MultiHeadAtt(s_self,h_t,h_t)s _t =MultiHeadAtt(s _self ,h _t ,h _t )

随后，句子编解码注意力输出s_x和原型编解码注意力输出s_t被连接，用于计算比例变量α：Subsequently, the sentence encoder-decoder attention output _sx and the prototype encoder-decoder attention output _st are concatenated to calculate the scale variable α:

α＝sigmoid(W_α[s_x；s_t]+b_α)α＝sigmoid(W _α [s _x ；s _t ]+b _α )

其中W_α和b_α为可训练参数,α随后被用于计算编解码注意力层的最终输出，计算公式为：Where W _α and b _α are trainable parameters, and α is then used to calculate the final output of the encoder-decoder attention layer, calculated as:

s_{enc_dec}＝α*s_x+(1-α)*s_t s _{enc_dec} = α*s _x + (1-α)*s _t

进而s_{enc_dec}作为输入被填充到全连接前馈网络中：Then s _{enc_dec} is fed into the fully connected feed-forward network as input:

s_ffn＝f(s_{enc_dec})s _ffn = f(s _{enc_dec} )

其中f(x)的定义为:f(x)＝max(0,xW₁+b₁)W₂+b₂，其中W₁,W₂,b₁和b₂均为参数,最终i时刻的译文y_i计算如下：The definition of f(x) is: f(x) = max(0,xW ₁ +b ₁ )W ₂ +b ₂ , where W ₁ , W ₂ , b ₁ and b ₂ are all parameters. The final translation yi at time _i is calculated as follows:

P(y_i|y_＜i；x；t,θ)＝softmax(σ(s_ffn))P(y _i |y _＜i ；x；t,θ)＝softmax(σ(s _ffn ))

其中t为原型序列，σ(·)为线性变换函数。Where t is the prototype sequence and σ(·) is the linear transformation function.

本发明的有益效果是：The beneficial effects of the present invention are:

1、本发明通过联合使用原型序列检索方法和基于词替换的伪原型生成方法来获取原型序列，从而在保证序列质量的前提下，最大化提升低资源场景下可用的原型序列数量；1. The present invention obtains prototype sequences by combining a prototype sequence retrieval method and a pseudo-prototype generation method based on word replacement, thereby maximizing the number of available prototype sequences in low-resource scenarios while ensuring sequence quality;

2、本发明对编码器-解码器翻译框架进行了改进，编码端使用双编码结构，句子编码器和原型编码器分别对输入的句子和检索到(生成)的原型进行编码。解码端使用门控机制控制信息的比例和流动过程；2. The present invention improves the encoder-decoder translation framework. The encoder uses a dual encoding structure. The sentence encoder and the prototype encoder encode the input sentence and the retrieved (generated) prototype respectively. The decoder uses a gating mechanism to control the proportion and flow of information.

3、改进神经机器翻译模型损失计算方法。使模型在尽力利用高质量原型序列所含语义信息的同时，削弱低质量原型序列对翻译模型的负面影响，最终提高低资源场景下的翻译性能，同时能带来更好的翻译流畅度。3. Improve the loss calculation method of the neural machine translation model. While making the best use of the semantic information contained in high-quality prototype sequences, the model weakens the negative impact of low-quality prototype sequences on the translation model, ultimately improving the translation performance in low-resource scenarios and bringing better translation fluency.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明中的流程图；Fig. 1 is a flow chart of the present invention;

图2为本发明提出的模型整体结构图；FIG2 is a diagram showing the overall structure of the model proposed by the present invention;

图3为本发明展示的关键词替换次数对模型性能的影响。FIG3 is a graph showing the effect of the number of keyword replacements on model performance in the present invention.

具体实施方式Detailed ways

实施例1：如图1-图3所示，基于多策略原型生成的低资源神经机器翻译方法，所述方法的具体步骤如下：Embodiment 1: As shown in FIG. 1 to FIG. 3 , a low-resource neural machine translation method based on multi-strategy prototype generation, the specific steps of the method are as follows:

处理后的平行语料按规模分为两类：小规模平行语料和大规模平行语料。在不同规模的平行语料上应用本发明的方法，可以观察语料规模的提升对信息利用率的影响，验证所提出的方法适用于平行语料资源稀缺场景的设想。表1为实验数据信息。The processed parallel corpora are divided into two categories according to their scale: small-scale parallel corpora and large-scale parallel corpora. By applying the method of the present invention to parallel corpora of different scales, we can observe the impact of the increase in corpus scale on information utilization and verify that the proposed method is applicable to scenarios where parallel corpus resources are scarce. Table 1 shows the experimental data information.

表1实验数据Table 1 Experimental data

Step2.1、结合使用模糊匹配和分布式表示匹配进行原型检索；具体实现如下：翻译记忆库是由L对平行句组成的集合{(s_l,t_l):l＝1,…,L},其中s_l为源句，t_l为目标句；对给定的输入句子x，首先使用关键词匹配于翻译记忆库中进行检索；低资源环境下，由于语料中的长句数量较少，易导致匹配到的片段长度较短，难以形成有效的相似度度量。因此本文未采用N-gram匹配，而是采用模糊匹配作为关键词匹配方法，其定义为:Step 2.1, use fuzzy matching and distributed representation matching to perform prototype retrieval; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences {(s _l ,t _l ):l＝1,…,L}, where s _l is the source sentence and t _l is the target sentence; for a given input sentence x, first use keyword matching to search in the translation memory; in a low-resource environment, due to the small number of long sentences in the corpus, it is easy to cause the length of the matched fragment to be short, making it difficult to form an effective similarity measure. Therefore, this paper does not use N-gram matching, but uses fuzzy matching as a keyword matching method, which is defined as:

为了说明本发明的在低资源场景下的原型检索效果，在不同的数据规模上本发明提出的方法与基线原型检索效果进行对比。表2展示了混合式原型检索模型带来的匹配质量的提升。In order to illustrate the prototype retrieval effect of the present invention in low-resource scenarios, the method proposed by the present invention is compared with the baseline prototype retrieval effect on different data scales. Table 2 shows the improvement in matching quality brought by the hybrid prototype retrieval model.

表2模糊匹配与混合式原型检索效果对比Table 2 Comparison of fuzzy matching and hybrid prototype retrieval effects

由表2的结果可以看出，在低资源场景下，模糊匹配检索得到原型序列数量明显不足。使用结合了模糊匹配和分布式表示匹配的混合式原型匹配后，可在一定程度上缓解该问题。在大规模数据集WMT14上，相比单独使用模糊匹配策略，结合使用分布式表示匹配能够得到较为理想的匹配结果。而对于小规模数据集IWSLT15而言，结合模糊匹配和分布式表示匹配后，能够在关键词匹配基础上附加语义层面的综合考量，进一步提升原型序列质量。从而证明本发明所提的混合式原型检索方法适用于低资源场景。It can be seen from the results in Table 2 that in low-resource scenarios, the number of prototype sequences obtained by fuzzy matching retrieval is obviously insufficient. This problem can be alleviated to a certain extent by using a hybrid prototype matching that combines fuzzy matching and distributed representation matching. On the large-scale dataset WMT14, compared with using the fuzzy matching strategy alone, the combined use of distributed representation matching can obtain a more ideal matching result. As for the small-scale dataset IWSLT15, after combining fuzzy matching and distributed representation matching, it is possible to add a comprehensive consideration of the semantic level on the basis of keyword matching, further improving the quality of the prototype sequence. This proves that the hybrid prototype retrieval method proposed in the present invention is suitable for low-resource scenarios.

图3展示了在越-英翻译任务上关键词替换次数对模型性能的影响。利用关键词替换进行伪原型生成过程中，我们首先依据经验设置替换阈值，随后基于验证集性能对阈值进行调整。评估结果表明，在顺序遍历词典策略下，当替换次数阈值设置较小时，生成的原型序列与原文区别有限，难以为翻译过程提供有效的指导信息；而当设置较大时，则趋向退化为全局替换；当关键词替换次数设置为3时模型表现最优.测试集上的结果显示了同样的变化趋势。Figure 3 shows the impact of the number of keyword replacements on model performance in the Vietnamese-English translation task. In the process of pseudo-prototype generation using keyword replacement, we first set the replacement threshold based on experience, and then adjust the threshold based on the performance of the validation set. The evaluation results show that under the sequential dictionary traversal strategy, when the replacement threshold is set to a small value, the generated prototype sequence is limited in difference from the original text and it is difficult to provide effective guidance information for the translation process; when it is set to a large value, it tends to degenerate into a global replacement; when the number of keyword replacements is set to 3, the model performs best. The results on the test set show the same trend of change.

s_x＝MultiHeadAtt(s_self,h_x,h_x)s _x =MultiHeadAtt(s _self ,h _x ,h _x )

s_t＝MultiHeadAtt(s_self,h_t,h_t)s _t =MultiHeadAtt(s _self ,h _t ,h _t )

α＝sigmoid(W_α[s_x；s_t]+b_α)α＝sigmoid(W _α [s _x ；s _t ]+b _α )

s_{enc_dec}＝α*s_x+(1-α)*s_t s _{enc_dec} = α*s _x + (1-α)*s _t

s_ffn＝f(s_{enc_dec})s _ffn = f(s _{enc_dec} )

P(y_i|y_<i；x；t,θ)＝softmax(σ(s_ffn))P(y _i |y _<i ；x；t,θ)＝softmax(σ(s _ffn ))

为了说明本发明的翻译效果，采用基线系统和本发明产生的译文进行对比，表3、表4为在小规模评语语料上的提升结果。In order to illustrate the translation effect of the present invention, the translations generated by the baseline system and the present invention are compared. Tables 3 and 4 show the improvement results on a small-scale comment corpus.

表3 BLEU值评测结果(％)Table 3 BLEU value evaluation results (%)

表4 RIBES值评测结果(％)Table 4 RIBES value evaluation results (%)

从表3、表4的结果可以看出，本发明提出的方法能够在编码端使用额外的原型编码器学习原型序列表征，同时在解码端有效利用高质量原型序列所含语义信息、减少低质量原型序列引入的噪声信息。低资源场景下有效提升译文质量和流畅度。From the results in Table 3 and Table 4, it can be seen that the method proposed in the present invention can use an additional prototype encoder to learn the prototype sequence representation at the encoding end, and effectively utilize the semantic information contained in the high-quality prototype sequence at the decoding end to reduce the noise information introduced by the low-quality prototype sequence. It can effectively improve the quality and fluency of the translation in low-resource scenarios.

上面结合附图对本发明的具体实施方式作了详细说明，但是本发明并不限于上述实施方式，在本领域普通技术人员所具备的知识范围内，还可以在不脱离本发明宗旨的前提下作出各种变化。The specific implementation modes of the present invention are described in detail above in conjunction with the accompanying drawings, but the present invention is not limited to the above implementation modes, and various changes can be made within the knowledge scope of ordinary technicians in this field without departing from the purpose of the present invention.

Claims

1. A low-resource neural machine translation method based on multi-strategy prototype generation, characterized in that: the specific steps of the method are as follows:

Step 1, corpus preprocessing: preprocess parallel training corpora, verification corpora and test corpora of different sizes for model training, parameter tuning and effect testing; and build multi-language global replacement dictionaries and keyword dictionaries for pseudo-prototype generation;

Step 2, prototype generation: Prototype generation is performed using a prototype generation method based on a mixture of multiple strategies to ensure the availability of the prototype sequence; the specific idea of this step is: first, prototype retrieval is performed by combining fuzzy matching and distributed representation matching. If the prototype is not retrieved, the keyword in the input sentence is replaced by a word replacement operation to obtain a pseudo prototype sequence;

Step 3: Construction of a translation model that incorporates prototype sequences: Improve the encoder-decoder structure of the traditional attention-based neural machine translation model to better incorporate prototype sequences, use the corpus from Step 1 and Step 2 as model input, and generate the final translation;

The specific steps of Step 2 are:

Step 2.1, use fuzzy matching and distributed representation matching to perform prototype retrieval; the specific implementation is as follows: the translation memory is a set of L pairs of parallel sentences {(s _l ,t _l ):l＝1,…,L}, where s _l is the source sentence and t _l is the target sentence; for a given input sentence x, first use keyword matching to search in the translation memory; use fuzzy matching as the keyword matching method, which is defined as:

Where ED(x,s _i ) is the edit distance between x and s _i , |x| is the sentence length of x;

Different from the keyword-based matching method, distributed representation matching performs retrieval based on the distance between sentence vector representations. To some extent, it is a means of similarity retrieval using semantic information, and therefore provides a different retrieval perspective from keyword matching. Distributed representation matching based on cosine similarity is defined as:

where h _x and are the vector representations of x and _si respectively, and ||h _x || is the metric of vector h _x . To achieve fast calculation, we first use the multilingual pre-trained model mBERT to obtain the vector representations of sentences x and _si , and then use the faiss tool to perform similarity matching based on the representations.

When fuzzy matching can obtain the best matching source sentence s _best , the distributed representation matching is used to obtain the set of top-k matching results s′＝{s ₁ ,s ₂ ,…,s _k }. If s _best ∈s′, the target sentence t _best corresponding to s _best is selected as the prototype sequence. When fuzzy matching fails to retrieve a matching source sentence or , the best matching source sentence s _best is retrieved through distributed representation matching;

Step 2.2: If the prototype is not retrieved in Step 2.1, the input sentence is replaced with keywords to generate a pseudo-prototype, which is called pseudo-prototype generation based on word replacement. Specifically, it includes the following two replacement strategies:

Global replacement: When no match is found in the input sentence, the bilingual dictionary is used to replace the words in the input sentence as much as possible based on the maximization principle. The replaced sentence is called a pseudo-prototype sequence.

Keyword replacement: Extract important nouns and entities from the bilingual dictionary to build a keyword dictionary. When no match is found in the input sentence, use the dictionary to replace the keywords in the input sentence to generate a pseudo-prototype sequence. The upper limit of the number of replacements is less than the set threshold. It is expected that based on the shared vocabulary, the pseudo-prototype sequence that mixes source and important target vocabulary can provide guidance for the generation of translation.

2. According to the low-resource neural machine translation method based on multi-strategy prototype generation according to claim 1, it is characterized in that: the specific steps of Step 1 are:

Step 1.1. Use the general dataset IWSLT15 in the field of machine translation for model training. The translation tasks are English-Vietnamese, English-Chinese and English-German. For verification and testing, select tst2012 as the verification set for parameter optimization and model selection, and select tst2013 as the test set for test evaluation.

Step 1.2, use PanLex, Wikipedia, the laboratory's own English-Chinese-Southeast Asian dictionary and Google Translate interface to build an English-Vietnamese-Chinese-German global replacement dictionary;

Step 1.3: Based on Step 1.2, a key dictionary is obtained through tag screening, and all entities are retained during the screening process; to avoid excessive concentration of replacement on certain hot nouns, noun words are searched in the corpus and reversed according to the frequency of occurrence.

3. According to the low-resource neural machine translation method based on multi-strategy prototype generation according to claim 1, it is characterized in that: the step 3 includes:

Step 3.1. The encoding end adopts a dual encoder structure, which can simultaneously receive sentence input and prototype sequence input, and then encode the input into the corresponding hidden state representation; the sentence encoder is a standard Transformer encoder, which is composed of multiple layers; each layer is composed of two sub-layers: a multi-head self-attention layer and a feedforward neural network layer, both of which use residual connections and layer regularization mechanisms; given an input sentence x = (x ₁ , x ₂ , …, x _m ), the sentence encoder encodes it into the corresponding hidden state sequence h _x = (h _x1 , h _x2 , …, h _xm ), where h _xi is the hidden state corresponding to _xi , and the prototype encoder is consistent with the sentence encoder in terms of neural network structure. Given a prototype sequence t = (t ₁ , t ₂ , …, t _n ), the prototype encoder encodes it into the corresponding hidden state sequence h _t = (h _t1 , h _t2 , …, h _tn ), where h _ti is the hidden state corresponding to _ti .

4. The low-resource neural machine translation method based on multi-strategy prototype generation according to claim 1, characterized in that: Step 3 includes:

The decoding end incorporates a gating mechanism and uses the self-learning ability of the neural network to optimize the ratio between sentence information and prototype information, thereby controlling the information flow during the decoding process. The improved decoder consists of three sublayers: (1) self-attention layer; (2) improved codec attention layer; (3) fully connected feedforward network layer. The improved codec attention layer consists of a sentence codec attention module and a prototype codec attention module. When receiving the output s _self of the multi-head self-attention layer at time i and the output h _x of the sentence encoder, the sentence codec attention module performs attention calculation.

5. According to the low-resource neural machine translation method based on multi-strategy prototype generation according to claim 4, it is characterized in that: in the step 3, the sentence encoding and decoding attention module performs attention calculation including:

s _x =MultiHeadAtt(s _self ,h _x ,h _x )

Among them, MultiHeadAtt(·) is the attention calculation based on multiple heads. Similarly, the calculation of prototype encoding and decoding attention is:

s _t =MultiHeadAtt(s _self ,h _t ,h _t )

Subsequently, the sentence encoder-decoder attention output _sx and the prototype encoder-decoder attention output _st are concatenated to calculate the scale variable α:

α＝sigmoid(W _α [s _x ；s _t ]+b _α )

Where W _α and b _α are trainable parameters, and α is then used to calculate the final output of the encoder-decoder attention layer, calculated as:

s _{enc_dec} = α*s _x + (1-α)*s _t

Then s _{enc_dec} is fed into the fully connected feed-forward network as input:

s _ffn = f(s _{enc_dec} )

The definition of f(x) is: f(x) = max(0,xW ₁ +b ₁ )W ₂ +b ₂ , where W ₁ , W ₂ , b ₁ and b ₂ are all parameters. The final translation yi at time _i is calculated as follows:

P(y _i |y _＜i ；x；t,θ)＝softmax(σ(s _ffn ))

Where t is the prototype sequence and σ(·) is the linear transformation function.