CN110688860A

CN110688860A - A Weight Allocation Method Based on Transformer Multiple Attention Mechanisms

Info

Publication number: CN110688860A
Application number: CN201910924914.XA
Authority: CN
Inventors: 闫明明; 陈绪浩; 罗华成; 赵宇; 段世豪
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-14
Anticipated expiration: 2039-09-27
Also published as: CN110688860B

Abstract

The invention discloses a weight distribution method based on multiple attention mechanisms of a transformer; the method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also preserved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.

Description

A Weight Allocation Method Based on Transformer Multiple Attention Mechanisms

技术领域technical field

本发明涉及的神经机器翻译相关领域，具体来讲是一种基于transformer多种注意力机制权重分配方法。The present invention relates to the related field of neural machine translation, in particular to a weight distribution method based on multiple attention mechanisms of transformers.

背景技术Background technique

神经网络机器翻译是最近几年提出来的一种机器翻译方法。相比于传统的统计机器翻译而言，神经网络机器翻译能够训练一张能够从一个序列映射到另一个序列的神经网络，输出的可以是一个变长的序列，这在翻译、对话和文字概括方面能够获得非常好的表现。神经网络机器翻译其实是一个编码-译码系统，编码把源语言序列进行编码，并提取源语言中信息，通过译码再把这种信息转换到另一种语言即目标语言中来，从而完成对语言的翻译。Neural network machine translation is a machine translation method proposed in recent years. Compared with traditional statistical machine translation, neural network machine translation can train a neural network that can map from one sequence to another, and the output can be a variable-length sequence, which is useful in translation, dialogue and text generalization. can get very good performance. Neural network machine translation is actually an encoding-decoding system. The encoding encodes the source language sequence, extracts the information in the source language, and then converts this information into another language, the target language, through decoding. Translation of languages.

而该模型在产生输出的时候，会产生一个注意力范围来表示接下来输出的时候要重点关注输入序列的哪些部分，然后根据关注的区域来产生下一个输出，如此反复。注意力机制和人的一些行为特征有一定相似之处，人在看一段话的时候，通常只会重点注意具有信息量的词，而非全部词，即人会赋予每个词的注意力权重不同。注意力机制模型虽然增加了模型的训练难度，但提升了文本生成的效果。在该专利中，我们就是在注意力机制函数中进行改进.When the model generates output, it will generate an attention range to indicate which parts of the input sequence to focus on when outputting next, and then generate the next output according to the area of interest, and so on. There are certain similarities between the attention mechanism and some behavioral characteristics of people. When people read a paragraph, they usually only pay attention to words with information, not all words, that is, people will give attention weight to each word. different. Although the attention mechanism model increases the training difficulty of the model, it improves the effect of text generation. In this patent, we are making improvements in the attention mechanism function.

自2013年提出了神经机器翻译系统之后，随着计算机的计算力发展的迅速，神经机器翻译也得到了迅速的发展，先后提出了seq-seq模型，transformer模型等等，2013年，Nal Kalchbrenner和Phil Blunsom提出了一种用于机器翻译的新型端到端编码器-解码器结构[4]。该模型可以使用卷积神经网络(CNN)将给定的一段源文本编码成一个连续的向量，然后再使用循环神经网络(RNN)作为解码器将该状态向量转换成目标语言。2017年谷歌发布了一种新的机器学习模型Transformer，该模型在机器翻译及其他语言理解任务上的表现远远超越了现有算法。Since the neural machine translation system was proposed in 2013, with the rapid development of computer computing power, neural machine translation has also developed rapidly, and successively proposed the seq-seq model, the transformer model, etc. In 2013, Nal Kalchbrenner and Phil Blunsom proposed a novel end-to-end encoder-decoder structure for machine translation [4]. The model can use a convolutional neural network (CNN) to encode a given piece of source text into a continuous vector, and then use a recurrent neural network (RNN) as a decoder to convert this state vector into the target language. In 2017, Google released a new machine learning model, Transformer, which far outperformed existing algorithms in machine translation and other language understanding tasks.

传统技术存在以下技术问题：The traditional technology has the following technical problems:

在注意力机制函数对齐过程中，现有的框架是先计算输入的两个句子词向量的相似度，再进行一系列计算得到对齐函数。而每个对齐函数在计算时会输出一遍，再以该次的输出作为下次的输入进行计算。这样单个线程的计算，很有可能导致误差的累积。我们引进多种注意力机制的权重分配，就是为了找出多个计算过程中的最优解。达到最佳翻译效果。In the process of attention mechanism function alignment, the existing framework first calculates the similarity of the two input sentence word vectors, and then performs a series of calculations to obtain the alignment function. Each alignment function will output once during calculation, and then use the output of this time as the next input for calculation. The calculation of such a single thread is likely to lead to the accumulation of errors. We introduce the weight distribution of multiple attention mechanisms in order to find the optimal solution in multiple computing processes. achieve the best translation effect.

发明内容SUMMARY OF THE INVENTION

因此，为了解决上述不足，本发明在此提供一种基于transformer多种注意力机制的权重分配方法；应用在基于注意力机制的transformer框架模型上。包括：注意力机制的输入是目标语言的目标语言和源语言的词向量，输出是一个对齐张量。使用多个注意力机制函数可以输出多个对齐张量输出，并且由于计算过程中有随机参数的变化，所以每个输出是不同的。现今已经提出了很多个注意力机制模型，比如自注意力机制，多头注意力机制，全部注意力机制，局部注意力机制等等，每种不同的注意力机制有着不同的输出与特点，我们将所有的注意力机制模型都投入运算中，并将多种注意力机制输出做正则化计算，来逼近最佳输出。Therefore, in order to solve the above deficiencies, the present invention provides a weight distribution method based on multiple attention mechanisms of the transformer; it is applied to the transformer frame model based on the attention mechanism. Including: the input of the attention mechanism is the target language of the target language and the word vector of the source language, and the output is an alignment tensor. Using multiple attention mechanism functions can output multiple aligned tensor outputs, and each output is different due to random parameter changes during computation. Many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, full attention mechanism, local attention mechanism, etc. Each different attention mechanism has different outputs and characteristics, we will All attention mechanism models are put into operation, and the outputs of various attention mechanisms are regularized to approximate the optimal output.

本发明是这样实现的，构造一种基于transformer多种注意力机制的权重分配方法，应用基于注意力机制的transformer模型中，其特征在于；包括如下步骤：The present invention is implemented in this way, constructs a weight distribution method based on multiple attention mechanisms of transformers, and applies it to the transformer model based on attention mechanisms, and is characterized in that it includes the following steps:

步骤1：在transformer模型中，针对应用情景选取其中较优秀的模型输出。Step 1: In the transformer model, select the best model output for the application scenario.

步骤2：初始化权重序列δ的值，第一次计算时权重序列δ为随机数，并且δ₁+δ₂+....+δ_i＝1；Step 2: Initialize the value of the weight sequence δ, the weight sequence δ is a random number in the first calculation, and δ ₁ +δ ₂ +....+δ _i =1;

步骤3：将各模型输出进行正则化计算并计算出各输出的中心点(与所有值最接近的点)，通过计算公式fin_out＝δ₁O₁+δ₂O₂+δ₃O₃.......+δ_iO_i计算出最优的匹配值作为最终输出；其中δ₁+δ₂+....+δ_i＝1且δ_i是我们设置的权重参数；O_i是各种注意力模型的输出；Step 3: Regularize the output of each model and calculate the center point of each output (the closest point to all values), through the calculation formula fin_out=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .. .....+δ _i O _i calculates the optimal matching value as the final output; where δ ₁ +δ ₂ +....+δ _i =1 and δ _i is the weight parameter we set; O _i is The output of various attention models;

步骤4：将最终输出代入后续运算中，计算与上一次训练相比损失函数的差值，若损失函数下降，则提高δ中靠中心点的序列比重；若损失函数上升，则提升δ序列中与中心点最远的序列比重，整个过程严格遵守δ₁+δ₂+....+δ_i＝1的规则；Step 4: Substitute the final output into the subsequent operation, and calculate the difference between the loss function and the previous training. If the loss function decreases, increase the proportion of the sequence near the center point in the delta; if the loss function increases, increase the sequence in the delta sequence. The proportion of the sequence farthest from the center point, the whole process strictly follows the rule of δ ₁ +δ ₂ +....+δ _i =1;

步骤5：多次循环迭代计算，最终确定最佳权重序列δ。Step 5: Iterative calculation in multiple loops, and finally determine the optimal weight sequence δ.

本发明具有如下优点：本发明公开了一种基于transformer多种注意力机制的权重分配方法。应用在基于注意力机制的transformer框架模型上。包括：注意力机制的输入是目标语言的目标语言和源语言的词向量，输出是一个对齐张量。使用多个注意力机制函数可以输出多个对齐张量输出，并且由于计算过程中有随机参数的变化，所以每个输出是不同的。现今已经提出了很多个注意力机制模型，比如自注意力机制，多头注意力机制，全部注意力机制，局部注意力机制等等，每种不同的注意力机制有着不同的输出与特点，我们将所有的注意力机制模型都投入运算中，并将多种注意力机制输出做正则化计算，来逼近最佳输出。运用公式：fin_out＝δ₁O₁+δ₂O₂+δ₃O₃.......+δ_iO_i其中δ₁+δ₂+....+δ_i＝1且δ_i是我们设置的权重参数。O_i是各种注意力模型的输出，这种正则化计算方法确定了所得的值不会偏离最优值太远，也保存了各个注意力模型的最优性，若是一个注意力模型的实验效果极好，则加大该模型的权重函数来加大该模型对最终输出的影响力，从而提高翻译效果。The invention has the following advantages: the invention discloses a weight distribution method based on multiple attention mechanisms of the transformer. Applied to the transformer framework model based on the attention mechanism. Including: the input of the attention mechanism is the target language of the target language and the word vector of the source language, and the output is an alignment tensor. Using multiple attention mechanism functions can output multiple aligned tensor outputs, and each output is different due to random parameter changes during computation. Many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, full attention mechanism, local attention mechanism, etc. Each different attention mechanism has different outputs and characteristics, we will All attention mechanism models are put into operation, and the outputs of various attention mechanisms are regularized to approximate the optimal output. Using the formula: fin_out=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i where δ ₁ +δ ₂ +....+δ _i =1 and δ _i is the weight parameter we set. O _i is the output of various attention models. This regularization calculation method determines that the obtained value will not deviate too far from the optimal value, and also preserves the optimality of each attention model. If it is an experiment of an attention model If the effect is excellent, then increase the weight function of the model to increase the influence of the model on the final output, thereby improving the translation effect.

具体实施方式Detailed ways

下面将对本发明进行详细说明，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The present invention will be described in detail below, and the technical solutions in the embodiments of the present invention will be described clearly and completely. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明通过改进在此提供一种基于transformer多种注意力机制的权重分配方法。应用在基于注意力机制的transformer框架模型上。The present invention provides a weight distribution method based on multiple attention mechanisms of transformers by improving. Applied to the transformer framework model based on the attention mechanism.

transformer框架介绍：Introduction to the transformer framework:

Encoder:由6个相同的layers组成,每一层包含两个sub-layers.第一个sub-layer就是多头注意力层然后是一个简单的全连接层。其中每个sub-layer都加了残差连接和归一)。Encoder: consists of 6 identical layers, each layer contains two sub-layers. The first sub-layer is a multi-head attention layer and then a simple fully connected layer. Residual connections and normalization are added to each sub-layer).

Decoder:由6个相同的Layer组成，但这里的layer和encoder不一样，这里的layer包含了三个sub-layers,其中有一个self-attention layer,encoder-decoder attentionlayer最后是一个全连接层。前两个sub-layer都是基于multi-head attention layer。这里有个特别点就是masking,masking的作用就是防止在训练的时候使用未来的输出的单词。Decoder: It consists of 6 identical layers, but the layer here is different from the encoder. The layer here contains three sub-layers, including a self-attention layer, and the encoder-decoder attentionlayer is finally a fully connected layer. The first two sub-layers are based on the multi-head attention layer. A special point here is masking. The function of masking is to prevent future output words from being used during training.

注意力模型：Attention Model:

encoder-decoder模型虽然非常经典，但是局限性也非常大。较大的局限性就在于编码和解码之间的联系就是一个固定长度的语义向量C。也就是说，编码器要将整个序列的信息压缩进一个固定长度的向量中去。但是这样做有两个弊端，一是语义向量无法完全表示整个序列的信息，二是先输入的内容携带的信息会被后输入的信息稀释掉。输入序列越长，这个现象就越严重。这就使得在解码的时候一开始就没有获得输入序列足够的信息，那么解码时准确率就要打一定折扣。Although the encoder-decoder model is very classic, it is also very limited. The bigger limitation is that the connection between encoding and decoding is a fixed-length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed-length vector. However, there are two drawbacks in this way. First, the semantic vector cannot fully represent the information of the entire sequence. Second, the information carried by the first input content will be diluted by the later input information. The longer the input sequence, the more severe this phenomenon is. This makes it impossible to obtain enough information of the input sequence at the beginning of decoding, so the accuracy of decoding will be discounted to a certain extent.

为了解决上述问题，在Seq2Seq出现一年之后，注意力模型被提出了。该模型在产生输出的时候，会产生一个注意力范围来表示接下来输出的时候要重点关注输入序列的哪些部分，然后根据关注的区域来产生下一个输出，如此反复。注意力和人的一些行为特征有一定相似之处，人在看一段话的时候，通常只会重点注意具有信息量的词，而非全部词，即人会赋予每个词的注意力权重不同。注意力模型虽然增加了模型的训练难度，但提升了文本生成的效果。To solve the above problems, attention model was proposed one year after Seq2Seq appeared. When the model generates output, it will generate an attention range to indicate which parts of the input sequence to focus on when outputting next, and then generate the next output according to the area of interest, and so on. Attention has certain similarities with some behavioral characteristics of people. When people read a paragraph, they usually only pay attention to words with information, not all words, that is, people will give each word a different attention weight. . Although the attention model increases the training difficulty of the model, it improves the effect of text generation.

第一步，生成该时刻语义向量:The first step is to generate the semantic vector of the moment:

s_t＝tanh(W[s_t-1，y_t-1])s _t =tanh(W[s _t-1 , y _t-1 ])

第二步，传递隐层信息并预测:The second step is to pass the hidden layer information and predict:

现今已经提出了很多个注意力机制模型，比如自注意力机制，多头注意力机制，全部注意力机制，局部注意力机制等等，每种不同的注意力机制有着不同的输出与特点。Many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc. Each different attention mechanism has different outputs and characteristics.

在此的改进就是在注意力函数中修改。The improvement here is to modify the attention function.

在此将所有的注意力机制模型都投入运算中，并将多种注意力机制输出做正则化计算，来逼近最佳输出。运用公式：fin_out＝δ₁O₁+δ₂O₂+δ₃O₃.......+δ_iO_i其中δ₁+δ₂+....+δ_i＝1且δ_i是我们设置的权重参数。O_i是各种注意力模型的输出，这种正则化计算方法确定了所得的值不会偏离最优值太远，也保存了各个注意力模型的最优性。本发明具体实现步骤为；Here, all attention mechanism models are put into operation, and the outputs of various attention mechanisms are regularized to approximate the optimal output. Using the formula: fin_out=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i where δ ₁ +δ ₂ +....+δ _i =1 and δ _i is the weight parameter we set. O _i is the output of various attention models. This regularization calculation method determines that the obtained values will not deviate too far from the optimal value, and also preserves the optimality of each attention model. The specific implementation steps of the present invention are as follows;

步骤3：将各模型输出进行正则化计算并计算出各输出的中心点(与所有值最接近的点)，通过计算公式fin_out＝δ₁O₁+δ₂O₂+δ₃O₃.......+δ_iO_i计算出最优的匹配值作为最终输出。Step 3: Regularize the output of each model and calculate the center point of each output (the closest point to all values), through the calculation formula fin_out=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .. .....+δ _i O _i to calculate the optimal matching value as the final output.

步骤4：将最终输出代入后续运算中，计算与上一次训练相比损失函数的差值，若损失函数下降，则提高δ中靠中心点的序列比重；若损失函数上升，则提升δ序列中与中心点最远的序列比重，整个过程严格遵守δ₁+δ₂+....+δ_i＝1的规则。Step 4: Substitute the final output into the subsequent operation, and calculate the difference between the loss function and the previous training. If the loss function decreases, increase the proportion of the sequence near the center point in the delta; if the loss function increases, increase the sequence in the delta sequence. The proportion of the sequence farthest from the center point, the whole process strictly obeys the rule of δ ₁ +δ ₂ +....+δ _i =1.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. a weight distribution method based on multiple attention mechanisms of transformer, applied in the transformer model based on attention mechanism, is characterized in that; Comprise the following steps:

Step 1: In the transformer model, select the best model output for the application scenario.

Step 2: Initialize the value of the weight sequence δ, the weight sequence δ is a random number in the first calculation, and δ ₁ +δ ₂ +....+δ _i =1;

Step 3: Regularize the output of each model and calculate the center point of each output (the closest point to all values), through the calculation formula fin_out=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .. .....+δ _i O _i calculates the optimal matching value as the final output; where δ ₁ +δ ₂ +....+δ _i =1 and δ _i is the weight parameter we set; O _i is The output of various attention models;

Step 4: Substitute the final output into the subsequent operation, and calculate the difference between the loss function and the previous training. If the loss function decreases, increase the proportion of the sequence near the center point in the delta; if the loss function increases, increase the sequence in the delta sequence. The proportion of the sequence farthest from the center point, the whole process strictly follows the rule of δ ₁ +δ ₂ +....+δ _i =1;

Step 5: Iterative calculation in multiple loops, and finally determine the optimal weight sequence δ.