CN110688860A - Weight distribution method based on multiple attention mechanisms of transducer - Google Patents

Weight distribution method based on multiple attention mechanisms of transducer Download PDF

Info

Publication number
CN110688860A
CN110688860A CN201910924914.XA CN201910924914A CN110688860A CN 110688860 A CN110688860 A CN 110688860A CN 201910924914 A CN201910924914 A CN 201910924914A CN 110688860 A CN110688860 A CN 110688860A
Authority
CN
China
Prior art keywords
output
delta
attention
model
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910924914.XA
Other languages
Chinese (zh)
Other versions
CN110688860B (en
Inventor
闫明明
陈绪浩
罗华成
赵宇
段世豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910924914.XA priority Critical patent/CN110688860B/en
Publication of CN110688860A publication Critical patent/CN110688860A/en
Application granted granted Critical
Publication of CN110688860B publication Critical patent/CN110688860B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a weight distribution method based on multiple attention mechanisms of a transformer; the method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also preserved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.

Description

Weight distribution method based on multiple attention mechanisms of transducer
Technical Field
The invention relates to the field of neural machine translation, in particular to a transform-based multi-attention mechanism weight distribution method.
Background
Neural network machine translation is a machine translation method proposed in recent years. Compared with the traditional statistical machine translation, the neural network machine translation can train a neural network which can be mapped from one sequence to another sequence, and the output can be a sequence with a variable length, so that the better performance can be obtained in the aspects of translation, conversation and text summarization. Neural network machine translation is actually a coding-decoding system, coding encodes a source language sequence, extracts information in the source language, and converts the information into another language, namely a target language through decoding, so that language translation is completed.
When the model generates the output, an attention range is generated to indicate which parts of the input sequence need to be focused when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. The attention mechanism is similar to some behavior characteristics of a human, and when the human looks at a certain period of speech, the human usually only focuses on words with information amount, but not all words, namely, the attention weight given to each word by the human is different. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are just making improvements in the attention mechanism function.
Since 2013, a neural machine translation system is proposed, along with the rapid development of computer computing power, the neural machine translation is also rapidly developed, a seq-seq model, a transform model and the like are proposed in sequence, and in 2013, a novel end-to-end encoder-decoder structure for machine translation is proposed by Nal Kalch brenner and Phil Blunom [4 ]. The model may use a Convolutional Neural Network (CNN) to encode a given piece of source text into a continuous vector, and then use a Recurrent Neural Network (RNN) as a decoder to convert the state vector into the target language. Google in 2017 issued a new machine learning model, a Transformer, that performed far better than existing algorithms in machine translation and other language understanding tasks.
The traditional technology has the following technical problems:
in the attention mechanism function alignment process, the existing framework firstly calculates the similarity of two sentence word vectors input, and then performs a series of calculations to obtain an alignment function. And each alignment function is output once during calculation, and the output of the time is used as the input of the next time for calculation. Such single thread computations are likely to result in accumulation of errors. We introduce a variety of attention mechanisms for weight assignment, namely to find the optimal solution in a plurality of calculation processes. The best translation effect is achieved.
Disclosure of Invention
Therefore, in order to solve the above-mentioned disadvantages, the present invention provides a weight assignment method based on multiple attention mechanisms of a transform; the method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output.
The invention is realized in such a way, a weight distribution method based on multiple attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanism and is characterized in that; the method comprises the following steps:
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiCalculating an optimal matching value as final output; wherein delta12+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;
and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δiRule of 1;
and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
The invention has the following advantages: the invention discloses a weight distribution method based on multiple attention mechanisms of a transformer. The method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O12O23O3.......+δiOiWherein delta12+....+δi1 and δiIs the weight parameter we set. O isiIs the output of various attention models, and the regularization calculation method determines that the obtained value does not deviate too far from the optimal value and also preserves the values of various attention modelsOptimality, if the experimental effect of an attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.
Detailed Description
The present invention will be described in detail below, and technical solutions in embodiments of the present invention will be clearly and completely described below. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a weight distribution method based on multiple attention mechanisms of a transformer by improvement. The method is applied to a transformer frame model based on an attention mechanism.
transform framework introduction:
encoder consisting of 6 identical layers, each layer containing two sub-layers, the first sub-layer being a multi-head attention layer and then a simple fully connected layer. Where each sub-layer is concatenated and normalized with the residual).
The Decoder consists of 6 identical layers, but the layers are different from the encorder, wherein the layers comprise three sub-layers, one self-addressing Layer is arranged, and the encorder-addressing Layer is finally a full connection Layer. Both of the first two sub-layers are based on multi-head authentication layers. One particular point is masking, which prevents future output words from being used during training.
Attention model:
the encode-decoder model, although very classical, is also very limited. A large limitation is that the link between encoding and decoding is a fixed-length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed-length vector. However, there are two disadvantages to this, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the first input content is diluted by the later input information. The longer the input sequence, the more severe this phenomenon is. This results in insufficient information being initially obtained for the input sequence at the time of decoding, which can compromise accuracy.
In order to solve the above problem, an attention model was proposed after the appearance of Seq2Seq for one year. When the model generates the output, an attention range is generated to indicate which parts of the input sequence are focused on when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. Attention and some behavior characteristics of a person have certain similarities, and when the person looks at a certain word, the person usually only focuses attention on words with information amount, but not all words, namely, the attention weight given to each word by the person is different. The attention model increases the training difficulty of the model, but improves the effect of text generation.
First, generating the semantic vector at the moment:
Figure BDA0002218636610000041
st=tanh(W[st-1,yt-1])
and secondly, transferring hidden layer information and predicting:
Figure BDA0002218636610000044
Figure BDA0002218636610000045
many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., and each different attention mechanism has different outputs and characteristics.
The improvement here is a modification in the attention function.
All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O12O23O3.......+δiOiWherein delta12+....+δi1 and δiIs the weight parameter we set. O isiThe regularization calculation method determines that the obtained values do not deviate too far from the optimal values, and preserves the optimality of each attention model. The method comprises the following specific implementation steps of;
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiAnd calculating the optimal matching value as final output.
And 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δi1.
And 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (1)

1. A weight distribution method based on multiple attention mechanisms of a transducer is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta12+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O12O23O3.......+δiOiCalculating an optimal matching value as final output; wherein delta12+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;
and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta12+....+δiRule of 1;
and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
CN201910924914.XA 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer Active CN110688860B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910924914.XA CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910924914.XA CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Publications (2)

Publication Number Publication Date
CN110688860A true CN110688860A (en) 2020-01-14
CN110688860B CN110688860B (en) 2024-02-06

Family

ID=69110821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910924914.XA Active CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Country Status (1)

Country Link
CN (1) CN110688860B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381581A (en) * 2020-11-17 2021-02-19 东华理工大学 Advertisement click rate estimation method based on improved Transformer
CN112992129A (en) * 2021-03-08 2021-06-18 中国科学技术大学 Attention-keeping mechanism monotonicity keeping method in voice recognition task
CN113505193A (en) * 2021-06-01 2021-10-15 华为技术有限公司 Data processing method and related equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381581A (en) * 2020-11-17 2021-02-19 东华理工大学 Advertisement click rate estimation method based on improved Transformer
CN112381581B (en) * 2020-11-17 2022-07-08 东华理工大学 Advertisement click rate estimation method based on improved Transformer
CN112992129A (en) * 2021-03-08 2021-06-18 中国科学技术大学 Attention-keeping mechanism monotonicity keeping method in voice recognition task
CN113505193A (en) * 2021-06-01 2021-10-15 华为技术有限公司 Data processing method and related equipment
WO2022253074A1 (en) * 2021-06-01 2022-12-08 华为技术有限公司 Data processing method and related device

Also Published As

Publication number Publication date
CN110688860B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
CN110413785B (en) Text automatic classification method based on BERT and feature fusion
CN111382582B (en) Neural machine translation decoding acceleration method based on non-autoregressive
CN110222349B (en) Method and computer for deep dynamic context word expression
CN108153913B (en) Training method of reply information generation model, reply information generation method and device
CN111274375B (en) Multi-turn dialogue method and system based on bidirectional GRU network
CN110688860A (en) Weight distribution method based on multiple attention mechanisms of transducer
CN109522403A (en) A kind of summary texts generation method based on fusion coding
CN105279552A (en) Character based neural network training method and device
CN110032638A (en) A kind of production abstract extraction method based on coder-decoder
CN115841119B (en) Emotion cause extraction method based on graph structure
CN112560456A (en) Generation type abstract generation method and system based on improved neural network
CN114691858B (en) Improved UNILM digest generation method
CN115860054A (en) Sparse codebook multiple access coding and decoding system based on generation countermeasure network
CN110717342B (en) Distance parameter alignment translation method based on transformer
CN112949255A (en) Word vector training method and device
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN112287641B (en) Synonym sentence generating method, system, terminal and storage medium
CN110717343B (en) Optimal alignment method based on transformer attention mechanism output
CN112528168A (en) Social network text emotion analysis method based on deformable self-attention mechanism
Wang et al. V-A3tS: A rapid text steganalysis method based on position information and variable parameter multi-head self-attention controlled by length
CN110674647A (en) Layer fusion method based on Transformer model and computer equipment
Tian et al. An online word vector generation method based on incremental huffman tree merging
CN113469260B (en) Visual description method based on convolutional neural network, attention mechanism and self-attention converter
CN113077785B (en) End-to-end multi-language continuous voice stream voice content identification method and system
CN115167863A (en) Code completion method and device based on code sequence and code graph fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant