CN110688860A - Weight distribution method based on multiple attention mechanisms of transducer - Google Patents
Weight distribution method based on multiple attention mechanisms of transducer Download PDFInfo
- Publication number
- CN110688860A CN110688860A CN201910924914.XA CN201910924914A CN110688860A CN 110688860 A CN110688860 A CN 110688860A CN 201910924914 A CN201910924914 A CN 201910924914A CN 110688860 A CN110688860 A CN 110688860A
- Authority
- CN
- China
- Prior art keywords
- output
- delta
- attention
- model
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000004364 calculation method Methods 0.000 claims abstract description 20
- 230000008569 process Effects 0.000 claims abstract description 10
- 238000012549 training Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 abstract description 16
- 238000013519 translation Methods 0.000 abstract description 15
- 239000013598 vector Substances 0.000 abstract description 10
- 230000000694 effects Effects 0.000 abstract description 7
- 238000013459 approach Methods 0.000 abstract description 4
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 241001156002 Anthonomus pomorum Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a weight distribution method based on multiple attention mechanisms of a transformer; the method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also preserved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.
Description
Technical Field
The invention relates to the field of neural machine translation, in particular to a transform-based multi-attention mechanism weight distribution method.
Background
Neural network machine translation is a machine translation method proposed in recent years. Compared with the traditional statistical machine translation, the neural network machine translation can train a neural network which can be mapped from one sequence to another sequence, and the output can be a sequence with a variable length, so that the better performance can be obtained in the aspects of translation, conversation and text summarization. Neural network machine translation is actually a coding-decoding system, coding encodes a source language sequence, extracts information in the source language, and converts the information into another language, namely a target language through decoding, so that language translation is completed.
When the model generates the output, an attention range is generated to indicate which parts of the input sequence need to be focused when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. The attention mechanism is similar to some behavior characteristics of a human, and when the human looks at a certain period of speech, the human usually only focuses on words with information amount, but not all words, namely, the attention weight given to each word by the human is different. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are just making improvements in the attention mechanism function.
Since 2013, a neural machine translation system is proposed, along with the rapid development of computer computing power, the neural machine translation is also rapidly developed, a seq-seq model, a transform model and the like are proposed in sequence, and in 2013, a novel end-to-end encoder-decoder structure for machine translation is proposed by Nal Kalch brenner and Phil Blunom [4 ]. The model may use a Convolutional Neural Network (CNN) to encode a given piece of source text into a continuous vector, and then use a Recurrent Neural Network (RNN) as a decoder to convert the state vector into the target language. Google in 2017 issued a new machine learning model, a Transformer, that performed far better than existing algorithms in machine translation and other language understanding tasks.
The traditional technology has the following technical problems:
in the attention mechanism function alignment process, the existing framework firstly calculates the similarity of two sentence word vectors input, and then performs a series of calculations to obtain an alignment function. And each alignment function is output once during calculation, and the output of the time is used as the input of the next time for calculation. Such single thread computations are likely to result in accumulation of errors. We introduce a variety of attention mechanisms for weight assignment, namely to find the optimal solution in a plurality of calculation processes. The best translation effect is achieved.
Disclosure of Invention
Therefore, in order to solve the above-mentioned disadvantages, the present invention provides a weight assignment method based on multiple attention mechanisms of a transform; the method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output.
The invention is realized in such a way, a weight distribution method based on multiple attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanism and is characterized in that; the method comprises the following steps:
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta1+δ2+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O1+δ2O2+δ3O3.......+δiOiCalculating an optimal matching value as final output; wherein delta1+δ2+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;
and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta1+δ2+....+δiRule of 1;
and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
The invention has the following advantages: the invention discloses a weight distribution method based on multiple attention mechanisms of a transformer. The method is applied to a transformer frame model based on an attention mechanism. The method comprises the following steps: the input to the attention mechanism is the word vectors in the target and source languages, and the output is an alignment tensor. Multiple alignment tensor outputs may be output using multiple attention mechanism functions, and each output is different due to random parametric variations in the calculation process. A plurality of attention mechanism models are proposed nowadays, such as a self-attention mechanism, a multi-head attention mechanism, a total attention mechanism, a local attention mechanism and the like, each different attention mechanism has different outputs and characteristics, all attention mechanism models are put into operation, and the outputs of various attention mechanism models are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O1+δ2O2+δ3O3.......+δiOiWherein delta1+δ2+....+δi1 and δiIs the weight parameter we set. O isiIs the output of various attention models, and the regularization calculation method determines that the obtained value does not deviate too far from the optimal value and also preserves the values of various attention modelsOptimality, if the experimental effect of an attention model is excellent, the weight function of the model is increased to increase the influence of the model on final output, so that the translation effect is improved.
Detailed Description
The present invention will be described in detail below, and technical solutions in embodiments of the present invention will be clearly and completely described below. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a weight distribution method based on multiple attention mechanisms of a transformer by improvement. The method is applied to a transformer frame model based on an attention mechanism.
transform framework introduction:
encoder consisting of 6 identical layers, each layer containing two sub-layers, the first sub-layer being a multi-head attention layer and then a simple fully connected layer. Where each sub-layer is concatenated and normalized with the residual).
The Decoder consists of 6 identical layers, but the layers are different from the encorder, wherein the layers comprise three sub-layers, one self-addressing Layer is arranged, and the encorder-addressing Layer is finally a full connection Layer. Both of the first two sub-layers are based on multi-head authentication layers. One particular point is masking, which prevents future output words from being used during training.
Attention model:
the encode-decoder model, although very classical, is also very limited. A large limitation is that the link between encoding and decoding is a fixed-length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed-length vector. However, there are two disadvantages to this, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the first input content is diluted by the later input information. The longer the input sequence, the more severe this phenomenon is. This results in insufficient information being initially obtained for the input sequence at the time of decoding, which can compromise accuracy.
In order to solve the above problem, an attention model was proposed after the appearance of Seq2Seq for one year. When the model generates the output, an attention range is generated to indicate which parts of the input sequence are focused on when the output is next generated, and then the next output is generated according to the focused area, and the process is repeated. Attention and some behavior characteristics of a person have certain similarities, and when the person looks at a certain word, the person usually only focuses attention on words with information amount, but not all words, namely, the attention weight given to each word by the person is different. The attention model increases the training difficulty of the model, but improves the effect of text generation.
First, generating the semantic vector at the moment:
st=tanh(W[st-1,yt-1])
and secondly, transferring hidden layer information and predicting:
many attention mechanism models have been proposed, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., and each different attention mechanism has different outputs and characteristics.
The improvement here is a modification in the attention function.
All attention mechanism models are put into operation, and various attention mechanism outputs are subjected to regularization calculation to approach to the optimal output. Applying a formula: fin _ out is δ1O1+δ2O2+δ3O3.......+δiOiWherein delta1+δ2+....+δi1 and δiIs the weight parameter we set. O isiThe regularization calculation method determines that the obtained values do not deviate too far from the optimal values, and preserves the optimality of each attention model. The method comprises the following specific implementation steps of;
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta1+δ2+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O1+δ2O2+δ3O3.......+δiOiAnd calculating the optimal matching value as final output.
And 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta1+δ2+....+δi1.
And 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (1)
1. A weight distribution method based on multiple attention mechanisms of a transducer is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:
step 1: in the transform model, the more excellent model output is selected for the application scenario.
Step 2: initializing the value of the weight sequence delta, the weight sequence delta being a random number when first calculated, and delta1+δ2+....+δi=1;
And step 3: each model output is normalized and the center point (the point closest to all values) of each output is calculated, and the calculation formula fin _ out is δ1O1+δ2O2+δ3O3.......+δiOiCalculating an optimal matching value as final output; wherein delta1+δ2+....+δi1 and δiIs the weight parameter we set; o isiIs the output of various attention models;
and 4, step 4: substituting the final output into subsequent operation, calculating the difference of the loss function compared with the last training, and if the loss function is reduced, improving the sequence proportion of the delta close to the central point; if the loss function rises, the sequence proportion of the delta sequence farthest from the central point is improved, and the whole process strictly complies with the delta1+δ2+....+δiRule of 1;
and 5: and (5) performing loop iteration calculation for multiple times, and finally determining the optimal weight sequence delta.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910924914.XA CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910924914.XA CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110688860A true CN110688860A (en) | 2020-01-14 |
CN110688860B CN110688860B (en) | 2024-02-06 |
Family
ID=69110821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910924914.XA Active CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110688860B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112381581A (en) * | 2020-11-17 | 2021-02-19 | 东华理工大学 | Advertisement click rate estimation method based on improved Transformer |
CN112992129A (en) * | 2021-03-08 | 2021-06-18 | 中国科学技术大学 | Attention-keeping mechanism monotonicity keeping method in voice recognition task |
CN113505193A (en) * | 2021-06-01 | 2021-10-15 | 华为技术有限公司 | Data processing method and related equipment |
-
2019
- 2019-09-27 CN CN201910924914.XA patent/CN110688860B/en active Active
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112381581A (en) * | 2020-11-17 | 2021-02-19 | 东华理工大学 | Advertisement click rate estimation method based on improved Transformer |
CN112381581B (en) * | 2020-11-17 | 2022-07-08 | 东华理工大学 | Advertisement click rate estimation method based on improved Transformer |
CN112992129A (en) * | 2021-03-08 | 2021-06-18 | 中国科学技术大学 | Attention-keeping mechanism monotonicity keeping method in voice recognition task |
CN113505193A (en) * | 2021-06-01 | 2021-10-15 | 华为技术有限公司 | Data processing method and related equipment |
WO2022253074A1 (en) * | 2021-06-01 | 2022-12-08 | 华为技术有限公司 | Data processing method and related device |
Also Published As
Publication number | Publication date |
---|---|
CN110688860B (en) | 2024-02-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110413785B (en) | Text automatic classification method based on BERT and feature fusion | |
CN111382582B (en) | Neural machine translation decoding acceleration method based on non-autoregressive | |
CN110222349B (en) | Method and computer for deep dynamic context word expression | |
CN108153913B (en) | Training method of reply information generation model, reply information generation method and device | |
CN111274375B (en) | Multi-turn dialogue method and system based on bidirectional GRU network | |
CN110688860A (en) | Weight distribution method based on multiple attention mechanisms of transducer | |
CN109522403A (en) | A kind of summary texts generation method based on fusion coding | |
CN105279552A (en) | Character based neural network training method and device | |
CN110032638A (en) | A kind of production abstract extraction method based on coder-decoder | |
CN115841119B (en) | Emotion cause extraction method based on graph structure | |
CN112560456A (en) | Generation type abstract generation method and system based on improved neural network | |
CN114691858B (en) | Improved UNILM digest generation method | |
CN115860054A (en) | Sparse codebook multiple access coding and decoding system based on generation countermeasure network | |
CN110717342B (en) | Distance parameter alignment translation method based on transformer | |
CN112949255A (en) | Word vector training method and device | |
CN113297374B (en) | Text classification method based on BERT and word feature fusion | |
CN112287641B (en) | Synonym sentence generating method, system, terminal and storage medium | |
CN110717343B (en) | Optimal alignment method based on transformer attention mechanism output | |
CN112528168A (en) | Social network text emotion analysis method based on deformable self-attention mechanism | |
Wang et al. | V-A3tS: A rapid text steganalysis method based on position information and variable parameter multi-head self-attention controlled by length | |
CN110674647A (en) | Layer fusion method based on Transformer model and computer equipment | |
Tian et al. | An online word vector generation method based on incremental huffman tree merging | |
CN113469260B (en) | Visual description method based on convolutional neural network, attention mechanism and self-attention converter | |
CN113077785B (en) | End-to-end multi-language continuous voice stream voice content identification method and system | |
CN115167863A (en) | Code completion method and device based on code sequence and code graph fusion |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |