CN110598223A - Neural machine translation inference acceleration method from coarse granularity to fine granularity - Google Patents

Neural machine translation inference acceleration method from coarse granularity to fine granularity Download PDF

Info

Publication number
CN110598223A
CN110598223A CN201910889781.7A CN201910889781A CN110598223A CN 110598223 A CN110598223 A CN 110598223A CN 201910889781 A CN201910889781 A CN 201910889781A CN 110598223 A CN110598223 A CN 110598223A
Authority
CN
China
Prior art keywords
layer
attention
machine translation
decoding
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201910889781.7A
Other languages
Chinese (zh)
Inventor
杜权
朱靖波
肖桐
张春良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd
Original Assignee
SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd filed Critical SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd
Priority to CN201910889781.7A priority Critical patent/CN110598223A/en
Publication of CN110598223A publication Critical patent/CN110598223A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a neural machine translation inference acceleration method from coarse granularity to fine granularity, which comprises the following steps: constructing a multi-layer neural machine translation model, generating a machine translation word list, and acquiring model parameters after training convergence; calculating self-attention weights of a coding end, a decoding end and coding and decoding attention weights of different layers in a multi-layer neural machine translation training model; calculating the information quantity of the self attention weight of the decoding end; calculating the information quantity of the coding and decoding attention weight, and calculating the compression ratio of the coding and decoding attention weight transformation matrix of each layer; and modifying parameters of the multilayer neural machine translation model, and utilizing parameters of the multilayer neural machine translation model obtained by training of the training set again to realize acceleration of translation inference of the neural machine from coarse granularity to fine granularity. The method dynamically adjusts the size of the expression space of the language information in the attention mechanism, so that the deduction speed of the model is accelerated under the condition of ensuring that the quality of the translation does not change obviously.

Description

Neural machine translation inference acceleration method from coarse granularity to fine granularity
Technical Field
The invention relates to a neural machine translation inference acceleration technology, in particular to a coarse-grained to fine-grained neural machine translation inference acceleration method.
Background
Machine Translation (MT) is an experimental subject for natural language Translation using an electronic computer. Generally, it is a process of converting one natural language (source language) into another natural language (target language) using a computer. Machine translation has long been recognized as one of the ultimate techniques to solve the translation problem. For example, the chinese government has incorporated natural language understanding, including research into machine translation technology, into the "compendium for long-term scientific and technological development in the country". Google translation reportedly serves 2 million multiple users worldwide each day, translating 10 million times a day, with a daily translation volume equivalent to 100 million books, exceeding the number of texts that professional translators around the world can translate a year. These all reflect the great value and technological application prospects of machine translation.
The methods of machine translation are divided into two types, one is rule-based machine translation, and the other is corpus-based machine translation. In particular, corpus-based machine translation can be divided into instance-based machine translation, statistical machine translation, and neural machine translation. Early people used primarily rules for machine translation. However, as research progresses, rule-based methods are gradually exposing the problems of limited coverage of the manually written rules, increased number of rules causing conflicts, difficulty in language expansion, etc. While subsequent case-based approaches may alleviate the above problems to some extent, the problems have not been solved fundamentally.
The breakthrough of machine translation began in the early 90 s of the 20 th century. At that time, the concept of statistical machine translation was proposed by international business machines and american telegraph telephone. The method completely abandons the dependence on manual writing rules, and considers the translation problem as finding the most possible translation problem. The development of statistical machine translation systems relies only on bilingual and monolingual data and manually defined translation features. The robustness and the expandability of the system are greatly improved, and obvious advantages are shown in a plurality of translation tasks. However, statistical machine translation still relies on feature engineering of a large number of corpora, and this approach assumes that the translation process has an implicit structure, which limits the representation capability of the statistical machine translation model.
In addition, researchers also provide a neural machine translation method based on deep learning, which is called neural machine translation for short. The method directly uses the neural network to model the machine translation problem and completes model learning in an end-to-end mode, and the whole process does not need artificial characteristic design.
Compared with the prior machine translation method based on statistics, the neural machine translation system has higher translation quality, and a plurality of researchers use the method to research machine translation tasks, but because of the characteristics of the neural network, a large amount of matrix operation exists in the neural network, so that the neural network consumes a large amount of time and computing resources. This problem is particularly significant in practical machine translation systems, because these tasks generally require a strict response time, and therefore the inference speed of the neural machine translation system is also critical to the practicability of the translation system. It is an important issue to optimize the speed of a conventional neural machine translation system.
The neural-machine translation system based on Self-attention (Self-attention) mechanism is a method for directly transmitting information among vocabularies at different positions, has advantages in closer information transmission distance, and is concerned in many similar systems. Such a neural-machine translation model is able to more fully represent complex relationships between words at different positions in a sequence. The method has the central idea that the relevance among the vocabularies is obtained by considering the relevance among the vocabularies at any position of a source or a target sentence, and the relevance is used as the importance degree in the process of integrating different vocabularies or fragment information. Finally, the semantic information representation in the source or object can be obtained.
Although the attention-based model can obtain high-quality translation results, the attention mechanism needs to calculate the correlation degree of words between two sentence fragments, wherein a large number of matrix operations are involved, so that the inference time is more occupied, and the inference speed of the machine translation method in practical use is difficult to meet the requirement of real-time response. Researchers have noted that there are a large number of attentive operations in this configuration, which accounts for as much as 63.99% of the total inference time, so speeding up attention operations can effectively reduce the time-consuming model inference. At present, when a multi-layer neural machine translation model calculates each layer of attention weight, the same calculation amount is adopted, namely the calculation granularity is the same, but actually, the functions of different layers are different, so that a large amount of unnecessary calculation is carried out during inference, namely coarse granularity calculation is carried out, and if useless calculation amount in each layer can be removed by using a certain index, fine granularity calculation can be realized, so that the inference process can be accelerated.
However, no report is found on the current method for accelerating the model inference speed based on the information theory, which can meet the real-time response requirement.
Disclosure of Invention
Aiming at the defects that the inference speed is difficult to meet the requirement of real-time response in the practical use of the machine translation method in the prior art, the invention aims to provide a neural machine translation inference acceleration method from coarse granularity to fine granularity, which can improve the real-time response speed on the basis of the latest implementation of fast inference and on the premise that the model performance is hardly reduced.
In order to solve the technical problems, the invention adopts the technical scheme that:
the invention relates to a neural machine translation inference acceleration method from coarse granularity to fine granularity, which comprises the following steps of:
1) establishing a parallel corpus and a multi-layer neural machine translation model based on an attention mechanism, generating a machine translation word list by using the parallel corpus, dividing the parallel corpus into a training set and a check set, and further training by using the training set to obtain model parameters after training convergence;
2) inputting a check set into the multi-layer neural machine translation model, calculating self-attention weights of a coding end, a decoding end and a coding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and respectively calculating information content of the attention weights;
3) selecting the maximum value and the minimum value to calculate the compression ratio of the coding end self-attention weight transformation matrix of each layer according to the information content of the coding end self-attention weight;
4) according to the information amount of the decoding end self-attention weight calculated in the step 2), selecting the maximum value and the minimum value to calculate the decoding end self-attention weight transformation matrix compression ratio of each layer;
5) selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer according to the information amount of the coding and decoding attention weight calculated in the step 2);
6) modifying parameters of the multilayer neural machine translation model according to the three compression ratios calculated in the steps 3) -5), and training by using the training set again to obtain the converged parameters of the multilayer neural machine translation model, thereby realizing acceleration of translation inference of the neural machine from coarse granularity to fine granularity.
In step 1), the multi-layer neural machine translation model includes an encoding end and a decoding end, each layer of the multi-layer neural machine translation model is calculated by using an attention mechanism, and when a multi-head attention mechanism is used, a calculation formula of each head is as follows:
wherein softmax (·) is a normalization function, if the model is a first layer, the information representation matrices Q, K and V are respectively word-embedded different linear transformation matrices, if the model is another layer outside the first layer, Q, K and V are respectively different linear transformation matrices after output of a lower layer stack, and d is a second layer stackkThe size of each head dimension of the K matrix.
In step 2), calculating the self-attention weight of the encoding end, the self-attention weight of the decoding end and the encoding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:
201) inputting a check set sentence into a multi-layer neural machine translation model, mapping the check set sentence into a word list, and converting the word list into vector representation through a series of transformations;
202) transforming the vector representation into an information representation matrix Q, K, calculating an attention weight of each layer of the multi-layer neural machine translation model, and calculating an attention weight S of each layerm=s(Qm,KmWhere s (-) is attention weightedAnd calculating a formula, wherein m represents the mth layer in the model.
The information quantity for calculating the weight in the step 2) refers to calculating information entropy, and the specific steps are as follows:
203) calculating an information entropy H (S) using the attention weight calculated in step 202)m) Wherein H (·) represents the formula H (x) Σ for entropy calculation of informationiP(xi)I(xi)=-∑iP(xi)ln P(xi),P(xi) Representing the probability of the distribution of the random variable X, and calculating the information amount of the attention weight distribution by using the SmReplacing X as a random variable;
204) and averaging the multiple heads when calculating the final information entropy, and taking the average value of the sentences in the check set as the final information quantity index H of the layer.
In step 3), according to the information amount of the self-attention weight of the encoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
301) selecting a maximum value maxEnc and a minimum value minEnc from information quantity indexes of layers with coding self attention;
302) defining a lower bound aEnc of a compression ratio of the coding self-attention weight transformation matrix, adjusting according to different tasks, and if aEnc is 0, reducing minEnc to ensure that the compression ratio of the coding self-attention weight transformation matrix is not 0;
303) calculating the reduction ratio of the encoding end for different layers of the layer with encoding self attention to obtain the dimension d 'after reduction'kF (H), where H is the amount of information calculated for each layer, f (-) is a clipping mapping function,
in step 4), according to the information amount of the self-attention weight of the decoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
401) selecting a maximum value maxDc and a minimum value minDnc from the information quantity indexes of the layer with the decoding self attention;
402) defining a lower bound aDnc of a compression ratio of a decoded self-attention weight transformation matrix, adjusting according to different tasks, and if the aDnc is 0, reducing minDnc to ensure that the compressed self-attention weight transformation matrix is not 0;
403) calculating the reduction ratio of the decoding end for different layers of the layer with decoding self attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
In step 5), according to the information amount of the coding and decoding attention weight, selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer, and the specific steps are as follows:
501) selecting a maximum value maxEncDnc and a minimum value minEncDnc from the information quantity indexes of the layers with coding and decoding attention;
502) defining a lower bound aEncDec of the compression ratio of the coding and decoding attention weight transformation matrix, adjusting according to different tasks, and if aEncDec is 0, reducing minEncDnc to ensure that the compression ratio of the coding and decoding attention weight transformation matrix is not 0;
503) calculating the reduction ratio of the encoding and decoding end for different layers of the layer with encoding and decoding attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
The invention has the following advantages:
1. the invention aims to accelerate the inference by improving the execution efficiency of attention operation from the perspective of using a lighter-weight attention network, provides a coarse-grained to fine-grained neural machine translation acceleration method from the perspective of simply using the same-sized space to extract the attention weight and causing redundancy of an information space to a certain extent, and dynamically adjusts the size of a space for representing language information in an attention mechanism by using the information quantity of each layer of attention weight in an existing model so as to accelerate the inference speed of the model under the condition of ensuring that the quality of a translated text does not obviously change.
2. The invention provides a neural machine translation inference acceleration method from coarse granularity to fine granularity, which takes a lot of time for calculating attention and overcomes the defects of redundant calculation and the like in the process of inferring the neural machine translation based on the attention mechanism.
3. The method is simple and effective, is not easy to repel other inference methods, and can further improve the speed on the basis of the fastest inference technology.
Drawings
FIG. 1 is a diagram of the neural machine translation attention as applied in the prior art;
FIG. 2 is a graphical representation of the output of attention weights of different layers of a multi-layer neural machine translation model in accordance with the present invention;
FIG. 3 is a diagram of the information content of different layers at the translation encoding end of a neural machine based on attention mechanism according to the present invention;
FIG. 4 is a schematic diagram illustrating the basic idea of the present invention from coarse grain size to fine grain size;
FIG. 5 is a diagram of the information amount of different layers at the decoding end of the neural machine translation coding based on attention mechanism according to the present invention.
Detailed Description
The invention is further elucidated with reference to the accompanying drawings.
The invention relates to a neural machine translation inference acceleration method from coarse granularity to fine granularity, which is characterized by comprising the following steps of:
1) establishing a parallel corpus and a multi-layer neural machine translation model based on an attention mechanism, generating a machine translation word list by using the parallel corpus, dividing the parallel corpus into a training set and a check set, and further training by using the training set to obtain model parameters after training convergence;
2) inputting a check set into the multi-layer neural machine translation model, calculating self-attention weights of decoding ends, self-attention weights of encoding ends and encoding and decoding attention weights of different layers in the multi-layer neural machine translation training model, and respectively calculating information content of the self-attention weights of the encoding ends;
3) selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer according to the information content of the self-attention weight of the encoding end;
4) according to the information amount of the self-attention weight of the decoding end calculated in the step 2), selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer;
5) selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer according to the information amount of the coding and decoding attention weight calculated in the step 2);
6) modifying parameters of the multilayer neural machine translation model according to the three compression ratios calculated in the steps 3), 4) and 5), and training by using the training set again to obtain the converged parameters of the multilayer neural machine translation model, thereby realizing the acceleration of the neural machine translation inference from coarse granularity to fine granularity.
The invention optimizes the decoding speed of the neural machine translation system based on the attention mechanism from the perspective of using a lighter-weight attention network, and aims to improve the inference speed of the translation system at the cost of less performance loss so as to achieve the balance of performance and speed.
In step 1), the multi-layer neural machine translation model includes an encoding end and a decoding end, each layer of the multi-layer neural machine translation model is calculated by using an attention mechanism, and when a multi-head attention mechanism is used, a calculation formula of each head is as follows:
wherein softmax (·) is a normalization function, if the model is a first layer, the information representation matrices Q, K and V are respectively word-embedded different linear transformation matrices, if the model is another layer outside the first layer, Q, K and V are respectively different linear transformation matrices after output of a lower layer stack, and d is a second layer stackkThe size of each head dimension of the K matrix.
In this step, bilingual sentences corresponding to two languages need to be obtained, and then a neural machine translation model based on an attention mechanism needs to be built, wherein the attention calculation process is as shown in fig. 1, and the lower-layer input is transformed into an Q, K, V matrix. The processing is performed using a self-attention transform and a codec attention process. Before model inference is performed, the model needs to be trained to converge neural-machine translation model parameters on a training set. The machine translation model is composed of an encoding end and a decoding end, namely, the left part in fig. 1 represents two parts, mainly using an attention mechanism, and the calculation formula is as follows:i.e. the right matrix operation part in fig. 1. To speed up computation speed, multi-headed attention is typically used, with Q, K and V at the input being word embedding in the source language or linear transformation after output of the lower stack. QKTThe correlation between any two locations in the source language is actually calculated, dkIs the size of the dimensions of each head,denominations can be made to convert the correlation into a reasonable real number range. softmax (-) is normalized to the source language location, and the result is the correlation weight of i with any other location, which is multiplied by V to obtain a weighted sum of all location vectors. This process does not use any cyclic or convolutional unitsAnd can be parallelized, thereby speeding up the computation.
In step 2), calculating self-attention weights of decoding ends, self-attention weights of encoding ends and encoding and decoding attention weights of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:
201) inputting a check set sentence into a multi-layer neural machine translation model, mapping the check set sentence into a word list, and converting the word list into vector representation through a series of transformations;
202) transforming the vector representation into an information representation matrix Q, K, calculating an attention weight of each layer of the multi-layer neural machine translation model, and calculating an attention weight S of each layerm=s(Qm,Km) Where s (-) is attention weightedAnd calculating a formula, wherein m represents the mth layer in the model.
The information quantity for calculating the weight in the step 2) refers to calculating information entropy, and the specific steps are as follows:
203) calculating an information entropy H (S) using the attention weight calculated in step 202)m) Wherein H (·) represents the formula H (x) Σ for entropy calculation of informationiP(xi)I(xi)=-∑iP(xi)ln P(xi),P(xi) Representing the probability of the distribution of the random variable X, and calculating the information amount of the attention weight distribution by using the SmReplacing X as a random variable;
204) and averaging the multiple heads when calculating the final information entropy, and taking the average value of the sentences in the check set as the final information quantity index H of the layer.
The method mainly calculates the similarity between layers and provides theoretical guidance for later weight sharing.
Step 201) in order to vectorize the text information, each word in the word list is mapped into a vector through numerical transformation, and then the vector is in one-to-one correspondence with the word in terms of numerical value.
Step 202) is attention weight calculation, the information representation matrices Q and K are transformed from the lower layer output, which requires operation for each attention operation of the decoding section at the encoding end, and it should be noted that for the encoding and decoding attention, Q and K use the output at the encoding end. For example, when the sentence "He calls for intensive negotiation in the next weeks" is translated into "He captured for intensive new activities over the next few roads" for target language inference, the word "roads" is decoded from the attention weight, and it can be seen that the attention weights of different layers are very different. The "weeks" attention weight curve is shown in fig. 2.
Step 203) is information amount calculation, and the attention mechanism mainly comprises two stages, namely calculation of attention weight and information fusion. The calculation process of the attention weight captures the position of the most concerned language segment in the current state, and the information entropy is used as an index for measuring the information quantity contained in the attention weights of different layers. In Information theory, Information entropy (Information entropy) is often used to measure the expectation of Information contained in an event. The probability distribution of events and the information content of each event constitute a random variable, and the expectation of this random variable is the average (i.e., entropy) of the information content generated by the distribution.
The invention analyzes the information quantity of attention weights of different layers in a multi-layer neural machine translation model, and finds that the information quantity of each layer is obviously different. As shown in fig. 3, for self-attention operation in the encoder, the information entropy as a whole shows a tendency to increase with the number of layers. In this case, the common translation model uses Q and K with the same dimension to calculate the attention weight of each layer, which results in the phenomenon of information redundancy, i.e. the calculation granularity is very coarse, and unnecessary calculation is generated in the inference process. It can be seen from fig. 3 that the information amount of different layers is different, which proves that redundancy exists in the calculation granularity, i.e. the current calculation granularity is coarse-grained calculation. Numerically, although the difference between the highest and lowest information levels is very small, because the weights of all the positions are normalized, and if the sentence length is long, the weight value of each position is low, which results in a difference in the calculated information levels even though the attention distribution between the layers is significantly different.
In step 3), according to the information amount of the self-attention weight of the encoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
301) selecting a maximum value maxEnc and a minimum value minEnc from information quantity indexes of layers with coding self attention;
302) defining a lower bound aEnc of a compression ratio of the coding self-attention weight transformation matrix, adjusting according to different tasks, if aEnc is 0, reducing minEnc, ensuring that the compressed coding self-attention weight transformation matrix is not 0, and not obtaining too small lower bound of a coding end;
303) calculating the reduction ratio of the encoding end for different layers of the layer with encoding self attention to obtain the dimension d 'after reduction'kF (H), where H is the amount of information calculated for each layer, f (-) is a clipping mapping function,
step 301) selects the maximum value and the minimum value of the information amount from each layer of the encoding end as the reference of the reduction ratio, because different model parameters of different tasks are different, a dynamic value needs to be calculated according to different situations to serve as the reference of the reduction, and meanwhile, the action of attention weight of each layer needs to be considered during the reduction, so that the reference value is directly selected from the different layers.
Step 302) will choose the cut-down lower bound aEnc, and the cut-down lower bound should not be set too small for the encoding end, because in the most advanced technology, it will be inferred that one sentence will store the information of the encoding end until the conclusion is reached, so the calculation of the encoding end will not occupy a large proportion, and even if the cut-down lower bound is set to be very large, the speed will not be significantly increased.
Step 303) calculating the reduction ratio of the layer according to the calculated maxEnc, minEnc, aEnc and H, analyzing the importance of each layer through the information quantity before, reflecting the property when reducing, thereby ensuring the performance of the model, and dynamically adjusting according to the task when designing the function, so that a super parameter is set, a lower bound is set for the reduction ratio, and the quality of certain tasks with higher quality requirements is preferentially ensured.
In step 4), according to the information amount of the self-attention weight of the decoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
401) selecting a maximum value maxDc and a minimum value minDnc from the information quantity indexes of the layer with the decoding self attention;
402) defining a lower bound aDnc of a compression ratio of a decoded self-attention weight transformation matrix, adjusting according to different tasks, and if the aDnc is 0, reducing minDnc to ensure that the compressed self-attention weight transformation matrix is not 0;
403) calculating the reduction ratio of the decoding end for different layers of the layer with decoding self attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
Step 4 is similar to step 3, except that the lower cut boundary aDnc can be set lower to obtain more performance acceleration due to the larger time ratio of the encoding end in the inference process. Compared with the encoding end, the calculation amount of the decoding end is more redundant. The whole reduction process is as shown in fig. 4, the original 5-dimensional calculation space is compressed into 3-dimensional size, the calculation amount is reduced, but a very similar attention weight distribution is obtained, so after the calculation granularity compression, after the calculation amount is reduced, the inference speed is accelerated, and no obvious performance loss is brought.
In step 5), according to the information amount of the coding and decoding attention weight, selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer, and the specific steps are as follows:
501) selecting a maximum value maxEncDnc and a minimum value minEncDnc from the information quantity indexes of the layers with coding and decoding attention;
502) defining a lower bound aEncDec of the compression ratio of the coding and decoding attention weight transformation matrix, adjusting according to different tasks, and if aEncDec is 0, reducing minEncDnc to ensure that the compression ratio of the coding and decoding attention weight transformation matrix is not 0;
503) calculating the reduction ratio of the encoding and decoding end for different layers of the layer with encoding and decoding attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
Step 5) is similar to step 3) and 4), and for the attention weight of the coding and decoding attention, the method is characterized in that the information amount of the previous layer is larger, as shown in fig. 5, and the difference between the next layer and the previous layer is larger, so that the next layer may be very small according to the previous method, thereby causing larger performance loss, and at this time minEncDnc needs to be adjusted to be small, so as to ensure that the dimension of the layer with smaller information amount does not approach to 0, thereby avoiding the rapid performance reduction.
The invention discloses a neural machine translation inference acceleration method from coarse granularity to fine granularity, which mainly aims to achieve the purpose of acceleration inference by improving the execution efficiency of attention operation from the viewpoint of using a lighter attention network. The invention further provides a neural machine translation acceleration method from coarse granularity to fine granularity from the aspect that the attention weight is extracted by simply using the space with the same size, and the method dynamically adjusts the size of the expression space of the language information in the attention mechanism by using the information quantity of each layer of attention weight in the existing model so as to accelerate the inference speed of the model under the condition of ensuring that the quality of the translated text is not obviously changed.
The present invention takes into account that computing attention takes a significant amount of time when an attention-based neural machine translation is making inferences, where there are redundant computations in the computation. The invention further provides a neural machine translation inference acceleration method from coarse granularity to fine granularity by utilizing the characteristic that the information quantity of each part is different among different layers by a method for measuring the information quantity, wherein in the self-attention mechanism, the information quantity is increased from the bottom layer to the top layer, and in the coding and decoding attention mechanism, the information quantity is reduced, and the neural machine translation inference acceleration method guides the compression proportion of each layer based on the information quantity of each layer so as to achieve the purpose of accelerating inference. The method is simple and effective, is not easy to repel other inference methods, and can further improve the speed on the basis of the fastest inference technology.

Claims (7)

1. A neural machine translation inference acceleration method from coarse-grained to fine-grained, characterized by comprising the steps of:
1) establishing a parallel corpus and a multi-layer neural machine translation model based on an attention mechanism, generating a machine translation word list by using the parallel corpus, dividing the parallel corpus into a training set and a check set, and further training by using the training set to obtain model parameters after training convergence;
2) inputting a check set into the multi-layer neural machine translation model, calculating self-attention weights of a coding end, a decoding end and a coding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and respectively calculating information content of the attention weights;
3) selecting the maximum value and the minimum value to calculate the compression ratio of the coding end self-attention weight transformation matrix of each layer according to the information content of the coding end self-attention weight;
4) according to the information amount of the decoding end self-attention weight calculated in the step 2), selecting the maximum value and the minimum value to calculate the decoding end self-attention weight transformation matrix compression ratio of each layer;
5) selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer according to the information amount of the coding and decoding attention weight calculated in the step 2);
6) modifying parameters of the multilayer neural machine translation model according to the three compression ratios calculated in the steps 3) -5), and training by using the training set again to obtain the converged parameters of the multilayer neural machine translation model, thereby realizing acceleration of translation inference of the neural machine from coarse granularity to fine granularity.
2. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 1), the multi-layer neural machine translation model includes an encoding end and a decoding end, each layer of the multi-layer neural machine translation model is calculated by using an attention mechanism, and when a multi-head attention mechanism is used, a calculation formula of each head is as follows:
wherein softmax (·) is a normalization function, if the model is a first layer, the information representation matrices Q, K and V are respectively word-embedded different linear transformation matrices, if the model is another layer outside the first layer, Q, K and V are respectively different linear transformation matrices after output of a lower layer stack, and d is a second layer stackkThe size of each head dimension of the K matrix.
3. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 2), calculating the self-attention weight of the encoding end, the self-attention weight of the decoding end and the encoding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:
201) inputting a check set sentence into a multi-layer neural machine translation model, mapping the check set sentence into a word list, and converting the word list into vector representation through a series of transformations;
202) transforming the vector representation into an information representation matrix Q, K, calculating an attention weight of each layer of the multi-layer neural machine translation model, and calculating an attention weight S of each layerm=s(Qm,Km) Where s (-) is attention weightedAnd calculating a formula, wherein m represents the mth layer in the model.
4. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: the information quantity for calculating the weight in the step 2) refers to calculating information entropy, and the specific steps are as follows:
203) calculating an information entropy H (S) using the attention weight calculated in step 202)m) Wherein H (·) represents the formula H (x) Σ for entropy calculation of informationiP(xi)I(xi)=-∑iP(xi)ln P(xi),P(xi) Representing the probability of the distribution of the random variable X, and substituting Sm for X as the random variable when calculating the information quantity of the attention weight distribution;
204) and averaging the multiple heads when calculating the final information entropy, and taking the average value of the sentences in the check set as the final information quantity index H of the layer.
5. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 3), according to the information amount of the self-attention weight of the encoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
301) selecting a maximum value maxEnc and a minimum value minEnc from information quantity indexes of layers with coding self attention;
302) defining a lower bound aEnc of a compression ratio of the coding self-attention weight transformation matrix, adjusting according to different tasks, and if aEnc is 0, reducing minEnc to ensure that the compression ratio of the coding self-attention weight transformation matrix is not 0;
303) calculating the reduction ratio of the encoding end for different layers of the layer with encoding self attention to obtain the dimension d 'after reduction'kF (H), where H is the amount of information calculated for each layer, f (-) is a clipping mapping function,
6. the method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 4), according to the information amount of the self-attention weight of the decoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:
401) selecting a maximum value maxDc and a minimum value minDnc from the information quantity indexes of the layer with the decoding self attention;
402) defining a lower bound aDnc of a compression ratio of a decoded self-attention weight transformation matrix, adjusting according to different tasks, and if the aDnc is 0, reducing minDnc to ensure that the compressed self-attention weight transformation matrix is not 0;
403) calculating the reduction ratio of the decoding end for different layers of the layer with decoding self attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
7. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 5), according to the information amount of the coding and decoding attention weight, selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer, and the specific steps are as follows:
501) selecting a maximum value maxEncDnc and a minimum value minEncDnc from the information quantity indexes of the layers with coding and decoding attention;
502) defining a lower bound aEncDec of the compression ratio of the coding and decoding attention weight transformation matrix, adjusting according to different tasks, and if aEncDec is 0, reducing minEncDnc to ensure that the compression ratio of the coding and decoding attention weight transformation matrix is not 0;
503) calculating the reduction ratio of the encoding and decoding end for different layers of the layer with encoding and decoding attention to obtain the dimension d 'after reduction'kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.
CN201910889781.7A 2019-09-20 2019-09-20 Neural machine translation inference acceleration method from coarse granularity to fine granularity Withdrawn CN110598223A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910889781.7A CN110598223A (en) 2019-09-20 2019-09-20 Neural machine translation inference acceleration method from coarse granularity to fine granularity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910889781.7A CN110598223A (en) 2019-09-20 2019-09-20 Neural machine translation inference acceleration method from coarse granularity to fine granularity

Publications (1)

Publication Number Publication Date
CN110598223A true CN110598223A (en) 2019-12-20

Family

ID=68861328

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910889781.7A Withdrawn CN110598223A (en) 2019-09-20 2019-09-20 Neural machine translation inference acceleration method from coarse granularity to fine granularity

Country Status (1)

Country Link
CN (1) CN110598223A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382581A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 One-time pruning compression method in machine translation
CN111382582A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Neural machine translation decoding acceleration method based on non-autoregressive
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN112395891A (en) * 2020-12-03 2021-02-23 内蒙古工业大学 Chinese-Mongolian translation method combining Bert language model and fine-grained compression

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN110134771A (en) * 2019-04-09 2019-08-16 广东工业大学 A kind of implementation method based on more attention mechanism converged network question answering systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109472024A (en) * 2018-10-25 2019-03-15 安徽工业大学 A kind of file classification method based on bidirectional circulating attention neural network
CN110134771A (en) * 2019-04-09 2019-08-16 广东工业大学 A kind of implementation method based on more attention mechanism converged network question answering systems

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张裕浩: ""从粗粒度到细粒度的神经机器翻译系统推断加速方法研究"", 《HTTPS://NLPLAB.COM/MEMBERS/XIAOTONG_FILES/2019-CCMT-HAO.PDF》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111382581A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 One-time pruning compression method in machine translation
CN111382582A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Neural machine translation decoding acceleration method based on non-autoregressive
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111382582B (en) * 2020-01-21 2023-04-07 沈阳雅译网络技术有限公司 Neural machine translation decoding acceleration method based on non-autoregressive
CN111382580B (en) * 2020-01-21 2023-04-18 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN112395891A (en) * 2020-12-03 2021-02-23 内蒙古工业大学 Chinese-Mongolian translation method combining Bert language model and fine-grained compression

Similar Documents

Publication Publication Date Title
WO2021047286A1 (en) Text processing model training method, and text processing method and apparatus
CN110598223A (en) Neural machine translation inference acceleration method from coarse granularity to fine granularity
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
CN107967262B (en) A kind of neural network illiteracy Chinese machine translation method
US20210141799A1 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
CN110543640A (en) attention mechanism-based neural machine translation inference acceleration method
US20210141798A1 (en) Dialogue system, a method of obtaining a response from a dialogue system, and a method of training a dialogue system
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
WO2023160472A1 (en) Model training method and related device
CN109992775B (en) Text abstract generation method based on high-level semantics
CN111985228B (en) Text keyword extraction method, text keyword extraction device, computer equipment and storage medium
CN116701431A (en) Data retrieval method and system based on large language model
CN116097250A (en) Layout aware multimodal pre-training for multimodal document understanding
CN112101010B (en) Telecom industry OA office automation manuscript auditing method based on BERT
CN114880461A (en) Chinese news text summarization method combining contrast learning and pre-training technology
CN111090734B (en) Method and system for optimizing machine reading understanding capability based on hierarchical attention mechanism
WO2021218023A1 (en) Emotion determining method and apparatus for multiple rounds of questions and answers, computer device, and storage medium
CN109918507B (en) textCNN (text-based network communication network) improved text classification method
CN114020906A (en) Chinese medical text information matching method and system based on twin neural network
CN115759119A (en) Financial text emotion analysis method, system, medium and equipment
CN113821635A (en) Text abstract generation method and system for financial field
CN112579739A (en) Reading understanding method based on ELMo embedding and gating self-attention mechanism
CN112257468A (en) Method for improving translation performance of multi-language neural machine
Andriyanov Combining Text and Image Analysis Methods for Solving Multimodal Classification Problems
CN110888944A (en) Attention convolution neural network entity relation extraction method based on multiple convolution window sizes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Du Quan

Inventor before: Du Quan

Inventor before: Zhu Jingbo

Inventor before: Xiao Tong

Inventor before: Zhang Chunliang

WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20191220