CN110598223A

CN110598223A - Neural machine translation inference acceleration method from coarse granularity to fine granularity

Info

Publication number: CN110598223A
Application number: CN201910889781.7A
Authority: CN
Inventors: 杜权; 朱靖波; 肖桐; 张春良
Original assignee: SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd
Current assignee: SHENYANG YAYI NETWORK TECHNOLOGY Co Ltd
Priority date: 2019-09-20
Filing date: 2019-09-20
Publication date: 2019-12-20

Abstract

The invention discloses a neural machine translation inference acceleration method from coarse granularity to fine granularity, which comprises the following steps: constructing a multi-layer neural machine translation model, generating a machine translation word list, and acquiring model parameters after training convergence; calculating self-attention weights of a coding end, a decoding end and coding and decoding attention weights of different layers in a multi-layer neural machine translation training model; calculating the information quantity of the self attention weight of the decoding end; calculating the information quantity of the coding and decoding attention weight, and calculating the compression ratio of the coding and decoding attention weight transformation matrix of each layer; and modifying parameters of the multilayer neural machine translation model, and utilizing parameters of the multilayer neural machine translation model obtained by training of the training set again to realize acceleration of translation inference of the neural machine from coarse granularity to fine granularity. The method dynamically adjusts the size of the expression space of the language information in the attention mechanism, so that the deduction speed of the model is accelerated under the condition of ensuring that the quality of the translation does not change obviously.

Description

Neural machine translation inference acceleration method from coarse granularity to fine granularity

Technical Field

The invention relates to a neural machine translation inference acceleration technology, in particular to a coarse-grained to fine-grained neural machine translation inference acceleration method.

Background

Machine Translation (MT) is an experimental subject for natural language Translation using an electronic computer. Generally, it is a process of converting one natural language (source language) into another natural language (target language) using a computer. Machine translation has long been recognized as one of the ultimate techniques to solve the translation problem. For example, the chinese government has incorporated natural language understanding, including research into machine translation technology, into the "compendium for long-term scientific and technological development in the country". Google translation reportedly serves 2 million multiple users worldwide each day, translating 10 million times a day, with a daily translation volume equivalent to 100 million books, exceeding the number of texts that professional translators around the world can translate a year. These all reflect the great value and technological application prospects of machine translation.

The methods of machine translation are divided into two types, one is rule-based machine translation, and the other is corpus-based machine translation. In particular, corpus-based machine translation can be divided into instance-based machine translation, statistical machine translation, and neural machine translation. Early people used primarily rules for machine translation. However, as research progresses, rule-based methods are gradually exposing the problems of limited coverage of the manually written rules, increased number of rules causing conflicts, difficulty in language expansion, etc. While subsequent case-based approaches may alleviate the above problems to some extent, the problems have not been solved fundamentally.

The breakthrough of machine translation began in the early 90 s of the 20 th century. At that time, the concept of statistical machine translation was proposed by international business machines and american telegraph telephone. The method completely abandons the dependence on manual writing rules, and considers the translation problem as finding the most possible translation problem. The development of statistical machine translation systems relies only on bilingual and monolingual data and manually defined translation features. The robustness and the expandability of the system are greatly improved, and obvious advantages are shown in a plurality of translation tasks. However, statistical machine translation still relies on feature engineering of a large number of corpora, and this approach assumes that the translation process has an implicit structure, which limits the representation capability of the statistical machine translation model.

In addition, researchers also provide a neural machine translation method based on deep learning, which is called neural machine translation for short. The method directly uses the neural network to model the machine translation problem and completes model learning in an end-to-end mode, and the whole process does not need artificial characteristic design.

Compared with the prior machine translation method based on statistics, the neural machine translation system has higher translation quality, and a plurality of researchers use the method to research machine translation tasks, but because of the characteristics of the neural network, a large amount of matrix operation exists in the neural network, so that the neural network consumes a large amount of time and computing resources. This problem is particularly significant in practical machine translation systems, because these tasks generally require a strict response time, and therefore the inference speed of the neural machine translation system is also critical to the practicability of the translation system. It is an important issue to optimize the speed of a conventional neural machine translation system.

The neural-machine translation system based on Self-attention (Self-attention) mechanism is a method for directly transmitting information among vocabularies at different positions, has advantages in closer information transmission distance, and is concerned in many similar systems. Such a neural-machine translation model is able to more fully represent complex relationships between words at different positions in a sequence. The method has the central idea that the relevance among the vocabularies is obtained by considering the relevance among the vocabularies at any position of a source or a target sentence, and the relevance is used as the importance degree in the process of integrating different vocabularies or fragment information. Finally, the semantic information representation in the source or object can be obtained.

Although the attention-based model can obtain high-quality translation results, the attention mechanism needs to calculate the correlation degree of words between two sentence fragments, wherein a large number of matrix operations are involved, so that the inference time is more occupied, and the inference speed of the machine translation method in practical use is difficult to meet the requirement of real-time response. Researchers have noted that there are a large number of attentive operations in this configuration, which accounts for as much as 63.99% of the total inference time, so speeding up attention operations can effectively reduce the time-consuming model inference. At present, when a multi-layer neural machine translation model calculates each layer of attention weight, the same calculation amount is adopted, namely the calculation granularity is the same, but actually, the functions of different layers are different, so that a large amount of unnecessary calculation is carried out during inference, namely coarse granularity calculation is carried out, and if useless calculation amount in each layer can be removed by using a certain index, fine granularity calculation can be realized, so that the inference process can be accelerated.

However, no report is found on the current method for accelerating the model inference speed based on the information theory, which can meet the real-time response requirement.

Disclosure of Invention

Aiming at the defects that the inference speed is difficult to meet the requirement of real-time response in the practical use of the machine translation method in the prior art, the invention aims to provide a neural machine translation inference acceleration method from coarse granularity to fine granularity, which can improve the real-time response speed on the basis of the latest implementation of fast inference and on the premise that the model performance is hardly reduced.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention relates to a neural machine translation inference acceleration method from coarse granularity to fine granularity, which comprises the following steps of:

1) establishing a parallel corpus and a multi-layer neural machine translation model based on an attention mechanism, generating a machine translation word list by using the parallel corpus, dividing the parallel corpus into a training set and a check set, and further training by using the training set to obtain model parameters after training convergence;

2) inputting a check set into the multi-layer neural machine translation model, calculating self-attention weights of a coding end, a decoding end and a coding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and respectively calculating information content of the attention weights;

3) selecting the maximum value and the minimum value to calculate the compression ratio of the coding end self-attention weight transformation matrix of each layer according to the information content of the coding end self-attention weight;

4) according to the information amount of the decoding end self-attention weight calculated in the step 2), selecting the maximum value and the minimum value to calculate the decoding end self-attention weight transformation matrix compression ratio of each layer;

5) selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer according to the information amount of the coding and decoding attention weight calculated in the step 2);

6) modifying parameters of the multilayer neural machine translation model according to the three compression ratios calculated in the steps 3) -5), and training by using the training set again to obtain the converged parameters of the multilayer neural machine translation model, thereby realizing acceleration of translation inference of the neural machine from coarse granularity to fine granularity.

In step 1), the multi-layer neural machine translation model includes an encoding end and a decoding end, each layer of the multi-layer neural machine translation model is calculated by using an attention mechanism, and when a multi-head attention mechanism is used, a calculation formula of each head is as follows:

wherein softmax (·) is a normalization function, if the model is a first layer, the information representation matrices Q, K and V are respectively word-embedded different linear transformation matrices, if the model is another layer outside the first layer, Q, K and V are respectively different linear transformation matrices after output of a lower layer stack, and d is a second layer stack_kThe size of each head dimension of the K matrix.

In step 2), calculating the self-attention weight of the encoding end, the self-attention weight of the decoding end and the encoding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:

201) inputting a check set sentence into a multi-layer neural machine translation model, mapping the check set sentence into a word list, and converting the word list into vector representation through a series of transformations;

202) transforming the vector representation into an information representation matrix Q, K, calculating an attention weight of each layer of the multi-layer neural machine translation model, and calculating an attention weight S of each layer^m＝s(Q^m，K^mWhere s (-) is attention weightedAnd calculating a formula, wherein m represents the mth layer in the model.

The information quantity for calculating the weight in the step 2) refers to calculating information entropy, and the specific steps are as follows:

203) calculating an information entropy H (S) using the attention weight calculated in step 202)^m) Wherein H (·) represents the formula H (x) Σ for entropy calculation of information_iP(x_i)I(x_i)＝-∑_iP(x_i)ln P(x_i)，P(x_i) Representing the probability of the distribution of the random variable X, and calculating the information amount of the attention weight distribution by using the S^mReplacing X as a random variable;

204) and averaging the multiple heads when calculating the final information entropy, and taking the average value of the sentences in the check set as the final information quantity index H of the layer.

In step 3), according to the information amount of the self-attention weight of the encoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:

301) selecting a maximum value maxEnc and a minimum value minEnc from information quantity indexes of layers with coding self attention;

302) defining a lower bound aEnc of a compression ratio of the coding self-attention weight transformation matrix, adjusting according to different tasks, and if aEnc is 0, reducing minEnc to ensure that the compression ratio of the coding self-attention weight transformation matrix is not 0;

303) calculating the reduction ratio of the encoding end for different layers of the layer with encoding self attention to obtain the dimension d 'after reduction'_kF (H), where H is the amount of information calculated for each layer, f (-) is a clipping mapping function,

in step 4), according to the information amount of the self-attention weight of the decoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:

401) selecting a maximum value maxDc and a minimum value minDnc from the information quantity indexes of the layer with the decoding self attention;

402) defining a lower bound aDnc of a compression ratio of a decoded self-attention weight transformation matrix, adjusting according to different tasks, and if the aDnc is 0, reducing minDnc to ensure that the compressed self-attention weight transformation matrix is not 0;

403) calculating the reduction ratio of the decoding end for different layers of the layer with decoding self attention to obtain the dimension d 'after reduction'_kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.

In step 5), according to the information amount of the coding and decoding attention weight, selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer, and the specific steps are as follows:

501) selecting a maximum value maxEncDnc and a minimum value minEncDnc from the information quantity indexes of the layers with coding and decoding attention;

502) defining a lower bound aEncDec of the compression ratio of the coding and decoding attention weight transformation matrix, adjusting according to different tasks, and if aEncDec is 0, reducing minEncDnc to ensure that the compression ratio of the coding and decoding attention weight transformation matrix is not 0;

503) calculating the reduction ratio of the encoding and decoding end for different layers of the layer with encoding and decoding attention to obtain the dimension d 'after reduction'_kWhere H is the amount of information calculated for each layer and f (·) is a clipping mapping function.

The invention has the following advantages:

1. the invention aims to accelerate the inference by improving the execution efficiency of attention operation from the perspective of using a lighter-weight attention network, provides a coarse-grained to fine-grained neural machine translation acceleration method from the perspective of simply using the same-sized space to extract the attention weight and causing redundancy of an information space to a certain extent, and dynamically adjusts the size of a space for representing language information in an attention mechanism by using the information quantity of each layer of attention weight in an existing model so as to accelerate the inference speed of the model under the condition of ensuring that the quality of a translated text does not obviously change.

2. The invention provides a neural machine translation inference acceleration method from coarse granularity to fine granularity, which takes a lot of time for calculating attention and overcomes the defects of redundant calculation and the like in the process of inferring the neural machine translation based on the attention mechanism.

3. The method is simple and effective, is not easy to repel other inference methods, and can further improve the speed on the basis of the fastest inference technology.

Drawings

FIG. 1 is a diagram of the neural machine translation attention as applied in the prior art;

FIG. 2 is a graphical representation of the output of attention weights of different layers of a multi-layer neural machine translation model in accordance with the present invention;

FIG. 3 is a diagram of the information content of different layers at the translation encoding end of a neural machine based on attention mechanism according to the present invention;

FIG. 4 is a schematic diagram illustrating the basic idea of the present invention from coarse grain size to fine grain size;

FIG. 5 is a diagram of the information amount of different layers at the decoding end of the neural machine translation coding based on attention mechanism according to the present invention.

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

The invention relates to a neural machine translation inference acceleration method from coarse granularity to fine granularity, which is characterized by comprising the following steps of:

2) inputting a check set into the multi-layer neural machine translation model, calculating self-attention weights of decoding ends, self-attention weights of encoding ends and encoding and decoding attention weights of different layers in the multi-layer neural machine translation training model, and respectively calculating information content of the self-attention weights of the encoding ends;

3) selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer according to the information content of the self-attention weight of the encoding end;

4) according to the information amount of the self-attention weight of the decoding end calculated in the step 2), selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer;

6) modifying parameters of the multilayer neural machine translation model according to the three compression ratios calculated in the steps 3), 4) and 5), and training by using the training set again to obtain the converged parameters of the multilayer neural machine translation model, thereby realizing the acceleration of the neural machine translation inference from coarse granularity to fine granularity.

The invention optimizes the decoding speed of the neural machine translation system based on the attention mechanism from the perspective of using a lighter-weight attention network, and aims to improve the inference speed of the translation system at the cost of less performance loss so as to achieve the balance of performance and speed.

In this step, bilingual sentences corresponding to two languages need to be obtained, and then a neural machine translation model based on an attention mechanism needs to be built, wherein the attention calculation process is as shown in fig. 1, and the lower-layer input is transformed into an Q, K, V matrix. The processing is performed using a self-attention transform and a codec attention process. Before model inference is performed, the model needs to be trained to converge neural-machine translation model parameters on a training set. The machine translation model is composed of an encoding end and a decoding end, namely, the left part in fig. 1 represents two parts, mainly using an attention mechanism, and the calculation formula is as follows:i.e. the right matrix operation part in fig. 1. To speed up computation speed, multi-headed attention is typically used, with Q, K and V at the input being word embedding in the source language or linear transformation after output of the lower stack. QK^TThe correlation between any two locations in the source language is actually calculated, d_kIs the size of the dimensions of each head,denominations can be made to convert the correlation into a reasonable real number range. softmax (-) is normalized to the source language location, and the result is the correlation weight of i with any other location, which is multiplied by V to obtain a weighted sum of all location vectors. This process does not use any cyclic or convolutional unitsAnd can be parallelized, thereby speeding up the computation.

In step 2), calculating self-attention weights of decoding ends, self-attention weights of encoding ends and encoding and decoding attention weights of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:

202) transforming the vector representation into an information representation matrix Q, K, calculating an attention weight of each layer of the multi-layer neural machine translation model, and calculating an attention weight S of each layer^m＝s(Q^m，K^m) Where s (-) is attention weightedAnd calculating a formula, wherein m represents the mth layer in the model.

The method mainly calculates the similarity between layers and provides theoretical guidance for later weight sharing.

Step 201) in order to vectorize the text information, each word in the word list is mapped into a vector through numerical transformation, and then the vector is in one-to-one correspondence with the word in terms of numerical value.

Step 202) is attention weight calculation, the information representation matrices Q and K are transformed from the lower layer output, which requires operation for each attention operation of the decoding section at the encoding end, and it should be noted that for the encoding and decoding attention, Q and K use the output at the encoding end. For example, when the sentence "He calls for intensive negotiation in the next weeks" is translated into "He captured for intensive new activities over the next few roads" for target language inference, the word "roads" is decoded from the attention weight, and it can be seen that the attention weights of different layers are very different. The "weeks" attention weight curve is shown in fig. 2.

Step 203) is information amount calculation, and the attention mechanism mainly comprises two stages, namely calculation of attention weight and information fusion. The calculation process of the attention weight captures the position of the most concerned language segment in the current state, and the information entropy is used as an index for measuring the information quantity contained in the attention weights of different layers. In Information theory, Information entropy (Information entropy) is often used to measure the expectation of Information contained in an event. The probability distribution of events and the information content of each event constitute a random variable, and the expectation of this random variable is the average (i.e., entropy) of the information content generated by the distribution.

The invention analyzes the information quantity of attention weights of different layers in a multi-layer neural machine translation model, and finds that the information quantity of each layer is obviously different. As shown in fig. 3, for self-attention operation in the encoder, the information entropy as a whole shows a tendency to increase with the number of layers. In this case, the common translation model uses Q and K with the same dimension to calculate the attention weight of each layer, which results in the phenomenon of information redundancy, i.e. the calculation granularity is very coarse, and unnecessary calculation is generated in the inference process. It can be seen from fig. 3 that the information amount of different layers is different, which proves that redundancy exists in the calculation granularity, i.e. the current calculation granularity is coarse-grained calculation. Numerically, although the difference between the highest and lowest information levels is very small, because the weights of all the positions are normalized, and if the sentence length is long, the weight value of each position is low, which results in a difference in the calculated information levels even though the attention distribution between the layers is significantly different.

302) defining a lower bound aEnc of a compression ratio of the coding self-attention weight transformation matrix, adjusting according to different tasks, if aEnc is 0, reducing minEnc, ensuring that the compressed coding self-attention weight transformation matrix is not 0, and not obtaining too small lower bound of a coding end;

step 301) selects the maximum value and the minimum value of the information amount from each layer of the encoding end as the reference of the reduction ratio, because different model parameters of different tasks are different, a dynamic value needs to be calculated according to different situations to serve as the reference of the reduction, and meanwhile, the action of attention weight of each layer needs to be considered during the reduction, so that the reference value is directly selected from the different layers.

Step 302) will choose the cut-down lower bound aEnc, and the cut-down lower bound should not be set too small for the encoding end, because in the most advanced technology, it will be inferred that one sentence will store the information of the encoding end until the conclusion is reached, so the calculation of the encoding end will not occupy a large proportion, and even if the cut-down lower bound is set to be very large, the speed will not be significantly increased.

Step 303) calculating the reduction ratio of the layer according to the calculated maxEnc, minEnc, aEnc and H, analyzing the importance of each layer through the information quantity before, reflecting the property when reducing, thereby ensuring the performance of the model, and dynamically adjusting according to the task when designing the function, so that a super parameter is set, a lower bound is set for the reduction ratio, and the quality of certain tasks with higher quality requirements is preferentially ensured.

Step 4 is similar to step 3, except that the lower cut boundary aDnc can be set lower to obtain more performance acceleration due to the larger time ratio of the encoding end in the inference process. Compared with the encoding end, the calculation amount of the decoding end is more redundant. The whole reduction process is as shown in fig. 4, the original 5-dimensional calculation space is compressed into 3-dimensional size, the calculation amount is reduced, but a very similar attention weight distribution is obtained, so after the calculation granularity compression, after the calculation amount is reduced, the inference speed is accelerated, and no obvious performance loss is brought.

Step 5) is similar to step 3) and 4), and for the attention weight of the coding and decoding attention, the method is characterized in that the information amount of the previous layer is larger, as shown in fig. 5, and the difference between the next layer and the previous layer is larger, so that the next layer may be very small according to the previous method, thereby causing larger performance loss, and at this time minEncDnc needs to be adjusted to be small, so as to ensure that the dimension of the layer with smaller information amount does not approach to 0, thereby avoiding the rapid performance reduction.

The invention discloses a neural machine translation inference acceleration method from coarse granularity to fine granularity, which mainly aims to achieve the purpose of acceleration inference by improving the execution efficiency of attention operation from the viewpoint of using a lighter attention network. The invention further provides a neural machine translation acceleration method from coarse granularity to fine granularity from the aspect that the attention weight is extracted by simply using the space with the same size, and the method dynamically adjusts the size of the expression space of the language information in the attention mechanism by using the information quantity of each layer of attention weight in the existing model so as to accelerate the inference speed of the model under the condition of ensuring that the quality of the translated text is not obviously changed.

The present invention takes into account that computing attention takes a significant amount of time when an attention-based neural machine translation is making inferences, where there are redundant computations in the computation. The invention further provides a neural machine translation inference acceleration method from coarse granularity to fine granularity by utilizing the characteristic that the information quantity of each part is different among different layers by a method for measuring the information quantity, wherein in the self-attention mechanism, the information quantity is increased from the bottom layer to the top layer, and in the coding and decoding attention mechanism, the information quantity is reduced, and the neural machine translation inference acceleration method guides the compression proportion of each layer based on the information quantity of each layer so as to achieve the purpose of accelerating inference. The method is simple and effective, is not easy to repel other inference methods, and can further improve the speed on the basis of the fastest inference technology.

Claims

1. A neural machine translation inference acceleration method from coarse-grained to fine-grained, characterized by comprising the steps of:

2. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 1), the multi-layer neural machine translation model includes an encoding end and a decoding end, each layer of the multi-layer neural machine translation model is calculated by using an attention mechanism, and when a multi-head attention mechanism is used, a calculation formula of each head is as follows:

3. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 2), calculating the self-attention weight of the encoding end, the self-attention weight of the decoding end and the encoding and decoding attention weight of different layers in the multi-layer neural machine translation training model, and specifically comprising the following steps:

4. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: the information quantity for calculating the weight in the step 2) refers to calculating information entropy, and the specific steps are as follows:

203) calculating an information entropy H (S) using the attention weight calculated in step 202)^m) Wherein H (·) represents the formula H (x) Σ for entropy calculation of information_iP(x_i)I(x_i)＝-∑_iP(x_i)ln P(x_i)，P(x_i) Representing the probability of the distribution of the random variable X, and substituting Sm for X as the random variable when calculating the information quantity of the attention weight distribution;

5. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 3), according to the information amount of the self-attention weight of the encoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:

6. the method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 4), according to the information amount of the self-attention weight of the decoding end, selecting the maximum value and the minimum value to calculate the compression ratio of the self-attention weight transformation matrix of each layer, and the specific steps are as follows:

7. The method for acceleration of neural machine translation inference from coarse-grained to fine-grained according to claim 1, wherein: in step 5), according to the information amount of the coding and decoding attention weight, selecting the maximum value and the minimum value to calculate the compression ratio of the coding and decoding attention weight transformation matrix of each layer, and the specific steps are as follows: