CN112699693A

CN112699693A - Machine translation method and machine translation device

Info

Publication number: CN112699693A
Application number: CN202110062736.1A
Authority: CN
Inventors: 徐成国; 杨康; 周星杰; 王硕
Original assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Current assignee: Shanghai Minglue Artificial Intelligence Group Co Ltd
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2021-04-23

Abstract

The application provides a machine translation method and a machine translation device, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be translated and a target index word bank, converting the text to be translated into an input vector matrix, inputting the input vector matrix into a pre-trained machine translation model, predicting the probability of each word in the text to be translated into any index word in the target index word bank, wherein the machine translation model is a Transformer model containing a convolution layer, and selecting the index word corresponding to the maximum probability as the translation result of the word aiming at each word in the text to be translated to obtain the target translation result of the text to be translated. Like this, this application has reduced model parameter quantity and calculated quantity to a certain extent, guarantees the performance of real-time machine translation for when carrying out real-time machine translation through the model of this application, because parameter quantity diminishes, it is more convenient to extract the training of characteristic in-process, and because calculated quantity diminishes, speed obtains certain promotion.

Description

Machine translation method and machine translation device

Technical Field

The application relates to the technical field of machine translation, in particular to a machine translation method and a machine translation device.

Background

With the rapid development of science and technology and economy, the interconnection and intercommunication among countries around the world has become an unblocked development trend, and machine translation is carried out in order to realize effective communication between different countries at low cost.

The existing machine translation usually depends on a transform model, the transform model carries out machine translation based on a multilayer depth network and a multi-head self-attention mechanism, although the translation performance is high, the size of the obtained model is large due to the fact that more parameters need to be added into the multilayer depth network and the multi-head self-attention mechanism, and the calculated amount of the model in the training process is easy to be large; meanwhile, because machine translation belongs to a text generation task, each sentence is translated word by word and is based on the sequence generated in the front, so that model operation is performed every time when decoding translation, and the calculation amount is large.

Disclosure of Invention

In view of this, an object of the present application is to provide a machine translation method and a machine translation apparatus, which utilize a Transformer model including a convolutional layer to perform machine translation, and reduce the number of parameters and the amount of computation of the model to a certain extent, thereby ensuring the performance of real-time machine translation, so that when performing real-time machine translation by using the model of the present application, since the number of parameters becomes small, training is more convenient in the process of extracting features, and since the amount of computation becomes small, the speed is improved to a certain extent.

In a first aspect, the present application provides a machine translation method, including:

acquiring a text to be translated and a target index word library;

converting the acquired text to be translated into an input vector matrix;

inputting the input vector matrix into a pre-trained machine translation model, and predicting the probability of translating each word in the text to be translated into any index word in the target index word library, wherein the machine translation model is a Transformer model containing a convolutional layer;

and aiming at each word in the text to be translated, selecting the index word corresponding to the maximum probability as the translation result of the word to obtain the target translation result of the text to be translated.

Preferably, the converting the acquired text to be translated into an input vector matrix includes:

determining a representation input vector of each word in the text to be translated, wherein the representation input vector is obtained according to a word embedding vector and a position embedding vector;

and determining an input vector matrix of the text to be translated based on the obtained representation input vector of each word.

Preferably, the machine translation model comprises an input encoding model and an output decoding model; training the machine translation model by:

acquiring input vector matrix samples corresponding to a preset number of text samples to be translated;

inputting each input vector matrix sample into the input coding model for feature extraction to obtain a target feature extraction matrix;

inputting the obtained target feature extraction matrix into the output decoding model for probability calculation to obtain the probability of translating each word in the text sample to be translated into any index word in the target index word bank;

and when the preset number of text samples to be translated are completely trained, determining that the training of the machine translation model is completed.

Preferably, the input coding model comprises a first multi-headed attention mechanism network, a first network reinforcement layer, a first volume dimension reduction layer and a second network reinforcement layer; training the input coding model by:

inputting the input vector matrix sample into the first multi-head attention mechanism network for calculation to obtain a first feature extraction matrix;

inputting the first feature extraction matrix into the first network strengthening layer for residual connection and normalization calculation to obtain a second feature extraction matrix;

inputting the second feature extraction matrix into the first convolution dimensionality reduction layer for dimensionality reduction calculation to obtain a third feature extraction matrix with the same dimensionality as the first feature extraction matrix;

and inputting the third feature extraction matrix into the second network strengthening layer for residual connection and normalization calculation to obtain a target feature extraction matrix.

Preferably, the inputting the second feature extraction matrix into the first convolution dimension reduction layer for dimension reduction calculation to obtain a third feature extraction matrix having the same dimension as the first feature extraction matrix includes:

carrying out convolution summation on the h row vector in the second feature extraction matrix to obtain a corresponding value of the h bit of the third feature extraction matrix; wherein h is a positive integer not greater than p, and p is the dimension number of the first feature extraction matrix.

Preferably, the first multi-headed attention mechanism network comprises three first linear layers, N first multi-headed self-attention mechanism layers, a first stitching layer and a second convolution dimensionality reduction layer; training the first multi-headed attention mechanism network by:

performing linear transformation on the input vector matrix samples to obtain N groups of coding input matrixes, wherein the coding input matrixes comprise a query matrix, a key value matrix and a value matrix, the query matrix, the key value matrix and the value matrix in each group of coding input matrixes are different, and N is an integer greater than or equal to 1;

inputting each group of code input matrixes to a first multi-head self-attention mechanism layer through the first linear layer for attention calculation to obtain N groups of code output vectors;

inputting N groups of coding output vectors into the first splicing layer for splicing to obtain an integrated coding vector;

and inputting the integrated coding vector to the second convolution dimensionality reduction layer for convolution dimensionality reduction calculation to obtain a first feature extraction matrix.

Preferably, the inputting the integrated coding vector into the second convolution dimensionality reduction layer for convolution dimensionality reduction calculation to obtain a first feature extraction matrix includes:

performing convolution summation on the Mth bit array of each coded output vector in the integrated coded vectors to obtain a corresponding value of the Mth bit of the first feature extraction matrix; wherein M is a positive integer not greater than N.

Preferably, the output decoding model comprises a second multi-head attention mechanism network, a third network strengthening layer, a third convolution dimensionality reduction layer, a fourth network strengthening layer and a probability output layer; training the output decoding model by:

inputting the target feature extraction matrix into the second multi-head attention mechanism network for calculation to obtain a first feature transformation matrix;

inputting the first characteristic transformation matrix into the third network strengthening layer for residual connection and normalization calculation to obtain a second characteristic transformation matrix;

inputting the second characteristic transformation matrix into the third convolution dimensionality reduction layer for dimensionality reduction calculation to obtain a third characteristic transformation matrix with the same dimensionality as the first characteristic transformation matrix;

inputting the third feature transformation matrix into the fourth network strengthening layer for residual connection and normalization calculation to obtain a target feature transformation matrix;

and inputting the target characteristic conversion matrix into the probability output layer for probability calculation to obtain a probability output matrix representing the probability of translating each word in the text sample to be translated into any index word in the target index word bank.

Preferably, the second multi-head attention mechanism network comprises three second linear layers, N second multi-head self-attention mechanism layers, a second splicing layer and a fourth convolution dimensionality reduction layer; training the second multi-headed attention mechanism network by:

determining N key value matrixes and N value matrixes based on the target feature extraction matrix, wherein N is an integer greater than or equal to 1;

combining N query matrixes obtained by performing linear transformation on the input vector matrix sample, and the determined N key value matrixes and N value matrixes to obtain N groups of decoding input matrixes, wherein the query matrixes, the key value matrixes and the value matrixes in each group of decoding input matrixes are different;

inputting each group of decoding input matrixes to a second multi-head self-attention mechanism layer through the second linear layer for attention calculation to obtain N groups of decoding output vectors;

inputting N groups of decoding output vectors into the second splicing layer for splicing to obtain an integrated decoding vector;

and inputting the integrated decoding vector to the fourth convolution dimensionality reduction layer for convolution dimensionality reduction calculation to obtain a first feature transformation matrix.

In a second aspect, the present application further provides a machine translation apparatus, comprising:

the acquisition module is used for acquiring a text to be translated and a target index word library;

the conversion module is used for converting the acquired text to be translated into an input vector matrix;

the prediction module is used for inputting the input vector matrix into a pre-trained machine translation model, and predicting the probability of translating each word in the text to be translated into any index word in the target index word bank, wherein the machine translation model is a Transformer model containing a convolutional layer;

and the translation module is used for selecting the index word corresponding to the maximum probability as the translation result of each word in the text to be translated to obtain the target translation result of the text to be translated.

In a third aspect, the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the machine translation method as described above.

In a fourth aspect, the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the machine translation method as described above.

The application provides a machine translation method and a device, wherein the method comprises the following steps: the method comprises the steps of obtaining a text to be translated and a target index word bank, converting the obtained text to be translated into an input vector matrix, inputting the input vector matrix into a pre-trained machine translation model, predicting the probability of translating each word in the text to be translated into any index word in the target index word bank, wherein the machine translation model is a Transformer model containing a convolution layer, and selecting the index word corresponding to the maximum probability as the translation result of each word in the text to be translated to obtain the target translation result of the text to be translated.

Compared with a method for performing machine translation by a transform model based on a multi-layer depth network and a multi-head self-attention mechanism in the prior art, the method has the advantages that a traditional full connection layer is replaced by point convolution, the strong feature extraction capability based on the transform model is combined with the low parameter and high-speed calculation capability of a convolutional neural network, and the method is combined in the field of real-time machine translation.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a flow chart of a method for machine translation provided by an embodiment of the present application;

fig. 2 is a schematic structural diagram of an input coding model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a first multi-headed attention mechanism network according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a point convolution dimensionality reduction operation provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an output decoding model according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a second multi-headed attention mechanism network according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a machine translation device according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

First, an application scenario to which the present application is applicable will be described. The method and the device can be applied to the technical field of machine translation. Compared with manual translation, the machine translation has the advantages that the cost of the machine translation is much lower, meanwhile, the flow of the machine translation is simple and convenient, and the running speed is high. In recent years, artificial intelligent machine translation such as "deep learning technology" has been added, and individual words are not simply translated into another language any more, but combined with context, and continuously reviewed forward to understand sentences with complex structures.

Specifically, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Currently, basic models commonly used in the NLP field are four types, namely, fully-Connected Neural Networks (CNNs), Long Short-Term Memory Networks (LSTM), and transform models, where LSTM and transform are most widely used as feature extractors, and especially once a transform using a self-Attention mechanism (Attention) is opened, they are regarded as the best basic feature extraction model, but the advantage of the transform is also based on its specific network structure and the use of a Multi-head self-Attention mechanism (Multi-head Attention), and the addition of a deep-layer stacked network and Multi-head also makes the transform have a slight improvement in feature extraction performance compared with LSTM, but also leads to the problem of increase in parameter number and calculation amount. For example, in a real-time machine translation task, when reasoning and translating, a model is required to ensure both translation accuracy and "real-time" in speed, which is a point that the network structure of a standard Transformer is deficient.

Specifically, the advantages and disadvantages of the Transformer model in the real-time machine translation task and the advantages and disadvantages of the CNN model in the real-time machine translation task are respectively analyzed.

Firstly, compared with an LSTM model, the transform model deepens the network in the longitudinal dimension, so that the capability of feature extraction is greatly improved, and each time step does not depend on the hidden output of the previous time step like the LSTM, so that the feature retention is well improved. Meanwhile, the model can learn the side key points of the feature codes as a person due to the addition of the multi-head self-attention mechanism, the attention mechanism learning capacity is stronger due to the multi-head application, the type feature learning is performed from different dimensional spaces, and the model generalization capacity is improved. Compared with an LSTM model, the Transformer model is realized by stacking purely based on a full-connection network, the problem that parallel training cannot be realized due to long-term dependence of the LSTM is solved, and the training performance of the model is improved.

Meanwhile, in order to achieve the performance improvement of the transform model, it means that more parameters are required to be added for the multi-layer deep network and the multi-head parallel attention, so that the model needs more calculation amount in the training process, and the finally obtained model page size is larger, which is an unavoidable problem of the full-connection network. Furthermore, when applying the Transformer model for machine translation, although the translation performance is very high, and the best average level in the industry is reached in the standard model structure, this cannot mask the existing disadvantages: in the training process, the standard Transformer Block reaches 12 layers, and the attention head in the self-attention mechanism also reaches 12 layers, which means that the calculation of a great number of full-connection networks and the parameters bring problems, for example, the size of a model is large due to a large parameter amount, and the calculation amount of the model is large. Meanwhile, in the actual translation reasoning process, because machine translation belongs to a text generation task, each sentence is translated word by word and is based on the sequence generated in the front, which means that model operation is performed every time when decoding translation.

Secondly, the CNN model has the greatest advantages of parameter sharing, fast training and calculation and high parallelism, and the CNN architecture-based model applied to the NLP field is verified, so that the convolutional neural network can also play a high-efficiency role in the text field as long as the convolutional neural network is reasonable in design. The convolutional neural network is used in machine translation, and has a fatal problem, the convolution operation is to extract the features of a region, which means that the extracted features are valuable only if a certain position has a certain correlation with the data of the surrounding positions. However, as mentioned above, the machine-translated sentences generally have a logical relationship between the front and the back, so the convolutional neural network will not be comprehensive in extracting features, and the performance is inferior to that of the transform or LSTM.

Based on the above, the embodiment of the application provides a machine translation method and a machine translation device, based on the strong feature extraction capability of a Transformer model, and simultaneously in combination with the low parameter number and the high-speed calculation capability of a convolutional neural network, the method and the device are combined in the field of real-time machine translation, so that when the model of the application is used for real-time machine translation, due to the fact that the parameter number is reduced, training is more convenient in the feature extraction process, and due to the fact that the calculated amount is reduced, the speed is improved to a certain extent.

Referring to fig. 1, fig. 1 is a flowchart of a machine translation method according to an embodiment of the present disclosure. As shown in fig. 1, a machine translation method provided in an embodiment of the present application includes:

and S110, acquiring a text to be translated and a target index word bank.

Here, the text to be translated is a text of a language to be translated, and the text may be an article, or a word or a sentence in an article. The article can be in various fields, such as science and technology, sports, leisure and entertainment, cate and literature; the articles may also be in various languages such as chinese, english, korean, japanese, and the like. The target index word library may be a language word required by the target user, and if the language to be translated is english, the target index word library is an english word library including a certain number of english words.

Furthermore, when a section of Chinese text needs to be translated into an English text, the server firstly obtains the section of Chinese text to be translated, then determines an English index word stock corresponding to the target index word stock according to the selection of the target user, and finally translates the Chinese text to be translated and the English index word stock.

And S120, converting the acquired text to be translated into an input vector matrix.

The method includes the steps of performing word segmentation processing on an acquired text to be translated to obtain a plurality of words, determining a representation input vector corresponding to each word, and determining an input vector matrix according to the obtained representation input vector of each word, wherein each row of the input vector matrix is the representation input vector of one word.

Specifically, when the text to be translated is a chinese text, a chinese word segmentation technique is adopted, and when the text to be translated is an english text or a text of another language, a word segmentation technique is adopted.

For example, the text to be translated is "i happy," and the word segmentation processing is performed on the text to be translated to obtain "i," very "and" happy, "and the" i, "" very "and" happy "are converted into corresponding representation input vectors respectively to obtain an input vector matrix corresponding to the" i happy.

S130, inputting the input vector matrix into a pre-trained machine translation model, and predicting the probability of translating each word in the text to be translated into any index word in the target index word bank, wherein the machine translation model is a Transformer model containing a convolutional layer.

Here, the machine translation model is trained, some evaluation indexes may be added in the training process, then the training times, such as 100 times or 200 times, are defined, in the training process, the training result is fed back to the model while training, and then the training degree of the machine translation model is evaluated according to the indexes of numerical classes, so that the model can adjust the parameters in time, and a trained machine translation model is finally obtained.

The machine translation model adopts a Transformer model containing convolution layers, wherein the convolution layers are point convolutions, the point convolutions are formed by reducing the size of a convolution kernel to 1 multiplied by 1 on the basis of standard convolution, and the point convolutions are often used for fusing channel information of an input feature map to reduce the number of channels of the feature map. The point convolution is used for replacing the traditional full-connection layer, so that the model parameter quantity and the calculated quantity are reduced to a certain extent, and the performance guarantee of real-time machine translation is ensured.

In the step, an input vector matrix is used as input, a pre-trained machine translation model is used as a prediction inference model, and the probability of translating each word in the text to be translated into any index word in a target index word library is used as output.

Here, probability calculation in the machine translation model employs softmax calculation, softmax has the role of calculating the fraction of each value in a set of values, and the formula is generally described as: let a total of n numerical classifications S_k，k∈(0，n]Where n represents the number of classifications. Then softmax is calculated as:

where i represents a certain class in k, g_iA value representing the classification.

S140, aiming at each word in the text to be translated, selecting the index word corresponding to the maximum probability as the translation result of the word to obtain the target translation result of the text to be translated.

Here, the translation results of each word are integrated together to obtain a target translation result of the text to be translated, and the target translation result is a translation corresponding to the text to be translated.

For example, inputting a text to be translated, i is happy, "i", "very" and "happy" are formed after word segmentation processing, if the text is translated into english, the dictionary space of an english target index word bank is 5000, for "i", an embedded dense vector can be formed, the dimension of the embedded dense vector becomes 1 × 5000, each digit value on the embedded dense vector represents the probability of translating "i" into any word in the english target index word bank, and then 5000 probabilities can be obtained, and similarly, for "very" and "happy", 5000 probabilities can be obtained respectively, and then a maximum probability is taken from the 5000 probabilities, for example, when translating "i", the probability of 100 th digit is the maximum, and then it can be determined that "i" translates into the 100 th word in the dictionary space when translating into english.

The machine translation method provided by the embodiment of the application comprises the following steps: the method comprises the steps of obtaining a text to be translated and a target index word bank, converting the obtained text to be translated into an input vector matrix, inputting the input vector matrix into a pre-trained machine translation model, predicting the probability of translating each word in the text to be translated into any index word in the target index word bank, wherein the machine translation model is a Transformer model containing a convolution layer, and selecting the index word corresponding to the maximum probability as the translation result of the word aiming at each word in the text to be translated to obtain the target translation result of the text to be translated. Therefore, based on the strong feature extraction capability of the transform model, the low parameter number and the high-speed computing capability of the convolutional neural network are combined simultaneously, and the low parameter number and the high-speed computing capability are combined in the field of real-time machine translation, so that when the model is subjected to real-time machine translation through the application, the parameter number is reduced, the training in the feature extraction process is more convenient, the calculated amount is reduced, the speed is improved to a certain extent, and therefore, the lower computing cost is guaranteed while the high performance of the transform is realized, and the real-time performance is guaranteed.

In the embodiment of the present application, as a preferred embodiment, the step S120 includes:

determining a representation input vector of each word in the text to be translated, wherein the representation input vector is obtained according to a word embedding vector and a position embedding vector; and determining an input vector matrix of the text to be translated based on the obtained representation input vector of each word.

Here, the word Embedding vector is word Embedding, the position Embedding vector is position Embedding, the expression input vector is obtained by adding the word Embedding and the position Embedding of the word, and the obtained expression input vectors of the words are integrated into an input vector matrix of the text to be translated.

Specifically, the matrix of input vectors obtained by the Transformer Block is Xinput (B × S), B is Batch size indicating the number of sentences fed into the machine translation model training each time, and S is sequence length indicating the sentence length set in each Batch, which is predefined.

When the Xinput is dictionary mapping of 'I happy', the dictionary is a data structure in a key-value form, the key is a number from 0 to length, and the value is a word, and the shape is {0: you, 1: i, 2: he, 3: happy … }; for example, assume that the keys of "i am happy" are 1, 5, and 3; here, "I am happy" is a sentence, then B is 1; s represents the sentence length set in each batch, i.e., the assumed length, and the deficiency is filled with 0; e represents the embedding size of the embedding words and represents each word (e.g., there are 3 words in the example, so there are 3 representing input vectors which are the embedding dense vector representation, assuming that S is 1, then "i" corresponds to the input vector being [ 0000000001 ], "very" corresponds to the input vector being [ 0000000050 ], and "happy" corresponds to the input vector being [ 0000000300 ].

In the embodiment of the present application, as a preferred embodiment, the machine translation model includes an input encoding model and an output decoding model; training the machine translation model by:

acquiring input vector matrix samples corresponding to a preset number of text samples to be translated; inputting each input vector matrix sample into the input coding model for feature extraction to obtain a target feature extraction matrix; inputting the obtained target feature extraction matrix into the output decoding model for probability calculation to obtain the probability of translating each word in the text sample to be translated into any index word in the target index word bank; and when the preset number of text samples to be translated are completely trained, determining that the training of the machine translation model is completed.

Here, the training end condition is determined by a preset training number, which is the same as the preset number of text samples to be translated.

Specifically, the transform input coding model is an Encoder (Encoder), the output decoding model is a Decoder (Decoder), in the practical application process, the Encoder (Encoder) and the Decoder (Decoder) both have multiple layers, from the perspective of the Encoder, the information extracted by each layer of Encoder is different, the lexical information of the surface layer is extracted by the lower layer of Encoder, and after the lexical information is gradually abstracted upwards, the abstract semantic information is represented on the upper layer. The Encode part also connects several lines to each Decoder part at the top layer, in order to carry out Attention operation in Decoder, network of Decoder and Encoder also have information transfer and interaction, and dimension size of Encoder and Decoder is the same.

Referring to fig. 2, fig. 2 is a schematic structural diagram of an input coding model according to an embodiment of the present application. As shown in FIG. 2, the input coding model 200 includes a first multi-headed attention mechanism network 210, a first network enforcement layer 220, a first volume dimensionality reduction layer 230, and a second network enforcement layer 240; the input coding model 200 is trained by:

inputting the input vector matrix sample 250 into the first multi-head attention mechanism network 210 for calculation to obtain a first feature extraction matrix; inputting the first feature extraction matrix into the first network strengthening layer 220 for residual connection and normalization calculation to obtain a second feature extraction matrix; inputting the second feature extraction matrix into the first convolution dimensionality reduction layer 230 for dimensionality reduction calculation to obtain a third feature extraction matrix with the same dimensionality as the first feature extraction matrix; and inputting the third feature extraction matrix into the second network strengthening layer 240 for residual connection and normalization calculation to obtain a target feature extraction matrix.

Therein, the first Multi-head Attention mechanism network 210 is denoted by CNN Multi-head attachment, the first network strengthening layer 220 is denoted by Add & Norm, the first dimensionality reduction layer 230 is denoted by PointCNN, and the second network strengthening layer 240 is also denoted by Add & Norm.

Here, CNN is fused based mainly on the structure of the transform Block, where the transform Block follows the original structure, and replaces the fully-connected Network (FFN) with the convolutional layer (Point CNN), and replaces the Multi-head attachment layer with the CNN Multi-head attachment, that is, the final fully-connected dimension reduction operation in the Multi-head attachment replaces the Point convolution operation.

The point convolution operation is as follows: carrying out convolution summation on the h row vector in the second feature extraction matrix to obtain a corresponding value of the h bit of the third feature extraction matrix; wherein h is a positive integer not greater than p, and p is the dimension number of the first feature extraction matrix.

Here, when h is equal to 1, performing convolution summation on the 1 st row vector in the second feature extraction matrix to obtain the 1 st bit corresponding value of the third feature extraction matrix, and so on to obtain the 2 nd bit, 3 rd bit corresponding value and so on of the third feature extraction matrix through calculation, and stopping the calculation until the h value reaches the dimension number of the first feature extraction matrix.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a first multi-head attention mechanism network according to an embodiment of the present disclosure. As shown in fig. 3, the first multi-headed attention mechanism network 210 includes three first linear layers 211, N first multi-headed self-attention mechanism layers 212, a first stitching layer 213, and a second convolution dimensionality reduction layer 214; training the first multi-headed attention mechanism network 210 by:

performing linear transformation on the input vector matrix samples to obtain N groups of coding input matrixes, wherein the coding input matrixes comprise a query matrix Q, a key value matrix K and a value matrix V, the query matrix Q, the key value matrix K and the value matrix V in each group of coding input matrixes are different, and N is an integer greater than or equal to 1; inputting each group of coding input matrixes to a first multi-head self-attention mechanism layer 212 through the first linear layer 211 for attention calculation to obtain N groups of coding output vectors; inputting the N groups of coding output vectors into a first splicing layer 213 for splicing to obtain an integrated coding vector; and inputting the integrated coding vector into a second convolution dimensionality reduction layer 214 for convolution dimensionality reduction calculation to obtain a first feature extraction matrix.

Here, the input vector matrix is substituted into the first Multi-head Attention mechanism network 210(CNN Multi-head Attention) to perform Multi-head self-Attention calculation, where the input vector matrix is subjected to linear mapping initialized by different parameters WQ, WK, and WV to obtain N groups of coding input matrices, where the coding input matrices include a query matrix Q, a key value matrix K, and a value matrix V, and each group Q, K, V is the same, but N groups are different from each other.

A query matrix Q, a key value matrix K and a value matrix V are needed during calculation; q, K and V are obtained by linear transformation of the input of the Self-orientation.

The attention calculation of the first multi-headed self-attention mechanism layer 212(scaled dot-product attention) is performed for each set of encoded input matrices (Q, K, V), and the specific formula is as follows:

wherein, K^TRepresenting the transpose of the K matrix, d_kThe number of columns representing the Q matrix and the K matrix, i.e., the vector dimension, is an adjustable hyper-parameter that is set by the user himself prior to training the model.

The inner product of each row vector of the matrices Q and K is calculated in the formula, and is divided by d to prevent the inner product from being too large_kThe square root of (a). And after Q is multiplied by K, the number of rows and columns of the obtained matrix is n, n is the number of words of a sentence, the matrix can represent the attention strength among the words, QKT is obtained, an attention coefficient of each word relative to other words is calculated by using Softmax, Softmax in the formula is Softmax for each row of the matrix, namely the sum of each row is 1, and the Softmax matrix can be multiplied by V to obtain a final first feature extraction matrix.

The Attention output of each group is obtained according to the set N, and is set as head1, head2. Specifically, the Multi-Head orientation includes a plurality of Self-orientation layers, and first, input vector matrix samples are respectively transmitted to N different Self-orientations, and N first feature extraction matrices are obtained through calculation. That is, when h is 8, 8 first feature extraction matrices are obtained.

The Attention here represents a matrix of dense vectors, and after training is completed, the Attention represents relevance vectors inside the input sentence.

If the input of Self-orientation is expressed by input vector matrix samples, then Q, K, V can be calculated by using linear variable matrix WQ, WK, WV. Note that each row of the input vector matrix samples, Q, K, V, represents a word.

The N first feature extraction matrices (entries outputs) are spliced, and because of the splicing operation, the output size is N times of the original size, that is, the size is [ B × S, N × headroom ], and the original multi-head attention mechanism is to perform dimensionality reduction calculation by using a full connection layer, but in the embodiment of the present application, the dimensionality reduction calculation is performed by using a point convolution operation.

The original full-connection concrete calculation formula is as follows:

MultiOut＝FFN(Concat(head1,head2....headh))＝Concat(head1...headh)W；

wherein, the size of the W weight matrix is [ N × headsize, headsize ], so the parameter quantity of W is N × headsize.

Specifically, as shown in fig. 4, assuming that Multiout is a concatenation of 4 heads, the bit-by-bit convolution operations are performed through a 1 × 1 dot convolution kernel, that is, as shown in the figure, a first bit array of each head output is convolved and summed for the first time to obtain a first bit of an output S, a second bit array of each head output is convolved and summed for the second time to obtain a second bit of the output S, and similarly, the dot convolution operations of the heads 3 and the heads 4 are performed. The final output S is recompressed from length 16(4 heads, each head length 4) in the diagram to 4 (the same size as the original head output). In the same operation, if the parameter number required by using the fully-connected dimensionality reduction is 16 × 4, and the point convolution requires a parameter of 1 × 1 (i.e., the parameter number of the convolution kernel), the overall parameter number and the calculation amount are greatly reduced, so that the calculation optimization is accelerated and the optimization of the parameter number is reduced.

Further, the first feature extraction matrix is input to the first network strengthening layer 220 for residual connection and normalization calculation, so as to obtain a second feature extraction matrix.

Specifically, the residual error connection and normalization calculation is performed through the following formula to reduce the feature loss problem caused by the multi-layer network:

add&norm＝LayerNorm(S+Xinput)；

where add denotes Residual Connection (Residual Connection) for preventing network degradation, and norm denotes Layer Normalization for normalizing the activation value of each Layer.

And further, inputting the second feature extraction matrix into the first convolution dimensionality reduction layer for dimensionality reduction calculation to obtain a third feature extraction matrix with the same dimensionality as the first feature extraction matrix.

In this case, the dot convolution dimensionality reduction operation is required to be performed again, and the value of add & norm is reduced from the headsize to the hiddensize by dot convolution instead of full-connected dimensionality reduction, so that the input and output sizes are kept consistent when the transform Block is stacked.

And further, inputting the third feature extraction matrix into the second network strengthening layer for residual connection and normalization calculation to obtain a target feature extraction matrix.

Here, residual concatenation and normalization are performed again to obtain an output Xoutput of the transform Block, i.e., a target feature extraction matrix.

In the training process, the final output (target feature extraction matrix) of the transform Block is obtained, namely, the matrix representation composed of dense vectors, which is also the output of an encoder.

In this embodiment, as a preferred embodiment, the inputting the integrated coding vector into the second convolution dimensionality reduction layer to perform convolution dimensionality reduction calculation to obtain a first feature extraction matrix includes:

Here, the process of performing the convolution dimensionality reduction calculation by the second convolution dimensionality reduction layer is the same as the process of performing the convolution dimensionality reduction calculation by the first convolution dimensionality reduction layer, and reference may be made to the process flow of fig. 4.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an output decoding model according to an embodiment of the present disclosure. As shown in fig. 5, the output decoding model 500 includes a second multi-headed attention mechanism network 510, a third network reinforcement layer 520, a third convolutional dimensionality reduction layer 530, a fourth network reinforcement layer 540, and a probabilistic output layer 550; the output decoding model 500 is trained by:

inputting the target feature extraction matrix 560 into the second multi-head attention mechanism network 510 for calculation to obtain a first feature transformation matrix; inputting the first feature transformation matrix into the third network strengthening layer 520 for residual connection and normalization calculation to obtain a second feature transformation matrix; inputting the second feature transformation matrix into a third convolution dimensionality reduction layer 530 for dimensionality reduction calculation to obtain a third feature transformation matrix with the same dimensionality as the first feature transformation matrix; inputting the third feature transformation matrix into the fourth network strengthening layer 540 for residual connection and normalization calculation to obtain a target feature transformation matrix; and inputting the target characteristic conversion matrix into a probability output layer 550 for probability calculation to obtain a probability output matrix representing the probability of translating each word in the text sample to be translated into any index word in the target index word bank.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a second multi-headed attention mechanism network according to an embodiment of the present application. As shown in fig. 6, the second multi-headed attention mechanism network 510 includes three second linear layers 511, N second multi-headed self-attention mechanism layers 512, a second stitching layer 513, and a fourth convolution dimensionality reduction layer 514; training the second multi-headed attention mechanism network 510 by:

determining N key value matrixes and N value matrixes based on the target feature extraction matrix, wherein N is an integer greater than or equal to 1; combining N query matrices obtained by performing linear transformation on the input vector matrix sample, the determined N key value matrices and the determined N value matrices to obtain N groups of decoding input matrices, wherein the query matrix Q, the key value matrix K and the value matrix V in each group of decoding input matrices are different, and inputting each group of decoding input matrices to a second multi-head self-attention mechanism layer 512 through a second linear layer 511 for performing attention calculation to obtain N groups of decoding output vectors; inputting the N groups of decoding output vectors into the second splicing layer 513 for splicing to obtain an integrated decoding vector; and inputting the integrated decoding vector into a fourth convolution dimensionality reduction layer 514 for convolution dimensionality reduction calculation to obtain a first feature transformation matrix.

Further, the inputting the second feature transformation matrix into the third convolution dimensionality reduction layer for dimensionality reduction calculation to obtain a third feature transformation matrix with the same dimensionality as the first feature transformation matrix includes:

performing convolution summation on the f-th row vector in the second characteristic conversion matrix to obtain a corresponding value of the f-th bit of the third characteristic conversion matrix; wherein f is a positive integer not greater than c, and c is the dimension number of the first feature transformation matrix.

Further, the inputting the integrated decoding vector to the fourth convolution dimensionality reduction layer for convolution dimensionality reduction calculation to obtain a first feature transformation matrix includes:

performing convolution summation on the Mth bit array of each decoding output vector in the integrated decoding vectors to obtain a corresponding value of the Mth bit of the first characteristic transformation matrix; wherein M is a positive integer not greater than N.

It should be noted that, the input coding model (encoder) and the output decoding model (decoder) are constructed in the same way, and for the output decoding model (decoder), the dictionary mapping of the "i am happy" translation text "is corresponded, and the detailed description of the input coding model (encoder) is referred for the specific flow process.

The machine translation method provided by the embodiment of the application is mainly used for solving the problems that the model demand computing power is large and the model size is too large due to the fact that the parameter quantity brought by a Transformer is too large, the strong feature extraction capacity based on the Transformer model is adopted by the embodiment of the application, the low parameter quantity and the high-speed computing capacity of a convolutional neural network are combined at the same time, the low parameter quantity and the high-speed computing capacity are combined in the field of real-time machine translation, the model is enabled to be used for real-time machine translation, the parameter quantity is reduced, the training in the feature extraction process is more convenient, the calculated quantity is reduced, the speed is improved to a certain extent, and therefore when the high performance of the Transformer is achieved, the lower computing cost is guaranteed, and the real-time.

Based on the same inventive concept, a machine translation device corresponding to the machine translation method is also provided in the embodiments of the present application, and because the principle of solving the problem of the device in the embodiments of the present application is similar to the machine translation method described above in the embodiments of the present application, the implementation of the device may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a machine translation device according to an embodiment of the present disclosure. As shown in fig. 7, the machine translation apparatus 700 includes:

an obtaining module 710, configured to obtain a text to be translated and a target index lexicon;

the conversion module 720 is configured to convert the obtained text to be translated into an input vector matrix;

the prediction module 730 is configured to input the input vector matrix into a pre-trained machine translation model, and predict a probability that each word in the text to be translated is translated into any index word in the target index word bank, where the machine translation model is a Transformer model including a convolutional layer;

the translation module 740 is configured to select, for each word in the text to be translated, an index word corresponding to the maximum probability as a translation result of the word, so as to obtain a target translation result of the text to be translated.

Preferably, when the converting module 720 is configured to convert the obtained text to be translated into an input vector matrix, the converting module 720 is configured to:

Preferably, the machine translation apparatus 700 further comprises a training module 750, the machine translation model comprises an input coding model and an output decoding model, the training module 750 is configured to train the machine translation model by:

Preferably, the input coding model comprises a first multi-headed attention mechanism network, a first network reinforcement layer, a first volume dimension reduction layer and a second network reinforcement layer; the training module 750 is configured to train the input coding model by:

Preferably, when the training module 750 is configured to input the second feature extraction matrix to the first convolution dimension reduction layer for dimension reduction calculation to obtain a third feature extraction matrix having the same dimension as the first feature extraction matrix, the training module 750 is specifically configured to:

Preferably, the first multi-headed attention mechanism network comprises three first linear layers, N first multi-headed self-attention mechanism layers, a first stitching layer and a second convolution dimensionality reduction layer; the training module 750 is configured to train the first multi-headed attention mechanism network by:

Preferably, when the training module 750 is configured to input the integrated coding vector to the second convolution dimensionality reduction layer for performing convolution dimensionality reduction calculation to obtain a first feature extraction matrix, the training module 750 is specifically configured to:

Preferably, the output decoding model comprises a second multi-head attention mechanism network, a third network strengthening layer, a third convolution dimensionality reduction layer, a fourth network strengthening layer and a probability output layer; the training module 750 is configured to train the output decoding model by:

Preferably, the second multi-head attention mechanism network comprises three second linear layers, N second multi-head self-attention mechanism layers, a second splicing layer and a fourth convolution dimensionality reduction layer; the training module 750 is configured to train the second multi-headed attention mechanism network by:

The machine translation device provided by the embodiment of the application comprises an acquisition module, a conversion module, a prediction module and a translation module, wherein the acquisition module is used for acquiring a text to be translated and a target index word stock; the conversion module is used for converting the acquired text to be translated into an input vector matrix; the prediction module is used for inputting the input vector matrix into a pre-trained machine translation model, and predicting the probability of translating each word in the text to be translated into any index word in a target index word library, wherein the machine translation model is a Transformer model containing a convolutional layer; the translation module is used for selecting the index word corresponding to the maximum probability as the translation result of each word in the text to be translated to obtain the target translation result of the text to be translated.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 8, the electronic device 800 includes a processor 810, a memory 820, and a bus 830.

The memory 820 stores machine-readable instructions executable by the processor 810, when the electronic device 800 runs, the processor 810 and the memory 820 communicate through the bus 830, and when the machine-readable instructions are executed by the processor 810, the steps of the machine translation method in the embodiment of the method shown in fig. 1 may be executed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the step of the machine translation method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A machine translation method, the machine translation method comprising:

acquiring a text to be translated and a target index word library;

converting the acquired text to be translated into an input vector matrix;

2. The machine translation method according to claim 1, wherein said converting the obtained text to be translated into an input vector matrix comprises:

3. The machine translation method of claim 1 wherein said machine translation model comprises an input encoding model and an output decoding model; training the machine translation model by:

4. The machine translation method of claim 3, wherein the input coding model comprises a first multi-headed attention mechanism network, a first network enforcement layer, a first volume dimension reduction layer, and a second network enforcement layer; training the input coding model by:

5. The machine translation method according to claim 4, wherein said inputting the second feature extraction matrix into the first convolution dimension reduction layer for dimension reduction calculation to obtain a third feature extraction matrix having the same dimension as the first feature extraction matrix, comprises:

6. The machine translation method of claim 4, wherein the first multi-headed attention mechanism network comprises three first linear layers, N first multi-headed self-attention mechanism layers, a first stitching layer, and a second convolutional dimensionality reduction layer; training the first multi-headed attention mechanism network by:

7. The machine translation method of claim 6, wherein said inputting said integrated coded vector into said second convolutional dimensionality reduction layer for performing a convolutional dimensionality reduction calculation to obtain a first feature extraction matrix, comprises:

8. The machine translation method of claim 3, wherein the output decoding model comprises a second multi-headed attention mechanism network, a third network reinforcement layer, a third convolution dimensionality reduction layer, a fourth network reinforcement layer, and a probability output layer; training the output decoding model by:

9. The machine translation method of claim 8, wherein the second multi-headed attention mechanism network comprises three second linear layers, N second multi-headed self-attention mechanism layers, a second stitching layer, and a fourth convolution dimensionality reduction layer; training the second multi-headed attention mechanism network by:

10. A machine translation apparatus, comprising: