CN111950302A

CN111950302A - Knowledge distillation-based machine translation model training method, device, equipment and medium

Info

Publication number: CN111950302A
Application number: CN202010843014.5A
Authority: CN
Inventors: 袁秋龙
Original assignee: Shanghai Zhilv Information Technology Co ltd
Current assignee: Shanghai Zhilv Information Technology Co ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-17
Anticipated expiration: 2040-08-20
Also published as: CN111950302B

Abstract

The invention provides a knowledge distillation-based machine translation model training method, a knowledge distillation-based machine translation model training device, knowledge distillation-based machine translation model training equipment and a knowledge distillation-based machine translation model training medium, wherein the method comprises the following steps of: acquiring a teacher model and a student model; acquiring a sample data set containing training corpora; inputting the training corpus into the teacher model to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model; inputting the training corpus into the student model to obtain intermediate content output by the simplified module in the student model and a final result output by the student model; determining a model loss function according to the labeled translation label of the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model; and performing iterative training on the student model according to the model loss function. The teacher model is used for training the student model, and the performance effect of the model is ensured under the condition of simplified model structure.

Description

Knowledge distillation-based machine translation model training method, device, equipment and medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a knowledge distillation-based machine translation model training method, a knowledge distillation-based machine translation model training device, knowledge distillation-based machine translation model training equipment and a knowledge distillation-based machine translation model training medium.

Background

Machine translation (machine translation), also known as auto-translation, is a process of converting a natural source language into another natural target language using a computer, and generally refers to the translation of sentences and full text between natural languages. Machine translation is a branch of Natural Language Processing (Natural Language Processing), and has a close-up relationship with Computational Linguistics (Computational Linguistics) and Natural Language Understanding (Natural Language Understanding). The idea of using machines for translation was first proposed by Warren Weaver in 1949. For a long time (50 s to 80 s of the 20 th century), machine translation has been accomplished by studying linguistic information in both source and target languages, i.e., generating translations based on dictionaries and grammars, which is known as rule-based machine translation (RBMT). As statistics have evolved, researchers have begun to apply statistical models to machine translation, which is based on analysis of bilingual text databases to generate translation results. This approach, known as Statistical Machine Translation (SMT), performed better than RBMT and dominated this field between the 1980 s and 2000 s. In 1997, Ramon Neco and Mikel Forcada proposed the idea of using Encoder-Decoder (Encoder-Decoder) architecture for machine translation. In 2003 years later, a research team led by Yoshua Bengio, university of montreal, developed a language model based on a neural network, and improved the data sparsity problem of the traditional SMT model. The research work of the users lays a foundation for the application of the neural network in machine translation in the future.

In 2017, Google (Google) proposed a transform model in the paper "Attention Is All You Need". The model based on the self-attention mechanism can well solve the problem of the sequence model, is applied to a machine translation task, and greatly improves the translation effect. On the one hand, however, with the development of the Transformer series from BERT to GPT2 to XLNet model, the capacity of the translation model increases, and although the translation effect can be improved to some extent, the inference performance (delay and throughput) of the translation model on line is worse and worse, and how to improve the inference performance of the translation model on line is a key factor for determining whether the translation model can be well deployed and providing user-friendly service; on the other hand, with the dramatic increase of the number of the accessed foreign language languages, how to effectively compress the model on the premise of not losing the translation effect of the model is convenient for storing and releasing the model, and the method is an important problem to be faced for engineering deployment of the algorithm model.

Disclosure of Invention

In view of the above deficiencies of the prior art, an object of the present invention is to provide a machine translation model training method, apparatus, device and medium based on knowledge distillation, so as to train a simplified student model according to a teacher model on the premise of not affecting the model effect as much as possible, improve the throughput when the model is deployed on line, reduce the delay of the model, and further improve the user experience.

In order to achieve the above object, the present invention provides a knowledge-based machine translation model training method, including:

acquiring a trained teacher model and untrained student models, wherein the student models are obtained by simplifying partial modules in the teacher model;

acquiring a sample data set, wherein the sample data set comprises a plurality of training corpora and labeled translation labels corresponding to the training corpora;

inputting the training corpus into the teacher model for processing to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model;

inputting the training corpus into the student model for processing to obtain intermediate content output by a simplified module in the student model and a final result output by the student model;

determining a model loss function according to a label translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;

and performing iterative training on the student model according to the model loss function.

In a preferred embodiment of the present invention, the determining a model loss function according to the labeled translation label corresponding to the corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model, includes:

determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;

determining a second loss function according to the labeled translation label corresponding to the training corpus and the final result output by the student model;

determining a third loss function according to the final result output by the teacher model and the final result output by the student model;

and determining the model loss function according to the first loss function, the second loss function and the third loss function.

In a preferred embodiment of the present invention, the teacher model and the student model respectively comprise an embedding module, an encoding module, a decoding module and an output module.

In a preferred embodiment of the present invention, the student model and the teacher model have the same structure as the embedding module, the encoding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is provided between the decoding module of the student model and the decoding module of the teacher model.

In a preferred embodiment of the present invention, after the sample data set is obtained, the method further includes: and preprocessing the training corpus.

In a preferred embodiment of the present invention, the preprocessing the corpus includes:

converting characters in the training corpus into corresponding numerical values;

and dividing the training corpuses into different batches, and adjusting the training corpuses of each batch to be the same in length in a zero value filling mode.

In order to achieve the above object, the present invention further provides a knowledge-based machine translation model training apparatus, including:

the model acquisition module is used for acquiring a trained teacher model and untrained student models, and the student models are obtained by simplifying part of modules in the teacher model;

the system comprises a sample acquisition module, a translation module and a translation module, wherein the sample acquisition module is used for acquiring a sample data set, and the sample data set comprises a plurality of training corpora and labeled translation labels corresponding to the training corpora;

the teacher model processing module is used for inputting the training corpora into the teacher model for processing to obtain intermediate contents output by the simplified module in the teacher model and a final result output by the teacher model;

the student model processing module is used for inputting the training corpus into the student model for processing to obtain intermediate content output by the simplified module in the student model and a final result output by the student model;

the model loss function determining module is used for determining a model loss function according to the labeled translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model;

and the model training module is used for carrying out iterative training on the student model according to the model loss function.

In a preferred embodiment of the present invention, the model loss function determining module includes:

a first loss function determination unit configured to determine a first loss function based on the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model;

a second loss function determining unit, configured to determine a second loss function according to the labeled translation label corresponding to the training corpus and the final result output by the student model;

a third loss function determination unit configured to determine a third loss function according to a final result output by the teacher model and a final result output by the student model;

a model loss function determining unit, configured to determine the model loss function according to the first loss function, the second loss function, and the third loss function.

In a preferred embodiment of the present invention, the apparatus further comprises: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.

In a preferred embodiment of the present invention, the preprocessing module includes:

the numerical value conversion unit is used for converting the characters in the training corpus into corresponding numerical values;

and the length adjusting unit is used for dividing the training corpuses into different batches and adjusting the training corpuses of each batch to be the same length in a zero value filling mode.

In order to achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the aforementioned machine translation model training method when executing the computer program.

To achieve the above object, the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the aforementioned machine translation model training step.

By adopting the technical scheme, the invention has the following beneficial effects:

according to the label translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model, a model loss function is determined, and iterative training is carried out on the student model according to the model loss function. Compared with a teacher model, the student model obtained by training simplifies the model structure, and the intermediate content and the final result output by the teacher model are utilized to supervise in the training process, so that the performance and the effect of the model can be ensured as far as possible under the condition that the parameters of the student model are reduced.

Drawings

FIG. 1 is a flowchart of a knowledge-based distillation machine translation model training method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a training method of a knowledge-based distillation machine translation model in embodiment 1 of the present invention;

FIG. 3 is a block diagram of a knowledge-based distillation machine translation model training apparatus according to embodiment 2 of the present invention;

fig. 4 is a hardware architecture diagram of an electronic device in embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

Example 1

The embodiment provides a knowledge distillation-based machine translation model training method, as shown in fig. 1, the method specifically includes the following steps:

and S1, acquiring a trained teacher model and untrained student models, wherein the student models are obtained by simplifying partial modules in the teacher model.

The knowledge distillation is a network model compression method, a teacher model-student model framework is constructed, the teacher model guides the training of the student model, the knowledge learned by the teacher model with a complex model structure and large parameter quantity about feature representation is distilled out, and the knowledge is transferred to the student model with a simple model structure, small parameter quantity and weak learning ability. The knowledge distillation mode can improve the performance of the model without increasing the complexity of the student model.

In the present embodiment, a machine translation model that has been trained is prepared in advance as a teacher model and a student model obtained by simplifying part of modules in the teacher model. The teacher model is a prediction mode, and the prediction mode represents that the model parameters of the teacher model are frozen, namely the model parameters of the teacher model cannot be modified in the subsequent training process; the student model is a training mode, and model parameters in the student model can be modified in the training process.

For example, the teacher model and the student model in the present embodiment may be translation models having a transform as a basic structure. As shown in fig. 2, the teacher model and the student model respectively include an embedding module, an encoding module, a decoding module, and an output module, which are sequentially cascaded, where the embedding module may include a corpus embedding layer and a language type embedding layer. Because the embedded module, the coding module and the output module occupy less during reasoning, the embedded module, the coding module and the output module of the student model are consistent with the embedded module, the coding module and the output module of the teacher model in structure, are not reduced, and parameters can be shared. That is, the present embodiment simply compresses the decoding module of the teacher model by reducing the number of decoding layers in the decoding module) as the decoding module of the student model. In order to ensure the translation effect of the student model, the number of the neurons of the embedding module and the output module of the student model is consistent with that of the neurons of the embedding module and the output module of the teacher model.

In addition, in order to ensure that the dimension of the intermediate content output by the decoding module in the student model is consistent with the dimension of the intermediate content output by the decoding module in the teacher model, so as to perform the subsequent loss function calculation, a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.

S2, obtaining a sample data set, wherein the sample data set comprises a plurality of training corpora and label translation labels corresponding to the training corpora, and the training corpora can also carry corresponding language types.

S3, preprocessing the sample data set. The method specifically comprises the following steps: firstly, the characters in the training corpus are converted into corresponding numerical values, and the training corpus is divided into different batches. Because the training corpuses are different in length, the training corpuses in each batch can be adjusted to be the same in length in a zero value filling mode. The zero-value filling method is to fill the missing characters in other sentences with 0 on the basis of the longest sentence in the same batch of training corpus, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of Batch _ size and Sequence _ length is obtained, where Batch _ size refers to the number of corpora in the same Batch, and Sequence _ length refers to the length of the longest corpus in the same Batch.

And S4, inputting the preprocessed corpus into the teacher model for processing to obtain the intermediate content output by the simplified module in the teacher model and the final result output by the teacher model.

For example, when the teacher model has the structure shown in fig. 2, the training corpus is first input to the embedding module of the teacher model, the training corpus and the language type thereof are mapped through the corpus layer and the language type layer of the embedding module, the corpus embedding result and the language type embedding result are merged and then input to the encoding module for feature encoding, then feature decoding is performed through the decoding module, the intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module to obtain the final result output by the teacher model.

And S5, inputting the training corpus into the student model for processing to obtain the intermediate content output by the simplified module in the student model and the final result output by the student model.

For example, when the student model has the structure shown in fig. 2, the training corpus is first input to the embedding module of the student model, the training corpus and the language type thereof are mapped through the corpus layer and the language type layer of the embedding module, the corpus embedding result and the language type embedding result are merged and then input to the encoding module for feature encoding, then feature decoding is performed through the decoding module, the intermediate content output by the decoding module is collected, and finally the decoding result is processed through the output module to obtain the final result output by the student model.

And S6, determining a model loss function according to the label translation label corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model and the final result output by the student model. The specific implementation process of the step is as follows:

and S61, determining a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.

For example, when the student model is the structure shown in fig. 2, the first loss function L is calculated according to the following equation (1)_{AT_FMT}：

Wherein C represents the number of decoding layers of the decoding module in the student model, D_klRepresenting a function for calculating the KL divergence,

shows the result of the content output by the c-th decoding layer in the teacher model after being processed by the full-connection layer,

and the processing result output by the decoding layer of the layer c in the student model is represented.

And S62, determining a second loss function according to the label translation label corresponding to the training corpus and the final result output by the student model.

For example, when the student model is the structure shown in fig. 2, the second loss function L is calculated according to the following equation (2)_hard：

L_hard＝{-p′_ijlog(p_ij)-(1-p′_ij)log(1-p_ij)} (2)

Wherein log (#) represents a logarithmic function, p_ijProbability value p 'representing that ith word output by the student model corresponds to jth translation tag'_ijProbability value (p'_ijObtained according to the labeled translation label corresponding to the training corpus).

And S63, determining a third loss function according to the final result output by the teacher model and the final result output by the student model.

For example, when the student model is the structure shown in fig. 2, the third loss function L is calculated according to the following equation (3)_soft：

Wherein log (#) represents a logarithmic function, p_ijRepresenting the probability value of the ith word output by the student model corresponding to the jth translation tag,

and representing the probability value of the ith word output by the teacher model corresponding to the jth translation tag.

S64, according to the first loss function L_{AT_FMT}A second loss function L_hardAnd a third loss function L_softAnd determining the model loss function.

For example, the model Loss function Loss is calculated according to the following equation (4)_all：

Loss_all＝αL_hard+(1-α)L_soft+βL_AT-FMT (4)

Wherein alpha and beta respectively represent corresponding weight coefficients of loss values, alpha belongs to (0, 1), and beta belongs to R, and the specific values can be preset according to experience.

And S7, training the student model according to the model loss function, namely updating the parameters of the student model according to the loss function.

The process of training the model according to the loss function is an iterative process, and whether a preset training termination condition is met or not is judged once training. If the training termination condition is not satisfied, continuing the training according to the steps S4 to S7 until the training termination condition is satisfied.

In one possible implementation, satisfying the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative trainings reaches a threshold number. The number threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, modelThe loss function is less than a loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in this embodiment of the application. Third, the model loss function converges. The convergence of the model loss function means that the fluctuation range of the model loss function is within a reference range in the training result of the reference times along with the increase of the iterative training times. For example, assume a reference range of-10^-3～10^-3Assume that the reference number is 10. If the model loss function has the fluctuation range of-10 in the 10 times of iterative training results^-3～10^-3And (4) considering the model loss function to be converged. When any one of the conditions is met, the condition that the training termination condition is met is indicated, and the training of the student model is completed.

In the process of updating the model parameters by using the model loss function, the optimization can be performed by adopting an Adaptive Moment Estimation (Adam) optimization algorithm. In the training process, a coding module lr of the student model_ebIs less than or equal to the learning rate lr of the decoding module_db。

In addition, a hierarchical training mode can be used for reducing the decoding modules of the student models step by step in the training process. As shown in fig. 2, after obtaining a student model (including L decoding layers) by training according to a teacher model (including K decoding layers), the student model after training is used as a new teacher model to train a student model with a smaller number of decoding layers, and so on until obtaining a student model only including a predetermined number of (N) decoding layers by training, where K > M > N. In this embodiment, the compression ratio of the student model is appropriately selected according to the improvement of the theoretical performance of the translation model and the compromise of the effect of the translation model. And after the student model training is finished, removing the teacher model.

The student model that this embodiment training obtained has simplified the model structure to through utilizing the intermediate content and the final result of teacher's model output to supervise at the training in-process, make the performance, the effect of model are guaranteed as far as possible under the condition that the student model can the parameter become less, because the model structure of student model is simplified, therefore promoted the model and deployed the throughput when online, reduced the model and delayed, and then promoted user experience.

It should be noted that the foregoing embodiments are described as a series of acts or combinations for simplicity in explanation, but it should be understood by those skilled in the art that the present invention is not limited by the order of acts or acts described, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Example 2

The embodiment provides a knowledge-based distillation machine translation model training device, as shown in fig. 3, the device 1 specifically includes: the model training system comprises a model acquisition module 11, a sample acquisition module 12, a preprocessing module 13, a teacher model processing module 14, a student model processing module 15, a model loss function determination module 16 and a model training module 17.

Each module is described in detail below:

the model obtaining module 11 is configured to obtain a trained teacher model and untrained student models, where the student models are obtained by simplifying some modules in the teacher model.

The sample obtaining module 12 is configured to obtain a sample data set, where the sample data set includes a plurality of training corpora and labeled translation tags corresponding to the training corpora, and the training corpora may also carry corresponding language types.

The preprocessing module 13 is configured to preprocess the sample data set. The method specifically comprises the following steps: a numerical value conversion unit 131, configured to convert characters in the corpus into corresponding numerical values; the length adjustment unit 132 is configured to divide the corpus into different batches, and adjust each batch of the corpus to have the same length through a zero-value filling method because the corpus has different lengths. The zero-value filling method is to fill the missing characters in other sentences with 0 on the basis of the longest sentence in the same batch of training corpus, so that the lengths of the characters are adjusted to be consistent with the longest sentence. Thus, input data with a size of Batch _ size and Sequence _ length is obtained, where Batch _ size refers to the number of corpora in the same Batch, and Sequence _ length refers to the length of the longest corpus in the same Batch.

The teacher model processing module 14 is configured to input the preprocessed corpus into the teacher model for processing, so as to obtain intermediate content output by the simplified module in the teacher model and a final result output by the teacher model.

The student model processing module 15 is configured to input the corpus into the student model for processing, so as to obtain intermediate content output by the simplified module in the student model and a final result output by the student model.

The model loss function determining module 16 is configured to determine a model loss function according to the labeled translation label corresponding to the corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model. The specific implementation process of the step is as follows:

the first loss function determining unit 161 is configured to determine a first loss function according to the intermediate content output by the simplified module in the teacher model and the intermediate content output by the simplified module in the student model.

The second loss function determining unit 162 is configured to determine a second loss function according to the labeled translation label corresponding to the corpus and the final result output by the student model.

L_hard＝{-p′_ijlog(p_ij)-(1-p′_ij)log(1-p_ij)} (2)

The third loss function determining unit 163 is configured to determine a third loss function based on the final result output by the teacher model and the final result output by the student model.

The model loss function determination unit 164 is configured to determine a first loss function L according to the first loss function L_{AT_FMT}A second loss function L_hardAnd a third loss function L_softAnd determining the model loss function.

Loss_all＝αL_hard+(1-α)L_soft+βL_AT-FMT (4)

The model training module 17 is configured to train the student model according to the model loss function, that is, update parameters of the student model according to the loss function.

The process of training the model according to the loss function is an iterative process, and whether a preset training termination condition is met or not is judged once training. And if the training termination condition is not met, continuing training until the training termination condition is met.

In one possible implementation, satisfying the training termination condition includes, but is not limited to, the following three cases: first, the number of iterative trainings reaches a threshold number. The number threshold may be set empirically, or may be flexibly adjusted according to an application scenario, which is not limited in the embodiment of the present application. Second, the model loss function is less than the loss threshold. The loss threshold may be set empirically, or may be freely adjusted according to an application scenario, which is not limited in this embodiment of the application. Third, model loss functionAnd (7) converging. The convergence of the model loss function means that the fluctuation range of the model loss function is within a reference range in the training result of the reference times along with the increase of the iterative training times. For example, assume a reference range of-10^-3～10^-3Assume that the reference number is 10. If the model loss function has the fluctuation range of-10 in the 10 times of iterative training results^-3～10^-3And (4) considering the model loss function to be converged. When any one of the conditions is met, the condition that the training termination condition is met is indicated, and the training of the student model is completed.

Example 3

The present embodiment provides an electronic device, which may be represented in the form of a computing device (for example, may be a server device), including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the knowledge-based distillation machine translation model training method provided in embodiment 1.

Fig. 4 shows a schematic diagram of a hardware structure of the present embodiment, and as shown in fig. 4, the electronic device 9 specifically includes:

at least one processor 91, at least one memory 92, and a bus 93 for connecting the various system components (including the processor 91 and the memory 92), wherein:

the bus 93 includes a data bus, an address bus, and a control bus.

Memory 92 includes volatile memory, such as Random Access Memory (RAM)921 and/or cache memory 922, and can further include Read Only Memory (ROM) 923.

Memory 92 also includes a program/utility 925 having a set (at least one) of program modules 924, such program modules 924 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The processor 91 executes a computer program stored in the memory 92 to execute various functional applications and data processing, such as the machine translation model training method based on knowledge distillation provided in embodiment 1 of the present invention.

The electronic device 9 may further communicate with one or more external devices 94 (e.g., a keyboard, a pointing device, etc.). Such communication may be through an input/output (I/O) interface 95. Also, the electronic device 9 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet) via the network adapter 96. The network adapter 96 communicates with the other modules of the electronic device 9 via the bus 93. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 9, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, and data backup storage systems, etc.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the electronic device are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, according to embodiments of the application. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Example 4

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the knowledge-based distillation machine translation model training method of embodiment 1.

More specific examples, among others, that the readable storage medium may employ may include, but are not limited to: a portable disk, a hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible implementation, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps of implementing the knowledge-based distillation machine translation model training method of example 1, when the program product is run on the terminal device.

Where program code for carrying out the invention is written in any combination of one or more programming languages, the program code may be executed entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device and partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and that the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the spirit and scope of the invention, and these changes and modifications are within the scope of the invention.

Claims

1. A knowledge distillation-based machine translation model training method is characterized by comprising the following steps:

2. The knowledge distillation-based machine translation model training method according to claim 1, wherein the determining a model loss function according to the labeled translation tags corresponding to the training corpus, the intermediate content output by the simplified module in the teacher model, the final result output by the teacher model, the intermediate content output by the simplified module in the student model, and the final result output by the student model comprises:

3. The knowledge distillation-based machine translation model training method of claim 1, wherein the teacher model and the student model respectively comprise an embedding module, an encoding module, a decoding module and an output module.

4. The knowledge distillation-based machine translation model training method according to claim 3, wherein the student model is consistent with the structure of the embedding module, the coding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.

5. The knowledge-based distillation machine translation model training method of claim 1, wherein after acquiring the sample data set, the method further comprises: and preprocessing the training corpus.

6. The knowledge-based distillation machine translation model training method according to claim 5, wherein the preprocessing the corpus comprises:

7. A knowledge distillation-based machine translation model training device is characterized by comprising:

8. The knowledge-based distillation machine translation model training apparatus of claim 7, wherein the model loss function determination module comprises:

9. The knowledge distillation-based machine translation model training apparatus of claim 7, wherein the teacher model and the student model respectively comprise an embedding module, an encoding module, a decoding module and an output module.

10. The knowledge distillation-based machine translation model training device according to claim 9, wherein the student model is consistent with the structure of the embedding module, the coding module and the output module of the teacher model, the decoding module of the student model is obtained by simplifying the decoding module of the teacher model, and a full connection layer is arranged between the decoding module of the student model and the decoding module of the teacher model.

11. The knowledge-based distillation machine translation model training apparatus of claim 7, further comprising: and the preprocessing module is used for preprocessing the training corpus after the sample data set is acquired.

12. The knowledge-based distillation machine translation model training device of claim 11, wherein the preprocessing module comprises:

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 6 are implemented when the computer program is executed by the processor.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.