CN114580442B

CN114580442B - Model training method and device, electronic equipment and storage medium

Info

Publication number: CN114580442B
Application number: CN202210186688.1A
Authority: CN
Inventors: 高鹏至; 何中军; 李芝; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2023-04-18
Anticipated expiration: 2042-02-28
Also published as: CN114580442A

Abstract

The disclosure provides a model training method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing. The specific implementation scheme is as follows: acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence; for each group of sentence pairs, determining a first word vector of a first semantic element in a source language sentence contained in the sentence pair, and determining a second word vector of a second semantic element in a target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by utilizing a first word vector of the first semantic element and a second word vector of the second semantic element; and determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training the model by adopting the first loss function. The method and the device can reduce the complexity of the model training process.

Description

Model training method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of natural language processing technology.

Background

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language.

Machine Translation (MT) is a process of translating a text in one natural language (which may be called a source language) into a text in another natural language (which may be called a target language) by Machine force, and is an important research field of NLP and one of the services commonly used in the internet at present. In recent years, the Neural network model has made a significant progress in the task of Machine Translation and surpassed statistical Machine Translation, and especially the Neural Machine Translation (NMT) model based on a Transformer (Transformer) has achieved the best Translation quality under the training of a large amount of data. However, as the model becomes more complex, the training of the model becomes more complex.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for model training.

According to an aspect of the present disclosure, there is provided a model training method, including:

acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence;

for each set of sentence pairs, determining a first word vector for a first semantic element in the source language sentence contained in the sentence pair and determining a second word vector for a second semantic element in the target language sentence contained in the sentence pair;

determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;

and determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training the model by using the first loss function.

According to another aspect of the present disclosure, there is provided a model training apparatus including:

the sentence pair acquisition module is used for acquiring a plurality of groups of sentence pairs, and each group of sentence pairs comprises a source language sentence and a target language sentence;

a sample pair determination module for determining, for each set of the sentence pairs, a first word vector for a first semantic element in the source language sentence contained in the sentence pair and a second word vector for a second semantic element in the target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element;

and the training module is used for determining a first loss function by utilizing the sentence pairs and the corresponding sample pairs and training the model by adopting the first loss function.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute the above model training method.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described model training method.

According to the method and the device, the corresponding sample pairs are generated by adopting the multiple groups of sentence pairs, and the first loss function is determined by utilizing the sentence pairs and the corresponding sample pairs, so that a simple and effective mode for constructing the loss function is provided, and the complexity of a model training process is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of an application scenario according to the present disclosure;

FIG. 2 is a schematic flow chart of an implementation of a model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating an implementation of a model training method according to an embodiment of the present disclosure

FIG. 4 is a schematic illustration of the effect of different hyper-parameters on the model on the IWSLT14 delta data set according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a model training apparatus 500 according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a model training apparatus 600 according to an embodiment of the present disclosure

FIG. 7 is a block diagram of an electronic device for implementing a model training method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Neural networks are of high quality in natural language text translation. Taking the NMT model as an example, the NMT model uses neural network based techniques to achieve more context-accurate translation rather than translating fragmented sentences of one word at a time. The probability of a word sequence is calculated by using a large artificial neural network, the NMT puts complete sentences into an integrated model, the whole sentences can be translated at one time, and the output effect of the NMT model can be similar to that of manual translation.

However, as the models become more complex, the problem of over-fitting model training becomes more prominent, especially in some low-resource scenarios. Simply put, fitting refers to the dynamic process of finding the optimal parameters of the model. When the process is completed, a plurality of fitted states, such as over-fitting, under-fitting, etc., are generated. The overfitting phenomenon is a phenomenon that performs well on the training set and underperforms on the test set. An overfitting phenomenon may be said to occur when a model learns too much the details and noise in the training data, such that the model performs poorly on the new data. This means that noise or random fluctuations in the training data are also learned by the model as features of the training data, and there is a problem in that these features are not suitable for new data, and thus overfitting occurs, resulting in deterioration of the generalization performance of the model.

In order to alleviate the problem of model overfitting, a regularization method and the like can be adopted, and consistency training (consistency training) is a commonly used regularization method. For example, in the related art, a semantic element (Token) clipping (Cutoff) method may be adopted to perform consistency regularization training, and the Token Cutoff method may randomly select a semantic element (Token) included in a sentence, and set a word vector (Embedding) of the Token to zero. However, the loss function of the Token Cutoff method is very complex, and 4 hyper-parameters are introduced, so that the selection of the hyper-parameters is time-consuming and labor-consuming, which results in an excessively complex and time-consuming model training process, and especially in the case of training resource shortage, the problem is especially prominent.

Aiming at the problem, the invention provides a simple consistency regularization training method based on word vector discarding, which can reduce the number of hyper-parameters and relieve the overfitting problem of a model in the training process, thereby reducing the complexity of model training and improving the translation quality of the model.

The present disclosure proposes a model training method that may be applied to the application scenario shown in fig. 1. As shown in fig. 1, the model training apparatus may perform model training by using the training method provided in the present disclosure, and send the trained model to the translation platform. The model may be an NMT model. In the training process, the model training device may obtain training data from the data set platform and perform model training using the training data. The translation platform receives and stores the trained model from the model training device, and when a translation request of a user is received, the translation platform can translate a sentence to be translated, which is input by the user, by using the trained model to obtain a translated sentence, and feed the translated sentence back to the user. The model training device can update the model periodically, or update the model when receiving an update request of the translation platform, and resend the updated model to the translation platform.

Fig. 2 is a schematic diagram of an implementation flow of a model training method according to an embodiment of the present disclosure, including:

s210: acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence;

s220: for each group of sentence pairs, determining a first word vector of a first semantic element in a source language sentence contained in the sentence pair, and determining a second word vector of a second semantic element in a target language sentence contained in the sentence pair;

s230: determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;

s240: a first loss function is determined using the sentence pairs and the corresponding sample pairs, and the model is trained using the first loss function.

In some implementations, the disclosed embodiments can obtain sentence pairs from a dataset platform. For example, the data set may be an IWSLT14 standard data set, and the IWSLT14 standard data set includes sentence pairs that can be used for model training, such as an IWSLT14 english standard data set (IWSLT 14, en-de) that includes a plurality of sentence pairs, each sentence pair including an english sentence and a german sentence with the same meaning.

In some embodiments, the sentence pair obtained by the embodiments of the present disclosure includes a source language sentence and a target language sentence, where the source language sentence and the target language sentence are expressed with the same meaning, the source language sentence may be a sentence before translation, and the target language sentence may be a sentence after translation.

The semantic element may refer to Token included in the sentence, and the semantic element (e.g., token) may be each character included in the sentence, or may be each word, etc. A word vector of semantic elements may refer to embed. In the embodiment of the disclosure, for convenience of distinguishing, a semantic element in a source language sentence is referred to as a first semantic element, and a word vector corresponding to the semantic element in the source language sentence is referred to as a first word vector; and the semantic elements in the target language sentence are referred to as second semantic elements for short, and the word vectors corresponding to the semantic elements in the target language sentence are referred to as second word vectors for short. The "first" and "second" are used only for distinguishing, and do not indicate a sequence, importance, priority, or the like, and other names may also be used in the embodiments of the present disclosure to distinguish semantic elements and corresponding word vectors in source language sentences and target language sentences, which is not limited in the embodiments of the present disclosure.

For example, in the embodiment of the present disclosure, N groups of sentence pairs are obtained, the sample pairs corresponding to each group of sentence pairs are determined by using the above steps S220 and S230, and N sample pairs are determined by using the N groups of sentence pairs, where N is a positive integer; and calculating a first loss function by adopting each sample pair, and adjusting model parameters according to the values of the first loss function, thereby realizing the training of the model. In some embodiments, when the number of times that the value of the first loss function satisfies the predetermined requirement reaches the preset threshold, the parameters of the model may not be adjusted any more, and the training of the model is ended.

Taking sentence pair (x, y) as an example, x in sentence pair (x, y) represents a source language sentence, y represents a target language sentence, x and y are two different natural languages, and x and y have the same meaning.

For example, x and y each contain multiple semantic elements (e.g., token). According to the above steps S220 and S230, word vectors corresponding to the tokens included in x are determined, and word vectors corresponding to the tokens included in y are determined. Determining a sample pair corresponding to the sentence pair (x, y) by using the word vector corresponding to each Token contained in x and the word vector corresponding to each Token contained in y, wherein the sample pair can be marked as (x, y) _cut ，y _cut )。

Compared with the situation that a plurality of corresponding sample pairs are generated by one sentence pair in the related technology, the method for determining the first loss function is simple, and therefore the complexity of model training can be reduced.

In some possible embodiments, the determining the sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element includes:

zeroing the first word vectors of all first semantic elements in the source language sentence according to a preset probability, and zeroing the second word vectors of all second semantic elements in the target language sentence to obtain the sample pair;

wherein the pair of samples comprises a first sample and a second sample; the first sample comprises data results obtained by zeroing the first word vectors of the first semantic elements in the source language sentence according to the predetermined probability, and the second sample comprises data results obtained by zeroing the second word vectors of the second semantic elements in the target language sentence according to the predetermined probability.

It can be seen that, compared to the related art (such as Token Cutoff), in the embodiment of the present disclosure, a sentence pair determines a sample pair, and therefore, the process of determining the sample pair is simpler, thereby ensuring that the manner of determining the first loss function by using the sentence pair and the sample pair is simpler, and reducing the complexity of deceleration. And, zero setting operation is carried out on the first word vector or the second word vector, which is equivalent to discarding the word vector, namely, a part of information in the training data is actively discarded, thereby reducing the overfitting phenomenon of the model caused by overfitting details in the training data.

Specifically, if the predetermined probability is P _cut ；

Aiming at first word vectors of various first semantic elements in source language sentences, each first word vector is controlled to have P _cut Is replaced by a zero vector and has a 1-P _cut The probability of (d) remains unchanged;

controlling each second word vector to have P for each second word vector of second semantic elements in the target language sentence _cut Is replaced by a zero vector and has a 1-P _cut The probability of (c) remains unchanged.

For example, the word vectors corresponding to the tokens included in the sentence x are: x1, X2, X3, \ 8230and Xn. At a predetermined probabilityP _cut For example, =0.1, a random number is generated, the random number may have optional values of 0 and 1, the probability of generating the random number as 0 is 0.1, and the probability of generating the random number as 1 is 1 to 0.1=0.9. For X1, if the value of the generated random number is 0, replacing X1 with a zero vector, namely setting X1 to be all zeros; if the value of the generated random number is 1, X1 is kept unchanged. The subsequent processing is also performed in the same manner for X2 to Xn. The data result obtained after zeroing X1 to Xn according to the predetermined probability is the first sample in the sample pair, such as sample pair (X) _cut ，y _cut ) X in (1) _cut 。

The word vectors corresponding to the tokens included in the sentence y are: y1, Y2, Y3, \ 8230and Ym. Y1 to Ym are also treated in the same manner. The data result obtained after zeroing Y1 through Ym according to the predetermined probability is the second sample in the sample pair, such as sample pair (x) _cut ，y _cut ) Y in (1) _cut 。

Therefore, in the operation, the probabilities that each word vector in the sentence is replaced by the zero vector are independent, namely, the phenomenon that details in each part of the training data are actively discarded is independent, so that the effect of reducing the model overfitting phenomenon is ensured.

After the sample pairs corresponding to each sentence pair are determined, a first loss function, which may also be referred to as an objective function, an objective loss function, or the like, may be determined using the sentence pairs and the corresponding sample pairs.

As shown in fig. 3, in some embodiments, the manner of determining the first loss function includes:

s310: determining a cross entropy (cross entropy) function of the first probability distribution and tags of target language sentences contained in the sentence pairs, and determining a relative entropy (relative entropy) function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to a sentence pair and the second probability distribution corresponds to a sample pair corresponding to the sentence pair;

s320: and determining a first loss function by using the cross entropy function, the relative entropy function and a preset hyper-parameter.

For example, the first loss function is determined using the following equation:

L _simcut (θ)＝L _ce (θ)+αL _simkl (θ)；

wherein L is _simcut (θ) represents a first loss function;

L _ce (θ) represents a cross-entropy function of the first probability distribution and tags of target language sentences contained in the sentence pairs;

wherein the content of the first and second substances,

represents a cross entropy function;

f (x, y, θ) represents a first probability distribution corresponding to a sentence pair (e.g., (x, y));

θ represents a parameter of the model;

a label (label) representing a sentence in the target language, such as a one-hot vector for sentence y; one-hot vector is a vector form easily utilized by machine learning algorithm, and the representation of the vector is a feature vector of an attribute, namely, only one activation point (not 0) at the same time, and the others are all 0.

L _simkl (θ) represents a relative entropy function of the first probability distribution and the second probability distribution, which may also be referred to as a KL-divergence (Kullback-Leibler divergence) function or an information divergence (information divergence) function, which is a measure of asymmetry of the difference between the two probability distributions (probability distributions).

L _simkl (θ)＝KL(f(x,y；θ)||f(x _cut ,y _cut ；θ))；

Wherein KL (| |) represents the KL divergence of the two probability distributions;

θ represents a parameter of the model;

f(x _cut ,y _cut (ii) a θ) represents a second probability distribution, the first probability distribution corresponding to a pair of samples (e.g., (x) _cut ，y _cut ))。

α represents a preset hyper-parameter.

It can be seen that the training method provided by the embodiment of the present disclosure uses two hyper-parameters, i.e., α and the predetermined probability P mentioned above _cut Compared with four hyper-parameters adopted by the Token Cutoff method, the training method provided by the embodiment of the disclosure reduces the number of the hyper-parameters, so that the selection process of the hyper-parameters can be simplified, and the complexity of training is reduced as a whole. Along with the reduction of the training complexity, the training effect can be improved, so that the overfitting problem of the model is relieved, the generalization capability of the model is improved, and the translation quality of the model is improved.

In some embodiments, when the model is trained by using the first loss function, the present disclosure may adjust parameters of the model by using a gradient descent method using the first loss function, and perform a Bi-back (Bi-back) process from model parameters corresponding to the first probability distribution and model parameters corresponding to the second probability distribution during the adjustment process. Alternatively, during the gradient descent, the model parameters (such as θ mentioned above) corresponding to the two probability distributions (i.e., the first probability distribution and the second probability distribution) in the KL (| |) function may be used to adjust the model parameters. The mode can further improve the model training speed, thereby improving the model training effect.

Table 1 shows comparison data between the effect of the model obtained by training with the model training method provided in the embodiment of the present disclosure and the effect of the model obtained by training with the existing Virtual Accommodation Training (VAT) method. In table 1 below, the model training method proposed by the embodiment of the present disclosure is represented by SimCut.

TABLE 1

Bilingual evaluation assistant (BLEU) is a tool for evaluating the quality of machine translation, and the evaluation idea is: the closer the machine translation results are to the results of professional human translation, the better. The larger the value of BLEU, the better the translation model.

Table 1 shows data statistics of model evaluation using IWSLT14 delin data set, and as shown in table 1, when performing "english- > deli" translation and "deli- > ying" translation, the performance of the model trained by the existing VAT training method when performing translation is lower than the performance of the model obtained by the SimCut method proposed in the present disclosure, and the performance of the model trained by the training method allowing two-side pass back is higher than the performance of the model trained by the training method not allowing two-side pass back when performing translation.

FIG. 4 is a graph of the effect of different hyper-parameters on the IWSLT14 delta data set according to an embodiment of the present disclosure. In FIG. 4, the ordinate is P _cut The abscissa is alpha; with P _cut Unlike α, the trained model is "De-" on the IWSLT14 De-English dataset>The BLEU varies in the English "translation. The model training method provided by the embodiment of the disclosure only has P _cut And alpha, so that the selection process of the hyper-parameters can be greatly simplified, and the complexity of model training is reduced.

Table 2 shows comparison data of the effect of the model obtained by training with the model training method proposed in the embodiment of the present disclosure and the effect of the model obtained by training with several existing training methods. In table 2 below, the model training method proposed by the embodiment of the present disclosure is expressed by SimCut.

TABLE 2

Table 2 is a data statistic of model evaluation using IWSLT14 delin dataset, and as shown in table 2, when performing "english- > de" translation and "de- > english" translation, the performance of translation using the SimCut method proposed by the present disclosure is higher than that of translation using the models trained by several existing training methods.

In addition, the robustness performance of the training method provided by the embodiment of the disclosure is also obviously improved. Table 3 shows comparison data of robustness of the model trained by the model training method proposed in the embodiment of the present disclosure and robustness of the model trained by several existing training methods. In table 3 below, the model training method proposed by the embodiment of the present disclosure is represented by SimCut.

TABLE 3

Table 3 is the data statistics for model evaluation using the IWSLT14 delta data set. The Probability (Probability) represents the Probability that each token in the sentence to be translated is randomly replaced by other tokens, and the larger the Probability (Probability) is, the larger the noise representing the sentence to be translated is. As shown in table 3, when performing the "de- > in" translation, under various noise probabilities of the sentence to be translated, the performance of the model translated by the SimCut method proposed by the present disclosure is higher than that of the model translated by the training methods.

In addition, the translation effect of the Simcut training method provided by the embodiment of the disclosure on the actual sign language translation data set is also significantly improved, as shown in the following Table 4.

TABLE 4

The embodiment of the present disclosure further provides a model training apparatus, and fig. 5 is a schematic structural diagram of a model training apparatus 500 according to the embodiment of the present disclosure, where the model training apparatus includes:

a sentence pair obtaining module 510, configured to obtain a plurality of sets of sentence pairs, where each set of sentence pairs includes a source language sentence and a target language sentence;

a sample pair determining module 520, configured to determine, for each set of sentence pairs, a first word vector of a first semantic element in the source language sentence included in the sentence pair, and determine a second word vector of a second semantic element in the target language sentence included in the sentence pair; determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;

a training module 530, configured to determine a first loss function using the sentence pairs and the corresponding sample pairs, and train the model using the first loss function.

Fig. 6 is a schematic structural diagram of a model training apparatus 600 according to an embodiment of the present disclosure, and as shown in fig. 6, in a possible implementation manner, the training module 530 includes:

a loss function generation submodule 531 for determining a cross entropy function of the first probability distribution and the tags of the target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;

determining the first loss function by utilizing the cross entropy function, the relative entropy function and a preset hyper-parameter;

the adjusting submodule 532 is configured to adjust the parameters of the model by using the first loss function and using a gradient descent method, and during the adjusting process, perform two-side feedback from the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution.

In some possible embodiments, the sample pair determining module 520 is configured to:

zeroing the first word vector of each first semantic element in the source language sentence according to a preset probability, and zeroing the second word vector of each second semantic element in the target language sentence to obtain the sample pair;

wherein the pair of samples comprises a first sample and a second sample; the first sample comprises data results obtained after zeroing the first word vectors of the first semantic elements in the source language sentence according to the preset probability, and the second sample comprises data results obtained after zeroing the second word vectors of the second semantic elements in the target language sentence according to the preset probability.

In some possible embodiments, the predetermined probability is P _cut ；

The sample pair determination module 520 is configured to: controlling each first word vector to have P for the first word vectors of each first semantic element in the source language sentence _cut Is replaced by a zero vector and has a 1-P _cut The probability of (d) remains unchanged; controlling each second word vector to have P for each second word vector of a second semantic element in the target language sentence _cut Is replaced by a null vector and has a value of 1-P _cut The probability of (c) remains unchanged.

In some possible embodiments, the model comprises a Neural Machine Translation (NMT) model.

The functions of each module and/or unit in the embodiments of the apparatus of the present disclosure may refer to the related descriptions in the above embodiments of the method of the present disclosure, which are not repeated herein

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The calculation unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the model training method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the model training method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the model training method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A model training method, comprising:

for each set of the sentence pairs, determining a first word vector for a first semantic element in the source language sentence contained in the sentence pair and determining a second word vector for a second semantic element in the target language sentence contained in the sentence pair;

determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element, wherein one sentence pair generates one corresponding sample pair;

determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training a model by using the first loss function;

wherein said determining a first loss function using said sentence pairs and corresponding sample pairs comprises:

determining a cross entropy function of the first probability distribution and tags of target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;

determining the first loss function by utilizing the cross entropy function, the relative entropy function and a preset hyper-parameter; wherein the first loss function is the sum of the product of the relative entropy function and the preset hyper-parameter and the cross entropy function.

2. The method of claim 1, wherein the determining a sample pair corresponding to the sentence pair using the first word vector of the first semantic element and the second word vector of the second semantic element comprises:

wherein the sample pair comprises a first sample and a second sample; the first sample comprises data results obtained after zeroing the first word vectors of the first semantic elements in the source language sentence according to the preset probability, and the second sample comprises data results obtained after zeroing the second word vectors of the second semantic elements in the target language sentence according to the preset probability.

3. The method of claim 2, wherein the predetermined probability is

；

The zeroing the first word vector of each first semantic element in the source language sentence according to the predetermined probability includes: controlling each first word vector to have for each first semantic element in the source language sentence a first word vector

Is replaced by a zero vector and has a 1->

The probability of (d) remains unchanged;

the zeroing the second word vector of each second semantic element in the target language sentence according to the predetermined probability includes: controlling each second word vector to have for each second semantic element in the target language sentence a second word vector

Is replaced by a zero vector and has a 1->

The probability of (c) remains unchanged.

4. The method of claim 1, wherein said training a model with said first loss function comprises:

and adjusting the parameters of the model by adopting the first loss function and utilizing a gradient descent method, and returning from the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution to two sides in the adjustment process.

5. The method according to any of claims 1 to 4, wherein the model comprises a Neural Machine Translation (NMT) model.

6. A model training apparatus comprising:

a sample pair determination module for determining, for each set of the sentence pairs, a first word vector for a first semantic element in the source language sentence contained in the sentence pair and a second word vector for a second semantic element in the target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element, wherein one sentence pair generates one corresponding sample pair;

the training module is used for determining a first loss function by utilizing the sentence pairs and the corresponding sample pairs and training a model by adopting the first loss function;

wherein the training module comprises:

a loss function generation submodule for determining a cross entropy function of the first probability distribution and the tags of the target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;

determining the first loss function by using the cross entropy function, the relative entropy function and a preset hyper-parameter; wherein the first loss function is the sum of the product of the relative entropy function and the preset hyperparameter and the cross entropy function.

7. The apparatus of claim 6, wherein the sample pair determination module is to,

8. The apparatus of claim 7, wherein the predetermined probability is

；

The sample pair determining module is used for controlling each first word vector of each first semantic element in the source language sentence to have

Is replaced by a zero vector and has a 1->

The probability of (d) remains unchanged; controlling each second word vector to have ^ greater than or equal to a second word vector for each second semantic element in the target language sentence>

Is replaced by a zero vector and has a 1->

The probability of (c) remains unchanged.

9. The apparatus of claim 6, wherein the training module comprises:

and the adjusting submodule is used for adjusting the parameters of the model by adopting the first loss function and utilizing a gradient descent method, and in the adjusting process, two sides of the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution are returned.

10. The apparatus of any one of claims 6 to 9, wherein the model comprises a Neural Machine Translation (NMT) model.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.