CN114580442B - Model training method and device, electronic equipment and storage medium - Google Patents

Model training method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114580442B
CN114580442B CN202210186688.1A CN202210186688A CN114580442B CN 114580442 B CN114580442 B CN 114580442B CN 202210186688 A CN202210186688 A CN 202210186688A CN 114580442 B CN114580442 B CN 114580442B
Authority
CN
China
Prior art keywords
sentence
pair
word vector
model
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210186688.1A
Other languages
Chinese (zh)
Other versions
CN114580442A (en
Inventor
高鹏至
何中军
李芝
吴华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210186688.1A priority Critical patent/CN114580442B/en
Publication of CN114580442A publication Critical patent/CN114580442A/en
Application granted granted Critical
Publication of CN114580442B publication Critical patent/CN114580442B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Abstract

The disclosure provides a model training method and device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of natural language processing. The specific implementation scheme is as follows: acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence; for each group of sentence pairs, determining a first word vector of a first semantic element in a source language sentence contained in the sentence pair, and determining a second word vector of a second semantic element in a target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by utilizing a first word vector of the first semantic element and a second word vector of the second semantic element; and determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training the model by adopting the first loss function. The method and the device can reduce the complexity of the model training process.

Description

Model training method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of natural language processing technology.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between a person and a computer using natural language.
Machine Translation (MT) is a process of translating a text in one natural language (which may be called a source language) into a text in another natural language (which may be called a target language) by Machine force, and is an important research field of NLP and one of the services commonly used in the internet at present. In recent years, the Neural network model has made a significant progress in the task of Machine Translation and surpassed statistical Machine Translation, and especially the Neural Machine Translation (NMT) model based on a Transformer (Transformer) has achieved the best Translation quality under the training of a large amount of data. However, as the model becomes more complex, the training of the model becomes more complex.
Disclosure of Invention
The present disclosure provides a method, apparatus, device, and storage medium for model training.
According to an aspect of the present disclosure, there is provided a model training method, including:
acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence;
for each set of sentence pairs, determining a first word vector for a first semantic element in the source language sentence contained in the sentence pair and determining a second word vector for a second semantic element in the target language sentence contained in the sentence pair;
determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;
and determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training the model by using the first loss function.
According to another aspect of the present disclosure, there is provided a model training apparatus including:
the sentence pair acquisition module is used for acquiring a plurality of groups of sentence pairs, and each group of sentence pairs comprises a source language sentence and a target language sentence;
a sample pair determination module for determining, for each set of the sentence pairs, a first word vector for a first semantic element in the source language sentence contained in the sentence pair and a second word vector for a second semantic element in the target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element;
and the training module is used for determining a first loss function by utilizing the sentence pairs and the corresponding sample pairs and training the model by adopting the first loss function.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the model training method described above.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute the above model training method.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described model training method.
According to the method and the device, the corresponding sample pairs are generated by adopting the multiple groups of sentence pairs, and the first loss function is determined by utilizing the sentence pairs and the corresponding sample pairs, so that a simple and effective mode for constructing the loss function is provided, and the complexity of a model training process is reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram of an application scenario according to the present disclosure;
FIG. 2 is a schematic flow chart of an implementation of a model training method according to an embodiment of the present disclosure;
FIG. 3 is a schematic flow chart illustrating an implementation of a model training method according to an embodiment of the present disclosure
FIG. 4 is a schematic illustration of the effect of different hyper-parameters on the model on the IWSLT14 delta data set according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a model training apparatus 500 according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a model training apparatus 600 according to an embodiment of the present disclosure
FIG. 7 is a block diagram of an electronic device for implementing a model training method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Neural networks are of high quality in natural language text translation. Taking the NMT model as an example, the NMT model uses neural network based techniques to achieve more context-accurate translation rather than translating fragmented sentences of one word at a time. The probability of a word sequence is calculated by using a large artificial neural network, the NMT puts complete sentences into an integrated model, the whole sentences can be translated at one time, and the output effect of the NMT model can be similar to that of manual translation.
However, as the models become more complex, the problem of over-fitting model training becomes more prominent, especially in some low-resource scenarios. Simply put, fitting refers to the dynamic process of finding the optimal parameters of the model. When the process is completed, a plurality of fitted states, such as over-fitting, under-fitting, etc., are generated. The overfitting phenomenon is a phenomenon that performs well on the training set and underperforms on the test set. An overfitting phenomenon may be said to occur when a model learns too much the details and noise in the training data, such that the model performs poorly on the new data. This means that noise or random fluctuations in the training data are also learned by the model as features of the training data, and there is a problem in that these features are not suitable for new data, and thus overfitting occurs, resulting in deterioration of the generalization performance of the model.
In order to alleviate the problem of model overfitting, a regularization method and the like can be adopted, and consistency training (consistency training) is a commonly used regularization method. For example, in the related art, a semantic element (Token) clipping (Cutoff) method may be adopted to perform consistency regularization training, and the Token Cutoff method may randomly select a semantic element (Token) included in a sentence, and set a word vector (Embedding) of the Token to zero. However, the loss function of the Token Cutoff method is very complex, and 4 hyper-parameters are introduced, so that the selection of the hyper-parameters is time-consuming and labor-consuming, which results in an excessively complex and time-consuming model training process, and especially in the case of training resource shortage, the problem is especially prominent.
Aiming at the problem, the invention provides a simple consistency regularization training method based on word vector discarding, which can reduce the number of hyper-parameters and relieve the overfitting problem of a model in the training process, thereby reducing the complexity of model training and improving the translation quality of the model.
The present disclosure proposes a model training method that may be applied to the application scenario shown in fig. 1. As shown in fig. 1, the model training apparatus may perform model training by using the training method provided in the present disclosure, and send the trained model to the translation platform. The model may be an NMT model. In the training process, the model training device may obtain training data from the data set platform and perform model training using the training data. The translation platform receives and stores the trained model from the model training device, and when a translation request of a user is received, the translation platform can translate a sentence to be translated, which is input by the user, by using the trained model to obtain a translated sentence, and feed the translated sentence back to the user. The model training device can update the model periodically, or update the model when receiving an update request of the translation platform, and resend the updated model to the translation platform.
Fig. 2 is a schematic diagram of an implementation flow of a model training method according to an embodiment of the present disclosure, including:
s210: acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence;
s220: for each group of sentence pairs, determining a first word vector of a first semantic element in a source language sentence contained in the sentence pair, and determining a second word vector of a second semantic element in a target language sentence contained in the sentence pair;
s230: determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;
s240: a first loss function is determined using the sentence pairs and the corresponding sample pairs, and the model is trained using the first loss function.
In some implementations, the disclosed embodiments can obtain sentence pairs from a dataset platform. For example, the data set may be an IWSLT14 standard data set, and the IWSLT14 standard data set includes sentence pairs that can be used for model training, such as an IWSLT14 english standard data set (IWSLT 14, en-de) that includes a plurality of sentence pairs, each sentence pair including an english sentence and a german sentence with the same meaning.
In some embodiments, the sentence pair obtained by the embodiments of the present disclosure includes a source language sentence and a target language sentence, where the source language sentence and the target language sentence are expressed with the same meaning, the source language sentence may be a sentence before translation, and the target language sentence may be a sentence after translation.
The semantic element may refer to Token included in the sentence, and the semantic element (e.g., token) may be each character included in the sentence, or may be each word, etc. A word vector of semantic elements may refer to embed. In the embodiment of the disclosure, for convenience of distinguishing, a semantic element in a source language sentence is referred to as a first semantic element, and a word vector corresponding to the semantic element in the source language sentence is referred to as a first word vector; and the semantic elements in the target language sentence are referred to as second semantic elements for short, and the word vectors corresponding to the semantic elements in the target language sentence are referred to as second word vectors for short. The "first" and "second" are used only for distinguishing, and do not indicate a sequence, importance, priority, or the like, and other names may also be used in the embodiments of the present disclosure to distinguish semantic elements and corresponding word vectors in source language sentences and target language sentences, which is not limited in the embodiments of the present disclosure.
For example, in the embodiment of the present disclosure, N groups of sentence pairs are obtained, the sample pairs corresponding to each group of sentence pairs are determined by using the above steps S220 and S230, and N sample pairs are determined by using the N groups of sentence pairs, where N is a positive integer; and calculating a first loss function by adopting each sample pair, and adjusting model parameters according to the values of the first loss function, thereby realizing the training of the model. In some embodiments, when the number of times that the value of the first loss function satisfies the predetermined requirement reaches the preset threshold, the parameters of the model may not be adjusted any more, and the training of the model is ended.
Taking sentence pair (x, y) as an example, x in sentence pair (x, y) represents a source language sentence, y represents a target language sentence, x and y are two different natural languages, and x and y have the same meaning.
For example, x and y each contain multiple semantic elements (e.g., token). According to the above steps S220 and S230, word vectors corresponding to the tokens included in x are determined, and word vectors corresponding to the tokens included in y are determined. Determining a sample pair corresponding to the sentence pair (x, y) by using the word vector corresponding to each Token contained in x and the word vector corresponding to each Token contained in y, wherein the sample pair can be marked as (x, y) cut ,y cut )。
Compared with the situation that a plurality of corresponding sample pairs are generated by one sentence pair in the related technology, the method for determining the first loss function is simple, and therefore the complexity of model training can be reduced.
In some possible embodiments, the determining the sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element includes:
zeroing the first word vectors of all first semantic elements in the source language sentence according to a preset probability, and zeroing the second word vectors of all second semantic elements in the target language sentence to obtain the sample pair;
wherein the pair of samples comprises a first sample and a second sample; the first sample comprises data results obtained by zeroing the first word vectors of the first semantic elements in the source language sentence according to the predetermined probability, and the second sample comprises data results obtained by zeroing the second word vectors of the second semantic elements in the target language sentence according to the predetermined probability.
It can be seen that, compared to the related art (such as Token Cutoff), in the embodiment of the present disclosure, a sentence pair determines a sample pair, and therefore, the process of determining the sample pair is simpler, thereby ensuring that the manner of determining the first loss function by using the sentence pair and the sample pair is simpler, and reducing the complexity of deceleration. And, zero setting operation is carried out on the first word vector or the second word vector, which is equivalent to discarding the word vector, namely, a part of information in the training data is actively discarded, thereby reducing the overfitting phenomenon of the model caused by overfitting details in the training data.
Specifically, if the predetermined probability is P cut
Aiming at first word vectors of various first semantic elements in source language sentences, each first word vector is controlled to have P cut Is replaced by a zero vector and has a 1-P cut The probability of (d) remains unchanged;
controlling each second word vector to have P for each second word vector of second semantic elements in the target language sentence cut Is replaced by a zero vector and has a 1-P cut The probability of (c) remains unchanged.
For example, the word vectors corresponding to the tokens included in the sentence x are: x1, X2, X3, \ 8230and Xn. At a predetermined probabilityP cut For example, =0.1, a random number is generated, the random number may have optional values of 0 and 1, the probability of generating the random number as 0 is 0.1, and the probability of generating the random number as 1 is 1 to 0.1=0.9. For X1, if the value of the generated random number is 0, replacing X1 with a zero vector, namely setting X1 to be all zeros; if the value of the generated random number is 1, X1 is kept unchanged. The subsequent processing is also performed in the same manner for X2 to Xn. The data result obtained after zeroing X1 to Xn according to the predetermined probability is the first sample in the sample pair, such as sample pair (X) cut ,y cut ) X in (1) cut
The word vectors corresponding to the tokens included in the sentence y are: y1, Y2, Y3, \ 8230and Ym. Y1 to Ym are also treated in the same manner. The data result obtained after zeroing Y1 through Ym according to the predetermined probability is the second sample in the sample pair, such as sample pair (x) cut ,y cut ) Y in (1) cut
Therefore, in the operation, the probabilities that each word vector in the sentence is replaced by the zero vector are independent, namely, the phenomenon that details in each part of the training data are actively discarded is independent, so that the effect of reducing the model overfitting phenomenon is ensured.
After the sample pairs corresponding to each sentence pair are determined, a first loss function, which may also be referred to as an objective function, an objective loss function, or the like, may be determined using the sentence pairs and the corresponding sample pairs.
As shown in fig. 3, in some embodiments, the manner of determining the first loss function includes:
s310: determining a cross entropy (cross entropy) function of the first probability distribution and tags of target language sentences contained in the sentence pairs, and determining a relative entropy (relative entropy) function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to a sentence pair and the second probability distribution corresponds to a sample pair corresponding to the sentence pair;
s320: and determining a first loss function by using the cross entropy function, the relative entropy function and a preset hyper-parameter.
For example, the first loss function is determined using the following equation:
L simcut (θ)=L ce (θ)+αL simkl (θ);
wherein L is simcut (θ) represents a first loss function;
L ce (θ) represents a cross-entropy function of the first probability distribution and tags of target language sentences contained in the sentence pairs;
Figure BDA0003523806320000071
wherein the content of the first and second substances,
Figure BDA0003523806320000073
represents a cross entropy function;
f (x, y, θ) represents a first probability distribution corresponding to a sentence pair (e.g., (x, y));
θ represents a parameter of the model;
Figure BDA0003523806320000072
a label (label) representing a sentence in the target language, such as a one-hot vector for sentence y; one-hot vector is a vector form easily utilized by machine learning algorithm, and the representation of the vector is a feature vector of an attribute, namely, only one activation point (not 0) at the same time, and the others are all 0.
L simkl (θ) represents a relative entropy function of the first probability distribution and the second probability distribution, which may also be referred to as a KL-divergence (Kullback-Leibler divergence) function or an information divergence (information divergence) function, which is a measure of asymmetry of the difference between the two probability distributions (probability distributions).
L simkl (θ)=KL(f(x,y;θ)||f(x cut ,y cut ;θ));
Wherein KL (| |) represents the KL divergence of the two probability distributions;
f (x, y, θ) represents a first probability distribution corresponding to a sentence pair (e.g., (x, y));
θ represents a parameter of the model;
f(x cut ,y cut (ii) a θ) represents a second probability distribution, the first probability distribution corresponding to a pair of samples (e.g., (x) cut ,y cut ))。
α represents a preset hyper-parameter.
It can be seen that the training method provided by the embodiment of the present disclosure uses two hyper-parameters, i.e., α and the predetermined probability P mentioned above cut Compared with four hyper-parameters adopted by the Token Cutoff method, the training method provided by the embodiment of the disclosure reduces the number of the hyper-parameters, so that the selection process of the hyper-parameters can be simplified, and the complexity of training is reduced as a whole. Along with the reduction of the training complexity, the training effect can be improved, so that the overfitting problem of the model is relieved, the generalization capability of the model is improved, and the translation quality of the model is improved.
In some embodiments, when the model is trained by using the first loss function, the present disclosure may adjust parameters of the model by using a gradient descent method using the first loss function, and perform a Bi-back (Bi-back) process from model parameters corresponding to the first probability distribution and model parameters corresponding to the second probability distribution during the adjustment process. Alternatively, during the gradient descent, the model parameters (such as θ mentioned above) corresponding to the two probability distributions (i.e., the first probability distribution and the second probability distribution) in the KL (| |) function may be used to adjust the model parameters. The mode can further improve the model training speed, thereby improving the model training effect.
Table 1 shows comparison data between the effect of the model obtained by training with the model training method provided in the embodiment of the present disclosure and the effect of the model obtained by training with the existing Virtual Accommodation Training (VAT) method. In table 1 below, the model training method proposed by the embodiment of the present disclosure is represented by SimCut.
TABLE 1
Figure BDA0003523806320000091
Bilingual evaluation assistant (BLEU) is a tool for evaluating the quality of machine translation, and the evaluation idea is: the closer the machine translation results are to the results of professional human translation, the better. The larger the value of BLEU, the better the translation model.
Table 1 shows data statistics of model evaluation using IWSLT14 delin data set, and as shown in table 1, when performing "english- > deli" translation and "deli- > ying" translation, the performance of the model trained by the existing VAT training method when performing translation is lower than the performance of the model obtained by the SimCut method proposed in the present disclosure, and the performance of the model trained by the training method allowing two-side pass back is higher than the performance of the model trained by the training method not allowing two-side pass back when performing translation.
FIG. 4 is a graph of the effect of different hyper-parameters on the IWSLT14 delta data set according to an embodiment of the present disclosure. In FIG. 4, the ordinate is P cut The abscissa is alpha; with P cut Unlike α, the trained model is "De-" on the IWSLT14 De-English dataset>The BLEU varies in the English "translation. The model training method provided by the embodiment of the disclosure only has P cut And alpha, so that the selection process of the hyper-parameters can be greatly simplified, and the complexity of model training is reduced.
Table 2 shows comparison data of the effect of the model obtained by training with the model training method proposed in the embodiment of the present disclosure and the effect of the model obtained by training with several existing training methods. In table 2 below, the model training method proposed by the embodiment of the present disclosure is expressed by SimCut.
TABLE 2
Figure BDA0003523806320000092
Figure BDA0003523806320000101
Table 2 is a data statistic of model evaluation using IWSLT14 delin dataset, and as shown in table 2, when performing "english- > de" translation and "de- > english" translation, the performance of translation using the SimCut method proposed by the present disclosure is higher than that of translation using the models trained by several existing training methods.
In addition, the robustness performance of the training method provided by the embodiment of the disclosure is also obviously improved. Table 3 shows comparison data of robustness of the model trained by the model training method proposed in the embodiment of the present disclosure and robustness of the model trained by several existing training methods. In table 3 below, the model training method proposed by the embodiment of the present disclosure is represented by SimCut.
TABLE 3
Figure BDA0003523806320000102
Table 3 is the data statistics for model evaluation using the IWSLT14 delta data set. The Probability (Probability) represents the Probability that each token in the sentence to be translated is randomly replaced by other tokens, and the larger the Probability (Probability) is, the larger the noise representing the sentence to be translated is. As shown in table 3, when performing the "de- > in" translation, under various noise probabilities of the sentence to be translated, the performance of the model translated by the SimCut method proposed by the present disclosure is higher than that of the model translated by the training methods.
In addition, the translation effect of the Simcut training method provided by the embodiment of the disclosure on the actual sign language translation data set is also significantly improved, as shown in the following Table 4.
TABLE 4
Figure BDA0003523806320000103
Figure BDA0003523806320000111
The embodiment of the present disclosure further provides a model training apparatus, and fig. 5 is a schematic structural diagram of a model training apparatus 500 according to the embodiment of the present disclosure, where the model training apparatus includes:
a sentence pair obtaining module 510, configured to obtain a plurality of sets of sentence pairs, where each set of sentence pairs includes a source language sentence and a target language sentence;
a sample pair determining module 520, configured to determine, for each set of sentence pairs, a first word vector of a first semantic element in the source language sentence included in the sentence pair, and determine a second word vector of a second semantic element in the target language sentence included in the sentence pair; determining a sample pair corresponding to the sentence pair by using a first word vector of the first semantic element and a second word vector of the second semantic element;
a training module 530, configured to determine a first loss function using the sentence pairs and the corresponding sample pairs, and train the model using the first loss function.
Fig. 6 is a schematic structural diagram of a model training apparatus 600 according to an embodiment of the present disclosure, and as shown in fig. 6, in a possible implementation manner, the training module 530 includes:
a loss function generation submodule 531 for determining a cross entropy function of the first probability distribution and the tags of the target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;
determining the first loss function by utilizing the cross entropy function, the relative entropy function and a preset hyper-parameter;
the adjusting submodule 532 is configured to adjust the parameters of the model by using the first loss function and using a gradient descent method, and during the adjusting process, perform two-side feedback from the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution.
In some possible embodiments, the sample pair determining module 520 is configured to:
zeroing the first word vector of each first semantic element in the source language sentence according to a preset probability, and zeroing the second word vector of each second semantic element in the target language sentence to obtain the sample pair;
wherein the pair of samples comprises a first sample and a second sample; the first sample comprises data results obtained after zeroing the first word vectors of the first semantic elements in the source language sentence according to the preset probability, and the second sample comprises data results obtained after zeroing the second word vectors of the second semantic elements in the target language sentence according to the preset probability.
In some possible embodiments, the predetermined probability is P cut
The sample pair determination module 520 is configured to: controlling each first word vector to have P for the first word vectors of each first semantic element in the source language sentence cut Is replaced by a zero vector and has a 1-P cut The probability of (d) remains unchanged; controlling each second word vector to have P for each second word vector of a second semantic element in the target language sentence cut Is replaced by a null vector and has a value of 1-P cut The probability of (c) remains unchanged.
In some possible embodiments, the model comprises a Neural Machine Translation (NMT) model.
The functions of each module and/or unit in the embodiments of the apparatus of the present disclosure may refer to the related descriptions in the above embodiments of the method of the present disclosure, which are not repeated herein
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The calculation unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the model training method. For example, in some embodiments, the model training method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the model training method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the model training method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (12)

1. A model training method, comprising:
acquiring a plurality of groups of sentence pairs, wherein each group of sentence pairs comprises a source language sentence and a target language sentence;
for each set of the sentence pairs, determining a first word vector for a first semantic element in the source language sentence contained in the sentence pair and determining a second word vector for a second semantic element in the target language sentence contained in the sentence pair;
determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element, wherein one sentence pair generates one corresponding sample pair;
determining a first loss function by using the sentence pairs and the corresponding sample pairs, and training a model by using the first loss function;
wherein said determining a first loss function using said sentence pairs and corresponding sample pairs comprises:
determining a cross entropy function of the first probability distribution and tags of target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;
determining the first loss function by utilizing the cross entropy function, the relative entropy function and a preset hyper-parameter; wherein the first loss function is the sum of the product of the relative entropy function and the preset hyper-parameter and the cross entropy function.
2. The method of claim 1, wherein the determining a sample pair corresponding to the sentence pair using the first word vector of the first semantic element and the second word vector of the second semantic element comprises:
zeroing the first word vector of each first semantic element in the source language sentence according to a preset probability, and zeroing the second word vector of each second semantic element in the target language sentence to obtain the sample pair;
wherein the sample pair comprises a first sample and a second sample; the first sample comprises data results obtained after zeroing the first word vectors of the first semantic elements in the source language sentence according to the preset probability, and the second sample comprises data results obtained after zeroing the second word vectors of the second semantic elements in the target language sentence according to the preset probability.
3. The method of claim 2, wherein the predetermined probability is
Figure QLYQS_1
The zeroing the first word vector of each first semantic element in the source language sentence according to the predetermined probability includes: controlling each first word vector to have for each first semantic element in the source language sentence a first word vector
Figure QLYQS_2
Is replaced by a zero vector and has a 1->
Figure QLYQS_3
The probability of (d) remains unchanged;
the zeroing the second word vector of each second semantic element in the target language sentence according to the predetermined probability includes: controlling each second word vector to have for each second semantic element in the target language sentence a second word vector
Figure QLYQS_4
Is replaced by a zero vector and has a 1->
Figure QLYQS_5
The probability of (c) remains unchanged.
4. The method of claim 1, wherein said training a model with said first loss function comprises:
and adjusting the parameters of the model by adopting the first loss function and utilizing a gradient descent method, and returning from the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution to two sides in the adjustment process.
5. The method according to any of claims 1 to 4, wherein the model comprises a Neural Machine Translation (NMT) model.
6. A model training apparatus comprising:
the sentence pair acquisition module is used for acquiring a plurality of groups of sentence pairs, and each group of sentence pairs comprises a source language sentence and a target language sentence;
a sample pair determination module for determining, for each set of the sentence pairs, a first word vector for a first semantic element in the source language sentence contained in the sentence pair and a second word vector for a second semantic element in the target language sentence contained in the sentence pair; determining a sample pair corresponding to the sentence pair by using the first word vector of the first semantic element and the second word vector of the second semantic element, wherein one sentence pair generates one corresponding sample pair;
the training module is used for determining a first loss function by utilizing the sentence pairs and the corresponding sample pairs and training a model by adopting the first loss function;
wherein the training module comprises:
a loss function generation submodule for determining a cross entropy function of the first probability distribution and the tags of the target language sentences contained in the sentence pairs, and determining a relative entropy function of the first probability distribution and the second probability distribution; wherein the first probability distribution corresponds to the sentence pair and the second probability distribution corresponds to the sample pair corresponding to the sentence pair;
determining the first loss function by using the cross entropy function, the relative entropy function and a preset hyper-parameter; wherein the first loss function is the sum of the product of the relative entropy function and the preset hyperparameter and the cross entropy function.
7. The apparatus of claim 6, wherein the sample pair determination module is to,
zeroing the first word vector of each first semantic element in the source language sentence according to a preset probability, and zeroing the second word vector of each second semantic element in the target language sentence to obtain the sample pair;
wherein the sample pair comprises a first sample and a second sample; the first sample comprises data results obtained after zeroing the first word vectors of the first semantic elements in the source language sentence according to the preset probability, and the second sample comprises data results obtained after zeroing the second word vectors of the second semantic elements in the target language sentence according to the preset probability.
8. The apparatus of claim 7, wherein the predetermined probability is
Figure QLYQS_6
The sample pair determining module is used for controlling each first word vector of each first semantic element in the source language sentence to have
Figure QLYQS_7
Is replaced by a zero vector and has a 1->
Figure QLYQS_8
The probability of (d) remains unchanged; controlling each second word vector to have ^ greater than or equal to a second word vector for each second semantic element in the target language sentence>
Figure QLYQS_9
Is replaced by a zero vector and has a 1->
Figure QLYQS_10
The probability of (c) remains unchanged.
9. The apparatus of claim 6, wherein the training module comprises:
and the adjusting submodule is used for adjusting the parameters of the model by adopting the first loss function and utilizing a gradient descent method, and in the adjusting process, two sides of the model parameters corresponding to the first probability distribution and the model parameters corresponding to the second probability distribution are returned.
10. The apparatus of any one of claims 6 to 9, wherein the model comprises a Neural Machine Translation (NMT) model.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.
CN202210186688.1A 2022-02-28 2022-02-28 Model training method and device, electronic equipment and storage medium Active CN114580442B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210186688.1A CN114580442B (en) 2022-02-28 2022-02-28 Model training method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210186688.1A CN114580442B (en) 2022-02-28 2022-02-28 Model training method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114580442A CN114580442A (en) 2022-06-03
CN114580442B true CN114580442B (en) 2023-04-18

Family

ID=81771192

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210186688.1A Active CN114580442B (en) 2022-02-28 2022-02-28 Model training method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114580442B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN113408299A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of semantic representation model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558605B (en) * 2018-12-17 2022-06-10 北京百度网讯科技有限公司 Method and device for translating sentences
KR20200075615A (en) * 2018-12-18 2020-06-26 삼성전자주식회사 Method and apparatus for machine translation
CN112464676A (en) * 2020-12-02 2021-03-09 北京捷通华声科技股份有限公司 Machine translation result scoring method and device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111460838A (en) * 2020-04-23 2020-07-28 腾讯科技(深圳)有限公司 Pre-training method and device of intelligent translation model and storage medium
CN113408299A (en) * 2021-06-30 2021-09-17 北京百度网讯科技有限公司 Training method, device, equipment and storage medium of semantic representation model

Also Published As

Publication number Publication date
CN114580442A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN116051668B (en) Training method of diffusion model of draft map and image generation method based on text
CN112466288B (en) Voice recognition method and device, electronic equipment and storage medium
CN113962315A (en) Model pre-training method, device, equipment, storage medium and program product
CN112528655B (en) Keyword generation method, device, equipment and storage medium
US20230047980A1 (en) Method of training deep learning model and method of processing natural language
US20220398834A1 (en) Method and apparatus for transfer learning
CN114881129A (en) Model training method and device, electronic equipment and storage medium
CN112786108A (en) Molecular understanding model training method, device, equipment and medium
JP2023025126A (en) Training method and apparatus for deep learning model, text data processing method and apparatus, electronic device, storage medium, and computer program
CN114819079A (en) Model training method and device, electronic equipment and readable storage medium
CN115062718A (en) Language model training method and device, electronic equipment and storage medium
CN114580442B (en) Model training method and device, electronic equipment and storage medium
CN113408304B (en) Text translation method and device, electronic equipment and storage medium
CN115357710A (en) Training method and device for table description text generation model and electronic equipment
CN114817476A (en) Language model training method and device, electronic equipment and storage medium
CN114926322A (en) Image generation method and device, electronic equipment and storage medium
CN114463361A (en) Network model training method, device, equipment, medium and program product
CN113361621A (en) Method and apparatus for training a model
CN114067805A (en) Method and device for training voiceprint recognition model and voiceprint recognition
CN113807397A (en) Training method, device, equipment and storage medium of semantic representation model
CN113408632A (en) Method and device for improving image classification accuracy, electronic equipment and storage medium
CN113553833A (en) Text error correction method and device and electronic equipment
CN113408269A (en) Text emotion analysis method and device
CN113591492B (en) Corpus generation method and device, electronic equipment and storage medium
CN116151215B (en) Text processing method, deep learning model training method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant