CN115618891B

CN115618891B - Multimodal machine translation method and system based on contrast learning

Info

Publication number: CN115618891B
Application number: CN202211629848.1A
Authority: CN
Inventors: 荣辉桂; 张斌
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-04-07
Anticipated expiration: 2042-12-19
Also published as: CN115618891A

Abstract

The invention provides a multi-modal machine translation method and system based on contrast learning. The method comprises the following steps: acquiring N groups of resource files to be translated, wherein each group of resource files comprises a video unit and a first language text unit corresponding to the video unit, the N first language text units form a first language text to be translated, and N is an integer greater than or equal to 1; and translating the first language text by using a translation model based on the video unit and the first language text unit in each group of resource files to obtain a translation text of a second language, wherein the translation model is obtained at least based on VATEX data sets and neural network model training. The translation model used by the method is obtained based on VATEX data set and neural network model training, can solve ambiguity of the whole sentence, and has the advantage of good translation quality.

Description

Multimodal machine translation method and system based on contrast learning

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to a multimodal machine translation method and system based on contrast learning.

Background

Multimodal Machine Translation (MMT), which is a popular research direction in the field of computer natural language processing, is a technology for improving Translation quality through computer technology. It is given text in one language and, as an aid, another one or more signal inputs (including, primarily but not limited to, sound, image, and video), translated into text in the target language. The technology plays an important role in text error correction, text ambiguity elimination and the like.

Multimodal machine translation focuses on extracting auxiliary information from other modalities that helps with text correction and sentence disambiguation. From the data set, multimodal machine translation can be classified as both auditory and visual based. Video is the most abundant information carrying medium, and the amount of information that can be transmitted far exceeds that of single modalities such as sound and images. Video-guided machine translation assists translation by extracting features from the video, and currently mainstream schemes can be divided into two stages: in the first stage, corresponding neural network models are mainly established respectively aiming at texts and videos to extract the characteristics of the texts and the videos; and in the second stage, aligning the text and the video characteristics through a neural network model, and finally outputting a target text according to the characteristics aggregated by the two modes.

However, existing video-guided multimodal machine translation methods suffer from some non-negligible drawbacks: first, these methods can only disambiguate from the word level, not from the semantic level the sentence as a whole; secondly, the multi-modal learning model is too simple to construct, and the characteristics of multiple modes cannot be effectively aligned; third, it does not make full use of the data set information, at the higher cost required to make a new data sample.

Disclosure of Invention

The invention aims to provide a translation model obtained by training a VATEX data set and a neural network model to translate a multi-modal resource file, which can solve the ambiguity of the whole sentence and has the advantage of good translation quality.

In order to achieve the above object, the present invention provides a multimodal machine translation method based on contrast learning, the method comprising the steps of:

acquiring N groups of resource files to be translated, wherein each group of resource files comprises a video unit and a first language text unit corresponding to the video unit, the N first language text units form a first language text to be translated, and N is an integer greater than or equal to 1;

translating the first language text by using a translation model based on the video unit and the first language text unit in each group of resource files to obtain a translation text of a second language, wherein the translation model is obtained at least based on VATEX data sets and neural network model training; the training method of the translation model comprises the following steps:

the method comprises the steps of (1) obtaining a VATEX data set, wherein the VATEX data set comprises a training set, a testing set and a verification set, the training set comprises a plurality of data packets, and each data packet comprises a video clip, a plurality of Chinese descriptions corresponding to the video clip and a plurality of English descriptions corresponding to the video clip;

determining a plurality of data groups to be translated, wherein each data group consists of a video segment and a plurality of Chinese descriptions in each data packet or consists of a video segment and a plurality of English descriptions in each data packet, and the data groups and the data packets comprising the same video segment are in one-to-one correspondence;

step (3) obtaining a sample pair corresponding to each data group one by one based on a plurality of Chinese descriptions or English descriptions in each data group, wherein each sample pair comprises a positive sample and a negative sample, the positive sample is one of the data groups, and the negative sample is one of the data groups;

and (4) taking each data packet and the sample pair corresponding to the data packet as input data, performing multiple rounds of iterative training on the neural network model to obtain the trained neural network model, and obtaining the sample pair corresponding to the same data packet again during each round of iterative training, wherein the training process of the neural network model is as follows:

setting hyper-parameters of a neural network model, acquiring an initialized neural network model, wherein the ownership weight of the initialized neural network model adopts a standard initialization value;

calculating a loss value set of each data packet based on the input data, wherein the loss value set comprises a plurality of loss values, the number of the loss values is the same as the number of Chinese descriptions or English descriptions in the data packet, and the loss values are loss values between the neural network predicted values and real values;

updating and optimizing all weight parameters of the neural network model by using a back propagation algorithm based on the loss value sets corresponding to all data packets to obtain a new neural network model;

and (d) repeatedly and iteratively executing the step (b) and the step (c) until a preset condition is met, so as to obtain the trained neural network model.

In a specific embodiment, the step (3) comprises:

inputting all of the data sets into a negative examples generator;

respectively calculating text similarity and semantic similarity between the text of each data group and the text of any data group in all data groups to obtain a plurality of text similarity data and a plurality of semantic similarity data, wherein the text is a plurality of Chinese descriptions or a plurality of English descriptions;

calculating to obtain a plurality of pairs of sample coefficients corresponding to each data group based on the plurality of text similarity data and the plurality of semantic similarity data of each data group, wherein each pair of sample coefficients comprises a positive sample coefficient and a negative sample coefficient; each pair of sample coefficients is obtained by calculation based on text similarity data and semantic similarity data between the data group and the same data group;

generating a positive sample pool of each data group based on all positive sample coefficients corresponding to each data group, wherein the positive sample pool comprises M positive samples, each positive sample corresponds to one data group, and M is an integer greater than 4 and less than 8;

generating a negative sample pool for each data group based on the corresponding overall negative sample coefficients for each data group, the negative sample pool comprising M negative samples, each negative sample corresponding to one data group;

based on the same preset rule, respectively obtaining a positive sample from the positive sample pool corresponding to each data group and obtaining a negative sample from the negative sample pool to form a sample pair of each data group.

In a specific embodiment, the generating a positive sample pool for each data group based on all positive sample coefficients corresponding to each data group includes:

sequencing all positive sample coefficients corresponding to each data group;

determining the first M positive sample coefficients with the largest numerical value in each data group;

acquiring a first target data group set corresponding to each data group, wherein the first target data group set consists of M data groups corresponding to the first M positive sample coefficients;

taking the data groups in the first target data group set as positive samples, and generating a positive sample pool of each data group, wherein the positive sample pool comprises M positive samples;

the generating the negative sample pool of each data group based on all negative sample coefficients corresponding to each data group comprises:

sequencing all negative sample coefficients corresponding to each data group;

determining the first M negative sample coefficients corresponding to the largest numerical value in each data group;

acquiring a second target data group set corresponding to each data group, wherein the second target data group set consists of M data groups corresponding to the first M negative sample coefficients;

and taking the data groups in the second target data group set as negative samples, and generating a negative sample pool of each data group, wherein the negative sample pool comprises M negative samples.

In one specific embodiment, the preset rule is a randomly selected rule, and the randomly selected probability is positively correlated to the weight value of each positive sample in the positive sample pool and the weight value of each negative sample in the negative sample pool.

In a specific embodiment, the calculation formula of the loss value in the step (b) is as follows:

Loss＝L _ce +γ|S|(Lc _inner +Lc _outer )

wherein, the Loss value between the neural network predicted value and the real value is the Loss value;

L _ce the cross entropy between the predicted value and the true value of the neural network is obtained;

Lc _inner is a loss function between the same modalities;

Lc _outer as a loss function between different modalities;

γ is used to balance the two loss functions;

and | S | is the average length of the sentence.

In a specific embodiment, the preset condition in step (d) is that the accuracy rate does not increase when the new neural network model is verified by using the verification set.

In a specific embodiment, said step (c) is followed by a step (e):

inputting the verification set to a new neural network model for testing, and calculating an evaluation index BLEU-4 value;

and (d) repeatedly and iteratively executing the step (b), the step (c) and the step (e) until the maximum BLEU-4 value is obtained, wherein the new neural network model corresponding to the maximum BLEU-4 value is the trained neural network model.

The invention also provides a multimodal machine translation system based on contrast learning, which comprises:

the system comprises an acquisition module, a translation module and a translation module, wherein the acquisition module is used for acquiring N groups of resource files to be translated, each group of resource files comprises a video unit and a first language text unit corresponding to the video unit, the N first language text units form a first language text to be translated, and N is an integer greater than or equal to 1;

the translation module is used for translating the first language text by utilizing a translation model based on the video unit and the first language text unit in each group of resource files to obtain a translation text of a second language, and the translation model is obtained at least based on VATEX data sets and neural network model training;

a training module for training the translation model, the training module comprising:

the device comprises a data set acquisition unit, a verification unit and a verification unit, wherein the data set acquisition unit is used for acquiring a VATEX data set, the VATEX data set comprises a training set, a testing set and a verification set, the training set comprises a plurality of data packets, and each data packet comprises a video clip, a plurality of Chinese descriptions corresponding to the video clip and a plurality of English descriptions corresponding to the video clip;

the translation device comprises a determining unit, a translation unit and a translation unit, wherein the determining unit is used for determining a plurality of data groups to be translated, each data group consists of a video segment and a plurality of Chinese descriptions in each data packet or consists of a video segment and a plurality of English descriptions in each data packet, and the data groups and the data packets comprising the same video segment are in one-to-one correspondence;

a sample pair obtaining unit, configured to obtain a sample pair corresponding to each of the data sets one to one based on a plurality of chinese descriptions or a plurality of english descriptions in each of the data sets, where each sample pair includes a positive sample and a negative sample, the positive sample is one of the data sets, and the negative sample is one of the data sets;

and the training unit is used for performing multiple rounds of iterative training on the neural network model by taking each data packet and the sample pair corresponding to the data packet as input data to obtain the trained neural network model, and the sample pair corresponding to the same data packet is obtained again during each round of iterative training.

The beneficial effects of the invention at least comprise:

1. the invention provides a multimodal machine translation method based on contrast learning, which is characterized in that a translation model is utilized to carry out translation based on a provided video unit and a first language and cultural relic unit corresponding to the video unit, wherein the translation model is obtained at least based on VATEX data set and neural network model training; therefore, on one hand, the existing data set is fully utilized, the development cost is reduced, on the other hand, in the process of training the neural network model, the technical problem that the ambiguity of the whole sentence cannot be solved in the translation process of the existing video-guided multi-modal machine translation can be solved by contrasting and learning the semantic alignment between the intra-modal state and different modes, and the advantage of good translation quality is achieved.

2. In the process of obtaining an input data sample pair of a neural network model, obtaining a video text pair with high relevance to each group of data by calculating the text similarity and semantic similarity between the text in each group of data and the texts of all data groups, on one hand, fully utilizing data set information, on the other hand, enabling the samples with similar semantics to be as close as possible in an embedding space and the samples with different semantics to be as far as possible in the embedding space through contrast learning; meanwhile, the function formula for calculating the loss value comprehensively considers semantic fusion in the modes and between the two modes of the video and the text; therefore, the technical problem that the existing method is too single when a neural network model is built can be solved, various modal characteristics can be effectively aligned, and the method has the advantage of good translation quality.

Drawings

FIG. 1 is a flowchart of a method of multimodal machine translation based on contrast learning according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for training a translation model according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method of step S30 according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for training a neural network model according to an embodiment of the present invention;

FIG. 5 is a flowchart of a method for training a neural network model according to another embodiment of the present invention;

fig. 6 is a block diagram of a multimodal machine translation system based on contrast learning according to an embodiment of the present invention.

Detailed Description

The embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

The embodiments of the present disclosure are described below with specific examples, and other advantages and effects of the present disclosure will be readily apparent to those skilled in the art from the disclosure in the specification. It is to be understood that the described embodiments are merely illustrative of some, and not restrictive, of the embodiments of the disclosure. The disclosure may be embodied or carried out in various other specific embodiments, and various modifications and changes may be made in the details within the description without departing from the spirit of the disclosure. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to or other than one or more of the aspects set forth herein.

In addition, in the following description, specific details are provided to facilitate a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.

Referring to fig. 1 to 3, the present invention provides a multimodal machine translation method based on contrast learning, including the following steps:

s100, acquiring N groups of resource files to be translated, wherein each group of resource files comprises a video unit and a first language text unit corresponding to the video unit, the N first language text units form a first language text to be translated, and N is an integer greater than or equal to 1;

in the embodiment of the present disclosure, the video unit and the text unit are two different modalities, the video is used as the most abundant information carrying medium, the amount of information that can be conveyed far exceeds that of a single modality such as sound and image, and when performing language translation, by aligning the text and video features, the ambiguity of the whole sentence, not only on the word level, can be solved, and the advantage of good translation quality is achieved.

In the embodiment of the present disclosure, the first language text to be translated may be one sentence or a combination of multiple sentences, and is not limited herein.

In the disclosed embodiment, the first language text unit is a sentence, the average value of the number of characters is 10 characters, and the playing time of each video unit is 5-15 seconds.

For example, if a person walks on the video, the corresponding first language text unit describes that "the person walks", and it needs to be explained that the first language text unit describes that the main body activity in the video is mainly, and some environmental factors are not described.

Step S200, translating the first language text by using a translation model based on the video unit and the first language text unit in each group of resource files to obtain a translation text of a second language, wherein the translation model is obtained at least based on VATEX data sets and neural network model training.

Here, the first language text is translated by the translation model to obtain a translated text of the second language, for example, if the first language is chinese and the second language is english, the translated text output after the chinese text is translated by the translation model is an english text.

The translation process by using the translation model is that the translation texts of the first language text units in each group of resource files are obtained based on the video units and the first language text units in each group of resource files, and the translation texts of all the groups of resource files form the translation texts of the second language.

In the embodiment of the disclosure, a VATEX data set is adopted when the translation model is trained, the VATEX data set is processed and then used as input to be input to the neural network model, and the neural network model is trained to obtain the translation model.

Wherein VATEX is a new large multilingual video description dataset containing 41,250 videos and 825,380 mid-English video descriptions, half of which are mid-English parallel translation pairs. Its videos contain 600 personal activities, each video being equipped with 10 english and 10 chinese descriptions of 20 people. According to the official partitioning scheme, the VATEX dataset is partitioned into 25991 videos as a training set, 3000 videos as a verification set, and 6000 videos as a common test set.

In an embodiment of the present disclosure, the neural network model is a transform neural network model.

The training method of the translation model comprises the following steps:

step S10, a VATEX data set is obtained, wherein the VATEX data set comprises a training set, a testing set and a verification set, the training set comprises a plurality of data packets, and each data packet comprises a video clip, a plurality of Chinese descriptions corresponding to the video clip and a plurality of English descriptions corresponding to the video clip;

in this embodiment, the training set, the test set, and the verification set of the VATEX data set adopt the official partition standard, and the training set includes 25991 videos, that is, the number of the data packets is 25991, and each data packet includes a video clip, 10 left and right chinese descriptions corresponding to the video clip, and 10 left and right english descriptions.

For example, a video clip corresponds to 10 chinese descriptions, and thus there are 10 sentences describing the video clip, and similarly, multiple english descriptions can be understood as multiple english descriptions of the video clip.

Step S20, determining a plurality of data groups to be translated, wherein each data group consists of a video segment and a plurality of Chinese descriptions in each data packet or consists of a video segment and a plurality of English descriptions in each data packet, and the data groups and the data packets comprising the same video segment are in one-to-one correspondence;

each data packet comprises a video segment and a Chinese description and an English description corresponding to the video segment, and when the translation model is trained, either Chinese is translated into English or English is translated into Chinese. If Chinese is required to be translated into English, each data group consists of a video clip and a plurality of Chinese descriptions in each data packet; if English is required to be translated into Chinese, each data set consists of a video clip and a plurality of English descriptions in each data packet.

A one-to-one correspondence of data groups and data packets comprising the same video clip is to be understood as a one-to-one correspondence of data packets comprising an a video clip and data groups comprising an a video clip.

In this embodiment, a plurality of data groups and a plurality of data packets are in a one-to-one correspondence relationship, and then the number of data groups is the same as the number of data packets, which is also 25991.

Step S30, acquiring a sample pair corresponding to each data group one by one based on a plurality of Chinese descriptions or English descriptions in each data group, wherein each sample pair comprises a positive sample and a negative sample, the positive sample is one of the data groups, and the negative sample is one of the data groups;

further: the step S30 includes:

step S31, inputting all the data sets into a negative sample generator;

in this embodiment, the number of all data sets to be translated is 25991, which may be 25991 video-chinese pairs, or 25991 video-english pairs.

Step S32, respectively calculating text similarity and semantic similarity between the text of each data group and the text of any data group in all data groups to obtain a plurality of text similarity data and a plurality of semantic similarity data, wherein the text is a plurality of Chinese descriptions or a plurality of English descriptions;

in the embodiment of the present disclosure, the calculation of the text similarity data adopts a cosine similarity calculation method, which specifically includes:

specifically, first, a text in each data group is extracted, where the text is all chinese descriptions or all english descriptions contained in each data group. For convenience of understanding, a data group of the text similarity data to be calculated is named as a first data group, and the text of the first data group is T _i Any one of the whole data sets is named as a second data set, and the text of the second data set is T _j Then, calculating according to a first calculation formula to obtain the text similarity between the first data group and the second data group as

The first calculation formula is:

here, it should be noted that, in this step, the similarity of the first data set to the text of the user is also calculated, and the result is 1.

For convenience of understanding, it is assumed that the number of data groups is 5, i.e., a group a, a group B, a group C, a group D, and a group E, and the five groups each include one video and 10 chinese descriptions corresponding to the video.

Taking the calculation of the text similarity data of group a as an example for illustration, then the step needs to calculate the text similarity between 10 chinese descriptions of group a and 10 chinese descriptions of group a, between 10 chinese descriptions of group a and 10 chinese descriptions of group B, between 10 chinese descriptions of group a and 10 chinese descriptions of group C, between 10 chinese descriptions of group a and 10 chinese descriptions of group D, and between 10 chinese descriptions of group a and 10 chinese descriptions of group E, to obtain 5 text similarity data, which are respectively A1-1, A1-2, A1-3, A1-4, and A1-5, and are the same as the number of data groups.

In this embodiment, the text similarity data takes a value of-1 to 1.

In the embodiment of the present disclosure, the calculation of the semantic similarity data uses a word frequency-inverse text frequency index algorithm, which specifically includes the following steps:

naming the data group of the semantic similarity data to be calculated as a first data group, wherein the verb of the first data group is

The noun of the first data set is->

Any one of the data sets is named as a second data set, and the verb of the second data set is->

The term->

Calculating each verb of the first data set according to a second calculation formula>

Weight of and each noun>

The second calculation formula is as follows:

wherein N is _verb And N _noun Respectively, the number of verbs and nouns in the word frequency-inverse text frequency index algorithm.

Here, it should be noted that, in this step, the semantic similarity between the first data set and itself is also calculated, and the result is 1.

In this embodiment, the semantic similarity data takes a value of 0 to 1.

For example, the semantic similarity data calculation of the group a is used for illustration, and then the step needs to calculate semantic similarities between 10 chinese descriptions of the group a and 10 chinese descriptions of the group a, 10 chinese descriptions of the group a and 10 chinese descriptions of the group B, 10 chinese descriptions of the group a and 10 chinese descriptions of the group C, 10 chinese descriptions of the group a and 10 chinese descriptions of the group D, and 10 chinese descriptions of the group a and 10 chinese descriptions of the group E, to obtain 5 semantic similarity data, which are respectively A2-1, A2-2, A2-3, A2-4, and A2-5, and are the same as the number of the data groups.

Step S33, based on the text similarity data and the semantic similarity data of each data group, calculating to obtain a plurality of pairs of sample coefficients corresponding to each data group, wherein each pair of sample coefficients comprises a positive sample coefficient and a negative sample coefficient; each pair of sample coefficients is obtained by calculation based on text similarity data and semantic similarity data between the data group and the same data group;

positive sample coefficient W ^pos The calculation formula of (c) is as follows:

W ^pos ＝softmax(W ^a )+β*softmax(W ^s )

negative sample coefficient W ^neg The calculation formula of (a) is as follows:

W ^neg ＝softmax(W ^a )-β*softmax(W ^s )

wherein, W ^a The text similarity data calculated in step S32;

W ^s semantic similarity data calculated in step S32;

β is a parameter, and in this embodiment, β takes a value of 0.8.

Based on the illustration of step S32, taking A1-1 and A2-1 as a group, A1-2 and A2-2 as a group, A1-3 and A2-3 as a group, A1-4 and A2-4 as a group, and A1-5 and A2-5 as a group, respectively, into the positive sample coefficient calculation formulas to obtain 5 positive sample coefficients of the group a data, which are A3-1, A3-2, A3-3, A3-4, and A3-5, respectively; respectively substituting the coefficients into a negative sample coefficient calculation formula to obtain 5 negative sample coefficients of the group A data, wherein the negative sample coefficients are A4-1, A4-2, A4-3, A4-4 and A4-5.

Step S34, generating a positive sample pool of each data group based on all positive sample coefficients corresponding to each data group, wherein the positive sample pool comprises M positive samples, each positive sample corresponds to one data group, and M is an integer greater than 4 and less than 8;

the number of the positive sample coefficients of each data set is the same as that of all the data sets, one positive sample coefficient corresponds to one positive sample, and in order to accelerate the convergence rate of the training model, proper positive samples need to be screened out to form a positive sample pool. In this example, the screening method was as follows:

sequencing all positive sample coefficients corresponding to each data group;

determining the first M positive sample coefficients corresponding to the largest numerical value in each datum;

and acquiring a first target data group set corresponding to each data group, wherein the first target data group set consists of M data groups corresponding to the first M positive sample coefficients, the data groups in the first target data group set are used as positive samples, and a positive sample pool of each data group is generated and comprises M positive samples. In the present embodiment, M is empirical data, and M ranges from greater than 4 to less than 8, and preferably, M is 5.

And generating a positive sample pool of each data group by taking the data group in the first target data group set as a positive sample, wherein each data group corresponds to one first target data group set, each first target data group set comprises M data groups, and the positive sample pool of each data group is obtained by taking one data group in the M data groups as a positive sample.

It will be appreciated that the positive sample pool of each data set is certainly inclusive of the data set, because each data set has 1 as a result of its own calculation of text similarity and semantic similarity, and the positive sample coefficient is certainly the largest one.

Based on the above content, taking the obtaining of the positive sample pool of the group a as an example for description, first, 5 positive sample coefficients of the group a are ranked, and the size relationships of the 5 positive sample coefficients A3-1, A3-2, A3-3, A3-4 and A3-5 are A3-1>, A3-5>, A3-2>, A3-4, if the positive sample pool includes three positive samples, the positive sample coefficients of the larger 3 positive samples are determined to be A3-1, A3-5 and A3-3, respectively, according to the size sequence, and three positive samples corresponding to the 3 positive sample coefficients A3-1, A3-5 and A3-3 are obtained, which are the group a, the group E and the group C, and the three data groups of the group a, the group E and the group C form a first target data set of the group a, the group E and the group C are one positive sample pool respectively, and the number of the positive samples in the group a is generated.

Step S35, generating a negative sample pool of each data group based on all negative sample coefficients corresponding to each data group, wherein the negative sample pool comprises M negative samples, and each negative sample corresponds to one data group;

the number of the negative sample coefficients of each data set is the same as that of all the data sets, one negative sample coefficient corresponds to one negative sample, and in order to accelerate the convergence rate of the training model, proper negative samples need to be screened out to form a negative sample pool. In this example, the screening method was as follows:

sequencing all negative sample coefficients corresponding to each data group;

determining the first M negative sample coefficients with the maximum values in each data group;

and taking the data groups in the second target data group set as negative samples, and generating a negative sample pool of each data group, wherein the negative sample pool comprises M negative samples. And generating a negative sample pool of each data group by taking the data groups in the second target data group set as negative samples, wherein each data group corresponds to a second target data group set, each second target data group set comprises M data groups, and the negative sample pool of each data group is obtained by taking one data group in the M data groups as a negative sample.

In this embodiment, M is empirical data, M ranges from greater than 4 to less than 8, and preferably M is 5.

Based on the above content, taking the negative sample pool of the group a as an example for explanation, first, 5 negative sample coefficients of the group a are ranked, and the 5 negative sample coefficients A4-1, A4-2, A4-3, A4-4 and A4-5 have size relationships of A4-2>, A4-4>, A4-3>, A4-5, A4-1, if the negative sample pool includes three negative samples, then according to the size sequence, the larger 3 negative sample coefficients are determined to be A4-2, A4-4 and A4-3, and three negative samples corresponding to the 3 negative sample coefficients A4-2, A4-4 and A4-3 are obtained to be a group B, a group D and a group C, respectively, the three data groups of the group B, the group D and the group C form a second target data set of the group a, and the number of the negative sample pools of the group a is generated by taking the group B, the group D and the group C in the second target data set as one negative sample pool, and the negative sample pool of the group a is the number of the group a.

And S36, respectively obtaining a positive sample from the positive sample pool corresponding to each data group and a negative sample from the negative sample pool to form a sample pair of each data group based on the same preset rule.

Preferably, the preset rule is a random selection, and the probability of the random selection is positively correlated to the weight value of each positive sample in the positive sample pool and the weight value of each negative sample in the negative sample pool.

Each data group corresponds to a positive sample pool and a negative sample pool, a positive sample is randomly selected from the positive sample pool, a negative sample is randomly selected from the negative sample pool, and the selected positive sample and the selected negative sample form a positive sample pair of the data group. It is understood that, in the following description, in different rounds of iterative training, a positive sample needs to be selected from the positive sample pool again according to a random selection rule, and a negative sample needs to be selected from the negative sample pool to form a sample pair. The convergence speed of the model can be accelerated by adopting a randomly selected rule.

Specifically, the larger the positive sample coefficient is, the larger the weight value of the positive sample is, and similarly, the larger the negative sample coefficient is, the larger the weight value of the negative sample is. In this embodiment, the method for calculating the weight value is as follows: by way of example with the weight value calculation of positive samples, the positive sample pool of group a includes three positive samples of group a, group E and group C, the positive sample coefficient of group a is 1, the positive sample coefficient of group E is 0.8, the positive sample coefficient of group C is 0.7, then the weight value of group a is 0.4 (1/(1 +0.8+ 0.7)), the weight value of group E is 0.32 (0.8/(1 +0.8+ 0.7)), and the weight value of group C is 0.28 (0.7/(1 +0.8+ 0.7)).

And S40, taking each data packet and the sample pair corresponding to the data packet as input data, performing multiple rounds of iterative training on the neural network model to obtain the trained neural network model, and acquiring the sample pair corresponding to the same data packet again during each round of iterative training.

It will be appreciated that the data groups and data packets comprising the same video segment are in a one-to-one correspondence, and thus the sample pairs corresponding to the data groups are also in a one-to-one correspondence with the data packets corresponding to the data groups. For ease of understanding, the following are illustrated: the first data group including the video segment a has a correspondence with the first data packet including the video segment a, and the first sample pair obtained based on the chinese description or the english description in the first data group has a correspondence with the first data group, and similarly, the first sample pair also has a correspondence with the first data packet.

It can be understood that each round of iterative training refers to inputting all data packets in the training set and sample pairs corresponding to the data packets as input data into the neural network model for calculation.

The reacquisition of the sample pair corresponding to the same data packet during each iteration training cycle means that a new sample pair of the iteration training cycle is reselected from the positive sample pool and the negative sample pool according to a random selection principle, and based on the above description, if the positive sample in the sample pair of the first iteration training cycle a group is a, the positive sample in the sample pair of the second iteration training cycle a group may be E or C, and certainly may also be a, but it may be clear that the positive sample corresponding to the second iteration training cycle a group is reselected from the positive sample pool.

It should be noted that, retrieving a new sample pair refers to re-selecting a positive sample and a negative sample from the positive sample pool and the negative sample pool, respectively, and the positive sample in the positive sample pool and the negative sample in the negative sample pool do not need to be retrieved again.

Referring to fig. 4, preferably, the method for training the neural network model includes:

s501, setting hyper-parameters of a neural network model, and acquiring an initialized neural network model, wherein the ownership weight of the initialized neural network model adopts a standard initialization value; the hyper-parameter settings of the neural network model are as follows: the hyper-parameter of the Transformer neural network adopts a standard initialization value, the maximum sentence length is set to be 40, the embedded Dropout Rate is 0.2, and other embedded parameters are 0.1; the hyperparameter λ of the comparison learning loss is set to 0.16, and meanwhile, in the comparison learning, in order to balance the calculation amount and the effect of the comparison learning, K =5 is set, where K is the same as M for determining the number of samples in the positive sample pool and the negative sample pool, that is, K =5, and then M =5 is set, and the total number of samples in the positive sample pool and the negative sample pool is 5.

The neural network model adopts a Transformer neural network model.

S502, calculating a loss value set of each data packet based on the input data, wherein the loss value set comprises a plurality of loss values, the number of the loss values is the same as the number of Chinese descriptions or English descriptions in the data packet, and the loss values are loss values between a neural network predicted value and a real value;

the number of the loss values is the same as the number of chinese descriptions or the number of english descriptions in the data packet, and it is understood that when the number of chinese descriptions in the data packet is 10, the number of the calculated loss values is also 10, and thus, the loss value set of the data packet includes 10 loss values.

The calculation formula of the loss value is as follows:

Loss＝L _ce +γ|S|(Lc _inner +Lc _outer )

Lc _inner is a loss function between the same modalities;

Lc _outer as a loss function between different modalities;

γ is used to balance the two loss functions;

| S | is the average length of the sentence.

Wherein, the cross entropy L between the predicted value and the true value of the neural network _ce The concrete expression is as follows:

wherein t is ⁱ Is a language description to be translated in the data packet, t ^j Is another corresponding language description, which can also be understood as t ⁱ Is a Chinese description, then t ^j Is the corresponding english description; t is t ⁱ For English description, then t ^j Is the corresponding Chinese description, t ⁱ And t ^j Are all language descriptions in the data packet.

θ is a parameter of the multimodal Transformer model.

The loss function used for the contrast learning is to make the different, enhanced new samples of a sample as close as possible in the embedding space, and then to make the different samples as far as possible, which is specifically expressed as:

d (-) denotes a function that calculates the distance between vectors, H ^s 、H ^s+ 、H ^s- Respectively representing sentence S and sentence S corresponding to positive sample in sample pair ⁺ Sentence S corresponding to the positive and negative samples in the sample pairs ^- Output, H, after transform neural network model coding and average pooling ^v 、H ^v+ 、H ^v- Respectively representing the corresponding sentences V of the positive samples in the video V and the sample pairs ⁺ Sentence V corresponding to negative and positive samples in sample pairs ^- And (4) carrying out transform neural network model coding and averaging the output after pooling.

The above sentence S is a language description.

S503, updating and optimizing all weight parameters of the neural network model by using a back propagation algorithm based on the loss value sets corresponding to all the data packets to obtain a new neural network model;

the loss value sets corresponding to all the data packets are obtained by calculation in step S502, and all the data packets form a training set, which can also be understood here as a new neural network model obtained by updating and optimizing all the weight parameters of the neural network model by using a back propagation algorithm based on the loss value sets of the training set.

It can be understood that each time one round of training is completed, all the weight parameters of the neural network model are updated and optimized once to obtain the optimal weight parameters. It should be noted here that the hyper-parameters of the neural network model, such as the hyper-parameter λ of the contrast learning loss, do not belong to the parameters updated and optimized after each training round, and the parameters updated and optimized after each training round only refer to the weight parameters in the neural network.

And S504, repeatedly and iteratively executing the step S502 and the step S503 until a preset condition is met, and obtaining the trained neural network model.

Repeating the iterative execution of step S502 and step S503 may be understood that, in step S502, the loss value is calculated based on the new neural network model after all the weight parameters are updated, and in step S503, all the weight parameters of the neural network model obtained in the previous round of updating are updated and optimized by using a back propagation algorithm based on the loss value sets corresponding to all the data packets calculated in step S502.

For ease of understanding, the following are illustrated: firstly, calculating by using an initial neural network model to obtain a loss value set corresponding to each data packet, and then updating and optimizing ownership of the initial neural network model by using a back propagation algorithm based on the loss value to obtain a first new neural network model, which is a first round; calculating by using a first new neural network model to obtain a loss value set corresponding to each data packet, and updating and optimizing ownership of the first new neural network model by using a back propagation algorithm based on the loss value set to obtain a second new neural network model, which is a second round; calculating by using a second new neural network model to obtain a loss value set corresponding to each data packet, and updating and optimizing ownership of the second new neural network model by using a back propagation algorithm based on the loss value set to obtain a third new neural network model, which is a third round; 823060, 8230and meeting the preset conditions.

The preset condition is that the accuracy does not rise any more when the verification set is used for verifying a new neural network model.

Referring to fig. 5, in another disclosed embodiment, after the step S503, the method further includes:

step S505, inputting the verification set to a new neural network model for testing, and calculating an evaluation index BLEU-4 value;

step S504 is to repeat and iteratively execute step S502, step S503, and step S505 until the maximum BLEU-4 value is obtained, and the neural network model corresponding to the maximum BLEU-4 value is the trained neural network model.

In the embodiment of the disclosure, whether the accuracy rate does not rise any more is judged according to the BLEU-4 value. It is understood that step S504 further comprises comparing the currently calculated BLEU-4 value with the historically calculated BLEU-4 values to determine the maximum BLEU-4 value. When the BLEU-4 values measured before and after are smaller than the current BLEU-4 value, the neural network model corresponding to the current BLEU-4 value can be obtained as the trained neural network model.

The embodiment can be understood that verification is performed by using a verification set after each round of training, so as to obtain the BLEU-4 value of the neural network model obtained by each round of training.

In other embodiments, after repeatedly performing step S502 and step S503 for a preset number of times, the accuracy of the new neural network model is calculated by using the verification set, so that the computation amount of the verification set can be reduced. The preset times can be data of 10 times, 12 times, 15 times and the like, and can be set according to experience.

All data form a training set, the average loss value calculated by each round of training is the average loss value of the training set, and the average loss value calculated by each round of training is directly replaced by the average loss value of the training set. Generally, the average loss value of the training set is the smallest, and the accuracy of the corresponding verification set is the highest, that is, the preset condition can simultaneously satisfy that the average loss value of the training set does not decrease any more and the accuracy does not increase any more when the verification set is used for verification. Therefore, after the steps S502 and S503 are repeatedly executed iteratively for a plurality of times, until the average loss value of the training set decreases little or no further or fluctuates back and forth, the calculation of the accuracy of the new neural network model by using the verification set is not started, so that the computation amount of the verification set can be reduced as well.

Comparing the neural network model provided by the invention with other methods, the comparison results are shown in tables 1 to 3:

TABLE 1 BLEU-4 values on VATEX data set for different models

Table 1 shows the BLEU-4 values for different models on the VATEX dataset. The invention obtains BLEU-4 values of 36.04 and 36.38 on VATEX and MSVD-Turkish, respectively. Experimental results show the effectiveness of the phrase-level multi-modal encoder and the global-level comparison method.

Table 2 shows the experimental results of the invention for the generated samples at different values of β:

TABLE 2 BLEU-4 values corresponding to different beta values

As can be seen from table 2 above, different β values have an influence on the re-recognition accuracy, and since β values influence the translation quality, an appropriate β value is beneficial to the training of the neural network, which makes the generated positive and negative sample pools more reasonable, so that the neural network learns richer features, enhances the generalization ability of the network, and further improves the translation quality. In the present invention, β is preferably 0.8 to 1.2, and more preferably β =0.8.

TABLE 3 BLEU-4 values corresponding to different lambda values

The effect of the parameter λ on the neural network model can be seen from table 3. The BLEU-4 value is highest when the lambda value is 0.16. As λ continues to increase, overall performance instead decreases because the weight that determines the cross-entropy loss decreases with increasing λ. In the translation task of the present invention, cross-entropy loss should dominate. In the present invention, λ is preferably 0.04 to 0.16, and more preferably, λ =0.16.

It should be added that the BLEU-4 values in tables 1 to 3 are all the results calculated by the neural network model in english-to-chinese.

Referring to fig. 6, the present invention further provides a multimodal machine translation system 100 based on contrast learning, the system includes:

the system comprises an acquisition module 10, a translation module and a translation module, wherein the acquisition module is used for acquiring N groups of resource files to be translated, each group of resource files comprises a video unit and a first language text unit corresponding to the video unit, the N first language text units form a first language text to be translated, and N is an integer greater than or equal to 1;

the translation module 20 is configured to translate the first language text based on the video unit and the first language text unit in each group of resource files by using a translation model to obtain a translation text of a second language, where the translation model is obtained based on at least a VATEX data set and a neural network model training.

A training module 30, configured to train the translation model, where the training module 30 includes:

a data set obtaining unit 31, configured to obtain a VATEX data set, where the VATEX data set includes a training set, a test set, and a verification set, the training set includes a plurality of data packets, and each data packet includes a video segment, a plurality of chinese descriptions corresponding to the video segment, and a plurality of english descriptions corresponding to the video segment;

a determining unit 32, configured to determine multiple data sets to be translated, where each data set is composed of a video segment and multiple chinese descriptions in each data packet or composed of a video segment and multiple english descriptions in each data packet, and the data sets and the data packets including the same video segment are in a one-to-one correspondence relationship;

a sample pair obtaining unit 33, configured to obtain a sample pair corresponding to each of the data sets one to one based on multiple chinese descriptions or multiple english descriptions in each of the data sets, where each sample pair includes one positive sample and one negative sample, the positive sample is one of the data sets, and the negative sample is one of the data sets;

and the training unit 34 is configured to perform multiple rounds of iterative training on the neural network model to obtain the trained neural network model by using each data packet and the sample pair corresponding to the data packet as input data, and reacquire the sample pair corresponding to the same data packet during each round of iterative training.

It should be noted that details not mentioned in the embodiment corresponding to fig. 6 and specific implementation manners of the steps executed by each module and unit may refer to the embodiments shown in fig. 1 to fig. 5 and the foregoing details, which are not described herein again.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in or transmitted over a computer-readable storage medium. The computer readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server that integrates one or more available media. The usable medium may be a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape; or an optical medium, such as a DVD; it may also be a semiconductor medium, such as a Solid State Disk (SSD).

The foregoing is a further detailed description of the present invention in connection with specific preferred embodiments thereof, and it is not intended to limit the invention to the specific embodiments thereof. For those skilled in the art to which the invention pertains, numerous and varied simplifications or substitutions may be made without departing from the spirit of the invention, which should be construed as falling within the scope of the invention.

Claims

1. A multimodal machine translation method based on contrast learning, the method comprising the steps of:

on the basis of the video unit and the first language text unit in each group of resource files, translating the first language text by using a translation model to obtain a translated text of a second language, wherein the translation model is obtained by training at least on the basis of a VATEX data set and a neural network model, and the training method of the translation model comprises the following steps:

step (3) acquiring a sample pair corresponding to each data group one to one based on a plurality of Chinese descriptions or English descriptions in each data group, wherein each sample pair comprises a positive sample and a negative sample, the positive sample is one of the data groups, and the negative sample is one of the data groups;

and (d) repeatedly and iteratively executing the step (b) and the step (c) until a preset condition is met, and obtaining the trained neural network model.

2. The multimodal machine translation method based on contrast learning according to claim 1, wherein the step (3) comprises:

inputting all of the data sets into a negative examples generator;

respectively calculating text similarity and semantic similarity between the text of each data group and the text of any data group in all the data groups to obtain a plurality of text similarity data and a plurality of semantic similarity data, wherein the text is a plurality of Chinese descriptions or a plurality of English descriptions;

calculating to obtain a plurality of pairs of sample coefficients corresponding to each data group based on the text similarity data and the semantic similarity data of each data group, wherein each pair of sample coefficients comprises a positive sample coefficient and a negative sample coefficient; each pair of sample coefficients is obtained by calculation based on text similarity data and semantic similarity data between the data group and the same data group;

generating a negative sample pool of each data group based on all negative sample coefficients corresponding to each data group, wherein the negative sample pool comprises M negative samples, and each negative sample corresponds to one data group;

based on the same preset rule, a positive sample is obtained from the positive sample pool corresponding to each data group, and a negative sample is obtained from the negative sample pool, so as to form a sample pair of each data group.

3. The multimodal machine translation method based on contrast learning according to claim 2,

the generating a positive sample pool of each data group based on all positive sample coefficients corresponding to each data group comprises:

sequencing all positive sample coefficients corresponding to each data group;

sequencing all negative sample coefficients corresponding to each data group;

4. The contrast learning-based multimodal machine translation method according to claim 2, wherein the preset rule is a randomly selected rule, and the randomly selected probability is positively correlated to the magnitude of the weight value of each positive sample in the positive sample pool and the weight value of each negative sample in the negative sample pool.

5. The multimodal machine translation method based on contrast learning according to claim 1, wherein the calculation formula of the loss value in the step (b) is as follows:

Loss＝L _ce +γ|s|(Lc _inner +Lc _outer )

Lc _inner is a loss function between the same modalities;

Lc _outer as a loss function between different modalities;

γ is used to balance the two loss functions;

and | S | is the average length of the sentence.

6. The multimodal machine translation method based on contrast learning as claimed in claim 1, wherein the preset condition in step (d) is that the accuracy rate does not increase when the new neural network model is verified by the verification set.

7. The multimodal machine translation method based on contrast learning according to claim 6, further comprising step (e) after the step (c):

8. A multimodal machine translation system based on contrast learning, the translation system comprising:

the translation module is used for translating the first language text based on the video unit and the first language text unit in each group of resource files by using a translation model to obtain a translation text of a second language, wherein the translation model is obtained at least based on VATEX data sets and neural network model training;