CN112819091A

CN112819091A - Cross-language description oriented antagonism data enhancement method, system and storage medium

Info

Publication number: CN112819091A
Application number: CN202110198513.8A
Authority: CN
Inventors: 肖宇; 鲁统伟
Original assignee: Wuhan Institute of Technology
Current assignee: Wuhan Institute of Technology
Priority date: 2021-02-22
Filing date: 2021-02-22
Publication date: 2021-05-18

Abstract

The invention discloses a cross-language description oriented antagonism data enhancement method, a system and a storage medium, wherein the method comprises the following steps: obtaining a clean image-text pair dataset; generating a text antagonism sample by using a sequence-to-sequence model; training an image description generation model: if the current training stage is a resistance training stage, generating an image resistance sample, further expanding an image-text pair, then training a model by using the expanded image-text pair data set, and optimizing the model according to a joint loss function; if the current training stage is a non-antagonistic training stage, training the model by using a clean image-text data set, and optimizing the model according to a loss function; and obtaining a trained image description generation model and an expanded image-text pair data set. According to the method, the data set is expanded through an easy-to-operate data enhancement mode, and the robustness and the performance of the image description generation model are improved.

Description

Cross-language description oriented antagonism data enhancement method, system and storage medium

Technical Field

The invention belongs to the technical field of data enhancement, and particularly relates to a cross-language description oriented antagonism data enhancement method, system and storage medium.

Background

The algorithm can directly benefit from the scale of the data set, and the fitting degree and robustness of the model trained by the large-scale data set are often better than those of the model obtained by the small-scale data set. For the image description task in the small language, in order to achieve the performance consistent with the image description task in the english language, the challenge to be faced first is the acquisition of a large-scale data set.

To ensure quality, manually labeling the data set is the best method, but this method is very time consuming. In order to balance the performance of the model and the cost of manually labeling the data set, data enhancement methods are generally adopted to enlarge the data set, and the data enhancement is widely applied and has good effect in the field of images.

In the image description task, it is very challenging to perform image-text pairwise augmentation while keeping semantics the same. Both geometric transformations and random cropping of images can affect the accuracy of the generated sentence. When orientation information is involved, such as a person standing on the left side of a table, flipping or cropping the picture may result in the model acquiring less accurate and comprehensive information, resulting in a problem with the generated text. In the aspect of texts, it is also challenging to provide a general Language conversion rule, and a general data enhancement technology in Natural Language Processing (NLP) has not been fully explored.

Disclosure of Invention

The invention aims to provide a cross-language description oriented antagonism data enhancement method, a cross-language description oriented antagonism data enhancement system and a storage medium, and the requirement of a neural network on a large-scale data set is relieved.

The invention provides a cross-language description oriented antagonism data enhancement method, which comprises the following steps:

s1, acquiring a clean image-text pair data set;

s2, generating a text antagonism sample by using the sequence-to-sequence model;

s3, training the image description generation model:

if the current training stage is a resistance training stage, generating an image resistance sample, further expanding an image-text pair, then training a model by using the expanded image-text pair data set, and optimizing the model according to a joint loss function;

if the current training stage is a non-antagonistic training stage, training the model by using a clean image-text data set, and optimizing the model according to a loss function;

and S4, obtaining the trained image description generation model and the expanded image-text pair data set.

Further, an image-resistant sample is generated using a gradient attack algorithm.

Further, step S2 specifically includes:

s21, converting the original text into a target text according to the formula (1):

wherein S represents an original text, S 'represents a target text, P represents a probability distribution, and w'_tThe t-th participle represents the target text, and n represents the participle number of the target text;

s22, K optimal translation target texts of the original text are obtained, the K optimal translation target texts are converted into sentences generating probability distribution in a target vocabulary, and calculation is carried out according to a formula (2):

in the formula, K represents the number of target texts obtained by a single original text, omega represents a word segmentation on a target vocabulary, C represents the obtained target texts, and E represents the original text;

wherein the probability distribution of the sentence is calculated according to equation (3):

in the formula, m represents the number of word segments of the target text.

Further, step S2 specifically includes:

s23, according to the formula (4), evaluating semantic similarity between the generated target text and the original text:

in the formula, P (S '| S) represents the probability of the target text S' given the original text S defined in equation (3), and P (S | S) is used to normalize the different distributions.

And S24, screening the text antagonism sample from the target text according to the semantic similarity.

Further, generating the image-resistance sample comprises:

generating an image resistance sample by using an iterative gradient attack method, wherein the calculation formula is as shown in formula (5):

in the formula I_advA resistance sample representing an image I, S representing a target text, θ representing a parameter of an image description generative model, L (θ, I, S) representing a loss function of the image description generative model, α representing a perturbation weight, N representing the number of iterations, and a Clip () function for replacing an overflowed value with a boundary value.

Further, augmenting the image-text pair comprises:

and obtaining two types of image contrast samples according to the text and the text contrast sample, and combining the image and the text pairwise to obtain an expanded image-text pair.

Further, the joint loss function loss is:

in the formula, L (theta, I, S) represents a loss function of original data, omega represents related weight in a control adversarial sample in the loss function, S represents original text, S represents relative weight in the control adversarial sample in the original text, and_advrepresenting text antagonism samples, I_scRepresenting the corresponding image-resistant sample of the original text, I_sadvThe image representing the text is a resistant sample to the corresponding image.

The invention also provides a cross-language description oriented antagonism data enhancement system for realizing the method, which comprises the following steps:

the data input module is used for reading in a clean image-text pair data set;

an enhanced image module for augmenting the image by an iterative gradient attack algorithm;

an enhanced text module to augment text by a sequence-to-sequence model;

the network training module is used for enhancing the image by utilizing the image enhancement module in the antagonism training stage so as to expand the image-text pair, training the data set by utilizing the expanded image-text and taking a minimum joint loss function as an optimization target; and training the data set by using clean images-texts in the rest training time periods, wherein the minimum loss function is taken as an optimization target.

Further, the iterative gradient attack algorithm generates a counteractive sample for attack noise generated by the network gradient; the joint loss function is a weighted calculation of the loss functions of the augmented image-text pairs.

The present invention also provides a computer storage medium having stored therein a computer program executable by a computer processor, the computer program performing the cross-language description oriented antagonism data enhancement method according to any one of claims 1-7.

The invention has the beneficial effects that: the cross-language description oriented antagonism data enhancement method, the cross-language description oriented antagonism data enhancement system and the storage medium expand a data set through an easy-to-operate data enhancement mode, and improve the robustness and performance of an image description generation model.

Drawings

FIG. 1 is a flow chart of the cross-language description oriented antagonism data enhancement method of the present invention.

Fig. 2 is a diagram of a network training framework according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of the cross-language description oriented antagonism data enhancement system of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings in which:

the invention discloses a cross-language image description oriented antagonism data enhancement method, which comprises the following steps: firstly, adopting a countermeasure algorithm of gradient attack for an image, and generating a countermeasure sample by adding disturbance as small as possible; secondly, generating an antagonism sample with the same semantic as the original sentence through the thought from the sequence to the sequence for the text; and finally, taking the antagonism sample as an additional sample, putting the additional sample and the clean sample into network training, generating the antagonism sample of the text before the training, and continuously generating an antagonism image in the training. Four pairs of additional data can be generated by generating each image-text pair once, the scale of the data set is effectively increased, and the model performance of the small-scale data set on the image description can be improved.

The cross-language description oriented antagonism data enhancement method of the embodiment of the invention, as shown in fig. 1, comprises the following steps:

s1, acquiring a clean image-text pair data set; taking Flickr8k as an example, the image data source is the photo album website Flickr of Yahoo, and the number of images in the data set is 8000; most of the images show the scenes of human beings participating in a certain activity, and the corresponding manual label of each image is 5 sentences of English.

And S2, generating a text antagonism sample by using the sequence-to-sequence model.

In the embodiment of the present invention, the step S2 may be implemented by the following steps:

s21, converting English into Chinese, and calculating formula according to formula (1):

wherein S is an original text, S 'is a target text, P represents a probability distribution, w'_tThe t-th participle represents the target text, and n represents the participle number of the target text;

s22, obtaining K best translations of the original text, converting the Chinese sentences of the K best translations into sentences generating probability distribution in the target vocabulary, and calculating according to the formula (2):

in the formula, K represents the number of target texts acquired by a single original text, C represents an acquired Chinese sentence, E represents an English sentence, and omega represents participles on a target vocabulary;

wherein the probability distribution P (C | E) of a sentence is calculated according to equation (3):

in the formula, m represents the number of participles of the final chinese sentence.

S23, evaluating the semantic similarity score between the generated text and the original text, and calculating according to the formula (4):

where P (S '| S) is the probability of the target text S' given the original text S as defined in equation (3), P (S | S) being used to normalize the different distributions.

S24, according to the semantic similarity, text antagonism samples meeting the requirements are screened from the target text, and Chinese sentences with poor semantic similarity are removed.

S3, training the image description generation model:

the present embodiment is based on the framework of Convolutional Neural Networks (CNN) and long short term memory networks (LSTM) for training.

(1) If the current training stage is a resistance training stage, generating an image resistance sample, further expanding an image-text pair, then training the model by using the expanded image-text pair data set, and optimizing the model according to a joint loss function. The image generates antagonism for the gradient attack algorithm for the resistant sample.

S31, generating an image resistance sample by using an iterative gradient attack method, wherein the calculation formula is as shown in formula (5):

I_advis an antagonistic sample of the image I, S represents the original text, θ is a series of parameters of the image description generative model, and L (θ, I, S) represents the loss function of the image description generative model. The attack will propagate the gradient back to the input image features to compute

Thereby updating the network. It then adjusts the network in small steps to maximize the loss. Alpha is a disturbance weight and is used for controlling the amplitude of attack noise, and the larger the value is, the larger the attack intensity is, and the noise is easier to observe by naked eyes. N represents iteration times, in order to save training time and calculation cost, we are set as 2, the Clip () function in the formula is used for replacing overflowed values with boundary values, because in the iteration updating, as the iteration times increase, partial pixel values overflow (exceed the boundary value range), at this time, the values need to be replaced with the boundary values to ensure that usable countermeasure samples can be generated, and the boundary values are set to be 2

Both α and ε are hyper-parameters, set to (0.0625, 0.3);

s32, according to the initial text S and the antagonistic text S_advTwo types of enhanced image samples can be obtained, and are respectively marked as I_SCAnd I_sadvAfter two are combined, we can obtain four types of extended data pairs: (I)_sc,S)，(I_sadv,S)，(I_sc,S_adv) And (I)_sadv,S_adv)；

S33, updating the model parameters with the objective of minimizing the joint loss function, which is calculated according to equation (6):

where L (θ, I, S) is the loss function of the initial data and ω is the associated weight in the control adversarial sample in the loss function.

(2) If the current training stage is a non-antagonistic training stage, the model is trained on the data set by using a clean image-text, and the model is optimized by using a minimum loss function L (theta, I, S).

And S4, obtaining the weight of the trained image description generation model and the expanded image-text pair data set.

Test examples: the testing link selects Flickr8k-cn as a training data set. Each test image in Flickr8k-cn is associated with five Chinese texts, which are obtained by manually translating a corresponding number of English texts in Flickr8 k. The text countermeasure sample is obtained with the corresponding english sentence in Flickr8k as input.

The performance indexes widely used in NLP, namely BLEU-4, ROUGE-L and CIDER are adopted. As shown in FIG. 2, for the Chinese image description model, the CNN + LSTM method is followed. To extract image features, we used pre-trained ResNet-152, which obtained the latest results for image classification and detection in both ImageNet and COCO competitions. The image features are 2048-dimensional vectors from the ReLU after the pool5 layer. The extracted features were normalized by L2. The size of the image and word embedding and the hidden size of the LSTM are set to 512. The initial learning rate η is set to 0.001, and the rate is attenuated every ten cycles with an attenuation weight of 0.999.

Experimental results show that the method can obviously improve the performance of the cross-language image model while effectively expanding the data set. Experimental comparison results are provided below to illustrate the effectiveness and superiority of the method. As shown in table 1, table 2 and table 3, the method of the present invention is significantly improved in all three indexes compared to other methods, and it can be proved that the effect is more significant on small data sets.

TABLE 1 comparison of results of experiments on Flickr8k-cn data set with different data enhancement methods

TABLE 2 comparison of results of experiments on Flickr8k-cn data set using the method for different models

TABLE 3 comparison of results of experiments on different scale data sets with the process of the invention

The present invention also provides a cross-language description oriented antagonism data enhancement system for implementing the cross-language description oriented antagonism data enhancement method, as shown in fig. 3, including:

a data input module 101 for reading in a clean image-text pair dataset;

an enhanced image module 102 for augmenting the image by an iterative gradient attack algorithm; an iterative gradient attack algorithm generates a resistance sample for attack noise generated by network gradients;

an enhanced text module 103 for augmenting text by a sequence-to-sequence model;

the network training module 104 is used for enhancing the image by using the image enhancement module in the antagonism training stage so as to expand the image-text pair, training the data set by using the expanded image-text and taking a minimum joint loss function as an optimization target; and training the data set by using clean images-texts in the rest training time periods, wherein the minimum loss function is taken as an optimization target. The joint loss function is a weighted calculation of the loss functions of the augmented image-text pairs.

Based on the cross-language image description oriented antagonism data enhancement method, the invention also provides a computer storage medium. The above-described methods may be implemented in hardware, firmware, or as software or computer code storable in a recording medium such as a CD-ROM, a RAM, a floppy disk, a hard disk, or a magneto-optical disk, or as computer code originally stored in a remote recording medium or a non-transitory machine-readable medium downloaded through a network and to be stored in a local recording medium, so that the methods described herein may be stored in such software processes on a recording medium using a general-purpose computer, a dedicated processor, or programmable or dedicated hardware such as an ASIC or FPGA. It will be appreciated that the computer, processor, microprocessor controller or programmable hardware includes memory components (e.g., RAM, ROM, flash memory, etc.) that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the cross-language description oriented antagonism data enhancement method described herein. Further, when a general-purpose computer accesses code for implementing the processes shown herein, execution of the code transforms the general-purpose computer into a special-purpose computer for performing the processes shown herein.

It should be noted that, according to the implementation requirement, each step/component described in the present application can be divided into more steps/components, and two or more steps/components or partial operations of the steps/components can be combined into new steps/components to achieve the purpose of the present invention.

It will be understood by those skilled in the art that the foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included within the scope of the present invention.

Claims

1. A cross-language description oriented antagonism data enhancement method is characterized by comprising the following steps:

s1, acquiring a clean image-text pair data set;

s3, training the image description generation model:

2. The cross-language description-oriented antagonism data enhancement method according to claim 1, wherein the image-antagonism samples are generated by a gradient attack algorithm.

3. The cross-language description oriented antagonism data enhancement method according to claim 1, wherein step S2 specifically includes:

in the formula, m represents the number of word segments of the target text.

4. The cross-language description oriented antagonism data enhancement method according to claim 3, wherein the step S2 further comprises:

5. The cross-language description oriented antagonism data enhancement method of claim 1 wherein generating image antagonism samples comprises:

6. The cross-language description oriented adversarial data enhancement method of claim 1, characterized in that augmenting image-text pairs comprises:

7. The cross-language description-oriented antagonism data enhancement method according to claim 1, wherein the joint loss function loss is:

8. A cross-language description oriented antagonism data enhancement system for implementing a cross-language description oriented antagonism data enhancement method, comprising:

the data input module is used for reading in a clean image-text pair data set;

an enhanced text module to augment text by a sequence-to-sequence model;

9. The cross-language description oriented adversarial data enhancement system of claim 8, characterized in that iterative gradient attack algorithm generates adversarial samples for attack noise generated by network gradients; the joint loss function is a weighted calculation of the loss functions of the augmented image-text pairs.

10. A computer storage medium, characterized in that: stored within it is a computer program executable by a computer processor, the computer program performing the cross-language description oriented antagonism data enhancement method according to any one of claims 1-7.