CN111967487A

CN111967487A - Incremental data enhancement method for visual question-answer model training and application

Info

Publication number: CN111967487A
Application number: CN202010563289.3A
Authority: CN
Inventors: 王瀚漓; 龙宇
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-03-23
Filing date: 2020-06-19
Publication date: 2020-11-20
Anticipated expiration: 2040-06-19
Also published as: CN111967487B

Abstract

The invention relates to an incremental data enhancement method for visual question-answering model training, which comprises the following steps: acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences; obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution; and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement. Compared with the prior art, the method has the advantages of realizing data diversity, being good in efficiency, simple and the like.

Description

Incremental data enhancement method for visual question-answer model training and application

Technical Field

The invention relates to a model training method, in particular to an incremental data enhancement method for visual question-answering model training and application.

Background

With the great popularization of mobile devices and the increasing demand of people in recent years, various visual data presented to everyone show explosive growth, and the demand of people for a visual question-answering system capable of answering doubtful questions is rising continuously. The visual question-answering system aims to describe and help to finish the interpretation of visual information according to the needs of people, and relates to the understanding of questions, the retrieval, the positioning and the reasoning of objects. Compared with other cross-mode tasks such as visual description, the development of the visual question-answering task is still limited by the contradiction between an infinite search space and incomplete training data, the contradiction between statistical reasoning and actual reasoning, answer conflict caused by understanding difference, diversity of semantic expression, the contradiction between reasoning difficulty and data size and the like.

The diversity of semantic expression increases the possibility of contradiction and answer conflict of data size, thereby increasing reasoning difficulty, so the diversity of semantic expression is an important problem to be faced at present, the existing method usually only uses data cleaning, namely removes invalid data in semantic text data, and the required recognition effect is difficult to achieve.

Disclosure of Invention

The present invention aims to overcome the defects of the prior art and provide an incremental data enhancement method and application for training a visual question-answering model, which are simple to implement.

The purpose of the invention can be realized by the following technical scheme:

an incremental data enhancement method for visual question-answering model training, the method comprising:

acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences;

obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution;

and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement.

Further, the length distribution of all sentences is captured, the length distribution presents a normal distribution, and the minimum sentence length threshold value and the maximum sentence length threshold value are determined by adopting 50% and 99%.

Further, whether the sentence length is smaller than the maximum sentence length threshold value or not is judged according to the natural language sequence of each training sample, if yes, the natural language sequence is expanded, the expanded sentence length is located in the length threshold value range formed by the minimum sentence length threshold value and the maximum sentence length threshold value, and if not, the natural language sequence is not expanded.

Further, a word in the natural language sequence is randomly selected and expanded by a method of recovering the word immediately after the original word.

Furthermore, the word frequency of the middle 1/3 part is selected as a word candidate to strengthen the corresponding weight through the word frequency distribution, and then a certain word in the natural language sequence is selected in a random mode through the weight, and the method of recovering the word after the original word is expanded.

The invention also provides a training method of the visual question-answering model, which comprises the following steps:

initializing a model;

expanding the original training data set by the incremental data enhancement method to obtain an expanded training data set;

performing feature extraction on training samples in the extended training data set to obtain text features and image features;

performing feature fusion on the image features and the text features to generate fusion features, and generating output answers based on the fusion features;

calculating an answer error based on the output answer and an initial answer in a training sample;

and performing parameter iterative adjustment on the visual question-answering model based on the answer error.

Further, the extracting of the text features specifically comprises:

performing maximum length cutting on the natural language sequence based on the maximum length limit of the time sequence neural network, and sending the cut natural language sequence into the time sequence neural network to extract text features;

the time sequence neural network comprises a cyclic neural network module, a natural language sequence is gradually input into the cyclic neural network module, and the hidden layer characteristics of the last time step or the fusion of the hidden layer characteristics of all time steps are used as the text characteristics.

And further, sending the images in the training samples into a convolutional neural network to extract corresponding convolutional layer and fully-connected layer features, and taking the features of the last convolutional layer, the feature confidence of the last fully-connected layer or the features of the highest 36 targets as the image features.

Further, the feature fusion specifically includes:

respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with the same size, and performing dot multiplication on the two hidden layer features to obtain a fusion feature; or

Respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with different sizes, adjusting the sizes of the two hidden layer features to be the same through copying and expanding, performing dot multiplication on the two adjusted hidden layer features to generate a fused hidden layer feature, performing feature conversion on the fused hidden layer feature through one full connection layer, generating an attention feature through the other full connection layer, and performing dot multiplication fusion on the attention feature and the image features to generate a final fused feature.

Further, the iterative method adopted by the parameter iterative adjustment comprises a second-order momentum optimization method.

Compared with the prior art, the invention has the following beneficial effects:

(1) the method expands the training data set, realizes enhanced text data change, realizes data diversity, achieves the diversification of expression forms of native idioms by performing limited deformation on the expression of an original sentence, enables the model to receive the same information with various lengths and modes, enhances the robustness of the model to the expression forms, further improves the effect of the model, and improves the evaluation effects such as classification accuracy;

(2) according to the method, the random enhancement strategy and the word frequency enhancement strategy are designed through data statistics to realize the expansion of natural language data, the operation is very simple, and the operability is very strong;

(3) the invention relates to the deformation of input text data, and does not relate to the change of a model and the input of extra data, so that the invention has no other additional calculation and data requirements and no additional consumption, and has extremely strong significance in practical application;

(4) the training of the visual question-answering model is carried out based on the training data set after data expansion, the training precision is high, and the method has strong advantages and application prospects.

Drawings

FIG. 1 is a flow chart of the training process of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.

Example 1

The embodiment provides an incremental data enhancement method for visual question-answering model training, which comprises a data statistics step, a threshold determination step and a data expansion step, and specifically comprises the following steps: acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences; obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution; and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement.

The data statistics comprises the length statistical distribution and the word frequency distribution of text words in all original language sequences in a training data set, the limited length range of the text language sequences is determined according to the sentence length statistics, the minimum sentence length threshold value and the maximum sentence length threshold value are set, and the word frequency distribution of each word is determined according to the word frequency statistics and is used for the subsequent word selection probability.

The specific process of generating the threshold and the word frequency distribution comprises the following steps: the value range of the sentence length can fall between 0 and L, the whole sentence length presents normal distribution, the minimum sentence length threshold value and the maximum sentence length threshold value are respectively determined by adopting 50 percent and 99 percent, the sentence is cut according to the minimum sentence length threshold value and the maximum sentence length threshold value, and the maximum sentence length threshold value is set to be 14 in the embodiment. Capturing the occurrence frequency of all words by using a training data set, wherein the value range of the word frequency falls between 0 and F, the whole word follows Ziv's law, and dividing the word into three segments according to the word frequency, namely, the words with the frequency between (0, e ^ log (F))/3), (e ^ log (F))/3), e ^ log (F))/2/3), (e ^ log (F)) 2/3 and F), and recording the word frequency corresponding to all words.

In the data expansion step, whether the sentence length is smaller than the maximum sentence length threshold value or not is judged according to the natural language sequence of each training sample, if so, the natural language sequence is expanded, the expanded sentence length is located in the length threshold value range formed by the minimum sentence length threshold value and the maximum sentence length threshold value, and if not, the natural language sequence is not expanded. The expansion strategy of the embodiment comprises a random enhancement strategy and a word frequency enhancement strategy.

When the natural language is expanded by adopting a random enhancement strategy, aiming at the natural language in input data, firstly comparing the sentence length of the natural language data with a maximum sentence length threshold value, if the sentence length is greater than or equal to the maximum sentence length threshold value, not doing any operation, and if the sentence length is lower than the sentence length upper limit threshold value, expanding the sentence. The method can ensure that the final length is greater than the minimum sentence length threshold and less than the maximum sentence length threshold. The single sentence expansion strategy is that firstly, a word at a certain random position in the current sentence is randomly selected, then a new repeated original word is inserted after the word, the sentence is expanded by a word length by the method, and the sentence expansion strategy is repeated, so that the changed natural language sequence replaces the original natural language sequence.

When the natural language is expanded by adopting a word frequency enhancement strategy, for the natural language in input data, firstly, the sentence length of the natural language data is compared with a maximum sentence length threshold value, if the sentence length is greater than or equal to the maximum sentence length threshold value, no operation is carried out, and if the sentence length is lower than the sentence length upper limit threshold value, the sentence is expanded. The method can ensure that the final length is greater than the minimum sentence length threshold and less than the maximum sentence length threshold. The single sentence expansion strategy is that firstly, the word frequency of the middle 1/3 part is selected as word candidates to strengthen corresponding weights through the distribution of the word frequency, 2 times of weights are adopted in the embodiment, then, a word at a certain position in a sentence is selected in a weight random mode (the weight random means that the probability of selecting a word is the sum of the current word weight and all word weights of the current sentence), a new repeated word is inserted into the position immediately after the word, the sentence is expanded by a word length through the method, and the sentence expansion strategy is repeated, so that the changed natural language sequence replaces the original natural language sequence.

Example 2

The embodiment provides a training method of a visual question-answering model, which adopts end-to-end training, and the flow of the training method is shown in fig. 1, and includes:

(1) and initializing the model.

(2) The original training data set is expanded by the incremental data enhancement method described in embodiment 1 to obtain an expanded training data set, thereby implementing text data enhancement.

(3) And performing feature extraction on the training samples in the extended training data set to obtain text features and image features.

In the training process of the model, the maximum length of the text language sequence is cut to make the maximum length less than the maximum length limit of the time sequence neural network model, then the sequence is sent to a lookup table module, the output result is sent to the time sequence neural network to extract text characteristics, and in the testing stage, the original text language sequence is sent to the time sequence neural network to extract text characteristics. In both training and testing stages, the images are sent to a convolutional neural network to extract the corresponding convolutional layer and full-link layer image characteristics.

31) In the text language feature extraction, data sent into the model each time is a batch-sized cross-mode data pair { V, Q, A }, and only a single cross-mode data pair is taken as an example, wherein Q is corresponding text language information and can be represented as a word sequence { word }₁,word₂,…,word_TWhere T has a maximum length of 14, words beyond this length will be discarded. The word sequence Q is firstly sent into a Lookup Table module (Lookup Table) completely, and the original one-hot dictionary-type vector is converted into a corresponding word nested feature sequence { word _ vector } through mapping, wherein the shape of the one-hot dictionary-type vector is {0, …,0,1,0, …,0}, and the word sequence is converted into a corresponding word nested feature sequence { word _ vector }₁,word_vector₂,…,word_vector_TWherein word vector_tThe vectors are vectors with dimension 1 × 300, and then the word nested feature sequence corresponding to the sentence is gradually sent into a specific Recurrent neural network module according to the time step sequence, in this embodiment, a Gate Recovery Unit (GRU) is used, each step in the Recurrent neural network includes two vectors of hidden state high _ state and subsequent output, in this embodiment, two features of the hidden feature of the last time step and the fused feature of the hidden features of all times are used as references to verify the effect, and the two features are used as references to verify the effectMiddle hidden _ state_tThe hidden layer feature at the time of time step t is a vector with dimension 1 × 1024. If the hidden layer feature of the last time step is adopted as the output feature, the last output ques _ presentation is high _ state_tI.e. the output is a 1 x 1024 vector. If the fusion feature of the hidden layer features of all the time steps is adopted as the output feature, the hidden layer features of all the time steps are { hidden _ state₁,hidden_state₂,…,hidden_state_TThe hidden layer characteristics of all time steps are converted into vectors with dimension T multiplied by 512 through the same convolution layer, the vectors with dimension T multiplied by 2 are converted into vectors with dimension T multiplied by 2 through the same convolution layer, the vectors are split into two vectors with dimension T, the two vectors are used as two Attention heads to respectively carry out point multiplication and addition with the hidden layer characteristics of all time steps to form vectors with dimension 1 multiplied by 1024, the output results of the two Attention heads are subjected to final dimension splicing, and the spliced output ques _ representation is generally expressed as Self-Attention₂(hidden_state₁,hidden_state₂,…,hidden_state_T) I.e. the output is a 1 x 2048 vector. The output feature ques _ presentation is used in a subsequent step as a feature of the text language.

32) In the image feature extraction, in this embodiment, different image sizes are used according to different reference models, so that the original image needs to be scaled to two sizes, i.e., 224 × 224 or 448 × 448, according to the requirements of the reference model, and then according to the requirements of the reference model, or an image of a certain size is sent to a renet 152 model pre-trained on imagenet, and the features of the last convolutional layer or the features of the second last fully-connected layer are extracted as image features, so that the image feature image _ representation is a vector with dimensions of 1024 × 14 × 14 or 1 × 2048. Or sending the image with the corresponding size into a fast-rcnn model pre-trained on a target detection task of the mscoco data set, using the features of 36 targets with the highest confidence as image features, wherein the final image feature image _ representation is a vector with dimensions of 36 × 2048, and outputting the feature image _ representation as the features of the image to be sent to the subsequent steps.

(4) And performing feature fusion on the image features and the text features to generate fusion features, and generating an output answer based on the fusion features.

And simultaneously sending the text language features and the image features extracted in the previous steps into a fusion reasoning module in the training and testing processes of the model, respectively carrying out feature conversion on the two features, and then carrying out corresponding fusion operation on the two features to complete the fusion of the features so as to generate corresponding fusion features. And sending the generated fusion features into a full-connection layer for feature conversion, and generating answer feature vectors with the dimensionality of 1 multiplied by C according to different reference data sets, wherein C is the number of answers of the reference data sets, the generated answer features are followed by a softmax layer, and the corresponding maximum classification category is the generated answers.

According to different reference models, fusing the two features in multiple forms, including:

41) and performing simple conversion fusion on the two features, namely performing feature conversion through full connection layers respectively to obtain hidden layer features, wherein the hidden layer features are vectors with dimensions of 1 × 2048. And then, performing point multiplication on the two hidden layer features with the same size to complete the fusion of the features, and further generating a corresponding fusion feature fusion _ representation which is a vector with the dimension of 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.

42) The two features are fused through an attention mechanism model, firstly, the two features are respectively subjected to feature conversion through a full connection layer to become hidden layer features, namely vectors with dimensions of K multiplied by 2048(K may be 36, 196) and 1 multiplied by 2048. Then, the hidden layer features of the text language are copied and expanded to the same size as the image features, namely K multiplied by 2048, the two hidden layer features are subjected to dot product to generate fused hidden layer features, the feature dimension is K multiplied by 2048, then, feature conversion is performed through a full connection layer, the feature dimension is converted into K multiplied by 512, feature conversion is performed through the full connection layer to generate attention features, and the feature dimension is converted into K multiplied by 1. This attention feature is then point-by-point fused with the image feature to generate the final fused feature fusion _ presentation as a vector whose dimensions are 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.

43) The two features are fused through an outer product model of a fusion attention mechanism, firstly, the two features are respectively subjected to feature conversion through a full connection layer to become hidden layer features, and the hidden layer features are vectors with dimensions of K multiplied by 2048(K may be 36, 196) and 1 multiplied by 2048 respectively. Then, the hidden layer features of the text language are copied and expanded to the same size as the image features, namely K multiplied by 2048, the two hidden layer features are subjected to outer product multiplication through a direct or indirect rank reduction method or an approximate rank reduction method to generate fused hidden layer features, the feature dimension is K multiplied by 2048, feature conversion is performed through a full connection layer, the feature dimension is converted into K multiplied by 512, feature conversion is performed through the full connection layer to generate attention features, and the feature dimension is converted into K multiplied by 1. This attention feature is then point-by-point fused with the image feature to generate the final fused feature fusion _ presentation as a vector whose dimensions are 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.

(5) An answer error is calculated based on the output answer and the initial answer in the training sample. In this example, Cross-Entrophy type loss function was used to calculate the error.

(6) And performing parameter iterative adjustment on the visual question-answering model based on the answer error. The iterative method used for iterative adjustment of the parameters in this embodiment includes a second-order momentum optimization method (such as Adam).

To verify the performance of the above method, the following experiment was designed.

Experimental validation was first performed on two public data sets (COCO-QA, VQA 2.0). The COCO-QA data set comprises < images, texts and answers >, wherein 431 answers are used as classification categories, 78736 data pairs corresponding to 8000 images are used as training sets, and 38948 data pairs corresponding to 4000 images are used as verification sets; the VQA 2.0.0 data set also consists of the < image, text, answer > form, taking the top 3129 answers as classification results, for a total of 443757 and 214354 data pairs as training and validation sets.

And (4) the classification accuracy of the evaluation indexes used in the experiment.

The index results on the COCO-QA data set are shown in Table 1, and it can be easily found that the method provided by the invention is superior to the original method in various models.

TABLE 1 comparison of Classification accuracy indexes on COCO-QA data set

Reference method	No data enhancement	Augmentation using data
			LSTM+CNN	60.93	61.63
SAN	65.34	65.64
			LSTM+CNN_AttRNN	61.40	61.86
SAN_AttRNN	65.54	65.83

The index results under the VQA 2.0.0 data set are shown in table 2, and compared with the benchmark method, the model has better effect after the data expansion enhancement strategy is used on various benchmark models.

TABLE 2 VQA2.0 comparison of Classification accuracy index on 2.0 datasets

Reference method	No data enhancement	Augmentation using data
			LSTM+CNN	51.61	52.10
BLOCK	63.03	63.34
			BAN	64.67	64.97

The series of experimental results can prove that on various public data sets, the incremental data enhancement method for natural language processing in visual question answering, provided by the invention, has the advantages of obvious effect, simplicity in implementation and no additional cost, and has stronger advantages and application prospects in the existing known text data enhancement.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions that can be obtained by a person skilled in the art through logic analysis, reasoning or limited experiments based on the prior art according to the concept of the present invention should be within the protection scope determined by the present invention.

Claims

1. An incremental data enhancement method for visual question-answering model training, the method comprising:

2. The incremental data enhancement method for visual question-answering model training according to claim 1, characterized in that the length distribution of all sentences is grabbed, the length distribution presents a normal distribution, and the minimum sentence length threshold and the maximum sentence length threshold are determined using 50% and 99%.

3. The incremental data enhancement method for visual question-answering model training according to claim 1, wherein for the natural language sequence of each training sample, it is determined whether the sentence length is smaller than the maximum sentence length threshold, if so, the natural language sequence is extended, the extended sentence length is within a length threshold range formed by the minimum sentence length threshold and the maximum sentence length threshold, and if not, the extension is not performed.

4. The incremental data enhancement method for visual question-answering model training of claim 3, wherein a word in the natural language sequence is randomly selected to be augmented by a method of restituting the word immediately after the original word.

5. The incremental data enhancement method for visual question-answering model training according to claim 3, wherein the word frequency of the middle 1/3 part is selected as a word candidate to reinforce the corresponding weight through the word frequency distribution, and then a word in the natural language sequence is selected in a weighted random manner to be augmented by a method of restoring the word immediately after the original word.

6. A training method of a visual question-answering model is characterized by comprising the following steps:

initializing a model;

expanding the original training data set by the incremental data enhancement method of any one of claims 1 to 5 to obtain an expanded training data set;

7. The training method of the visual question-answering model according to claim 6, wherein the extraction of the text features specifically comprises:

8. The method for training the visual question-answer model according to claim 6, wherein the images in the training samples are fed into a convolutional neural network to extract the features of the corresponding convolutional layers and fully-connected layers, and the feature of the last convolutional layer, the feature confidence of the last fully-connected layer or the features of the highest 36 targets are used as the image features.

9. The training method of the visual question-answering model according to claim 6, wherein the feature fusion specifically is:

10. The method for training a visual question-answering model according to claim 6, wherein the iterative method adopted by the iterative adjustment of the parameters comprises a second-order momentum optimization method.