CN111967487A - Incremental data enhancement method for visual question-answer model training and application - Google Patents

Incremental data enhancement method for visual question-answer model training and application Download PDF

Info

Publication number
CN111967487A
CN111967487A CN202010563289.3A CN202010563289A CN111967487A CN 111967487 A CN111967487 A CN 111967487A CN 202010563289 A CN202010563289 A CN 202010563289A CN 111967487 A CN111967487 A CN 111967487A
Authority
CN
China
Prior art keywords
features
training
feature
sentence length
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010563289.3A
Other languages
Chinese (zh)
Other versions
CN111967487B (en
Inventor
王瀚漓
龙宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Publication of CN111967487A publication Critical patent/CN111967487A/en
Application granted granted Critical
Publication of CN111967487B publication Critical patent/CN111967487B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an incremental data enhancement method for visual question-answering model training, which comprises the following steps: acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences; obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution; and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement. Compared with the prior art, the method has the advantages of realizing data diversity, being good in efficiency, simple and the like.

Description

Incremental data enhancement method for visual question-answer model training and application
Technical Field
The invention relates to a model training method, in particular to an incremental data enhancement method for visual question-answering model training and application.
Background
With the great popularization of mobile devices and the increasing demand of people in recent years, various visual data presented to everyone show explosive growth, and the demand of people for a visual question-answering system capable of answering doubtful questions is rising continuously. The visual question-answering system aims to describe and help to finish the interpretation of visual information according to the needs of people, and relates to the understanding of questions, the retrieval, the positioning and the reasoning of objects. Compared with other cross-mode tasks such as visual description, the development of the visual question-answering task is still limited by the contradiction between an infinite search space and incomplete training data, the contradiction between statistical reasoning and actual reasoning, answer conflict caused by understanding difference, diversity of semantic expression, the contradiction between reasoning difficulty and data size and the like.
The diversity of semantic expression increases the possibility of contradiction and answer conflict of data size, thereby increasing reasoning difficulty, so the diversity of semantic expression is an important problem to be faced at present, the existing method usually only uses data cleaning, namely removes invalid data in semantic text data, and the required recognition effect is difficult to achieve.
Disclosure of Invention
The present invention aims to overcome the defects of the prior art and provide an incremental data enhancement method and application for training a visual question-answering model, which are simple to implement.
The purpose of the invention can be realized by the following technical scheme:
an incremental data enhancement method for visual question-answering model training, the method comprising:
acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences;
obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution;
and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement.
Further, the length distribution of all sentences is captured, the length distribution presents a normal distribution, and the minimum sentence length threshold value and the maximum sentence length threshold value are determined by adopting 50% and 99%.
Further, whether the sentence length is smaller than the maximum sentence length threshold value or not is judged according to the natural language sequence of each training sample, if yes, the natural language sequence is expanded, the expanded sentence length is located in the length threshold value range formed by the minimum sentence length threshold value and the maximum sentence length threshold value, and if not, the natural language sequence is not expanded.
Further, a word in the natural language sequence is randomly selected and expanded by a method of recovering the word immediately after the original word.
Furthermore, the word frequency of the middle 1/3 part is selected as a word candidate to strengthen the corresponding weight through the word frequency distribution, and then a certain word in the natural language sequence is selected in a random mode through the weight, and the method of recovering the word after the original word is expanded.
The invention also provides a training method of the visual question-answering model, which comprises the following steps:
initializing a model;
expanding the original training data set by the incremental data enhancement method to obtain an expanded training data set;
performing feature extraction on training samples in the extended training data set to obtain text features and image features;
performing feature fusion on the image features and the text features to generate fusion features, and generating output answers based on the fusion features;
calculating an answer error based on the output answer and an initial answer in a training sample;
and performing parameter iterative adjustment on the visual question-answering model based on the answer error.
Further, the extracting of the text features specifically comprises:
performing maximum length cutting on the natural language sequence based on the maximum length limit of the time sequence neural network, and sending the cut natural language sequence into the time sequence neural network to extract text features;
the time sequence neural network comprises a cyclic neural network module, a natural language sequence is gradually input into the cyclic neural network module, and the hidden layer characteristics of the last time step or the fusion of the hidden layer characteristics of all time steps are used as the text characteristics.
And further, sending the images in the training samples into a convolutional neural network to extract corresponding convolutional layer and fully-connected layer features, and taking the features of the last convolutional layer, the feature confidence of the last fully-connected layer or the features of the highest 36 targets as the image features.
Further, the feature fusion specifically includes:
respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with the same size, and performing dot multiplication on the two hidden layer features to obtain a fusion feature; or
Respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with different sizes, adjusting the sizes of the two hidden layer features to be the same through copying and expanding, performing dot multiplication on the two adjusted hidden layer features to generate a fused hidden layer feature, performing feature conversion on the fused hidden layer feature through one full connection layer, generating an attention feature through the other full connection layer, and performing dot multiplication fusion on the attention feature and the image features to generate a final fused feature.
Further, the iterative method adopted by the parameter iterative adjustment comprises a second-order momentum optimization method.
Compared with the prior art, the invention has the following beneficial effects:
(1) the method expands the training data set, realizes enhanced text data change, realizes data diversity, achieves the diversification of expression forms of native idioms by performing limited deformation on the expression of an original sentence, enables the model to receive the same information with various lengths and modes, enhances the robustness of the model to the expression forms, further improves the effect of the model, and improves the evaluation effects such as classification accuracy;
(2) according to the method, the random enhancement strategy and the word frequency enhancement strategy are designed through data statistics to realize the expansion of natural language data, the operation is very simple, and the operability is very strong;
(3) the invention relates to the deformation of input text data, and does not relate to the change of a model and the input of extra data, so that the invention has no other additional calculation and data requirements and no additional consumption, and has extremely strong significance in practical application;
(4) the training of the visual question-answering model is carried out based on the training data set after data expansion, the training precision is high, and the method has strong advantages and application prospects.
Drawings
FIG. 1 is a flow chart of the training process of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
Example 1
The embodiment provides an incremental data enhancement method for visual question-answering model training, which comprises a data statistics step, a threshold determination step and a data expansion step, and specifically comprises the following steps: acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences; obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution; and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement.
The data statistics comprises the length statistical distribution and the word frequency distribution of text words in all original language sequences in a training data set, the limited length range of the text language sequences is determined according to the sentence length statistics, the minimum sentence length threshold value and the maximum sentence length threshold value are set, and the word frequency distribution of each word is determined according to the word frequency statistics and is used for the subsequent word selection probability.
The specific process of generating the threshold and the word frequency distribution comprises the following steps: the value range of the sentence length can fall between 0 and L, the whole sentence length presents normal distribution, the minimum sentence length threshold value and the maximum sentence length threshold value are respectively determined by adopting 50 percent and 99 percent, the sentence is cut according to the minimum sentence length threshold value and the maximum sentence length threshold value, and the maximum sentence length threshold value is set to be 14 in the embodiment. Capturing the occurrence frequency of all words by using a training data set, wherein the value range of the word frequency falls between 0 and F, the whole word follows Ziv's law, and dividing the word into three segments according to the word frequency, namely, the words with the frequency between (0, e ^ log (F))/3), (e ^ log (F))/3), e ^ log (F))/2/3), (e ^ log (F)) 2/3 and F), and recording the word frequency corresponding to all words.
In the data expansion step, whether the sentence length is smaller than the maximum sentence length threshold value or not is judged according to the natural language sequence of each training sample, if so, the natural language sequence is expanded, the expanded sentence length is located in the length threshold value range formed by the minimum sentence length threshold value and the maximum sentence length threshold value, and if not, the natural language sequence is not expanded. The expansion strategy of the embodiment comprises a random enhancement strategy and a word frequency enhancement strategy.
When the natural language is expanded by adopting a random enhancement strategy, aiming at the natural language in input data, firstly comparing the sentence length of the natural language data with a maximum sentence length threshold value, if the sentence length is greater than or equal to the maximum sentence length threshold value, not doing any operation, and if the sentence length is lower than the sentence length upper limit threshold value, expanding the sentence. The method can ensure that the final length is greater than the minimum sentence length threshold and less than the maximum sentence length threshold. The single sentence expansion strategy is that firstly, a word at a certain random position in the current sentence is randomly selected, then a new repeated original word is inserted after the word, the sentence is expanded by a word length by the method, and the sentence expansion strategy is repeated, so that the changed natural language sequence replaces the original natural language sequence.
When the natural language is expanded by adopting a word frequency enhancement strategy, for the natural language in input data, firstly, the sentence length of the natural language data is compared with a maximum sentence length threshold value, if the sentence length is greater than or equal to the maximum sentence length threshold value, no operation is carried out, and if the sentence length is lower than the sentence length upper limit threshold value, the sentence is expanded. The method can ensure that the final length is greater than the minimum sentence length threshold and less than the maximum sentence length threshold. The single sentence expansion strategy is that firstly, the word frequency of the middle 1/3 part is selected as word candidates to strengthen corresponding weights through the distribution of the word frequency, 2 times of weights are adopted in the embodiment, then, a word at a certain position in a sentence is selected in a weight random mode (the weight random means that the probability of selecting a word is the sum of the current word weight and all word weights of the current sentence), a new repeated word is inserted into the position immediately after the word, the sentence is expanded by a word length through the method, and the sentence expansion strategy is repeated, so that the changed natural language sequence replaces the original natural language sequence.
Example 2
The embodiment provides a training method of a visual question-answering model, which adopts end-to-end training, and the flow of the training method is shown in fig. 1, and includes:
(1) and initializing the model.
(2) The original training data set is expanded by the incremental data enhancement method described in embodiment 1 to obtain an expanded training data set, thereby implementing text data enhancement.
(3) And performing feature extraction on the training samples in the extended training data set to obtain text features and image features.
In the training process of the model, the maximum length of the text language sequence is cut to make the maximum length less than the maximum length limit of the time sequence neural network model, then the sequence is sent to a lookup table module, the output result is sent to the time sequence neural network to extract text characteristics, and in the testing stage, the original text language sequence is sent to the time sequence neural network to extract text characteristics. In both training and testing stages, the images are sent to a convolutional neural network to extract the corresponding convolutional layer and full-link layer image characteristics.
31) In the text language feature extraction, data sent into the model each time is a batch-sized cross-mode data pair { V, Q, A }, and only a single cross-mode data pair is taken as an example, wherein Q is corresponding text language information and can be represented as a word sequence { word }1,word2,…,wordTWhere T has a maximum length of 14, words beyond this length will be discarded. The word sequence Q is firstly sent into a Lookup Table module (Lookup Table) completely, and the original one-hot dictionary-type vector is converted into a corresponding word nested feature sequence { word _ vector } through mapping, wherein the shape of the one-hot dictionary-type vector is {0, …,0,1,0, …,0}, and the word sequence is converted into a corresponding word nested feature sequence { word _ vector }1,word_vector2,…,word_vectorTWherein word vectortThe vectors are vectors with dimension 1 × 300, and then the word nested feature sequence corresponding to the sentence is gradually sent into a specific Recurrent neural network module according to the time step sequence, in this embodiment, a Gate Recovery Unit (GRU) is used, each step in the Recurrent neural network includes two vectors of hidden state high _ state and subsequent output, in this embodiment, two features of the hidden feature of the last time step and the fused feature of the hidden features of all times are used as references to verify the effect, and the two features are used as references to verify the effectMiddle hidden _ statetThe hidden layer feature at the time of time step t is a vector with dimension 1 × 1024. If the hidden layer feature of the last time step is adopted as the output feature, the last output ques _ presentation is high _ statetI.e. the output is a 1 x 1024 vector. If the fusion feature of the hidden layer features of all the time steps is adopted as the output feature, the hidden layer features of all the time steps are { hidden _ state1,hidden_state2,…,hidden_stateTThe hidden layer characteristics of all time steps are converted into vectors with dimension T multiplied by 512 through the same convolution layer, the vectors with dimension T multiplied by 2 are converted into vectors with dimension T multiplied by 2 through the same convolution layer, the vectors are split into two vectors with dimension T, the two vectors are used as two Attention heads to respectively carry out point multiplication and addition with the hidden layer characteristics of all time steps to form vectors with dimension 1 multiplied by 1024, the output results of the two Attention heads are subjected to final dimension splicing, and the spliced output ques _ representation is generally expressed as Self-Attention2(hidden_state1,hidden_state2,…,hidden_stateT) I.e. the output is a 1 x 2048 vector. The output feature ques _ presentation is used in a subsequent step as a feature of the text language.
32) In the image feature extraction, in this embodiment, different image sizes are used according to different reference models, so that the original image needs to be scaled to two sizes, i.e., 224 × 224 or 448 × 448, according to the requirements of the reference model, and then according to the requirements of the reference model, or an image of a certain size is sent to a renet 152 model pre-trained on imagenet, and the features of the last convolutional layer or the features of the second last fully-connected layer are extracted as image features, so that the image feature image _ representation is a vector with dimensions of 1024 × 14 × 14 or 1 × 2048. Or sending the image with the corresponding size into a fast-rcnn model pre-trained on a target detection task of the mscoco data set, using the features of 36 targets with the highest confidence as image features, wherein the final image feature image _ representation is a vector with dimensions of 36 × 2048, and outputting the feature image _ representation as the features of the image to be sent to the subsequent steps.
(4) And performing feature fusion on the image features and the text features to generate fusion features, and generating an output answer based on the fusion features.
And simultaneously sending the text language features and the image features extracted in the previous steps into a fusion reasoning module in the training and testing processes of the model, respectively carrying out feature conversion on the two features, and then carrying out corresponding fusion operation on the two features to complete the fusion of the features so as to generate corresponding fusion features. And sending the generated fusion features into a full-connection layer for feature conversion, and generating answer feature vectors with the dimensionality of 1 multiplied by C according to different reference data sets, wherein C is the number of answers of the reference data sets, the generated answer features are followed by a softmax layer, and the corresponding maximum classification category is the generated answers.
According to different reference models, fusing the two features in multiple forms, including:
41) and performing simple conversion fusion on the two features, namely performing feature conversion through full connection layers respectively to obtain hidden layer features, wherein the hidden layer features are vectors with dimensions of 1 × 2048. And then, performing point multiplication on the two hidden layer features with the same size to complete the fusion of the features, and further generating a corresponding fusion feature fusion _ representation which is a vector with the dimension of 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.
42) The two features are fused through an attention mechanism model, firstly, the two features are respectively subjected to feature conversion through a full connection layer to become hidden layer features, namely vectors with dimensions of K multiplied by 2048(K may be 36, 196) and 1 multiplied by 2048. Then, the hidden layer features of the text language are copied and expanded to the same size as the image features, namely K multiplied by 2048, the two hidden layer features are subjected to dot product to generate fused hidden layer features, the feature dimension is K multiplied by 2048, then, feature conversion is performed through a full connection layer, the feature dimension is converted into K multiplied by 512, feature conversion is performed through the full connection layer to generate attention features, and the feature dimension is converted into K multiplied by 1. This attention feature is then point-by-point fused with the image feature to generate the final fused feature fusion _ presentation as a vector whose dimensions are 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.
43) The two features are fused through an outer product model of a fusion attention mechanism, firstly, the two features are respectively subjected to feature conversion through a full connection layer to become hidden layer features, and the hidden layer features are vectors with dimensions of K multiplied by 2048(K may be 36, 196) and 1 multiplied by 2048 respectively. Then, the hidden layer features of the text language are copied and expanded to the same size as the image features, namely K multiplied by 2048, the two hidden layer features are subjected to outer product multiplication through a direct or indirect rank reduction method or an approximate rank reduction method to generate fused hidden layer features, the feature dimension is K multiplied by 2048, feature conversion is performed through a full connection layer, the feature dimension is converted into K multiplied by 512, feature conversion is performed through the full connection layer to generate attention features, and the feature dimension is converted into K multiplied by 1. This attention feature is then point-by-point fused with the image feature to generate the final fused feature fusion _ presentation as a vector whose dimensions are 1 × 2048. The output feature fusion _ representation is used in subsequent steps as a feature of the fusion inference.
(5) An answer error is calculated based on the output answer and the initial answer in the training sample. In this example, Cross-Entrophy type loss function was used to calculate the error.
(6) And performing parameter iterative adjustment on the visual question-answering model based on the answer error. The iterative method used for iterative adjustment of the parameters in this embodiment includes a second-order momentum optimization method (such as Adam).
To verify the performance of the above method, the following experiment was designed.
Experimental validation was first performed on two public data sets (COCO-QA, VQA 2.0). The COCO-QA data set comprises < images, texts and answers >, wherein 431 answers are used as classification categories, 78736 data pairs corresponding to 8000 images are used as training sets, and 38948 data pairs corresponding to 4000 images are used as verification sets; the VQA 2.0.0 data set also consists of the < image, text, answer > form, taking the top 3129 answers as classification results, for a total of 443757 and 214354 data pairs as training and validation sets.
And (4) the classification accuracy of the evaluation indexes used in the experiment.
The index results on the COCO-QA data set are shown in Table 1, and it can be easily found that the method provided by the invention is superior to the original method in various models.
TABLE 1 comparison of Classification accuracy indexes on COCO-QA data set
Reference method No data enhancement Augmentation using data
LSTM+CNN 60.93 61.63
SAN 65.34 65.64
LSTM+CNN_AttRNN 61.40 61.86
SAN_AttRNN 65.54 65.83
The index results under the VQA 2.0.0 data set are shown in table 2, and compared with the benchmark method, the model has better effect after the data expansion enhancement strategy is used on various benchmark models.
TABLE 2 VQA2.0 comparison of Classification accuracy index on 2.0 datasets
Reference method No data enhancement Augmentation using data
LSTM+CNN 51.61 52.10
BLOCK 63.03 63.34
BAN 64.67 64.97
The series of experimental results can prove that on various public data sets, the incremental data enhancement method for natural language processing in visual question answering, provided by the invention, has the advantages of obvious effect, simplicity in implementation and no additional cost, and has stronger advantages and application prospects in the existing known text data enhancement.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions that can be obtained by a person skilled in the art through logic analysis, reasoning or limited experiments based on the prior art according to the concept of the present invention should be within the protection scope determined by the present invention.

Claims (10)

1. An incremental data enhancement method for visual question-answering model training, the method comprising:
acquiring an original training data set, wherein training samples in the data set are in the form of images, texts and answers, and the texts are formed by natural language sequences;
obtaining sentence length distribution of a natural language sequence in the original training data set and word frequency distribution of each word, and determining a minimum sentence length threshold and a maximum sentence length threshold based on the sentence length distribution;
and expanding the natural language sequence in the training sample according to the minimum sentence length threshold, the maximum sentence length threshold and the word frequency distribution to realize data enhancement.
2. The incremental data enhancement method for visual question-answering model training according to claim 1, characterized in that the length distribution of all sentences is grabbed, the length distribution presents a normal distribution, and the minimum sentence length threshold and the maximum sentence length threshold are determined using 50% and 99%.
3. The incremental data enhancement method for visual question-answering model training according to claim 1, wherein for the natural language sequence of each training sample, it is determined whether the sentence length is smaller than the maximum sentence length threshold, if so, the natural language sequence is extended, the extended sentence length is within a length threshold range formed by the minimum sentence length threshold and the maximum sentence length threshold, and if not, the extension is not performed.
4. The incremental data enhancement method for visual question-answering model training of claim 3, wherein a word in the natural language sequence is randomly selected to be augmented by a method of restituting the word immediately after the original word.
5. The incremental data enhancement method for visual question-answering model training according to claim 3, wherein the word frequency of the middle 1/3 part is selected as a word candidate to reinforce the corresponding weight through the word frequency distribution, and then a word in the natural language sequence is selected in a weighted random manner to be augmented by a method of restoring the word immediately after the original word.
6. A training method of a visual question-answering model is characterized by comprising the following steps:
initializing a model;
expanding the original training data set by the incremental data enhancement method of any one of claims 1 to 5 to obtain an expanded training data set;
performing feature extraction on training samples in the extended training data set to obtain text features and image features;
performing feature fusion on the image features and the text features to generate fusion features, and generating output answers based on the fusion features;
calculating an answer error based on the output answer and an initial answer in a training sample;
and performing parameter iterative adjustment on the visual question-answering model based on the answer error.
7. The training method of the visual question-answering model according to claim 6, wherein the extraction of the text features specifically comprises:
performing maximum length cutting on the natural language sequence based on the maximum length limit of the time sequence neural network, and sending the cut natural language sequence into the time sequence neural network to extract text features;
the time sequence neural network comprises a cyclic neural network module, a natural language sequence is gradually input into the cyclic neural network module, and the hidden layer characteristics of the last time step or the fusion of the hidden layer characteristics of all time steps are used as the text characteristics.
8. The method for training the visual question-answer model according to claim 6, wherein the images in the training samples are fed into a convolutional neural network to extract the features of the corresponding convolutional layers and fully-connected layers, and the feature of the last convolutional layer, the feature confidence of the last fully-connected layer or the features of the highest 36 targets are used as the image features.
9. The training method of the visual question-answering model according to claim 6, wherein the feature fusion specifically is:
respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with the same size, and performing dot multiplication on the two hidden layer features to obtain a fusion feature; or
Respectively performing feature conversion on the image features and the text features through a full connection layer to generate two hidden layer features with different sizes, adjusting the sizes of the two hidden layer features to be the same through copying and expanding, performing dot multiplication on the two adjusted hidden layer features to generate a fused hidden layer feature, performing feature conversion on the fused hidden layer feature through one full connection layer, generating an attention feature through the other full connection layer, and performing dot multiplication fusion on the attention feature and the image features to generate a final fused feature.
10. The method for training a visual question-answering model according to claim 6, wherein the iterative method adopted by the iterative adjustment of the parameters comprises a second-order momentum optimization method.
CN202010563289.3A 2020-03-23 2020-06-19 Incremental data enhancement method for visual question-answer model training and application Active CN111967487B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010209983 2020-03-23
CN2020102099835 2020-03-23

Publications (2)

Publication Number Publication Date
CN111967487A true CN111967487A (en) 2020-11-20
CN111967487B CN111967487B (en) 2022-09-20

Family

ID=73360374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010563289.3A Active CN111967487B (en) 2020-03-23 2020-06-19 Incremental data enhancement method for visual question-answer model training and application

Country Status (1)

Country Link
CN (1) CN111967487B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613543A (en) * 2020-12-15 2021-04-06 重庆紫光华山智安科技有限公司 Enhanced policy verification method and device, electronic equipment and storage medium
CN113220883A (en) * 2021-05-17 2021-08-06 华南师范大学 Text classification model performance optimization method and device and storage medium
CN113486833A (en) * 2021-07-15 2021-10-08 北京达佳互联信息技术有限公司 Multi-modal feature extraction model training method and device and electronic equipment
CN113516182A (en) * 2021-07-02 2021-10-19 文思海辉元辉科技(大连)有限公司 Visual question-answering model training method and device, and visual question-answering method and device
CN116841756A (en) * 2023-09-04 2023-10-03 奇点数联(北京)科技有限公司 Acquisition method of target incremental data

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text
CN108710647A (en) * 2018-04-28 2018-10-26 苏宁易购集团股份有限公司 A kind of data processing method and device for chat robots
CN109002852A (en) * 2018-07-11 2018-12-14 腾讯科技(深圳)有限公司 Image processing method, device, computer readable storage medium and computer equipment
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method
CN109815459A (en) * 2017-11-17 2019-05-28 奥多比公司 Generate the target summary for being adjusted to the content of text of target audience's vocabulary
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110781680A (en) * 2019-10-17 2020-02-11 江南大学 Semantic similarity matching method based on twin network and multi-head attention mechanism

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN109815459A (en) * 2017-11-17 2019-05-28 奥多比公司 Generate the target summary for being adjusted to the content of text of target audience's vocabulary
CN108334497A (en) * 2018-02-06 2018-07-27 北京航空航天大学 The method and apparatus for automatically generating text
CN108710647A (en) * 2018-04-28 2018-10-26 苏宁易购集团股份有限公司 A kind of data processing method and device for chat robots
CN109002852A (en) * 2018-07-11 2018-12-14 腾讯科技(深圳)有限公司 Image processing method, device, computer readable storage medium and computer equipment
CN109784163A (en) * 2018-12-12 2019-05-21 中国科学院深圳先进技术研究院 A kind of light weight vision question answering system and method
CN109902166A (en) * 2019-03-12 2019-06-18 北京百度网讯科技有限公司 Vision Question-Answering Model, electronic equipment and storage medium
CN110647612A (en) * 2019-09-18 2020-01-03 合肥工业大学 Visual conversation generation method based on double-visual attention network
CN110781680A (en) * 2019-10-17 2020-02-11 江南大学 Semantic similarity matching method based on twin network and multi-head attention mechanism

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112613543A (en) * 2020-12-15 2021-04-06 重庆紫光华山智安科技有限公司 Enhanced policy verification method and device, electronic equipment and storage medium
CN112613543B (en) * 2020-12-15 2023-05-30 重庆紫光华山智安科技有限公司 Enhanced policy verification method, enhanced policy verification device, electronic equipment and storage medium
CN113220883A (en) * 2021-05-17 2021-08-06 华南师范大学 Text classification model performance optimization method and device and storage medium
CN113220883B (en) * 2021-05-17 2023-12-26 华南师范大学 Text classification method, device and storage medium
CN113516182A (en) * 2021-07-02 2021-10-19 文思海辉元辉科技(大连)有限公司 Visual question-answering model training method and device, and visual question-answering method and device
CN113516182B (en) * 2021-07-02 2024-04-23 文思海辉元辉科技(大连)有限公司 Visual question-answering model training and visual question-answering method and device
CN113486833A (en) * 2021-07-15 2021-10-08 北京达佳互联信息技术有限公司 Multi-modal feature extraction model training method and device and electronic equipment
CN116841756A (en) * 2023-09-04 2023-10-03 奇点数联(北京)科技有限公司 Acquisition method of target incremental data
CN116841756B (en) * 2023-09-04 2023-11-10 奇点数联(北京)科技有限公司 Acquisition method of target incremental data

Also Published As

Publication number Publication date
CN111967487B (en) 2022-09-20

Similar Documents

Publication Publication Date Title
CN111967487B (en) Incremental data enhancement method for visual question-answer model training and application
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN106845411B (en) Video description generation method based on deep learning and probability map model
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN110083693B (en) Robot dialogue reply method and device
CN110033008B (en) Image description generation method based on modal transformation and text induction
CN112364150A (en) Intelligent question and answer method and system combining retrieval and generation
US20220277038A1 (en) Image search based on combined local and global information
CN109359302B (en) Optimization method of domain word vectors and fusion ordering method based on optimization method
CN111984766A (en) Missing semantic completion method and device
CN111813913B (en) Two-stage problem generating system with problem as guide
CN109684928B (en) Chinese document identification method based on internet retrieval
CN113886626B (en) Visual question-answering method of dynamic memory network model based on multi-attention mechanism
CN115495555A (en) Document retrieval method and system based on deep learning
CN114020906A (en) Chinese medical text information matching method and system based on twin neural network
CN109145946B (en) Intelligent image recognition and description method
CN111814843B (en) End-to-end training method and application of image feature module in visual question-answering system
CN113962228A (en) Long document retrieval method based on semantic fusion of memory network
CN110516240B (en) Semantic similarity calculation model DSSM (direct sequence spread spectrum) technology based on Transformer
CN115034208A (en) Chinese ASR output text repair method and system based on BERT
CN112988970A (en) Text matching algorithm serving intelligent question-answering system
CN114417872A (en) Contract text named entity recognition method and system
CN117332789A (en) Semantic analysis method and system for dialogue scene
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN112446219A (en) Chinese request text intention analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant