CN117095187B

CN117095187B - Meta-learning visual language understanding and positioning method

Info

Publication number: CN117095187B
Application number: CN202311330418.4A
Authority: CN
Inventors: 苏超; 彭德中; 胡鹏; 袁钟; 王旭; 孙元; 秦阳
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2023-10-16
Filing date: 2023-10-16
Publication date: 2023-12-19
Anticipated expiration: 2043-10-16
Also published as: CN117095187A

Abstract

The invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps: constructing a training set; constructing a meta learning visual language understanding and positioning training model; constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set; calculating loss for a query set in a training set by using the updated basic learner parameters, and reversely optimizing a meta-learning visual language understanding and positioning training model; and coding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture. The invention solves the problems that the generalization capability of a model is poor and the visual language understanding and positioning accuracy is further reduced due to the fact that the training set is excessively concerned when the existing visual language understanding and positioning method faces the scene with the overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set.

Description

Meta-learning visual language understanding and positioning method

Technical Field

The invention belongs to the technical field of multi-mode visual language understanding and positioning, and particularly relates to a meta-learning visual language understanding and positioning method.

Background

Visual language understanding and localization (VG) refers to accurately locating a target region or object in an image through natural language expression. In short, the coordinates of the positioning frame of the object to be described in the picture are output by inputting a picture and the corresponding text description of the object. In visual language understanding and locating tasks, the object being described is typically specified by one or more pieces of information in a textual description. The information may include object properties, appearance properties, visual relationship context, and the like. Visual language understanding combines computer vision and natural language understanding with localization tasks to enhance image understanding and analysis capabilities. In addition, it supports applications such as image description generation, image text retrieval, and visual language questions and answers. In general, visual language understanding and localization techniques play a vital role in the development of numerous fields that motivate the combination of computer vision and natural language understanding, with significant research implications.

In recent years, various depth visual language understanding and positioning methods have been explored, which extract visual features of pictures and language features corresponding to text descriptions of objects in the pictures by means of neural networks, and then generate final positioning frames through feature fusion. These methods fall into three main categories: two-stage, one-stage and transform-based methods two-stage method models generate candidate box regions in an initial stage, match these candidate boxes with text descriptions in a subsequent stage, and then sort the candidate boxes to select a final positioning box, but in such methods the sorting and selection of the candidate boxes requires a large amount of computation, and it is not possible to exhaust all possible resulting candidate boxes that are also suboptimal. The one-stage approach directly fuses the text description with the image features and predicts the bounding box directly to locate the mentioned object, reducing redundant computation on the region proposal by densely sampling the possible target locations, which is a significant reduction in computation compared to the two-stage approach, but which is still based on a generic object detector, the inference process relies on the prediction results of all possible candidate regions, which makes performance limited by the quality of the prediction proposal or the predefined anchor box configuration. Furthermore, whether a two-stage approach or a one-stage approach, the candidate objects are essentially represented as regional features (corresponding to predicted suggestions) or point features (features of dense anchor boxes) to match or fuse with linguistic features of the textual description, such feature representations may be less flexible for capturing detailed visual concepts or contexts mentioned in the textual description, which inflexibility increases the difficulty in identifying the target object. With the development of a transducer model based on an attention mechanism, the present visual language understanding and positioning method has realized direct regression of positioning frame coordinates based on a transducer, and in the transducer-based visual language understanding and positioning method, a core component attention layer of the transducer model establishes a corresponding relationship between a modality interior and a modality between vision and language input, and the depth model is directly utilized to regress trans-modality data into a positioning frame. However, in the two-stage method, the one-stage method or the latest Transformer-based method, the training set is excessively focused when the scene with excessively large difference between the visual language understanding and positioning data set training set and the test set distribution is faced, so that the generalization capability of the model is poor, the model is over-fitted, and the positioning accuracy of the visual language understanding and positioning model is greatly affected.

Disclosure of Invention

Aiming at the defects in the prior art, the meta-learning visual language understanding and positioning method provided by the invention solves the problems of slow convergence and unstable training when visual language understanding and positioning tasks are trained and the problems of poor generalization capability of a model and over fitting of the model caused by excessive attention to the training set when the training set and the testing set are excessively distributed in the visual language understanding and positioning data set.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the scheme provides a meta-learning visual language understanding and positioning method, which comprises the following steps:

s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;

s2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair;

s3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;

s4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;

s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture.

The beneficial effects of the invention are as follows: the invention optimizes the inner layer and the outer layer of the meta-learning visual language understanding and positioning training model based on a random uncorrelated training mechanism, and outputs a positioning frame of the object to be described in the test picture-text description sample pair in the picture by utilizing the optimized meta-learning visual language understanding and positioning training model. The invention provides a random uncorrelated training mechanism, so that a meta-learning visual language understanding and positioning training model can directly carry out meta-learning iterative training on a visual language understanding and positioning data set, and the generalization capability of the model is improved; the convergence speed of the visual language understanding and positioning model is accelerated by utilizing the meta learning iterative training, and the stability of the meta learning visual language understanding and positioning model during training is improved. The invention solves the problems that the prior visual language understanding and positioning method excessively pays attention to the training set when facing the scene with overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set, so that the generalization capability of the model is poor, the model is over-fitted, and the visual language understanding and positioning precision is further reduced.

Further, the expressions of the support set and the query set are as follows:

；

wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->，/>The batch size in each round of iterative training in meta-learning is represented.

The beneficial effects of the above-mentioned further scheme are: according to the invention, a support set and a query set of meta learning are obtained through the random uncorrelated meta learning data dividing mechanism, the meta learning iterative training is directly carried out on the visual language understanding and positioning data set, and the visual language understanding and positioning model can carry out the subsequent meta learning process by utilizing the support set and the query set of meta learning.

Still further, the step S2 includes the steps of:

s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;

s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;

s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;

s204, reversely optimizing the meta-learning visual language understanding and positioning training model by using a random gradient descent method based on the calculation result of S203.

The beneficial effects of the above-mentioned further scheme are: the visual characteristics of an input picture and the language characteristics of corresponding text description are extracted by using a visual transducer network and a Bert-based network as a visual branch and a language branch of a meta-learning visual language understanding and positioning training model respectively, then the visual characteristics of the picture and the language characteristics of the text description are fused by using the visual language transducer network and subjected to cross-modal reasoning, the coordinates of a prediction positioning frame are directly regressed, the training loss is calculated by using a loss function, and the meta-learning visual language understanding and positioning training model is continuously iterated and optimized by using a random gradient descent method.

Still further, the visual features of the picture and the language features of the text description are expressed as follows:

；

wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.

The beneficial effects of the above-mentioned further scheme are: the visual features of the extracted pictures and the language features of the corresponding text description provide a basis for the subsequent feature fusion and cross-modal reasoning process.

Still further, the expression of the predicted positioning frame is as follows:

；

wherein,representing a predicted positioning box in the form +.>，/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Language features representing visual features and text descriptions of a picture, respectively,/->Representation for fusion->And->Is a visual language transducer network.

The beneficial effects of the above-mentioned further scheme are: the resulting predicted bounding box may be used with the real bounding box to calculate training loss as an input to the loss function in a subsequent process.

Still further, the expression of the loss function of the meta-learning visual language understanding and localization training model is as follows:

；

wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>，/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.

The beneficial effects of the above-mentioned further scheme are: the loss function formula is utilized to calculate the loss of the predicted positioning frame and the real positioning frame, and the loss function focuses not only on the overlapping area of the predicted positioning frame and the real positioning frame, but also on other non-overlapping areas, so that the contact ratio of the predicted positioning frame and the real positioning frame can be reflected better, and the training loss of the model can be reflected more accurately.

Still further, the updating of the parameters of the base learner has the expression:

；

wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.

The beneficial effects of the above-mentioned further scheme are: according to the invention, the basic learner can learn the characteristic representation and model parameters with more generalization capability through the inner-layer circulation training, so that the generalization capability on visual language understanding and positioning tasks is improved.

Still further, the expression of the weight parameters of the reverse optimization meta-learning visual language understanding and positioning training model is as follows:

；

wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Representation element learning visual language understandingLoss function with positioning training model, +.>Representing differential calculation +.>Representing the amount of parameter update.

The beneficial effects of the above-mentioned further scheme are: according to the invention, through the element learning outer layer circulation training, the element learner can rapidly optimize the parameters of the visual language understanding and positioning model, so that the convergence speed is increased, and meanwhile, the training is more stable.

Still further, the expression of the coordinates of the positioning frame is as follows:

；

wherein,representing the coordinates of the positioning frame in the form +.>，/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.

The beneficial effects of the above-mentioned further scheme are: the optimal element learning visual language understanding and positioning training model obtained through iterative optimization encodes test picture-text description sample pairs, and a positioning frame of a described object in the text description in the picture can be output.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Examples

As shown in fig. 1, the invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps:

in this embodiment, training data is constructed, a support set and a query set are extracted from the training data set based on a random uncorrelated training mechanism for training a model, and in each iteration training of meta-learning, the target visual language understanding and positioning data set is randomly divided into an uncorrelated support set and query set without repeated data. It should be noted that each round of branches participating in meta-learning iterative trainingThe persistent set and the query set are irrelevant, i.e. training samples in the support set and the query set are not repeated at all, meta-learning is the firstiThe support set and query set in the round of iterative training are as follows:

；

S2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair, and the implementation method is as follows:

In this embodiment, the input of the meta-learning visual language understanding and positioning model is a picture and a text description sample pair corresponding to the picture, the visual branch based on the visual transducer is used as a visual branch of the model, the visual feature of the picture in the input sample pair is extracted, the language branch based on the Bert network is used as a language branch of the model, and the language feature of the text description in the input sample pair is extracted:

；

Fusing the visual characteristics of the extracted pictures and the language characteristics of the text description by using a visual language transducer network, and then realizing direct regression of the frame marks of the visual targets mentioned by the text description by cross-modal relation reasoning to obtain a prediction positioning frame:

；

wherein,representing a predicted positioning box in the form +.>，/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Respectively show the diagramsVisual characteristics of the sheet and linguistic characteristics of the text description, < >>Representation for fusion->And->Is a visual language transducer network.

Prediction positioning frame obtained by utilizing loss function pairs of meta-learning visual language understanding and positioning training modelAnd true positioning framebCalculating losses, wherein all processes in the visual language understanding and positioning model adopt a unified loss function:

；

After the loss obtained by calculation is obtained, a random gradient descent algorithm is used for reversely optimizing the meta-learning visual language understanding and positioning model, and the optimization algorithm adopted in the meta-learning visual language understanding and positioning model is unified into a random gradient descent method.

in this embodiment, an inner-layer circulation training based on meta-learning of a random uncorrelated training mechanism is constructed, and a loss function adopted in the inner-layer circulation training is a loss function used by a meta-learning visual language understanding and positioning model in a unified mannerThe adopted optimization algorithm is a random gradient descent method, and the parameter updating of the basic learner is carried out by utilizing a support set in a training set:

；

in this embodiment, the loss is calculated for the query set in the training data set by using the updated basic learner parameters, and the visual language understanding and positioning model for meta learning is reversely optimized, so as to complete the outer layer circulation training process of meta learning. In this embodiment, the visual language understanding and locating model is learned by meta-learningAnd query set->As input, the loss function used is a loss function unified with the localization model for meta-learning visual language understanding>Guiding element learning visual language understanding and positioning model weight parameters by using random gradient descent algorithm>Updating, model weight parameter which is updated continuously in the step +.>The final wanted weight parameters of the meta-learning visual language understanding and positioning model are used for encoding test picture-text description sample pairs in S5, setting the following meta-learning targets and guiding the visual language understanding and positioning model weight parameters in the meta-learning outer layer cycle training->Is updated by:

；

wherein,representing the total number of iterative training in meta-learning,irepresent the firstiWheel learning in the range of 1 to +.>，/>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Loss function representing unified use of visual language understanding and localization model, +.>Weight parameters representing meta-learning visual language understanding and positioning training models.

The formula involved in calculating parameter updates based on the principle of the random gradient descent method in this embodiment is:

；

wherein,irepresent the firstiWheel element learning in the range of 1 to，/>Representing the total number of iterative training in meta-learning, +.>Represent the firstiSupport set in iterative training set in wheel learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->The learning rate of the inner layer cycle training representing meta learning is set to 1e-5,/for meta learning>Weight parameters representing meta-learning visual language understanding and positioning training model, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing the amount of parameter update.

Weight parameters of meta-learning visual language understanding and positioning modelCan be updated as:

；

wherein,the learning rate of the outer circulation training is set to be 1e-5, < >>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiAnd the wheel element learns the basic learner parameters after the inner layer cycle training update.

In this embodiment, the iterative optimized optimal element learning visual language understanding and positioning model is used to encode the test dataset, and for each picture-text description sample pair used for testing, the trained model parameters are used to calculate the regression points of the positioning frame, and the positioning frame of the object to be described in the picture is output:

；

Claims

1. A meta-learning visual language understanding and positioning method is characterized by comprising the following steps:

the step S2 comprises the following steps:

s204, reversely optimizing the element learning visual language understanding and positioning training model by utilizing a random gradient descent method based on the calculation result of S203;

s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture;

the expression of the loss function of the meta-learning visual language understanding and positioning training model is as follows:

；

2. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of the support set and the query set are as follows, respectively:

；

3. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of visual features and textual descriptions of the picture are as follows, respectively:

；

4. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of the predictive locating box is as follows:

；

5. The meta-learning visual language understanding and locating method of claim 1, wherein the updating of the parameters of the base learner is expressed as follows:

；

6. The meta-learning visual language understanding and locating method according to claim 1, wherein the expression of the weight parameters of the reverse-optimized meta-learning visual language understanding and locating training model is as follows:

；

wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Representing differential calculation +.>Representing the amount of parameter update.

7. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of coordinates of the locating frame is as follows:

；