CN117095187B - Meta-learning visual language understanding and positioning method - Google Patents
Meta-learning visual language understanding and positioning method Download PDFInfo
- Publication number
- CN117095187B CN117095187B CN202311330418.4A CN202311330418A CN117095187B CN 117095187 B CN117095187 B CN 117095187B CN 202311330418 A CN202311330418 A CN 202311330418A CN 117095187 B CN117095187 B CN 117095187B
- Authority
- CN
- China
- Prior art keywords
- learning
- meta
- representing
- training
- positioning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000000007 visual effect Effects 0.000 title claims abstract description 162
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 157
- 238000012360 testing method Methods 0.000 claims abstract description 19
- 230000006870 function Effects 0.000 claims description 25
- 230000004807 localization Effects 0.000 claims description 14
- 230000014509 gene expression Effects 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000011478 gradient descent method Methods 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 description 9
- 238000013459 approach Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/768—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps: constructing a training set; constructing a meta learning visual language understanding and positioning training model; constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set; calculating loss for a query set in a training set by using the updated basic learner parameters, and reversely optimizing a meta-learning visual language understanding and positioning training model; and coding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture. The invention solves the problems that the generalization capability of a model is poor and the visual language understanding and positioning accuracy is further reduced due to the fact that the training set is excessively concerned when the existing visual language understanding and positioning method faces the scene with the overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set.
Description
Technical Field
The invention belongs to the technical field of multi-mode visual language understanding and positioning, and particularly relates to a meta-learning visual language understanding and positioning method.
Background
Visual language understanding and localization (VG) refers to accurately locating a target region or object in an image through natural language expression. In short, the coordinates of the positioning frame of the object to be described in the picture are output by inputting a picture and the corresponding text description of the object. In visual language understanding and locating tasks, the object being described is typically specified by one or more pieces of information in a textual description. The information may include object properties, appearance properties, visual relationship context, and the like. Visual language understanding combines computer vision and natural language understanding with localization tasks to enhance image understanding and analysis capabilities. In addition, it supports applications such as image description generation, image text retrieval, and visual language questions and answers. In general, visual language understanding and localization techniques play a vital role in the development of numerous fields that motivate the combination of computer vision and natural language understanding, with significant research implications.
In recent years, various depth visual language understanding and positioning methods have been explored, which extract visual features of pictures and language features corresponding to text descriptions of objects in the pictures by means of neural networks, and then generate final positioning frames through feature fusion. These methods fall into three main categories: two-stage, one-stage and transform-based methods two-stage method models generate candidate box regions in an initial stage, match these candidate boxes with text descriptions in a subsequent stage, and then sort the candidate boxes to select a final positioning box, but in such methods the sorting and selection of the candidate boxes requires a large amount of computation, and it is not possible to exhaust all possible resulting candidate boxes that are also suboptimal. The one-stage approach directly fuses the text description with the image features and predicts the bounding box directly to locate the mentioned object, reducing redundant computation on the region proposal by densely sampling the possible target locations, which is a significant reduction in computation compared to the two-stage approach, but which is still based on a generic object detector, the inference process relies on the prediction results of all possible candidate regions, which makes performance limited by the quality of the prediction proposal or the predefined anchor box configuration. Furthermore, whether a two-stage approach or a one-stage approach, the candidate objects are essentially represented as regional features (corresponding to predicted suggestions) or point features (features of dense anchor boxes) to match or fuse with linguistic features of the textual description, such feature representations may be less flexible for capturing detailed visual concepts or contexts mentioned in the textual description, which inflexibility increases the difficulty in identifying the target object. With the development of a transducer model based on an attention mechanism, the present visual language understanding and positioning method has realized direct regression of positioning frame coordinates based on a transducer, and in the transducer-based visual language understanding and positioning method, a core component attention layer of the transducer model establishes a corresponding relationship between a modality interior and a modality between vision and language input, and the depth model is directly utilized to regress trans-modality data into a positioning frame. However, in the two-stage method, the one-stage method or the latest Transformer-based method, the training set is excessively focused when the scene with excessively large difference between the visual language understanding and positioning data set training set and the test set distribution is faced, so that the generalization capability of the model is poor, the model is over-fitted, and the positioning accuracy of the visual language understanding and positioning model is greatly affected.
Disclosure of Invention
Aiming at the defects in the prior art, the meta-learning visual language understanding and positioning method provided by the invention solves the problems of slow convergence and unstable training when visual language understanding and positioning tasks are trained and the problems of poor generalization capability of a model and over fitting of the model caused by excessive attention to the training set when the training set and the testing set are excessively distributed in the visual language understanding and positioning data set.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the scheme provides a meta-learning visual language understanding and positioning method, which comprises the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
s2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair;
s3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
s4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture.
The beneficial effects of the invention are as follows: the invention optimizes the inner layer and the outer layer of the meta-learning visual language understanding and positioning training model based on a random uncorrelated training mechanism, and outputs a positioning frame of the object to be described in the test picture-text description sample pair in the picture by utilizing the optimized meta-learning visual language understanding and positioning training model. The invention provides a random uncorrelated training mechanism, so that a meta-learning visual language understanding and positioning training model can directly carry out meta-learning iterative training on a visual language understanding and positioning data set, and the generalization capability of the model is improved; the convergence speed of the visual language understanding and positioning model is accelerated by utilizing the meta learning iterative training, and the stability of the meta learning visual language understanding and positioning model during training is improved. The invention solves the problems that the prior visual language understanding and positioning method excessively pays attention to the training set when facing the scene with overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set, so that the generalization capability of the model is poor, the model is over-fitted, and the visual language understanding and positioning precision is further reduced.
Further, the expressions of the support set and the query set are as follows:
;
;
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
The beneficial effects of the above-mentioned further scheme are: according to the invention, a support set and a query set of meta learning are obtained through the random uncorrelated meta learning data dividing mechanism, the meta learning iterative training is directly carried out on the visual language understanding and positioning data set, and the visual language understanding and positioning model can carry out the subsequent meta learning process by utilizing the support set and the query set of meta learning.
Still further, the step S2 includes the steps of:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the meta-learning visual language understanding and positioning training model by using a random gradient descent method based on the calculation result of S203.
The beneficial effects of the above-mentioned further scheme are: the visual characteristics of an input picture and the language characteristics of corresponding text description are extracted by using a visual transducer network and a Bert-based network as a visual branch and a language branch of a meta-learning visual language understanding and positioning training model respectively, then the visual characteristics of the picture and the language characteristics of the text description are fused by using the visual language transducer network and subjected to cross-modal reasoning, the coordinates of a prediction positioning frame are directly regressed, the training loss is calculated by using a loss function, and the meta-learning visual language understanding and positioning training model is continuously iterated and optimized by using a random gradient descent method.
Still further, the visual features of the picture and the language features of the text description are expressed as follows:
;
;
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
The beneficial effects of the above-mentioned further scheme are: the visual features of the extracted pictures and the language features of the corresponding text description provide a basis for the subsequent feature fusion and cross-modal reasoning process.
Still further, the expression of the predicted positioning frame is as follows:
;
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Language features representing visual features and text descriptions of a picture, respectively,/->Representation for fusion->And->Is a visual language transducer network.
The beneficial effects of the above-mentioned further scheme are: the resulting predicted bounding box may be used with the real bounding box to calculate training loss as an input to the loss function in a subsequent process.
Still further, the expression of the loss function of the meta-learning visual language understanding and localization training model is as follows:
;
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
The beneficial effects of the above-mentioned further scheme are: the loss function formula is utilized to calculate the loss of the predicted positioning frame and the real positioning frame, and the loss function focuses not only on the overlapping area of the predicted positioning frame and the real positioning frame, but also on other non-overlapping areas, so that the contact ratio of the predicted positioning frame and the real positioning frame can be reflected better, and the training loss of the model can be reflected more accurately.
Still further, the updating of the parameters of the base learner has the expression:
;
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
The beneficial effects of the above-mentioned further scheme are: according to the invention, the basic learner can learn the characteristic representation and model parameters with more generalization capability through the inner-layer circulation training, so that the generalization capability on visual language understanding and positioning tasks is improved.
Still further, the expression of the weight parameters of the reverse optimization meta-learning visual language understanding and positioning training model is as follows:
;
;
wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Representation element learning visual language understandingLoss function with positioning training model, +.>Representing differential calculation +.>Representing the amount of parameter update.
The beneficial effects of the above-mentioned further scheme are: according to the invention, through the element learning outer layer circulation training, the element learner can rapidly optimize the parameters of the visual language understanding and positioning model, so that the convergence speed is increased, and meanwhile, the training is more stable.
Still further, the expression of the coordinates of the positioning frame is as follows:
;
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.
The beneficial effects of the above-mentioned further scheme are: the optimal element learning visual language understanding and positioning training model obtained through iterative optimization encodes test picture-text description sample pairs, and a positioning frame of a described object in the text description in the picture can be output.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Examples
As shown in fig. 1, the invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
in this embodiment, training data is constructed, a support set and a query set are extracted from the training data set based on a random uncorrelated training mechanism for training a model, and in each iteration training of meta-learning, the target visual language understanding and positioning data set is randomly divided into an uncorrelated support set and query set without repeated data. It should be noted that each round of branches participating in meta-learning iterative trainingThe persistent set and the query set are irrelevant, i.e. training samples in the support set and the query set are not repeated at all, meta-learning is the firstiThe support set and query set in the round of iterative training are as follows:
;
;
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
S2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair, and the implementation method is as follows:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the meta-learning visual language understanding and positioning training model by using a random gradient descent method based on the calculation result of S203.
In this embodiment, the input of the meta-learning visual language understanding and positioning model is a picture and a text description sample pair corresponding to the picture, the visual branch based on the visual transducer is used as a visual branch of the model, the visual feature of the picture in the input sample pair is extracted, the language branch based on the Bert network is used as a language branch of the model, and the language feature of the text description in the input sample pair is extracted:
;
;
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
Fusing the visual characteristics of the extracted pictures and the language characteristics of the text description by using a visual language transducer network, and then realizing direct regression of the frame marks of the visual targets mentioned by the text description by cross-modal relation reasoning to obtain a prediction positioning frame:
;
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Respectively show the diagramsVisual characteristics of the sheet and linguistic characteristics of the text description, < >>Representation for fusion->And->Is a visual language transducer network.
Prediction positioning frame obtained by utilizing loss function pairs of meta-learning visual language understanding and positioning training modelAnd true positioning framebCalculating losses, wherein all processes in the visual language understanding and positioning model adopt a unified loss function:
;
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
After the loss obtained by calculation is obtained, a random gradient descent algorithm is used for reversely optimizing the meta-learning visual language understanding and positioning model, and the optimization algorithm adopted in the meta-learning visual language understanding and positioning model is unified into a random gradient descent method.
S3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
in this embodiment, an inner-layer circulation training based on meta-learning of a random uncorrelated training mechanism is constructed, and a loss function adopted in the inner-layer circulation training is a loss function used by a meta-learning visual language understanding and positioning model in a unified mannerThe adopted optimization algorithm is a random gradient descent method, and the parameter updating of the basic learner is carried out by utilizing a support set in a training set:
;
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
S4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
in this embodiment, the loss is calculated for the query set in the training data set by using the updated basic learner parameters, and the visual language understanding and positioning model for meta learning is reversely optimized, so as to complete the outer layer circulation training process of meta learning. In this embodiment, the visual language understanding and locating model is learned by meta-learningAnd query set->As input, the loss function used is a loss function unified with the localization model for meta-learning visual language understanding>Guiding element learning visual language understanding and positioning model weight parameters by using random gradient descent algorithm>Updating, model weight parameter which is updated continuously in the step +.>The final wanted weight parameters of the meta-learning visual language understanding and positioning model are used for encoding test picture-text description sample pairs in S5, setting the following meta-learning targets and guiding the visual language understanding and positioning model weight parameters in the meta-learning outer layer cycle training->Is updated by:
;
wherein,representing the total number of iterative training in meta-learning,irepresent the firstiWheel learning in the range of 1 to +.>,/>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Loss function representing unified use of visual language understanding and localization model, +.>Weight parameters representing meta-learning visual language understanding and positioning training models.
The formula involved in calculating parameter updates based on the principle of the random gradient descent method in this embodiment is:
;
wherein,irepresent the firstiWheel element learning in the range of 1 to,/>Representing the total number of iterative training in meta-learning, +.>Represent the firstiSupport set in iterative training set in wheel learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->The learning rate of the inner layer cycle training representing meta learning is set to 1e-5,/for meta learning>Weight parameters representing meta-learning visual language understanding and positioning training model, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing the amount of parameter update.
Weight parameters of meta-learning visual language understanding and positioning modelCan be updated as:
;
wherein,the learning rate of the outer circulation training is set to be 1e-5, < >>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiAnd the wheel element learns the basic learner parameters after the inner layer cycle training update.
S5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture.
In this embodiment, the iterative optimized optimal element learning visual language understanding and positioning model is used to encode the test dataset, and for each picture-text description sample pair used for testing, the trained model parameters are used to calculate the regression points of the positioning frame, and the positioning frame of the object to be described in the picture is output:
;
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.
Claims (7)
1. A meta-learning visual language understanding and positioning method is characterized by comprising the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
s2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair;
the step S2 comprises the following steps:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the element learning visual language understanding and positioning training model by utilizing a random gradient descent method based on the calculation result of S203;
s3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
s4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture;
the expression of the loss function of the meta-learning visual language understanding and positioning training model is as follows:
;
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
2. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of the support set and the query set are as follows, respectively:
;
;
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
3. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of visual features and textual descriptions of the picture are as follows, respectively:
;
;
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
4. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of the predictive locating box is as follows:
;
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Language features representing visual features and text descriptions of a picture, respectively,/->Representation for fusion->And->Is a visual language transducer network.
5. The meta-learning visual language understanding and locating method of claim 1, wherein the updating of the parameters of the base learner is expressed as follows:
;
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
6. The meta-learning visual language understanding and locating method according to claim 1, wherein the expression of the weight parameters of the reverse-optimized meta-learning visual language understanding and locating training model is as follows:
;
;
wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Representing differential calculation +.>Representing the amount of parameter update.
7. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of coordinates of the locating frame is as follows:
;
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311330418.4A CN117095187B (en) | 2023-10-16 | 2023-10-16 | Meta-learning visual language understanding and positioning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311330418.4A CN117095187B (en) | 2023-10-16 | 2023-10-16 | Meta-learning visual language understanding and positioning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117095187A CN117095187A (en) | 2023-11-21 |
CN117095187B true CN117095187B (en) | 2023-12-19 |
Family
ID=88783590
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311330418.4A Active CN117095187B (en) | 2023-10-16 | 2023-10-16 | Meta-learning visual language understanding and positioning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117095187B (en) |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110580500A (en) * | 2019-08-20 | 2019-12-17 | 天津大学 | Character interaction-oriented network weight generation few-sample image classification method |
CN114187472A (en) * | 2021-12-06 | 2022-03-15 | 江南大学 | Breast cancer molecular subtype prediction method based on model-driven meta learning |
CN114220516A (en) * | 2021-12-17 | 2022-03-22 | 北京工业大学 | Brain CT medical report generation method based on hierarchical recurrent neural network decoding |
CN114491039A (en) * | 2022-01-27 | 2022-05-13 | 四川大学 | Meta-learning few-sample text classification method based on gradient improvement |
CN115249361A (en) * | 2022-07-15 | 2022-10-28 | 北京京东尚科信息技术有限公司 | Instructional text positioning model training, apparatus, device, and medium |
CN115953569A (en) * | 2022-12-16 | 2023-04-11 | 华东师范大学 | One-stage visual positioning model construction method based on multi-step reasoning |
CN116011507A (en) * | 2022-12-06 | 2023-04-25 | 东北林业大学 | Rare fault diagnosis method for fusion element learning and graph neural network |
CN116050399A (en) * | 2023-01-05 | 2023-05-02 | 中国科学院声学研究所南海研究站 | Cross-corpus and cross-algorithm generation type text steganalysis method |
CN116071315A (en) * | 2022-12-31 | 2023-05-05 | 聚光科技(杭州)股份有限公司 | Product visual defect detection method and system based on machine vision |
CN116246279A (en) * | 2022-12-28 | 2023-06-09 | 北京理工大学 | Graphic and text feature fusion method based on CLIP background knowledge |
CN116258990A (en) * | 2023-02-13 | 2023-06-13 | 安徽工业大学 | Cross-modal affinity-based small sample reference video target segmentation method |
CN116524356A (en) * | 2023-04-11 | 2023-08-01 | 湖北工业大学 | Ore image small sample target detection method and system |
CN116612324A (en) * | 2023-05-17 | 2023-08-18 | 四川九洲电器集团有限责任公司 | Small sample image classification method and device based on semantic self-adaptive fusion mechanism |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11605019B2 (en) * | 2019-05-30 | 2023-03-14 | Adobe Inc. | Visually guided machine-learning language model |
US20210241099A1 (en) * | 2020-02-05 | 2021-08-05 | Baidu Usa Llc | Meta cooperative training paradigms |
EP3926531B1 (en) * | 2020-06-17 | 2024-04-24 | Tata Consultancy Services Limited | Method and system for visio-linguistic understanding using contextual language model reasoners |
KR20230127509A (en) * | 2022-02-25 | 2023-09-01 | 한국전자통신연구원 | Method and apparatus for learning concept based few-shot |
US20230297603A1 (en) * | 2022-03-18 | 2023-09-21 | Adobe Inc. | Cross-lingual meta-transfer learning adaptation to natural language understanding |
-
2023
- 2023-10-16 CN CN202311330418.4A patent/CN117095187B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110580500A (en) * | 2019-08-20 | 2019-12-17 | 天津大学 | Character interaction-oriented network weight generation few-sample image classification method |
CN114187472A (en) * | 2021-12-06 | 2022-03-15 | 江南大学 | Breast cancer molecular subtype prediction method based on model-driven meta learning |
CN114220516A (en) * | 2021-12-17 | 2022-03-22 | 北京工业大学 | Brain CT medical report generation method based on hierarchical recurrent neural network decoding |
CN114491039A (en) * | 2022-01-27 | 2022-05-13 | 四川大学 | Meta-learning few-sample text classification method based on gradient improvement |
CN115249361A (en) * | 2022-07-15 | 2022-10-28 | 北京京东尚科信息技术有限公司 | Instructional text positioning model training, apparatus, device, and medium |
CN116011507A (en) * | 2022-12-06 | 2023-04-25 | 东北林业大学 | Rare fault diagnosis method for fusion element learning and graph neural network |
CN115953569A (en) * | 2022-12-16 | 2023-04-11 | 华东师范大学 | One-stage visual positioning model construction method based on multi-step reasoning |
CN116246279A (en) * | 2022-12-28 | 2023-06-09 | 北京理工大学 | Graphic and text feature fusion method based on CLIP background knowledge |
CN116071315A (en) * | 2022-12-31 | 2023-05-05 | 聚光科技(杭州)股份有限公司 | Product visual defect detection method and system based on machine vision |
CN116050399A (en) * | 2023-01-05 | 2023-05-02 | 中国科学院声学研究所南海研究站 | Cross-corpus and cross-algorithm generation type text steganalysis method |
CN116258990A (en) * | 2023-02-13 | 2023-06-13 | 安徽工业大学 | Cross-modal affinity-based small sample reference video target segmentation method |
CN116524356A (en) * | 2023-04-11 | 2023-08-01 | 湖北工业大学 | Ore image small sample target detection method and system |
CN116612324A (en) * | 2023-05-17 | 2023-08-18 | 四川九洲电器集团有限责任公司 | Small sample image classification method and device based on semantic self-adaptive fusion mechanism |
Non-Patent Citations (4)
Title |
---|
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting;Guangxing Han等;《arXiv:2204.07841v3》;第1-17页 * |
基于室内场景图知识融入的视觉语言导航;胡成纬;《中国优秀硕士学位论文全文数据库 信息科技辑》(第3期);第I140-382页 * |
细粒度图像分类场景下的小样本学习方法研究;曹思雨;《中国优秀硕士学位论文全文数据库 信息科技辑》(第2期);第I138-432页 * |
视觉-语言导航的研究进展与发展趋势;牛凯等;《计算机辅助设计与图形学学报》;第34卷(第12期);第1815-1827页 * |
Also Published As
Publication number | Publication date |
---|---|
CN117095187A (en) | 2023-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444340A (en) | Text classification and recommendation method, device, equipment and storage medium | |
CN111966800B (en) | Emotion dialogue generation method and device and emotion dialogue model training method and device | |
CN112633010A (en) | Multi-head attention and graph convolution network-based aspect-level emotion analysis method and system | |
CN112527993B (en) | Cross-media hierarchical deep video question-answer reasoning framework | |
CN113705238B (en) | Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model | |
CN113628059A (en) | Associated user identification method and device based on multilayer graph attention network | |
CN113704522A (en) | Artificial intelligence-based target image rapid retrieval method and system | |
CN114528398A (en) | Emotion prediction method and system based on interactive double-graph convolutional network | |
CN115587207A (en) | Deep hash retrieval method based on classification label | |
CN111882042A (en) | Automatic searching method, system and medium for neural network architecture of liquid state machine | |
CN116258990A (en) | Cross-modal affinity-based small sample reference video target segmentation method | |
CN117095187B (en) | Meta-learning visual language understanding and positioning method | |
CN116881416A (en) | Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network | |
CN115827878B (en) | Sentence emotion analysis method, sentence emotion analysis device and sentence emotion analysis equipment | |
CN116860943A (en) | Multi-round dialogue method and system for dialogue style perception and theme guidance | |
CN117009478A (en) | Algorithm fusion method based on software knowledge graph question-answer question-sentence analysis process | |
Qin et al. | Modularized Pre-training for End-to-end Task-oriented Dialogue | |
CN116258147A (en) | Multimode comment emotion analysis method and system based on heterogram convolution | |
CN114549958B (en) | Night and camouflage target detection method based on context information perception mechanism | |
Basnyat et al. | Vision powered conversational AI for easy human dialogue systems | |
CN113010712B (en) | Visual question answering method based on multi-graph fusion | |
CN115357712A (en) | Aspect level emotion analysis method and device, electronic equipment and storage medium | |
CN115081445A (en) | Short text entity disambiguation method based on multitask learning | |
CN115062123A (en) | Knowledge base question-answer pair generation method of conversation generation system | |
US12014149B1 (en) | Multi-turn human-machine conversation method and apparatus based on time-sequence feature screening encoding module |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |