CN117095187B - Meta-learning visual language understanding and positioning method - Google Patents

Meta-learning visual language understanding and positioning method Download PDF

Info

Publication number
CN117095187B
CN117095187B CN202311330418.4A CN202311330418A CN117095187B CN 117095187 B CN117095187 B CN 117095187B CN 202311330418 A CN202311330418 A CN 202311330418A CN 117095187 B CN117095187 B CN 117095187B
Authority
CN
China
Prior art keywords
learning
meta
representing
training
positioning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311330418.4A
Other languages
Chinese (zh)
Other versions
CN117095187A (en
Inventor
苏超
彭德中
胡鹏
袁钟
王旭
孙元
秦阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202311330418.4A priority Critical patent/CN117095187B/en
Publication of CN117095187A publication Critical patent/CN117095187A/en
Application granted granted Critical
Publication of CN117095187B publication Critical patent/CN117095187B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/768Arrangements for image or video recognition or understanding using pattern recognition or machine learning using context analysis, e.g. recognition aided by known co-occurring patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps: constructing a training set; constructing a meta learning visual language understanding and positioning training model; constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set; calculating loss for a query set in a training set by using the updated basic learner parameters, and reversely optimizing a meta-learning visual language understanding and positioning training model; and coding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture. The invention solves the problems that the generalization capability of a model is poor and the visual language understanding and positioning accuracy is further reduced due to the fact that the training set is excessively concerned when the existing visual language understanding and positioning method faces the scene with the overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set.

Description

Meta-learning visual language understanding and positioning method
Technical Field
The invention belongs to the technical field of multi-mode visual language understanding and positioning, and particularly relates to a meta-learning visual language understanding and positioning method.
Background
Visual language understanding and localization (VG) refers to accurately locating a target region or object in an image through natural language expression. In short, the coordinates of the positioning frame of the object to be described in the picture are output by inputting a picture and the corresponding text description of the object. In visual language understanding and locating tasks, the object being described is typically specified by one or more pieces of information in a textual description. The information may include object properties, appearance properties, visual relationship context, and the like. Visual language understanding combines computer vision and natural language understanding with localization tasks to enhance image understanding and analysis capabilities. In addition, it supports applications such as image description generation, image text retrieval, and visual language questions and answers. In general, visual language understanding and localization techniques play a vital role in the development of numerous fields that motivate the combination of computer vision and natural language understanding, with significant research implications.
In recent years, various depth visual language understanding and positioning methods have been explored, which extract visual features of pictures and language features corresponding to text descriptions of objects in the pictures by means of neural networks, and then generate final positioning frames through feature fusion. These methods fall into three main categories: two-stage, one-stage and transform-based methods two-stage method models generate candidate box regions in an initial stage, match these candidate boxes with text descriptions in a subsequent stage, and then sort the candidate boxes to select a final positioning box, but in such methods the sorting and selection of the candidate boxes requires a large amount of computation, and it is not possible to exhaust all possible resulting candidate boxes that are also suboptimal. The one-stage approach directly fuses the text description with the image features and predicts the bounding box directly to locate the mentioned object, reducing redundant computation on the region proposal by densely sampling the possible target locations, which is a significant reduction in computation compared to the two-stage approach, but which is still based on a generic object detector, the inference process relies on the prediction results of all possible candidate regions, which makes performance limited by the quality of the prediction proposal or the predefined anchor box configuration. Furthermore, whether a two-stage approach or a one-stage approach, the candidate objects are essentially represented as regional features (corresponding to predicted suggestions) or point features (features of dense anchor boxes) to match or fuse with linguistic features of the textual description, such feature representations may be less flexible for capturing detailed visual concepts or contexts mentioned in the textual description, which inflexibility increases the difficulty in identifying the target object. With the development of a transducer model based on an attention mechanism, the present visual language understanding and positioning method has realized direct regression of positioning frame coordinates based on a transducer, and in the transducer-based visual language understanding and positioning method, a core component attention layer of the transducer model establishes a corresponding relationship between a modality interior and a modality between vision and language input, and the depth model is directly utilized to regress trans-modality data into a positioning frame. However, in the two-stage method, the one-stage method or the latest Transformer-based method, the training set is excessively focused when the scene with excessively large difference between the visual language understanding and positioning data set training set and the test set distribution is faced, so that the generalization capability of the model is poor, the model is over-fitted, and the positioning accuracy of the visual language understanding and positioning model is greatly affected.
Disclosure of Invention
Aiming at the defects in the prior art, the meta-learning visual language understanding and positioning method provided by the invention solves the problems of slow convergence and unstable training when visual language understanding and positioning tasks are trained and the problems of poor generalization capability of a model and over fitting of the model caused by excessive attention to the training set when the training set and the testing set are excessively distributed in the visual language understanding and positioning data set.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the scheme provides a meta-learning visual language understanding and positioning method, which comprises the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
s2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair;
s3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
s4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture.
The beneficial effects of the invention are as follows: the invention optimizes the inner layer and the outer layer of the meta-learning visual language understanding and positioning training model based on a random uncorrelated training mechanism, and outputs a positioning frame of the object to be described in the test picture-text description sample pair in the picture by utilizing the optimized meta-learning visual language understanding and positioning training model. The invention provides a random uncorrelated training mechanism, so that a meta-learning visual language understanding and positioning training model can directly carry out meta-learning iterative training on a visual language understanding and positioning data set, and the generalization capability of the model is improved; the convergence speed of the visual language understanding and positioning model is accelerated by utilizing the meta learning iterative training, and the stability of the meta learning visual language understanding and positioning model during training is improved. The invention solves the problems that the prior visual language understanding and positioning method excessively pays attention to the training set when facing the scene with overlarge distribution difference between the training set and the testing set in the visual language understanding and positioning data set, so that the generalization capability of the model is poor, the model is over-fitted, and the visual language understanding and positioning precision is further reduced.
Further, the expressions of the support set and the query set are as follows:
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
The beneficial effects of the above-mentioned further scheme are: according to the invention, a support set and a query set of meta learning are obtained through the random uncorrelated meta learning data dividing mechanism, the meta learning iterative training is directly carried out on the visual language understanding and positioning data set, and the visual language understanding and positioning model can carry out the subsequent meta learning process by utilizing the support set and the query set of meta learning.
Still further, the step S2 includes the steps of:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the meta-learning visual language understanding and positioning training model by using a random gradient descent method based on the calculation result of S203.
The beneficial effects of the above-mentioned further scheme are: the visual characteristics of an input picture and the language characteristics of corresponding text description are extracted by using a visual transducer network and a Bert-based network as a visual branch and a language branch of a meta-learning visual language understanding and positioning training model respectively, then the visual characteristics of the picture and the language characteristics of the text description are fused by using the visual language transducer network and subjected to cross-modal reasoning, the coordinates of a prediction positioning frame are directly regressed, the training loss is calculated by using a loss function, and the meta-learning visual language understanding and positioning training model is continuously iterated and optimized by using a random gradient descent method.
Still further, the visual features of the picture and the language features of the text description are expressed as follows:
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
The beneficial effects of the above-mentioned further scheme are: the visual features of the extracted pictures and the language features of the corresponding text description provide a basis for the subsequent feature fusion and cross-modal reasoning process.
Still further, the expression of the predicted positioning frame is as follows:
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Language features representing visual features and text descriptions of a picture, respectively,/->Representation for fusion->And->Is a visual language transducer network.
The beneficial effects of the above-mentioned further scheme are: the resulting predicted bounding box may be used with the real bounding box to calculate training loss as an input to the loss function in a subsequent process.
Still further, the expression of the loss function of the meta-learning visual language understanding and localization training model is as follows:
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
The beneficial effects of the above-mentioned further scheme are: the loss function formula is utilized to calculate the loss of the predicted positioning frame and the real positioning frame, and the loss function focuses not only on the overlapping area of the predicted positioning frame and the real positioning frame, but also on other non-overlapping areas, so that the contact ratio of the predicted positioning frame and the real positioning frame can be reflected better, and the training loss of the model can be reflected more accurately.
Still further, the updating of the parameters of the base learner has the expression:
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
The beneficial effects of the above-mentioned further scheme are: according to the invention, the basic learner can learn the characteristic representation and model parameters with more generalization capability through the inner-layer circulation training, so that the generalization capability on visual language understanding and positioning tasks is improved.
Still further, the expression of the weight parameters of the reverse optimization meta-learning visual language understanding and positioning training model is as follows:
wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Representation element learning visual language understandingLoss function with positioning training model, +.>Representing differential calculation +.>Representing the amount of parameter update.
The beneficial effects of the above-mentioned further scheme are: according to the invention, through the element learning outer layer circulation training, the element learner can rapidly optimize the parameters of the visual language understanding and positioning model, so that the convergence speed is increased, and meanwhile, the training is more stable.
Still further, the expression of the coordinates of the positioning frame is as follows:
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.
The beneficial effects of the above-mentioned further scheme are: the optimal element learning visual language understanding and positioning training model obtained through iterative optimization encodes test picture-text description sample pairs, and a positioning frame of a described object in the text description in the picture can be output.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Examples
As shown in fig. 1, the invention provides a meta-learning visual language understanding and positioning method, which comprises the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
in this embodiment, training data is constructed, a support set and a query set are extracted from the training data set based on a random uncorrelated training mechanism for training a model, and in each iteration training of meta-learning, the target visual language understanding and positioning data set is randomly divided into an uncorrelated support set and query set without repeated data. It should be noted that each round of branches participating in meta-learning iterative trainingThe persistent set and the query set are irrelevant, i.e. training samples in the support set and the query set are not repeated at all, meta-learning is the firstiThe support set and query set in the round of iterative training are as follows:
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
S2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair, and the implementation method is as follows:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the meta-learning visual language understanding and positioning training model by using a random gradient descent method based on the calculation result of S203.
In this embodiment, the input of the meta-learning visual language understanding and positioning model is a picture and a text description sample pair corresponding to the picture, the visual branch based on the visual transducer is used as a visual branch of the model, the visual feature of the picture in the input sample pair is extracted, the language branch based on the Bert network is used as a language branch of the model, and the language feature of the text description in the input sample pair is extracted:
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
Fusing the visual characteristics of the extracted pictures and the language characteristics of the text description by using a visual language transducer network, and then realizing direct regression of the frame marks of the visual targets mentioned by the text description by cross-modal relation reasoning to obtain a prediction positioning frame:
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Respectively show the diagramsVisual characteristics of the sheet and linguistic characteristics of the text description, < >>Representation for fusion->And->Is a visual language transducer network.
Prediction positioning frame obtained by utilizing loss function pairs of meta-learning visual language understanding and positioning training modelAnd true positioning framebCalculating losses, wherein all processes in the visual language understanding and positioning model adopt a unified loss function:
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
After the loss obtained by calculation is obtained, a random gradient descent algorithm is used for reversely optimizing the meta-learning visual language understanding and positioning model, and the optimization algorithm adopted in the meta-learning visual language understanding and positioning model is unified into a random gradient descent method.
S3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
in this embodiment, an inner-layer circulation training based on meta-learning of a random uncorrelated training mechanism is constructed, and a loss function adopted in the inner-layer circulation training is a loss function used by a meta-learning visual language understanding and positioning model in a unified mannerThe adopted optimization algorithm is a random gradient descent method, and the parameter updating of the basic learner is carried out by utilizing a support set in a training set:
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
S4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
in this embodiment, the loss is calculated for the query set in the training data set by using the updated basic learner parameters, and the visual language understanding and positioning model for meta learning is reversely optimized, so as to complete the outer layer circulation training process of meta learning. In this embodiment, the visual language understanding and locating model is learned by meta-learningAnd query set->As input, the loss function used is a loss function unified with the localization model for meta-learning visual language understanding>Guiding element learning visual language understanding and positioning model weight parameters by using random gradient descent algorithm>Updating, model weight parameter which is updated continuously in the step +.>The final wanted weight parameters of the meta-learning visual language understanding and positioning model are used for encoding test picture-text description sample pairs in S5, setting the following meta-learning targets and guiding the visual language understanding and positioning model weight parameters in the meta-learning outer layer cycle training->Is updated by:
wherein,representing the total number of iterative training in meta-learning,irepresent the firstiWheel learning in the range of 1 to +.>,/>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Loss function representing unified use of visual language understanding and localization model, +.>Weight parameters representing meta-learning visual language understanding and positioning training models.
The formula involved in calculating parameter updates based on the principle of the random gradient descent method in this embodiment is:
wherein,irepresent the firstiWheel element learning in the range of 1 to,/>Representing the total number of iterative training in meta-learning, +.>Represent the firstiSupport set in iterative training set in wheel learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->The learning rate of the inner layer cycle training representing meta learning is set to 1e-5,/for meta learning>Weight parameters representing meta-learning visual language understanding and positioning training model, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing the amount of parameter update.
Weight parameters of meta-learning visual language understanding and positioning modelCan be updated as:
wherein,the learning rate of the outer circulation training is set to be 1e-5, < >>Represent the firstiQuery set in iterative training of wheel element learning, < ->Represent the firstiAnd the wheel element learns the basic learner parameters after the inner layer cycle training update.
S5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture.
In this embodiment, the iterative optimized optimal element learning visual language understanding and positioning model is used to encode the test dataset, and for each picture-text description sample pair used for testing, the trained model parameters are used to calculate the regression points of the positioning frame, and the positioning frame of the object to be described in the picture is output:
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.

Claims (7)

1. A meta-learning visual language understanding and positioning method is characterized by comprising the following steps:
s1, in each round of iterative training of meta-learning, randomly dividing a target visual language understanding and positioning data set into a support set and a query set without repeated data, constructing a training set, wherein the support set participating in the meta-learning iterative training of each round is irrelevant to the query set;
s2, constructing a meta-learning visual language understanding and positioning training model according to an input sample pair, wherein the input sample pair is a picture-text description sample pair;
the step S2 comprises the following steps:
s201, using a visual transducer network as a visual branch of a meta-learning visual language understanding and positioning training model, extracting visual features of an input sample centering picture, and using a Bert-based network as a language branch of the meta-learning visual language understanding and positioning training model, extracting language features of a text description in the input sample centering;
s202, fusing visual features of the pictures and language features of the text description by using a visual language transducer network, and carrying out regression processing on coordinate frames of visual targets mentioned by the text description to obtain a prediction positioning frame;
s203, calculating losses of the predicted positioning frame and the real positioning frame by using a loss function of the meta-learning visual language understanding and positioning training model;
s204, reversely optimizing the element learning visual language understanding and positioning training model by utilizing a random gradient descent method based on the calculation result of S203;
s3, constructing a meta-learning inner layer circulation training based on a random uncorrelated training mechanism, and updating parameters of a basic learner by using a support set;
s4, calculating loss of a query set in a training set by using the updated basic learner parameters, and reversely optimizing the visual language understanding and positioning training model of meta learning to finish outer layer circulation training of meta learning;
s5, encoding a test picture-text description sample pair by using the optimized meta learning visual language understanding and positioning training model, and outputting a positioning frame of the object to be described in the picture;
the expression of the loss function of the meta-learning visual language understanding and positioning training model is as follows:
wherein,loss function representing meta-learning visual language understanding and localization training model, < ->Representing a picture-text description sample pair +.>Is a true positioning frame of->Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text description->Representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>Indicating the area of the area where the real positioning frame and the predicted positioning frame overlap, +.>Representing the sum of the areas of the real and predicted positioning frames,/->Representing the area of the smallest bounding rectangle of the real and predicted bounding boxes.
2. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of the support set and the query set are as follows, respectively:
wherein,and->Respectively represent that meta learning is at the firstiSupport set and query set in round of iterative training, +.>And->Representing the first of the support set and the query set, respectivelykInput pictures->And->Representing support set and query set and first, respectivelykA text description corresponding to the respective input picture,khas a value of 1 to->,/>The batch size in each round of iterative training in meta-learning is represented.
3. The meta-learning visual language understanding and locating method of claim 1, wherein expressions of visual features and textual descriptions of the picture are as follows, respectively:
wherein,and->Language features representing visual features and text descriptions of a picture, respectively,/->Representing a visual transducer network, +.>Represents a Bert-based network, +.>Representing a picture in a picture-text description sample pair, < >>Representation and->Corresponding text descriptions.
4. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of the predictive locating box is as follows:
wherein,representing a predicted positioning box in the form +.>,/>Respectively representing the abscissa and the ordinate of the central point of the predicted positioning frame,/, respectively>Respectively representing the width and the height of the prediction positioning frame, < >>And->Language features representing visual features and text descriptions of a picture, respectively,/->Representation for fusion->And->Is a visual language transducer network.
5. The meta-learning visual language understanding and locating method of claim 1, wherein the updating of the parameters of the base learner is expressed as follows:
wherein,represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Representing basic learneriMeta learning visual language understanding and positioning training model parameters in wheel meta learning iterative training>Inner layer circulation training learning rate for representing meta learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Represent the firstiSupport set in iterative training of wheel learning, +.>Representing differential calculations.
6. The meta-learning visual language understanding and locating method according to claim 1, wherein the expression of the weight parameters of the reverse-optimized meta-learning visual language understanding and locating training model is as follows:
wherein,weight parameters representing meta-learning visual language understanding and positioning training model, < ->Represent learning rate of outer cycle training, +.>Representing the total number of meta-learning iterative training, +.>Represent the firstiBasic learner parameters updated by inner layer circulation training of wheel element learning, < ->Represent the firstiQuery set in iterative training of wheel element learning, < ->Inner layer circulation training learning rate for representing meta learning, < ->Represent the firstiSupport set in iterative training set in wheel learning, < ->Loss function representing meta-learning visual language understanding and localization training model, < ->Representing differential calculation +.>Representing the amount of parameter update.
7. The meta-learning visual language understanding and locating method of claim 1, wherein the expression of coordinates of the locating frame is as follows:
wherein,representing the coordinates of the positioning frame in the form +.>,/>Respectively representing the abscissa and the ordinate of the center point of the positioning frame,/->Respectively represent the width and the height of the positioning frame, +.>Representing an optimized meta-learning visual language understanding and positioning training model +.>Picture-text description sample pair representing input optimized meta-learning visual language understanding and positioning training model for testing>Representing a test picture->Representation and->Corresponding text descriptions.
CN202311330418.4A 2023-10-16 2023-10-16 Meta-learning visual language understanding and positioning method Active CN117095187B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311330418.4A CN117095187B (en) 2023-10-16 2023-10-16 Meta-learning visual language understanding and positioning method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311330418.4A CN117095187B (en) 2023-10-16 2023-10-16 Meta-learning visual language understanding and positioning method

Publications (2)

Publication Number Publication Date
CN117095187A CN117095187A (en) 2023-11-21
CN117095187B true CN117095187B (en) 2023-12-19

Family

ID=88783590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311330418.4A Active CN117095187B (en) 2023-10-16 2023-10-16 Meta-learning visual language understanding and positioning method

Country Status (1)

Country Link
CN (1) CN117095187B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580500A (en) * 2019-08-20 2019-12-17 天津大学 Character interaction-oriented network weight generation few-sample image classification method
CN114187472A (en) * 2021-12-06 2022-03-15 江南大学 Breast cancer molecular subtype prediction method based on model-driven meta learning
CN114220516A (en) * 2021-12-17 2022-03-22 北京工业大学 Brain CT medical report generation method based on hierarchical recurrent neural network decoding
CN114491039A (en) * 2022-01-27 2022-05-13 四川大学 Meta-learning few-sample text classification method based on gradient improvement
CN115249361A (en) * 2022-07-15 2022-10-28 北京京东尚科信息技术有限公司 Instructional text positioning model training, apparatus, device, and medium
CN115953569A (en) * 2022-12-16 2023-04-11 华东师范大学 One-stage visual positioning model construction method based on multi-step reasoning
CN116011507A (en) * 2022-12-06 2023-04-25 东北林业大学 Rare fault diagnosis method for fusion element learning and graph neural network
CN116050399A (en) * 2023-01-05 2023-05-02 中国科学院声学研究所南海研究站 Cross-corpus and cross-algorithm generation type text steganalysis method
CN116071315A (en) * 2022-12-31 2023-05-05 聚光科技(杭州)股份有限公司 Product visual defect detection method and system based on machine vision
CN116246279A (en) * 2022-12-28 2023-06-09 北京理工大学 Graphic and text feature fusion method based on CLIP background knowledge
CN116258990A (en) * 2023-02-13 2023-06-13 安徽工业大学 Cross-modal affinity-based small sample reference video target segmentation method
CN116524356A (en) * 2023-04-11 2023-08-01 湖北工业大学 Ore image small sample target detection method and system
CN116612324A (en) * 2023-05-17 2023-08-18 四川九洲电器集团有限责任公司 Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11605019B2 (en) * 2019-05-30 2023-03-14 Adobe Inc. Visually guided machine-learning language model
US20210241099A1 (en) * 2020-02-05 2021-08-05 Baidu Usa Llc Meta cooperative training paradigms
EP3926531B1 (en) * 2020-06-17 2024-04-24 Tata Consultancy Services Limited Method and system for visio-linguistic understanding using contextual language model reasoners
KR20230127509A (en) * 2022-02-25 2023-09-01 한국전자통신연구원 Method and apparatus for learning concept based few-shot
US20230297603A1 (en) * 2022-03-18 2023-09-21 Adobe Inc. Cross-lingual meta-transfer learning adaptation to natural language understanding

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110580500A (en) * 2019-08-20 2019-12-17 天津大学 Character interaction-oriented network weight generation few-sample image classification method
CN114187472A (en) * 2021-12-06 2022-03-15 江南大学 Breast cancer molecular subtype prediction method based on model-driven meta learning
CN114220516A (en) * 2021-12-17 2022-03-22 北京工业大学 Brain CT medical report generation method based on hierarchical recurrent neural network decoding
CN114491039A (en) * 2022-01-27 2022-05-13 四川大学 Meta-learning few-sample text classification method based on gradient improvement
CN115249361A (en) * 2022-07-15 2022-10-28 北京京东尚科信息技术有限公司 Instructional text positioning model training, apparatus, device, and medium
CN116011507A (en) * 2022-12-06 2023-04-25 东北林业大学 Rare fault diagnosis method for fusion element learning and graph neural network
CN115953569A (en) * 2022-12-16 2023-04-11 华东师范大学 One-stage visual positioning model construction method based on multi-step reasoning
CN116246279A (en) * 2022-12-28 2023-06-09 北京理工大学 Graphic and text feature fusion method based on CLIP background knowledge
CN116071315A (en) * 2022-12-31 2023-05-05 聚光科技(杭州)股份有限公司 Product visual defect detection method and system based on machine vision
CN116050399A (en) * 2023-01-05 2023-05-02 中国科学院声学研究所南海研究站 Cross-corpus and cross-algorithm generation type text steganalysis method
CN116258990A (en) * 2023-02-13 2023-06-13 安徽工业大学 Cross-modal affinity-based small sample reference video target segmentation method
CN116524356A (en) * 2023-04-11 2023-08-01 湖北工业大学 Ore image small sample target detection method and system
CN116612324A (en) * 2023-05-17 2023-08-18 四川九洲电器集团有限责任公司 Small sample image classification method and device based on semantic self-adaptive fusion mechanism

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting;Guangxing Han等;《arXiv:2204.07841v3》;第1-17页 *
基于室内场景图知识融入的视觉语言导航;胡成纬;《中国优秀硕士学位论文全文数据库 信息科技辑》(第3期);第I140-382页 *
细粒度图像分类场景下的小样本学习方法研究;曹思雨;《中国优秀硕士学位论文全文数据库 信息科技辑》(第2期);第I138-432页 *
视觉-语言导航的研究进展与发展趋势;牛凯等;《计算机辅助设计与图形学学报》;第34卷(第12期);第1815-1827页 *

Also Published As

Publication number Publication date
CN117095187A (en) 2023-11-21

Similar Documents

Publication Publication Date Title
CN111444340A (en) Text classification and recommendation method, device, equipment and storage medium
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN112633010A (en) Multi-head attention and graph convolution network-based aspect-level emotion analysis method and system
CN112527993B (en) Cross-media hierarchical deep video question-answer reasoning framework
CN113705238B (en) Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN113628059A (en) Associated user identification method and device based on multilayer graph attention network
CN113704522A (en) Artificial intelligence-based target image rapid retrieval method and system
CN114528398A (en) Emotion prediction method and system based on interactive double-graph convolutional network
CN115587207A (en) Deep hash retrieval method based on classification label
CN111882042A (en) Automatic searching method, system and medium for neural network architecture of liquid state machine
CN116258990A (en) Cross-modal affinity-based small sample reference video target segmentation method
CN117095187B (en) Meta-learning visual language understanding and positioning method
CN116881416A (en) Instance-level cross-modal retrieval method for relational reasoning and cross-modal independent matching network
CN115827878B (en) Sentence emotion analysis method, sentence emotion analysis device and sentence emotion analysis equipment
CN116860943A (en) Multi-round dialogue method and system for dialogue style perception and theme guidance
CN117009478A (en) Algorithm fusion method based on software knowledge graph question-answer question-sentence analysis process
Qin et al. Modularized Pre-training for End-to-end Task-oriented Dialogue
CN116258147A (en) Multimode comment emotion analysis method and system based on heterogram convolution
CN114549958B (en) Night and camouflage target detection method based on context information perception mechanism
Basnyat et al. Vision powered conversational AI for easy human dialogue systems
CN113010712B (en) Visual question answering method based on multi-graph fusion
CN115357712A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN115081445A (en) Short text entity disambiguation method based on multitask learning
CN115062123A (en) Knowledge base question-answer pair generation method of conversation generation system
US12014149B1 (en) Multi-turn human-machine conversation method and apparatus based on time-sequence feature screening encoding module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant