CN117079007A - Zero sample detection method based on vision-language pre-training model and class Prototype - Google Patents
Zero sample detection method based on vision-language pre-training model and class Prototype Download PDFInfo
- Publication number
- CN117079007A CN117079007A CN202310871656.XA CN202310871656A CN117079007A CN 117079007 A CN117079007 A CN 117079007A CN 202310871656 A CN202310871656 A CN 202310871656A CN 117079007 A CN117079007 A CN 117079007A
- Authority
- CN
- China
- Prior art keywords
- region
- text
- class
- detected
- visual
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 91
- 238000001514 detection method Methods 0.000 title claims abstract description 52
- 230000000007 visual effect Effects 0.000 claims abstract description 79
- 238000000034 method Methods 0.000 claims abstract description 48
- 238000010276 construction Methods 0.000 claims description 40
- 238000004590 computer program Methods 0.000 claims description 7
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 abstract description 9
- 238000005457 optimization Methods 0.000 abstract description 9
- 230000008569 process Effects 0.000 abstract description 9
- 238000004891 communication Methods 0.000 description 8
- 238000013461 design Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 101100121955 Tanacetum cinerariifolium GLIP gene Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0475—Generative networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/70—Labelling scene content, e.g. deriving syntactic or semantic representations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The application particularly relates to a zero sample detection method based on a vision-language pre-training model and a class Prototype, which comprises the following steps: after extracting the region of interest of the object to be detected, extracting region features from the region of interest, inputting the region features into a preset visual pre-training model to obtain a visual embedded vector, inputting text information of base class and level class into the preset text pre-training model to obtain a text embedded vector, obtaining feature structure types according to classification score regions of the base class and the level class, and generating classification results of the object to be detected according to alignment results of the visual embedded vector and the text embedded vector and similarity results of the region of interest and the feature structure types. Therefore, the problems that the data set labeling cost is high, a large amount of resources are easy to occupy, and an optimization method aiming at a pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
Description
Technical Field
The application relates to the technical field of target detection, in particular to a zero sample detection method based on a vision-language pre-training model and a class Prototype.
Background
The series of visual-language models for classification tasks is mainly performed around CLIP (Contrastive Language-Image Pre-training, versus text-Image Pre-training model), including CoOp, coCoOp, CLIP-Adapter and Tip-Adapter. This part of the work gets a visual-language model with a stronger generalization by pre-training using a large-scale Image capture dataset. Based on the method, the improvement work is mostly from two angles of the Prompt-tuning and Adapter design, so that the classification effect of the corresponding method on the specific data set is improved.
In the related art, the series of visual-language model works for (zero sample) detection tasks can be largely classified into CLIP-like, including ViLD, detProh, VL-PLM, etc., and GLIP-like, including GLIP, GLIPv2, detCLIP, etc. The CLIP-like uses a large-scale Image capture Data set for pre-training, and the GLIP-like uses a grouping Data set with the same scale for pre-training.
The Tip-Adapter is a classical work of classifying images by using a visual-language model from the Adapter design point of view, and the method adopts Few-shot data, generates a set of features through an Image Encoder of a CLIP model, and takes the set of features as prototypes of different categories. Finally, the classification task is completed by calculating the similarity between the characteristic information of the pictures to be classified and the Prototypes of different categories. Meanwhile, methods for improving image classification or object detection using class Prototype have also received much attention, not limited to the use of vision-language models. For example, adopting methods of Few-Shot Object Detection by Attending to Per-Sample-Prototype, MA-GCP, FSODup and the like, and aiming at Few-shot or semi-supervised detection tasks, prototype based on single Sample data or object local characteristics is designed to improve target detection effect.
However, the series of works for the GLIP-like type have the following drawbacks: (1) The GLIP-like model is too large, so that a large amount of resources are occupied during training, and the time is long; (2) The model pre-training stage uses a large-scale Grounding Dataset data set, so that the labeling cost is high; (3) The GLIP-like series approach performed less well on the Open Vocabulary Detection task than the CLIP-like series approach.
The series of work for the CLIP-like type has the following drawbacks: (1) The idea that related works utilize Adapter or class Prototype design has not been found to improve the detection accuracy of the existing model; (2) The class Prototype improvement-oriented related work ignores the zero sample target detection problem, but only focuses on Few-shot or supervised image classification or target detection tasks, and meanwhile lacks an optimization strategy specific to a vision-language pre-training model, so that the problem needs to be solved.
Disclosure of Invention
The application provides a zero sample detection method based on a vision-language pre-training model and a category Prototype, which aims to solve the problems that in the pre-training process, a data set is high in labeling cost, a large amount of resources are easy to occupy, the time is long, and an optimization method aiming at the vision-language pre-training model is lacking, so that the zero sample target detection precision is reduced.
An embodiment of a first aspect of the present application provides a zero sample detection method based on a vision-language pre-training model and a class Prototype, including the steps of:
acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;
inputting the regional characteristics of the object level to a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class to a preset text pre-training model to obtain a text embedded vector; and
and obtaining a characteristic construction category according to the classification score region of the base class and the classification score region of the level class based on the visual embedding vector, and generating a classification result of the target to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the characteristic construction category.
According to one embodiment of the present application, the extracting the region of interest of the object to be detected, extracting the object-level region features from the region of interest, includes:
based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;
and extracting the region of interest of the object to be detected according to the visual characteristic value through an RPN (Region Proposal Network) and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.
According to an embodiment of the present application, before the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedding vector, the method further includes:
based on the target to be detected, acquiring text weight information in the target to be detected, and obtaining a text characteristic value of the target to be detected;
and extracting text information of the base class and the novel class according to the text characteristic value.
According to one embodiment of the present application, obtaining a feature structure class according to the classification score region of the base class and the classification score region of the level class includes:
based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class;
and selecting a classification score region of the level class through a second preset learning method based on the visual embedded vector to obtain a feature structure class of the level class.
According to one embodiment of the present application, the generating the classification result of the object to be detected according to the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the interest area and the feature construction category includes:
performing iterative learning on the visual embedding vector, the text embedding vector and the interest region and the feature construction category by an exponential sliding average method to respectively obtain an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature construction category;
and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.
According to the zero sample detection method based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into a preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
An embodiment of the second aspect of the present application provides a zero sample detection device based on a vision-language pre-training model and a class Prototype, including:
the extraction module is used for acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;
the input module is used for inputting the regional characteristics of the object level into a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class into a preset text pre-training model to obtain a text embedded vector; and
the generation module is used for obtaining a characteristic construction category according to the classification score area of the base class and the classification score area of the novel class based on the visual embedding vector, and generating a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest area and the characteristic construction category.
According to one embodiment of the application, the extraction module is specifically configured to:
based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;
and extracting the region of interest of the target to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.
According to an embodiment of the present application, before the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedded vector, the input module is further configured to:
based on the target to be detected, acquiring text weight information in the target to be detected, and obtaining a text characteristic value of the target to be detected;
and extracting text information of the base class and the novel class according to the text characteristic value.
According to one embodiment of the present application, the generating module is specifically configured to:
based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class;
and selecting a classification score region of the level class through a second preset learning method based on the visual embedded vector to obtain a feature structure class of the level class.
According to one embodiment of the present application, the generating module is specifically configured to:
performing iterative learning on the visual embedding vector, the text embedding vector and the interest region and the feature construction category by an exponential sliding average method to respectively obtain an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature construction category;
and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.
According to the zero sample detection device based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into the preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
An embodiment of a third aspect of the present application provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the zero sample detection method based on the vision-language pre-training model and the class Prototype.
A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing a zero sample detection method based on a vision-language pre-training model and class Prototype as described in the above embodiments.
Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.
Drawings
The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flowchart of a zero sample detection method based on a vision-language pre-training model and class Prototype according to an embodiment of the present application;
FIG. 2 is a schematic diagram of pre-training according to one embodiment of the application;
FIG. 3 is a block diagram of an example of a zero sample detection device based on a vision-language pre-training model and class Prototype, according to an embodiment of the application;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.
The application provides a zero sample detection method based on a vision-language pre-training model and a category Prototype, and aims at solving the problems that in the prior art, a data set is high in labeling cost, a large amount of resources are easy to occupy, time is long, and an optimization method for the vision-language pre-training model is short, so that the detection precision of a zero sample target is reduced. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
Specifically, fig. 1 is a schematic flow chart of a zero sample detection method based on a vision-language pre-training model and a class Prototype according to an embodiment of the present application.
As shown in fig. 1, the zero sample detection method based on the vision-language pre-training model and the class Prototype comprises the following steps:
in step S101, a target to be detected is acquired, a region of interest of the target to be detected is extracted, and object-level region features are extracted from the region of interest.
Further, in some embodiments, extracting a region of interest of an object to be detected, extracting object-level region features from the region of interest, includes: based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected; and extracting the region of interest of the object to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.
The preset irrelevant condition may be a relevant condition set by a person skilled in the art according to the detection requirement, and is not specifically limited herein.
Specifically, when the embodiment of the application detects zero samples of the target to be detected, the target to be detected is firstly obtained through a backbone network; secondly, after the target to be detected is obtained, visual weight information in the target to be detected is obtained through fixing a Visual Encoder, so that a Visual characteristic value of the target to be detected is obtained, meanwhile, an interest region of the target to be detected is extracted according to the Visual characteristic value by utilizing an RPN, and region characteristics of object levels meeting preset irrelevant conditions are extracted through a plurality of RoIAlignon modules in series, so that the quality of the region characteristics of the object levels is improved.
In step S102, the regional features of the object level are input to a preset visual pre-training model to obtain a visual embedded vector, and the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedded vector.
Further, in some embodiments, before inputting the text information of the base class and the level class to the preset text pre-training model to obtain the text embedded vector, the method further includes: acquiring text weight information in a target to be detected based on the target to be detected, and obtaining a text characteristic value of the target to be detected; and extracting text information of the base class and the level class according to the text characteristic values.
The preset visual pre-training model and the preset text pre-training model may be related training models set by those skilled in the art according to the detection requirements, and are not specifically limited herein.
Specifically, as shown in fig. 2, after extracting the region features of the object level that meet the preset irrelevant condition, the embodiment of the present application inputs the region features of the object level into a preset visual pre-training model, such as CLIPVisual Encoder, so as to obtain a corresponding visual embedding vector.
Further, the embodiment of the application obtains the Text weight information in the target to be detected by fixing the Text end based on the target to be detected, thereby obtaining the Text feature value of the target to be detected, extracts the Text information of the base class and the level class according to the Text feature value, and then can input the extracted Text information of the base class and the level class into a preset Text pre-training model, such as the Text end of the CLIP, thereby obtaining the Text embedding vector of the base class and the Text embedding vector of the level class, thereby realizing that the embodiment of the application obtains the Text feature value of the level class without increasing additional training and marking cost.
In step S103, based on the visual embedding vector, a feature structure class is obtained according to the classification score region of the base class and the classification score region of the level class, and a classification result of the object to be detected is generated according to the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the interest region and the feature structure class.
Further, in some embodiments, deriving the feature construction category from the classification score region of the base class and the classification score region of the level class includes: based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class; and selecting a classification score region of the novel class through a second preset learning method based on the visual embedded vector to obtain the feature structure class of the novel class.
Further, in some embodiments, generating a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the region of interest and the feature construction category includes: iterative learning is carried out on the visual embedded vector and the text embedded vector and the interest area and the feature construction category through an exponential moving average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest area and the feature construction category are respectively obtained; and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.
Specifically, as shown in fig. 2, in the embodiment of the present application, based on the obtained visual embedding vector, first, a first preset learning method is adopted to select a classification score region of the base class in the base class, for example, a region with a classification score of Top-K is selected in the base class by group trunk, so as to obtain a feature structure class Prototype of the base class.
And secondly, selecting a classification score region of the novel class in the novel class by adopting a second preset learning method, for example, selecting a region with a classification score of Top-K in the novel class by using a pseudo label, so as to obtain a feature construction class Prototype of the novel class.
Finally, iterative learning is carried out on the obtained characteristic construction type Prototype of the characteristic construction type Prototype, novelclass of the base class and the interest region of the object to be detected, as well as the visual embedded vector and the text embedded vector through an index sliding average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest region and the characteristic construction type are obtained respectively, further, weighting calculation is carried out on the alignment result and the similarity result, and a classification result of the object to be detected is generated, thereby improving zero sample detection capability of a preset visual pre-training model and a preset text pre-training model in the level class through introducing more priori knowledge, and meanwhile, the detection precision in the base class is guaranteed.
In summary, through the discussion analysis of the above embodiments, the following beneficial effects may be brought about by the present application:
(1) Based on the visual-language mixed model VL-PLM in the related technology, the method for detecting the zero sample target combines the pseudo tag generation and the training detector in the VL-PLM model into one stage, so that the training process of the model is simplified, and the end-to-end training of the model can be realized.
(2) Under the condition of ensuring that the detection precision of the base class is unchanged, the embodiment of the application introduces the thought of class Prototype design into a zero sample target detection task, creatively designs a group of class Prototype generation modes aiming at the base class and the level class, and continuously optimizes the class Prototype generation modes in the training process so as to fully exert the inherent advantages of a vision-language pre-training model and realize the zero sample target detection with higher precision, wherein the embodiment predicts that the AP (Average Precision) value on the level class of the VL-PLM model is improved by about 3 percent (namely about 37 percent) compared with the AP (Average Precision) value of the level class of the VL-PLM model, thereby improving the detection performance of the existing mainstream method.
According to the zero sample detection method based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into a preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
Fig. 3 is a block schematic diagram of a zero sample detection device based on a vision-language pre-training model and class Prototype according to an embodiment of the present application.
As shown in fig. 3, the zero sample detection device 10 based on the vision-language pre-training model and the category Prototype includes: an extraction module 100, an input module 200 and a generation module 300.
The extraction module 100 is configured to obtain a target to be detected, extract a region of interest of the target to be detected, and extract object-level region features from the region of interest;
the input module 200 is configured to input the region features of the object level to a preset visual pre-training model to obtain a visual embedded vector, and input text information of the base class and the level class to a preset text pre-training model to obtain a text embedded vector; and
the generating module 300 is configured to obtain a feature structure class according to the classification score region of the base class and the classification score region of the level class based on the visual embedding vector, and generate a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature structure class.
Further, in some embodiments, the extraction module 100 is specifically configured to:
based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;
and extracting the region of interest of the object to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.
Further, in some embodiments, before inputting the text information of the base class and the level class to the preset text pre-training model to obtain the text embedded vector, the input module 200 is further configured to:
acquiring text weight information in a target to be detected based on the target to be detected, and obtaining a text characteristic value of the target to be detected;
and extracting text information of the base class and the level class according to the text characteristic values.
Further, in some embodiments, the generating module 300 is specifically configured to:
based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class;
and selecting a classification score region of the novel class through a second preset learning method based on the visual embedded vector to obtain the feature structure class of the novel class.
Further, in some embodiments, the generating module 300 is specifically configured to:
iterative learning is carried out on the visual embedded vector and the text embedded vector and the interest area and the feature construction category through an exponential moving average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest area and the feature construction category are respectively obtained;
and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.
According to the zero sample detection device based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into the preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:
memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.
The processor 402, when executing the program, implements the zero sample detection method based on the vision-language pre-training model and class Prototype provided in the above embodiments.
Further, the electronic device further includes:
a communication interface 403 for communication between the memory 401 and the processor 402.
A memory 401 for storing a computer program executable on the processor 402.
Memory 401 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.
Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may perform communication with each other through internal interfaces.
The processor 402 may be a central processing unit (Central Processing Unit, abbreviated as CPU) or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC) or one or more integrated circuits configured to implement embodiments of the present application.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor implements the zero sample detection method based on the vision-language pre-training model and class Prototype as above.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.
Claims (10)
1. A zero sample detection method based on a vision-language pre-training model and a class Prototype, comprising the steps of:
acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;
inputting the regional characteristics of the object level to a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class to a preset text pre-training model to obtain a text embedded vector; and
and obtaining a characteristic construction category according to the classification score region of the base class and the classification score region of the level class based on the visual embedding vector, and generating a classification result of the target to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the characteristic construction category.
2. The method according to claim 1, wherein the extracting the region of interest of the object to be detected, extracting object-level region features from the region of interest, comprises:
based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;
and extracting the region of interest of the target to be detected according to the visual characteristic value through a region generation network (RPN), and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlignogn modules connected in series.
3. The method of claim 1, wherein before inputting the text information of the base class and the novel class to a preset text pre-training model to obtain a text embedding vector, further comprising:
based on the target to be detected, acquiring text weight information in the target to be detected, and obtaining a text characteristic value of the target to be detected;
and extracting text information of the base class and the novel class according to the text characteristic value.
4. The method of claim 1, wherein deriving feature construction categories from the classification score region of the base class and the classification score region of the novel class comprises:
based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class;
and selecting a classification score region of the level class through a second preset learning method based on the visual embedded vector to obtain a feature structure class of the level class.
5. The method of claim 4, wherein the generating the classification result of the object to be detected based on the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the region of interest and the feature construction class comprises:
performing iterative learning on the visual embedding vector, the text embedding vector and the interest region and the feature construction category by an exponential sliding average method to respectively obtain an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature construction category;
and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.
6. Zero sample detection device based on vision-language pre-training model and class Prototype, characterized by comprising
The extraction module is used for acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;
the input module is used for inputting the regional characteristics of the object level into a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class into a preset text pre-training model to obtain a text embedded vector; and
the generation module is used for obtaining a characteristic construction category according to the classification score area of the base class and the classification score area of the novel class based on the visual embedding vector, and generating a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest area and the characteristic construction category.
7. The apparatus according to claim 6, wherein the extraction module is specifically configured to:
based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;
and extracting the region of interest of the target to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.
8. The apparatus of claim 6, wherein the input module is further configured to, before inputting the text information of the base class and the novel class to a preset text pre-training model to obtain a text embedding vector:
based on the target to be detected, acquiring text weight information in the target to be detected, and obtaining a text characteristic value of the target to be detected;
and extracting text information of the base class and the novel class according to the text characteristic value.
9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the zero sample detection method based on a vision-language pre-training model and class Prototype as claimed in any of claims 1-5.
10. A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor for implementing a zero sample detection method based on a vision-language pre-training model and class Prototype according to any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310871656.XA CN117079007A (en) | 2023-07-17 | 2023-07-17 | Zero sample detection method based on vision-language pre-training model and class Prototype |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310871656.XA CN117079007A (en) | 2023-07-17 | 2023-07-17 | Zero sample detection method based on vision-language pre-training model and class Prototype |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117079007A true CN117079007A (en) | 2023-11-17 |
Family
ID=88703269
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310871656.XA Pending CN117079007A (en) | 2023-07-17 | 2023-07-17 | Zero sample detection method based on vision-language pre-training model and class Prototype |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117079007A (en) |
-
2023
- 2023-07-17 CN CN202310871656.XA patent/CN117079007A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109086756B (en) | Text detection analysis method, device and equipment based on deep neural network | |
US11410407B2 (en) | Method and device for generating collection of incorrectly-answered questions | |
US20210182611A1 (en) | Training data acquisition method and device, server and storage medium | |
US9619735B1 (en) | Pure convolutional neural network localization | |
CN109685055B (en) | Method and device for detecting text area in image | |
CN110472082B (en) | Data processing method, data processing device, storage medium and electronic equipment | |
CN108985214A (en) | The mask method and device of image data | |
WO2020133442A1 (en) | Text recognition method and terminal device | |
CN107330027B (en) | Weak supervision depth station caption detection method | |
US20220019834A1 (en) | Automatically predicting text in images | |
CN111460155B (en) | Knowledge graph-based information credibility assessment method and device | |
CN113221918B (en) | Target detection method, training method and device of target detection model | |
CN109189965A (en) | Pictograph search method and system | |
CN113343989B (en) | Target detection method and system based on self-adaption of foreground selection domain | |
CN114663904A (en) | PDF document layout detection method, device, equipment and medium | |
CN112991281B (en) | Visual detection method, system, electronic equipment and medium | |
CN104966109A (en) | Medical laboratory report image classification method and apparatus | |
CN113780116A (en) | Invoice classification method and device, computer equipment and storage medium | |
CN109657710B (en) | Data screening method and device, server and storage medium | |
CN117079007A (en) | Zero sample detection method based on vision-language pre-training model and class Prototype | |
CN111291756B (en) | Method and device for detecting text region in image, computer equipment and computer storage medium | |
CN110851349B (en) | Page abnormity display detection method, terminal equipment and storage medium | |
CN108021918B (en) | Character recognition method and device | |
US20160364458A1 (en) | Methods and Systems for Using Field Characteristics to Index, Search For, and Retrieve Forms | |
CN106446902B (en) | Non-legible image-recognizing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |