CN117079007A

CN117079007A - Zero sample detection method based on vision-language pre-training model and class Prototype

Info

Publication number: CN117079007A
Application number: CN202310871656.XA
Authority: CN
Inventors: 丁贵广; 徐鑫浩; 何宇巍
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2023-07-17
Filing date: 2023-07-17
Publication date: 2023-11-17

Abstract

The application particularly relates to a zero sample detection method based on a vision-language pre-training model and a class Prototype, which comprises the following steps: after extracting the region of interest of the object to be detected, extracting region features from the region of interest, inputting the region features into a preset visual pre-training model to obtain a visual embedded vector, inputting text information of base class and level class into the preset text pre-training model to obtain a text embedded vector, obtaining feature structure types according to classification score regions of the base class and the level class, and generating classification results of the object to be detected according to alignment results of the visual embedded vector and the text embedded vector and similarity results of the region of interest and the feature structure types. Therefore, the problems that the data set labeling cost is high, a large amount of resources are easy to occupy, and an optimization method aiming at a pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.

Description

Zero sample detection method based on vision-language pre-training model and class Prototype

Technical Field

The application relates to the technical field of target detection, in particular to a zero sample detection method based on a vision-language pre-training model and a class Prototype.

Background

The series of visual-language models for classification tasks is mainly performed around CLIP (Contrastive Language-Image Pre-training, versus text-Image Pre-training model), including CoOp, coCoOp, CLIP-Adapter and Tip-Adapter. This part of the work gets a visual-language model with a stronger generalization by pre-training using a large-scale Image capture dataset. Based on the method, the improvement work is mostly from two angles of the Prompt-tuning and Adapter design, so that the classification effect of the corresponding method on the specific data set is improved.

In the related art, the series of visual-language model works for (zero sample) detection tasks can be largely classified into CLIP-like, including ViLD, detProh, VL-PLM, etc., and GLIP-like, including GLIP, GLIPv2, detCLIP, etc. The CLIP-like uses a large-scale Image capture Data set for pre-training, and the GLIP-like uses a grouping Data set with the same scale for pre-training.

The Tip-Adapter is a classical work of classifying images by using a visual-language model from the Adapter design point of view, and the method adopts Few-shot data, generates a set of features through an Image Encoder of a CLIP model, and takes the set of features as prototypes of different categories. Finally, the classification task is completed by calculating the similarity between the characteristic information of the pictures to be classified and the Prototypes of different categories. Meanwhile, methods for improving image classification or object detection using class Prototype have also received much attention, not limited to the use of vision-language models. For example, adopting methods of Few-Shot Object Detection by Attending to Per-Sample-Prototype, MA-GCP, FSODup and the like, and aiming at Few-shot or semi-supervised detection tasks, prototype based on single Sample data or object local characteristics is designed to improve target detection effect.

However, the series of works for the GLIP-like type have the following drawbacks: (1) The GLIP-like model is too large, so that a large amount of resources are occupied during training, and the time is long; (2) The model pre-training stage uses a large-scale Grounding Dataset data set, so that the labeling cost is high; (3) The GLIP-like series approach performed less well on the Open Vocabulary Detection task than the CLIP-like series approach.

The series of work for the CLIP-like type has the following drawbacks: (1) The idea that related works utilize Adapter or class Prototype design has not been found to improve the detection accuracy of the existing model; (2) The class Prototype improvement-oriented related work ignores the zero sample target detection problem, but only focuses on Few-shot or supervised image classification or target detection tasks, and meanwhile lacks an optimization strategy specific to a vision-language pre-training model, so that the problem needs to be solved.

Disclosure of Invention

The application provides a zero sample detection method based on a vision-language pre-training model and a category Prototype, which aims to solve the problems that in the pre-training process, a data set is high in labeling cost, a large amount of resources are easy to occupy, the time is long, and an optimization method aiming at the vision-language pre-training model is lacking, so that the zero sample target detection precision is reduced.

An embodiment of a first aspect of the present application provides a zero sample detection method based on a vision-language pre-training model and a class Prototype, including the steps of:

acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;

inputting the regional characteristics of the object level to a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class to a preset text pre-training model to obtain a text embedded vector; and

and obtaining a characteristic construction category according to the classification score region of the base class and the classification score region of the level class based on the visual embedding vector, and generating a classification result of the target to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the characteristic construction category.

According to one embodiment of the present application, the extracting the region of interest of the object to be detected, extracting the object-level region features from the region of interest, includes:

based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected;

and extracting the region of interest of the object to be detected according to the visual characteristic value through an RPN (Region Proposal Network) and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.

According to an embodiment of the present application, before the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedding vector, the method further includes:

based on the target to be detected, acquiring text weight information in the target to be detected, and obtaining a text characteristic value of the target to be detected;

and extracting text information of the base class and the novel class according to the text characteristic value.

According to one embodiment of the present application, obtaining a feature structure class according to the classification score region of the base class and the classification score region of the level class includes:

based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class;

and selecting a classification score region of the level class through a second preset learning method based on the visual embedded vector to obtain a feature structure class of the level class.

According to one embodiment of the present application, the generating the classification result of the object to be detected according to the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the interest area and the feature construction category includes:

performing iterative learning on the visual embedding vector, the text embedding vector and the interest region and the feature construction category by an exponential sliding average method to respectively obtain an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature construction category;

and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.

According to the zero sample detection method based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into a preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.

An embodiment of the second aspect of the present application provides a zero sample detection device based on a vision-language pre-training model and a class Prototype, including:

the extraction module is used for acquiring a target to be detected, extracting an interest region of the target to be detected, and extracting object-level region features from the interest region;

the input module is used for inputting the regional characteristics of the object level into a preset visual pre-training model to obtain a visual embedded vector, and inputting text information of base class and level class into a preset text pre-training model to obtain a text embedded vector; and

the generation module is used for obtaining a characteristic construction category according to the classification score area of the base class and the classification score area of the novel class based on the visual embedding vector, and generating a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest area and the characteristic construction category.

According to one embodiment of the application, the extraction module is specifically configured to:

and extracting the region of interest of the target to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.

According to an embodiment of the present application, before the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedded vector, the input module is further configured to:

According to one embodiment of the present application, the generating module is specifically configured to:

According to the zero sample detection device based on the vision-language pre-training model and the category Prototype, after the region of interest of the object to be detected is extracted, region characteristics are extracted from the region of interest and are input into the preset vision pre-training model to obtain a vision embedded vector, text information of base class and level class is input into the preset text pre-training model to obtain a text embedded vector, a characteristic construction category is obtained based on the vision embedded vector according to the classification score regions of the base class and the level class, and a classification result of the object to be detected is generated according to an alignment result of the vision embedded vector and the text embedded vector and a similarity result of the region of interest and the characteristic construction category. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.

An embodiment of a third aspect of the present application provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the zero sample detection method based on the vision-language pre-training model and the class Prototype.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing a zero sample detection method based on a vision-language pre-training model and class Prototype as described in the above embodiments.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flowchart of a zero sample detection method based on a vision-language pre-training model and class Prototype according to an embodiment of the present application;

FIG. 2 is a schematic diagram of pre-training according to one embodiment of the application;

FIG. 3 is a block diagram of an example of a zero sample detection device based on a vision-language pre-training model and class Prototype, according to an embodiment of the application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

The application provides a zero sample detection method based on a vision-language pre-training model and a category Prototype, and aims at solving the problems that in the prior art, a data set is high in labeling cost, a large amount of resources are easy to occupy, time is long, and an optimization method for the vision-language pre-training model is short, so that the detection precision of a zero sample target is reduced. Therefore, the problems that the data set labeling cost is high, a large amount of resources are occupied easily, the time is long, and an optimization method for a vision-language pre-training model is lacking in the pre-training process are solved, so that the zero sample target detection precision is reduced.

Specifically, fig. 1 is a schematic flow chart of a zero sample detection method based on a vision-language pre-training model and a class Prototype according to an embodiment of the present application.

As shown in fig. 1, the zero sample detection method based on the vision-language pre-training model and the class Prototype comprises the following steps:

in step S101, a target to be detected is acquired, a region of interest of the target to be detected is extracted, and object-level region features are extracted from the region of interest.

Further, in some embodiments, extracting a region of interest of an object to be detected, extracting object-level region features from the region of interest, includes: based on the target to be detected, acquiring visual weight information in the target to be detected, and obtaining a visual characteristic value of the target to be detected; and extracting the region of interest of the object to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.

The preset irrelevant condition may be a relevant condition set by a person skilled in the art according to the detection requirement, and is not specifically limited herein.

Specifically, when the embodiment of the application detects zero samples of the target to be detected, the target to be detected is firstly obtained through a backbone network; secondly, after the target to be detected is obtained, visual weight information in the target to be detected is obtained through fixing a Visual Encoder, so that a Visual characteristic value of the target to be detected is obtained, meanwhile, an interest region of the target to be detected is extracted according to the Visual characteristic value by utilizing an RPN, and region characteristics of object levels meeting preset irrelevant conditions are extracted through a plurality of RoIAlignon modules in series, so that the quality of the region characteristics of the object levels is improved.

In step S102, the regional features of the object level are input to a preset visual pre-training model to obtain a visual embedded vector, and the text information of the base class and the level class is input to a preset text pre-training model to obtain a text embedded vector.

Further, in some embodiments, before inputting the text information of the base class and the level class to the preset text pre-training model to obtain the text embedded vector, the method further includes: acquiring text weight information in a target to be detected based on the target to be detected, and obtaining a text characteristic value of the target to be detected; and extracting text information of the base class and the level class according to the text characteristic values.

The preset visual pre-training model and the preset text pre-training model may be related training models set by those skilled in the art according to the detection requirements, and are not specifically limited herein.

Specifically, as shown in fig. 2, after extracting the region features of the object level that meet the preset irrelevant condition, the embodiment of the present application inputs the region features of the object level into a preset visual pre-training model, such as CLIPVisual Encoder, so as to obtain a corresponding visual embedding vector.

Further, the embodiment of the application obtains the Text weight information in the target to be detected by fixing the Text end based on the target to be detected, thereby obtaining the Text feature value of the target to be detected, extracts the Text information of the base class and the level class according to the Text feature value, and then can input the extracted Text information of the base class and the level class into a preset Text pre-training model, such as the Text end of the CLIP, thereby obtaining the Text embedding vector of the base class and the Text embedding vector of the level class, thereby realizing that the embodiment of the application obtains the Text feature value of the level class without increasing additional training and marking cost.

In step S103, based on the visual embedding vector, a feature structure class is obtained according to the classification score region of the base class and the classification score region of the level class, and a classification result of the object to be detected is generated according to the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the interest region and the feature structure class.

Further, in some embodiments, deriving the feature construction category from the classification score region of the base class and the classification score region of the level class includes: based on the visual embedded vector, selecting a classification score region of the base class through a first preset learning method to obtain a characteristic construction category of the base class; and selecting a classification score region of the novel class through a second preset learning method based on the visual embedded vector to obtain the feature structure class of the novel class.

Further, in some embodiments, generating a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the region of interest and the feature construction category includes: iterative learning is carried out on the visual embedded vector and the text embedded vector and the interest area and the feature construction category through an exponential moving average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest area and the feature construction category are respectively obtained; and carrying out weighted calculation on the alignment result and the similarity result to generate a classification result of the target to be detected.

Specifically, as shown in fig. 2, in the embodiment of the present application, based on the obtained visual embedding vector, first, a first preset learning method is adopted to select a classification score region of the base class in the base class, for example, a region with a classification score of Top-K is selected in the base class by group trunk, so as to obtain a feature structure class Prototype of the base class.

And secondly, selecting a classification score region of the novel class in the novel class by adopting a second preset learning method, for example, selecting a region with a classification score of Top-K in the novel class by using a pseudo label, so as to obtain a feature construction class Prototype of the novel class.

Finally, iterative learning is carried out on the obtained characteristic construction type Prototype of the characteristic construction type Prototype, novelclass of the base class and the interest region of the object to be detected, as well as the visual embedded vector and the text embedded vector through an index sliding average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest region and the characteristic construction type are obtained respectively, further, weighting calculation is carried out on the alignment result and the similarity result, and a classification result of the object to be detected is generated, thereby improving zero sample detection capability of a preset visual pre-training model and a preset text pre-training model in the level class through introducing more priori knowledge, and meanwhile, the detection precision in the base class is guaranteed.

In summary, through the discussion analysis of the above embodiments, the following beneficial effects may be brought about by the present application:

(1) Based on the visual-language mixed model VL-PLM in the related technology, the method for detecting the zero sample target combines the pseudo tag generation and the training detector in the VL-PLM model into one stage, so that the training process of the model is simplified, and the end-to-end training of the model can be realized.

(2) Under the condition of ensuring that the detection precision of the base class is unchanged, the embodiment of the application introduces the thought of class Prototype design into a zero sample target detection task, creatively designs a group of class Prototype generation modes aiming at the base class and the level class, and continuously optimizes the class Prototype generation modes in the training process so as to fully exert the inherent advantages of a vision-language pre-training model and realize the zero sample target detection with higher precision, wherein the embodiment predicts that the AP (Average Precision) value on the level class of the VL-PLM model is improved by about 3 percent (namely about 37 percent) compared with the AP (Average Precision) value of the level class of the VL-PLM model, thereby improving the detection performance of the existing mainstream method.

Fig. 3 is a block schematic diagram of a zero sample detection device based on a vision-language pre-training model and class Prototype according to an embodiment of the present application.

As shown in fig. 3, the zero sample detection device 10 based on the vision-language pre-training model and the category Prototype includes: an extraction module 100, an input module 200 and a generation module 300.

The extraction module 100 is configured to obtain a target to be detected, extract a region of interest of the target to be detected, and extract object-level region features from the region of interest;

the input module 200 is configured to input the region features of the object level to a preset visual pre-training model to obtain a visual embedded vector, and input text information of the base class and the level class to a preset text pre-training model to obtain a text embedded vector; and

the generating module 300 is configured to obtain a feature structure class according to the classification score region of the base class and the classification score region of the level class based on the visual embedding vector, and generate a classification result of the object to be detected according to an alignment result of the visual embedding vector and the text embedding vector and a similarity result of the interest region and the feature structure class.

Further, in some embodiments, the extraction module 100 is specifically configured to:

and extracting the region of interest of the object to be detected according to the visual characteristic value through the RPN, and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlign modules connected in series.

Further, in some embodiments, before inputting the text information of the base class and the level class to the preset text pre-training model to obtain the text embedded vector, the input module 200 is further configured to:

acquiring text weight information in a target to be detected based on the target to be detected, and obtaining a text characteristic value of the target to be detected;

and extracting text information of the base class and the level class according to the text characteristic values.

Further, in some embodiments, the generating module 300 is specifically configured to:

and selecting a classification score region of the novel class through a second preset learning method based on the visual embedded vector to obtain the feature structure class of the novel class.

iterative learning is carried out on the visual embedded vector and the text embedded vector and the interest area and the feature construction category through an exponential moving average method, so that an alignment result of the visual embedded vector and the text embedded vector and a similarity result of the interest area and the feature construction category are respectively obtained;

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

memory 401, processor 402, and a computer program stored on memory 401 and executable on processor 402.

The processor 402, when executing the program, implements the zero sample detection method based on the vision-language pre-training model and class Prototype provided in the above embodiments.

Further, the electronic device further includes:

a communication interface 403 for communication between the memory 401 and the processor 402.

A memory 401 for storing a computer program executable on the processor 402.

Memory 401 may comprise high-speed RAM memory or may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

If the memory 401, the processor 402, and the communication interface 403 are implemented independently, the communication interface 403, the memory 401, and the processor 402 may be connected to each other by a bus and perform communication with each other. The bus may be an industry standard architecture (Industry Standard Architecture, abbreviated ISA) bus, an external device interconnect (Peripheral Component, abbreviated PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 4, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may perform communication with each other through internal interfaces.

The processor 402 may be a central processing unit (Central Processing Unit, abbreviated as CPU) or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC) or one or more integrated circuits configured to implement embodiments of the present application.

The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor implements the zero sample detection method based on the vision-language pre-training model and class Prototype as above.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. A zero sample detection method based on a vision-language pre-training model and a class Prototype, comprising the steps of:

2. The method according to claim 1, wherein the extracting the region of interest of the object to be detected, extracting object-level region features from the region of interest, comprises:

and extracting the region of interest of the target to be detected according to the visual characteristic value through a region generation network (RPN), and extracting the region characteristics of the object level meeting the preset irrelevant condition through a plurality of RoIAlignogn modules connected in series.

3. The method of claim 1, wherein before inputting the text information of the base class and the novel class to a preset text pre-training model to obtain a text embedding vector, further comprising:

4. The method of claim 1, wherein deriving feature construction categories from the classification score region of the base class and the classification score region of the novel class comprises:

5. The method of claim 4, wherein the generating the classification result of the object to be detected based on the alignment result of the visual embedding vector and the text embedding vector and the similarity result of the region of interest and the feature construction class comprises:

6. Zero sample detection device based on vision-language pre-training model and class Prototype, characterized by comprising

7. The apparatus according to claim 6, wherein the extraction module is specifically configured to:

8. The apparatus of claim 6, wherein the input module is further configured to, before inputting the text information of the base class and the novel class to a preset text pre-training model to obtain a text embedding vector:

9. An electronic device, comprising: memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the zero sample detection method based on a vision-language pre-training model and class Prototype as claimed in any of claims 1-5.

10. A computer readable storage medium having stored thereon a computer program, wherein the program is executed by a processor for implementing a zero sample detection method based on a vision-language pre-training model and class Prototype according to any of claims 1-5.