CN112749738B

CN112749738B - Zero sample object detection method for performing superclass reasoning by fusing context

Info

Publication number: CN112749738B
Application number: CN202011618077.7A
Authority: CN
Inventors: 李亚南; 李太豪
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-05-23
Anticipated expiration: 2040-12-30
Also published as: CN112749738A

Abstract

The invention discloses a zero sample object detection method for performing super-class reasoning by fusing context, which can be used for positioning and identifying a brand new object which is never seen in the past under the condition that a tag training picture is missing. Firstly, predicting an object frame possibly existing in an input picture by using an object detection network, secondly, positioning the position of the object based on the visual characteristics of the object frame, predicting a specific object category by using a label semantic vector, extracting the context information of a candidate object frame by adopting multi-layer cavity convolution, predicting a corresponding superclass by using the extracted context information, and finally fusing the predicted specific category with the superclass to achieve a final recognition result. The method is simple, convenient and flexible, and can obviously improve the detection performance of the objects which are not seen.

Description

Zero sample object detection method for performing superclass reasoning by fusing context

Technical Field

The invention relates to the technical field of computer vision, in particular to a zero sample object detection method for performing super-class reasoning by fusing contexts.

Background

Object detection is one of the classical problems in the computer vision field, and object detection techniques based on deep neural networks have met with great success in the last few years, one of which is the use of a large number of labeled training datasets with accurate bounding box labeling. However, on the one hand, it is difficult to collect and annotate all object class images beyond the daily object, such as some species that are dying, continually pushing new products out, etc. On the other hand, when the data of the target domain is relatively small or non-existent, the object detector trained on the source domain is difficult to generalize to the target domain. In contrast, a human being has the extraordinary ability to quickly learn new concepts, new objects, even if no image is seen, and a good object detection system should have this learning ability. In order to bridge the gap between object detectors and human intelligence, the ability to give object detectors the ability to detect completely new unknown target classes (i.e. zero sample object detection) has become one of the hotspot problems.

Zero sample object detection aims to detect unknown object classes in the absence of supervised training samples. It requires not only model recognition of the class of objects, but also accurate localization of the target within millions of potential candidate regions, as compared to zero sample recognition. The common practice for zero sample object detection is to incorporate a zero sample classifier into an existing object detection framework, such as Faster R-CNN, YOLO, etc., to bridge the semantic gap between visible and unknown class objects by aligning the visual features of each object region with the inherent properties of the class of objects (i.e., class semantic embedding).

However, this type of approach has two drawbacks. First, they use only limited visual features to identify candidate regions, ignoring the importance of context information, which has shown great potential in multiple tasks. In this case, visually similar but semantically different objects will be falsely detected. For example, a green apple in a kitchen may be mistaken for a tennis ball because the appearance of the two objects is very similar. Second, such methods ignore domain offset problems due to the different source domain and target domain data distributions, resulting in a detector trained on the source domain that does not generalize well to the target domain. This problem is exacerbated when both known and unknown classes are present in the picture.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purpose of improving the detection accuracy of a zero sample object, the invention adopts the following technical scheme:

a zero sample object detection method for performing super-class reasoning by fusing context includes the following steps:

step one: extracting depth picture characteristics;

step two: based on the extracted depth picture features, for each cell, predicting the position coordinates of the cell by using a coordinate prediction network, and predicting the confidence of whether an object exists in the cell by using a confidence prediction network;

step three: classifying candidate object frames by using a zero sample classifier based on the visual characteristics of each cell to obtain a fine-grained classification result, wherein each cell can predict one or more candidate object frame positions, the characteristics of each candidate object frame are the visual characteristics of the current cell, and zero sample classification is performed based on the visual characteristics;

step four: on the basis of the first step, extracting the context information of each cell by using a context extraction network, and simultaneously predicting the superclass information of each candidate object by using a superclass prediction network on the basis of the extracted context characteristics;

step five: and (3) organically fusing the super class predicted in the step four with the class predicted in the step three to obtain a final classification result.

Combining the superclass corresponding to the context in the step four with the class predicted in the step three to finally obtain a final classification result, and solving the problem that when limited visual features are used for identifying candidate object areas, the importance of the context information is ignored, so that objects which are similar in vision but are different in semantics can be detected wrongly;

because the source domain and the target domain are mutually exclusive, namely the class in the source domain (namely the training class) is completely different from the class in the target domain (namely the testing class), however, because of the superclass relation, the superclass has unknown class information, and the information of the testing class can be considered to be taken into the training process, so that the detector considers the information of the target domain in the training process, thereby having generalization capability, improving the detection capability of the target domain, and avoiding the domain offset problem caused by different data distribution of the source domain and the target domain, and leading the detector trained on the source domain to be not well generalized to the target domain.

Further, the third step specifically includes the following steps:

step 3.1: based on the visual features of each cell, projecting the visual features to a semantic embedding space using a nonlinear equation; the visual characteristics of the cell are x _i The visual characteristic after projection is k _i I represents the i-th cell;

step 3.2: calculating the similarity between the projection vector of the visual characteristic and the semantic embedded vector of the object class in the semantic embedded space to obtain a classification score value and give a classification result with fine granularity; score value is expressed as

C _s The number of classes with objects is represented, j represents the j-th training class, and s represents the known class, i.e., training class. The object classes are given in advance, in zero sample object detection, which object classes need to be detected are given in advance, and the labels of the object classes are given in advance, so that the purpose of classification can be achieved by only comparing projection vectors with semantic embedded vectors of the given object classes in a semantic embedded space.

Further, the fourth step specifically includes the following steps:

step 4.1: extracting a context feature matrix through context feature extraction on a feature matrix of a given input picture, and obtaining a feature matrix with a larger visual field range on an original image by using hole convolution, wherein the feature matrix fuses context information of candidate object frames;

step 4.2: extracting a superclass relation between the object class and the object class from the semantic web, so that the superclass at least contains 1 test object class; thus, each object class belongs to a superclass;

step 4.3: based on the context characteristics of each cell, predicting corresponding superclasses by using a multi-layer fully connected network; the true superclass of cells is represented as

The predicted superclass is denoted +.>

The network was optimized using the following cross entropy loss:

where h×w is the number of cells, i is the number of cells, and the superclass prediction is performed on each cell, C is the number of superclasses, and j is the j-th superclass.

Further, the fifth step specifically includes the following steps:

step 5.1: the super class score value predicted in the step 4.3 is calculated

Multiplying the fine-grained classification score obtained in step 3.2 to obtain a final classification result, which is expressed as +.>

Wherein s is->

The superclass where the cross entropy loss is located is: />

Where obj is an abbreviation for object, indicating whether the ith cell contains an object, if any,

otherwise is->

And obtaining an optimal classification result through the cross entropy loss function.

Further, the contextual feature extraction in step 4.1, specifically,

is a picture feature matrix, X represents a tensor matrix, R represents the whole real number interval, H X W X d _v Is a three-dimensional matrix, which represents the size of X, and the size of each dimension is H, W and d respectively _v H W is the number of cells, each cell characterized by d _v In dimension, a plurality of hole convolution blocks are continuously used on X, wherein each hole convolution block comprises K hole convolutions, and the hole rate is +.>

i represents the ith cavity convolution, r represents the cavity rate of the cavity convolution, and the feature matrix obtained by each cavity convolution block is fused to obtain a final context feature matrix +.>

H×W×d _c Representing the size of the contextual feature matrix, where the contextual features of each cell are represented as p _i I represents the number of cells.

Further, the nonlinear equation in the step 3.1 is a fully connected network.

Further, the hole convolution in the step 4.1 adopts a multi-layer 3*3 size hole convolution.

Further, the semantic net in the step 4.2 is WordNet.

The invention has the advantages that:

the invention combines the superclass corresponding to the context with the predicted class to finally obtain the final classification result, and solves the problem that when the limited visual features are used for identifying the candidate object areas, the importance of the context information is ignored, so that objects which are similar in vision but are semantically different are detected wrongly;

Drawings

Fig. 1 is a diagram of an object detection framework of the present invention.

FIG. 2 is a schematic diagram of a hollow convolution structure in accordance with the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

As shown in FIG. 1, a zero sample object detection method for performing super-class reasoning by fusing context comprises the following steps:

step one: and inputting the picture into a deep neural network (CNN) to extract the depth picture characteristics.

Step two: based on the extracted depth picture features, predicting the position coordinates of each cell by using a coordinate prediction network; while using a confidence prediction network, predicting whether there is confidence (confidence) for the object in the cell. In fig. 1, p represents a probability value (between 0 and 1) of whether an object is contained in the cell, x, y, w, h represents a predicted object frame position in the cell, where x, y represents a center point coordinate of the object frame, w represents a width of the object frame, h represents a height of the object frame, c represents a category in p (c|s), and s represents a supercategory.

Step three: classifying the candidate object frames by using a zero sample classifier based on the visual characteristics of each cell, to obtain a fine-grained classification result, wherein each cell can predict one or more candidate object frame positions (i.e. p, x, y, w, h in fig. 1), and the characteristics of each candidate object frame are the visual characteristics of the current cell, and zero sample classification is performed based on the visual characteristics, and comprises the following steps:

3.1, based on the visual features of each cell, projecting the visual features to a semantic embedding space using a nonlinear equation FC (fully-connected network); assuming that the visual characteristics of the cell are x _i The visual characteristic after projection is k _i I represents the i-th cell;

3.2, calculating the similarity between the projection vector of the visual characteristic and the semantic embedded vector of the object class in the semantic embedded space to obtain a classification score value and give a classification result with fine granularity; score value is expressed as

C _s Representing the number of classes of objects, i.e. a total of C during training _s The j represents the j-th training class (i.e., the known class because there is data at the time of training, and therefore known class), s is the abbreviation of sen, representing the known class, i.e., training class. The object classes are given in advance, in zero sample object detection, which object classes need to be detected are given in advance, and the labels of the object classes are given in advance, so that the purpose of classification can be achieved by only comparing projection vectors with semantic embedded vectors of the given object classes in a semantic embedded space.

Step four: on the basis of the first step, extracting the context information of each cell by using a context extraction network, and simultaneously predicting the superclass information of each candidate object by using a superclass prediction network based on the extracted context characteristics, comprising the following steps:

4.1, extracting a context feature matrix through context feature extraction CFE (contextual feature extraction) on the feature matrix of a given input picture, and obtaining a feature matrix with a larger visual field range on an original image by using multi-layer 3*3-sized cavity convolution, wherein the feature matrix fuses context information of candidate object frames; in particular if

Is a picture feature matrix, X represents a tensor matrix (also called tensor, represents an input picture to a deep learning network, and the network outputs a three-dimensional matrix representing the picture), R represents an entire real number interval (because each element in X is a real number, namely, the value range of X is the entire real number interval), and H is multiplied by W is multiplied by d _v Is a three-dimensional matrix, which represents the size of X, and the size of each dimension is H, W and d respectively _v H W is the number of cells,each unit cell is characterized by d _v Dimension, v is an abbreviation for visual, we use a number of hole convolution blocks consecutively on X, where each hole convolution block contains K hole convolutions, the hole rate is +.>

i represents the ith cavity convolution (from 1 st to Kth), r represents the cavity rate of the cavity convolution, as shown in FIG. 2, and the feature matrix obtained by each cavity convolution block is fused to obtain a final context feature matrix->

H×W×d _c Representing the size of the context feature matrix, c is an abbreviation for context, different from d _v Wherein the contextual characteristics of each cell are denoted as p _i I represents the number of cells;

4.2, extracting a superclass relation (building superclass) between the object classes from the WordNet, so that the superclass at least contains 1 test object class; thus, each object class belongs to a superclass;

4.3, based on the contextual characteristics of each cell, predicting corresponding superclasses using a multi-layer fully connected network (e.g., MLP, multilayer Perceptron, multi-layer perceptron); assuming that the true superclass of the cell is represented as

The predicted superclass is denoted +.>

The network was optimized using the following cross entropy loss: />

Wherein i represents the number of cells, and each cell is subjected to superclass prediction, C represents the number of superclasses, each cell needs to predict the probability of C superclasses, and j represents the j-th superclass.

Step five: organically fusing the super class predicted in the fourth step with the class predicted in the third step to obtain a final classification result, wherein the method comprises the following steps of:

5.1, the predicted superclass score value (one superclass score per cell is predicted, step 4.3)

) Multiplying the fine-grained classification score obtained in the step 3.2 to obtain a final classification result. Assume that the classification result is expressed as +.>

Wherein s is->

The superclass where the cross entropy loss is located is:

otherwise is->

Each cell has its own object class and context characteristics, and through the superclass relationship, it is first known that the superclass corresponding to each object class, such as the superclass of "dog" and "cat" may be "animal". This is already constructed in advance. We use the visual features of the cells to predict the corresponding object class, such as dogs, and use the contextual features of the cells to predict superclasses, i.e., animals. And fusing the two paths of prediction results to obtain a final classification result.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. A zero sample object detection method for performing super-class reasoning by fusing context is characterized by comprising the following steps:

step one: extracting depth picture characteristics;

step three: classifying candidate object frames by using a zero sample classifier based on the visual characteristics of each cell to obtain a fine-grained classification result, wherein each cell can predict one or more candidate object frame positions, the characteristics of each candidate object frame are the visual characteristics of the current cell, and zero sample classification is performed based on the visual characteristics; the method specifically comprises the following steps:

C _s Representing the number of the classes with the objects, j representing the jth training class, s representing the known class, namely the training class; the object class is given in advance;

step five: organically fusing the super class predicted in the fourth step with the class predicted in the third step to obtain a final classification result; multiplying the super class information predicted in the fourth step by the fine granularity classification result obtained in the step S3.2 to obtain a final classification result.

2. The method for detecting zero sample object by fusion context to perform superclass reasoning as set forth in claim 1, wherein said step four comprises the steps of:

step 4.1: extracting a context feature matrix through context feature extraction on the feature matrix of a given input picture, and using hole convolution;

step 4.2: extracting a superclass relation between the object class and the object class from the semantic web, so that the superclass at least contains 1 test object class;

step 4.3: predicting superclasses using a multi-layer fully connected network based on the contextual characteristics of each cell; the true superclass of cells is represented as

The predicted superclass is denoted +.>

The network was optimized using the following cross entropy loss:

3. The method for detecting zero sample object by fusion context to perform superclass reasoning as set forth in claim 2, wherein said step five comprises the steps of:

step 5.1: the super class score value predicted in the step 4.3 is calculated

Wherein s is->

The superclass where the cross entropy loss is located is:

/>

otherwise it is

4. A method of zero sample object detection with context fusion for superclass reasoning as claimed in claim 2, characterized by the context feature extraction in step 4.1, in particular,

i represents the ith cavity convolution, r represents the cavity rate of the cavity convolution, and the feature matrix obtained by each cavity convolution block is fused to obtain a final context feature matrix

5. The method for detecting zero-sample objects by fusion context to perform superclass reasoning according to claim 1, wherein the nonlinear equation in the step 3.1 is a fully-connected network.

6. The method for detecting zero-sample object by fusion context to perform superclass reasoning according to claim 2, wherein the hole convolution in the step 4.1 adopts a multi-layer 3*3 hole convolution.

7. The method for detecting zero-sample objects by fusion context to perform superclass reasoning according to claim 2, wherein the semantic net in the step 4.2 is WordNet.