CN114155398A

CN114155398A - Label type self-adaptive active learning image target detection method and device

Info

Publication number: CN114155398A
Application number: CN202111435129.1A
Authority: CN
Inventors: 吕梦遥; 陈辉; 张希雅
Original assignee: Hangzhou Zhuoxi Brain And Intelligence Research Institute
Current assignee: Hangzhou Zhuoxi Brain And Intelligence Research Institute
Priority date: 2021-11-29
Filing date: 2021-11-29
Publication date: 2022-03-08

Abstract

The invention provides a label type self-adaptive active learning image target detection method and a label type self-adaptive active learning image target detection device, wherein the method comprises the following steps: the method comprises the steps of obtaining a target image, detecting the target image to obtain positioning information and classification information corresponding to a target object, labeling the classification information meeting a first preset condition to obtain a corresponding class label, labeling the positioning information meeting a second preset condition to obtain a corresponding supplementary bounding box label, generating first labeling data according to the two labels, adding the first labeling data into a labeling data set, wherein the labeling data set is pre-stored with second labeling data; and (4) retraining the semi-supervised detection model according to the labeled data set to obtain an iteratively updated target semi-supervised detection model until the model reaches the expected performance or the labeled quantity reaches the budget. The method can not only obviously save the marking cost, but also improve the judgment of the detection algorithm on the target type and position.

Description

Label type self-adaptive active learning image target detection method and device

Technical Field

The invention relates to the technical field of self-adaptive active learning, in particular to a label type self-adaptive active learning image target detection method and device.

Background

In the related art, the target detection method based on the convolutional neural network mainly depends on a large-scale data set and full-supervised training, and mainly comprises a two-stage detector based on a candidate frame, a single-stage detector based on an anchor frame and a frame-free detector based on feature points.

In general, the two-stage detection first extracts candidate frames by means of selective search or area extraction networks and then extracts image features of the candidate frames to make category and location predictions. Girshick et al extract candidate box features for the first time using a convolutional neural network, and classification and localization are respectively realized by a support vector machine and a regression model; the space pyramid pooling model maps the candidate frames to the feature map, the whole map only needs one-time forward calculation, and before the pooling layer is inserted into the last full connection layer of the network, the image representation with fixed length can be obtained without scaling the candidate frames. And, in the related art, the accuracy of the single-stage detection method has reached the level of the two-stage method, but the large number of background anchor blocks limits the performance of the network.

However, these algorithms still rely on large-scale, diverse-pattern, and exhaustive datasets for annotation, and the cost of manual annotation becomes more time-consuming and complex, and it is feasible to select only a portion of representative data for annotation. If the randomly sampled image data is labeled, sufficient rich information can be obtained only when the sampling scale is large enough, otherwise the generalization capability of the model is seriously influenced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the first objective of the present invention is to provide an active learning image target detection method with adaptive annotation types, which realizes decoupling of multiple targets in a single image and decoupling of classification and positioning tasks, and utilizes joint training of fully supervised and weakly supervised data to maximally save annotation cost.

The second objective of the present invention is to provide an active learning image target detection device with adaptive annotation type.

A third object of the invention is to propose a non-transitory computer-readable storage medium.

A fourth object of the invention is to propose a computer program product.

To achieve the above object, an embodiment of a first aspect of the present invention provides a method, including:

detecting the target detection object by using the detection model to obtain positioning information and classification information corresponding to the target object

Selecting the most valuable object according to the quantitative index quota for labeling the target object of which the classification information meets the first preset condition so as to obtain the class label corresponding to the target detection object, and selecting the most valuable detection object according to the quantitative index quota for labeling the target detection object of which the positioning information meets the second preset condition so as to obtain the complementary bounding box label corresponding to the detection target object;

generating first labeling data of the target object according to the category label and the supplementary bounding box label, and adding the first labeling data into a labeling data set, wherein second labeling data are prestored in the labeling data set;

and (4) retraining the semi-supervised detection model according to the labeled data set to obtain an iteratively updated target semi-supervised detection model until the model reaches the expected performance or the labeled quantity reaches the budget.

Optionally, in an embodiment of the present application, the initial semi-supervised detection model is designed by:

extracting a multi-scale feature map of the image, taking the central point estimation as a first branch, taking the weak supervision global average pooling as a second branch, and sharing parameters of part of the multi-scale feature map by the first branch and the second branch;

in the first branch, performing convolution on the multi-scale feature map to obtain predicted position information;

in the second branch, convolving the multi-scale feature map results in a response map that can be supervised by image-level labels. Optionally, in an embodiment of the present application, detecting the target image to obtain the positioning information and the classification information corresponding to the target object includes:

and predicting the target image through the initial semi-supervised detection model to obtain positioning information and classification information corresponding to the target object.

Optionally, in an embodiment of the present application, the first preset condition includes that the target object with the information amount higher than a first specific threshold is classified, and the second preset condition includes that the target object with the information amount higher than a second specific threshold is located.

Optionally, in an embodiment of the present application, labeling a target object whose classification information satisfies a first preset condition to obtain a class label corresponding to the target object, and labeling a target object whose positioning information satisfies a second preset condition to obtain a supplementary bounding box label corresponding to the target object includes:

entropy measures the class information amount of the target:

wherein the content of the first and second substances,

to measure the amount of class information of an object with entropy,

representing coordinates of the center point

The class prediction probability of (c) is the total number of candidate classes.

For calculating coordinates of center point of point

The amount of positioning information is calculated by first calculating a scale compensation matrix

The local probability distribution of (2) expects:

where r defines the local neighborhood radius.

Secondly, the difference between the entropy of the local average predicted value and the mean value of the predicted value entropy is used for measuring the mutual information of the data distribution and the model predicted distribution so as to obtain the final product

As an estimation of the amount of positioning information:

wherein the content of the first and second substances,

computing the entropy of the information, defined herein as:

similarly, center point coordinates are obtained

Size information amount of

By using

Represents the total amount of positioning information:

setting threshold e separately for classification and location_c,∈_lAdapting when the amount of information of one class of objects exceeds a corresponding thresholdAnnotations of the corresponding type should be provided.

To achieve the above object, a second aspect of the present invention provides an active learning image target detection apparatus with label type adaptation, including:

the detection module is used for detecting the target detection object by using the detection model to obtain the positioning information and the classification information corresponding to the target object;

the evaluation module is used for selecting the most valuable object according to the quantitative index quota and marking the object with classification information meeting the first preset condition so as to obtain the class label corresponding to the target detection object, and selecting the most valuable detection object according to the quantitative index quota and marking the object with positioning information meeting the second preset condition so as to obtain the supplement bounding box label corresponding to the detection target object;

the labeling module is used for generating first labeling data of the target object according to the category label and the supplementary bounding box label, and adding the first labeling data into a labeling data set, wherein second labeling data are prestored in the labeling data set;

and the training module is used for retraining the semi-supervised detection model according to the labeled data set to obtain an iteratively updated target semi-supervised detection model until the model reaches the expected performance or the labeled quantity reaches the budget.

extracting a multi-scale feature map of an image, estimating a central point as a first branch, performing weak supervision global average pooling as a second branch, and sharing parameters of part of the multi-scale feature map by the first branch and the second branch;

in the second branch, convolving the multi-scale feature map results in a response map that can be supervised by image-level labels.

Optionally, in an embodiment of the present application, the detection module is further configured to:

In order to achieve the above object, a third aspect of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for detecting an object in an active learning image with adaptive annotation type according to the first aspect of the present application is implemented.

To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth embodiment of the present application, and a computer program is stored thereon, and when executed by a processor, the computer program implements the annotation type adaptive active learning image target detection method described in the first embodiment of the present application.

In summary, the method, the apparatus, the computer device, and the non-transitory computer-readable storage medium for detecting an image target with adaptive annotation types in the embodiments of the present invention provide that an active detection iteration process includes five steps: model reasoning, target retrieval, information quantity evaluation, self-adaptive labeling and semi-supervised training. The method can respectively estimate the classified information quantity and the positioning information quantity of each target in the image, select valuable targets to adaptively add category labels or bounding box labels, and design a detection model capable of carrying out joint training on fully supervised and weakly supervised data. Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of an annotation type adaptive active learning image target detection method according to an embodiment of the present invention.

Fig. 2 is a device structure diagram of an annotation type adaptive active learning image target detection method according to an embodiment of the present invention.

Fig. 3 is a structural diagram of a target detection model of the supervised weak supervised joint training provided in the embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes an annotation type adaptive active learning image target detection method and apparatus according to an embodiment of the present invention with reference to the drawings.

Fig. 1 is a schematic flowchart of a label type adaptive active learning image target detection method according to an embodiment of the present invention.

As shown in fig. 1, the method for detecting an image target by label type adaptive active learning includes the following steps:

and step S1, detecting the target detection object by using the detection model to obtain the positioning information and the classification information corresponding to the target object.

In one embodiment of the present application, before step S1, it is first necessary to design a semi-supervised detection model, including:

extracting a multi-scale feature map of the image, taking the central point estimation as a first branch, taking the weakly supervised global average pooling as a second branch, and sharing parameters of part of the multi-scale feature map by the first branch and the second branch;

in the second branch, the multi-scale feature map is convolved to obtain a response map which can be supervised by image-level labels.

Specifically, in one embodiment of the present application, extracting a multi-scale feature map of an image, using a center point estimation as a first branch and a weakly supervised global mean average pooling as a second branch, the first branch and the second branch sharing parameters of a part of the multi-scale feature map, comprises:

firstly, a multi-scale feature map F of the image is extracted by using a feature pyramid network_iDimension W_i×H_i×D_iWhere i e {1,2,3} represents three sequentially increasing feature map resolutions, and W, H, D represent the feature map width, height, and depth, respectively. The central point estimate and the weakly supervised branch, in which the feature map is 3 x 3 convolution compressed into

Where C is the detection class. Response graph

Under pixel level supervision, obtaining class prediction with length C after global average pooling

May be supervised by image level labels. The other branch is responsible for predicting the position information, and the feature map is compressed into by 3 x 3 convolution

Representing the position compensation of the center point in two dimensions and, similarly,

representing a size estimate of the bounding box length and width.

Step S2, selecting the most valuable object according to the quantitative index quota for the target object whose classification information satisfies the first preset condition, and labeling to obtain the class label corresponding to the target detection object, and selecting the most valuable detection object according to the quantitative index quota for the target detection object whose positioning information satisfies the second preset condition, and labeling to obtain the complementary bounding box label corresponding to the detection target object.

In an embodiment of the present application, detecting a target image to obtain positioning information and classification information corresponding to a target object includes:

and predicting the target object through the initial semi-supervised detection model to obtain the positioning information and the classification information corresponding to the target object.

And, in one embodiment of the present application, the first preset condition includes classifying the target object whose information amount is higher than a first specific threshold, and the second preset condition includes locating the target object whose information amount is higher than a second specific threshold.

Specifically, the model is trained with a currently existing annotated data set. If the detection model is initially trained, a part of images (for example, 2,000 VOC07 sheets are commonly used as an initial training set) are randomly selected for complete target labeling, and the initial training set is formed to train the detection model.

Model prediction

And (3) representing the probability that the target represented by the key point at (x, y) on the feature map belongs to the class c. And recording the center point of each manually marked target as p e R²Predicting and calculating the loss on the low resolution feature map with the lower sampling ratio R, then

Mapping truth values to a thermodynamic diagram Y ∈ [0,1 ] with the following Gaussian kernel]^W×H×CThe method comprises the following steps:

wherein σ_pIs the standard deviation of the adaptive target size.

The loss function estimated for the target center point is a pixel level logistic regression, denoted L_k：

Where α and β are hyper-parameters and N is the number of objects in the image.

In order to recover the discretization error,

local coordinate compensation is predicted for each center point, and training of a loss supervision model with L1 is performed, and a loss function is recorded as L_off：

Record the target size of each manual label as

Where k represents the number of object objects in a graph.

For each target object size

Making regression, using L1 loss as loss function to supervise size regression, and recording loss function as L_size：

And representing the image-level label of the sample by using a one-hot vector g, and when at least one object belonging to the class c exists in the sample, recording the corresponding position as 1:

wherein

Is an indicative function, c^(k)Is the class of the kth object, g_cIs a truth label belonging to a class c object. The branch is supervised by a multi-label cross entropy loss function, recorded as L_cls：

Based on the above, the total training targets of the model are:

L＝L_k+λ_offL_off+λ_sizeL_size+λ_clsL_cls.

wherein λ is_off、λ_size、λ_clsFor the hyper-parameter, the weight of each branch in the training is controlled.

Step S3, generating first labeled data of the target object according to the category label and the supplemental bounding box label, and adding the first labeled data to a labeled data set, where the labeled data set pre-stores second labeled data.

In an embodiment of the present application, labeling a target object whose classification information satisfies a first preset condition to obtain a class label corresponding to the target object, and labeling a target object whose positioning information satisfies a second preset condition to obtain a supplementary bounding box label corresponding to the target object includes:

entropy measures the class information amount of the target:

wherein the content of the first and second substances,

to measure the amount of class information of an object with entropy,

to representCoordinates of center point

For calculating coordinates of center point of point

The amount of positioning information, first of all, of the scale compensation matrix

The local probability distribution of (2) expects:

where r defines the local neighborhood radius.

As an estimation of the amount of positioning information:

wherein the content of the first and second substances,

computing the entropy of the information, defined herein as:

similarly, center point coordinates are obtained

Size information amount of

By using

Represents the total amount of positioning information:

setting threshold e separately for classification and location_c,∈_lAnd when the information quantity of one type of the target exceeds the corresponding threshold value, the label of the corresponding type is provided in a self-adaptive manner.

And step S4, retraining the semi-supervised detection model according to the annotation data set to obtain an iteratively updated target semi-supervised detection model until the model reaches the expected performance or the annotation quantity reaches the budget.

The technical effects of this application: the characteristics of classification, positioning decoupling and multi-target separation in a detection task are fully utilized, the classification information quantity and the positioning information quantity of each target in an image are respectively estimated, valuable targets are selected, category labels or bounding box labels are added in a self-adaptive mode, meanwhile, a detection model capable of carrying out combined training on fully supervised and weakly supervised data is designed, the labeling cost can be remarkably saved, and the judgment of a detection algorithm on the categories and the positions of the targets can be pertinently improved.

In order to implement the above embodiments, the present invention further provides an active learning image target detection apparatus with adaptive annotation types.

Fig. 2 is a schematic structural diagram of an annotation type adaptive active learning image target detection apparatus according to an embodiment of the present invention.

As shown in fig. 2, the label type adaptive active learning image target detection apparatus includes:

the evaluation module is used for selecting the most valuable object according to the quantitative index quota for the target object of which the classification information meets the first preset condition to label so as to obtain the class label corresponding to the target detection object, and selecting the most valuable detection object according to the quantitative index quota for the target detection object of which the positioning information meets the second preset condition to label so as to obtain the supplement bounding box label corresponding to the detection target object;

In an embodiment of the present application, further, the method further includes:

an initial semi-supervised detection model was designed by:

in the first branch, performing convolution on the multi-scale characteristic graph to obtain predicted position information;

in the second branch, the convolution of the multi-scale feature map results in a response map that can be supervised by image-level labels.

and the detection module is used for predicting the target object through the initial semi-supervised detection model to obtain the positioning information and the classification information corresponding to the target object.

the preset conditions include classifying target objects with information amount higher than a first specific threshold, and the second preset conditions include locating target objects with information amount higher than a second specific threshold.

In one embodiment of the present application, the overall test model structure is shown in FIG. 3.

To achieve the above object, a third aspect of the present application provides a computer device, a memory thereon, a processor, and a computer program stored on the memory and executable on the processor, where the computer program, when executed by the processor, implements a method for label type adaptive active learning image target detection as described in the first aspect of the present application.

To achieve the above object, a non-transitory computer-readable storage medium is provided in a fourth embodiment of the present application, and a computer program is stored thereon, and when executed by a processor, the computer program implements a method for label type adaptive active learning image target detection as described in the first embodiment of the present application.

Although the present application has been disclosed in detail with reference to the accompanying drawings, it is to be understood that such description is merely illustrative and not restrictive of the application of the present application. The scope of the present application is defined by the appended claims and may include various modifications, adaptations, and equivalents of the invention without departing from the scope and spirit of the application.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A label type self-adaptive active learning image target detection method is characterized by comprising the following steps:

detecting the target detection object by using the detection model to obtain positioning information and classification information corresponding to the target object;

and retraining the semi-supervised detection model according to the labeled data set to obtain an iteratively updated target semi-supervised detection model until the model reaches the expected performance or the labeled quantity reaches the budget.

2. The method of claim 1, wherein the semi-supervised detection model is designed by:

extracting a multi-scale feature map of an image, and estimating a central point as a first branch and weakly supervised global average pooling as a second branch, wherein the first branch and the second branch share part of parameters of the multi-scale feature map;

3. The method according to claim 1 or 2, wherein detecting the target image to obtain the corresponding positioning information and classification information comprises:

4. The method according to claim 3, wherein the first predetermined condition includes classifying target objects having an information content above a first specific threshold, and the second predetermined condition includes locating target objects having an information content above a second specific threshold.

5. The method according to claim 4, wherein labeling the target object whose classification information satisfies a first preset condition to obtain a class label corresponding to the target object, and labeling the target object whose positioning information satisfies a second preset condition to obtain a supplementary bounding box label corresponding to the target object, comprises:

entropy measures the class information amount of the target:

wherein the content of the first and second substances,

to measure the amount of class information of an object with entropy,

representing coordinates of the center point

For calculating coordinates of center point of point

The local probability distribution of (2) expects:

where r defines the local neighborhood radius.

As an estimation of the amount of positioning information:

wherein the content of the first and second substances,

computing the entropy of the information, defined herein as:

similarly, center point coordinates are obtained

Size information amount of

By using

Represents the total amount of positioning information:

setting threshold e separately for classification and location_c,∈_lAnd respectively screening out the targets with the information quantity exceeding the corresponding threshold value, quantitatively selecting the target with the maximum information quantity, and providing the label of the corresponding type.

6. An adaptive active learning image target detection device of label type, characterized by comprising:

the detection module is used for detecting the target detection object by using the detection model to obtain positioning information and classification information corresponding to the target object;

the evaluation module is used for selecting the most valuable object according to the quantitative index quota for the target object of which the classification information meets the first preset condition to label so as to obtain the class label corresponding to the target detection object, and selecting the most valuable detection object according to the quantitative index quota for the target detection object of which the positioning information meets the second preset condition to label so as to obtain the supplementary bounding box label corresponding to the detection target object;

7. The apparatus of claim 6, wherein the initial semi-supervised detection model is designed by:

8. The apparatus of claim 6 or 7, wherein the detection module is further configured to:

9. The apparatus of claim 8, wherein the first predetermined condition comprises classifying target objects with information amount higher than a first specific threshold, and the second predetermined condition comprises locating target objects with information amount higher than a second specific threshold.

10. A non-transitory computer-readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the annotation type adaptive active learning image target detection method according to any one of claims 1 to 5.