CN114154563A

CN114154563A - Target detection method based on hybrid supervised training

Info

Publication number: CN114154563A
Application number: CN202111355318.8A
Authority: CN
Inventors: 李甲; 穆凯; 齐云山; 赵沁平
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-08

Abstract

The invention provides a target detection method based on hybrid supervised training, and provides a target detection method for hybrid training by using partial fully-labeled data and partial weakly supervised labeled data based on a training data set labeling strategy used in the training process of an analysis target detector. The method comprises the steps that a peak value category activation response mechanism is used, mapping of object category labels and coarse-grained position information is modeled for weak labeling data during training, and detection branch training is assisted; model classification and positioning branches are trained on the fully labeled data. And finally, the results of the two branches are fused in a self-adaptive manner, so that the performance of the target detector is improved. On one hand, the patent provides a training method of a hybrid supervised training target detector based on peak class activation response, which can remarkably reduce training cost while ensuring performance, and on the other hand, the method is combined with the existing target detector, which remarkably reduces training cost and improves detection performance to a certain extent.

Description

Target detection method based on hybrid supervised training

Technical Field

The invention relates to the field of computer vision and multimedia analysis, in particular to a target detection method based on hybrid supervised training.

Background

Object detection is a basic task in the computer field, the object being to locate and output, for a given class, a rectangular bounding box of the class object from the input image. The object detector is mainly divided into a one-stage object detector and a two-stage object detector. The two-stage target detector is based on the R-CNN structure proposed by Girshick, Berkeley, California, and the region of interest is first generated by a low-level computer vision algorithm and then classified and located. SPPNet proposed by Microsoft institute He and Fast R-CNN proposed by Microsoft institute Girshick utilize spatial pyramid pooling to generate features at one time and generate regional features through RoI pooling, thereby effectively reducing redundant computation. The Faster R-CNN proposed by Ren et al, university of china science and technology, further improves performance by using a region proposal network instead of the time-consuming region proposal algorithm. The R-FCN proposed by Microsoft institute Dai et al avoids processing for each region of interest by generating location sensitivity scores over a full convolutional network. Mask R-CNN proposed by Facebook Artificial Intelligence institute He et al effectively solves the problem of coarse space quantization by using an interested region alignment layer. FPN proposed by Facebook Artificial Intelligence research institute Lin et al fuses features with strong low-resolution semantic information and features with weak high-resolution semantic information through a top-to-bottom path and skip connection, so that the problem of scale change is solved. Traditionally, the two-stage detector can obtain better detection performance but often has larger calculation overhead, and does not meet the requirement of real-time application. To address this situation, a one-stage detector avoids the time-consuming proposal generation step, directly classifying predefined detection blocks, such as YOLO by Redmon et al, washington university, and SSD model by Liu et al, north carolina university. The existing training of a target detector usually adopts full-supervised training, namely a data set simultaneously marks the category and the bounding box of each object, however, the marking cost of the data set is high, the time is consumed, meanwhile, parts of the data set are difficult to obtain under a medical scene, especially in a complex and dense scene, the number of object examples is large, the objects are distributed more densely, in addition, the objects are seriously shielded from each other, the cost of the bounding box is high, and the training cost is high. Meanwhile, a part of research provides a weak supervision method, namely a data set only labels the categories appearing in the pictures without marking a surrounding frame, the labeling cost can be obviously reduced by the labeling mode, however, the existing weak supervision method is regarded as multi-instance learning, and due to the lack of explicit position information supervision, the performance is generally greatly different from that of a fully supervised detector. Therefore, by using the hybrid supervised training, that is, by using a small amount of fully supervised labeled data and a large amount of easily obtained weakly supervised labeled data, the training target detection network can well reduce the training cost while ensuring the performance.

The method disclosed by the invention firstly provides a target detection method for lightweight hybrid supervised training, and greatly reduces the training and labeling cost of the model on the basis of equivalent performance. And meanwhile, the method is obviously superior to the existing method under the same labeling cost.

Disclosure of Invention

In light of the above-mentioned practical needs and key issues, the present invention is directed to: a target detection method based on hybrid supervised training is provided, and a small amount of classification and regression heads of a full supervised annotation data training model and a large amount of low cost classification heads of a weak supervised annotation data training model are used in the training process. The model trains classification branches of a classification head for weakly labeled data, simultaneously introduces a class peak value activation response mechanism, models mapping from classification information to coarse-grained position information, and enhances the response of the position of an object while suppressing noise by fusing the extracted coarse-grained position information with an original position information feature map in a test stage.

The invention comprises the following 3 steps:

step S100, calculating the loss of class labels and model classification prediction by using a network loss function for a weakly labeled image of a training data set, and minimizing a classification branch of a loss function training model by using a gradient back propagation method;

step S200, for the full-labeled image of the training data set, calculating a class label and a classification prediction and a loss function of a position label and a positioning prediction respectively by using a network loss function, minimizing the loss function by using a gradient back propagation method, and training classification and positioning branches of a model;

and step S300, for the detected image, performing forward calculation by using the convolutional neural network with the trained network weight by the method, and fusing the result of the peak type activation response branch into the central point detection branch after shifting, and calculating the characteristic by using the enhanced detection characteristic to obtain a prediction frame.

Drawings

FIG. 1 is a flow chart of a hybrid supervised training based target detection method of the present invention;

FIG. 2 is a block diagram of the hybrid supervised training based target detection method of the present invention;

FIG. 3 is a training strategy diagram of the hybrid supervised training based target detection method of the present invention;

FIG. 4 is a fusion detection diagram of the target detection method based on hybrid supervised training of the present invention.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information. The following detailed description of embodiments of the invention is provided in connection with the accompanying drawings. The following examples or figures are illustrative of the present invention and are not intended to limit the scope of the present invention.

FIG. 1 is a flow chart of a target detection method based on hybrid supervised training, which includes the following steps:

step S200, for a full-labeled image of a training data set, calculating a class label and classification prediction by using a network loss function, calculating a loss function of a position label and positioning prediction, minimizing the loss function by using a gradient back propagation method, and training classification and positioning branches of a model;

and step S300, performing forward calculation on the detection image by using the convolutional neural network with the trained network weight by the method, fusing the result of the peak type activation response branch into a central point detection branch after shifting, and forming a final detection frame by using the enhanced detection heat map prediction, the length and width prediction and the central point shift prediction.

Referring to fig. 2, a frame diagram of the target detection method based on hybrid supervised training of the present invention and fig. 3, a training strategy diagram of the target detection method based on hybrid supervised training of the present invention, the target detection method based on hybrid supervised training of the present invention includes the following steps in the training process:

and S100, calculating the loss of class labels and model classification prediction by using a network loss function for the weakly labeled image of the training data set, and minimizing the classification branch of the loss function training model by using a gradient back propagation method.

And training central heat map prediction of the peak value class activation response branch and the central point detection branch and classification capability of the central heat map prediction of the central point detection branch by using a training picture of weak supervision labeling and a corresponding class label, wherein the weak labeling image is an image labeled with only class labels and no bounding box labels. Classification loss function for weakly supervised training

Comprises the following steps:

wherein s is_aggrFor the purpose of peak aggregate response confidence level,

wherein the content of the first and second substances,

(ii) a peak point response representing a peak point of the peak class activation response map, (i)_k，j_k) Peak position, N, representing the peak point of the peak class activation response map^cAnd the number of peak points is represented, label is a category label vector in the data set label, and BCE is a cross entropy loss function. Maxpool denotes a maximization pooling operation, pooling N × C × H × W dimensional prediction vectors into N × C category prediction vectors

Here, the peak aggregate response confidence is the confidence that the prediction of the peak class activation response branch is obtained after aggregation.

And step S200, for the full-labeled image of the training data set, calculating a class label and classification prediction and a loss function of a position label and positioning prediction respectively by using a network loss function, minimizing the loss function by using a gradient back propagation method, and training classification and positioning branches of a model.

According to the model trained in step S100, the peak class activation response branch and the classification part of the center point detection branch of the full annotation image are trained using class labeling, and the prediction header of the center point detection branch of the model is proposed using full supervised data training, where the full annotation image represents an image labeled with a class label and a bounding box label, and the loss function of the center point detection branch

The following were used:

wherein the content of the first and second substances,

representing the prediction of the central point heat map,

represents the length and width scale prediction,

Which represents the prediction of the center point offset,

for offset-to-center prediction, Y represents the heat map generated by the CenterNet algorithm according to the dataset annotation, (w)_i，h_i) Representation by CenterNet calculation according to dataset annotationsLength and width dimension (delta w) produced by the method_i，δh_i) Representing the center point offset generated by the CenterNet algorithm according to the dataset labels, (δ px)_i，δpy_i) Representing a learning objective labeling the center offset generated by the centret algorithm from the dataset, the prediction head was trained using the loss functions FocalLoss and L1 distance loss, respectively.

Representing the loss of central heat map prediction,

indicating the loss of prediction towards the center offset,

indicating the loss of the center point offset prediction, GT labels indicating the dataset class labels,

representing the loss of confidence prediction for the peak aggregate response, GT boxes representing dataset bounding box labels,

representing the loss of the length-width scale prediction.

Training a classification head by using full supervision data for a peak activated response branch, and using a peak class activated response branch loss function for the peak class activated response branch during full supervision training

Comprises the following steps:

wherein s is_aggrFor the peak aggregate response confidence, label is the class label vector in the dataset label, and BCE is the cross entropy loss function.

Fully supervised trainingLoss function of

Comprises the following steps:

referring to fig. 4, referring to a fusion detection diagram of the target detection method based on hybrid supervised training of the present invention in fig. 4, the target detection method based on hybrid supervised training of the present invention includes the following steps in the test inference detection process:

and step S300, performing forward calculation on the detection image by using the convolutional neural network with the trained network weight through the methods of the steps S100 and S200, fusing the result of the peak type activation response branch into a central point detection branch after deviation, and forming a final detection frame by using the enhanced detection heat map prediction, the length and width prediction and the central point deviation prediction.

In the detection process of the model, a mixed supervision training model trained through the steps is used, a peak type activation response branch is used for obtaining branch responses, a central point detection branch is used for obtaining heat map prediction, length and width scale prediction, central point offset prediction and central offset prediction, finally offset through central offset is used, the position of a peak point is relatively close to the position of a central point, and the peak point is fused with responses at the corresponding positions of the central heat map, so that an enhanced central point heat map is obtained. And the enhanced central point heat map prediction, the length and width prediction and the central point offset prediction form a final detection frame together.

For the peak value category activation response branch, constructing a category activation response graph, and calculating the category probability y output by the last layer_cAll pixels A for the current layer feature map_i，jPartial derivatives of

Wherein, y_cFor the probability vector output for the classification,

is the pixel at the (i, j) th position on the kth channel of feature map a. Averaging the partial derivatives of each pixel in the spatial dimension to obtain the weight coefficient of the class C for each channel:

the contribution weight α of the feature of channel k to the class C is obtained, where Z equals i × j, and Z represents the number of all pixels.

And carrying out weighted summation and linear combination on the weight and the feature map, and obtaining a category activation response map through the processing of an activation function ReLU:

wherein the content of the first and second substances,

activation of a responsive thermodynamic diagram for a class of class C, A^kShowing the operation on all channels k of the profile a.

Selecting a peak point on the class activation response map as an output of the peak class activation response, selecting a series of local maxima within a given neighborhood window using a max pooling operation:

the locations of the local maxima of the class activation response maps, each representing a class C, may be obtained using a maximum pooling sliding window calculation. Wherein N is^cRepresenting the number of local maxima for the C category. Here, the max-pooling sliding window may be an operation of taking a neighborhood maximum by max-pooling. The domain window represents a square area within a certain range k from top to bottom and from left to right by taking the current pixel as the center, and the maximum value in the domain window is obtained through a sampling function。

Predicting a heat map for a midpoint detection branch

Length and width dimensions

Shift of neutral point

Offset to center

For the image to be detected, forward calculation is carried out through a convolutional neural network with trained network weight, and the result of the peak type activation response branch is fused to the central point detection branch after being shifted:

wherein the content of the first and second substances,

to fuse the enhanced central point heat map,

indicating the location of the peak point of the output,

a category activation response indicating the corresponding location of the peak point,

and beta is the proportion of the hyper-parametric control peak class activation response in the whole fusion process.

Central point heat prediction map with enhanced final selection fusion

The points of medium and high response constitute the final detection box:

here, the first and second liquid crystal display panels are,

indicates the width of a prediction bounding box formed by centering the ith position,

indicating the height of the prediction bounding box formed by centering the ith position,

represents the abscissa representing the prediction bounding box constituted with the ith position as the center,

represents the ordinate of the prediction bounding box constructed with the ith position as the center,

representing the offset of the x and y coordinates, respectively.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A target detection method based on hybrid supervised training comprises the following steps:

2. The method of claim 1, wherein the computing the class labels and the model classification predicted loss for weakly labeled images of the training dataset using a network loss function, the training of the classification branches of the model using a gradient back propagation method to minimize the loss function comprises:

for a weak annotation image of a training data set, using class annotation to train the prediction of a central heat map of a weak annotation image peak value class activation response branch and a central point detection branch;

constructing a category activation response graph, and calculating the category probability y of the output of the last layer of classification^cAll pixels A of the current layer feature map_i，jPartial derivatives of

Wherein, y^cFor the probability vector output for the classification,

at (i, j) th position on the kth channel of feature map AA pixel;

averaging the partial derivatives of each pixel in the spatial dimension to obtain a weight coefficient of the category C for each channel:

obtaining a weight coefficient alpha of the feature of the channel k to the classification C, wherein Z is i multiplied by j, and Z represents the number of all pixels;

and carrying out weighted summation and linear combination on the weight coefficient and the feature map, and obtaining a class activation response map through the processing of an activation function ReLU:

wherein the content of the first and second substances,

activation of a responsive thermodynamic diagram for a class of class C, A^kRepresenting the operation of all channels K of the characteristic diagram A;

wherein N is^cThe number of local maxima representing the C category;

calculating a loss function by using the output of the peak value category activation response and the data set category label, and calculating the confidence coefficient of the peak value aggregation response

Wherein the content of the first and second substances,

(ii) a peak point response representing a peak point of the peak class activation response map, (i)_k，j_k) Peak position, N, representing the peak point of the peak class activation response map^cRepresenting the number of peak points, marking the aggregated confidence response and the data set category to calculate a classification loss function, and predicting the heat map of the central point detection branch

Calculate the classification loss function after maximum pooling:

wherein BCE is a cross entropy loss function, s_aggrFor peak aggregate response confidence, label is the class label vector in the dataset label, MaxPool represents the maximize pooling operation.

3. The method of claim 1, wherein the using a network loss function, calculating class labels and class predictions, calculating loss functions of location labels and location predictions, minimizing loss functions using a gradient back propagation method, training classification and location branches of a model for a fully labeled image of a training dataset comprises:

for a full-labeled image of a training data set, a peak value category of the full-labeled image is trained by using category labeling to activate a response branch and a classification part of a central point detection branch, a prediction head of the central point detection branch of a proposed model is trained by using full-supervision data, and a loss function of the central point detection branch

Comprises the following steps:

wherein the content of the first and second substances,

representing the prediction of the central point heat map,

the length-width scale prediction is represented,

represents the center point offset prediction, Y, (w)_i，h_i)、(δw_i，δh_i)、(δpx_i，δpy_i) Respectively representing the learning objectives of the heat map, the length-width scale, the center point offset and the center offset generated by the centret algorithm according to the dataset labels, focallloss is a loss function,

representing peak-to-center offset prediction, L1 is distance loss training, peak class activation response branch loss function

Comprises the following steps:

wherein BCE is a cross entropy loss function, s_aggrIndicating the peak aggregate response confidence, label is the category label vector in the dataset label.

4. The method according to claim 1, wherein for the detected image, forward calculation is performed by using a convolutional neural network with network weights trained by the above method, the result of the peak class activation response branch is fused into the central point detection branch after being shifted, and the final detection frame is formed by using the enhanced detection heat map prediction and the length-width prediction and the central point shift prediction, which comprises:

wherein the content of the first and second substances,

representing the prediction of the central point heat map,

a central point heat prediction map representing fusion enhancement,

indicating the location of the peak point of the output,

representing the offset of each point to a central point, wherein beta is the proportion of the hyper-parameter control peak value category activation response in the whole fusion process;

and using the enhanced detection heat map prediction, the length and width prediction and the center point deviation prediction to form a target detection result.