CN112926673A

CN112926673A - Semi-supervised target detection method based on consistency constraint

Info

Publication number: CN112926673A
Application number: CN202110286708.8A
Authority: CN
Inventors: 王好谦; 王颢涵
Original assignee: Shenzhen International Graduate School of Tsinghua University
Current assignee: Shenzhen International Graduate School of Tsinghua University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2021-06-08
Anticipated expiration: 2041-03-17
Also published as: CN112926673B

Abstract

A semi-supervised target detection method based on consistency constraint comprises the following steps: performing data enhancement on the training set to obtain a reconstructed training set; constructing any target detection model based on deep learning; in each training cycle, in each training batch, simultaneously inputting the images sampled in the training set and the corresponding reconstructed images in the reconstructed training set into a model network, calculating the error between the prediction result of the original image and the truth value label of the original image, calculating the consistency error between the original image and the reconstructed images, and performing weighted summation of the two errors to obtain the total error of model training; updating parameters by using a batch gradient descending method; and carrying out target detection on the input image by using the trained network to obtain the position and the category of the target in the input image. Compared with the traditional fully-supervised target detection model, the method and the device can achieve equivalent performance by using fewer manual labels, or achieve better performance by using the same number of labels and more label-free images.

Description

Semi-supervised target detection method based on consistency constraint

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a semi-supervised target detection method based on consistency constraint.

Background

Object Detection (Object Detection) is one of the most important and challenging problems in the field of computer vision. Given an input image of arbitrary size, the object detection model outputs the locations and classes of one or more predefined classes of objects in the image. The target detection has very wide application scenes, such as automatic driving, industrial production, video monitoring, medical image processing, satellite image processing and the like. Therefore, target detection is always a very interesting research problem in academia and industry.

Currently, most of mainstream target detection models are based on a deep neural network and adopt a training mode of full supervision learning. Under the full-supervised learning mode, each training image should have an accurate and comprehensive label. According to research, it takes about 10 seconds to accurately mark one object, and one image often has a plurality of objects. Since the training of the deep neural network requires a large amount of data, it takes a lot of time and labor to label the training image. Meanwhile, the non-label data is not lacked in many application scenarios, but the existing fully supervised learning method cannot effectively utilize the non-label data. As described above, the use of the unlabeled training data is helpful for reducing the dependence of the deep neural network on artificial labeling, and is also helpful for the model to make full use of the unlabeled data with wider sources and larger quantity.

The semi-supervised learning is a learning mode which not only can obtain strong supervised learning signals from the labels, but also can mine useful learning information from unlabelled training data. However, existing semi-supervised learning focuses mainly on classification tasks. The semi-supervised learning is not sufficiently explored on the aspect of target detection problems with higher labeling cost and more difficult learning process. Therefore, the semi-supervised learning method is introduced into the target detection task, and has stronger academic value and application prospect.

The existing semi-supervised target detection method and the semi-supervised classification method have many common points, wherein the mainstream semi-supervised target detection method adopts a learning mode based on self-training. Self-training means that an initial model is generated from a labeled image by training in a full-supervised learning mode, then the unlabeled image is processed by the model, and a high-confidence result is used as a pseudo label of the unlabeled image; this process iterates multiple times until a stop condition is met. However, this type of method requires lengthy training time and is too sensitive to the hyper-parameters of pseudo-label screening.

Another common and effective semi-supervised classification approach is based on consistency constraints. The consistency constraint refers to a small amount of perturbation to the input image, and the output should remain consistent. Since the output of the classification problem is only a fixed-dimension class vector, which is robust to the pixel position distribution and color distribution of the input image, it is very simple and natural to perturb the input, such as mirror inversion, clipping, color dithering, etc. However, for the object detection problem, the output is highly correlated with the pixel position of the input image, and therefore, it is very challenging to design a suitable perturbation to the input image so that the object detection task can learn consistency from the perturbation.

It is to be noted that the information disclosed in the above background section is only for understanding the background of the present application and thus may include information that does not constitute prior art known to a person of ordinary skill in the art.

Disclosure of Invention

The invention mainly aims to provide a semi-supervised target detection method based on consistency constraint, so as to solve the problem that model training in the background technology is highly dependent on artificial labels.

In order to achieve the purpose, the invention adopts the following technical scheme:

a semi-supervised target detection method based on consistency constraint comprises the following steps:

first step, data enhancement: performing data enhancement on the training set to obtain a reconstructed training set;

secondly, model initialization: constructing any target detection model based on deep learning;

thirdly, model training: in each training cycle, in each training batch, simultaneously inputting the images sampled in the training set and the corresponding reconstructed images in the reconstructed training set into a model network, and calculating the error between the prediction result of the original images and the true value label of the original images, wherein if the original images are not labeled, the corresponding loss function value is 0; calculating a consistency error between the original image and the reconstructed image, and performing weighted summation on the two errors to serve as a total error of model training, wherein if the original image is not labeled, the value of a corresponding item in the weighted summation process is 0; then updating the model parameters by using a batch gradient descent method;

the fourth step: target detection: and carrying out target detection on the input image by using the trained network model to obtain the position and the category of the target in the input image.

Further:

the first step comprises: for each image a in the training set, it is cropped into a plurality of sub-images and rearranged spatially, generating a reconstructed image b of the same size as image a.

In the first step, each image in the training set is cut along a horizontal center line and a vertical center line to obtain four sub-images of upper left, upper right, lower right and lower left, which are recorded as A, B, C and D, then A is horizontally translated to the position of B, B is vertically translated to the position of C, C is horizontally translated to the position of D, and D is vertically translated to the position of A, so that a reconstructed image is obtained.

The target detection model is fast R-CNN, YOLO, SSD, CenterNet or CornerNet; the Faster R-CNN comprises a neural Network framework (Backbone), a Feature Pyramid Network (Feature Pyramid Network), a Region suggestion Network (RPN) and a Head neural Network (Head Network).

In the third step, the original image with the label and the image without the label are mixed and disordered, and then the image and the corresponding reconstructed image are sequentially input into a network to calculate a loss function; the loss function includes two parts, one is the error between the output of the labeled image and the corresponding label, and the other is the loss of consistency of the results between all images and the corresponding reconstructed image.

The third step includes:

inputting the original images and the corresponding reconstructed images in the training set into a network in sequence, and respectively predicting to obtain a boundary box set B of the original images and a boundary box set B' of the reconstructed images; constructing a loss function, wherein the loss function comprises two parts: one part is the error between the bounding box of the original image with labels and the truth label, wherein the position error uses smooth L1 loss function, the category error uses cross entropy loss function, if the original image has no truth label, the part loss function is 0; the other part is consistency loss between the original image and the reconstructed image; wherein, the boundary frame set B on the theoretically reconstructed image is obtained by reconstructing the boundary frame set B of the original image₂With B₂Predicting B' and B as truth labels₂Wherein the error is defined using a DIoU loss function as

Wherein IoU represents the ratio of the area of intersection of the two bounding boxes to the area of union, d represents the Euclidean distance between the center points of the two bounding boxes, and c represents the diagonal distance of the minimum closure region containing both bounding boxes.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method.

The beneficial effect of this application and prior art contrast includes:

compared with other traditional full-supervision target detection models, the semi-supervision target detection method based on consistency constraint can achieve equivalent performance by using fewer manual labels or achieve better performance by using the same number of labels and more label-free images. The method and the device design a consistency constraint, so that a prediction result of a reconstructed image and a prediction result of an original image meet a certain geometric relationship in space, a model is ensured to obtain a supervision signal under the condition that a training image is not labeled, and useful knowledge is learned. By adding a geometric constraint to the unlabeled image, the model can effectively learn useful knowledge from the unlabeled image, thereby reducing the number of labels required for model training.

Drawings

FIG. 1 is a simplified flowchart of a semi-supervised target detection method based on consistency constraints according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a method for splitting and reconstructing a training image in a semi-supervised target detection method based on consistency constraint according to an embodiment of the present invention, where the left side in fig. 2 is an original image, and the right side is a reconstructed image.

Fig. 3 is a schematic diagram of a theoretical change of a bounding box after a training image is split and reconstructed in the semi-supervised target detection method based on the consistency constraint according to an embodiment of the present invention, where the left side in fig. 3 is an original image prediction result B, and the right side is a theoretically reconstructed image prediction result B2.

Fig. 4 is a schematic diagram of a DIoU in a consistency constraint-based loss function in a consistency constraint-based semi-supervised target detection method according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of a training process when an Faster R-CNN target detection model is used in the semi-supervised target detection method based on consistency constraint according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described in detail below. It should be emphasized that the following description is merely exemplary in nature and is not intended to limit the scope of the invention or its application.

Referring to fig. 1, an embodiment of the present invention provides a semi-supervised target detection method based on consistency constraint, including the following steps:

In a preferred embodiment, the first step comprises: for each image a in the training set, it is cropped into a plurality of sub-images and rearranged spatially, generating a reconstructed image b of the same size as image a.

In a preferred embodiment, in the first step, each image in the training set is cropped along a horizontal center line and a vertical center line to obtain four sub-images, namely, upper left, upper right, lower right and lower left, which are denoted as a, B, C and D, then a is horizontally translated to the position of B, B is vertically translated to the position of C, C is horizontally translated to the position of D, and D is vertically translated to the position of a, so as to obtain a reconstructed image.

The target detection model can be fast R-CNN, YOLO, SSD, CenterNet, or CornerNet, among others.

In a preferred embodiment, the Faster R-CNN includes a neural Network framework (Backbone), a Feature Pyramid Network (Feature Pyramid Network), a Region suggestion Network (RPN), and a cranial neural Network (Head Network).

In a preferred embodiment, in the third step, the labeled image and the unlabeled image are mixed and scrambled, and then the image and the corresponding reconstructed image are sequentially input into a network to calculate a loss function; the loss function includes two parts, one is the error between the output of the labeled image and the corresponding label, and the other is the loss of consistency of the results between all images and the corresponding reconstructed image.

Specific embodiments of the present invention are further described below.

in a first step, for each image a in the data set, it is cropped into N sub-images and rearranged spatially, generating a reconstructed image b of the same size as a. In the second step, an arbitrary deep learning-based target detection model (e.g., Faster R-CNN) is constructed. The fast R-CNN is composed of a neural Network framework (Backbone), a Feature Pyramid Network (Feature Pyramid Network), a Region suggestion Network (RPN) and a Head neural Network (Head Network). And thirdly, training a model. Firstly, the image with the label and the image without the label are mixed and disordered, then the image and the corresponding reconstructed image are sequentially input into a network, and a loss function is calculated. The loss function consists of two parts, one is the error between the output of the labeled image and the corresponding label, and one is the consistency of the results between all images and the corresponding reconstructed images. The fourth step: and carrying out target detection on the input image by using the trained network to obtain the position and the category of the target in the input image.

The first step specifically comprises: and cutting each image in the training set along a horizontal center line and a vertical center line to obtain four sub-images, namely, upper left sub-image, upper right sub-image, lower right sub-image and lower left sub-image, which are marked as A, B, C and D, as shown in the left image of FIG. 2. Then, a is horizontally translated to the position of B, B is vertically translated to the position of C, C is horizontally translated to the position of D, and D is vertically translated to the position of a, so as to obtain a new reconstructed image, as shown in the right diagram of fig. 2.

The target detection model used in the second step is not limited to Faster R-CNN, and may be any structure of deep learning-based target detection models, such as YOLO, SSD, CenterNet, CornerNet, and the like. The method has no special requirements on the structure of the target detection model, and only needs the position and the category of the boundary box which can be output by the model.

The third step specifically comprises: and inputting the original images and the corresponding reconstructed images in the training set into the network in sequence, and predicting to obtain a bounding box set B of the original images and a bounding box set B' of the reconstructed images respectively. A loss function is then constructed, which consists of two parts, one part being the error between the bounding box of the labeled original image and the truth label, where the position error uses the smooth L1 loss function and the class error uses the cross entropy loss function, and the part is 0 if the original image has no truth label. The other part of the loss function is the consistency loss between the original image and the reconstructed image, namely, according to the image reconstruction method as claimed in claim 2, the original image bounding box set B is reconstructed to obtain a bounding box set B on the theoretically reconstructed image₂(B₂Directly reconstructed by B, rather than the output of the reconstructed image after it entered the network), as shown in fig. 3. Then with B₂Predicting B' and B as truth labels₂Wherein the error is defined using a DIoU loss function as

Where IoU represents the ratio of the area of intersection of the two bounding boxes to the area of union, d represents the Euclidean distance between the center points of the two bounding boxes, and c represents the diagonal distance of the minimum closure region containing both bounding boxes, as shown in FIG. 4. In the method, an original image is input into a network to obtain B, a reconstructed image is input into the network to obtain B', and B is obtained by reconstruction transformation under an ideal condition₂. The loss of consistency is therefore B₂And loss between B'. The detailed flow of the third step is shown in fig. 5.

As described in further detail below.

Data enhancement:

and cutting each image in the training set along a horizontal center line and a vertical center line to obtain four sub-images, namely, upper left sub-image, upper right sub-image, lower right sub-image and lower left sub-image, which are marked as A, B, C and D, as shown in the left image of FIG. 2. Then, a is horizontally translated to the position of B, B is vertically translated to the position of C, C is horizontally translated to the position of D, and D is vertically translated to the position of a, so as to obtain a new reconstructed image, as shown in the right diagram of fig. 2.

Model training:

a batch gradient descent method is used. And if the Batch Size (Batch Size) is N (N is an even number), sampling N/2 images from the training set, and finding out a reconstructed image corresponding to the N/2 images from the reconstructed training set. The N images are sequentially input into a model network (e.g., Faster R-CNN). Let the images in the training set and the reconstruction training set be I respectively_i，I′_i(i is 1,2 … … N/2), and the output result is B after passing through the model network_i，B′_i. Wherein the content of the first and second substances,

B_i＝{(b_i，c_i)，i＝1，2......X}

B′_i＝{(b′_i，c′_i)，i＝1，2......Y}

b_i，c_irespectively representing a four-dimensional bounding box vector (x, y, w, h) and a category vector. X, Y represent the number of bounding boxes that the original image and the reconstructed image are ultimately predicted to (after NMS if the detection model involves an NMS procedure).

Define Loss function Loss ═ L_det+w(t)*L_conWherein, in the step (A),

L_det＝L_{smooth l1}(B_i，B^*)+L_CE(C_i，C^*)

L_con＝L_diou(B_2i，B′_i)

L_{smooth l1}representing the Smooth L1 loss function, L_CERepresenting the cross entropy loss function, L_aiouRepresenting the DIoU loss function. B is^*，C^*Respectively representing bounding boxes and class labels, B_2iIs represented by B_iAccording to the new bounding box position after the split reconstruction shown in fig. 1, the generation process is shown in the following table:

b is to be_2iIs regarded as B'_iUsing the DIoU loss function:

where IoU represents the ratio of the area of intersection of the two bounding boxes to the area of union, d represents the Euclidean distance between the center points of the two bounding boxes, and c represents the diagonal distance of the minimum closure region containing both bounding boxes, as shown in FIG. 4. Fig. 4 is a schematic diagram of a DIoU, with two dashed black lines representing bounding boxes in the prediction and truth labels, and the outermost dotted line representing the minimum closure containing both.

w (t) represents the weight of the loss of consistency. In the training starting stage, the detection capability of the network model is poor, so that the quality of the detection result is low, in this case, a low weight of consistency loss is required, and the network is prevented from learning too much wrong information. As the training is carried out, the quality of the detection result of the model is improved, so that the consistency of the result also contains more correct information, and a higher consistency loss weight can be adopted. In the present application, w (t) takes a value of 0 in the first three-phase training, then increases linearly, takes a value of 1 in the last three-phase training, and then remains unchanged until the training is finished.

Target detection:

and inputting the test set image to be detected into the trained semi-supervised neural network model based on the consistency constraint, so as to obtain the position and the category of the object boundary box in the image.

The background of the present invention may contain background information related to the problem or environment of the present invention and does not necessarily describe the prior art. Accordingly, the inclusion in the background section is not an admission of prior art by the applicant.

The foregoing is a more detailed description of the invention in connection with specific/preferred embodiments and is not intended to limit the practice of the invention to those descriptions. It will be apparent to those skilled in the art that various substitutions and modifications can be made to the described embodiments without departing from the spirit of the invention, and these substitutions and modifications should be considered to fall within the scope of the invention. In the description herein, references to the description of the term "one embodiment," "some embodiments," "preferred embodiments," "an example," "a specific example," or "some examples" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. Although embodiments of the present invention and their advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the claims.

Claims

1. A semi-supervised target detection method based on consistency constraint is characterized by comprising the following steps:

2. The semi-supervised target detection method of claim 1, wherein the first step comprises: for each image a in the training set, it is cropped into a plurality of sub-images and rearranged spatially, generating a reconstructed image b of the same size as image a.

3. The semi-supervised object detection method of claim 2, wherein in the first step, each image in the training set is cropped along a horizontal center line and a vertical center line to obtain four sub-images, namely, upper left sub-image, upper right sub-image, lower right sub-image and lower left sub-image, which are marked as A, B, C and D, then, A is horizontally translated to the position of B, B is vertically translated to the position of C, C is horizontally translated to the position of D, and D is vertically translated to the position of A, so that a reconstructed image is obtained.

4. The semi-supervised object detection method as recited in any one of claims 1 to 3, wherein the object detection model is fast R-CNN, YOLO, SSD, CenterNet, or CornerNet; the Faster R-CNN comprises a neural Network framework (Backbone), a Feature Pyramid Network (Feature Pyramid Network), a Region suggestion Network (RPN) and a Head neural Network (Head Network).

5. The semi-supervised object detection method as recited in any one of claims 1 to 4, wherein in the third step, the labeled image and the unlabeled image are mixed and scrambled, and then the images and the corresponding reconstructed images are sequentially input into a network to calculate a loss function; the loss function includes two parts, one is the error between the bounding box of the labeled original image output and the corresponding label, and the other is the loss of consistency of the results between all images and the corresponding reconstructed images.

6. The semi-supervised object detection method of claim 5, wherein the third step comprises:

inputting the original images and the corresponding reconstructed images in the training set into a network in sequence, and respectively predicting to obtain a boundary box set B of the original images and a boundary box set B' of the reconstructed images; constructing a loss function, wherein the loss function comprises two parts: one part is the error between the bounding box of the output of the original image with labels and the truth labels, wherein the position error uses smooth L1 loss function, the category error uses cross entropy loss function, if the original image has no truth labels, the part loss function is 0; the other part is consistency loss between the original image and the reconstructed image; wherein, the boundary frame set B on the theoretically reconstructed image is obtained by reconstructing the boundary frame set B of the original image₂With B₂Predicting B' and B as truth labels₂Wherein the error is defined using a DIoU loss function as

7. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 6.