CN111931767B

CN111931767B - Multi-model target detection method, device and system based on picture informativeness and storage medium

Info

Publication number: CN111931767B
Application number: CN202010776488.2A
Authority: CN
Inventors: 孙亚杰; 张颖; 吴雨瑶; 吴爱国
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2023-09-15
Anticipated expiration: 2040-08-05
Also published as: CN111931767A

Abstract

The invention provides a multi-model target detection method, device and system based on picture informativeness and a storage medium, wherein the multi-model target detection method comprises the following steps: a first step of: selecting a plurality of target detection networks as candidate target detection networks, and designing information degree additional networks for the candidate detection networks according to the layer number of the target detection networks; and a second step of: the method comprises the steps of jointly training a target detection network and an informativeness additional network, designing a loss function of the informativeness additional network and a training strategy of the whole network, wherein the loss function of the target detection network is determined by a target detection method; and a third step of: and performing scale normalization on the output value of the informativeness-added network, and selecting a target detection network according to the output of the informativeness-added network. The beneficial effects of the invention are as follows: 1. the multi-model target detection method can combine the excellent performance of different detection models on different picture characteristics to comprehensively improve the detection accuracy.

Description

Multi-model target detection method, device and system based on picture informativeness and storage medium

Technical Field

The invention relates to the field of computer vision and industrial detection, in particular to a multi-model target detection method, device and system based on picture informativeness and a storage medium.

Background

Target detection is a classical task in computer vision, which refers to finding out the position of an object in a given picture and framing it out with a minimum bounding rectangle, and a specific task will also output the class of the object. For example, in the task of detecting the defect of the workpiece, the defective part needs to be framed out by a rectangular frame and what type of defect is designated.

Before deep learning is successfully applied to target detection, conventional image processing adopts a method of edge detection, such as line detection by using a Laplace and gradient method, and edge detection of some standard shapes by using a Hough change method. After the edge features are obtained, the leftmost (right) and uppermost (lower) points of the features can be found according to the shape of the edge, and the information of the minimum circumscribed rectangle can be obtained according to the position coordinates of the four points. And after the minimum external rectangle is obtained, sending the local picture characteristics in the rectangle range to a classifier. The searching of the classifier and the minimum external rectangle can be carried out as two independent tasks, wherein the task in the classifier is to correctly classify the picture according to the obtained picture characteristics, the class of the picture object is a limited class, and the picture object is encoded by using an One-hot encoding mode.

After deep learning successfully applies target detection, the advent of convolutional neural networks has freed up the complex task of feature extraction, and the network can learn to extract the desired features through the adjustment of a large number of parameters in the network. In the field of object detection, the most commonly used method is to operate on a feature map extracted by a convolutional neural network, for example, a method adopted by a face R-CNN, find a subgraph possibly including an object in the feature map by using a region candidate network, send the subgraph into the network as an image to be classified, and classify the image by using an SVM classifier.

The existing target detection method based on deep learning is divided into Two types, namely One Stage and Two Stage, wherein the Two-step detection method mainly comprises FastR-CNN and an improved version thereof, and the Two methods are divided into Two parts, namely a region candidate frame extraction network and a regression network; ONE STAGE is a detection method based on YOLO and Corner-Net, and a ONE-step method is to finish classification and regression tasks in ONE network, and has the advantage of very high detection speed.

Various detection methods are excellent in specific sizes and categories, and poor in other sizes and categories.

Disclosure of Invention

The invention provides a multi-model target detection method based on picture informativeness, which comprises the following steps of:

a first step of: selecting a plurality of target detection networks as candidate target detection networks, and designing information degree additional networks for the candidate detection networks according to the layer number of the target detection networks;

and a second step of: the method comprises the steps of jointly training a target detection network and an informativeness additional network, designing a loss function of the informativeness additional network and a training strategy of the whole network, wherein the loss function of the target detection network is determined by a target detection method;

and a third step of: and performing scale normalization on the output value of the informativeness-added network, and selecting a target detection network according to the output of the informativeness-added network.

As a further improvement of the invention, in said first step, it is further specifically comprised of performing the steps of:

step 1: selecting five target detection networks of Faster R-CNN, centerNet, cornerNet-V2, YOLO V3 and SSD as candidate target detection networks;

step 2: respectively designing information degree additional networks for the five networks in the step 1, wherein 4 characteristic layers are selected by Faster R-CNN to construct the information degree additional network, 3 characteristic layers are selected by CenterNet, cornerNet-V2 and SSD to construct the information degree additional network, and 5 characteristic layers are selected by YOLO V3 to construct the information degree additional network;

step 3: and (3) for the feature layers extracted in the step (2), performing a convolution operation, performing global average pooling again to convert the feature images into feature vectors, splicing the feature vectors obtained finally by each feature layer into a feature vector, connecting two layers, and mapping the feature vectors into informativeness values.

As a further improvement of the present invention, in the second step, specifically, further comprising:

each candidate target detection model and the informativity additional network thereof need to be jointly trained on the same data set, or pretrained on a large data set and then finely tuned on a small data set, and the loss function for training is as follows:

wherein B is batch, loss _dctction Refers to loss of the target detection network，L ₁ The form of (2) is as follows:

l _i and l _j Is the loss of the object detection network of the two pictures I and j that make up a picture pair, I _i And I _j The method is characterized in that the method is the informativity value of pictures i and j, lambda is set to 0.1 in the training process, the weight of an informativity additional network is fixed when the pictures arrive at half of the total epochs, only a target detection network is trained, and xi is a threshold value of the informativity additional network, and is set to 0.5 in the training process, wherein lambda is the weight of an informativity additional network loss function in the total network.

As a further improvement of the present invention, in the third step, the following steps are further performed: unifying the picture informativeness scale: the output of the informativity additional network is subjected to scale normalization, and the normalized calculation formula is as follows:

wherein I is the information degree value output by the information degree additional network of the detection network, I _min And I _max The information degree value after conversion is between 0 and 1 aiming at the minimum information degree value and the maximum information degree value of a detection network.

The invention also discloses a multi-model target detection device based on the picture informativeness, which comprises: information degree additional network structural unit: the method comprises the steps of selecting a plurality of target detection networks as candidate target detection networks, and designing information degree additional networks for the candidate detection networks according to the layer number of the target detection networks; training unit: the method comprises the steps of training a target detection network and an informativeness additional network in a combined mode, designing a loss function of the informativeness additional network and a training strategy of the whole network, wherein the loss function of the target detection network is determined by a target detection method;

unifying unit: for scale normalization of output values of the informative additional network, selecting a target detection network based on the output of the informative additional network

As a further improvement of the present invention, in the information degree additional network structural unit, it further specifically includes:

candidate target detection module: the method comprises the steps of selecting five target detection networks of Faster R-CNN, centerNet, cornerNet-V2, YOLO V3 and SSD as candidate target detection networks;

designing an informativeness additional network module: the method comprises the steps that information degree additional networks are respectively designed for five networks of a candidate target detection module, 4 characteristic layers are selected by a fast R-CNN to construct the information degree additional network, 3 characteristic layers are selected by CenterNet, cornerNet-V2 and SSD to construct the information degree additional network, 5 characteristic layers are selected by a YOLO V3 to construct the information degree additional network; the feature layer processing module: the method is used for extracting the feature layers for constructing the informativity network module from the original target detection network, carrying out a convolution operation, carrying out global average pooling again to convert the feature images into feature vectors, splicing the feature vectors obtained finally by each feature layer into one feature vector, connecting two layers of feature vectors, and mapping the feature vectors into informativity values.

As a further improvement of the present invention, in the training unit, specifically further comprising:

each candidate target detection model and its informativity additional network should be jointly trained on the same dataset, or pre-trained on a large dataset, such as ImageNet, and then fine-tuned on a small dataset, the loss function for training is as follows:

wherein B is batch, loss _detction Refers to loss, L of the target detection network ₁ The form of (2) is as follows:

As a further improvement of the present invention, in the unified unit, further comprising:

the picture information degree scale unification module: the calculation formula for scale normalization of the output of the informativeness additional network is as follows:

The invention also discloses a multi-model target detection system based on the picture informativeness, which comprises: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the multimodal object detection method of the invention when called by the processor.

The invention also discloses a computer readable storage medium storing a computer program configured to implement the steps of the multi-model object detection method of the invention when invoked by a processor.

The beneficial effects of the invention are as follows: 1. the multi-model target detection method can combine the excellent performance of different detection models on different picture characteristics to comprehensively improve the detection accuracy; 2. the multi-model target detection method can be applied to the fields of industrial defect detection and intelligent security, and has the advantages of high accuracy and small speed sacrifice.

Drawings

FIG. 1 is a basic flow chart of the multi-model object detection method of the present invention;

fig. 2 is a diagram of an information degree additional network structure of the present invention.

Detailed Description

As shown in fig. 1, the invention discloses a multi-model target detection method based on picture informativeness, which comprises the following steps:

and a third step of: and performing scale normalization on the output value of the informativeness-added network, and selecting a target detection network according to the output of the informativeness-added network. And selecting an optimal detection network according to the picture informativeness value output by each detection network, wherein the smaller the informativeness value is, the more familiar the detection network is with the picture, and the better the detection effect is.

In the first step, the method further specifically comprises the following steps:

step 2: respectively designing information degree additional networks for the five networks in the step 1, wherein 4 characteristic layers are selected by Faster R-CNN to construct the information degree additional network, self-weight 3 characteristic layers are selected by CenterNet, cornerNet-V2 and SSD to construct the information degree additional network, and 5 characteristic layers are selected by YOLO V3 to construct the information degree additional network;

As shown in fig. 2, the feature layers of the original detection network are taken, the number of the taken layers is related to the total number of the detection network, and the more the total number of the layers is, the more feature layers can be taken to form the informativeness additional network. For Resnet-18, 3 layers are taken to form the information degree additional network, and for Resnet-50, 5 layers are taken to form the information degree additional network. In order to ensure that enough features are acquired, the number of the extracted feature layers is not less than 3 layers, and in order to ensure that the extra calculation cost is not great, the number of the extracted feature layers is not more than 8 layers. The method comprises the steps of performing convolution operation on the extracted feature layers twice, performing global pooling to convert the feature layers into one-dimensional vectors, connecting one BN layer between two layers of convolution networks, splicing a plurality of one-dimensional vectors together, connecting a linear activation unit, connecting two full-connection layers, wherein dropout is arranged between each full-connection layer, and the last full-connection layer maps the feature vectors into informativeness values. (note: resnet is residual network, BN layer is batch normalization layer)

In the second step, the method specifically further includes:

l _i and l _j Is the loss of the object detection network of the two pictures I and j that make up a picture pair, I _i And I _j The method is characterized in that the method is an informativity value of pictures i and j, lambda is set to be 0.1 in the training process, the weight of an informativity additional network is fixed when the pictures reach half of the total epochs, only a target detection network is trained, xi is a threshold value of the informativity additional network, 0.5 is set in the training process, lambda is the weight of an informativity additional network loss function in the total network, batch is the number of a batch of pictures, loss is a loss function, and epochs is the number of training rounds.

The loss function is designed for informative additional networks and trained with the loss function of the target detection network. The loss function of the information degree additional network uses the loss value in the target detection network, and the loss function of the information degree additional network is designed according to the loss. And adding the loss function of the target detection and the loss function of the informativeness additional network to perform joint training, wherein the weight of the informativeness additional function is set to be 0.1.

The training process is that the original detection network and the informativity additional network are trained together, so that the situation that the predicted informativity value is almost the same due to the fact that the robustness is too high when the predicted informativity value appears is prevented, the number of epochs in the informativity additional network training is lower than that of the original detection network, generally half of that of the detection network, and the aim is to effectively distinguish the informativity value of each picture. The Batch size should not be set too large and should be set to an even number in order to be divided into picture pairs.

In the third step, the method further comprises the following steps:

unifying the picture informativeness scale: because the loss function forms adopted by different target detection networks are inconsistent, the information degrees of the same picture under different detection networks are compared, the output value scales of all the information degree additional networks are unified, and the normalized calculation formula is as follows:

And (3) selecting a detection result: and selecting a plurality of detection networks as candidate target detection networks, and independently designing an informativeness additional network for each detection network, so that each detection network and the informativeness additional network thereof train on the same data set, and setting the same epochs during training. When a picture is detected, the informativity of the picture is output by the informativity additional network, and the picture informativity value of a plurality of detection networks is compared, so that the detection result which is the most deeply understood picture content, namely the detection result with the smallest informativity value is selected as the final detection result.

unifying unit: the method is used for carrying out scale normalization on the output value of the informativeness adding network, and selecting a target detection network according to the output of the informativeness adding network.

The information degree additional network structure unit further specifically comprises:

The training unit specifically further comprises:

In the unified element, further comprising:

wherein I is the information degree value output by the information degree additional network of the detection network, I _min And I _max The information degree value after conversion is between 0 and 1 aiming at the minimum information degree value and the maximum information degree value of a detection network. Each detection network needs to have the informativeness value limited between 0-1 in order to facilitate comparing informativeness values between different detection networks.

The invention also discloses a multi-model target detection system based on the picture informativeness, a memory, a processor and a computer program stored on the memory, wherein the computer program is configured to realize the steps of the multi-model target detection method when being called by the processor.

The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.

Claims

1. The multi-model target detection method based on the picture informativity is characterized by comprising the following steps of:

and a third step of: performing scale normalization on the output value of the informativeness-added network, and selecting a target detection network according to the output of the informativeness-added network;

in the second step, the method specifically further includes:

each candidate target detection model and the informativity additional network thereof need to be jointly trained on the same data set, or the training is performed on a large data set, and then the training is performed on a small data set, wherein the loss function for training is as follows:

wherein B is batch, loss _detction Refers to the purpose ofTarget detection network loss, L ₁ The form of (2) is as follows:

2. The multi-model object detection method according to claim 1, further comprising, in the first step, the steps of:

3. The multi-model object detection method according to claim 1, further comprising, in the third step, performing the steps of:

unifying the picture informativeness scale: the output of the informativity additional network is subjected to scale normalization, and the normalized calculation formula is as follows:

4. A multi-model object detection device based on picture informativeness, characterized by comprising:

information degree additional network structural unit: the method comprises the steps of selecting a plurality of target detection networks as candidate target detection networks, and designing information degree additional networks for the candidate detection networks according to the layer number of the target detection networks; training unit: the method comprises the steps of training a target detection network and an informativeness additional network in a combined mode, designing a loss function of the informativeness additional network and a training strategy of the whole network, wherein the loss function of the target detection network is determined by a target detection method;

unifying unit: the method comprises the steps of performing scale normalization on an output value of an informativeness adding network, and selecting a target detection network according to the output of the informativeness adding network;

the training unit specifically further comprises:

I _i and l _j Is the loss of the object detection network of the two pictures I and j that make up a picture pair, I _i And I _j The method is characterized in that the method is the informativity value of pictures i and j, lambda is set to 0.1 in the training process, the weight of an informativity additional network is fixed when the pictures arrive at half of the total epochs, only a target detection network is trained, and xi is a threshold value of the informativity additional network, and is set to 0.5 in the training process, wherein lambda is the weight of an informativity additional network loss function in the total network.

5. The multi-model object detection apparatus according to claim 4, wherein in the informative additional network configuration unit, further comprising:

designing an informativeness additional network module: the method comprises the steps that information degree additional networks are respectively designed for five networks of a candidate target detection module, 4 characteristic layers are selected by a fast R-CNN to construct the information degree additional network, 3 characteristic layers are selected by CenterNet, cornerNet-V2 and SSD to construct the information degree additional network, 5 characteristic layers are selected by a YOLO V3 to construct the information degree additional network;

the feature layer processing module: the method is used for extracting the feature layers for constructing the informativity network module from the original target detection network, carrying out a convolution operation, carrying out global average pooling again to convert the feature images into feature vectors, splicing the feature vectors obtained finally by each feature layer into one feature vector, connecting two layers of feature vectors, and mapping the feature vectors into informativity values.

6. The multi-model object detection apparatus according to claim 4, further comprising, in the unified element:

7. A multi-model object detection system based on picture informativeness, comprising: a memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the multimodal object detection method of any of claims 1-3 when invoked by the processor.

8. A computer-readable storage medium, characterized by: the computer readable storage medium stores a computer program configured to implement the steps of the multimodal object detection method of any of claims 1-3 when invoked by a processor.