CN110569901B

CN110569901B - Channel selection-based countermeasure elimination weak supervision target detection method

Info

Publication number: CN110569901B
Application number: CN201910838283.XA
Authority: CN
Inventors: 杨金福; 单义; 李明爱; 武随烁
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-09-05
Filing date: 2019-09-05
Publication date: 2022-11-29
Anticipated expiration: 2039-09-05
Also published as: CN110569901A

Abstract

The invention relates to a method for detecting a countermeasure elimination weak supervision target based on feature channel selection, which is used for solving the problem of detection and positioning errors of the weak supervision target. Firstly, using weak supervision depth target detection as a bottom-layer framework, generating a candidate frame on training set data by adopting a selective search method, and using the candidate frame, a training set image and a corresponding image label as input of a weak supervision network; secondly, constructing a feature extraction network model by taking VGG16 as a basic network, and carrying out channel weighting selection on the obtained feature images in a feature channel compression mode, so as to excite image feature layers beneficial to classification and inhibit feature layers with interference on classification; then, obtaining a complete characteristic expression capable of expressing the image target by adopting a countermeasure elimination method as the input of a prediction network; and finally, training a prediction network according to the multitask cross entropy loss to realize target detection. The invention not only can more accurately position the position of the target object, but also can improve the accuracy of object identification.

Description

Channel selection-based countermeasure elimination weak supervision target detection method

Technical Field

The invention belongs to the technical field of target detection, and introduces a weak supervision target detection method based on countermeasure elimination.

Background

With the development of science and technology and the improvement of the intelligence level of human life, mobile robots gradually enter human production and life, and are widely applied to various industries. The target detection based on the mobile robot is widely applied to the fields of routing inspection, security protection, video monitoring search and the like. The deep learning target detection algorithm at the present stage generally needs a large amount of manually labeled data sets, which not only wastes a large amount of manpower, material resources and financial resources, but also causes the wrong labeled data to influence the robustness of the model and the detection precision of the model. The target detection algorithm based on weak supervision can complete target detection only by the image-level label, so that the method is more suitable for industrial production application environments.

In recent years, target detection algorithms based on deep learning have been developed. The method for realizing deep learning target detection based on the image-level label is also rapidly developed. 2016, bilen ^[1] The application of deep learning to the target detection algorithm of the image-level labels is firstly proposed by the people, and compared with the traditional weak supervision target detection algorithm, the detection precision on the public data set is obviously improved. 2016, diba ^[2] Et al propose a cascaded deep learning weak supervised target detection algorithm. The model is divided into two steps: (1) A convolutional neural network based on a first segment of the cascade network, for extracting a candidate region where a target object may exist; (2) And taking the candidate region output by the first section of the cascade network as the input of the second section of the cascade network, and finally detecting the target object by a multi-example learning method. 2017, tang ^[3] Et al propose a multi-instance loop-iterated deep-learning target detection network. Aiming at the problem that in the prediction stage of a target detection frame, after a non-maximum suppression strategy (NMS), the network inhibits the detection frame with lower classification score and accurate positioning, and retains the target with high target prediction classification score and poor positioning, the network provides a loop iteration method, and gradually iteratively updates the classification confidence coefficient of the object and the position of the detection frame. 2018, wang ^[4] The inventors propose a method for detecting a weakly supervised target by cooperative learning, which improves the detection accuracy by combining a strongly supervised learning network and a weakly supervised learning network. Using detection frame obtained from weak supervision target detection network as strong supervision learning networkInitializing input, reducing network parameters by utilizing a shared convolutional network, and updating the input of a strong supervision network through continuous cyclic training to improve the final precision. However, the position of the detection frame needs to be continuously updated and adjusted at the input end of the network model in the above network, and since the position regression is lacked in the weak supervision target detection compared with the strong supervision network, the problem of large target positioning error in the weak supervision target detection cannot be completely solved by continuously iterative updating.

The optimization method for the weak supervision target detection is a method for iterating the predicted candidate frame after the feature map of the target is obtained so as to optimize the position of the target detection frame. However, due to the lack of accurate position label information of the target in the weak supervision target detection, the problem of inaccurate positioning of the target detection frame is not well solved only by iterative optimization.

Reference:

1.H.Bilen and A.Vedaldi.Weakly supervised deep detection networks. In Proc.IEEE Int.Conf.Vis.Pattern Recognition.(CVPR),Pages 2846-2854,2016.

2.A.Diba,V.Sharma,A.Pazandeh,H.Pirsiavash,and L.Van Gool. Weakly supervised cascaded convolutional networks.In Proc.IEEE Int.Conf.Computer.Vis.Pattern Recognition.(CVPR),pages 5131-5139,2017.

3.P.Tang.X.Wang,X.Bai,and W.Liu.Multiple instance detection network with online instance classifier refinement.In Proc.IEEE Int. Conf.Computer.Vis.Pattern Recognition.(CVPR),pages 3059-3067, 2017.

4.J.Wang,J.Yao,Y.Zhang,et al.Collaborative Learning for Weakly Supervised Object Detection.IJCAI,2018.

disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a method for detecting an confrontation elimination target based on feature channel selection. In the feature extraction network, a feature channel selection strategy is utilized to improve the weight of feature channels with forward promotion on classification, and then feature countermeasure elimination is utilized to extract more comprehensive features of each detected object, so that the problem of weak supervision target detection positioning error is solved. The model only uses image data with image level labels as the input of the network, and combines a channel feature selection method to extract features more useful for classifying the network as the input of a classifier. The invention not only can more accurately position the position of the target object, but also can improve the accuracy of object identification.

The invention is realized by adopting the following technical means:

a method for detecting a countermeasure elimination weak supervision target based on feature channel selection is characterized by comprising the following steps:

step 1: data pre-processing

And taking the images in the training set as input images, and preprocessing the input images, namely performing multi-scale transformation, horizontal turning and random cutting on the input images to obtain a preprocessed training set.

And 2, step: generating a target region candidate box

And (3) generating a candidate box set R for each image to be trained through a selective search algorithm on the images in the training set, namely generating a series of rectangular areas in which the target object may exist in the images. After the redundant candidate frames with the similarity and the contact ratio larger than a threshold value A (the value range is (0.5, 1)) are deleted, the coordinates of the remaining candidate frames are used as the input of the weak supervision target detection network to train the network model.

And 3, step 3: the whole network is divided into six parts: the system comprises a primary feature extraction network model, a salient region feature extraction network, a secondary region feature learning network, a comprehensive feature extraction layer, an interested region pooling layer and a global average pooling layer;

firstly, extracting the preliminary features of the image by using a preliminary feature extraction network model.

Among deep network models such as VGGNet, resNet, mobileNet and the like, a VGG16 model is preferably used as a preliminary feature extraction network model. And (2) taking the preprocessed training set obtained in the step (1) as the input of a primary feature extraction network, training a primary feature extraction network model, modifying a primary feature extraction network model structure file and a network parameter configuration file, finishing primary extraction of image features, and obtaining a primary feature extraction image.

And 4, step 4: and selecting a characteristic channel for the preliminary characteristic graph by using a salient region characteristic extraction network to obtain salient region characteristics.

Step 4.1: for the extracted H × W × C in step 3 ₁ Performing convolution operation on the preliminary feature map by 1 × 1 to obtain a feature image u with the size of H × W × C _c . The transformation method is shown in formula (1)

Wherein u is _c Is the output after the convolution is performed and,

which represents the kernel of a convolution with the original,

a feature map representing the input. H × W is the size of the input feature map, C ₁ The number of the feature map channels before the convolution transformation is C, and the number of the feature map channels after the convolution transformation is C.

Step 4.2: output u from step 4.1 _c Feature compression using global average pooling outputs a tensor z of size 1 × 1 × C _c . Equation (2) for global average pooling is as follows:

wherein z is _c ∈R ^c Is the feature tensor with output of 1 × 1 × C, u _c Is the output of step 4.1, F _sq The global average pooling operation is performed on the feature map, H multiplied by W represents the size of an input feature layer, and i, j are element coordinate indexes in the feature map.

Step 4.3: the output z in step 4.2 is compared _c And performing characteristic channel selection operation to obtain a one-dimensional tensor s. The formula for channel selection is as follows:

s＝Fex(w ₂ δ(w ₁ z _c )) (3)

wherein s is a one-dimensional tensor output after network compression, W ₁ ,W ₂ And a weight parameter which needs to be continuously and iteratively optimized for the channel selection operation, wherein delta is a ReLU activation function, and Fex represents a nonlinear activation function.

Step 4.4: the feature vector s with the parameter weight obtained in the step 4.3 and the feature map u obtained in the step 4.1 are combined _c And (3) multiplying corresponding feature maps to obtain a feature map x with weight distribution, wherein a specific calculation formula is as follows:

x＝F _scale (u _c ,s) (4)

wherein, F _scale Representing a eigenchannel multiplication. s represents the scalar obtained in step 4.3. u. u _c As output in step 4.1.

Step 4.5: obtaining a salient region feature image M from the feature image x in the step 4.4 through a classifier of a salient region feature extraction network model ^A 。

M ^A ＝f(φ _A ,x,y _i ) (5)

Wherein M is ^A Representing the obtained characteristic region, f representing the classifier of the salient region characteristic extraction network model, preferably SDG phi _A Representing the parameters of the network model, x being the characteristic diagram with weight distribution obtained in step 4.4, y _i Is the label corresponding to the image.

And 5: output M of step 4.5 ^A Learning features other than the most salient region using a countervailing method

And obtaining a secondary feature map M through a secondary regional feature learning network ^B The method comprises the following steps:

step 5.1: output M of step 4.5 ^A By setting a threshold value t, the value range is (0.5, 1) to obtain the characteristic region R of the most significant part _t 。

And step 5.2: the characteristic region R _t Corresponding features are classified as 0, and a feature map with the removed features is obtained

Step 5.3: feature map obtained in step 5.2 is compared with a secondary area feature extraction network

And (4) further learning to obtain the features in the image, wherein the secondary region feature learning network and the salient region feature extraction network in the step 4 have the same network structure, but the input of the network is different, and the network parameters are not shared. Obtaining a secondary characteristic feature map M through a secondary region characteristic extraction network classifier ^B 。

Step 6: the salient region feature map M of the step 4.5 is processed ^A And the secondary region feature map M obtained by learning in the step 5.3 ^B A method for solving the maximum value of the corresponding pixel point to obtain a fused comprehensive characteristic graph M ^fuse 。

M ^fuse ＝max(M ^A ,M ^B ) (6)

Wherein M is ^fuse To obtain a comprehensive characteristic map after fusion. .

And 7: mapping the candidate region frame obtained in the step 2 to the comprehensive characteristic map M obtained in the step 6 ^fuse In the method, the interested region pooling is carried out on each candidate frame, and the candidate frames with different sizes are converted into the feature vector V with the same dimension _c 。，

And step 8: corresponding the target frame to the feature vector V _c After twice Global Average Pooling (GAP), obtaining probability values of corresponding categories by utilizing a softmax function;

and step 9: and training the whole network model through a total loss function L, wherein the input label of the training is a class label of the image in the data set. And reducing the difference between the predicted value and the data set label through continuous iterative training, and finishing the training when the model converges or reaches the preset training times.

Advantageous effects

In the existing weak supervision target detection method, the improvement of a target positioning frame is mostly based on iterative regression of the positioning frame, but the problem cannot be well solved because a network lacks a position label. The method for detecting the countermeasure elimination weak supervision target based on the feature selection improves the positioning problem of the target frame from the aspect of optimizing the feature extraction of the image, enables a network model to extract more comprehensive features of the target object, enables the subsequent positioning frame to completely cover the detection target, and can improve the accuracy of the network in classifying the object and improve the final detection positioning frame result.

Drawings

FIG. 1 is a general block diagram of a weakly supervised target detection approach based on countermeasure elimination;

FIG. 2 is a diagram of a basic feature extraction network architecture;

FIG. 3 is a diagram of a feature channel selection architecture;

FIG. 4 shows the results of part of the experiment.

Detailed Description

In order that those skilled in the art can better understand and use the present invention, the following technical solutions will be further described with reference to the accompanying drawings and specific embodiments.

1. The VOC data set was selected as the training and testing data set. Firstly, taking the images in the training set as input images, and preprocessing the input images, namely performing multi-scale transformation, horizontal turning and random cutting on the input images to obtain a preprocessed training set.

2. The images in the training set are subjected to a selective search algorithm to generate a candidate box set R, i.e., rectangular coordinates of the positions where the object may exist. After the redundant candidate boxes with the similarity and the coincidence degree larger than the set threshold value of 0.7 are deleted, the remaining candidate boxes are used as the input of the weak supervision target detection network to train the model. The overall network structure is shown in figure 1.

3. Among deep network models such as VGGNet, resNet, mobileNet and the like, a VGG16 model is preferably used as a preliminary feature extraction network model. The VGG network comprises 13 convolutional layers and four pooling layers, and the detailed network structure is shown in figure 2. And taking the preprocessed training set as the input of the convolutional neural network, training the convolutional neural network, further modifying the network structure file and the network parameter configuration file of the convolutional neural network, and finishing the extraction of the image characteristics.

4. The salient region feature extraction network model is shown in fig. 3, and includes 1 convolutional layer, 1 averaging pooling layer and weight distribution layer, and the detailed process is as follows. Carrying out convolution operation of 1 multiplied by 1 on the last layer of feature layer of the VGG network in 3 to change the number of channels of the feature map, and using the feature channel selection network to select the original H multiplied by W multiplied by C ₁ The feature map of (2) is converted into a feature image of H × W × C. The parameters of the model are reduced by reducing the number of channels, so that the model is simplified and the efficiency of the model is improved. And performing feature compression on the feature map after convolution through a global average pooling layer, and outputting a tensor with the size of 1 × 1 × C. In order to distinguish the contributions of different feature image channels to the classification, the feature extraction capability is improved by learning the weight of each channel. And performing corresponding dot multiplication on the obtained weighted feature map with the parameters and the corresponding feature map. In this way, feature maps with different weights can be obtained.

5. Using the feature map in the step 4 to extract a network classifier by using the feature of the salient region to obtain a feature image M ^A . Obtaining the characteristic region R of the most significant part through a preset threshold value t _t . By forming a characteristic region R _t And (4) the corresponding characteristic is 0, and a characteristic graph after the characteristic is eliminated is obtained. The features in the graph are further learned by the feature map after elimination through the secondary regional feature extraction network, and the model structure of the secondary regional feature extraction network is the same as that of the salient region feature extraction network, as shown in fig. 3. But the input of the network is different and the parameters of the network model are not shared. Obtaining a feature image M by using a secondary regional feature network classifier ^B . Feature M before elimination by fusing features ^A To obtain the final characteristic diagram M ^fuse 。

6. Mapping the candidate region frame obtained in step 2 to the feature map M obtained in step 5 ^fuse Then, performing region-of-interest pooling operation on each candidate frame, and changing the candidate frames with different sizes into feature vectors with the same size. After two layers of tie pooling layers (GAP), obtaining corresponding classes through a softmax functionOther probability values.

7. Model training: the method comprises the steps that (1) data preprocessing (2) data are input into a network model for training (3), in the training process, the value of a loss function is continuously calculated to judge the error condition (4), the training error is reduced through reverse propagation (5), the training process is repeated until the model converges or reaches the set times, wherein a network overall loss function consists of three parts, namely a loss function of the significant region feature extraction network model classification, a classification loss function of a secondary region feature extraction network and the output classification loss of the whole in the final stage, and the loss function of the whole network is obtained through adding the three parts.

8. The training platform adopts NVIDIA GeForce GTX TITAN X GPU, the network building adopts a Pythrch frame, the batch-size is set to be 20, the initial learning rate is 0.001, the momentum is set to be 0.9, the weight attenuation is set to be 0.0005, and the optimization method adopts a gradient descent method.

9. For a given input image, firstly, extracting the features of different channels of the image by using a feature extraction network; and then weighting different characteristic channels through a characteristic selection network, enlarging the characteristic channels which are useful for classification by adopting a back propagation training model, and inhibiting the characteristic channels which interfere with the classification. Finally, the characteristic after the significant characteristic elimination is obtained by utilizing the characteristic countermeasure elimination network, the characteristics of the two parts are fused, and the model is trained by utilizing the total loss function

10. The results of some of the experiments of the present invention are shown in FIG. 4. From the results of fig. 4, it can be seen that for the object to be detected in the image, the positioning frame not only can well surround the detected object, but also has high confidence for the detected object.

Finally, it should be noted that: the above examples are only intended to illustrate the invention and do not limit the technical solutions described in the present invention. Therefore, although the present invention has been described in detail with reference to the above examples, it should be understood by those skilled in the art that the present invention may be modified and replaced with equivalents without departing from the spirit and scope of the invention, and all such modifications and improvements are intended to be included within the scope of the claims.

Claims

1. A method for detecting an anti-elimination weak supervision target based on feature channel selection is characterized by comprising the following steps:

step 1: data pre-processing

Step 2: generating a target region candidate box

Generating a candidate frame set R for each preprocessed image through a selective search algorithm, namely generating a series of rectangular areas in which target objects possibly exist in the image, and only reserving candidate frames with similarity and contact ratio smaller than a threshold value A;

and step 3: an integral network model is constructed, and the model is divided into six parts: the system comprises a primary feature extraction network, a salient region feature extraction network, a secondary region feature learning network, a comprehensive feature extraction layer, an interested region pooling layer and a global average pooling layer;

the preliminary feature extraction network is used for extracting preliminary features of the image and is input into the data set preprocessed in the step 1;

the salient region feature extraction network is used for carrying out feature channel selection on the primary features of the image to obtain a salient region feature image M ^A ；

The secondary area feature learning network is used for obtaining a secondary feature map M of the image ^B And the network structure is the same as that of the salient region feature extraction network, and the difference is that the network parameters are not shared, and the input is M ^A Features learned by the countermeasure elimination method except for the most significant region

The comprehensive characteristic extraction layer is used for obtaining a comprehensive characteristic graph M ^fuse The method specifically comprises the following steps: characterizing the salient region to M ^A And a secondary region feature map M ^B Solving the maximum value of the corresponding pixel point to obtain a fused comprehensive characteristic graph M ^fuse

M ^fuse ＝max(M ^A ,M ^B ) (6)

Wherein，M ^fuse Obtaining a comprehensive characteristic diagram after fusion;

the region of interest pooling layer is used for mapping the candidate region frame obtained in the step 2 to a comprehensive characteristic map M ^fuse In the method, the interested region pooling is carried out on each candidate frame, and the candidate frames with different sizes are converted into the feature vector V with the same dimension _c ；

The global average pooling layer (GAP) is used for corresponding the target box to the feature vector V _c After twice Global Average Pooling (GAP), obtaining probability values of corresponding categories by utilizing a softmax function;

step 7, training the whole network model

Training the class probability predicted by the overall network model through an overall loss function L, training the overall network model through the back propagation of the difference value of the label value of the training data and the predicted value of the overall network model, and finishing the training when the model is converged;

and 8, acquiring data to be detected, and performing target detection by using the trained integral network model after the data to be detected is processed in the steps 1 and 2.

2. The method for detecting the countermeasure elimination weakly supervised target based on the feature channel selection as recited in claim 1, wherein: the pretreatment specifically comprises the following steps: and carrying out multi-scale transformation, horizontal turning and random cropping on the input image.

3. The method for detecting the confrontation elimination weak supervision target based on the feature channel selection as claimed in claim 1, characterized in that: the preliminary feature extraction network model is preferably a VGG16 model.

4. The method for detecting the confrontation elimination weak supervision target based on the feature channel selection as claimed in claim 1, characterized in that: the salient region feature extraction network is specifically as follows:

step 4.1: extracting H x W x C of network for preliminary feature extraction ₁ Is subjected to a convolution operation of 1 x 1 to obtain a size ofH × W × C feature image u _c The transformation method is shown in formula (1)

Wherein u is _c Is the output after the convolution and is,

which represents the kernel of the convolution,

representing the input feature map, H × W is the size of the input feature map, C ₁ The number of the characteristic image channels before the convolution transformation is C, and the number of the characteristic image channels after the convolution transformation is C;

step 4.2: the output u in step 4.1 is compared _c Feature compression using global average pooling outputs a tensor z of size 1 × 1 × C _c Equation (2) for global average pooling is as follows:

wherein z is _c ∈R ^c Is the feature tensor with output of 1 × 1 × C, u _c Is the output of step 4.1, F _sq Representing the global average pooling operation of the feature map, H multiplied by W represents the size of an input feature layer, and i, j are element coordinate indexes in the feature map;

step 4.3: the output z in step 4.2 is compared _c And performing characteristic channel selection operation to obtain a one-dimensional tensor s, wherein the formula of channel selection is as follows:

s＝Fex(w ₂ δ(w ₁ z _c )) (3)

wherein s is a one-dimensional tensor output after network compression, W ₁ ,W ₂ Weight parameters requiring continuous iterative optimization for channel selection operation, delta is ReLU activation function, fex represents nonlinear activationA function;

x＝F _scale (u _c ,s) (4)

wherein, F _scale Representing the eigen-channel multiplication, s representing the scalar obtained in step 4.3, u _c Is the output of step 4.1;

step 4.5: obtaining a salient region feature image M from the feature image x in the step 4.4 through a classifier of a salient region feature extraction network model ^A ，

M ^A ＝f(φ _A ，x，y _i ) (5)

Wherein M is ^A The obtained characteristic region is represented, f represents a classifier of a salient region characteristic extraction network model, phi _A Representing the parameters of the network model, x being the characteristic diagram with weight distribution obtained in step 4.4, y _i Is the label corresponding to the image.

5. The method for detecting the countermeasure elimination weakly supervised target based on the feature channel selection as recited in claim 1, wherein: the characteristic diagram

Obtained by the following steps:

characterizing the salient region to image M ^A Obtaining the characteristic region R of the most significant part by setting a threshold value t _t (ii) a Then the characteristic region R is divided into _t Corresponding features are classified as 0 to obtain a feature map

6. The method for detecting the countermeasure elimination weakly supervised target based on the feature channel selection as recited in claim 1, wherein: the overall loss function L is specifically as follows:

wherein, y _i Is the value of the corresponding label in the data set, phi _c Predicting probability distribution for the final class label of the whole network model;

L _A extracting loss function of network for salient region feature, y _i Is the value of the corresponding tag in the data set,

representing the probability distribution of the corresponding labels predicted by the salient region feature network, wherein c is the number of the categories in the data set;

L _B extracting the loss function of the network for the secondary area features, y _i Is the value of the corresponding tag in the data set,

and c is the number of categories in the data set.