CN112215252A

CN112215252A - Weak supervision target detection method based on online difficult and easy sample mining

Info

Publication number: CN112215252A
Application number: CN202010805922.5A
Authority: CN
Inventors: 许金泉; 王振宁; 王溢; 蔡碧颖
Original assignee: Nanqiang Zhishi Xiamen Technology Co ltd
Current assignee: Nanqiang Zhishi Xiamen Technology Co ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2021-01-12
Anticipated expiration: 2040-08-12
Also published as: CN112215252B

Abstract

The invention discloses a weak supervision target detection method based on-line difficult and easy sample mining, which comprises the following steps: step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network; and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.

Description

Weak supervision target detection method based on online difficult and easy sample mining

Technical Field

The invention relates to fusion of void characteristics and weak supervision target detection, in particular to a weak supervision target detection method for on-line difficult and easy sample mining.

Background

In recent years, with the improvement of computer performance and the development of big data, visual information data has increased dramatically, and multimedia data including still images, moving images, video files, audio files, and the like are spread on various social media at a fast speed. As one of the most basic problems in the field of computer vision, target detection is widely applied to many fields such as target tracking, behavior understanding, human-computer interaction, face recognition and the like, and attracts the extensive attention and research of many scholars in the beginning of the 20 th century. Human beings receive external information mainly through vision, so the application technology based on the visual information is a prospective research point of artificial intelligence. Among them, technologies such as face recognition, video monitoring, target detection, internet image content review, and biometric feature recognition have become the current research focus. The technologies are also widely applied to the fields of medical treatment, old age care, transportation, urban operation, security and the like, such as medical image diagnosis, attitude estimation, station security inspection, automatic driving, vehicle speed detection, video monitoring behavior analysis and the like.

The target detection is an extremely important research field in computer vision and machine learning, and the leading-edge knowledge of multiple fields such as image processing, pattern recognition, artificial intelligence and automatic control is fused. The main task of target detection is to quickly and accurately identify and accurately position a target from a picture. With the development of video websites and social networks, people can contact a large amount of multimedia resources such as images and videos, and target detection is also widely applied to the fields, such as face detection of pictures in social websites, pedestrian detection in images or video sequences, vehicle detection in traffic monitoring, and help people with visual disorders to understand visual contents.

Target detection has recently focused primarily on the study of Convolutional Neural Networks (CNNs) that use large-scale data with instance-level labels (i.e., bounding box labels) during detector training. However, collecting bounding box labels of a particular class is obviously a time consuming and laborious task that limits the practical use of the detector. It is much easier to collect image-level labels than bounding box labeling. For example, the collected images may be lightly checked for the presence of a target object by manually querying an Image search engine (e.g., Google Image) or a photo sharing website (e.g., Flickr). Therefore, the task of Weakly Supervised Object Detection (WSOD), i.e. supervised training of object detectors only at the image level, has recently attracted increasing attention.

Early target detection algorithms were primarily based on manual features. Viola et al [1] propose a Viola-Jones (VJ) face detector based on Haar features and Adaboost cascade algorithm, which greatly facilitates the development of target detection. Dalal et al [2] propose a histogram of gradient directions feature (HOG), and use a support vector machine [3] (SVM) as a classifier, further improve the accuracy of target detection, and make a major breakthrough in pedestrian detection. On the basis of an HOG detector, Felzenszwalb et al [4] propose a Deformable Part Model (DPM) and push a target detection algorithm based on manual characteristics to a peak, and the main method comprises the following steps: and splitting the detection of the target into the detection of each part of the target, and then obtaining a final result through aggregation. The algorithm model does not give labels of all parts of the target, so a weak supervision-based strategy is adopted. However, the method for constructing a complex model on the basis of basic level feature expression cannot meet the requirements of high precision and high speed, and the introduction of deep learning brings a new development direction for target detection. Most classical weak surveillance-based object detection methods follow multi-instance learning (MIL) by representing each image as an instance package. Learning alternates between training the object classifier and selecting positive instances with high confidence, and such settings are sensitive to initialization. The goal of the subsequent approach is to refine the weakly supervised based object detection model from different aspects, e.g. to improve the initialization [5], regularize the model using additional hints [6] or improve the MIL constraints [7 ]. Bilen et al [8] propose an end-to-end architecture that performs region selection and classification simultaneously by performing classification and detection headers separately, with supervision coming from combining classification scores. Furthermore, Bilen et al [9] proposed a smooth MIL that gently labeled the target proposal, rather than selecting the proposal with the highest score. Tang et al [10] iteratively refine the prediction by a multi-stage instance classifier. Other end-to-end trainable models have also been proposed for WSODs by using techniques such as domain adaptation [11], expectation maximization [12] [13] and significance module [14 ].

Most of the current research is focused on developing various detections, with little discussion on the feature distributions of target classes, however, the feature distributions of target classes are critical to WSOD performance, which can locate targets depending on the quality of the element distributions, WSOD typically relies on a backbone network, such as VGG-16 or ResNet-50, which is pre-trained on images with image-level labels in a fully supervised fashion on ImageNet. But there are still three challenges faced when training a WSOD directly using these backbone networks: first, using only image-level annotations, the limited receive field size currently produced by the backbone network is an obstacle to accurately capturing the object boundaries, which results in poor performance of the WSOD task. For example, the receptive field size in the last convolutional layer of a popular VGG16 network is 196 × 196, but the shortest edge of the input picture is typically between 480 and 1200. Second, deeper layers have a different profile than shallow layers, but the shallow layer must accept the same convolution pattern, which may confuse the detector training. Namely: the shallow layer is mainly focused on low-level visual features that are good at detecting small objects but not enough to represent global object semantics. As the depth of the layer increases, smaller feature maps may result in the absence of small objects. Therefore, reasonable design of semantic extraction between shallow and deep layers is required. Third, the imbalance between foreground and background limits accurate bounding box localization, especially where all image detection boxes are fed to the CNN, with only a few containing positive target objects. Therefore, the number of background proposals is much larger than the number of target detection frames, and such imbalance cannot effectively learn the distinguishing features. Moreover, such imbalance may lead to more iterations when training features with few examples, making the training process difficult to converge. Furthermore, identifying multiple objects having the same class in an image is a challenging task.

Based on the above analysis, in the existing research, high-precision image labeling is a precondition that the strong supervised learning target detection can obtain good performance, however, factors such as complexity of a background in an actual scene and diversity of targets cause that an image labeling task is very time-consuming and labor-consuming, and the scheme is generated based on the facts.

References referred to:

[1]Viola P,Jones M.Rapid object detection using a boosted cascade of simple features[J].CVPR(1),2001,1(511-518):3.

[2]Dalal N,Triggs B.Histograms of oriented gradients for human detection[C].2005.

[3]Liao S,Jain A K,Li S Z.A fast and accurate unconstrained face detector[J].IEEE transactions on pattern analysis and machine intelligence,2015,38(2):211-223.

[4]Felzenszwalb P,McAllester D,Ramanan D.A discriminatively trained,multiscale,deformable part model[C]//2008IEEE conference on computer vision and pattern recognition.IEEE,2008:1-8.

[5]Hyun Oh Song,Yong Jae Lee,Stefanie Jegelka,and Trevor Darrell.Weakly-supervised discovery of visual pattern configurations.In NeurIPS,2014.

[6]Ramazan Gokberk Cinbis,Jakob Verbeek,and Cordelia Schmid.Weakly supervised object localization with multifold multiple instance learning.TPAMI,2017.

[7]Xinggang Wang,Zhuotun Zhu,Cong Yao,and Xiang Bai.Relaxed Multiple-Instance SVM with Application to Object iscovery.In ICCV,2015.

[8]Hakan Bilen and Andrea Vedaldi.Weakly supervised deep detection networks.In CVPR,2016

[9]Hakan Bilen,Marco Pedersoli,and Tinne Tuytelaars.Weakly supervised object detection with convex clustering.In CVPR,2015.

[10]Peng Tang,Xinggang Wang,Xiang Bai,and Wenyu Liu.Multiple Instance Detection Network with Online Instance Classifier Refinement.In CVPR,2017.

[11]Dong Li,Jia-Bin Huang,Yali Li,Shengjin Wang,and Ming-Hsuan Yang.Weakly Supervised Object Localization with Progressive Domain Adaptation.In CVPR,2016.

[12]Zequn Jie,Yunchao Wei,Xiaojie Jin,Jiashi Feng,and Wei Liu.Deep Self-Taught Learning for Weakly Supervised Object Localization.In CVPR,2017.

[13]Ziang Yan,Jian Liang,Weishen Pan,Jin Li,and Changshui Zhang.Weakly-and semi-supervised object detection with expectation-maximization algorithm.arXiv,2017.

[14]Baisheng Lai and Xiaojin Gong.Saliency guided end-to-end learning for weakly supervised object detection.arXiv,2017.

disclosure of Invention

The invention aims to provide a weak supervision target detection method based on-line difficult and easy sample mining, which is based on a weak supervision training mode, obtains better characteristics by only weak label information through low-cost image annotation and achieves a good training result.

In order to achieve the above purpose, the solution of the invention is:

a weak supervision target detection method based on-line difficult and easy sample mining comprises the following steps:

step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network;

and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.

In the step 1, the pre-processing of the picture to be detected includes firstly performing normalization processing on the picture, then randomly selecting a numerical value from {480,576,688,864,1200}, and scaling the picture by the corresponding numerical value.

In the step 2, the training method of the neural network includes the following steps:

step a1, a data set with image level labels is given, and the set is divided into a training picture sample set and a test picture sample set;

a2, randomly selecting an image I from a training image sample set, inputting the image I, a label y of the corresponding image level and a corresponding candidate frame tau into a backbone network of a neural network, and outputting image characteristics to a hole module by the backbone network;

a3, sequentially processing image features input by a main network by the cavity module from input to output, namely a cavity convolution layer, a cavity convolution layer and an operation, and then obtaining the features of each candidate frame by the RoI Pooling operation;

step a4, the characteristics of each candidate frame obtained in the step a3 are sent into a first full connection layer;

step a5, the output of the first full-link layer has two data flow branches, which are respectively a detection branch and a classification branch, and the detection branch and the classification branch both sequentially comprise a full-link layer and a softmax layer;

step a6, multiplying the two groups of characteristics output by the detection branch and the classification branch to obtain a characteristic sigma;

step a7, obtaining N based on the characteristic sigma output in step a6_bgHard sample candidate box and N_fgIndividual sample candidate boxes;

step a8, selecting N selected in step a7_bgHard sample candidate box and N_fgObtaining the characteristics of each candidate frame through the RoI Pooling operation of the candidate frames of the samples, and then returning to the step a5-a6 to obtain new characteristics

Step a9, the characteristics obtained in the step a8

Performing summation operation based on the class axis to obtain the classification score of the picture level;

and a step a10, performing cross entropy loss on the classification scores obtained in the step a9 and the labels of the real input picture levels to obtain network loss.

In the step a4, the output characteristic diagram of the first fully-connected layer is:

x^c∈R^N×|τ|

where N is the total number of categories, | τ | is the number of candidate boxes, x^cIs a characteristic diagram.

In the step a5, the feature map x^cAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:

where e is a natural constant, | τ | is the number of candidate frames, x^dApplying a feature map obtained by softmax operation based on the 'candidate box' axis; n is a temporary variable and takes a value from 1 to | tau |;

a probability value for the nth candidate box belonging to the ith category;

feature map x^cAfter the full connection layer in the branch is classified, applying softmax operation based on the 'class' axis to obtain:

where e is a natural constant, N is the total number of categories, x^lApplying the characteristic diagram obtained by the softmax operation based on the class axis; m is a temporary variable and takes a value from 1 to N;

the probability value of the jth candidate box is included for the mth category.

The specific process of the step a7 is as follows: firstly, sorting the candidate frames in a descending order according to the example level prediction values of the candidate frames; and traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, and if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sampleOtherwise, selecting the candidate frame as an easy sample until the number of the difficult and easy samples reaches N_bgAnd N_fgOr end of traversal, where N_bgAnd N_fgRespectively, the number of the difficult samples and the number of the easy samples are preset.

In step a10, the network loss L is calculated according to the following formula:

wherein, y_kFor picture-level real labels, s, in which the picture to be detected belongs to the kth category_kAnd outputting the probability numerical value of the k-th category corresponding to the picture to be detected output in the neural network training stage.

After the scheme is adopted, the invention has the following outstanding advantages:

firstly, the invention provides a novel cavity module for the WSOD task by expanding the size of the receptive field in the feature map and simultaneously keeping more remarkable high-level semantic features;

secondly, in order to solve the problem of class imbalance in the WSOD, the invention quantitatively researches the reason and provides an effective on-line hard and easy case mining algorithm, and the algorithm has stable gradient and quick convergence in the training process.

Drawings

FIG. 1 is a schematic diagram of a network architecture of a neural network of the present invention;

FIG. 2 is a schematic diagram of a cavity module of the present invention.

Detailed Description

The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.

The invention provides a weak supervision target detection method based on-line difficult and easy sample mining, which comprises the following steps:

step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network; when the picture is preprocessed, firstly, the picture is standardized, then a numerical value is randomly selected from {480,576,688,864,1200}, and the picture is zoomed according to the corresponding numerical value; for the candidate frame of the picture, in the present embodiment, it is generated using a multi-scale combined grouping (MCG) method;

As shown in fig. 1, the neural network of the present invention mainly includes four parts: the system comprises a CNN trunk feature extraction network, a cavity (DB) module, a weak supervision detection branch and an on-line difficult and easy sample mining (OHEIM) branch.

The training method of the neural network comprises the following steps:

a2, randomly selecting an image I from a training picture sample set, inputting the image I, a label y of the corresponding image level and a corresponding candidate frame tau into a backbone network of a neural network, and outputting image characteristics to a DB module by the backbone network;

a3, as shown in fig. 2, is an architecture diagram of a DB module, which sequentially includes a void convolutional layer, a void convolutional layer, and a sum operation from input to output, and after sequentially processing image features input by a backbone network, features of each candidate frame are obtained through a RoI posing operation;

step a4, the feature of each candidate box obtained in step a3 is sent to a first full-link layer, and the output feature map of the first full-link layer is as follows:

x^c∈R^N×|τ|

where N is the total number of categories, | τ | is the number of candidate boxes, x^cIs a characteristic diagram;

step a5, the output of the first full link layer has two data flow branches, which are respectively a detection branch and a classification branch, and the detection branch and the classification branch both sequentially contain a full linkA connection layer and a softmax layer, where x^cAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:

a probability value for the nth candidate box belonging to the ith category;

x^cafter the full connection layer in the branch is classified, applying softmax operation based on the 'class' axis to obtain:

Step a6, multiplying the two groups of characteristics output by the detection branch and the classification branch to obtain:

σ＝σ_cls(x^l)⊙σ_det(x^d)

wherein σ ∈ R^N×|τ|；

Step a7, processing the characteristics output in step a6 by using an OHEIM algorithm (shown in Table 1) to obtain better candidate boxes

Wherein

Wherein, the input of the algorithm is the predicted value of the example level of the candidate box

A predicted value s of the picture grade of the picture to be detected, a real label y of the picture grade of the picture to be detected, and a characteristic F of each candidate frame_roisDifficulty of digging (N)_bg) Yiyi (N)_fg) The number of samples; the specific treatment is as follows: first, the candidate boxes are sorted in descending order according to their example level prediction values. And traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sample, otherwise, selecting the candidate frame as an easy sample until the number of the difficult samples respectively reaches (N is equal to N)_bg) And (N)_fg) Or the traversal ends. And finally, returning the difficult and easy samples selected by the algorithm.

TABLE 1

Step a8, selecting N selected in step a7_bgHard sample candidate box and N_fgObtaining the characteristics of each candidate frame by the RoI Pooling operation of each easy sample candidate frame, obtaining the multiplication of the two groups of characteristics output by the corresponding detection branch and the classification branch by the processing mode of the steps a5 and a6

Step a9, performing summation operation on the output characteristics of the step a8 based on the class axis to obtain the classification score of the picture level:

step a10, performing cross entropy loss on the score obtained in step a9 and the label of the truly input picture level to obtain the network loss:

The effects of the present invention are further illustrated by the following simulation experiments.

1) Simulation conditions

The invention is developed on an Ubuntu platform, and the developed deep learning framework is based on Caffe 2. The language mainly used in the invention is Python, and the framework Detectron is utilized to efficiently develop the Python.

2) Emulated content

The PASCAL VOC2007 and PASCAL VOC2012 data sets were taken, the network was trained according to the above steps and tested using the test set. Tables 2 and 3 are the results of testing the data sets of VOCs 2007 and 2012 for the present invention and other methods, respectively. Through visual numerical comparison, the method has the best effect compared with other methods. Where ours is the result of the present invention, the rest are the existing other methods. The method adopts an average accuracy method for evaluation, and performs numerical comparison on 20 prediction categories such as 'plane', 'bicycle', 'bird', 'boat', and the like. According to the invention, the average accuracy (mAP) of 45% is achieved on the VOC2007 data set, and the detection result is used as a pseudo label to train the Fast-RCNN network to achieve the average accuracy (mAP) of 53.0%, which is higher than that of other methods. The average accuracy (mAP) and the accurate localization rate (CorLoc) of the method on the VOC2012 data set exceed those of other methods and respectively reach 40.2% and 65.4%.

TABLE 2

TABLE 3

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A weak supervision target detection method based on-line difficult and easy sample mining is characterized by comprising the following steps:

2. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in the step 1, the pre-processing is performed on the picture to be detected, which includes firstly performing normalization processing on the picture, then randomly selecting a numerical value from {480,576,688,864,1200}, and scaling the picture according to the numerical value.

3. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in step 2, the training method of the neural network includes the following steps:

Step a9, the characteristics obtained in the step a8

4. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 3, characterized in that: in the step a4, the output characteristic diagram of the first fully-connected layer is:

x^c∈R^N×|τ|

5. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in the step a5, the feature map x^cAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:

a probability value for the nth candidate box belonging to the ith category;

where e is a natural constant, N is the total number of categories, x¹Applying the characteristic diagram obtained by the softmax operation based on the class axis; m is a temporary variable and takes a value from 1 to N;

6. As claimed in claim3 the weak supervision target detection method based on-line difficult and easy sample mining is characterized in that: the specific process of the step a7 is as follows: firstly, sorting the candidate frames in a descending order according to the example level prediction values of the candidate frames; and traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sample, otherwise, selecting the candidate frame as an easy sample until the number of the difficult samples respectively reaches N_bgAnd N_fgOr end of traversal, where N_bgAnd N_fgRespectively, the number of the difficult samples and the number of the easy samples are preset.

7. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 3, characterized in that: in the step a10, the network loss is calculated according to the following formula