CN112215252A - Weak supervision target detection method based on online difficult and easy sample mining - Google Patents

Weak supervision target detection method based on online difficult and easy sample mining Download PDF

Info

Publication number
CN112215252A
CN112215252A CN202010805922.5A CN202010805922A CN112215252A CN 112215252 A CN112215252 A CN 112215252A CN 202010805922 A CN202010805922 A CN 202010805922A CN 112215252 A CN112215252 A CN 112215252A
Authority
CN
China
Prior art keywords
picture
candidate
detected
difficult
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010805922.5A
Other languages
Chinese (zh)
Other versions
CN112215252B (en
Inventor
许金泉
王振宁
王溢
蔡碧颖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanqiang Zhishi Xiamen Technology Co ltd
Original Assignee
Nanqiang Zhishi Xiamen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanqiang Zhishi Xiamen Technology Co ltd filed Critical Nanqiang Zhishi Xiamen Technology Co ltd
Priority to CN202010805922.5A priority Critical patent/CN112215252B/en
Publication of CN112215252A publication Critical patent/CN112215252A/en
Application granted granted Critical
Publication of CN112215252B publication Critical patent/CN112215252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a weak supervision target detection method based on-line difficult and easy sample mining, which comprises the following steps: step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network; and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.

Description

Weak supervision target detection method based on online difficult and easy sample mining
Technical Field
The invention relates to fusion of void characteristics and weak supervision target detection, in particular to a weak supervision target detection method for on-line difficult and easy sample mining.
Background
In recent years, with the improvement of computer performance and the development of big data, visual information data has increased dramatically, and multimedia data including still images, moving images, video files, audio files, and the like are spread on various social media at a fast speed. As one of the most basic problems in the field of computer vision, target detection is widely applied to many fields such as target tracking, behavior understanding, human-computer interaction, face recognition and the like, and attracts the extensive attention and research of many scholars in the beginning of the 20 th century. Human beings receive external information mainly through vision, so the application technology based on the visual information is a prospective research point of artificial intelligence. Among them, technologies such as face recognition, video monitoring, target detection, internet image content review, and biometric feature recognition have become the current research focus. The technologies are also widely applied to the fields of medical treatment, old age care, transportation, urban operation, security and the like, such as medical image diagnosis, attitude estimation, station security inspection, automatic driving, vehicle speed detection, video monitoring behavior analysis and the like.
The target detection is an extremely important research field in computer vision and machine learning, and the leading-edge knowledge of multiple fields such as image processing, pattern recognition, artificial intelligence and automatic control is fused. The main task of target detection is to quickly and accurately identify and accurately position a target from a picture. With the development of video websites and social networks, people can contact a large amount of multimedia resources such as images and videos, and target detection is also widely applied to the fields, such as face detection of pictures in social websites, pedestrian detection in images or video sequences, vehicle detection in traffic monitoring, and help people with visual disorders to understand visual contents.
Target detection has recently focused primarily on the study of Convolutional Neural Networks (CNNs) that use large-scale data with instance-level labels (i.e., bounding box labels) during detector training. However, collecting bounding box labels of a particular class is obviously a time consuming and laborious task that limits the practical use of the detector. It is much easier to collect image-level labels than bounding box labeling. For example, the collected images may be lightly checked for the presence of a target object by manually querying an Image search engine (e.g., Google Image) or a photo sharing website (e.g., Flickr). Therefore, the task of Weakly Supervised Object Detection (WSOD), i.e. supervised training of object detectors only at the image level, has recently attracted increasing attention.
Early target detection algorithms were primarily based on manual features. Viola et al [1] propose a Viola-Jones (VJ) face detector based on Haar features and Adaboost cascade algorithm, which greatly facilitates the development of target detection. Dalal et al [2] propose a histogram of gradient directions feature (HOG), and use a support vector machine [3] (SVM) as a classifier, further improve the accuracy of target detection, and make a major breakthrough in pedestrian detection. On the basis of an HOG detector, Felzenszwalb et al [4] propose a Deformable Part Model (DPM) and push a target detection algorithm based on manual characteristics to a peak, and the main method comprises the following steps: and splitting the detection of the target into the detection of each part of the target, and then obtaining a final result through aggregation. The algorithm model does not give labels of all parts of the target, so a weak supervision-based strategy is adopted. However, the method for constructing a complex model on the basis of basic level feature expression cannot meet the requirements of high precision and high speed, and the introduction of deep learning brings a new development direction for target detection. Most classical weak surveillance-based object detection methods follow multi-instance learning (MIL) by representing each image as an instance package. Learning alternates between training the object classifier and selecting positive instances with high confidence, and such settings are sensitive to initialization. The goal of the subsequent approach is to refine the weakly supervised based object detection model from different aspects, e.g. to improve the initialization [5], regularize the model using additional hints [6] or improve the MIL constraints [7 ]. Bilen et al [8] propose an end-to-end architecture that performs region selection and classification simultaneously by performing classification and detection headers separately, with supervision coming from combining classification scores. Furthermore, Bilen et al [9] proposed a smooth MIL that gently labeled the target proposal, rather than selecting the proposal with the highest score. Tang et al [10] iteratively refine the prediction by a multi-stage instance classifier. Other end-to-end trainable models have also been proposed for WSODs by using techniques such as domain adaptation [11], expectation maximization [12] [13] and significance module [14 ].
Most of the current research is focused on developing various detections, with little discussion on the feature distributions of target classes, however, the feature distributions of target classes are critical to WSOD performance, which can locate targets depending on the quality of the element distributions, WSOD typically relies on a backbone network, such as VGG-16 or ResNet-50, which is pre-trained on images with image-level labels in a fully supervised fashion on ImageNet. But there are still three challenges faced when training a WSOD directly using these backbone networks: first, using only image-level annotations, the limited receive field size currently produced by the backbone network is an obstacle to accurately capturing the object boundaries, which results in poor performance of the WSOD task. For example, the receptive field size in the last convolutional layer of a popular VGG16 network is 196 × 196, but the shortest edge of the input picture is typically between 480 and 1200. Second, deeper layers have a different profile than shallow layers, but the shallow layer must accept the same convolution pattern, which may confuse the detector training. Namely: the shallow layer is mainly focused on low-level visual features that are good at detecting small objects but not enough to represent global object semantics. As the depth of the layer increases, smaller feature maps may result in the absence of small objects. Therefore, reasonable design of semantic extraction between shallow and deep layers is required. Third, the imbalance between foreground and background limits accurate bounding box localization, especially where all image detection boxes are fed to the CNN, with only a few containing positive target objects. Therefore, the number of background proposals is much larger than the number of target detection frames, and such imbalance cannot effectively learn the distinguishing features. Moreover, such imbalance may lead to more iterations when training features with few examples, making the training process difficult to converge. Furthermore, identifying multiple objects having the same class in an image is a challenging task.
Based on the above analysis, in the existing research, high-precision image labeling is a precondition that the strong supervised learning target detection can obtain good performance, however, factors such as complexity of a background in an actual scene and diversity of targets cause that an image labeling task is very time-consuming and labor-consuming, and the scheme is generated based on the facts.
References referred to:
[1]Viola P,Jones M.Rapid object detection using a boosted cascade of simple features[J].CVPR(1),2001,1(511-518):3.
[2]Dalal N,Triggs B.Histograms of oriented gradients for human detection[C].2005.
[3]Liao S,Jain A K,Li S Z.A fast and accurate unconstrained face detector[J].IEEE transactions on pattern analysis and machine intelligence,2015,38(2):211-223.
[4]Felzenszwalb P,McAllester D,Ramanan D.A discriminatively trained,multiscale,deformable part model[C]//2008IEEE conference on computer vision and pattern recognition.IEEE,2008:1-8.
[5]Hyun Oh Song,Yong Jae Lee,Stefanie Jegelka,and Trevor Darrell.Weakly-supervised discovery of visual pattern configurations.In NeurIPS,2014.
[6]Ramazan Gokberk Cinbis,Jakob Verbeek,and Cordelia Schmid.Weakly supervised object localization with multifold multiple instance learning.TPAMI,2017.
[7]Xinggang Wang,Zhuotun Zhu,Cong Yao,and Xiang Bai.Relaxed Multiple-Instance SVM with Application to Object iscovery.In ICCV,2015.
[8]Hakan Bilen and Andrea Vedaldi.Weakly supervised deep detection networks.In CVPR,2016
[9]Hakan Bilen,Marco Pedersoli,and Tinne Tuytelaars.Weakly supervised object detection with convex clustering.In CVPR,2015.
[10]Peng Tang,Xinggang Wang,Xiang Bai,and Wenyu Liu.Multiple Instance Detection Network with Online Instance Classifier Refinement.In CVPR,2017.
[11]Dong Li,Jia-Bin Huang,Yali Li,Shengjin Wang,and Ming-Hsuan Yang.Weakly Supervised Object Localization with Progressive Domain Adaptation.In CVPR,2016.
[12]Zequn Jie,Yunchao Wei,Xiaojie Jin,Jiashi Feng,and Wei Liu.Deep Self-Taught Learning for Weakly Supervised Object Localization.In CVPR,2017.
[13]Ziang Yan,Jian Liang,Weishen Pan,Jin Li,and Changshui Zhang.Weakly-and semi-supervised object detection with expectation-maximization algorithm.arXiv,2017.
[14]Baisheng Lai and Xiaojin Gong.Saliency guided end-to-end learning for weakly supervised object detection.arXiv,2017.
disclosure of Invention
The invention aims to provide a weak supervision target detection method based on-line difficult and easy sample mining, which is based on a weak supervision training mode, obtains better characteristics by only weak label information through low-cost image annotation and achieves a good training result.
In order to achieve the above purpose, the solution of the invention is:
a weak supervision target detection method based on-line difficult and easy sample mining comprises the following steps:
step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network;
and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.
In the step 1, the pre-processing of the picture to be detected includes firstly performing normalization processing on the picture, then randomly selecting a numerical value from {480,576,688,864,1200}, and scaling the picture by the corresponding numerical value.
In the step 2, the training method of the neural network includes the following steps:
step a1, a data set with image level labels is given, and the set is divided into a training picture sample set and a test picture sample set;
a2, randomly selecting an image I from a training image sample set, inputting the image I, a label y of the corresponding image level and a corresponding candidate frame tau into a backbone network of a neural network, and outputting image characteristics to a hole module by the backbone network;
a3, sequentially processing image features input by a main network by the cavity module from input to output, namely a cavity convolution layer, a cavity convolution layer and an operation, and then obtaining the features of each candidate frame by the RoI Pooling operation;
step a4, the characteristics of each candidate frame obtained in the step a3 are sent into a first full connection layer;
step a5, the output of the first full-link layer has two data flow branches, which are respectively a detection branch and a classification branch, and the detection branch and the classification branch both sequentially comprise a full-link layer and a softmax layer;
step a6, multiplying the two groups of characteristics output by the detection branch and the classification branch to obtain a characteristic sigma;
step a7, obtaining N based on the characteristic sigma output in step a6bgHard sample candidate box and NfgIndividual sample candidate boxes;
step a8, selecting N selected in step a7bgHard sample candidate box and NfgObtaining the characteristics of each candidate frame through the RoI Pooling operation of the candidate frames of the samples, and then returning to the step a5-a6 to obtain new characteristics
Figure BDA0002629116990000051
Step a9, the characteristics obtained in the step a8
Figure BDA0002629116990000052
Performing summation operation based on the class axis to obtain the classification score of the picture level;
and a step a10, performing cross entropy loss on the classification scores obtained in the step a9 and the labels of the real input picture levels to obtain network loss.
In the step a4, the output characteristic diagram of the first fully-connected layer is:
xc∈RN×|τ|
where N is the total number of categories, | τ | is the number of candidate boxes, xcIs a characteristic diagram.
In the step a5, the feature map xcAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:
Figure BDA0002629116990000061
where e is a natural constant, | τ | is the number of candidate frames, xdApplying a feature map obtained by softmax operation based on the 'candidate box' axis; n is a temporary variable and takes a value from 1 to | tau |;
Figure BDA0002629116990000065
a probability value for the nth candidate box belonging to the ith category;
feature map xcAfter the full connection layer in the branch is classified, applying softmax operation based on the 'class' axis to obtain:
Figure BDA0002629116990000062
where e is a natural constant, N is the total number of categories, xlApplying the characteristic diagram obtained by the softmax operation based on the class axis; m is a temporary variable and takes a value from 1 to N;
Figure BDA0002629116990000063
the probability value of the jth candidate box is included for the mth category.
The specific process of the step a7 is as follows: firstly, sorting the candidate frames in a descending order according to the example level prediction values of the candidate frames; and traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, and if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sampleOtherwise, selecting the candidate frame as an easy sample until the number of the difficult and easy samples reaches NbgAnd NfgOr end of traversal, where NbgAnd NfgRespectively, the number of the difficult samples and the number of the easy samples are preset.
In step a10, the network loss L is calculated according to the following formula:
Figure BDA0002629116990000064
wherein, ykFor picture-level real labels, s, in which the picture to be detected belongs to the kth categorykAnd outputting the probability numerical value of the k-th category corresponding to the picture to be detected output in the neural network training stage.
After the scheme is adopted, the invention has the following outstanding advantages:
firstly, the invention provides a novel cavity module for the WSOD task by expanding the size of the receptive field in the feature map and simultaneously keeping more remarkable high-level semantic features;
secondly, in order to solve the problem of class imbalance in the WSOD, the invention quantitatively researches the reason and provides an effective on-line hard and easy case mining algorithm, and the algorithm has stable gradient and quick convergence in the training process.
Drawings
FIG. 1 is a schematic diagram of a network architecture of a neural network of the present invention;
FIG. 2 is a schematic diagram of a cavity module of the present invention.
Detailed Description
The technical solution and the advantages of the present invention will be described in detail with reference to the accompanying drawings.
The invention provides a weak supervision target detection method based on-line difficult and easy sample mining, which comprises the following steps:
step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network; when the picture is preprocessed, firstly, the picture is standardized, then a numerical value is randomly selected from {480,576,688,864,1200}, and the picture is zoomed according to the corresponding numerical value; for the candidate frame of the picture, in the present embodiment, it is generated using a multi-scale combined grouping (MCG) method;
and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.
As shown in fig. 1, the neural network of the present invention mainly includes four parts: the system comprises a CNN trunk feature extraction network, a cavity (DB) module, a weak supervision detection branch and an on-line difficult and easy sample mining (OHEIM) branch.
The training method of the neural network comprises the following steps:
step a1, a data set with image level labels is given, and the set is divided into a training picture sample set and a test picture sample set;
a2, randomly selecting an image I from a training picture sample set, inputting the image I, a label y of the corresponding image level and a corresponding candidate frame tau into a backbone network of a neural network, and outputting image characteristics to a DB module by the backbone network;
a3, as shown in fig. 2, is an architecture diagram of a DB module, which sequentially includes a void convolutional layer, a void convolutional layer, and a sum operation from input to output, and after sequentially processing image features input by a backbone network, features of each candidate frame are obtained through a RoI posing operation;
step a4, the feature of each candidate box obtained in step a3 is sent to a first full-link layer, and the output feature map of the first full-link layer is as follows:
xc∈RN×|τ|
where N is the total number of categories, | τ | is the number of candidate boxes, xcIs a characteristic diagram;
step a5, the output of the first full link layer has two data flow branches, which are respectively a detection branch and a classification branch, and the detection branch and the classification branch both sequentially contain a full linkA connection layer and a softmax layer, where xcAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:
Figure BDA0002629116990000081
where e is a natural constant, | τ | is the number of candidate frames, xdApplying a feature map obtained by softmax operation based on the 'candidate box' axis; n is a temporary variable and takes a value from 1 to | tau |;
Figure BDA0002629116990000082
a probability value for the nth candidate box belonging to the ith category;
xcafter the full connection layer in the branch is classified, applying softmax operation based on the 'class' axis to obtain:
Figure BDA0002629116990000083
where e is a natural constant, N is the total number of categories, xlApplying the characteristic diagram obtained by the softmax operation based on the class axis; m is a temporary variable and takes a value from 1 to N;
Figure BDA0002629116990000084
the probability value of the jth candidate box is included for the mth category.
Step a6, multiplying the two groups of characteristics output by the detection branch and the classification branch to obtain:
σ=σcls(xl)⊙σdet(xd)
wherein σ ∈ RN×|τ|
Step a7, processing the characteristics output in step a6 by using an OHEIM algorithm (shown in Table 1) to obtain better candidate boxes
Figure BDA0002629116990000085
Wherein
Figure BDA0002629116990000086
Wherein, the input of the algorithm is the predicted value of the example level of the candidate box
Figure BDA0002629116990000087
A predicted value s of the picture grade of the picture to be detected, a real label y of the picture grade of the picture to be detected, and a characteristic F of each candidate frameroisDifficulty of digging (N)bg) Yiyi (N)fg) The number of samples; the specific treatment is as follows: first, the candidate boxes are sorted in descending order according to their example level prediction values. And traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sample, otherwise, selecting the candidate frame as an easy sample until the number of the difficult samples respectively reaches (N is equal to N)bg) And (N)fg) Or the traversal ends. And finally, returning the difficult and easy samples selected by the algorithm.
TABLE 1
Figure BDA0002629116990000091
Step a8, selecting N selected in step a7bgHard sample candidate box and NfgObtaining the characteristics of each candidate frame by the RoI Pooling operation of each easy sample candidate frame, obtaining the multiplication of the two groups of characteristics output by the corresponding detection branch and the classification branch by the processing mode of the steps a5 and a6
Figure BDA0002629116990000092
Step a9, performing summation operation on the output characteristics of the step a8 based on the class axis to obtain the classification score of the picture level:
Figure BDA0002629116990000093
step a10, performing cross entropy loss on the score obtained in step a9 and the label of the truly input picture level to obtain the network loss:
Figure BDA0002629116990000094
wherein, ykFor picture-level real labels, s, in which the picture to be detected belongs to the kth categorykAnd outputting the probability numerical value of the k-th category corresponding to the picture to be detected output in the neural network training stage.
The effects of the present invention are further illustrated by the following simulation experiments.
1) Simulation conditions
The invention is developed on an Ubuntu platform, and the developed deep learning framework is based on Caffe 2. The language mainly used in the invention is Python, and the framework Detectron is utilized to efficiently develop the Python.
2) Emulated content
The PASCAL VOC2007 and PASCAL VOC2012 data sets were taken, the network was trained according to the above steps and tested using the test set. Tables 2 and 3 are the results of testing the data sets of VOCs 2007 and 2012 for the present invention and other methods, respectively. Through visual numerical comparison, the method has the best effect compared with other methods. Where ours is the result of the present invention, the rest are the existing other methods. The method adopts an average accuracy method for evaluation, and performs numerical comparison on 20 prediction categories such as 'plane', 'bicycle', 'bird', 'boat', and the like. According to the invention, the average accuracy (mAP) of 45% is achieved on the VOC2007 data set, and the detection result is used as a pseudo label to train the Fast-RCNN network to achieve the average accuracy (mAP) of 53.0%, which is higher than that of other methods. The average accuracy (mAP) and the accurate localization rate (CorLoc) of the method on the VOC2012 data set exceed those of other methods and respectively reach 40.2% and 65.4%.
TABLE 2
Figure BDA0002629116990000101
TABLE 3
Figure BDA0002629116990000111
The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims (7)

1. A weak supervision target detection method based on-line difficult and easy sample mining is characterized by comprising the following steps:
step 1, preprocessing a picture to be detected, and then sending the preprocessed picture to be detected and a candidate frame corresponding to the preprocessed picture to a neural network;
and 2, processing the picture by the neural network, outputting a probability numerical value of the picture to be detected corresponding to each category in the training process, and outputting the coordinate, the category and the score of the frame predicted by the picture to be detected in the testing process.
2. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in the step 1, the pre-processing is performed on the picture to be detected, which includes firstly performing normalization processing on the picture, then randomly selecting a numerical value from {480,576,688,864,1200}, and scaling the picture according to the numerical value.
3. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in step 2, the training method of the neural network includes the following steps:
step a1, a data set with image level labels is given, and the set is divided into a training picture sample set and a test picture sample set;
a2, randomly selecting an image I from a training image sample set, inputting the image I, a label y of the corresponding image level and a corresponding candidate frame tau into a backbone network of a neural network, and outputting image characteristics to a hole module by the backbone network;
a3, sequentially processing image features input by a main network by the cavity module from input to output, namely a cavity convolution layer, a cavity convolution layer and an operation, and then obtaining the features of each candidate frame by the RoI Pooling operation;
step a4, the characteristics of each candidate frame obtained in the step a3 are sent into a first full connection layer;
step a5, the output of the first full-link layer has two data flow branches, which are respectively a detection branch and a classification branch, and the detection branch and the classification branch both sequentially comprise a full-link layer and a softmax layer;
step a6, multiplying the two groups of characteristics output by the detection branch and the classification branch to obtain a characteristic sigma;
step a7, obtaining N based on the characteristic sigma output in step a6bgHard sample candidate box and NfgIndividual sample candidate boxes;
step a8, selecting N selected in step a7bgHard sample candidate box and NfgObtaining the characteristics of each candidate frame through the RoI Pooling operation of the candidate frames of the samples, and then returning to the step a5-a6 to obtain new characteristics
Figure FDA0002629116980000021
Step a9, the characteristics obtained in the step a8
Figure FDA0002629116980000022
Performing summation operation based on the class axis to obtain the classification score of the picture level;
and a step a10, performing cross entropy loss on the classification scores obtained in the step a9 and the labels of the real input picture levels to obtain network loss.
4. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 3, characterized in that: in the step a4, the output characteristic diagram of the first fully-connected layer is:
xc∈RN×|τ|
where N is the total number of categories, | τ | is the number of candidate boxes, xcIs a characteristic diagram.
5. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 1, wherein: in the step a5, the feature map xcAfter detecting the fully connected layer in the branch, applying softmax operation based on its "candidate box" axis yields:
Figure FDA0002629116980000023
where e is a natural constant, | τ | is the number of candidate frames, xdApplying a feature map obtained by softmax operation based on the 'candidate box' axis; n is a temporary variable and takes a value from 1 to | tau |;
Figure FDA0002629116980000024
a probability value for the nth candidate box belonging to the ith category;
feature map xcAfter the full connection layer in the branch is classified, applying softmax operation based on the 'class' axis to obtain:
Figure FDA0002629116980000025
where e is a natural constant, N is the total number of categories, x1Applying the characteristic diagram obtained by the softmax operation based on the class axis; m is a temporary variable and takes a value from 1 to N;
Figure FDA0002629116980000026
the probability value of the jth candidate box is included for the mth category.
6. As claimed in claim3 the weak supervision target detection method based on-line difficult and easy sample mining is characterized in that: the specific process of the step a7 is as follows: firstly, sorting the candidate frames in a descending order according to the example level prediction values of the candidate frames; and traversing the candidate frame sorted in the last step according to the real label of the picture level of the picture to be detected, if the label of the picture level corresponding to the candidate frame is 1, selecting the candidate frame as a difficult sample, otherwise, selecting the candidate frame as an easy sample until the number of the difficult samples respectively reaches NbgAnd NfgOr end of traversal, where NbgAnd NfgRespectively, the number of the difficult samples and the number of the easy samples are preset.
7. The method for detecting the weakly supervised target based on the online difficult and easy sample mining as recited in claim 3, characterized in that: in the step a10, the network loss is calculated according to the following formula
Figure FDA0002629116980000031
Figure FDA0002629116980000032
Wherein, ykFor picture-level real labels, s, in which the picture to be detected belongs to the kth categorykAnd outputting the probability numerical value of the k-th category corresponding to the picture to be detected output in the neural network training stage.
CN202010805922.5A 2020-08-12 2020-08-12 Weak supervision target detection method based on-line difficult sample mining Active CN112215252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010805922.5A CN112215252B (en) 2020-08-12 2020-08-12 Weak supervision target detection method based on-line difficult sample mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010805922.5A CN112215252B (en) 2020-08-12 2020-08-12 Weak supervision target detection method based on-line difficult sample mining

Publications (2)

Publication Number Publication Date
CN112215252A true CN112215252A (en) 2021-01-12
CN112215252B CN112215252B (en) 2023-05-30

Family

ID=74058975

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010805922.5A Active CN112215252B (en) 2020-08-12 2020-08-12 Weak supervision target detection method based on-line difficult sample mining

Country Status (1)

Country Link
CN (1) CN112215252B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591931A (en) * 2021-07-06 2021-11-02 厦门路桥信息股份有限公司 Weak supervision target positioning method, device, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330027A (en) * 2017-06-23 2017-11-07 中国科学院信息工程研究所 A kind of Weakly supervised depth station caption detection method
CN107665351A (en) * 2017-05-06 2018-02-06 北京航空航天大学 The airfield detection method excavated based on difficult sample
US20190228313A1 (en) * 2018-01-23 2019-07-25 Insurance Services Office, Inc. Computer Vision Systems and Methods for Unsupervised Representation Learning by Sorting Sequences
CN111275044A (en) * 2020-02-21 2020-06-12 西北工业大学 Weak supervision target detection method based on sample selection and self-adaptive hard case mining

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107665351A (en) * 2017-05-06 2018-02-06 北京航空航天大学 The airfield detection method excavated based on difficult sample
CN107330027A (en) * 2017-06-23 2017-11-07 中国科学院信息工程研究所 A kind of Weakly supervised depth station caption detection method
US20190228313A1 (en) * 2018-01-23 2019-07-25 Insurance Services Office, Inc. Computer Vision Systems and Methods for Unsupervised Representation Learning by Sorting Sequences
CN111275044A (en) * 2020-02-21 2020-06-12 西北工业大学 Weak supervision target detection method based on sample selection and self-adaptive hard case mining

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙皓泽;常天庆;张雷;杨国振;: "基于Top-down网络结构的坦克装甲目标检测", 计算机仿真 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113591931A (en) * 2021-07-06 2021-11-02 厦门路桥信息股份有限公司 Weak supervision target positioning method, device, equipment and medium

Also Published As

Publication number Publication date
CN112215252B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
Chen et al. Once for all: a two-flow convolutional neural network for visual tracking
Hasani et al. Spatio-temporal facial expression recognition using convolutional neural networks and conditional random fields
Xiong et al. Recognize complex events from static images by fusing deep channels
Zhang et al. Loop closure detection for visual SLAM systems using convolutional neural network
Shou et al. Temporal action localization in untrimmed videos via multi-stage cnns
Kong et al. Hypernet: Towards accurate region proposal generation and joint object detection
Yuan et al. Discriminative video pattern search for efficient action detection
Lin et al. RSCM: Region selection and concurrency model for multi-class weather recognition
CN111709311A (en) Pedestrian re-identification method based on multi-scale convolution feature fusion
CN113221770B (en) Cross-domain pedestrian re-recognition method and system based on multi-feature hybrid learning
Xiao et al. Cross domain knowledge transfer for person re-identification
Ding et al. Let features decide for themselves: Feature mask network for person re-identification
Luo et al. Visual attention in multi-label image classification
Symeonidis et al. Neural attention-driven non-maximum suppression for person detection
Song et al. A review of object detectors in deep learning
Tong et al. A review of indoor-outdoor scene classification
Najibi et al. Towards the success rate of one: Real-time unconstrained salient object detection
CN112215252B (en) Weak supervision target detection method based on-line difficult sample mining
Saeidi et al. Deep learning based on CNN for pedestrian detection: an overview and analysis
Zheng et al. Bi-heterogeneous Convolutional Neural Network for UAV-based dynamic scene classification
Wang et al. Human action recognition based on deep network and feature fusion
Sun et al. Video-based parent-child relationship prediction
Beikmohammadi et al. Mixture of deep-based representation and shallow classifiers to recognize human activities
Li et al. Multiple instance discriminative dictionary learning for action recognition
Ren et al. Video-based emotion recognition using multi-dichotomy RNN-DNN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant