CN108734200B - Human target visual detection method and device based on BING (building information network) features - Google Patents

Human target visual detection method and device based on BING (building information network) features Download PDF

Info

Publication number
CN108734200B
CN108734200B CN201810374551.2A CN201810374551A CN108734200B CN 108734200 B CN108734200 B CN 108734200B CN 201810374551 A CN201810374551 A CN 201810374551A CN 108734200 B CN108734200 B CN 108734200B
Authority
CN
China
Prior art keywords
candidate
window
bing
detection
classifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810374551.2A
Other languages
Chinese (zh)
Other versions
CN108734200A (en
Inventor
杨戈
黄尚仁
黄静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Normal University Zhuhai
Original Assignee
Beijing Normal University Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Normal University Zhuhai filed Critical Beijing Normal University Zhuhai
Priority to CN201810374551.2A priority Critical patent/CN108734200B/en
Publication of CN108734200A publication Critical patent/CN108734200A/en
Application granted granted Critical
Publication of CN108734200B publication Critical patent/CN108734200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]

Abstract

The invention relates to a human target visual detection method and device based on BING characteristics. The method utilizes a BING characteristic visual saliency detection-based method to process video frames so as to optimize the performance problem caused by using a traditional sliding window in the aspect of target detection; and further designing an SVM and a cascade classifier based on deep convolution characteristics for a small number of candidate windows containing targets screened by the significance detection to finely screen the candidate windows where the human targets are located, and finally obtaining the positions and sizes of all the human targets on the video frame. The invention provides a method for extracting candidate windows by optimizing the traditional sliding window by using a visual saliency detection method based on BING (building information network) features, and performs directional screening on the candidate windows by combining a method based on a cascade SVM (support vector machine) classifier, so that the number of the candidate windows can be effectively and rapidly reduced, the human target detection precision is ensured, and the detection time is reduced.

Description

Human target visual detection method and device based on BING (building information network) features
Technical Field
The invention belongs to the technical field of computer vision, relates to a human target vision detection related technology based on visual attention, and particularly relates to a human target vision detection method and device based on BING characteristics.
Background
The field of cognitive psychology shows that human still can keep strong perception capability in extremely complex scenes through a great deal of observation because human is good at positioning meaningful targets and observing, identifying and recognizing, namely has strong screening capability on effective information. Based on this starting point, if the computer is to be made cognizant, the problem of the saliency of the target object in the image or video frame is to be defined.
Visual salience Detection (Visual salience Detection) is generally divided into two categories, one being: the significance detection from Bottom to top (Bottom-up Approach) is essentially based on the difference between pixels in a certain candidate region of an image and its surrounding pixels as the significance amount, and the greater the difference, the more significant the region is. Such as Histogram statistics Based Contrast (HC, Histogram Contrast) (Zeng P, Meng F, Shi R, et al, objective Object Detection Based on Histogram-Based Contrast and Guided Image Filtering [ M ]. Intelligent Analysis and applications. spring International publication, 2016), authors use histograms to count the color values of each pixel, and significance results are obtained after Histogram reduction and quantization; based on The method of Regional Contrast (RC) (Shi Y, Yi Y, Yan H, et al. regional Contrast and superior localization prediction project-based significance detection [ J ]. The Visual Computer,2015,31(9): 1191-. The method of SLIC (Simple Linear iterative clustering) (Achanta R, Shaji A, Smith K, et al. SLIC Superpixels Complex to State of-the-Art Superpixel Methods [ J ]. Pattern Analysis & Machine Analysis IEEE Transactions on 2012,34(11): 2274-.
Another class is the Top-down (Top-down Approach) significance test. Compared with the bottom-up significance detection which is driven by bottom-layer data, the top-down significance detection is based on task driving, primary detection is carried out according to the characteristics of the bottom layer, then the detection range is further reduced by combining the priori knowledge and specific task requirements, and a more accurate significance region of a related task is obtained. For example, Wu Ying (Wu Y.A unified approach to content object detection view low rank Matrix recovery [ C ]. Computer Vision and Pattern recognition. ieee,2012: 853860) and others propose to use a low rank Matrix (LowRank Matrix) in combination with noise to represent an image, and use a Matrix recovery method to recover the low rank Matrix and obtain corresponding noise, i.e. a significant region, and an author makes a higher-level guidance on a model in combination with prior information of position, semantics and color, so that the algorithm obtains better performance.
Target detection algorithms can also be broadly divided into two categories: one type is a target detection algorithm depending on candidate regions (probable Region), and the algorithm is mainly characterized in that all regions of an image are traversed through Sliding Windows (Sliding Windows) with different scales, and all regions possibly containing targets are found in the traversing process. The DPM (Deformable Part Models) proposed by Felzenzwalb (Felzenzwalb P F, Girshick R B, McAllester D, et al. Object Detection with discrete traveling Part-Based Models [ J ]. IEEE Transactions on Pattern Analysis & Machine interest, 2010,32(9):1627-, and finally, integrating the feature matching degree and the deviation of each model relative to the ideal position, calculating the optimal response score, and screening out the windows containing the targets according to the scores of the windows. In recent years, the method combined with deep learning is mainly used in the field of target detection, and quite good results are obtained. In 2014, Ross Girshick proposed an R-CNN (Region relational-Neural-Network) algorithm (Girshick R, Donahue J, Darrell T, et al, Rich features Hierarchies for Accurate Object Detection and Semantic Segmentation [ J ].2014: 580-5-well 587), successfully combining candidate regions and CNN for the first time, training a deep Neural Network to extract features from candidate windows, and screening candidate regions which may eventually contain specific targets by using a linear SVM (Support Vector Machine). The authors then propose the SPP-Net (Spatial Pyramid Pooling-Network) algorithm (He K, Zhang X, Ren S, et al. Spatial Pyramid Pooling in Deep connected Networks for Visual registration [ J ]. IEEE Transactions on Pattern Analysis & Machine integration, 2015,37(9): 1904. sub. 1916). SPP-Net only needs one forward CNN operation on the whole image when extracting features, calculates CNN features corresponding to all candidate windows through Spatial mapping, and adds a special Layer (SPP Layer) to adapt to input images of various sizes without clipping the input images, so SPP-Net is 50 times faster on average. In 2015, the authors propose a method based on multi-tasking loss on the basis Of R-CNN and SPP-CNN to replace SVM classification and add an ROI (region Of Inaterst) layer on the original network to further realize Fast-R-CNN (Girshick R.fast R-CNN [ J ]. Computer Science,2015) algorithm for end-to-end training and detection, thereby further improving the efficiency and accuracy Of the algorithm.
Another object detection method does not need to provide candidate windows in advance, saves the time for extracting the candidate windows, and eliminates the defects that the characteristics are repeatedly calculated due to the overlapping of the candidate windows, so that the calculation resources are wasted. In order to realize end-to-end Real-Time Detection, Joseph Redmon et al propose a YOLO (you Only Look one) Detection algorithm (Redmon J, Divvala S, Girshick R, et al. you Only Look one: Unifield, Real-Time Object Detection [ J ] Computer Science,2016:779-788) to solve the problem of repeatedly extracting features and detecting the features caused by overlapping of the Faster-RCNN candidate windows, the algorithm uses the whole image pre-divided into S × S areas as input for training and Detection, so that the model can better distinguish the target and the background. If the target falls in a certain area, the target is detected by the area, B candidate windows are predicted in each area, and the algorithm judges whether the target exists in the frame according to the confidence values of the candidate windows, so that the target can be quickly detected. In addition, Wei Liu synthesizes the advantages of YOLO and the fast-R-NN algorithm, and provides an SSD (Single Shot Multi Box Detector) algorithm (Liu W, Anguelov D, Erhan D, et al. SSD) based on forward propagation CNN, and the SSD algorithm does not need to use a candidate window, takes the whole image as algorithm input, and uses a small convolution kernel on feature maps on different scales to predict the position of the candidate window, so that the algorithm ensures the speed and the prediction precision, and can obtain a better prediction result under the condition of lower resolution of the input image.
The above prior art has two main problems: firstly, the candidate windows obtained in the process of sliding and traversing the image have serious overlapping phenomena based on the traditional method for obtaining the candidate windows by sliding the window, and the overlapping sampling generates a great number of candidate windows, thereby bringing serious performance bottleneck for the detection of a target detection algorithm. On the other hand, a large amount of computing resources are consumed in the feature extraction process of the deep convolutional network based on deep learning, so that the detection speed is low.
Disclosure of Invention
In order to comprehensively solve the two problems, the invention provides a method for optimizing the traditional method for extracting the candidate window by using the visual saliency detection based on the BING characteristic, and combines the method based on the cascade SVM classifier to carry out directional screening on the candidate window, thereby effectively and rapidly reducing the number of the candidate window, ensuring the human body target detection precision and simultaneously reducing the detection time.
The Human visual target Detection method (HD-BING, Human Detection with BING feature) based on BING feature visual saliency Detection provided by the invention processes the video frame by using the method based on BING feature visual saliency Detection so as to optimize the performance problem brought by using the traditional sliding window in the aspect of target Detection. In addition, an SVM and a cascade classifier based on deep convolution characteristics are further designed for a small number of candidate windows containing targets screened by the significance detection to finely screen the candidate windows where the human targets are located, and finally the positions and sizes of all the human targets on the video frame are obtained.
The technical scheme adopted by the invention is as follows:
a human target visual detection method based on BING characteristics comprises the following steps:
1) performing visual saliency detection on the image of the video frame based on the BING characteristics, and screening out a candidate window which may contain a human body target in the image;
2) and screening the candidate windows through a cascade classifier, and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.
Further, step 1) comprises the following substeps:
1-1) training a first-stage SVM classifier based on BING characteristics, screening candidate windows by using the first-stage SVM classifier, and calculating scores of the candidate windows;
1-2) training a second-stage SVM classifier based on BING characteristics, screening the candidate window obtained in the step 1-1) by using the second-stage SVM classifier, and calculating the score of the candidate window to be used as the measure of the significance of the candidate window area.
Further, step 2) comprises the following substeps:
2-1) training a third-level SVM classifier based on HOG characteristics, and screening the candidate window obtained in the step 1) by using the third-level SVM classifier;
2-2) training a fourth-level classifier based on the deep convolution characteristics, and screening the candidate windows obtained in the step 2-1) by using the fourth-level classifier to obtain all final candidate windows containing the human body target.
Further, when the first-level SVM classifier is trained, a candidate window containing a general target object is used as a positive sample, and a candidate window not containing the general target object or a candidate window with an overlapping rate of less than 50% of the candidate window where the general target object is located is used as a negative sample; and when the second-stage SVM classifier is trained, a candidate window with the overlapping rate of more than 50% with the area where the general target object is located is used as a positive sample, and a candidate window with the overlapping rate of less than 50% is used as a negative sample.
Further, when the third-stage classifier is trained, firstly, the HOG features are adopted for training each candidate region to obtain a linear SVM model, then the model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of the training set to serve as a hard sample, and training is carried out on the training set added with the hard sample to obtain a new SVM model.
Further, the fourth-level classifier uses a deep convolution network to perform feature extraction on an input image to obtain a deep convolution feature map, then candidate windows screened by the third-level SVM classifier are mapped onto the deep convolution feature map, a feature vector of an area where each candidate window is located is obtained, and finally a Softmax layer is used for classifying features of the windows.
Further, the fourth-level classifier is trained using the Caffe deep learning framework and tuning training using the augmented INRIA dataset.
Further, the candidate window finally containing the human body target is subjected to linear scaling and translation, so that the coincidence rate of the candidate window and the real position of the target is increased, and the detection accuracy is improved.
Corresponding to the above method, the present invention also provides a human target visual inspection device based on the BING feature, which comprises:
the saliency detection module is responsible for carrying out visual saliency detection on the image of the video frame based on the BING characteristics and screening out a candidate window which may contain a human body target in the image;
and the candidate window screening module is responsible for screening the candidate windows through the cascade classifier and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.
The HD-BING target detection algorithm can effectively improve the accuracy of human target detection, obtain 95.7% accuracy on an NICTA data set, obtain 86.08% detection rate on an amplified INRIA data set, and determine that the omission factor is 8.77% and the error rate is 5.15% (the evaluation standard is that the coincidence rate of a prediction window and a calibration target window is calculated, the window with the coincidence rate of more than 65% is regarded as a detection-correct window, the window with the coincidence rate of less than 50% is regarded as a false detection window, and the target window without the coincidence rate of a candidate window and the calibration window reaches 65% is regarded as the omission of the calibration target window). The average detection speed for images with an average size of 500 x 500 is 0.8 seconds, lower than 1.5 seconds using only the deep convolution feature method based on deep learning, and also lower than 2.1 seconds combining the methods based on conventional sliding windows and SVM classifiers.
Drawings
FIG. 1 is a detailed block diagram of a VGG-16 deep convolutional network.
Fig. 2 is a specific structure diagram of the VGG-16 deep convolutional network after improvement, in which, in addition to feature extraction using the original deep convolutional network, the method of adding mapping to candidate windows, regression classification layer (softmax layer) and window correction layer as proposed in the document "Girshick R.
Fig. 3 is a flow chart of the HD-BING algorithm detection.
Detailed Description
The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.
According to the design aspect of the HD-BING algorithm, a detection mode of 'rough viewing firstly and then fine viewing' is realized according to a visual attention mechanism, namely candidate windows which possibly contain target objects in an image are screened out through the rough viewing firstly, and then candidate windows which possibly contain human bodies are further screened out through the fine viewing secondly. The overall flow chart of the algorithm is shown in fig. 3, and the steps of the method of the present invention are specifically described in conjunction with the figure.
1. Significance detection based on BING features
Ming-Ming Cheng published in CVPR 2014 under the references "Cheng M, Zhang Z, Lin W Y, et al BING: binary normalized Gradients for object Estimation at 300fps [ J ].2014: 3286-3293" states that for general objects in an image, having a closed contour is a feature they show, further, if these objects are normalized to a certain smaller scale, their pixel Gradient magnitudes (Norm Of Gradient) appear to be sharp and show a strong commonality, but are single and regular with respect to the pixel Gradient magnitudes Of the background region, which has no closed contour. Since there is a significant difference between the pixel gradient magnitudes of the generic target object and the background, i.e. regions with cluttered pixel gradient magnitudes will have higher significance. Based on the rule, the author proposes a bing (binary normalized differences) feature improved based on the pixel gradient amplitude, scores the region where the general target object with higher significance is located in the image by combining with the SVM, screens according to the score of the window, and finally selects a small number of candidate windows on the windows by a Non-Maximum Suppression algorithm (Non-Maximum Suppression, NMS) (Neubeck a, Van Gool l.
The BING feature is essentially a binarized pixel gradient amplitude, the gradient of pixels in an image refers to the change rate of gray scale in the gradient direction, and reflects the gray scale change on the edges of some objects in the image, which is also a practical quantification method for detecting the target by using a closed contour. And the BING characteristics are insensitive to color, scale and rotation, and have higher robustness.
After the image is regarded as a two-dimensional discrete function, the calculation of the gradient is converted into derivation of the discrete function, such as the calculation of the BING characteristics for the candidate window regions according to the formula 1 to the formula 3.
Gradient(x,y)=dxi,j+dyi,jFormula 1
dxi,j=[Ci+1,j-Ci-1,j][ 2 ] formula 2
dyi,j=[Ci,j+1-Ci,j-1][ 2 ] formula 3
Where i, j represents the coordinates of the pixel, CijIs the RGB value of the pixel with coordinates (i, j), the binary gradient amplitude NG of the pixel at the point2Can be calculated using equation 4.
Figure BDA0001639170190000061
Where k represents the first k bits of the feature value, NG represents the pixel gradient magnitude, NG2Indicating a binarized pixel gradient magnitude, NGkRepresenting NG valueThe first k bits.
Therefore, an SVM classifier is used for learning the classifier based on the gradient features of the binary pixels, so that objects with closed feature commonalities can be distinguished from non-objects without the commonalities, and the aim of distinguishing general target objects from backgrounds is fulfilled. But the candidate area screened out at this time contains all types of general target objects. Wherein the first-stage SVM classifier learns a 64-dimensional linear SVM model M1 using a candidate window containing a general target object as a positive sample, a candidate window not containing the general target object or a candidate window having an overlap rate with the candidate window in which the general target object is located of less than 50% as a negative sample, the weight of the model is represented by w, and then calculates a score of the current candidate window according to equation 5 and equation 6L
scoreL=<w,(NG)2)L>Formula 5
(i, x, y) formula 6
Wherein, scoreLI.e. the score measured for the first stage of the window, i.e. "score S1" in fig. 3, L indicates the position information, i indicates the current scale, (x, y) indicates the coordinates where the current window is located, (NG)2)LIndicating the BING characteristics of the window at location L.
The second-stage SVM classifier uses a candidate window with the overlapping rate of more than 50% with the region where the general target object is located as a positive sample, a candidate window with the overlapping rate of less than 50% as a negative sample, the label of the negative sample is-1, the label of the positive sample is set to be 1, a linear SVM model M2 is trained to serve as a second classifier, and finally the final Score (Score) of the candidate window is calculated according to a formula 7F)L
(ScoreF)L=(w1)i×scoreL+(w2)iFormula 7
Wherein, w1Representing the weight, w, of the SVM model trained by the first-stage metric, i.e., the first-stage SVM classifier2Represents the weight of the SVM model trained in the second stage, i.e. the second-stage SVM classifier, (Score)F)LFinal score representing the candidate window in which the L position is locatedAnd use (Score)F)LFinal measure of significance of candidate window regions, (Score)F)LThe higher the value, the higher the likelihood that the region contains a general target object, i.e., the higher the saliency of the region. (Score)F)LI.e., "score S2" in fig. 3.
2. Cascade classifier for human target detection
Through the rapid screening of candidate windows based on the significance detection of the BING characteristics, the number of the candidate windows is reduced to 103And the probability of containing general target objects in the candidate windows is very high, wherein the human target required by final detection is contained.
In order to screen out human targets, a third-level linear SVM classifier is provided. In order to ensure the performance of the algorithm, for each candidate region, only relatively simple HOG (Histogram of Oriented Gradient) features are adopted to train the third-level linear SVM classifier, and a linear SVM model is finally obtained through training.
In order to reduce the overhigh false detection rate of the SVM model obtained by training based on the HOG characteristics, firstly, the obtained model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of a training set to be used as a Hard sample (Hard Samples), and a new SVM model M3 is obtained by training on the training set added with the Hard sample.
After the classification is carried out by the third-level SVM classifier, the candidate windows which are less than 1000 and are most likely to contain the human body target are obtained. Due to the need of accurately acquiring the position of the human target, a classifier with stronger classification capability needs to be arranged to further classify the candidate window, i.e. the training model M4 is used as a fourth-level classifier.
The fourth-level classifier uses a Deep convolution network to extract features of the input Image, and the feature extraction aspect is extracted through a 16-layer VGG (visual Geometry group) (Simony K, Zisserman A. Very Deep computational Networks for Large-Scale Image Recognition [ J ] Computer Science 2014) Deep convolution network VGG-16. The VGG-16 network is mainly divided into eight parts, the eighth part is an output unit formed by a full connection layer, the specific network structure is shown in FIG. 1, and the specific parameters of each layer are shown in Table 1.
TABLE 1 specific parameters of layers of VGG-16 networks
Figure BDA0001639170190000081
Figure BDA0001639170190000091
Figure BDA0001639170190000101
After the deep convolution feature map of the image is obtained, mapping the candidate windows screened by the third-level SVM classifier to the feature map, obtaining the feature vector of the region where each candidate window is located, and finally classifying the features of each window by using a Softmax layer and obtaining the classified confidence value. The model Architecture shown in FIG. 2 is proposed in the literature "Girshick R.Fast R-CNN [ J ]. Computer Science, 2015", which adjusts each layer of the original VGG-16 network, discards the full connection layer originally used as the output layer, adds the Softmax layer and the window correction layer to obtain the model, and the specific structure of the model is shown in FIG. 2, and the model is optimized by using a Caffe (Jia, Yangqing, Shelhaler, et al. Caffe: conditional Architecture for t Feature Embedding [ J ]. Eprint Arxiv,2014:675-678) deep learning framework in the training process.
The pseudo code of the HD-BING algorithm of the present invention is shown in Table 2.
TABLE 2 pseudo code of HD-BING Algorithm
Figure BDA0001639170190000102
Figure BDA0001639170190000111
Specific examples of methods employing the present invention are provided below.
The HD-BING human target detection algorithm needs to train four classifiers in total.
The first stage is based on the BING features and training of the linear SVM model. The data set used to train the model was VOC 2007, containing 20 classes of targets, 9963 images and 24640 targets. The training parameter C of the linear SVM model is set to 10, the error e is set to 0.001 and the bias is set to 1. And (3) marking the candidate window of the calibration target as a positive sample with the label of 1, marking the candidate window with the overlapping rate of less than 50% with the calibration candidate window as a negative sample with the label of-1, calculating the BING characteristics of the image areas where all samples (candidate windows) are positioned, and finally obtaining a linear SVM model after training.
The second stage is the training of the linear SVM model based on the BING features and over multiple scales. The data set used to train the model was also VOC 2007. The training parameter C of the linear SVM model is set to 100, the error e is set to 0.001 and the bias is set to 1. And (3) marking the candidate window of the calibration target as a positive sample with the label of 1, marking the candidate window with the overlapping rate of less than 50% with the calibration candidate window as a negative sample with the label of-1, calculating the BING characteristics of the image areas where all samples (candidate windows) are positioned, and finally obtaining a linear SVM model after training.
The third stage is based on the HOG features and training of the linear SVM model. The data sets used for training are INRIA and NICTA, a sample set with the resolution of 64 x 80 in the NICTA data set is used for cutting out a human body target with the size of 32 x 64 in the middle of each sample image, namely cutting out 8 pixels from the top and the bottom and 16 pixels from the left and the right. All sample images were then normalized to 64 x 128 dimensions by means of bilinear interpolation. In the model training stage, a total of 37344 images with a single human body target in a training set of an NICTA data set are selected as a positive sample set, and a sample label is set to be 1; as the total number of 500000 images without human body targets in the data set is not present, only 30000 images are randomly selected as a negative sample set in each experiment, the sample label is set to be-1, and the total number of samples for the first training is 67344.
In the testing stage, 50000 images with a single human body target in the testing set of the NICTA data set are selected, and a sample label is marked to be 1; since the test set of the data set contains 6879 images that do not contain the human target, 5000 images are randomly selected at each test and the sample label is marked as-1. And finally, the total number of the test samples is 11879, 20000 false detection windows are collected to be used as hard samples, and the hard samples are added into the negative sample set of the training set, so that after the hard samples are added, the number of the negative samples is 50000, and the total number of the samples is 87344. After the second training with hard samples added, the training parameter C of the linear SVM model is set to 0.01, the error e is set to 0.001, and the bias is set to 1. And marking the candidate window of the calibration target as a positive sample and a label as 1, marking the candidate window of the negative sample as a negative sample and a label as-1, and training to obtain a third-level linear SVM model.
In the experiment, the HOG features are used for the feature extraction aspect of samples, the size of a HOG Cell unit (Cell) is set to be 8 × 8 pixels, the size of a HOG Block (Block) is set to be 16 × 16 pixels, the step sizes of moving the Block in the x and y directions are respectively 8 pixels, the window size is set to be 64 × 128 pixels, the direction of the gradient is counted by using a histogram with 9 square bars, and finally, for each sample, 3781-dimensional HOG features are respectively extracted.
The fourth stage is a classifier based on deep convolution characteristics, a Caffe deep learning framework is used in the training process, the model architecture is shown in figure 2, the learning rate is set to be 0.0001, the model is subjected to tuning training by using the amplified INRIA data set on the data set, and a fourth-stage classification model is obtained.
In the aspect of extracting deep convolution characteristics, as shown in fig. 2, an original image with the size of M × N is normalized to the size of 224 × 224, then the original image is input into an adjusted VGG-16 deep convolution network for characteristic extraction, a characteristic map with the size of 14 × 14 is obtained, and all candidate windows R obtained by screening the image are subjected to pooling operation before the characteristic map is subjected to pooling operationiInput into the maximum pooling layer with convolution kernel size 2 and step size 2, for each size mi×niMapping the candidate window to the feature map to obtain a mapped candidate window Ri', fitting the candidate window RiSegmentation into a size of 7 × 7, the size of each mesh after segmentation is
Figure BDA0001639170190000121
And uniformly pooling the features of the candidate windows to 7 × 7 dimensions using a maximum pooling approach, so that k 7 × 7-dimensional feature vectors are obtained on each convolution feature map for the image. And since the fifth convolutional layer has 512 convolutional cores as shown in fig. 2, the final feature number is 512 × k × (7 × 7). Finally, the 512 × k 7 × 7 dimensional feature vectors are input to the fully connected layer of the VGG-16 network, respectively, and 4096 dimensional features of k candidate windows are obtained, respectively.
Inputting the obtained k 4096-dimensional features into a full-connected layer, obtaining the score of each candidate window, then inputting the scores of the candidate windows and label information (setting background label to be 0 and human body target label to be 1) into a Softmax layer of Caffe to calculate and obtain the prediction probability P of each candidate window corresponding to each category according to a formula 8 and a formula 9c|c=[0,1], PcI c-0 denotes that the current candidate window is background, PcAnd | c ═ 1 represents that the current candidate window is a human target, and the classification loss function of the Softmax layer is calculated according to the formula 10.
xi=xi-max(x0,x1,…,xn) Formula 8
Figure BDA0001639170190000131
Lossc=-log(PcC), c ═ 0,1, formula 10
Wherein x is0,x1,…,xnRepresenting the n-dimensional eigenvalue, P, of the convolutional network output0、P2、…Pk-1Representing the predicted probability result over k-1 candidate windows.
Because the candidate window obtained by the third-level SVM classifier and the fourth-level classifier of the HD-BING algorithm still has a large error relative to the real position of the target, if the candidate window obtained by the fourth-level classifier is directly used for classification and the final result is obtained, the detection performance is greatly reduced, and then the method proposed by the document (Girshi R.fast R-CNN [ J ]. Computer Science,2015) is utilized, namely the candidate window finally containing the human target is linearly scaled and shifted to increase the coincidence rate of the candidate window and the target real position, so that the detection accuracy is improved.
The HD-BING target detection algorithm obtains 95.7% accuracy on an NICTA data set, on the other hand, in order to better illustrate that the HD-BING target detection algorithm has practicability, the INRIA data set is firstly amplified to twice of the original one, the content of the amplified data set is an image containing more human targets with different postures, such as sitting posture, standing posture, leaning, running, walking and the like, the source of the amplified image is hundreds of degrees and some realistic scenes on Google, such as a tourist photo, a movie screenshot, a life photo and the like, and when the algorithm is tested, the amplified data is tested. Due to the fact that the real life scene is more complex and the background is changeable, the comprehensive performance evaluation of the human body target detection algorithm is more referential and scientific. The HD-BING target detection algorithm obtains 86.08% detection rate on the amplified INRIA data set, the omission factor is 8.77%, and the error rate is 5.15% (the evaluation standard is that the coincidence rate of the prediction window and the calibration target window is calculated, the window with the coincidence rate more than 65% is regarded as a correctly detected window, the window with the coincidence rate less than 50% is regarded as a falsely detected window, and the omission of the calibration target window is regarded when the overlap rate of no candidate window and the calibration window reaches 65%).
Description of other embodiments of the present invention:
1. for the first and second stages using classifiers based on visual attention detection of BING features, which is essentially an optimization of the conventional sliding window generation candidate window, other target detection algorithms can be used instead, such as the generation of the first and second stage using Edge Box (Zitnick C L, Doll a P. Edge Boxes: Locating Object Proposals from Edges C, Cham,2014: 391-. On the other hand, other visual saliency detection algorithms (such as simple linear iterative clustering, some saliency detection algorithms based on background discrimination and the like) can be used for replacing the visual attention detection aspect of the method, so that the same purpose is achieved.
2. For the third-level combination of HOG characteristics and a linear SVM classifier, the further screening of candidate windows is realized, for the extraction of the characteristics, the HOG characteristics can be replaced by edge characteristics, color characteristics or even deep convolution characteristics, and the classifier can replace the third-level classifier of the linear SVM method by strong classifiers such as a nonlinear SVM, a regression classifier, a decision tree, a neural network and the like, so that the same purpose is achieved.
3. For the fourth-level classifier based on the deep convolutional network to screen the remaining candidate windows, other deep convolutional network models may be used for replacement, for example, a VGG network with less than 16 layers is used to consider the performance problem, and models based on CNN, RCNN, DCNN (krighevavsky a, Sutskever I, Hinton G e.imagenet classification with subsequent conditional Neural network [ C ] and the like may be used for replacement, and the extraction of image features is performed by training the models, so as to achieve the purpose of detecting human targets by the method.
4. The method for detecting the human body target by combining the visual saliency detection and the cascade classifier can achieve the same purpose by replacing a saliency detection algorithm or replacing the cascade classifier.
In accordance with the above method, another embodiment of the present invention provides a human target visual inspection device based on the BING characteristics, which includes:
the saliency detection module is responsible for carrying out visual saliency detection on the image of the video frame based on the BING characteristics and screening out a candidate window which may contain a human body target in the image;
and the candidate window screening module is responsible for screening the candidate windows through the cascade classifier and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (6)

1. A human target visual detection method based on BING characteristics is characterized by comprising the following steps:
1) performing visual saliency detection on the image of the video frame based on the BING characteristics, and screening out a candidate window which may contain a human body target in the image;
2) screening the candidate windows through a cascade classifier, and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows;
wherein, step 1) comprises the following substeps:
1-1) training a first-stage SVM classifier based on BING characteristics, screening candidate windows by using the first-stage SVM classifier, and calculating scores of the candidate windows;
1-2) training a second-stage SVM classifier based on BING characteristics, screening the candidate window obtained in the step 1-1) by using the second-stage SVM classifier, and calculating the score of the candidate window to be used as the measure of the significance of the candidate window region; wherein, step 2) includes the following substeps:
2-1) training a third-level SVM classifier based on HOG characteristics, and screening the candidate window obtained in the step 1) by using the third-level SVM classifier;
2-2) training a fourth-level classifier based on the deep convolution characteristics, and screening the candidate windows obtained in the step 2-1) by using the fourth-level classifier to obtain all final candidate windows containing human body targets;
when the third-level SVM classifier is trained, firstly, each candidate region is trained by adopting HOG characteristics to obtain a linear SVM model, then, the model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of a training set to be used as a hard sample, and training is carried out on the training set added with the hard sample to obtain a new SVM model;
the fourth-level classifier uses a deep convolution network to perform feature extraction on an input image to obtain a deep convolution feature map, then candidate windows screened out by the third-level SVM classifier are mapped onto the deep convolution feature map to obtain feature vectors of regions where the candidate windows are located, and finally a Softmax layer is used for classifying features of the windows.
2. The method of claim 1, wherein the first-level SVM classifier is trained using a candidate window containing a general target object as a positive sample, using a candidate window not containing a general target object or a candidate window having an overlap rate with a candidate window in which a general target object is located of less than 50% as a negative sample; and when the second-stage SVM classifier is trained, a candidate window with the overlapping rate of more than 50% with the area where the general target object is located is used as a positive sample, and a candidate window with the overlapping rate of less than 50% is used as a negative sample.
3. The method of claim 1, wherein the first-level SVM classifier computes a score for a candidate window using the following formula:
scoreL=<w,(NG)2)L>,
L=(i,x,y),
wherein L represents position information, i represents the current scale, (x, y) represents the coordinates of the current window, (NG)2)LThe BING characteristics of the window at location L are shown, and w represents the weights of the linear SVM model.
4. The method of claim 3, wherein the second-level SVM classifier computes a score for a candidate window using the following formula:
(ScoreF)L=(w1)i×scoreL+(w2)i
wherein, w1Representing the weight, w, of the first-stage SVM classifier2Represents the weight of the second-level SVM classifier (Score)F)LThe final Score of the candidate window in which the L position is located is expressed and used (Score)F)LThe significance of the candidate window regions is finally measured.
5. The method of claim 1, wherein the fourth stage classifier is trained using a Caffe deep learning framework and tuning training using an augmented INRIA dataset.
6. A human target visual detection device based on BING characteristics by adopting the method of any claim 1 to 5, which is characterized by comprising:
the saliency detection module is responsible for carrying out visual saliency detection on the image of the video frame based on the BING characteristics and screening out a candidate window which may contain a human body target in the image;
and the candidate window screening module is responsible for screening the candidate windows through the cascade classifier and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.
CN201810374551.2A 2018-04-24 2018-04-24 Human target visual detection method and device based on BING (building information network) features Active CN108734200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810374551.2A CN108734200B (en) 2018-04-24 2018-04-24 Human target visual detection method and device based on BING (building information network) features

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810374551.2A CN108734200B (en) 2018-04-24 2018-04-24 Human target visual detection method and device based on BING (building information network) features

Publications (2)

Publication Number Publication Date
CN108734200A CN108734200A (en) 2018-11-02
CN108734200B true CN108734200B (en) 2022-03-08

Family

ID=63939765

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810374551.2A Active CN108734200B (en) 2018-04-24 2018-04-24 Human target visual detection method and device based on BING (building information network) features

Country Status (1)

Country Link
CN (1) CN108734200B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109726754A (en) * 2018-12-25 2019-05-07 浙江大学昆山创新中心 A kind of LCD screen defect identification method and device
CN111488893B (en) * 2019-01-25 2023-05-30 银河水滴科技(北京)有限公司 Image classification method and device
CN110188811A (en) * 2019-05-23 2019-08-30 西北工业大学 Underwater target detection method based on normed Gradient Features and convolutional neural networks
CN110458004B (en) * 2019-07-02 2022-12-27 浙江吉利控股集团有限公司 Target object identification method, device, equipment and storage medium
CN112733652B (en) * 2020-12-31 2024-04-19 深圳赛安特技术服务有限公司 Image target recognition method, device, computer equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446890A (en) * 2016-10-28 2017-02-22 中国人民解放军信息工程大学 Candidate area extraction method based on window scoring and superpixel segmentation
CN106503742A (en) * 2016-11-01 2017-03-15 广东电网有限责任公司电力科学研究院 A kind of visible images insulator recognition methods
CN106845458A (en) * 2017-03-05 2017-06-13 北京工业大学 A kind of rapid transit label detection method of the learning machine that transfinited based on core

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9443164B2 (en) * 2014-12-02 2016-09-13 Xerox Corporation System and method for product identification

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106446890A (en) * 2016-10-28 2017-02-22 中国人民解放军信息工程大学 Candidate area extraction method based on window scoring and superpixel segmentation
CN106503742A (en) * 2016-11-01 2017-03-15 广东电网有限责任公司电力科学研究院 A kind of visible images insulator recognition methods
CN106845458A (en) * 2017-03-05 2017-06-13 北京工业大学 A kind of rapid transit label detection method of the learning machine that transfinited based on core

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多分类器级联的街道场景遮挡行人检测;吴喆;《中国优秀硕士学位论文全文数据库(电子期刊)信息科技辑》;20180115;第9-10以及21-43页 *

Also Published As

Publication number Publication date
CN108734200A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN108734200B (en) Human target visual detection method and device based on BING (building information network) features
Zhao et al. Cloud shape classification system based on multi-channel cnn and improved fdm
CN109154978B (en) System and method for detecting plant diseases
CN109344701B (en) Kinect-based dynamic gesture recognition method
CN108520226B (en) Pedestrian re-identification method based on body decomposition and significance detection
Jia et al. Visual tracking via adaptive structural local sparse appearance model
CN110929593B (en) Real-time significance pedestrian detection method based on detail discrimination
CN107633226B (en) Human body motion tracking feature processing method
CN110263712B (en) Coarse and fine pedestrian detection method based on region candidates
CN107273905B (en) Target active contour tracking method combined with motion information
CN111652317B (en) Super-parameter image segmentation method based on Bayes deep learning
CN104504366A (en) System and method for smiling face recognition based on optical flow features
CN107767416B (en) Method for identifying pedestrian orientation in low-resolution image
CN109165658B (en) Strong negative sample underwater target detection method based on fast-RCNN
CN109902576B (en) Training method and application of head and shoulder image classifier
CN102346854A (en) Method and device for carrying out detection on foreground objects
CN110599463A (en) Tongue image detection and positioning algorithm based on lightweight cascade neural network
CN111275010A (en) Pedestrian re-identification method based on computer vision
Aqel et al. Road traffic: Vehicle detection and classification
CN112818905A (en) Finite pixel vehicle target detection method based on attention and spatio-temporal information
CN108073940A (en) A kind of method of 3D object instance object detections in unstructured moving grids
WO2020119624A1 (en) Class-sensitive edge detection method based on deep learning
CN105354547A (en) Pedestrian detection method in combination of texture and color features
CN110910497B (en) Method and system for realizing augmented reality map
CN111582057B (en) Face verification method based on local receptive field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant