CN108734200B

CN108734200B - Human target visual detection method and device based on BING (building information network) features

Info

Publication number: CN108734200B
Application number: CN201810374551.2A
Authority: CN
Inventors: 杨戈; 黄尚仁; 黄静
Original assignee: Beijing Normal University Zhuhai
Current assignee: Beijing Normal University Zhuhai
Priority date: 2018-04-24
Filing date: 2018-04-24
Publication date: 2022-03-08
Anticipated expiration: 2038-04-24
Also published as: CN108734200A

Abstract

The invention relates to a human target visual detection method and device based on BING characteristics. The method utilizes a BING characteristic visual saliency detection-based method to process video frames so as to optimize the performance problem caused by using a traditional sliding window in the aspect of target detection; and further designing an SVM and a cascade classifier based on deep convolution characteristics for a small number of candidate windows containing targets screened by the significance detection to finely screen the candidate windows where the human targets are located, and finally obtaining the positions and sizes of all the human targets on the video frame. The invention provides a method for extracting candidate windows by optimizing the traditional sliding window by using a visual saliency detection method based on BING (building information network) features, and performs directional screening on the candidate windows by combining a method based on a cascade SVM (support vector machine) classifier, so that the number of the candidate windows can be effectively and rapidly reduced, the human target detection precision is ensured, and the detection time is reduced.

Description

Human target visual detection method and device based on BING (building information network) features

Technical Field

The invention belongs to the technical field of computer vision, relates to a human target vision detection related technology based on visual attention, and particularly relates to a human target vision detection method and device based on BING characteristics.

Background

The field of cognitive psychology shows that human still can keep strong perception capability in extremely complex scenes through a great deal of observation because human is good at positioning meaningful targets and observing, identifying and recognizing, namely has strong screening capability on effective information. Based on this starting point, if the computer is to be made cognizant, the problem of the saliency of the target object in the image or video frame is to be defined.

Visual salience Detection (Visual salience Detection) is generally divided into two categories, one being: the significance detection from Bottom to top (Bottom-up Approach) is essentially based on the difference between pixels in a certain candidate region of an image and its surrounding pixels as the significance amount, and the greater the difference, the more significant the region is. Such as Histogram statistics Based Contrast (HC, Histogram Contrast) (Zeng P, Meng F, Shi R, et al, objective Object Detection Based on Histogram-Based Contrast and Guided Image Filtering [ M ]. Intelligent Analysis and applications. spring International publication, 2016), authors use histograms to count the color values of each pixel, and significance results are obtained after Histogram reduction and quantization; based on The method of Regional Contrast (RC) (Shi Y, Yi Y, Yan H, et al. regional Contrast and superior localization prediction project-based significance detection [ J ]. The Visual Computer,2015,31(9): 1191-. The method of SLIC (Simple Linear iterative clustering) (Achanta R, Shaji A, Smith K, et al. SLIC Superpixels Complex to State of-the-Art Superpixel Methods [ J ]. Pattern Analysis & Machine Analysis IEEE Transactions on 2012,34(11): 2274-.

Another class is the Top-down (Top-down Approach) significance test. Compared with the bottom-up significance detection which is driven by bottom-layer data, the top-down significance detection is based on task driving, primary detection is carried out according to the characteristics of the bottom layer, then the detection range is further reduced by combining the priori knowledge and specific task requirements, and a more accurate significance region of a related task is obtained. For example, Wu Ying (Wu Y.A unified approach to content object detection view low rank Matrix recovery [ C ]. Computer Vision and Pattern recognition. ieee,2012: 853860) and others propose to use a low rank Matrix (LowRank Matrix) in combination with noise to represent an image, and use a Matrix recovery method to recover the low rank Matrix and obtain corresponding noise, i.e. a significant region, and an author makes a higher-level guidance on a model in combination with prior information of position, semantics and color, so that the algorithm obtains better performance.

Target detection algorithms can also be broadly divided into two categories: one type is a target detection algorithm depending on candidate regions (probable Region), and the algorithm is mainly characterized in that all regions of an image are traversed through Sliding Windows (Sliding Windows) with different scales, and all regions possibly containing targets are found in the traversing process. The DPM (Deformable Part Models) proposed by Felzenzwalb (Felzenzwalb P F, Girshick R B, McAllester D, et al. Object Detection with discrete traveling Part-Based Models [ J ]. IEEE Transactions on Pattern Analysis & Machine interest, 2010,32(9):1627-, and finally, integrating the feature matching degree and the deviation of each model relative to the ideal position, calculating the optimal response score, and screening out the windows containing the targets according to the scores of the windows. In recent years, the method combined with deep learning is mainly used in the field of target detection, and quite good results are obtained. In 2014, Ross Girshick proposed an R-CNN (Region relational-Neural-Network) algorithm (Girshick R, Donahue J, Darrell T, et al, Rich features Hierarchies for Accurate Object Detection and Semantic Segmentation [ J ].2014: 580-5-well 587), successfully combining candidate regions and CNN for the first time, training a deep Neural Network to extract features from candidate windows, and screening candidate regions which may eventually contain specific targets by using a linear SVM (Support Vector Machine). The authors then propose the SPP-Net (Spatial Pyramid Pooling-Network) algorithm (He K, Zhang X, Ren S, et al. Spatial Pyramid Pooling in Deep connected Networks for Visual registration [ J ]. IEEE Transactions on Pattern Analysis & Machine integration, 2015,37(9): 1904. sub. 1916). SPP-Net only needs one forward CNN operation on the whole image when extracting features, calculates CNN features corresponding to all candidate windows through Spatial mapping, and adds a special Layer (SPP Layer) to adapt to input images of various sizes without clipping the input images, so SPP-Net is 50 times faster on average. In 2015, the authors propose a method based on multi-tasking loss on the basis Of R-CNN and SPP-CNN to replace SVM classification and add an ROI (region Of Inaterst) layer on the original network to further realize Fast-R-CNN (Girshick R.fast R-CNN [ J ]. Computer Science,2015) algorithm for end-to-end training and detection, thereby further improving the efficiency and accuracy Of the algorithm.

Another object detection method does not need to provide candidate windows in advance, saves the time for extracting the candidate windows, and eliminates the defects that the characteristics are repeatedly calculated due to the overlapping of the candidate windows, so that the calculation resources are wasted. In order to realize end-to-end Real-Time Detection, Joseph Redmon et al propose a YOLO (you Only Look one) Detection algorithm (Redmon J, Divvala S, Girshick R, et al. you Only Look one: Unifield, Real-Time Object Detection [ J ] Computer Science,2016:779-788) to solve the problem of repeatedly extracting features and detecting the features caused by overlapping of the Faster-RCNN candidate windows, the algorithm uses the whole image pre-divided into S × S areas as input for training and Detection, so that the model can better distinguish the target and the background. If the target falls in a certain area, the target is detected by the area, B candidate windows are predicted in each area, and the algorithm judges whether the target exists in the frame according to the confidence values of the candidate windows, so that the target can be quickly detected. In addition, Wei Liu synthesizes the advantages of YOLO and the fast-R-NN algorithm, and provides an SSD (Single Shot Multi Box Detector) algorithm (Liu W, Anguelov D, Erhan D, et al. SSD) based on forward propagation CNN, and the SSD algorithm does not need to use a candidate window, takes the whole image as algorithm input, and uses a small convolution kernel on feature maps on different scales to predict the position of the candidate window, so that the algorithm ensures the speed and the prediction precision, and can obtain a better prediction result under the condition of lower resolution of the input image.

The above prior art has two main problems: firstly, the candidate windows obtained in the process of sliding and traversing the image have serious overlapping phenomena based on the traditional method for obtaining the candidate windows by sliding the window, and the overlapping sampling generates a great number of candidate windows, thereby bringing serious performance bottleneck for the detection of a target detection algorithm. On the other hand, a large amount of computing resources are consumed in the feature extraction process of the deep convolutional network based on deep learning, so that the detection speed is low.

Disclosure of Invention

In order to comprehensively solve the two problems, the invention provides a method for optimizing the traditional method for extracting the candidate window by using the visual saliency detection based on the BING characteristic, and combines the method based on the cascade SVM classifier to carry out directional screening on the candidate window, thereby effectively and rapidly reducing the number of the candidate window, ensuring the human body target detection precision and simultaneously reducing the detection time.

The Human visual target Detection method (HD-BING, Human Detection with BING feature) based on BING feature visual saliency Detection provided by the invention processes the video frame by using the method based on BING feature visual saliency Detection so as to optimize the performance problem brought by using the traditional sliding window in the aspect of target Detection. In addition, an SVM and a cascade classifier based on deep convolution characteristics are further designed for a small number of candidate windows containing targets screened by the significance detection to finely screen the candidate windows where the human targets are located, and finally the positions and sizes of all the human targets on the video frame are obtained.

The technical scheme adopted by the invention is as follows:

a human target visual detection method based on BING characteristics comprises the following steps:

1) performing visual saliency detection on the image of the video frame based on the BING characteristics, and screening out a candidate window which may contain a human body target in the image;

2) and screening the candidate windows through a cascade classifier, and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.

Further, step 1) comprises the following substeps:

1-1) training a first-stage SVM classifier based on BING characteristics, screening candidate windows by using the first-stage SVM classifier, and calculating scores of the candidate windows;

1-2) training a second-stage SVM classifier based on BING characteristics, screening the candidate window obtained in the step 1-1) by using the second-stage SVM classifier, and calculating the score of the candidate window to be used as the measure of the significance of the candidate window area.

Further, step 2) comprises the following substeps:

2-1) training a third-level SVM classifier based on HOG characteristics, and screening the candidate window obtained in the step 1) by using the third-level SVM classifier;

2-2) training a fourth-level classifier based on the deep convolution characteristics, and screening the candidate windows obtained in the step 2-1) by using the fourth-level classifier to obtain all final candidate windows containing the human body target.

Further, when the first-level SVM classifier is trained, a candidate window containing a general target object is used as a positive sample, and a candidate window not containing the general target object or a candidate window with an overlapping rate of less than 50% of the candidate window where the general target object is located is used as a negative sample; and when the second-stage SVM classifier is trained, a candidate window with the overlapping rate of more than 50% with the area where the general target object is located is used as a positive sample, and a candidate window with the overlapping rate of less than 50% is used as a negative sample.

Further, when the third-stage classifier is trained, firstly, the HOG features are adopted for training each candidate region to obtain a linear SVM model, then the model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of the training set to serve as a hard sample, and training is carried out on the training set added with the hard sample to obtain a new SVM model.

Further, the fourth-level classifier uses a deep convolution network to perform feature extraction on an input image to obtain a deep convolution feature map, then candidate windows screened by the third-level SVM classifier are mapped onto the deep convolution feature map, a feature vector of an area where each candidate window is located is obtained, and finally a Softmax layer is used for classifying features of the windows.

Further, the fourth-level classifier is trained using the Caffe deep learning framework and tuning training using the augmented INRIA dataset.

Further, the candidate window finally containing the human body target is subjected to linear scaling and translation, so that the coincidence rate of the candidate window and the real position of the target is increased, and the detection accuracy is improved.

Corresponding to the above method, the present invention also provides a human target visual inspection device based on the BING feature, which comprises:

the saliency detection module is responsible for carrying out visual saliency detection on the image of the video frame based on the BING characteristics and screening out a candidate window which may contain a human body target in the image;

and the candidate window screening module is responsible for screening the candidate windows through the cascade classifier and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows.

The HD-BING target detection algorithm can effectively improve the accuracy of human target detection, obtain 95.7% accuracy on an NICTA data set, obtain 86.08% detection rate on an amplified INRIA data set, and determine that the omission factor is 8.77% and the error rate is 5.15% (the evaluation standard is that the coincidence rate of a prediction window and a calibration target window is calculated, the window with the coincidence rate of more than 65% is regarded as a detection-correct window, the window with the coincidence rate of less than 50% is regarded as a false detection window, and the target window without the coincidence rate of a candidate window and the calibration window reaches 65% is regarded as the omission of the calibration target window). The average detection speed for images with an average size of 500 x 500 is 0.8 seconds, lower than 1.5 seconds using only the deep convolution feature method based on deep learning, and also lower than 2.1 seconds combining the methods based on conventional sliding windows and SVM classifiers.

Drawings

FIG. 1 is a detailed block diagram of a VGG-16 deep convolutional network.

Fig. 2 is a specific structure diagram of the VGG-16 deep convolutional network after improvement, in which, in addition to feature extraction using the original deep convolutional network, the method of adding mapping to candidate windows, regression classification layer (softmax layer) and window correction layer as proposed in the document "Girshick R.

Fig. 3 is a flow chart of the HD-BING algorithm detection.

Detailed Description

The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.

According to the design aspect of the HD-BING algorithm, a detection mode of 'rough viewing firstly and then fine viewing' is realized according to a visual attention mechanism, namely candidate windows which possibly contain target objects in an image are screened out through the rough viewing firstly, and then candidate windows which possibly contain human bodies are further screened out through the fine viewing secondly. The overall flow chart of the algorithm is shown in fig. 3, and the steps of the method of the present invention are specifically described in conjunction with the figure.

1. Significance detection based on BING features

Ming-Ming Cheng published in CVPR 2014 under the references "Cheng M, Zhang Z, Lin W Y, et al BING: binary normalized Gradients for object Estimation at 300fps [ J ].2014: 3286-3293" states that for general objects in an image, having a closed contour is a feature they show, further, if these objects are normalized to a certain smaller scale, their pixel Gradient magnitudes (Norm Of Gradient) appear to be sharp and show a strong commonality, but are single and regular with respect to the pixel Gradient magnitudes Of the background region, which has no closed contour. Since there is a significant difference between the pixel gradient magnitudes of the generic target object and the background, i.e. regions with cluttered pixel gradient magnitudes will have higher significance. Based on the rule, the author proposes a bing (binary normalized differences) feature improved based on the pixel gradient amplitude, scores the region where the general target object with higher significance is located in the image by combining with the SVM, screens according to the score of the window, and finally selects a small number of candidate windows on the windows by a Non-Maximum Suppression algorithm (Non-Maximum Suppression, NMS) (Neubeck a, Van Gool l.

The BING feature is essentially a binarized pixel gradient amplitude, the gradient of pixels in an image refers to the change rate of gray scale in the gradient direction, and reflects the gray scale change on the edges of some objects in the image, which is also a practical quantification method for detecting the target by using a closed contour. And the BING characteristics are insensitive to color, scale and rotation, and have higher robustness.

After the image is regarded as a two-dimensional discrete function, the calculation of the gradient is converted into derivation of the discrete function, such as the calculation of the BING characteristics for the candidate window regions according to the formula 1 to the formula 3.

Gradient(x,y)＝dx_i,j+dy_i,jFormula 1

dx_i,j＝[C_i+1,j-C_i-1,j][ 2 ] formula 2

dy_i,j＝[C_i,j+1-C_i,j-1][ 2 ] formula 3

Where i, j represents the coordinates of the pixel, C_ijIs the RGB value of the pixel with coordinates (i, j), the binary gradient amplitude NG of the pixel at the point₂Can be calculated using equation 4.

Where k represents the first k bits of the feature value, NG represents the pixel gradient magnitude, NG₂Indicating a binarized pixel gradient magnitude, NG_kRepresenting NG valueThe first k bits.

Therefore, an SVM classifier is used for learning the classifier based on the gradient features of the binary pixels, so that objects with closed feature commonalities can be distinguished from non-objects without the commonalities, and the aim of distinguishing general target objects from backgrounds is fulfilled. But the candidate area screened out at this time contains all types of general target objects. Wherein the first-stage SVM classifier learns a 64-dimensional linear SVM model M1 using a candidate window containing a general target object as a positive sample, a candidate window not containing the general target object or a candidate window having an overlap rate with the candidate window in which the general target object is located of less than 50% as a negative sample, the weight of the model is represented by w, and then calculates a score of the current candidate window according to equation 5 and equation 6_L。

score_L＝<w,(NG)₂)_L>Formula 5

(i, x, y) formula 6

Wherein, score_LI.e. the score measured for the first stage of the window, i.e. "score S1" in fig. 3, L indicates the position information, i indicates the current scale, (x, y) indicates the coordinates where the current window is located, (NG)₂)_LIndicating the BING characteristics of the window at location L.

The second-stage SVM classifier uses a candidate window with the overlapping rate of more than 50% with the region where the general target object is located as a positive sample, a candidate window with the overlapping rate of less than 50% as a negative sample, the label of the negative sample is-1, the label of the positive sample is set to be 1, a linear SVM model M2 is trained to serve as a second classifier, and finally the final Score (Score) of the candidate window is calculated according to a formula 7_F)_L。

(Score_F)_L＝(w₁)_i×score_L+(w₂)_iFormula 7

Wherein, w₁Representing the weight, w, of the SVM model trained by the first-stage metric, i.e., the first-stage SVM classifier₂Represents the weight of the SVM model trained in the second stage, i.e. the second-stage SVM classifier, (Score)_F)_LFinal score representing the candidate window in which the L position is locatedAnd use (Score)_F)_LFinal measure of significance of candidate window regions, (Score)_F)_LThe higher the value, the higher the likelihood that the region contains a general target object, i.e., the higher the saliency of the region. (Score)_F)_LI.e., "score S2" in fig. 3.

2. Cascade classifier for human target detection

Through the rapid screening of candidate windows based on the significance detection of the BING characteristics, the number of the candidate windows is reduced to 10³And the probability of containing general target objects in the candidate windows is very high, wherein the human target required by final detection is contained.

In order to screen out human targets, a third-level linear SVM classifier is provided. In order to ensure the performance of the algorithm, for each candidate region, only relatively simple HOG (Histogram of Oriented Gradient) features are adopted to train the third-level linear SVM classifier, and a linear SVM model is finally obtained through training.

In order to reduce the overhigh false detection rate of the SVM model obtained by training based on the HOG characteristics, firstly, the obtained model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of a training set to be used as a Hard sample (Hard Samples), and a new SVM model M3 is obtained by training on the training set added with the Hard sample.

After the classification is carried out by the third-level SVM classifier, the candidate windows which are less than 1000 and are most likely to contain the human body target are obtained. Due to the need of accurately acquiring the position of the human target, a classifier with stronger classification capability needs to be arranged to further classify the candidate window, i.e. the training model M4 is used as a fourth-level classifier.

The fourth-level classifier uses a Deep convolution network to extract features of the input Image, and the feature extraction aspect is extracted through a 16-layer VGG (visual Geometry group) (Simony K, Zisserman A. Very Deep computational Networks for Large-Scale Image Recognition [ J ] Computer Science 2014) Deep convolution network VGG-16. The VGG-16 network is mainly divided into eight parts, the eighth part is an output unit formed by a full connection layer, the specific network structure is shown in FIG. 1, and the specific parameters of each layer are shown in Table 1.

TABLE 1 specific parameters of layers of VGG-16 networks

After the deep convolution feature map of the image is obtained, mapping the candidate windows screened by the third-level SVM classifier to the feature map, obtaining the feature vector of the region where each candidate window is located, and finally classifying the features of each window by using a Softmax layer and obtaining the classified confidence value. The model Architecture shown in FIG. 2 is proposed in the literature "Girshick R.Fast R-CNN [ J ]. Computer Science, 2015", which adjusts each layer of the original VGG-16 network, discards the full connection layer originally used as the output layer, adds the Softmax layer and the window correction layer to obtain the model, and the specific structure of the model is shown in FIG. 2, and the model is optimized by using a Caffe (Jia, Yangqing, Shelhaler, et al. Caffe: conditional Architecture for t Feature Embedding [ J ]. Eprint Arxiv,2014:675-678) deep learning framework in the training process.

The pseudo code of the HD-BING algorithm of the present invention is shown in Table 2.

TABLE 2 pseudo code of HD-BING Algorithm

Specific examples of methods employing the present invention are provided below.

The HD-BING human target detection algorithm needs to train four classifiers in total.

The first stage is based on the BING features and training of the linear SVM model. The data set used to train the model was VOC 2007, containing 20 classes of targets, 9963 images and 24640 targets. The training parameter C of the linear SVM model is set to 10, the error e is set to 0.001 and the bias is set to 1. And (3) marking the candidate window of the calibration target as a positive sample with the label of 1, marking the candidate window with the overlapping rate of less than 50% with the calibration candidate window as a negative sample with the label of-1, calculating the BING characteristics of the image areas where all samples (candidate windows) are positioned, and finally obtaining a linear SVM model after training.

The second stage is the training of the linear SVM model based on the BING features and over multiple scales. The data set used to train the model was also VOC 2007. The training parameter C of the linear SVM model is set to 100, the error e is set to 0.001 and the bias is set to 1. And (3) marking the candidate window of the calibration target as a positive sample with the label of 1, marking the candidate window with the overlapping rate of less than 50% with the calibration candidate window as a negative sample with the label of-1, calculating the BING characteristics of the image areas where all samples (candidate windows) are positioned, and finally obtaining a linear SVM model after training.

The third stage is based on the HOG features and training of the linear SVM model. The data sets used for training are INRIA and NICTA, a sample set with the resolution of 64 x 80 in the NICTA data set is used for cutting out a human body target with the size of 32 x 64 in the middle of each sample image, namely cutting out 8 pixels from the top and the bottom and 16 pixels from the left and the right. All sample images were then normalized to 64 x 128 dimensions by means of bilinear interpolation. In the model training stage, a total of 37344 images with a single human body target in a training set of an NICTA data set are selected as a positive sample set, and a sample label is set to be 1; as the total number of 500000 images without human body targets in the data set is not present, only 30000 images are randomly selected as a negative sample set in each experiment, the sample label is set to be-1, and the total number of samples for the first training is 67344.

In the testing stage, 50000 images with a single human body target in the testing set of the NICTA data set are selected, and a sample label is marked to be 1; since the test set of the data set contains 6879 images that do not contain the human target, 5000 images are randomly selected at each test and the sample label is marked as-1. And finally, the total number of the test samples is 11879, 20000 false detection windows are collected to be used as hard samples, and the hard samples are added into the negative sample set of the training set, so that after the hard samples are added, the number of the negative samples is 50000, and the total number of the samples is 87344. After the second training with hard samples added, the training parameter C of the linear SVM model is set to 0.01, the error e is set to 0.001, and the bias is set to 1. And marking the candidate window of the calibration target as a positive sample and a label as 1, marking the candidate window of the negative sample as a negative sample and a label as-1, and training to obtain a third-level linear SVM model.

In the experiment, the HOG features are used for the feature extraction aspect of samples, the size of a HOG Cell unit (Cell) is set to be 8 × 8 pixels, the size of a HOG Block (Block) is set to be 16 × 16 pixels, the step sizes of moving the Block in the x and y directions are respectively 8 pixels, the window size is set to be 64 × 128 pixels, the direction of the gradient is counted by using a histogram with 9 square bars, and finally, for each sample, 3781-dimensional HOG features are respectively extracted.

The fourth stage is a classifier based on deep convolution characteristics, a Caffe deep learning framework is used in the training process, the model architecture is shown in figure 2, the learning rate is set to be 0.0001, the model is subjected to tuning training by using the amplified INRIA data set on the data set, and a fourth-stage classification model is obtained.

In the aspect of extracting deep convolution characteristics, as shown in fig. 2, an original image with the size of M × N is normalized to the size of 224 × 224, then the original image is input into an adjusted VGG-16 deep convolution network for characteristic extraction, a characteristic map with the size of 14 × 14 is obtained, and all candidate windows R obtained by screening the image are subjected to pooling operation before the characteristic map is subjected to pooling operation_iInput into the maximum pooling layer with convolution kernel size 2 and step size 2, for each size m_i×n_iMapping the candidate window to the feature map to obtain a mapped candidate window R_i', fitting the candidate window R_iSegmentation into a size of 7 × 7, the size of each mesh after segmentation is

And uniformly pooling the features of the candidate windows to 7 × 7 dimensions using a maximum pooling approach, so that k 7 × 7-dimensional feature vectors are obtained on each convolution feature map for the image. And since the fifth convolutional layer has 512 convolutional cores as shown in fig. 2, the final feature number is 512 × k × (7 × 7). Finally, the 512 × k 7 × 7 dimensional feature vectors are input to the fully connected layer of the VGG-16 network, respectively, and 4096 dimensional features of k candidate windows are obtained, respectively.

Inputting the obtained k 4096-dimensional features into a full-connected layer, obtaining the score of each candidate window, then inputting the scores of the candidate windows and label information (setting background label to be 0 and human body target label to be 1) into a Softmax layer of Caffe to calculate and obtain the prediction probability P of each candidate window corresponding to each category according to a formula 8 and a formula 9_c|c＝[0,1]， P_cI c-0 denotes that the current candidate window is background, P_cAnd | c ═ 1 represents that the current candidate window is a human target, and the classification loss function of the Softmax layer is calculated according to the formula 10.

x_i＝x_i-max(x₀,x₁,…,x_n) Formula 8

Loss_c＝-log(P_cC), c ═ 0,1, formula 10

Wherein x is₀,x₁,…,x_nRepresenting the n-dimensional eigenvalue, P, of the convolutional network output₀、P₂、…P_k-1Representing the predicted probability result over k-1 candidate windows.

Because the candidate window obtained by the third-level SVM classifier and the fourth-level classifier of the HD-BING algorithm still has a large error relative to the real position of the target, if the candidate window obtained by the fourth-level classifier is directly used for classification and the final result is obtained, the detection performance is greatly reduced, and then the method proposed by the document (Girshi R.fast R-CNN [ J ]. Computer Science,2015) is utilized, namely the candidate window finally containing the human target is linearly scaled and shifted to increase the coincidence rate of the candidate window and the target real position, so that the detection accuracy is improved.

The HD-BING target detection algorithm obtains 95.7% accuracy on an NICTA data set, on the other hand, in order to better illustrate that the HD-BING target detection algorithm has practicability, the INRIA data set is firstly amplified to twice of the original one, the content of the amplified data set is an image containing more human targets with different postures, such as sitting posture, standing posture, leaning, running, walking and the like, the source of the amplified image is hundreds of degrees and some realistic scenes on Google, such as a tourist photo, a movie screenshot, a life photo and the like, and when the algorithm is tested, the amplified data is tested. Due to the fact that the real life scene is more complex and the background is changeable, the comprehensive performance evaluation of the human body target detection algorithm is more referential and scientific. The HD-BING target detection algorithm obtains 86.08% detection rate on the amplified INRIA data set, the omission factor is 8.77%, and the error rate is 5.15% (the evaluation standard is that the coincidence rate of the prediction window and the calibration target window is calculated, the window with the coincidence rate more than 65% is regarded as a correctly detected window, the window with the coincidence rate less than 50% is regarded as a falsely detected window, and the omission of the calibration target window is regarded when the overlap rate of no candidate window and the calibration window reaches 65%).

Description of other embodiments of the present invention:

1. for the first and second stages using classifiers based on visual attention detection of BING features, which is essentially an optimization of the conventional sliding window generation candidate window, other target detection algorithms can be used instead, such as the generation of the first and second stage using Edge Box (Zitnick C L, Doll a P. Edge Boxes: Locating Object Proposals from Edges C, Cham,2014: 391-. On the other hand, other visual saliency detection algorithms (such as simple linear iterative clustering, some saliency detection algorithms based on background discrimination and the like) can be used for replacing the visual attention detection aspect of the method, so that the same purpose is achieved.

2. For the third-level combination of HOG characteristics and a linear SVM classifier, the further screening of candidate windows is realized, for the extraction of the characteristics, the HOG characteristics can be replaced by edge characteristics, color characteristics or even deep convolution characteristics, and the classifier can replace the third-level classifier of the linear SVM method by strong classifiers such as a nonlinear SVM, a regression classifier, a decision tree, a neural network and the like, so that the same purpose is achieved.

3. For the fourth-level classifier based on the deep convolutional network to screen the remaining candidate windows, other deep convolutional network models may be used for replacement, for example, a VGG network with less than 16 layers is used to consider the performance problem, and models based on CNN, RCNN, DCNN (krighevavsky a, Sutskever I, Hinton G e.imagenet classification with subsequent conditional Neural network [ C ] and the like may be used for replacement, and the extraction of image features is performed by training the models, so as to achieve the purpose of detecting human targets by the method.

4. The method for detecting the human body target by combining the visual saliency detection and the cascade classifier can achieve the same purpose by replacing a saliency detection algorithm or replacing the cascade classifier.

In accordance with the above method, another embodiment of the present invention provides a human target visual inspection device based on the BING characteristics, which includes:

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A human target visual detection method based on BING characteristics is characterized by comprising the following steps:

2) screening the candidate windows through a cascade classifier, and obtaining the positions and the sizes of all human body targets on the video frame according to the finally screened candidate windows;

wherein, step 1) comprises the following substeps:

1-2) training a second-stage SVM classifier based on BING characteristics, screening the candidate window obtained in the step 1-1) by using the second-stage SVM classifier, and calculating the score of the candidate window to be used as the measure of the significance of the candidate window region; wherein, step 2) includes the following substeps:

2-2) training a fourth-level classifier based on the deep convolution characteristics, and screening the candidate windows obtained in the step 2-1) by using the fourth-level classifier to obtain all final candidate windows containing human body targets;

when the third-level SVM classifier is trained, firstly, each candidate region is trained by adopting HOG characteristics to obtain a linear SVM model, then, the model is used for detecting a test set, a false detection window obtained in the detection process is added into a negative sample set of a training set to be used as a hard sample, and training is carried out on the training set added with the hard sample to obtain a new SVM model;

the fourth-level classifier uses a deep convolution network to perform feature extraction on an input image to obtain a deep convolution feature map, then candidate windows screened out by the third-level SVM classifier are mapped onto the deep convolution feature map to obtain feature vectors of regions where the candidate windows are located, and finally a Softmax layer is used for classifying features of the windows.

2. The method of claim 1, wherein the first-level SVM classifier is trained using a candidate window containing a general target object as a positive sample, using a candidate window not containing a general target object or a candidate window having an overlap rate with a candidate window in which a general target object is located of less than 50% as a negative sample; and when the second-stage SVM classifier is trained, a candidate window with the overlapping rate of more than 50% with the area where the general target object is located is used as a positive sample, and a candidate window with the overlapping rate of less than 50% is used as a negative sample.

3. The method of claim 1, wherein the first-level SVM classifier computes a score for a candidate window using the following formula:

score_L＝<w,(NG)₂)_L>，

L＝(i,x,y)，

wherein L represents position information, i represents the current scale, (x, y) represents the coordinates of the current window, (NG)₂)_LThe BING characteristics of the window at location L are shown, and w represents the weights of the linear SVM model.

4. The method of claim 3, wherein the second-level SVM classifier computes a score for a candidate window using the following formula:

(Score_F)_L＝(w₁)_i×score_L+(w₂)_i，

wherein, w₁Representing the weight, w, of the first-stage SVM classifier₂Represents the weight of the second-level SVM classifier (Score)_F)_LThe final Score of the candidate window in which the L position is located is expressed and used (Score)_F)_LThe significance of the candidate window regions is finally measured.

5. The method of claim 1, wherein the fourth stage classifier is trained using a Caffe deep learning framework and tuning training using an augmented INRIA dataset.

6. A human target visual detection device based on BING characteristics by adopting the method of any claim 1 to 5, which is characterized by comprising: