CN110705366A

CN110705366A - Real-time human head detection method based on stair scene

Info

Publication number: CN110705366A
Application number: CN201910844880.3A
Authority: CN
Inventors: 张发恩; 胡太祥
Original assignee: Innovation Qizhi (guangzhou) Technology Co Ltd
Current assignee: Innovation Qizhi (guangzhou) Technology Co Ltd
Priority date: 2019-09-07
Filing date: 2019-09-07
Publication date: 2020-01-17

Abstract

The invention discloses a real-time human head detection method based on a stair scene in the field of computer vision, which comprises the following specific steps of: s1: collecting a large number of picture data sets of a stair scene; s2: labeling the data set, wherein a labeling box needs to contain information of human head and shoulders; s3: dividing a data set into a training set, a testing set and a verification set; s4: enhancing the training set data; s5: extracting marking frame data in a training set, performing kmeans clustering on marking frame information, and selecting categories to obtain different anchor information; s6: constructing an FCHD + FPN network architecture; s7: training by using an FCHD + FPN network model according to the labeled stair head training set; s8: testing the accuracy of the trained model in the verification set; s9: the generated model is used for detecting the human head in a real stair scene, an anchor is selected through a clustering method based on a real-time human head detection method of the stair scene, a labeling area is adjusted by combining shoulder information, and meanwhile, an FCHD method is improved by fusing an FPN network to improve the detection accuracy.

Description

Real-time human head detection method based on stair scene

Technical Field

The invention relates to the technical field of computer vision, in particular to a real-time human head detection method based on a stair scene.

Background

The existing human head detection method has two ideas, one is an algorithm idea of regression, a crowd density chart is obtained according to image regression, the method can only show the crowding condition of the human flow, the specific position of a human cannot be positioned, and the requirement on the resolution ratio of an image is high; another method of detecting objects, such as SSD, Yolo, fast-rcnn series of algorithms, detects the number of people, and these algorithms are poor in the case of mutual occlusion and difficult to achieve the requirements in terms of accuracy and speed of detection at the same time.

FCHD is the latest detection algorithm of this kind of scene of human head detection, but FCHD's anchor has only selected two kinds of sizes, and it is not good to have a generalization ability in the actual application, and the undetected rate is higher, because the size of human head is great with camera locating position and people's distance relation.

Based on the method, the real-time human head detection method based on the stair scene is designed, the proper anchors are selected through a clustering method, the labeled area is adjusted by combining shoulder information, and meanwhile, the FCHD method is improved by fusing an FPN network to improve the detection accuracy so as to solve the problems.

Disclosure of Invention

The invention aims to provide a real-time human head detection method based on a stair scene so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the real-time human head detection method based on the stair scene comprises the following specific steps:

s1: acquiring a large number of picture data sets of stair scenes in public places;

s2: labeling the data set, wherein a labeling box needs to contain information of human head and shoulders;

s3: dividing a data set into a training set, a testing set and a verification set;

s4: enhancing the training set data;

s5: extracting marking frame data in a training set, performing kmeans clustering on marking frame information, and selecting categories to obtain different anchor information;

s6: constructing an FCHD + FPN network architecture;

s7: training by using an FCHD + FPN network model according to the labeled stair head training set;

s8: testing the accuracy of the trained model in the verification set;

s9: and detecting the human head in a real stair scene by using the generated model.

Preferably, the public place of step S1 includes a shopping mall or a subway.

Preferably, in step S4, the enhancing manner includes horizontal inversion, random cropping, color dithering, scaling, and rotation transformation.

Preferably, in step S6, the FCHD + FPN network architecture is to add an FPN network on the basis of FCHD, and at the same time, resnet101 is used in the FCHD basic model to adjust the NMS algorithm to the SOFT-NMS algorithm.

Compared with the prior art, the invention has the beneficial effects that:

1. on the basis of a common detection framework Faster rcnn, an FCHD (fuzzy C-means high definition) and FPN (field programmable gate array) network framework is fused, and the human head detection speed is greatly improved by fusing a one-stage model of the FCHD and the FPN;

2. the accuracy of detection is improved remarkably by the resnet101+ FPN network, and meanwhile, the candidate frames are partially optimized, so that the missing rate is reduced;

3. in the aspect of data processing, partial human body characteristics are used for marking data, so that the accuracy of model detection is improved;

4. the anchor frame obtained by training set clustering is closer to a real scene, and the missing rate of human head detection is reduced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a block diagram of the FCHD + FPN network model of the present invention;

fig. 3 is a diagram of the last required feature generated by the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, the present invention provides a technical solution: the real-time human head detection method based on the stair scene comprises the following specific steps:

s1: acquiring a large number of picture data sets of stair scenes in public places, wherein the public places comprise shopping malls or subways;

s4: enhancing the training set data in a horizontal inversion mode, a random pruning mode, a color dithering mode, a scale transformation mode and a rotation transformation mode;

s6: constructing an FCHD + FPN network architecture, adding an FPN network on the basis of FCHD, and meanwhile, adopting resnet101 in an FCHD basic model to adjust an NMS algorithm to a SOFT-NMS algorithm;

s8: testing the accuracy of the trained model in the verification set;

As an embodiment of the present invention

kmeans clustering:

1. the clustering data used is a detection data set only with labeling boxes, and after the data is labeled, a file containing the positions and the categories of the labeling boxes is generated, wherein each row contains (x)_j，y_j，w_j，h_j) J ∈ {1, 2, …, N }, i.e., the coordinates of grouttuthboxes with respect to the original image, (x)_j，y_j) Is the center point of the frame, (w)_j，h_j) The width and height of the frame, and N is the number of all the marked frames;

2. first, k cluster center points (W) are given_i，H_i) I ∈ {1, 2, …, k }, where W_i，H_iIs the width and height dimensions of the anchor boxes, and since the anchor boxes are not fixed in position, there are no (x, y) coordinates, only width and height;

3. calculating the distance d between each labeling frame and the center point of each cluster to be 1-IOU (labeling frame, clustering center), wherein the center point of each labeling frame coincides with the clustering center during calculation, so that the IOU value can be calculated, namely d is 1-IOU [ (x)_j，y_j，w_j，h_j)，(x_j，y_j，W_i，H_i)]J is {1, 2, …, N }, i is {1, 2, …, k }. Allocating the marking frame to the nearest clustering center;

4. after all the label boxes are distributed, recalculating the cluster center point for each cluster in the way of

The number of the marking frames of the ith cluster is the average value of the width and the height of all the marking frames in the cluster;

5. repeating the steps 3 and 4 until the change amount of the cluster center is small.

As another embodiment of the present invention

FCHD + FPN network model:

FPN module

The pre-trained model resnet101 is used as the base model for the entire framework. Firstly, the high-level feature is up-sampled by 2 times (nearest up-sampling method), then the convolution kernel of 1 x 1 is carried out to make the front and back channels consistent, and simultaneously the front and back channels are combined with the corresponding previous-level feature, and the combination mode is the addition between pixels. This process is iterated until the finest feature map is generated. The start of the iteration, the already fused signatures are processed with a 3 x 3 convolution kernel (to eliminate aliasing effects of the upsampling) to generate the final desired signature, as shown in fig. 3.

Data set preparation

Brainwash public dataset: 11917 pieces, 91146 marking boxes, source store monitoring video data

SCUT _ HEAD public data set: 4405, 111251 boxes for labels, Source classroom Surveillance video and Web crawler data

Personal annotation data set: 2000, source subway video data

Loss function

The loss function used to train the model is a multitask loss function, and the equation is as follows:

where i is the index of the selected anchor, p is the prediction probability of i, Lcls represents the classification penalty, and Lreg represents the regression penalty. Lcls is calculated over all anchors, while Lreg is calculated only over the correct anchors. Lcls is the largest loss between the two classes (head and background). And Lreg is a defined smooth L1 penalty. Both loss terms are normalized by Ncls and Nreg, which are the number of samples classified and regressed, respectively.

Hyper-parametric design

The base model is initialized using the pre-trained resnet101, and all and new layers of the pre-trained model are retrained. The new layer is initialized with random weights sampled from the standard normal distribution. The weights attenuation during training was 0.0005. The entire model was fine-tuned using SGD. The learning rate for the training was set to 0.001 and the model was trained for 30 rounds, approaching 440k iterations. After the completion of 15 periods, the learning rate was attenuated by a proportion of 0.1.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A real-time human head detection method based on a stair scene is characterized by comprising the following steps: the method comprises the following specific steps:

s4: enhancing the training set data;

s6: constructing an FCHD + FPN network architecture;

s8: testing the accuracy of the trained model in the verification set;

2. The stair scene-based real-time human head detection method according to claim 1, wherein: the public place of the step S1 includes a shopping mall or a subway.

3. The stair scene-based real-time human head detection method according to claim 1, wherein: in step S4, the enhancing method includes horizontal inversion, random trimming, color dithering, scaling, and rotation transformation.

4. The stair scene-based real-time human head detection method according to claim 1, wherein: in step S6, the FCHD + FPN network architecture is to add an FPN network on the basis of FCHD, and at the same time, adjust the NMS algorithm to the SOFT-NMS algorithm by using resnet101 in the FCHD basic model.