CN111523494A

CN111523494A - Human body image detection method

Info

Publication number: CN111523494A
Application number: CN202010341723.3A
Authority: CN
Inventors: 侯峦轩; 马鑫; 赫然; 孙哲南
Original assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Current assignee: Tianjin Zhongke Intelligent Identification Industry Technology Research Institute Co ltd
Priority date: 2020-04-27
Filing date: 2020-04-27
Publication date: 2020-08-11

Abstract

The invention discloses a human body image detection method, which comprises the following steps: preprocessing an input training image, utilizing automatic data augmentation and data expansion, and using a pyramid-like network based on a characteristic region of a hollow convolution bottleck; cutting a boundary frame formed by the detected human body, and only keeping an image in the frame; and inputting the cut image into a designed model for training to obtain a pedestrian detection model. The invention can carry out two-dimensional space detection on the input image containing the human body, and the generated image has accurate human body space information and smaller calculation cost.

Description

Human body image detection method

Technical Field

The invention relates to the technical field of image processing, in particular to a human body image detection method.

Background

Human body image detection refers to marking the spatial geometric position of a human body from an image containing the human body. Human detection is a computer technology related to computer vision and image processing for detecting semantic objects of a particular class in digital images and videos. The object detection fields for intensive human research include face detection and pedestrian detection. Human detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Due to the flexibility of human body, there are various postures and shapes, and the appearance is greatly influenced by wearing, posture, visual angle, etc., and also faces the influence of factors such as shading and lighting, etc., which makes the pedestrian detection a very challenging subject in the field of computer vision. The main difficult problems to be solved in pedestrian detection are:

the first is that: the appearance difference is large. Including viewing angle, pose, apparel and attachments, lighting, imaging distance, etc. The appearance of pedestrians is very different from different perspectives. Pedestrians in different postures also have great appearance difference. The appearance of the clothes is very different due to different clothes worn by people and the influence of attachments such as umbrella opening, hat wearing, scarf wearing, luggage carrying and the like. Differences in illumination also cause some difficulties. The human body at a long distance and the human body at a short distance are also very different in appearance.

Secondly, the following steps: the problem of occlusion. In many application scenes, pedestrians are very dense, severe occlusion exists, and only a part of a human body can be seen, which brings serious challenges to a detection algorithm.

Thirdly, the method comprises the following steps: the background is complex. Whether indoor or outdoor, the background that pedestrian detection generally faces is very complicated, and the outward appearance and shape, color, texture of some objects are very similar to the human body, leads to the unable accurate differentiation of algorithm.

Fourthly, the method comprises the following steps: and detecting the speed. The pedestrian detection generally adopts a complex model, has quite large calculation amount, is very difficult to achieve real time, and generally needs a large amount of optimization.

The idea of the background modeling algorithms is to obtain a background model through previous frame learning, and then compare a current frame with a background frame to obtain a moving target, i.e. a changed region in an image. The background modeling algorithm is simple to implement and fast, but has the following problems: only moving objects can be detected, and for stationary objects, processing cannot be performed. The influence of illumination change and shadow is great. If the color of the target is very close to the background, missing inspection and breakage can occur. Is easily affected by bad weather such as rain and snow, and disturbance objects such as leaf shaking. If multiple objects are stuck and overlapped, the objects cannot be processed. The reason for this is because these background modeling algorithms only utilize information at the pixel level and do not utilize semantic information at higher levels in the image.

The invention also discloses a method based on machine learning. The method based on machine learning is the mainstream of pedestrian detection algorithm at the present stage, and mainly adopts the scheme of artificial features and a classifier. The human body has own appearance characteristics, and the characteristics can be designed manually and then used for training a classifier to distinguish pedestrians from backgrounds. The features comprise common features in machine learning such as color, edge, texture and the like, and the adopted classifier comprises algorithms commonly used in the computer vision field such as neural network, SVM, AdaBoost, random forest and the like. Since it is a detection problem, a sliding window technique is generally used.

Due to further research in technology and high quality and high accuracy of images with human body bounding boxes, the images have important significance for user experience and market competition. The quality of the existing human body image boundary box generation cannot meet the requirement, and the uncertainty is large. Therefore, it is necessary to improve the human body detection method by one step.

Disclosure of Invention

The invention aims to provide a human body image detection method, which solves the problems of running speed and precision in the existing detection method so as to improve the generation quality of a human body image boundary frame and reduce uncertainty.

In order to achieve the purpose of the invention, the invention provides a human body image detection method, which comprises the following steps:

s1, preprocessing image data in an image database: carrying out automatic data enhancement on an original image, and carrying out specific automatic data enhancement operation by taking the probability P and the operation amplitude M of data enhancement operation and implementation as a triple;

s2, sending the original image into a feature pyramid network based on cavity convolution for detection, and only outputting a human body image marked by a boundary frame by a human body; obtaining a human body image depth neural network model marked by a bounding box through training; using the human body image which is subjected to data amplification and cutting in the step S1 as the input of the network, using json files marked out in an xy-axis coordinate form in a training set as the marking information image of the human body boundary frame as the GroudTruth, training the detection network in the deep neural network model, and obtaining the trained detection neural network model which finishes the detection from the human body image to the human body image with the boundary frame;

and S3, carrying out posture estimation processing on the images containing the human body in the test data set by using the trained deep neural network model.

Further, the enhancement process includes random flipping, random rotation, and random scaling and employs specific parameters.

Furthermore, the feature pyramid network FPN adopts a specific data enhancement method to process the picture, modifies the last two stages of the FPN to specifically aim at the target detection, cuts the detected human body image and inputs the cut human body image,

the method specifically comprises the following steps:

adopting Resnet50 as a backbone network to extract features, and randomly initializing a ResNet50 network by using standard Gaussian distribution;

according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P₂,P₃,P₄,P₅And stage5 is added by concatenating convolution kernels of convolution kernel size 1 x 1, the feature map being P₆A characteristic diagram of (1); and after stage4 the spatial resolution of the feature map is kept constant, i.e. the spatial resolution of the feature map is kept constant

Wherein

Representing the spatial resolution, i is the original map size, x ∈ [ i,2,3,4,5,6]At P₄,P₅，P₆Connecting convolution kernels with the convolution kernel size of 1 x 1 to keep the number of channels consistent;

and finally, summing the feature maps of the stages 4-6 according to a pyramid framework to form an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, and performing constraint through regression loss and classification loss. The classification loss and the regression loss are fused, the classification loss adopts log loss, the loss of regression is the same as R-CNN, and the total loss function is as follows:

two branches are connected to the last full connection layer of the detection network, one branch is softmax and is used for classifying each ROI area, if K types are to be classified, the output result is p (p)₀.........p_k) The other is a bounding box for more precise regions of the ROI, output as

Coordinates of a bounding box representing a class k, wherein (x, y) is coordinates of the upper left corner of the bounding box, (x + w, y + h) is coordinates of the lower right corner of the bounding box, u is a group Truth of each POI area, v is a regression target of the group Truth of the bounding box, wherein lambda is a super parameter, and balance between two task losses is controlled, wherein lambda is 1, and [ u is more than or equal to 1 ≧ 1]Is 1 when u is more than or equal to 1,

the classification loss is specifically:

a loss function in the form of a log;

the regression loss is specifically:

wherein v ═ v_x，v_y，v_w，v_hIs the position of the real box of class u, and

is a predicted box position of class u, and

compared with the prior art, the human body detection network has the advantages that the problem of contradiction between operation performance and detection performance existing in detection is solved by the human body detection network according to the properties, the detection performance is improved by keeping the spatial resolution of the characteristic diagram and expanding the receptive field by using the cavity convolution, and the human body detection image with a very good perception effect can be generated by combining the human body image boundary frame detection model with the cavity convolution. In addition, because the existing detection method based on deep learning generally generalizes a classification network to a human body detection task by adding a convolution layer, most of the pre-training models are obtained based on the classification network at present and are not beneficial to directly generalizing to the human body detection model, by means of the proposed human body image detection model of the deep neural network fusing the cavity convolution, a residual error network is used as the basis for constructing the model, and a pyramid structure, particularly a related bounding lattice, is combined, so that the perception field of the model is larger, the effect is better, and the generalization capability is stronger.

Drawings

FIG. 1 is a process flow diagram of the method of the present invention;

FIG. 2 is a block diagram of a human body detection network according to the present invention;

FIG. 3 illustrates the connection of operations between p _4, p _5, and p _6 according to the present invention;

FIG. 4 is a block diagram of a different type of bottleeck of the present invention 3;

FIG. 5 is a schematic diagram of the detection of network summing operation of the present invention;

FIG. 6 is a sample visualization result of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

As shown in figures 1-6 of the drawings,

the human body image detection method comprises the following steps:

in step S1, specific data enhancement is first performed on the image training set data, and first we define all possible data enhancements that can be applied to the image, as shown in the following table (the parameters all correspond to the parameters of the TensorFlow corresponding function):

the following specific operations were employed:

an enhancement policy is defined as an unordered set of K sub-policies. During training, one of the K sub-strategies will be randomly selected and then applied to the current image. Each sub-strategy has 2 image enhancement operations, where P is the probability value (between the range 0-1) for each operation, M is the parameter magnitude, and each parameter magnitude is normalized to be within the interval 0-10.

And step S2, training a human body image detection model of the neural network fused with the hole convolution bottleeck by using the training input data so as to complete the detection task of the human body image.

In step S2, the clipped image containing the human body and the label information in step S1 are mainly used as the input of the network, and the bounding-box containing the labeled human body (in the form of json file, rectangular bounding boxes are respectively marked in the form of xy-axis coordinates) is used as the group route, so as to train the human body detection network in the depth model, and complete the task from the human body input of the image to the output of the image with the bounding box. Specifically, after the human body image detected by the detection network is cut, the characteristic diagram is extracted by using ResNet50 as a backbone network,

and S3, performing target detection on the images in the training data set by using human body detection, only reserving boundary frames of human bodies for all class frames, performing cutting operation to generate human body images with corresponding sizes of 224 x 224, marking information json files by using the human body boundary frames in the data set, and calling COCO api as marking information of the corresponding human bodies to accelerate the reading speed of I/O.

The human body detection network trains and uses all 80 classes of the COCO data set, and finally the human body class is selected for output (the output image is in a mode that the human body is marked by a bounding box in the image). The specific structure is shown in fig. 2, wherein the specific design of the human body detection network and the modules in the figure are explained as follows:

according to the features extracted by Resnet50, a feature map with the scale of 1-4, 4 is reserved and named as P₂,P₃,P₄,P₅And adding stage5 by concatenating convolution kernels having convolution kernel size 1 x 1, with the feature map being P₆A characteristic diagram of (1);

and after stage4 we keep the spatial resolution of the feature map unchanged, i.e. we keep the spatial resolution of the feature map unchanged

The conversion is accomplished by 3 x 3 convolutions or pooling layers with step size 2, wherein

Representing the spatial resolution, i is the original picture size, where the original picture size is 224 x 224, x ∈ [ i,2,3,4,5,6]At P₄,P₅，P₆And connecting convolution kernels with the convolution kernel size of 1 x 1 to keep the channel number consistent (256 channels).

P₄,P₅，P₆The transformation between the two types of the B and the B is realized by the bottleeck of the two types of the AB, and the design of the bottleeck of the two types of the AB is shown in a figureAnd 4, AB two types of bottleeck are respectively obtained by convolution of 1 × 1, convolution of 2 hole coefficients of 3 × 3 and relu layer.

And finally, summing the feature maps of the stages 4-6 according to a pyramid framework, wherein a lateral connection summing mode is as shown in FIG. 5, forming an FPN feature pyramid, performing target detection by adopting a Fast RCNN method, and performing constraint through regression loss and classification loss. The multiple loss fusion (classification loss and regression loss fusion) is the prediction operation in FIG. 2, the classification loss is log loss (i.e. the probability of real classification is negative log, and the classification output is K +1 dimension), and the loss of regression is the same as that of R-CNN (smooth L1 loss). Overall loss function:

two branches are connected to the last full connection layer of the detection network, one branch is softmax and is used for classifying each ROI area, if K types are to be classified (adding K +1 types in total to background), the output result is p ═ p (p is₀.........p_k) The other is a bounding box for more precise regions of the ROI, output as

The coordinates of the bounding box representing the k classes are (x, y) the coordinates of the upper left corner of the bounding box and (x + w, y + h) the coordinates of the lower right corner of the bounding box. u is the group Truth of each POI area, and v is the regression target of the group Truth of the bounding box. Where λ is the hyperparameter, controls the balance between the two task losses, where λ is 1. [ u.gtoreq.1]Is 1 when u is more than or equal to 1.

The classification loss is specifically:

is a loss function in log form.

The regression loss is specifically:

whereinv＝v_x，v_y，v_w，v_hIs the position of the real box of class u, and

is the prediction box position of class u. And is

In addition, the cropping operation is to perform operations such as expanding the frame to a fixed aspect ratio, then performing cropping, and then performing data enhancement on the bounding box region in the image containing the human body bounding box, such as random flipping, random rotation, random scaling, and the like.

Further, in all training steps, the data set is the MSCOCO training data set (including 57K images and 150K images containing human body instances), and after detection by the detector network (FPN + roiign) in step S2, for all detected bounding boxes, only the human body bounding box is used (i.e. the bounding box of the human class in the first 100 boxes of all classes is used in all experiments), and the human body bounding box is expanded to the fixed aspect ratio light: weight: 384:288, the cropped image is correspondingly resized to the default height 384 pixels and width 288 pixels, and then the corresponding data enhancement policy is applied, for the cropped image, the random rotation (angle-45 ° +45 °) and the random scale (0.7 ° -1.35) are applied, and the annotation information (json file containing the human body bounding box position) of the corresponding picture is used as groudtruth.

In order to describe the specific implementation mode of the invention in detail and verify the effectiveness of the invention, the method provided by the invention is applied to an open data set training. The database contains photos of some natural scenes, such as animals, animated characters, etc. (which have been used as interference factors to improve the robustness of the model and the application capability of the actual natural scene). And selecting all images of the data set as a training data set, automatically amplifying image data, performing target detection on all images in the training data set by using a trained characteristic pyramid network (FPN), outputting only a human body class boundary box, generating corresponding cut human body images, training a global network and a correction network by using gradient back propagation until the network is converged, and obtaining a human body detection model.

To test the validity of the model, the input image is processed and the visualization is shown in fig. 6. In the experiment, the result of the experiment is shown in fig. 1 by comparing with the real image of groudtruth. The embodiment effectively proves the effectiveness of the method provided by the invention on the super-resolution of the image.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A human body image detection method is characterized by comprising the following steps:

2. The human image detecting method according to claim 1, wherein the enhancement process includes random flipping, random rotation, and random scaling and employs specific parameters.

3. The human image detection method of claim 1, wherein the feature pyramid network FPN processes the picture by a specific data enhancement method, and modifies the last two stages of the FPN to be specific to the target detection, and cuts the detected human image for input,