CN106815563B

CN106815563B - Human body apparent structure-based crowd quantity prediction method

Info

Publication number: CN106815563B
Application number: CN201611225785.8A
Authority: CN
Inventors: 黄思羽; 张仲非; 李玺
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-12-27
Filing date: 2016-12-27
Publication date: 2020-06-02
Anticipated expiration: 2036-12-27
Also published as: CN106815563A

Abstract

The invention discloses a crowd quantity prediction method based on a human body apparent structure, which is used for predicting the crowd quantity in a given scene image. The method specifically comprises the following steps: acquiring a monitoring image data set used for training a crowd quantity prediction model, and defining an algorithm target; modeling an apparent semantic structure of a pedestrian body in the monitoring image data set, and performing combined modeling on density distribution and body shape of the pedestrian; establishing a prediction model of the crowd quantity according to the modeling result in the step S2; and predicting the number of people in the scene image by using the prediction model. The method is suitable for predicting the number of people in a real video monitoring scene, and has better effect and robustness in the face of various complex conditions.

Description

Human body apparent structure-based crowd quantity prediction method

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a crowd quantity prediction method based on a human body apparent structure.

Background

Since the end of the 20 th century, with the development of computer vision, intelligent video surveillance technology has gained widespread attention and research. People counting is one of the important and challenging tasks, with the goal of accurately predicting the number of pedestrians in high-density people images. Three key factors of the crowd counting task are the pedestrian, the head and their contextual structure. When people count the number of people, the semantic structures of different parts of the bodies of the people are used as clues to accurately judge the positions of the people. Therefore, accurately predicting the number of people requires analysis of the semantic structure of the pedestrian's body.

Existing population counting methods generally include the following three categories: 1. people counting based on pedestrian detectors. Such methods utilize various pedestrian detectors to match each pedestrian in the image; 2. population counts based on global regression. The method mainly models the mapping between the crowd image and the crowd quantity; 3. population counts based on density estimates. The method models the density distribution of the crowd and predicts the crowd quantity through the density distribution. Existing methods model the entire body of the pedestrian as a whole, or only the head of the pedestrian. They ignore rich semantic structural information of the pedestrian body parts, and the performance of the crowd counting algorithm can be improved by utilizing the structural information.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for predicting the number of people in a given scene image based on the apparent structure of human body. The method carries out semantic modeling on the body apparent structure and density distribution information of the pedestrian based on the deep neural network, predicts the accurate crowd quantity according to the modeling result, and can better adapt to the complex situation in the real video monitoring scene.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a crowd quantity prediction method based on human body apparent structure comprises the following steps:

s1, acquiring a monitoring image data set used for training a crowd quantity prediction model, and defining an algorithm target;

s2, modeling the apparent semantic structure of the pedestrian body in the monitoring image data set, and performing combined modeling on the density distribution and the body shape of the pedestrian;

s3, establishing a prediction model of the crowd quantity according to the modeling result in the step S2;

and S4, predicting the number of people in the scene image by using the prediction model.

Further, in step S1, the monitoring image data set for training the population quantity prediction model includes a scene image

Artificially labeled head position P of pedestrian_trainAnd scene depth map

The algorithm targets are defined as: predicting a scene image

Number of pedestrians

Further, in step S2, the modeling the apparent semantic structure of the pedestrian body specifically includes:

s21, collecting head positions P of all pedestrians according to the monitoring image data_trainAnd their respective scene depth values

Determining the position and size of each pedestrian image bounding box from the set of scene images

Middle cutting to obtain pedestrian image I_train；

S22, displaying the pedestrian image I_trainRespectively inputting a single pedestrian semantic segmentation system for semantic segmentation;

s23, for each scene image

Restoring the segmentation results of all the pedestrians according to the original size and position to obtain a scene image

Semantic structure diagram of crowd

Reflecting scene images

Semantic structure information of body parts of all pedestrians.

Further, in step S2, the joint modeling of the density distribution and the body shape of the pedestrian specifically includes:

s24, aiming at scene image

Performing combined modeling on the density distribution and the body shape of the pedestrians to obtain a structured crowd density map

Wherein p is

The position of the upper pixel in the image,

is a two-dimensional gaussian kernel to approximate the shape of a human head,

is a two-dimensional gaussian kernel to approximate the shape of the human body,

and

the central positions of the ith individual's head and body respectively,

is taken from P_train，

By

And scene depth value

Estimate to obtain_hAnd σ_bAre respectively

And

of (a) each of which consists of

And

the result of the estimation is that,

semantic structure diagram of crowd

The binary image is obtained by the binary image,

is the number of pedestrians in the scene, and Z is a normalization factor for each pedestrian in the scene

Sum of Density 1, structured population Density map

Reflecting scene images

The density distribution and body shape information of all pedestrians.

Further, in step S3, the establishing a prediction model of the population specifically includes:

s31, establishing a deep convolution neural network, wherein the input of the neural network is a scene image

Output is corresponding to

Semantic structure diagram of crowd

Structured population density map

And

number of pedestrians

Thus, the structure of the neural network can be represented as a map

S32, child mapping

Using a soft maximum (Softmax) loss function, expressed as

Wherein

Is one of the outputs of the neural network,

to represent

The middle pixel position (h, w) and the value of channel i,

generated by the method described in step S23,

to represent

The value of the middle pixel position (h, w);

s33, child mapping

Using Euclidean loss function, expressed as

Wherein

Is one of the outputs of the neural network,

generated by the method of step S24;

s34, child mapping

Using Euclidean loss function, expressed as

Wherein

Is one of the outputs of the neural network,

is the number of people manually labeled;

s35 loss function of the whole neural network

L＝L_c+λ_dL_d+λ_bL_bFormula (5)

The entire neural network is trained using a stochastic gradient descent and back propagation algorithm under a loss function L.

Further, in step S4, the predicting the number of people in the scene image includes: image of a scene to be predicted

Inputting the trained neural network, and outputting the population number

I.e. the result of the prediction of the number of the crowd.

Compared with the existing crowd quantity prediction method, the crowd quantity prediction method based on the human body apparent structure has the following beneficial effects:

firstly, the method for predicting the number of the crowd discovers the semantic attribute of the crowd counting problem, defines and models three key factors of the problem: body, head and their contextual structure. This assumption is more adaptive to the complexity in the actual scene.

Secondly, the crowd quantity prediction method establishes a crowd quantity prediction model based on the deep convolutional neural network. The deep convolutional neural network can better express visual features, in addition, visual feature extraction, pedestrian semantic modeling and crowd quantity regression are unified in the same frame, and the final effect of the method is improved.

The crowd quantity prediction method based on the human body apparent structure has good application value in an intelligent video monitoring analysis system, and can effectively improve the efficiency and accuracy of crowd quantity prediction. For example, in the application scene of public safety, the crowd quantity prediction method can quickly and accurately predict the pedestrian quantity in the shooting area of the monitoring camera, and provides decision basis for daily operation and emergency treatment in public places.

Drawings

Fig. 1 is a schematic flow chart of a human body apparent structure-based crowd quantity prediction method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

On the contrary, the invention is intended to cover alternatives, modifications, equivalents and alternatives which may be included within the spirit and scope of the invention as defined by the appended claims. Furthermore, in the following detailed description of the present invention, certain specific details are set forth in order to provide a better understanding of the present invention. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details.

Referring to fig. 1, in a preferred embodiment of the present invention, a method for predicting the number of people based on the apparent structure of human body comprises the following steps:

first, a monitoring image dataset for training a population quantity prediction model is obtained. Wherein the monitoring image data set used for training the crowd quantity prediction model comprises a scene image

Artificially labeled head position P of pedestrian_trainAnd scene depth map

The algorithm targets are defined as: predicting a scene image

Number of pedestrians

Secondly, the density distribution and body shape of the pedestrian in the obtained monitoring image dataset are jointly modeled. Specifically, the method comprises the following steps:

first, according to the head positions P of all the pedestrians in the monitored image data set_trainAnd their respective scene depth values

Middle cutting to obtain pedestrian image I_train；

Second, the pedestrian image I_trainRespectively inputting a single pedestrian semantic segmentation system for semantic segmentation;

third, for each scene image

Semantic structure diagram of crowd

Reflecting scene images

Semantic structure information of body parts of all pedestrians.

Next, the density distribution and the body shape of the pedestrian are jointly modeled. For scene image

Wherein p is

The position of the upper pixel in the image,

is a two-dimensional gaussian kernel to approximate the shape of a human head,

is a two-dimensional gaussian kernel to approximate the shape of the human body.

And

the central positions of the ith individual's head and body respectively,

is taken from P_train，

By

And scene depth value

And (6) estimating. Sigma_hAnd σ_bAre respectively

And

of (a) each of which consists of

And

and (4) estimating to obtain.

Semantic structure diagram of crowd

And (4) carrying out binarization to obtain.

The sum of the densities of (a) and (b) is 1. Structured population density map

Reflecting scene images

The density distribution and body shape information of all pedestrians.

And then, establishing a prediction model of the number of the crowd. The method specifically comprises the following steps:

firstly, establishing a deep convolution neural network, wherein the input of the neural network is a scene image

Output is corresponding to

Semantic structure diagram of crowd

Structured population density map

And

number of pedestrians

Thus, the structure of the neural network can be represented as a map

Second, sub-mapping

Using a soft maximum (Softmax) loss function, expressed as

Wherein

Is one of the outputs of the neural network,

to represent

The middle pixel position (h, w) and the value of channel i,

to represent

The value of the middle pixel position (h, w);

third step, sub-mapping

Using Euclidean loss function, expressed as

Wherein

Is a neural networkOne of the outputs is a high-frequency signal,

generated by the method described in equation (1).

Fourth, sub-mapping

Using Euclidean loss function, expressed as

Wherein

Is one of the outputs of the neural network,

is the number of people manually labeled.

The fifth step, the loss function of the whole neural network is

L＝L_c+λ_dL_d+λ_bL_bFormula (5)

And finally, predicting the number of people in the scene image to be predicted by using the established model. The method specifically comprises the following steps: scene image to be predicted

Inputting the trained neural network, and outputting the population number

I.e. the result of the prediction of the number of the crowd.

In the above embodiment, the crowd quantity prediction method of the present invention first models the body appearance structure and the density distribution information of the pedestrian into two semantic scene models. On the basis, the original problem is converted into a multi-task learning problem, and a crowd quantity prediction model is established based on the deep neural network. And finally, predicting the accurate pedestrian number in the new scene image by using the trained crowd number prediction model.

Through the technical scheme, the embodiment of the invention develops the crowd quantity prediction algorithm applied to the video monitoring scene based on the deep learning technology. The invention can effectively model the body semantic structure information and the density distribution information of the pedestrian at the same time, thereby predicting the accurate crowd number.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A crowd quantity prediction method based on human body apparent structure is characterized by comprising the following steps:

s1, obtaining a monitoring image data set for training a crowd quantity prediction model, including scene images

Artificially labeled head position P of pedestrian_trainAnd scene depth map

And defining the algorithm targets as: predicting a scene image X_testNumber of pedestrians C_test；

S2, modeling the apparent semantic structure of the pedestrian body in the monitoring image data set, and jointly modeling the density distribution and the body shape of the pedestrian, specifically comprising:

Determining the position and size of each pedestrian image bounding box to derive a scene image

Middle cutting to obtain pedestrian image I_train；

s23, for each scene image

Semantic structure diagram of crowd

Reflecting scene images

Semantic structure information of body parts of all pedestrians;

s24, aiming at scene image

Wherein p is

The position of the upper pixel in the image,

is a two-dimensional gaussian kernel to approximate the shape of a human head,

and

the central positions of the ith individual's head and body respectively,

is taken from P_train，

By

And head position P_hDepth value of scene

Estimate to obtain_hAnd σ_bAre respectively

And

respectively by the head position P_hDepth value of scene

And body center position P_bDepth value of scene

Estimated to obtain B_mThe method comprises the following steps that A, a crowd semantic structure diagram B is obtained through binarization, C is the number of pedestrians in a scene image X, Z is a normalization coefficient so that the sum of the density of each pedestrian on D is 1, and a structured crowd density diagram D reflects the density distribution and body shape information of all pedestrians in the scene image X;

s3, establishing a prediction model of the crowd quantity according to the modeling result in the step S2, which specifically comprises the following steps:

Output is corresponding to

Prediction of the semantic structure of the crowd

Prediction of structured population density map

And prediction of pedestrian number in X

Thus, the structure of the neural network can be represented as a map

S32, child mapping

Using a soft maximum (Softmax) loss function, expressed as

Wherein

Is one of the outputs of the neural network,

to represent

The values of the (h, w) middle pixel position and the channel i, B is generated by the method described in step S23, and B (h, w) represents the value of the (h, w) middle pixel position in B;

s33, child mapping

Using Euclidean loss function, expressed as

Wherein

Is one of the outputs of the neural network, D is generated by the method of step S24;

s34, child mapping

Using Euclidean loss function, expressed as

S35 loss function of the whole neural network

L＝L_c+λ_dL_d+λ_bL_bFormula (5)

Training the whole neural network under a loss function L by using a random gradient descent and back propagation algorithm;

2. The method for predicting the number of people based on the apparent structure of human body according to claim 1, wherein the step S4 of predicting the number of people in the scene image comprises: image of a scene to be predicted

Inputting the trained neural network and the output scene image

The pedestrian number C in (1) is the prediction result.