CN110889343B

CN110889343B - Crowd density estimation method and device based on attention type deep neural network

Info

Publication number: CN110889343B
Application number: CN201911118138.0A
Authority: CN
Inventors: 陈宋健; 冯瑞
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2023-05-05
Anticipated expiration: 2039-11-15
Also published as: CN110889343A

Abstract

The invention provides a crowd density estimation method based on an attention-based deep neural network, which is characterized by adopting a deep neural network model based on an attention mechanism to detect crowd density from an image to be detected, and comprising the following steps of: step S1, preprocessing an image to be detected to obtain a preprocessed image; step S2, constructing a group expansion deep neural network model based on an attention mechanism; s3, training the packet expansion deep neural network model according to the training set; and S4, inputting the preprocessed images into a trained group expansion depth neural network model so as to obtain crowd density results in each preprocessed image and outputting the crowd density results, wherein the group expansion depth neural network model is provided with two modules, including a common convolution module and a decoding convolution module with an attention mechanism, and each convolution operation of the decoding convolution module is followed by a channel attention addition and a space attention addition.

Description

Crowd density estimation method and device based on attention type deep neural network

Technical Field

The invention belongs to the technical field of computer vision and artificial intelligence, relates to a crowd density estimation method and device under a high-density crowd scene, and particularly relates to a crowd density estimation method and device of a deep neural network model based on an attention mechanism.

Background

Under the condition of high-speed improvement of the current machine learning technology and the computer hardware performance, breakthrough progress is made in the application fields of computer vision, natural language processing, voice detection and the like in recent years. The crowd density estimation is used as a basic task in the field of computer vision, and the accuracy of the crowd density estimation is greatly improved.

The crowd density estimation task may be described in detail as follows:

and generating a density map for representing the crowd density in a unit area for the shot picture or the recorded video and the crowd scene under the camera, and summing the crowd density in the unit area in the density map based on the density map to obtain the crowd density of the final overall scene or the crowd density change of the overall video.

Crowd density estimation has important implications for the field of computer vision and practical application, and has stimulated a large number of researchers to pay close attention and invest in research in the past decades. With the development of powerful machine learning theory and feature analysis technology, research activities related to population density estimation subjects have increased or decreased in recent decades, and the latest research results and practical applications are published and released every year. Furthermore, crowd density estimation is also applied to many practical tasks such as intelligent video surveillance, crowd situation analysis, etc.

However, the detection accuracy of the various crowd density estimation methods in the prior art is still low and cannot be applied to the actual and general estimation task. Thus, crowd density estimation has not been well resolved yet and remains an important challenging research topic.

In order to improve accuracy of density estimation, a method commonly used at present is to increase training data during training of a prediction model. However, on the one hand, collecting a large amount of training data is an extremely difficult task, and on the other hand, an increase in the amount of training data also results in an extended training time of the model, and it is even possible that the training cannot be actually completed afterwards.

Disclosure of Invention

In order to solve the problems, the invention provides a crowd density estimation method and a crowd density estimation device which have simple structure, less training consumption and can ensure detection precision, and the invention adopts the following technical scheme:

the invention provides a crowd density estimation method based on an attention-based deep neural network, which is used for estimating crowd density and is characterized in that the crowd density is detected from an image to be detected by adopting a deep neural network model based on an attention mechanism, and the crowd density estimation method comprises the following steps: step S1, preprocessing an image to be detected to obtain a preprocessed image; step S2, constructing a group expansion deep neural network model based on an attention mechanism; s3, inputting a training set containing a plurality of training images into the built grouping expansion depth neural network model so as to perform model training; and S4, inputting the preprocessed images into a trained group expansion depth neural network model so as to obtain crowd density results in each preprocessed image and outputting the crowd density results, wherein the group expansion depth neural network model is provided with two modules, including a common convolution module and a decoding convolution module with an attention mechanism, and each convolution operation of the decoding convolution module is followed by a channel attention addition and a space attention addition.

The crowd density estimation method based on the attention-type deep neural network provided by the invention can also have the technical characteristics that a model optimizer included in a group expansion deep neural network model is random gradient descent, the learning rate is minus seven times, and the step S3 comprises the following sub-steps: s3-1, sequentially inputting each training image in a training set into a grouping expansion depth neural network model and performing one iteration; s3-2, respectively calculating loss errors by adopting model parameters of the last layer of the packet expansion deep neural network model; s3-3, back-propagating the loss error so as to update model parameters of the packet expansion deep neural network model; and step S3-4, repeating the steps S3-2 to S3-3 until the training completion condition is reached, and obtaining the trained group expansion deep neural network model.

The crowd density estimation method based on the attention-based deep neural network provided by the invention can also have the technical characteristics that the image to be detected is a high-density crowd image, and the preprocessing performed in the step S1 comprises image segmentation of the image to be detected.

The crowd density estimation method based on the attention-based deep neural network provided by the invention can also have the technical characteristics that the image segmentation method is to equally divide the image into 9 parts.

The crowd density estimation method based on the attention-based deep neural network provided by the invention can also have the technical characteristics that the preprocessing of the step S1 further comprises regularizing the segmented image.

The invention also provides a crowd density estimation device based on the attention-based deep neural network, which is used for estimating crowd density, and is characterized in that the crowd density is detected from an image to be detected by adopting a deep neural network model based on an attention mechanism, and the crowd density estimation device comprises: a preprocessing part for preprocessing an image to be detected to obtain a preprocessed image; the density prediction part is used for predicting crowd density results from the preprocessed image and outputting the crowd density results, and comprises a trained group expansion depth neural network model based on an attention mechanism, wherein the group expansion depth neural network model is provided with two modules, including a common convolution module and a decoding convolution module with the attention mechanism, and each convolution operation of the decoding convolution module is followed by a channel attention addition and a space attention addition.

The actions and effects of the invention

According to the crowd density estimation method and device based on the attention type deep neural network, the group expansion deep neural network model including the two attention mechanisms including the channel attention and the space attention is introduced as the prediction model, so that the model can better position the crowd and identify the crowd density, and therefore, the model can learn more characteristics, better perform characteristic expression, is more suitable for the crowd density estimation task of high-density crowd, and can finally improve the crowd density estimation precision. In addition, the group expansion depth neural network model of the embodiment only has one common convolution module and one decoding convolution module, so that the structure is simple, methods such as model mixing, multi-task training and metric learning are not needed, and compared with the existing high-precision model, the group expansion depth neural network model is fast and convenient to construct, and the calculation amount consumed in the training process is small.

Drawings

FIG. 1 is a flow chart of a crowd density estimation method based on an attention-based deep neural network in an embodiment of the invention;

FIG. 2 is a schematic diagram of a packet-expanded deep neural network model in an embodiment of the present invention;

FIG. 3 is a schematic diagram of the structure of the attention module according to an embodiment of the present invention; and

FIG. 4 is a schematic view of the channel attention structure in an embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the present invention easy to understand, the AAA of the present invention will be specifically described below with reference to the embodiments and the accompanying drawings.

< example >

The data set used in this embodiment is UCF-QNRF. UCF-QNRF is a challenging high-density scene data set, which contains 1535 marked pictures and labeling information of 1251642 heads in total, the number of people in each picture in the data set ranges from 49 to 12865, and the average resolution of the data set pictures is 2013 by 2902, which is a high-resolution data set.

In addition, the hardware platform realized by the crowd density estimation method based on the attention-based deep neural network in this embodiment requires an NVIDIA 1080Ti graphics card (GPU acceleration).

According to the embodiment, firstly, preprocessing is carried out on a data set picture, then a deep neural network model based on an attention mechanism is trained, and finally, the crowd density of the picture is obtained through the deep neural network model. The method specifically comprises 4 processes: preprocessing, constructing a model, training the model and predicting the density.

Fig. 1 is a flowchart of a crowd density estimation method based on an attention-based deep neural network in an embodiment of the invention.

As shown in fig. 1, the crowd density estimation method based on the attention-based deep neural network includes the following steps:

step S1, preprocessing an image to be detected to obtain a preprocessed image.

In this embodiment, the image to be measured is an image obtained from the UCF-QNRF dataset, and due to the high percentage of the dataset, the image cannot be directly input into the model, the image needs to be divided, the image is equally divided into 9 equal parts, then the image is input into the model, and the divided dataset is copied, so that two times of training data are obtained, the data volume is increased, and the prediction accuracy of the model is increased.

In the above-mentioned process of the present embodiment, the segmentation of the image is a measure for the resolution of the image to be too high; the training data are copied to increase the number of the acquired images, so that the data expansion is realized, the data acquired from the images to be detected are more abundant, and the iterative epoch is further increased.

In other embodiments, the image to be measured may also be a single image (e.g., a photograph, etc.), in which case no segmentation operation is required. In addition, in other embodiments, the image may not be duplicated, or other prior art data expansion methods (e.g., vertical flip, a combination of horizontal and vertical flip, etc.) may be used.

And S2, constructing a group expansion deep neural network model (hereinafter referred to as a deep neural network model or model).

In step S2 in this embodiment, a deep neural network model based on an attention mechanism is built by using an existing deep learning framework. The deep neural network model based on the attention mechanism is a deep neural network model which is introduced into the attention mechanism and can be mainly divided into two modules, namely a front layer network module (namely a common convolution module) based on VGG-16 and a decoding convolution module which is introduced into two kinds of attention including channel attention and space attention.

Specifically, the deep neural network model of the present embodiment is composed of a convolution layer, a pooling layer, channel attention, and spatial attention, where each convolution operation is followed by an attention module in the decoding convolution module.

Fig. 2 is a schematic structural diagram of a packet-expanded deep neural network model in an embodiment of the present invention.

As shown in fig. 2, the packet expansion deep neural network model includes an input layer I, a VGG-16-based top ten-layer network B1, a convolutional layer C2, an attention module A1, a convolutional layer C3, an attention module A2, a convolutional layer C4, an attention module A3, a convolutional layer C5, an attention module A4, a convolutional layer C6, an attention module A5, a convolutional layer C7, and a convolutional layer C8, which are sequentially disposed. The method comprises the following steps:

(1) An input layer I for inputting each preprocessed image

(2) A plurality of convolution structures including convolution structure C1 (convolution kernel size 3×3, sliding step size 1, padding 0, output is 1/64 size of input image), convolution structure C2 (convolution kernel size 3×3, padding 0, output is 1/64 size of input image), convolution structure C3 (convolution kernel size 3×3, sliding step size 1, padding 0, output is 1/64 size of input image), convolution structure C4 (convolution kernel size 3×3, sliding step size 1, padding 0, output is 1/64 size of input image), convolution structure C5 (convolution kernel size 3×3, sliding step size 1, padding 0, output is 1/64 size of input image), convolution structure C6 (convolution kernel size 3×3, sliding step size 1, padding 0, output is 1/64 size of input image), convolution structure C7 (convolution kernel size 1×1, padding 0, output is 1/64 size of input image);

(3) A plurality of attention modules: a1, A2, A3, A4, A5, as shown in fig. 3, each attention module consists of channel attention and spatial attention, and is combined with the original input. Meanwhile, a sigmoid function merging operation is required to be performed after each attention.

FIG. 4 is a block diagram of channel attention in an embodiment of the invention.

As shown in fig. 4, the input feature map is first maximally pooled and averaged pooled, and then the two results are summed, so that the channel attention is obtained through the sigmoid function.

And S3, inputting training data into the built deep neural network model based on the attention mechanism, so as to perform model training.

In this embodiment, the crowd data set UCF-QNRF is used as the training set. For the training set, the same preprocessing method as in step S1 is adopted, thereby obtaining 1535 images containing 1251642 heads; the images are further segmented and copied to achieve data enhancement, and regularization processing is then carried out, so that a plurality of images are obtained and are the training set of the embodiment.

The images in the training set enter the network model in batches for training, the batch size of the training images entering the network model each time is 1, and the training is iterated for 2000 times.

The model parameters included in the deep neural network model based on the attention mechanism in this embodiment are randomly set, the optimizer is random gradient descent (SGD), and the learning rate is minus seven times ten.

In the model training process, after each iteration (that is, the training set image passes through the model), the model parameters of the last layer calculate the Loss error (Square Loss) respectively, and then the calculated Loss error (Square Loss) is back propagated, so that the model parameters are updated. In addition, the training completion condition of the model training is the same as that of a conventional deep neural network model, namely, the model parameters of each layer are converged to complete the training.

Through the iterative training and the error calculation and back propagation processes in the iterative process, the training-completed deep neural network model based on the attention mechanism can be obtained. In the embodiment, crowd density estimation is performed under a complex scene by using the trained model.

And S4, inputting the preprocessed image obtained through preprocessing into a trained deep neural network model based on an attention mechanism, so that a crowd density result is obtained through the model and is output.

In this embodiment, after the preprocessing image passes through the deep neural network model based on the attention mechanism, a corresponding density map is output, and according to the density map, the crowd density of the image can be obtained.

In addition, the UCF-QNRF test set is used as an image to be tested to test the model, wherein the scene is a high-density crowd scene. The specific process is as follows:

and (3) preprocessing a plurality of images in the data set by using the UCF-QNRF data set as described in the step S1 to obtain 334 images (i.e. preprocessed images after preprocessing) as a test set, sequentially inputting a trained deep neural network model based on an attention mechanism, generating a corresponding density map, and calculating to obtain a crowd density result. In addition, other crowd density estimation models in the prior art are adopted to carry out comparison test on the same test set, and the results are shown in the following table:

TABLE 1 comparative test results of population density estimation on UCF-QNRF datasets with other conventional techniques

In Table 1, MCNN, CP-CNN, TDF-CNN, ic-CNN, D-ConvNet, CSRNet are several models with higher accuracy of crowd density estimation commonly seen in the prior art. MAE represents mean absolute error, MSE represents mean squared error, and represents the absolute difference and variance of the predicted and actual results.

The above test shows that, in the crowd density estimation method of the deep neural network model based on the attention mechanism of this embodiment, the average absolute error (MAE) generated by crowd density estimation of the test set is 237.0, the Mean Square Error (MSE) is 276.9, and the prediction result on crowd density estimation is the highest in accuracy in the current existing method. Through tests, the error rate of the method in large-scale dense crowds is only about 10%.

Example operation and Effect

According to the crowd density estimation method and device based on the attention type deep neural network, the group expansion deep neural network model including two attention mechanisms including channel attention and space attention is introduced as the prediction model, so that the model can better position the crowd and identify the crowd density, and therefore, the model can learn more characteristics, better perform characteristic expression, is more suitable for the crowd density estimation task of high-density crowd, and can finally improve the crowd density estimation precision. In addition, the group expansion depth neural network model of the embodiment only has one common convolution module and one decoding convolution module, so that the structure is simple, methods such as model mixing, multi-task training and metric learning are not needed, and compared with the existing high-precision model, the group expansion depth neural network model is fast and convenient to construct, and the calculation amount consumed in the training process is small.

In addition, in the embodiment, the results in table 1 also prove that, compared with the traditional computer vision method, the method of the embodiment greatly improves the accuracy of target detection, has good detection precision under different detection difficulties and different detection environments, and particularly has better precision in complex scenes.

In addition, in the crowd density estimation method based on the deep neural network of the attention mechanism of the embodiment, two attention mechanisms including channel attention and spatial attention are introduced, wherein the spatial attention plays a very significant role, and meanwhile, the channel attention is also slightly improved, and two attention are introduced at the same time to supplement each other so that the accuracy is simultaneously improved.

The above examples are only for illustrating the specific embodiments of the present invention, and the present invention is not limited to the description scope of the above examples.

For example, the above embodiment provides a crowd density estimation method and device based on an attention mechanism and a deep neural network, and the method mainly includes the steps of preprocessing, modeling, training a model and estimating crowd density. However, for convenience in practical use, the trained group expansion depth neural network model can be packaged to form a density estimation part, and the density estimation part and the preprocessing part for preprocessing the image to be detected can form a crowd density estimation device based on the attention type depth neural network model, so that the image to be detected is estimated and output corresponding crowd density by the density estimation part after being processed by the preprocessing part.

Claims

1. A crowd density estimation method based on an attention-based deep neural network is used for estimating crowd density, and is characterized in that the crowd density is detected from an image to be detected by adopting a deep neural network model based on an attention mechanism, and the method comprises the following steps:

step S1, preprocessing the image to be detected to obtain a preprocessed image;

step S2, constructing a group expansion deep neural network model based on an attention mechanism;

s3, inputting a training set containing a plurality of training images into the built group expansion depth neural network model so as to perform model training;

step S4, inputting the preprocessed images into the trained group expansion depth neural network model, thereby obtaining crowd density results in each preprocessed image and outputting the crowd density results,

wherein the group expansion depth neural network model is provided with two modules, including a common convolution module and a decoding convolution module with an attention mechanism,

each convolution operation of the decoding convolution modules is followed by an attention module, which consists of channel attention and spatial attention,

each attention is followed by a sigmoid function merge operation,

the channel attention acquisition process comprises the following steps: and after carrying out maximum pooling and average pooling on the feature map input to the attention module, adding two results obtained by the maximum pooling and the average pooling, and obtaining the channel attention of the feature map through the sigmoid function.

2. The crowd density estimation method based on an attention-based deep neural network of claim 1, wherein:

wherein the model optimizer included in the grouping expansion deep neural network model is a random gradient descent, the learning rate is minus seven times of ten,

the step S3 includes the following sub-steps:

s3-1, sequentially inputting each training image in the training set into the group expansion depth neural network model and performing iteration once;

s3-2, respectively calculating loss errors by adopting model parameters of the last layer of the grouping expansion depth neural network model;

step S3-3, back-propagating the loss error so as to update model parameters of the group expansion depth neural network model;

and step S3-4, repeating the steps S3-2 to S3-3 until the training completion condition is reached, and obtaining the trained group expansion depth neural network model.

3. The crowd density estimation method based on an attention-based deep neural network of claim 1, wherein:

wherein the image to be measured is a high-density crowd image,

the preprocessing performed in the step S1 includes image segmentation of the image to be detected.

4. A crowd density estimation method based on an attention-based deep neural network according to claim 3, characterized in that:

the image segmentation method is to uniformly divide the image into 9 parts.

5. The crowd density estimation method based on an attention-based deep neural network of claim 1, wherein:

wherein the preprocessing in step S1 further includes regularizing the segmented image.

6. The crowd density estimation device based on the attention-based deep neural network is used for estimating crowd density, and is characterized in that the crowd density is detected from an image to be detected by adopting a deep neural network model based on an attention mechanism, and the device comprises:

a preprocessing part for preprocessing the image to be detected to obtain a preprocessed image;

a density predicting part for predicting crowd density results from the preprocessed images and outputting the crowd density results, wherein the density predicting part comprises a trained group expansion depth neural network model based on an attention mechanism,

each attention is then followed by a merge operation using a sigmoid function,