CN113052827B

CN113052827B - Crowd counting method and system based on multi-branch expansion convolutional neural network

Info

Publication number: CN113052827B
Application number: CN202110354656.3A
Authority: CN
Inventors: 张友梅; 张瑜; 刘伟龙
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2022-12-27
Anticipated expiration: 2041-03-30
Also published as: CN113052827A

Abstract

The invention belongs to the field of computer vision, and provides a crowd counting method and system based on a multi-branch expanded convolutional neural network. The method comprises the following steps: acquiring a scene image containing crowds and respectively generating a crowd density map label and a head position binary map label according to the scene image; constructing a training set according to the training samples; each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample; training a multi-branch expansion convolution population counting network model based on a training set to obtain network optimal parameters, and generating a trained multi-branch expansion convolution population counting network model; inputting the image to be detected into a trained multi-branch expansion convolution crowd counting network model, and outputting a crowd density graph; and summing the pixel values in the crowd density image to obtain a crowd counting result.

Description

Crowd counting method and system based on multi-branch expansion convolutional neural network

Technical Field

The invention belongs to the field of computer vision, and particularly relates to a crowd counting method and system based on a multi-branch expansion convolutional neural network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The Crowd Counting (Crowd Counting) aims at estimating the Crowd distribution appearing in an image or video in real time and Counting the number of people facing the image data or video data. In recent years, research on a crowd counting method becomes a research hotspot in the field of computer vision, the application field of the method is mainly intelligent security, crowd distribution and the number of people are provided in real time, people flow can be effectively analyzed and controlled, and safety accidents are prevented.

Due to the influence of the shooting angle and the shooting distance, the size difference of the target crowd in the image or the video is large, and great challenge is brought to the research of the crowd counting method.

Disclosure of Invention

Aiming at the problem of scale difference of target crowds, the invention provides a crowd counting method and a system based on a multi-branch expanded convolutional neural network, which are used for designing the multi-branch expanded convolutional network sharing training parameters so as to extract the characteristics with different receptive fields by using fewer network parameters; and the supervised head position binary image is used for guiding the network to pay attention to the head position, so that more accurate crowd counting is realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention provides a crowd counting method based on a multi-branch expansion convolutional neural network.

The crowd counting method based on the multi-branch expansion convolutional neural network comprises the following steps:

acquiring a scene image containing crowds and respectively generating a crowd density map label and a head position binary map label according to the scene image;

constructing a training set according to the training samples; each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample;

training a multi-branch expansion convolution crowd counting network model based on a training set to obtain network optimal parameters so as to generate a trained multi-branch expansion convolution crowd counting network model;

inputting the image to be detected into a trained multi-branch expansion convolution population counting network model, and outputting a population density map;

and summing the pixel values in the crowd density image to obtain a crowd counting result.

A second aspect of the invention provides a population counting system based on a multi-branch expanded convolutional neural network.

Crowd counting system based on many branches expand convolutional neural network includes:

a tag generation module configured to: acquiring a scene image containing crowds and respectively generating a crowd density map label and a head position binary map label according to the scene image;

a training set building module configured to: constructing a training set according to the training samples; each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample, and a plurality of samples are combined to form a training set;

a model training module configured to: training a multi-branch expansion convolution population counting network model based on a training set to obtain network optimal parameters, and generating a trained multi-branch expansion convolution population counting network model;

a crowd counting application module configured to: inputting the image to be detected into a trained multi-branch expansion convolution crowd counting network model, and outputting a crowd density graph;

an output module configured to: and summing the pixel values in the crowd density image to obtain a crowd counting result.

A third aspect of the invention provides a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for population counting based on a multi-branch expanded convolutional neural network as described in the first aspect above.

A fourth aspect of the invention provides a computer apparatus.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for population counting based on multi-branch expanded convolutional neural network as described in the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention designs 3 branches for feature extraction by utilizing the expanding convolution operation, the extracted features have different receptive fields, and the problem of head size difference caused by the shooting angle and the shooting distance can be effectively solved.

2. The invention shares the network parameters in a plurality of expansion convolution branches, can effectively reduce the variable parameter quantity and improve the training speed of the network.

3. The head position binary image estimation module designed by the invention can supervise and guide the network to extract more stable characteristics on one hand, and can assist the crowd density image estimation module to more accurately position the head position on the other hand, thereby enhancing the accuracy of crowd counting.

4. The original image is used as an input of a multi-branch expansion convolution crowd counting network model, and the output of the multi-branch expansion convolution crowd counting network model comprises a human head position binary image generated by a binary image estimation module and a crowd density image generated by a crowd density image estimation module. The binary image estimation module outputs a binary image which can represent the position and size of the human head after supervised training, and further calculates a Hadamard product with the fusion characteristics to be used as the input of the crowd density image estimation module, so that the crowd density image estimation module can more accurately position the position of the human head to estimate the density, and the problem of counting errors caused by target crowd size difference is solved.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a population counting method based on a multi-branch expanded convolutional neural network according to the present invention;

FIG. 2 is a flow chart of a population counting method in an embodiment;

fig. 3 is a block diagram of a multi-branch expanded convolutional neural network in an embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

As shown in fig. 1-2, the present embodiment provides a crowd counting method based on a multi-branch expanded convolutional neural network, and the present embodiment is illustrated by applying the method to a server, it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

step (1): acquiring a scene image containing crowds and respectively generating a crowd density map label and a head position binary map label according to the scene image;

specifically, scene images containing crowds are obtained, and the position of the head of each image is marked;

generating a crowd density map label according to the marked head position;

and generating a human head position binary image label by using a binarization function according to the crowd density image label.

In an example, the server may obtain scene images of people, label the scene images of people, and label the position of the head in each image. The scene image of the crowd can be an image shot by a camera used for scene monitoring, and the scene monitoring specifically can be subway monitoring, market monitoring and the like.

Generating a population density icon label in the step (1) by adopting a mode of a Gaussian kernel with a fixed size, and covering a Gaussian kernel with the sum of 1 in a fixed size at each head position; the head position binary image firstly utilizes a nearest neighbor algorithm to calculate the size of the head, then the pixel of the head position is set to be 1, other positions are set to be 0, and the head position binary image label contains head size information.

Specifically, a crowd density map label is generated according to the marked head position, and the generation mode is shown in formula (1):

in the formula, l represents the position set of all human heads in the image, and l _i Coordinates representing the center of the ith target person's head position. δ () is the pulse function, G () is the Gaussian kernel, σ _i The variance of the gaussian sum is represented and is set to 8 in this embodiment. Namely: the head position is covered with a gaussian kernel with a sum of 1 and variance of 8, and the non-head position is set to 0.

Specifically, a human head position binary image label is generated according to the marked human head position, and the generation mode is shown in formula (2):

in the formula (2), B () is a binarization function, and other variables and functions represented by the same symbols as those in the formula (1) are identical thereto. Namely, the specific generation mode of the human head position binary image label is as follows: firstly, according to the generation mode of the crowd density graph label, the variance of a Gaussian kernel is set to be 15, and then a binarization function is used for converting the generated result into a binary graph, namely non-0 pixels are reset to be 1, and the rest are 0.

Step (2): constructing a training set according to the training samples; and each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample.

Constructing a training set according to the training samples, comprising:

and carrying out data expansion on each training sample in a random cutting, mirror image and rotation mode to construct a training set.

Specifically, random clipping, mirroring and rotation operations are respectively adopted for data expansion. Specifically, 50 image blocks with the length and width being multiples of 32 and smaller than the size of the original image are randomly cropped, then horizontal mirroring and vertical mirroring are respectively performed on the 50 image blocks to obtain 150 image blocks, and finally the 150 image blocks are respectively rotated by 15 degrees to obtain 300 image blocks. It should be noted that the same operation is performed for the population density map label and the head position binary map label.

And (3): training a multi-branch expansion convolution crowd counting network model based on a training set to obtain network optimal parameters so as to generate a trained multi-branch expansion convolution crowd counting network model;

wherein, the multi-branch expansion convolution crowd counting network model includes: the device comprises a multi-branch expansion convolution module, a feature fusion module, a binary image estimation module and a density image estimation module.

Considering that the expansion convolution can obtain the characteristics with a larger receptive field by using fewer parameters, the embodiment designs a multi-branch convolution module, adopts the expansion convolution branches sharing 3 parameters to extract multi-scale characteristics, and designs a head position binary image estimation module to enhance the characteristics of the head position, so that the crowd density image estimation module is assisted to more accurately position the head position, and the accuracy of crowd counting is improved.

In one embodiment, a multi-branch convolution module, comprising: the three expansion convolution branches share network parameters and have different expansion rates and are used for carrying out multi-scale feature extraction on the crowd images. And the feature fusion module is used for performing feature fusion on the features of the three expansion convolution branches, and then performing feature extraction on the fused features to generate a feature map. And the binary image estimation module is used for realizing the estimation of the binary image by adopting a cross entropy loss function in a supervised way. And the density map estimation module receives the output of the binary map estimation module, calculates the Hadamard product of the output and the feature map generated by the feature fusion module, and realizes the estimation of the crowd density map by using a cross entropy loss function in a supervised manner by using three-layer convolution operation.

Specifically, fig. 3 is a diagram of a crowd counting network based on a multi-branch extended convolutional network. As shown in fig. 3: firstly, a multi-branch expansion convolution module carries out multi-scale feature extraction on a crowd image block, wherein the module comprises three branches, each branch adopts a convolution kernel of 3x3, parameters are shared, and the three branches are respectively provided with a convolution expansion rate of 1,2 and 3. Under the arrangement, the network can extract the characteristics with different receptive fields by fewer parameters, and can effectively deal with the size difference of human heads.

Then, a feature fusion module fuses features extracted from different branches, specifically, after the three features are added, dimension reduction is carried out by convolution of 1x1, and further feature extraction is carried out by convolution of 3x 3;

after the characteristics are fused, the characteristics are divided into two paths which are respectively input into a head position binary image generation module and a crowd density estimation module. The human head position binary image generation module further extracts features based on the fused features, the human head position binary image label is used for supervising and predicting the position of the human head, and the supervised training can enable the network to extract more stable features; in addition, the head position binary image generated by the head position binary image generation module after supervised training and the feature Hadamard product obtained by the feature fusion module are used as the input of the crowd density image estimation module, so that the auxiliary crowd density image estimation module can more accurately position the head position and estimate the crowd density.

And (4): and inputting the image to be detected into the trained multi-branch expansion convolution crowd counting network model, and outputting a crowd density graph.

And (5): and summing the pixel values in the crowd density image to obtain a crowd counting result.

And inputting the image to be tested into a trained multi-branch expansion convolution population counting network model aiming at the test image, estimating population density map estimation aiming at newly received image data, and finally summing pixel values of the output population density image to obtain the number of people predicted in the image.

Example two

The embodiment provides a crowd counting system based on a multi-branch expanded convolutional neural network.

a training set building module configured to: constructing a training set according to the training samples; each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample;

a model training module configured to: training a multi-branch expansion convolution crowd counting network model based on a training set to obtain network optimal parameters so as to generate a trained multi-branch expansion convolution crowd counting network model;

Wherein, the multi-branch expansion convolution crowd counting network model includes: the system comprises a multi-branch expansion convolution module, a feature fusion module, a binary image estimation module and a crowd density image estimation module, wherein the multi-branch expansion convolution module consists of 3 expansion convolution branches with different expansion rates, the feature fusion module consists of a feature summation layer and a convolution layer, and the binary image estimation module and the crowd density image estimation module are formed by 3 layers of convolution; it should be noted that the original image will be used as an input of the multi-branch expanded convolution population counting network model, and the output of the multi-branch expanded convolution population counting network model includes the head position binary image generated by the binary image estimation module and the population density image generated by the population density image estimation module. The binary image estimation module outputs a binary image capable of representing the position and the size of the human head after supervised training, and further calculates a Hadamard product with the fusion characteristics to be used as the input of the crowd density image estimation module so as to assist the crowd density image estimation module to more accurately position the position of the human head for density estimation.

Illustratively, the multi-branch dilation convolution module consists of 3 convolution branches with dilation rates of 1,2,3, respectively, and convolution kernel sizes of 3 × 3, each branch comprising 4 layers of convolutions, wherein the first two layers of convolutions are followed by maximum pooling. The feature fusion module firstly sums the features of the three expanded convolution branches, then uses the convolution of 1x1 to reduce the dimension and further uses the convolution of 3x3 to further extract the features. The binary image estimation module includes 3-layer convolution and supervised estimation of the binary image with cross entropy loss. The crowd density map estimation module firstly receives the output of the binary map estimation module, and calculates the Hadamard product of the output and the feature map generated by the feature fusion module, and then realizes the estimation of the crowd density map with cross entropy loss by using 3-layer rolling operation.

EXAMPLE III

The present embodiment provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method for population counting based on a multi-branch expanded convolutional neural network as described in the first embodiment above.

Example four

The present embodiment provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps in the method for counting people based on multi-branch expanded convolutional neural network as described in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The crowd counting method based on the multi-branch expansion convolution neural network is characterized by comprising the following steps:

training a multi-branch expansion convolution population counting network model based on a training set to obtain network optimal parameters, and generating a trained multi-branch expansion convolution population counting network model;

inputting the image to be detected into a trained multi-branch expansion convolution crowd counting network model, and outputting a crowd density graph;

summing the pixel values in the crowd density image to obtain a crowd counting result;

the multi-branch expansion convolution crowd counting network model comprises: the device comprises a multi-branch convolution module, a feature fusion module, a binary image estimation module and a density image estimation module;

the multi-branch convolution module includes: three expansion convolution branches sharing network parameters and having different expansion rates are used for carrying out multi-scale feature extraction on the crowd images;

the binary image estimation module is used for realizing estimation of the binary image by adopting a cross entropy loss function in a supervised manner;

the density map estimation module receives the output of the binary map estimation module, calculates the Hadamard product of the output and the feature map generated by the feature fusion module, and then realizes the estimation of the crowd density map by using a cross entropy loss function in a supervised manner by using three-layer convolution operation;

the feature fusion module is used for performing feature fusion on the features of the three expansion convolution branches, and then performing feature extraction on the fused features to generate a feature map; after the characteristics are fused, the characteristics are divided into two paths which are respectively input into a head position binary image generation module and a crowd density estimation module; the head position binary image generation module is used for further extracting features based on the fused features, the head position binary image label is used for supervising and predicting the position of the head, and the head position binary image generated after supervision training by the head position binary image generation module is used for solving a Hadamard product with the features obtained by the feature fusion module to serve as the input of the crowd density image estimation module;

the generated head position binary image label is as follows:

wherein B () is a binarization function, l represents a position set of all human heads in the image, and l is a binary function _i Coordinates representing the center of the ith target head position, δ () being a pulse function, G () being a gaussian kernel, σ _i The variance of the gaussian sum is indicated.

2. The method according to claim 1, wherein the obtaining a scene image containing a crowd and generating a crowd density map label and a head position binary map label respectively according to the scene image, comprises:

acquiring scene images containing crowds, and marking the positions of the heads in each image;

generating a crowd density map label according to the marked head position;

3. The method according to claim 1, wherein the constructing a training set according to the training samples comprises:

and carrying out data expansion on each training sample in a random cutting, mirror image and rotation mode.

4. Crowd counting system based on many branches expand convolutional neural network, its characterized in that includes:

a tag generation module configured to: acquiring a scene image containing a crowd and respectively generating a crowd density map label and a head position binary map label according to the scene image;

a training set construction module configured to: constructing a training set according to the training samples; each image, the corresponding crowd density graph label and the corresponding head position binary icon label are used as a training sample;

an output module configured to: summing the pixel values in the crowd density image to obtain a crowd counting result;

the feature fusion module is used for carrying out feature fusion on the features of the three expansion convolution branches, and then carrying out feature extraction on the fused features to generate a feature map; after the characteristics are fused, the characteristics are divided into two paths which are respectively input into a head position binary image generation module and a crowd density estimation module; the human head position binary image generation module is used for further extracting features based on the fused features, a human head position binary image label is used for supervising and predicting the position of the human head, and the human head position binary image generated after the human head position binary image generation module is subjected to supervision training and the Hadamard product obtained by the human head position binary image and the features obtained by the feature fusion module is used as the input of the crowd density image estimation module;

the generated head position binary image label is as follows:

5. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for population counting based on a multi-branch expanded convolutional neural network as claimed in any one of claims 1 to 3.

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method for population counting based on a multi-branch expanded convolutional neural network as claimed in any one of claims 1 to 3.