CN111144329B

CN111144329B - Multi-label-based lightweight rapid crowd counting method

Info

Publication number: CN111144329B
Application number: CN201911386325.7A
Authority: CN
Inventors: 王素玉; 杨滨; 冯明宽
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-12-29
Filing date: 2019-12-29
Publication date: 2023-07-25
Anticipated expiration: 2039-12-29
Also published as: CN111144329A

Abstract

The invention discloses a lightweight rapid crowd counting method based on multiple labels. According to the backbone characteristics extraction network which is simple and efficient in receptive field size design, a dense context module is built in, so that information transmission of a network layer is ensured, and the expression capacity of the network is improved; six multi-scale intermediate supervision branches are designed, so that the network converges faster and more stably; an up-sampling module is designed, the resolution is improved step by step, and the quality of a density map is improved, so that accurate counting and accurate positioning are realized; three labels are designed, the crowd counting task based on density is obviously converted into a foreground and background segmentation task to assist the regression task of the crowd density map, and meanwhile, the prediction of the density map and the segmentation map is realized, so that estimation errors are effectively reduced. In UCF_CC_50, test results of ShangghaiTech and UCF-QNFR data sets show that the prediction performance of the method is superior to that of the current mainstream algorithm, the prediction speed reaches real time, and the method can be conveniently deployed in terminal equipment.

Description

Multi-label-based lightweight rapid crowd counting method

Technical Field

The invention belongs to the field of crowd counting in computer vision, and discloses a method for predicting a density map by utilizing a convolutional neural network and integrating the density map to obtain the total number of people in a single picture.

Background

Currently, convolutional Neural Network (CNN) based population count techniques have made great progress. Most of the current advanced CNN methods use pre-trained backbone networks (e.g., vgg, resnet101, densenet169, etc.) and complex Module structures (e.g., attribute Module, self-attribute Module, perspective Module, etc.), to predict the density map of the input image, and then sum the estimated density map to obtain the population count. Still other methods utilize multi-column junction (MCNN and Switch-CNN) structures and multitasking (PCCNet) methods to improve the accuracy of predictions. These methods achieve a rather high degree of accuracy on the mainstream dataset ucf_cc_50, shanghaTech, UCF-QNRF datasets. Although the above methods can achieve good effects, in order to pursue high precision, they often have the problems of bulky network structure, large parameter amount and long prediction time consumption, cannot properly balance precision and speed, and are difficult to deploy into terminal equipment.

Disclosure of Invention

Aiming at the problems that the algorithm in the crowd counting field is high in complexity, difficult to achieve in real time and incapable of properly balancing accuracy and speed, the invention designs the lightweight and rapid crowd counting neural network based on multiple labels, has quite large balance in accuracy and operation efficiency and is easy to deploy into terminal equipment.

The invention adopts the following technical scheme: a lightweight and rapid crowd counting method based on multiple labels. The crowd counting algorithm comprises the following specific processes: preprocessing data and enhancing the data; the processed image data is input into the convolutional neural network provided by the invention, and the advanced features of the human head are extracted through a series of operations such as convolution, downsampling, intensive residual error connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); feeding the extracted advanced features to an up-sampling module, thereby generating a prediction density map and a segmentation map; the density map is then integrated in its entirety to obtain the final count result. An overall flow chart of the method proposed by the present invention is shown in fig. 1.

(1) Data preprocessing: the invention uses three public data sets of UCF-CC-50, shanghaiTech, UCF-QNRF. To facilitate training and prediction, data preprocessing was performed. Limiting the image width, the maximum aspect ratio does not exceed 1024. Since the neural network comprises five downsampling operations, the decoding process comprises continuous upsampling, ensuring consistent size and positioning accuracy, and the Resize operation on the input image is divisible by 32.

(2) Data enhancement: aiming at the problem of small number of data set samples, the invention uses 5 different data enhancement methods, namely random brightness, random saturation, random horizontal overturn, random noise and random clipping and scaling.

(3) And (3) multi-label manufacturing: generating a density map label by utilizing a self-adaptive Gaussian kernel, generating a segmentation label by adopting a random scaling strategy on the basis, applying the segmentation label and the segmentation label to single-path multi-channel prediction, converting a crowd counting task based on the density map into a foreground and background segmentation task to assist a regression task of the crowd density map, and simultaneously realizing the prediction of the density map and the segmentation map, thereby effectively reducing prediction errors. In addition, in order to cooperate with the training process of the intermediate supervision, six groups of multi-scale intermediate supervision labels are designed according to the two generated labels and the sizes of receptive fields at different stages of the network.

(4) Model setting and training: the network model mainly comprises a main convolution network, an up-sampling module and an intermediate supervision branch. The backbone network consists of convolution layers with convolution kernel sizes of 1×1 and 3×3, a ReLu activation function, a batch normalization layer and residual connections, with a total of only 2.06M. The enhanced data with the batch size of 16 and the size of 640 multiplied by 640 is input into a model, the advanced characteristic diagram of the crowd can be extracted through the intermediate supervision of six branches, and then the advanced characteristic diagram is fed into an up-sampling module, and the network training is supervised by using the strong supervision signals of the complete segmentation diagram and the density diagram.

In the training process, the method does not use a pre-training model, but uses the xavier method to initialize all parameters of the network. The mean square error loss function is used for training the density map, the cross entropy loss function is used for training the segmentation map, the momentum self-adaptive optimizer is used by the optimizer, the initial learning rate is set to be 0.0001, and the total iteration time is more than 400. (5) model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is obtained in an end-to-end mode, and the total number of people is obtained through integration. In this stage, only the parameters trained by the backbone network are needed to be loaded, and the intermediate supervision branches are not effective, so that the model reasoning speed is improved, and the real-time prediction can be achieved by adding a lightweight network structure.

The evaluation index of the model is Mean Absolute Error (MAE), mean Square Error (MSE) and peak signal to noise ratio (PSNR). Algorithm predictive performance was evaluated on ucf_cc_50, shanghaiTech, and UCF-QNRF datasets, and the method of the present invention yielded competitive results. On the part a of ShanghaiTech dataset, the computational performance (model size, number of parameters and run time) of the invention was verified, with a verification speed of 44ms, which is twice as fast as the current highest precision lightweight network PCCNet (89 ms).

Drawings

Fig. 1 is a schematic overall flow chart of the method according to the present invention.

Fig. 2 is a schematic diagram of a convolutional neural network according to the present invention.

Fig. 3 is a schematic diagram of a multi-tag according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings, which illustrate in detail:

the invention relates to a lightweight and rapid crowd counting method based on multiple labels. As shown in fig. 1, the crowd counting method specifically includes: preprocessing data, enhancing the data, inputting the data into a convolutional neural network, and extracting a crowd characteristic map through a series of operations such as convolution, downsampling, residual error connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); then generating a final prediction density map and a segmentation map through an up-sampling module; and finally, integrating the density map to obtain a final counting result. Specific algorithms are referenced below:

(1) Data preprocessing

Data preprocessing was performed on three main-stream public data sets UCF-CC-50, shangghai Tech and UCF-QNRF. The three data sets have large picture size variation range, and in order to unify training and save the video memory, the preprocessing is to limit the image height and width, the longest side is not more than 1024, and the length-width ratio is ensured to be unchanged. Since the network coding process involves five downsampling operations, the decoding process involves successive upsampling, and the resolution operation on the input image is performed to ensure that the input and output are consistent in size and positioning accuracy, so that it can be divided by 32.

(2) Data enhancement

UCF_CC_50 contains 50 images with different resolutions, the number of people annotated by a single image is 94 to 4543, and the background scene is single. The ShanghaiTech dataset contained 1198 pictures with a total of 330165 people, a large number of people, but a relatively single scene. The UCF-QNRF data set 1201 pictures are used for training, 334 images are used for testing, and the UCF-QNRF data set has complex scenes and higher crowd density, and is more real and difficult. For the features of these datasets, the present invention uses 5 different enhancement methods, random brightness, random saturation, random horizontal flip, random noise, and random zoom clipping. The method aims at deriving more training samples, reducing the influence of factors such as different sizes, positions, colors and the like of the human head on the model, preventing the model from being over-fitted and improving the generalization capability of the model.

(3) Multi-label fabrication

As shown in fig. 3, the multi-tag comprises three parts: the end of the network contains density maps, segmentation maps of all heads and multi-scale labels for intermediate supervision.

Density map: a density map is generated by convolving a geometrically adaptive gaussian kernel with each head point level annotation. If pixel x _i Representing the position of the center coordinates of the ith head in the scene, for N head-labeled crowd density maps D ^GT The generation method of (a) can be expressed as:

where i represents the ith head index, N represents the total head count, δ (x-x) _i ) A function representing the position of the head in the image,representing adaptive gaussian kernels, sigma _i Is Gaussian kernel standard deviation, < >>The average value of the Euclidean distance sum of the head and the adjacent heads in the image is represented, and beta is 0.15, so that the head size can be estimated more accurately. Segmentation map: since the density map generated using the adaptive convolution kernel can cover the head region more accurately, pixel-level annotations separating foreground and background can be easily generated. Meanwhile, the invention introduces the scaling factor lambda to properly scale the head area, so that the area covered by the larger head is properly reduced, and the smaller head area is properly enlarged, thereby enabling the segmentation map to contain more complete head information. Segmentation map S ^GT The generation process of (2) can be expressed as:

wherein j represents any position in the density map, p _j For the pixel value at position j in the density map,representing the density map generated after introducing scaling factors lambda _i Represents the scaling factor corresponding to the ith person's head, gaussian kernel standard deviation +.>

Notably, the estimation of the segmentation map is a more basic and easier binary classification task than the regression task of density map estimation. The method can provide classification loss, assist regression tasks and improve overall regression quality. The two types of labels are combined to be applied to single-path multi-channel prediction of a backbone network, and prediction of a density map and a segmentation map is realized at the same time, so that prediction errors can be effectively reduced.

Multiscale tag: because the designed network comprises five downsampling operations, and the receptive fields of the network at different stages are different in size, the heads of people with different sizes can be covered by the receptive fields with corresponding sizes or larger sizes, and therefore, six branches of intermediate supervision are arranged. In order to match with the training process of intermediate supervision, six groups of labels are correspondingly required to be set for supervised learning. According to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], dividing the heads in a single image into six groups, regenerating a density image and a segmentation image in each group, and then respectively performing sampling operation of different scales, wherein the downsampling multiples of the six groups of labels are 4,8, 16, 32 and 32 in sequence. According to the size of the receptive field and the size of the human head of each branch, the human heads in different size ranges are distributed to different branch labels, so that six branches can completely cover all human heads. The present invention will be described in detail in the next section with receptive field calculation and human head size interval division.

(4) Model setup and training

As shown in fig. 2, the network model mainly consists of a backbone convolutional network, an up-sampling module, and intermediate supervisory branches. And inputting the processed data into a model, and extracting the crowd advanced features through a backbone network. In the process, the backbone network only uses a convolution layer of 1×1 and 3×3, a ReLu activation function, a batch normalization layer and dense residual error connection operation, so that the network is ensured to be light. The extracted features are then fed to an upsampling module, which monitors the network training using the segmentation map and the density map.

In order to ensure that the model can cover all sizes of human heads while ensuring light weight, the backbone network is carefully designed, and the maximum receptive field size is 767, so that the model can completely cover the human head areas of three data sets. The backbone network structure is divided into 4 sub-dense context modules, namely DenseBlock_1, denseBlock_2, denseBlock_3 and DenseBlock_4. The arrangement of the dense context module effectively ensures the information flow of a network layer and can retain more multi-scale context characteristics. The Denseblock_1 is formed by stacking ten feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 64 (length multiplied by width multiplied by channel number), a batch normalization layer with the step length of 2 and a ReLu activation function combination module, and the output size is 1/4 of the input; nine convolution kernel sizes are 3 multiplied by 64, a convolution layer with a step length of 1, a batch normalization layer and a ReLu activation function combination module, a dense connection mode is adopted among the nine modules, and the size of the feature map is kept unchanged in the calculation process. The structure of Denseblock_2, denseblock_3 and Denseblock_4 is completely the same, and the structure is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 128, a batch normalization layer and a ReLu activation function combination module with the step size of 2, and the output size is 1/2 of the input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module are connected in a dense mode, and the size of the feature map is kept unchanged in the calculation process. The Denseblock_5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is formed by a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, a dense connection mode is adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged.

Because of the different sizes of receptive fields of the network at different stages of the designed model, heads of different sizes can be covered by receptive fields of corresponding sizes or larger. The receptive field calculated according to equation (4) is provided with six branches of intermediate supervision, see fig. 2.

Wherein l _k Indicating the size of the corresponding receptive field of the kth layer, f _k A dimension s of a pooling size of a convolution kernel of a kth layer or a pooling layer _h Indicating the step size corresponding to the h-th layer convolution.

The sizes of the receptive fields corresponding to Branch_1-6 are 39, 71, 143, 287, 575, 767, respectively. The head size interval detected by the corresponding six branches is set to [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480]. Because of the adoption of the image enhancement strategy of random scaling, the scaling ratio is 0.7-1.6, even if the maximum size 480 of the head interval is 768 after being amplified by 1.6 times, the maximum size corresponds to the maximum receptive field. The actual size of the largest human head of the three data sets is 382×382, which can not exceed 480, so that the receptive field of the network can completely cover all the human head areas with all the sizes, including the scaled human head areas, and the rationality of the network design of the invention is also proved.

The six intermediate supervision branches are similar in structure, each branch comprises two sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and the design principle is as light as possible, and only a convolution kernel with the size of 1 multiplied by 1 is used. Each intermediate supervising branch first maps features to a new feature space by a convolution layer of convolution kernel size 1 x c (the number of channels c is consistent with the number of output channels of the corresponding dense context module), and then feeds into the two sub-branches respectively. The partition map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally, a two-channel partition map prediction result is output. The density map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 1 respectively, and finally a single-channel density map prediction result is output.

Because the backbone network outputs 1/32 of the original size of the size through five times of downsampling, the positioning information of the crowd is destroyed to a certain extent, the resolution is improved step by utilizing the autonomous network learning through the simple and effective upsampling module, and finally the density map and the segmentation map of the original size are output, so that the recovered crowd position information is recovered, the quality of the density map is improved, and the crowd counting precision is further improved.

The up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the overall up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output. Specifically, upsampleblock_1 is composed of a nearest neighbor interpolation layer (four times upsampling), a convolution layer with a convolution kernel size of 3×3×16, a step size of 1, a batch normalization layer, and a ReLu activation function. Upsampleblock 1, upsampleblock 2 are similar in structure to upsampleblock 3, except that the network parameters are different, the nearest neighbor interpolation layer upsampling multiples are 4,4,2 respectively, the convolution kernel sizes of the convolution layers are 3×3×16 respectively 3×3×8 and 3×3×3. And finally outputting a three-channel prediction result by the up-sampling module, wherein the first two channels are predicted partition graphs, the third channel is a predicted density graph, and then obtaining the total number of people by integrating the density graph.

Therefore, the deployment of the whole lightweight network is completed, and the classification task of the prediction segmentation map and the regression task of the prediction density map are completed simultaneously by a single network. By doing so, network parameters can be shared, and better learning semantic features of two tasks are facilitated. Meanwhile, the segmentation map can provide position information constraint for the prediction of the density map, so that the network focuses on more head areas, and the prediction of background areas and body parts in the image on the density map is restrained, so that the counting is more accurate.

In the training process, the method does not use a pre-training model, and initializes all parameters of the network by using a xavier method. The mean square error function is used to train the density map, the cross entropy loss function is used to segment the regression of the map, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to 0.0001, and the total iteration is more than 400 times.

The mean square error loss function can be expressed as:

where H is the number of training samples, X _e Is the e-th input image, D _e Is a label of a corresponding density map, f (X _e ) Is to inputX _e Mapped to a predicted density map.

The cross entropy loss function can be expressed as:

wherein M represents the total number of pixel points, Y _m Represents the mth partition map label, P _m Representing the probability that the mth pixel in the segmentation map is the foreground.

The loss function of six branches can be expressed as:

wherein C represents the total number of branches,the density map loss and the partition map loss corresponding to the branch n are shown, respectively. The joint loss function can be expressed as:

L＝L ^MSE +αL ^S +φL ^B formula (8)

Where α, Φ is the coefficient balancing the three losses, and α=0.001, Φ=0.1 is set in the present invention. The present invention uses joint loss functions for end-to-end training.

(5) Model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is directly obtained, and the total number of people is obtained through integration. Note that only the parameters trained by the backbone network are needed to be loaded in the stage, and the intermediate supervision branches are not effective, so that the model reasoning speed is improved, and the real-time prediction can be achieved by adding a light-weight network structure.

The evaluation index of the model is Mean Absolute Error (MAE), mean Square Error (MSE) and peak signal to noise ratio (PSNR). Algorithm predictive performance was evaluated on three data sets ucf_cc_50, shanghaiTech, and UCF-QNRF data sets, and the method of the present invention yielded competitive results. Considering that the algorithm of the invention aims to achieve real-time estimation based on the balance algorithm precision and speed, the comparison algorithms in tables 1 and 2 are selected.

TABLE 1 comparison of the predicted Performance of the methods of the invention

As shown in Table 1, on Shangaai TechA, the MSE index of the algorithm provided by the invention obtains the second name, and other indexes all obtain the best results, and particularly, on PartB and UCF-QNRF data sets of Shangaai Tech data sets, the MSE index is greatly improved. PartB: MAE increased 31.8% and MSE increased 36.3%. UCF-QNRF: MAE increased 16.1% and MSE increased 10.4%.

Table 2 comparative results of the calculated performance of the method proposed by the present invention

As shown in table 2, on part a of the public data set ShanghaiTech, the calculation performance (model size, parameter number and running time) of the present invention was verified, the verification speed reached 44ms, which is twice faster than the current highest precision lightweight network PCCNet (89 ms), and the counting precision was also improved. Compared with Cascade-MTL with the highest speed at present, MAE is improved by 30 percent, and the method provided by the invention sacrifices a certain speed.

Claims

1. A multi-tag based lightweight fast crowd counting method, comprising:

step one: preprocessing data and enhancing the data;

step two: inputting the processed image data into a convolutional neural network, and extracting crowd advanced features through a series of convolution, downsampling and intensive residual error connection operations of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision;

step three: feeding the extracted advanced features to an up-sampling module, and generating a density map and a segmentation map which are consistent with the original image in size after gradually improving the resolution;

step four: obtaining a final counting result by carrying out integral integration on the density map;

three labels are designed, wherein the three labels comprise a density map of the size of the head of a person generated by utilizing an adaptive Gaussian kernel, a segmentation map containing complete head information generated by utilizing a random scaling strategy and a multi-scale label for intermediate supervision branches, and a crowd counting task based on the density map is converted into a foreground and background segmentation task to assist a regression task of the crowd density map in a displaying manner;

the network model consists of a main convolution network, an up-sampling module and an intermediate supervision branch; inputting the processed data into a model, and extracting crowd advanced features through a backbone network; in this process, the backbone network uses only the convolution layers 1×1 and 3×3, the ReLu activation function, the batch normalization layer, and the dense residual join operation; then the extracted features are fed to an up-sampling module, and the network training is supervised by utilizing the segmentation map and the density map;

the maximum receptive field size of the backbone network is 767; the backbone network structure is divided into 4 sub-dense context modules, namely Denseblock_1, denseblock_2, denseblock_3 and Denseblock_4;

the Denseblock_1 is formed by stacking ten feature extraction blocks, wherein the feature extraction blocks are a convolution layer, a batch normalization layer and a ReLu activation function combination module with the length of a convolution kernel, the width of the convolution kernel and the size of channels being 3 multiplied by 64, the step length of the convolution layer being 2, and the output size of the convolution layer is 1/4 of the input; nine convolution kernel sizes are 3 multiplied by 64, a convolution layer with a step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein a dense connection mode is adopted among the nine modules, and the dimension of the feature map is kept unchanged in the calculation process;

the structure of Denseblock_2, denseblock_3 and Denseblock_4 is completely the same, and the structure is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 128, a batch normalization layer and a ReLu activation function combination module with the step size of 2, and the output size is 1/2 of the input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module are densely connected, and the size of the feature map is kept unchanged in the calculation process; the Denseblock_5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is formed by a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, a dense connection mode is adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged;

the receptive field is calculated according to the following formula, and six intermediate supervision branches are arranged;

wherein l _k Indicating the size of the corresponding receptive field of the kth layer, f _k A dimension s of a pooling size of a convolution kernel of a kth layer or a pooling layer _h Representing the step length corresponding to the h-layer convolution;

in the six intermediate supervision branch structures, each branch comprises two sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and only a convolution kernel with the size of 1 multiplied by 1 is used; each intermediate supervision branch firstly passes through a convolution layer with the convolution kernel size of 1 multiplied by c, the channel number c is consistent with the output channel number of the corresponding dense context module, the characteristics are mapped to a new characteristic space, and then the characteristics are respectively sent into two sub-branches; the partition map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally, a two-channel partition map prediction result is output; the density map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 1 respectively, and finally a single-channel density map prediction result is output;

the main network is subjected to five times of downsampling, the output size is 1/32 of the original size, the resolution is gradually improved by utilizing the autonomous network learning through the arrangement of an upsampling module, and finally the density map and the segmentation map of the original size are output;

the up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the overall up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output; specifically, the UpsampleBlock_1 is composed of a convolution layer with four times up-sampling of a nearest neighbor interpolation layer, a convolution kernel size of 3×3×16 and a step length of 1, a batch normalization layer and a ReLu activation function; the upsampleblock_1, upsampleblock_2 and upsampleblock_3 have the same structure, except that the network parameters are different, the nearest neighbor interpolation layer upsampling multiples are 4,4,2 respectively, the convolution kernel sizes of the convolution layers are 3×3×16 respectively 3×3×8 and 3×3×3; and finally outputting a three-channel prediction result by the up-sampling module, wherein the first two channels are predicted segmentation graphs, the third channel is a predicted density graph, and then obtaining the total number of people by integrating the density graph.

2. The multi-label based lightweight fast crowd counting method of claim 1, wherein in step two, the backbone network is comprised of four dense context modules, each of which is comprised of a convolution layer of 1 x 1 and 3 x 3 convolution kernels, a ReLu activation function, a batch normalization layer, and dense residual connections.

3. The method of claim 1, wherein in the second step, according to the statistical information of the head size of the data set, six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480] are set, and six intermediate supervision branches are correspondingly designed, each branch comprises two independent sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and the corresponding receptive field sizes are 39, 71, 143, 287, 575 and 767.

4. The method for counting the lightweight and rapid population based on the multiple labels according to claim 1, wherein in the third step, an up-sampling module with repeated stacking of nearest neighbor interpolation and convolution layers is designed, the resolution of the feature map is improved step by step, and finally a density map with the original size is output.

5. The multi-label based lightweight rapid population count method of claim 1, wherein the density map generation process is represented as:

where i represents the ith head index, N represents the total head count, δ (x-x) _i ) A function representing the position of the head in the image,representing adaptive gaussian kernels, sigma _i Is Gaussian kernel standard deviation, < >>Representing the average value of the Euclidean distance sum of the head and three heads adjacent to the head in the image, wherein beta is a weight coefficient;

the generation process of the segmentation map is expressed as:

wherein j represents any position in the density map, p _j For the pixel value at position j in the density map,representing the density map generated after introducing scaling factors lambda _i Representing a scaling factor corresponding to the ith person's head;

multiscale tag: according to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], dividing the heads in a single image into six groups, regenerating a density image and a segmentation image in each group, and then respectively performing six-scale sampling operation, wherein the downsampling multiples of the six groups of labels are 4,8, 16, 32 and 32 in sequence.

6. The multi-label-based lightweight rapid population counting method of claim 1, wherein two training strategies are designed, one is single-path multi-channel prediction through a main network and an up-sampling module, a single network end-to-end outputs a segmentation graph and a density graph simultaneously, and the other is multi-path single-channel prediction through the main network and six intermediate monitoring branches, wherein each intermediate monitoring branch has two sub-branches, one sub-branch predicts the density graph and one sub-branch predicts the segmentation graph.