CN111144329A

CN111144329A - Light-weight rapid crowd counting method based on multiple labels

Info

Publication number: CN111144329A
Application number: CN201911386325.7A
Authority: CN
Inventors: 王素玉; 杨滨; 冯明宽
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-12-29
Filing date: 2019-12-29
Publication date: 2020-05-12
Anticipated expiration: 2039-12-29
Also published as: CN111144329B

Abstract

The invention discloses a light-weight rapid crowd counting method based on multiple labels. According to the size of the receptive field, a simple and efficient backbone feature extraction network is designed, and a dense context module is arranged in the network, so that the information transmission of a network layer is ensured, and the expression capacity of the network is improved; six multi-scale middle supervision branches are designed, so that the network can be converged more quickly and more stably; an up-sampling module is designed, so that the resolution is increased step by step, and the quality of a density map is improved, so that accurate counting and accurate positioning are realized; three labels are designed, a crowd counting task based on density is remarkably converted into a foreground and background segmentation task to assist a regression task of a crowd density graph, meanwhile, the prediction of the density graph and a segmentation graph is realized, and estimation errors are effectively reduced. The test results of the UCF _ CC _50, ShanghaiTech and UCF-QNFR data sets show that the prediction performance of the method is superior to that of the current mainstream algorithm, the prediction speed reaches real time, and the method can be conveniently deployed in terminal equipment.

Description

Light-weight rapid crowd counting method based on multiple labels

Technical Field

The invention belongs to the field of crowd counting in computer vision, and relates to a method for predicting a density map by using a convolutional neural network and integrating the density map to obtain the total number of people in a single picture, which is different from the current mainstream convolutional network based on VGG, ResNet and DenseNet.

Background

In recent years, large-scale group activities such as parades, festival celebrations, concerts, sporting events and the like are increasingly frequent, and group emergencies caused by dense crowds become the focus of society. The crowd counting is used as an important mode for crowd control and management, the statistics can be carried out on the crowd under the current scene, the resource allocation is assisted, the occurrence of an emergency can be planned, and the safety of public places is enhanced. In addition, the technology of population counting can be easily migrated to other fields to handle similar counting tasks. However, due to the problems of occlusion, background noise, scale and view angle variation, the traditional detection and regression-based method has its limitations, and accurate and fast population counting is still a difficult problem to be solved in the field of computer vision.

Currently, there have been great advances in population counting techniques based on Convolutional Neural Networks (CNN). Most of the current advanced CNN methods predict the density map of the input image by using a pre-trained backbone network (e.g., vgg16, Resnet101, and densnet 169) and a complex modular structure (e.g., Attention Module, Self-attribute Module, and perceptual Module), and then sum the predicted density map to obtain the population count. Still other methods utilize multi-column node (MCNN and Switch-CNN) architectures and a multitask (PCCNet) approach to improve the accuracy of predictions. These methods achieve a rather high accuracy on the mainstream data sets UCF _ CC _50, ShanghaTech, UCF-QNRF data sets. Although the above methods can achieve good effects, in order to pursue high precision, the methods often have the problems of bloated network structure, large parameter quantity and long prediction time consumption, cannot properly balance precision and speed, and are difficult to deploy in a terminal device.

Disclosure of Invention

Aiming at the problems that the algorithm in the crowd counting field is high in complexity, difficult to achieve real-time and incapable of properly balancing accuracy and speed, the invention designs a light-weight and rapid crowd counting neural network based on multiple labels, which has considerable balance in accuracy and operation efficiency and is easy to deploy in terminal equipment.

The invention adopts the following technical scheme: a light-weight and rapid crowd counting method based on multiple labels. The specific flow of the crowd counting algorithm is as follows: preprocessing and enhancing data; inputting the processed image data into the convolutional neural network provided by the invention, and extracting the advanced features of the human head through a series of operations such as convolution, downsampling, dense residual connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); the extracted high-level features are further fed to an upsampling module, thereby generating a predicted density map and a segmentation map; the density map is then integrated as a whole to obtain the final count result. The overall flow chart of the method proposed by the present invention is shown in fig. 1.

(1) Data preprocessing: the invention uses three public data sets of UCF-CC-50, ShanghaiTech and UCF-QNRF. For convenience of training and prediction, data preprocessing is performed. The image is limited in height and width, and the maximum aspect ratio does not exceed 1024. Since the neural network contains five downsampling operations, the decoding process contains successive upsampling, ensuring size consistency and positioning accuracy, and Resize operation on the input image, which enables it to be evenly divided by 32.

(2) Data enhancement: aiming at the problem of small number of samples in the data set, the invention uses 5 different data enhancement methods, namely random brightness, random saturation, random horizontal inversion, random noise and random clipping and scaling.

(3) And (3) multi-label manufacturing: the method comprises the steps of generating density map labels by using a self-adaptive Gaussian kernel, generating segmentation labels by using a random scaling strategy on the basis, applying the density map labels and the segmentation labels to single-path multi-channel prediction, converting a crowd counting task based on the density map into a foreground and background segmentation task to assist a regression task of the crowd density map, realizing prediction of the density map and the segmentation map, and effectively reducing prediction errors. In addition, in order to match with the training process of intermediate supervision, six groups of multi-scale intermediate supervision labels are designed according to the generated two labels and the sizes of the receptive fields of the network at different stages.

(4) Setting and training a model: the network model mainly comprises a backbone convolution network, an up-sampling module and an intermediate supervision branch. The backbone network consists of convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3, a ReLu activation function, a batch normalization layer and residual connection, and the total parameter number is only 2.06M. The enhanced data, with the batch size of 16 and the size of 640 multiplied by 640, is input into a model, and through intermediate supervision of six branches, a high-level feature map of a crowd can be extracted, and then the high-level feature map is fed to an upper sampling module, and network training is supervised by using a strong supervision signal of a complete segmentation map and a density map.

In the training process, the invention does not use a pre-training model, but uses a xavier method to initialize all parameters of the network. The mean square error loss function is used for training a density graph, the cross entropy loss function is used for training a segmentation graph, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to be 0.0001, and the total iteration is more than 400 times.

(5) Model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is obtained in an end-to-end mode, and the total number of people is obtained by integrating the predicted density map. In the stage, only parameters trained by a main network are loaded, and the middle supervision branch does not work, so that the model reasoning speed is improved, and the lightweight network structure is added, so that the real-time prediction can be achieved.

The evaluation indexes of the model are Mean Absolute Error (MAE), Mean Square Error (MSE) and peak signal-to-noise ratio (PSNR). The algorithmic prediction performance was evaluated on the UCF _ CC _50, ShanghaiTech and UCF-QNRF datasets and the inventive method yielded competitive results. On the PartA part of the ShanghaiTech data set, the calculation performance (model size, parameter number and operation time) of the method is verified, the verification speed reaches 44ms, and the verification speed is twice as fast as that of the current lightweight network PCCNet (89ms) with the highest precision.

Drawings

Fig. 1 is a schematic overall flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a convolutional neural network structure according to the present invention.

Fig. 3 is a schematic diagram of a multi-tag according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:

the invention relates to a light-weight and rapid crowd counting method based on multiple labels. As shown in fig. 1, the specific flow of the population counting method is as follows: preprocessing and enhancing data, inputting the data into a convolutional neural network, and extracting a crowd characteristic diagram through a series of operations such as convolution, down sampling and residual connection of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); then generating a final predicted density map and a final segmentation map through an up-sampling module; and finally, integrating the density map to obtain a final counting result.

The specific algorithm is referred to as follows:

(1) data pre-processing

Three mainstream public data sets, UCF-CC-50, ShanghaiTech and UCF-QNRF, were data preprocessed. The size change range of the three data sets is large, in order to train uniformly and save video memory, the preprocessing is to limit the height and width of the image, the longest edge does not exceed 1024, and the length-width ratio is ensured to be unchanged. Since the network coding process includes five downsampling operations and the decoding process includes successive upsampling, the Resize operation performed on the input image to ensure the input and output are consistent in size and accurate in positioning is divided by 32.

(2) Data enhancement

UCF _ CC _50 contains 50 images of different resolutions, the number of people annotated with a single image is from 94 to 4543, and the background scene is relatively single. The ShanghaiTech data set contained 1198 pictures, with a total of 330165 people, which varied greatly, but the scene was relatively single. The UCF-QNRF data set 1201 pictures are used for training, 334 images are used for testing, and the method is more real and difficult due to the complex scene and higher crowd density. For the characteristics of these datasets, the present invention uses 5 different enhancement methods, random luminance, random saturation, random horizontal flip, random noise and random scaling clipping. The method aims to derive more training samples, reduce the influence of factors such as different sizes, positions and colors of human heads on the model, prevent the model from being over-fitted and improve the generalization capability of the model.

(3) Multi-label making

As shown in fig. 3, the multi-tag comprises three parts: the density map, segmentation map and multi-scale label for intermediate supervision are used for network end including all human heads.

Density map: a density map is generated by convolving a geometrically adaptive gaussian kernel with each head point level annotation. If pixel x_iA crowd density map D representing the position of the ith person's head center coordinate in the scene and labeled for N persons' heads^GTThe generation mode of (d) can be expressed as:

where i denotes the ith head index, N denotes the total number of heads, δ (x-x)_i) A function representing the position of the person's head in the image,

representing an adaptive Gaussian kernel, σ_iIs a standard deviation of a gaussian kernel,

the average value of the euclidean distance sums of the head and the three adjacent heads in the image is represented, and β is 0.15, which enables more accurate estimation of the head size.

Segmentation graph: since the density map generated by the adaptive convolution kernel can cover the head region more accurately, the separation can be easily generatedPixel-level annotation of foreground and background. Meanwhile, the invention introduces a scaling coefficient lambda to appropriately scale the human head area, so that the area covered by a larger human head is appropriately reduced, and the area covered by a smaller human head is appropriately enlarged, thereby enabling the segmentation map to contain more complete human head information. Segmentation chart S^GTThe generation process of (a) can be expressed as:

wherein j represents an arbitrary position in the density map, p_jFor the pixel value at position j in the density map,

representing the resulting density map, λ, after introducing a scaling factor_iRepresents the scaling coefficient corresponding to the ith human head, the standard deviation of Gaussian kernel

Notably, segmentation graph estimation is a more fundamental and easier binary classification task than the regression task of density graph estimation. The method can provide classification loss, assist a regression task and improve the integral regression quality. The two types of labels are combined to be applied to single-path multi-channel prediction of a backbone network, the prediction of a density graph and a segmentation graph is realized at the same time, and prediction errors can be effectively reduced.

Multi-scale labeling: because the designed network comprises five downsampling operations and the receptive fields of the network in different stages are different in size, heads of people with different sizes can be covered by receptive fields with corresponding sizes or larger, and therefore six middle supervision branches are arranged. In order to match with the intermediate supervised training process, six groups of labels are correspondingly required to be set for supervised learning. Dividing the head in the single image into six groups according to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], regenerating a density image and a segmentation image for each group, and then respectively carrying out sampling operations with different scales, wherein the sampling multiples of the six groups are 4, 8, 16, 32, 32 in sequence. According to the receptive field size and the head size of each branch, heads in different size ranges are distributed to different branch labels, and the six branches can completely cover all heads. The present invention will be described in detail in the next section for the calculation of the receptive field and the division of the head size interval.

(4) Model setup and training

As shown in fig. 2, the network model is mainly composed of a backbone convolutional network, an upsampling module, and an intermediate supervision branch. And inputting the processed data into a model, and extracting high-level characteristics of the crowd through a backbone network. In the process, the backbone network only uses 1 × 1 and 3 × 3 convolutional layers, ReLu activation functions, batch normalization layers and dense residual connection operation, and therefore network lightweight is guaranteed. The extracted features are then fed to an upsampling module, which supervises network training using segmentation and density maps.

In order to ensure that the model can cover all sizes of human heads while ensuring light weight, the backbone network is well designed, the maximum receptive field size is 767, and the human head area of three data sets can be completely covered. The main network structure is divided into 4 sub-intensive context modules, namely Deneblock _1, Deneblock _2, Deneblock _3 and Deneblock _ 4. The dense context module effectively ensures the information flow of the network layer and can reserve more multi-scale context characteristics. The Denseblock _1 is formed by stacking ten feature extraction blocks, the feature extraction blocks are respectively a convolution layer with a convolution kernel size of 3 multiplied by 64 (length multiplied by width multiplied by the number of channels), a batch normalization layer and a ReLu activation function combination module with a step length of 2, and the output size is 1/4 of input; nine convolution layers with convolution kernel size of 3 multiplied by 64 and step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein a dense connection mode is adopted among the nine modules, and the size of the feature map is kept unchanged in the calculation process. The structure of Denseblock _2, Densblock _3 and Densblock _4 is completely the same, and the Densblock _4 is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the convolution core size of which is 3 multiplied by 128 and the step length of which is 2, and the output size is 1/2 of input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module, a dense connection mode is adopted among the four modules, and the size of the feature map is kept unchanged in the calculation process. The Densblock _5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is composed of a convolution layer with convolution kernel size of 3 x 128 and step length of 1, a batch normalization layer and a ReLu activation function, dense connection is also adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged.

Because the receptive fields of the networks in different stages of the designed model are different in size, heads with different sizes can be covered by receptive fields with corresponding sizes or larger sizes. According to the receptive field calculated by the formula (4), the invention sets six branches of intermediate supervision, see fig. 2.

Wherein l_kDenotes the size of the corresponding receptive field of the k-th layer, f_kIs the pooling size, s, of the convolution kernel or pooling layer of the k-th layer_hIndicating the step size corresponding to the h-th layer convolution.

The sizes of the receptive fields corresponding to Branch _1-6 are 39, 71, 143, 287, 575 and 767 respectively. The head size intervals detected by the corresponding six branches are set to [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480 ]. Due to the adoption of the image enhancement strategy of random scaling, the scaling is 0.7-1.6, even if the maximum size 480 of the human head interval is 768 after being amplified by 1.6 times of the maximum scaling, the maximum receptive field is just corresponding to. The maximum human head actual size of the three data sets is 382 multiplied by 382 through statistics and does not exceed 480, so that the receptive field of the network can completely cover all sizes of human head regions including the zoomed human head region, and the rationality of the network design is also proved.

The six middle supervision branches are similar in structure, each branch comprises two sub-branches, namely a segmentation graph predictor branch and a density graph predictor branch, the design principle is light weight as much as possible, and only a convolution kernel with the size of 1 multiplied by 1 is used. Each intermediate supervision branch maps the features to a new feature space through a convolution layer with a convolution kernel size of 1 × 1 × c (the number of channels c is consistent with the number of output channels of the corresponding dense context module), and then sends the features to two sub-branches respectively. The segmentation graph prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally the prediction results of the two-channel segmentation graph are output. The density map predictor branch comprises two 1 multiplied by 1 convolutional layers, the number of channels is c and 1 respectively, and finally the single-channel density map prediction result is output.

Because the trunk network outputs 1/32 with the original size after five times of downsampling, the positioning information of the crowd is damaged to a certain extent, the resolution is gradually improved by setting a simple and effective upsampling module and utilizing network autonomous learning, and the density map and the segmentation map with the original size are finally output, so that the recovered crowd position information is improved, the quality of the density map is improved, and the crowd counting precision is further improved.

The up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the integral up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output. Specifically, the upsamplable _ block _1 is composed of a nearest neighbor interpolation layer (quadruple upsampling), a convolution layer with a convolution kernel size of 3 × 3 × 16 and a step size of 1, a batch normalization layer, and a ReLu activation function. The Upsample _ block _1 and the Upsample _ block _2 have similar structures as the Upsample _ block _3, except that network parameters are different, sampling multiples on nearest interpolation layers are 4, 4 and 2 respectively, and convolution kernel sizes of convolution layers are 3 × 3 × 16, 3 × 3 × 8 and 3 × 3 × 3 respectively. And finally outputting three-channel prediction results by an up-sampling module, wherein the first two channels are predicted segmentation maps, the third channel is a predicted density map, and then the total number of people can be obtained by integrating the density maps.

Therefore, the deployment of the whole lightweight network is completed, and the classification task of the prediction segmentation graph and the regression task of the prediction density graph are simultaneously completed by a single network. By doing so, network parameters can be shared, and better semantic feature learning of the two tasks is promoted. Meanwhile, the segmentation map can provide position information constraint for prediction of the density map, so that the network focuses more on the head region, and meanwhile, prediction of the background region and the body part in the image on the density map is suppressed, and therefore counting is more accurate.

In the training process, the invention does not use a pre-training model, and initializes all parameters of the network by using a xavier method. The mean square error function is used for training a density graph, the cross entropy loss function is used for regression of a segmentation graph, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to be 0.0001, and the total iteration is carried out for more than 400 times.

The mean square error loss function can be expressed as:

where H is the number of training samples, X_eIs the e-th input image, D_eIs the label of the corresponding density map, f (X)_e) Is to input X_eMapping to a predicted density map.

The cross entropy loss function can be expressed as:

wherein M represents the total number of pixels, Y_mDenotes the m-th segmentation map label, P_mRepresenting the probability that the mth pixel in the segmentation map is foreground.

The loss function for the six branches can be expressed as:

wherein C represents the total number of branches,

the density map loss and the segmentation map loss corresponding to the branch n are shown, respectively.

The joint loss function can be expressed as:

L＝L^MSE+αL^S+φL^Bformula (8)

Where α, phi is the coefficient to balance the three losses, set to α ═ 0.001 and 0.1 in the present invention.

(5) Model prediction: after model training is completed, the pre-trained model parameters are loaded, test data of any size is input, a predicted density map is directly obtained, and the total number of people is obtained by integrating the predicted density map. Note that only the trained parameters of the backbone network are loaded at this stage, and the middle supervision branch does not work, so that the model reasoning speed is improved, and the lightweight network structure is added, so that real-time prediction can be achieved.

The evaluation indexes of the model are Mean Absolute Error (MAE), Mean Square Error (MSE) and peak signal-to-noise ratio (PSNR). The algorithm prediction performance was evaluated on three data sets, UCF _ CC _50, ShanghaiTech and UCF-QNRF data sets, and the inventive method yielded competitive results. Considering that the algorithm of the invention aims to achieve real-time estimation on the basis of balancing the accuracy and the speed of the algorithm, the comparison algorithm in the tables 1 and 2 is selected.

TABLE 1 comparison of predicted Performance of the method proposed by the present invention

Table 1 shows that on ShanghaiTechA, the MSE index of the algorithm proposed by the present invention takes the second name, and other indexes all achieve the best results, especially on the PartB and UCF-QNRF data sets of ShanghaiTech data sets, which are greatly improved. PartB: the MAE is improved by 31.8 percent, and the MSE is improved by 36.3 percent. UCF-QNRF: the MAE is improved by 16.1 percent, and the MSE is improved by 10.4 percent.

TABLE 2 comparison of the calculated Performance of the method proposed by the present invention

Table 2 shows that, in the part of PartA of the public data set ShanghaiTech, the computational performance (model size, number of parameters, and running time) of the present invention is verified, the verification speed reaches 44ms, which is twice as fast as the current highest precision lightweight network PCCNet (89ms), and the counting precision is also improved. Compared with the Cascade-MTL with the fastest speed at present, the MAE is improved by 30 percent, and the method of the invention sacrifices certain speed and is worthy.

Claims

1. A lightweight fast crowd counting method based on multiple labels is characterized by comprising the following steps:

the method comprises the following steps: preprocessing and enhancing data;

step two: inputting the processed image data into a convolutional neural network, and extracting high-level features of the crowd through a series of operations of convolution, downsampling and dense residual connection of a backbone network; in the process, six branches of the network are used for multi-scale intermediate supervision;

step three: feeding the extracted high-level features to an upper sampling module, and generating a density graph and a segmentation graph which are consistent with the size of the original image after gradually increasing the resolution;

step four: and finally, obtaining a final counting result by integrally integrating the density map.

2. The method according to claim 1, wherein in the second step, the backbone network is composed of four dense context modules, each module is composed of convolution layers of 1 x 1 and 3 x 3 convolution kernels, a ReLu activation function, a batch normalization layer and dense residual connection.

3. The method as claimed in claim 1, wherein in the second step, six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480] are set according to the statistical information of the head size of the data set, six intermediate supervision branches are correspondingly designed, each branch comprises two independent sub-branches, namely a segmentation map prediction sub-branch and a density map prediction sub-branch, and the corresponding receptive field size is 39, 71, 143, 287, 575, 767.

4. The method according to claim 1, wherein in step three, an upsampling module with nearest neighbor interpolation and convolutional layer repeated stacking is designed, the resolution of the feature map is gradually increased, and finally a full-size density map is output.

5. The method as claimed in claim 1, wherein three labels are designed, including a density map of approximate head size generated by using an adaptive Gaussian kernel, a segmentation map containing complete head information generated by using a random scaling strategy, and a multi-scale label for an intermediate supervision branch, and the density map-based population counting task is converted into a foreground and background segmentation task to assist the regression task of the population density map.

6. The method of claim 5, wherein the generation process of the density map is represented as follows:

representing the average of the sum of Euclidean distances of the head and three adjacent heads in the imageValue, β is a weight coefficient;

the generation process of the segmentation graph is represented as:

Multi-scale labeling: dividing the human head in the single image into six groups according to the set six human head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], regenerating a density image and a segmentation image for each group, and then respectively carrying out sampling operation of six scales, wherein the sampling multiples of the six groups are 4, 8, 16, 32, 32 in sequence.

7. The method as claimed in claim 1, wherein two training strategies are designed, one is single-path multi-channel prediction through a backbone network and an up-sampling module, that is, a segmentation graph and a density graph are simultaneously output end to end through a single network, and the other is multi-path single-channel prediction through the backbone network and six intermediate supervision branches, that is, each intermediate supervision branch has two sub-branches, a sub-branch prediction density graph and a sub-branch prediction segmentation graph.

8. The method of claim 1, wherein the network model is composed of a backbone convolutional network, an upsampling module and an intermediate supervision branch; inputting the processed data into a model, and extracting high-level characteristics of the crowd through a backbone network; in this process, the backbone network only uses 1 × 1 and 3 × 3 convolutional layers, ReLu activation functions, bulk normalization layers, and dense residual connection operations; then feeding the extracted features to an upper sampling module, and monitoring network training by using a segmentation graph and a density graph;

the maximum receptive field size of the backbone network is 767, which can completely cover the human head area of the three data sets; the main network structure is divided into 4 sub-intensive context modules, namely Deneblock _1, Deneblock _2, Deneblock _3 and Deneblock _ 4;

the Denseblock _1 is formed by stacking ten feature extraction blocks, the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the size of a convolution kernel length multiplied by width multiplied by the number of channels is 3 multiplied by 64, the step length is 2, and the output size is 1/4 of input; nine convolution layers with convolution kernel sizes of 3 multiplied by 64 and step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein the nine modules adopt a dense connection mode, and the size of a feature map is kept unchanged in the calculation process;

the structure of Denseblock _2, Densblock _3 and Densblock _4 is completely the same, and the Densblock _4 is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the convolution core size of which is 3 multiplied by 128 and the step length of which is 2, and the output size is 1/2 of input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module, a dense connection mode is adopted among the four modules, and the size of the characteristic diagram is kept unchanged in the calculation process; the Denseblock _5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is composed of a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, dense connection is also adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged;

setting six middle supervision branches according to the receptive field calculated by the formula (3);

wherein l_kDenotes the size of the corresponding receptive field of the k-th layer, f_kIs the pooling size, s, of the convolution kernel or pooling layer of the k-th layer_hRepresenting the step size corresponding to the h layer convolution;

the sizes of the receptive fields corresponding to Branch _1-6 are 39, 71, 143, 287, 575 and 767 respectively; the head size intervals detected by the corresponding six branches are set as [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480 ]; due to the adoption of an image enhancement strategy of random scaling, the scaling ratio is 0.7-1.6, even if the maximum size 480 of the human head interval is 768 after being amplified by 1.6 times of the maximum ratio, the maximum size just corresponds to the maximum receptive field; the maximum human head actual size of the three data sets is 382 multiplied by 382 through statistics and does not exceed 480, so that the receptive field of the network can completely cover all sizes of human head regions including the zoomed human head region;

the six middle supervision branches are similar in structure, each branch comprises two sub-branches, namely a segmentation graph prediction sub-branch and a density graph prediction sub-branch, and only convolution kernels with the size of 1 multiplied by 1 are used; each intermediate supervision branch firstly maps the features to a new feature space through a convolution layer with the convolution kernel size of 1 multiplied by c, the number c of the channels is consistent with the number of output channels of the corresponding dense context module, and then the features are respectively sent to two sub-branches; the segmentation graph prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and the prediction results of the two-channel segmentation graph are finally output; the density map predictor branch comprises two 1 multiplied by 1 convolutional layers, the number of channels is c and 1 respectively, and a single-channel density map prediction result is finally output;

because the main network outputs 1/32 with the original size after five times of downsampling, the resolution is gradually improved by arranging a simple and effective upsampling module and utilizing network autonomous learning, and finally a density map and a segmentation map with the original size are output;

the up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the integral up-sampling multiplying power is 32, and a density map and a segmentation map of the size of the original image are finally output; specifically, the upsamplable _ block _1 is composed of a convolution layer, a batch normalization layer and a ReLu activation function, wherein the nearest neighbor interpolation layer performs upsampling four times, the size of a convolution kernel is 3 × 3 × 16, and the step length is 1; the Upsample _ block _1 and Upsample _ block _2 have similar structures with Upsample _ block _3, except that the network parameters are different, the sampling multiples on the nearest interpolation layer are respectively 4, 4 and 2, and the convolution kernel sizes of the convolution layers are respectively 3 multiplied by 16, 3 multiplied by 8 and 3 multiplied by 3; and finally outputting three-channel prediction results by an up-sampling module, wherein the first two channels are predicted segmentation maps, the third channel is a predicted density map, and then the total number of people is obtained by integrating the density maps.