CN111144329B - Multi-label-based lightweight rapid crowd counting method - Google Patents

Multi-label-based lightweight rapid crowd counting method Download PDF

Info

Publication number
CN111144329B
CN111144329B CN201911386325.7A CN201911386325A CN111144329B CN 111144329 B CN111144329 B CN 111144329B CN 201911386325 A CN201911386325 A CN 201911386325A CN 111144329 B CN111144329 B CN 111144329B
Authority
CN
China
Prior art keywords
size
convolution
map
network
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911386325.7A
Other languages
Chinese (zh)
Other versions
CN111144329A (en
Inventor
王素玉
杨滨
冯明宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911386325.7A priority Critical patent/CN111144329B/en
Publication of CN111144329A publication Critical patent/CN111144329A/en
Application granted granted Critical
Publication of CN111144329B publication Critical patent/CN111144329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a lightweight rapid crowd counting method based on multiple labels. According to the backbone characteristics extraction network which is simple and efficient in receptive field size design, a dense context module is built in, so that information transmission of a network layer is ensured, and the expression capacity of the network is improved; six multi-scale intermediate supervision branches are designed, so that the network converges faster and more stably; an up-sampling module is designed, the resolution is improved step by step, and the quality of a density map is improved, so that accurate counting and accurate positioning are realized; three labels are designed, the crowd counting task based on density is obviously converted into a foreground and background segmentation task to assist the regression task of the crowd density map, and meanwhile, the prediction of the density map and the segmentation map is realized, so that estimation errors are effectively reduced. In UCF_CC_50, test results of ShangghaiTech and UCF-QNFR data sets show that the prediction performance of the method is superior to that of the current mainstream algorithm, the prediction speed reaches real time, and the method can be conveniently deployed in terminal equipment.

Description

Multi-label-based lightweight rapid crowd counting method
Technical Field
The invention belongs to the field of crowd counting in computer vision, and discloses a method for predicting a density map by utilizing a convolutional neural network and integrating the density map to obtain the total number of people in a single picture.
Background
Currently, convolutional Neural Network (CNN) based population count techniques have made great progress. Most of the current advanced CNN methods use pre-trained backbone networks (e.g., vgg, resnet101, densenet169, etc.) and complex Module structures (e.g., attribute Module, self-attribute Module, perspective Module, etc.), to predict the density map of the input image, and then sum the estimated density map to obtain the population count. Still other methods utilize multi-column junction (MCNN and Switch-CNN) structures and multitasking (PCCNet) methods to improve the accuracy of predictions. These methods achieve a rather high degree of accuracy on the mainstream dataset ucf_cc_50, shanghaTech, UCF-QNRF datasets. Although the above methods can achieve good effects, in order to pursue high precision, they often have the problems of bulky network structure, large parameter amount and long prediction time consumption, cannot properly balance precision and speed, and are difficult to deploy into terminal equipment.
Disclosure of Invention
Aiming at the problems that the algorithm in the crowd counting field is high in complexity, difficult to achieve in real time and incapable of properly balancing accuracy and speed, the invention designs the lightweight and rapid crowd counting neural network based on multiple labels, has quite large balance in accuracy and operation efficiency and is easy to deploy into terminal equipment.
The invention adopts the following technical scheme: a lightweight and rapid crowd counting method based on multiple labels. The crowd counting algorithm comprises the following specific processes: preprocessing data and enhancing the data; the processed image data is input into the convolutional neural network provided by the invention, and the advanced features of the human head are extracted through a series of operations such as convolution, downsampling, intensive residual error connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); feeding the extracted advanced features to an up-sampling module, thereby generating a prediction density map and a segmentation map; the density map is then integrated in its entirety to obtain the final count result. An overall flow chart of the method proposed by the present invention is shown in fig. 1.
(1) Data preprocessing: the invention uses three public data sets of UCF-CC-50, shanghaiTech, UCF-QNRF. To facilitate training and prediction, data preprocessing was performed. Limiting the image width, the maximum aspect ratio does not exceed 1024. Since the neural network comprises five downsampling operations, the decoding process comprises continuous upsampling, ensuring consistent size and positioning accuracy, and the Resize operation on the input image is divisible by 32.
(2) Data enhancement: aiming at the problem of small number of data set samples, the invention uses 5 different data enhancement methods, namely random brightness, random saturation, random horizontal overturn, random noise and random clipping and scaling.
(3) And (3) multi-label manufacturing: generating a density map label by utilizing a self-adaptive Gaussian kernel, generating a segmentation label by adopting a random scaling strategy on the basis, applying the segmentation label and the segmentation label to single-path multi-channel prediction, converting a crowd counting task based on the density map into a foreground and background segmentation task to assist a regression task of the crowd density map, and simultaneously realizing the prediction of the density map and the segmentation map, thereby effectively reducing prediction errors. In addition, in order to cooperate with the training process of the intermediate supervision, six groups of multi-scale intermediate supervision labels are designed according to the two generated labels and the sizes of receptive fields at different stages of the network.
(4) Model setting and training: the network model mainly comprises a main convolution network, an up-sampling module and an intermediate supervision branch. The backbone network consists of convolution layers with convolution kernel sizes of 1×1 and 3×3, a ReLu activation function, a batch normalization layer and residual connections, with a total of only 2.06M. The enhanced data with the batch size of 16 and the size of 640 multiplied by 640 is input into a model, the advanced characteristic diagram of the crowd can be extracted through the intermediate supervision of six branches, and then the advanced characteristic diagram is fed into an up-sampling module, and the network training is supervised by using the strong supervision signals of the complete segmentation diagram and the density diagram.
In the training process, the method does not use a pre-training model, but uses the xavier method to initialize all parameters of the network. The mean square error loss function is used for training the density map, the cross entropy loss function is used for training the segmentation map, the momentum self-adaptive optimizer is used by the optimizer, the initial learning rate is set to be 0.0001, and the total iteration time is more than 400. (5) model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is obtained in an end-to-end mode, and the total number of people is obtained through integration. In this stage, only the parameters trained by the backbone network are needed to be loaded, and the intermediate supervision branches are not effective, so that the model reasoning speed is improved, and the real-time prediction can be achieved by adding a lightweight network structure.
The evaluation index of the model is Mean Absolute Error (MAE), mean Square Error (MSE) and peak signal to noise ratio (PSNR). Algorithm predictive performance was evaluated on ucf_cc_50, shanghaiTech, and UCF-QNRF datasets, and the method of the present invention yielded competitive results. On the part a of ShanghaiTech dataset, the computational performance (model size, number of parameters and run time) of the invention was verified, with a verification speed of 44ms, which is twice as fast as the current highest precision lightweight network PCCNet (89 ms).
Drawings
Fig. 1 is a schematic overall flow chart of the method according to the present invention.
Fig. 2 is a schematic diagram of a convolutional neural network according to the present invention.
Fig. 3 is a schematic diagram of a multi-tag according to the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings, which illustrate in detail:
the invention relates to a lightweight and rapid crowd counting method based on multiple labels. As shown in fig. 1, the crowd counting method specifically includes: preprocessing data, enhancing the data, inputting the data into a convolutional neural network, and extracting a crowd characteristic map through a series of operations such as convolution, downsampling, residual error connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); then generating a final prediction density map and a segmentation map through an up-sampling module; and finally, integrating the density map to obtain a final counting result. Specific algorithms are referenced below:
(1) Data preprocessing
Data preprocessing was performed on three main-stream public data sets UCF-CC-50, shangghai Tech and UCF-QNRF. The three data sets have large picture size variation range, and in order to unify training and save the video memory, the preprocessing is to limit the image height and width, the longest side is not more than 1024, and the length-width ratio is ensured to be unchanged. Since the network coding process involves five downsampling operations, the decoding process involves successive upsampling, and the resolution operation on the input image is performed to ensure that the input and output are consistent in size and positioning accuracy, so that it can be divided by 32.
(2) Data enhancement
UCF_CC_50 contains 50 images with different resolutions, the number of people annotated by a single image is 94 to 4543, and the background scene is single. The ShanghaiTech dataset contained 1198 pictures with a total of 330165 people, a large number of people, but a relatively single scene. The UCF-QNRF data set 1201 pictures are used for training, 334 images are used for testing, and the UCF-QNRF data set has complex scenes and higher crowd density, and is more real and difficult. For the features of these datasets, the present invention uses 5 different enhancement methods, random brightness, random saturation, random horizontal flip, random noise, and random zoom clipping. The method aims at deriving more training samples, reducing the influence of factors such as different sizes, positions, colors and the like of the human head on the model, preventing the model from being over-fitted and improving the generalization capability of the model.
(3) Multi-label fabrication
As shown in fig. 3, the multi-tag comprises three parts: the end of the network contains density maps, segmentation maps of all heads and multi-scale labels for intermediate supervision.
Density map: a density map is generated by convolving a geometrically adaptive gaussian kernel with each head point level annotation. If pixel x i Representing the position of the center coordinates of the ith head in the scene, for N head-labeled crowd density maps D GT The generation method of (a) can be expressed as:
where i represents the ith head index, N represents the total head count, δ (x-x) i ) A function representing the position of the head in the image,representing adaptive gaussian kernels, sigma i Is Gaussian kernel standard deviation, < >>The average value of the Euclidean distance sum of the head and the adjacent heads in the image is represented, and beta is 0.15, so that the head size can be estimated more accurately. Segmentation map: since the density map generated using the adaptive convolution kernel can cover the head region more accurately, pixel-level annotations separating foreground and background can be easily generated. Meanwhile, the invention introduces the scaling factor lambda to properly scale the head area, so that the area covered by the larger head is properly reduced, and the smaller head area is properly enlarged, thereby enabling the segmentation map to contain more complete head information. Segmentation map S GT The generation process of (2) can be expressed as:
wherein j represents any position in the density map, p j For the pixel value at position j in the density map,representing the density map generated after introducing scaling factors lambda i Represents the scaling factor corresponding to the ith person's head, gaussian kernel standard deviation +.>
Notably, the estimation of the segmentation map is a more basic and easier binary classification task than the regression task of density map estimation. The method can provide classification loss, assist regression tasks and improve overall regression quality. The two types of labels are combined to be applied to single-path multi-channel prediction of a backbone network, and prediction of a density map and a segmentation map is realized at the same time, so that prediction errors can be effectively reduced.
Multiscale tag: because the designed network comprises five downsampling operations, and the receptive fields of the network at different stages are different in size, the heads of people with different sizes can be covered by the receptive fields with corresponding sizes or larger sizes, and therefore, six branches of intermediate supervision are arranged. In order to match with the training process of intermediate supervision, six groups of labels are correspondingly required to be set for supervised learning. According to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], dividing the heads in a single image into six groups, regenerating a density image and a segmentation image in each group, and then respectively performing sampling operation of different scales, wherein the downsampling multiples of the six groups of labels are 4,8, 16, 32 and 32 in sequence. According to the size of the receptive field and the size of the human head of each branch, the human heads in different size ranges are distributed to different branch labels, so that six branches can completely cover all human heads. The present invention will be described in detail in the next section with receptive field calculation and human head size interval division.
(4) Model setup and training
As shown in fig. 2, the network model mainly consists of a backbone convolutional network, an up-sampling module, and intermediate supervisory branches. And inputting the processed data into a model, and extracting the crowd advanced features through a backbone network. In the process, the backbone network only uses a convolution layer of 1×1 and 3×3, a ReLu activation function, a batch normalization layer and dense residual error connection operation, so that the network is ensured to be light. The extracted features are then fed to an upsampling module, which monitors the network training using the segmentation map and the density map.
In order to ensure that the model can cover all sizes of human heads while ensuring light weight, the backbone network is carefully designed, and the maximum receptive field size is 767, so that the model can completely cover the human head areas of three data sets. The backbone network structure is divided into 4 sub-dense context modules, namely DenseBlock_1, denseBlock_2, denseBlock_3 and DenseBlock_4. The arrangement of the dense context module effectively ensures the information flow of a network layer and can retain more multi-scale context characteristics. The Denseblock_1 is formed by stacking ten feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 64 (length multiplied by width multiplied by channel number), a batch normalization layer with the step length of 2 and a ReLu activation function combination module, and the output size is 1/4 of the input; nine convolution kernel sizes are 3 multiplied by 64, a convolution layer with a step length of 1, a batch normalization layer and a ReLu activation function combination module, a dense connection mode is adopted among the nine modules, and the size of the feature map is kept unchanged in the calculation process. The structure of Denseblock_2, denseblock_3 and Denseblock_4 is completely the same, and the structure is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 128, a batch normalization layer and a ReLu activation function combination module with the step size of 2, and the output size is 1/2 of the input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module are connected in a dense mode, and the size of the feature map is kept unchanged in the calculation process. The Denseblock_5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is formed by a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, a dense connection mode is adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged.
Because of the different sizes of receptive fields of the network at different stages of the designed model, heads of different sizes can be covered by receptive fields of corresponding sizes or larger. The receptive field calculated according to equation (4) is provided with six branches of intermediate supervision, see fig. 2.
Wherein l k Indicating the size of the corresponding receptive field of the kth layer, f k A dimension s of a pooling size of a convolution kernel of a kth layer or a pooling layer h Indicating the step size corresponding to the h-th layer convolution.
The sizes of the receptive fields corresponding to Branch_1-6 are 39, 71, 143, 287, 575, 767, respectively. The head size interval detected by the corresponding six branches is set to [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480]. Because of the adoption of the image enhancement strategy of random scaling, the scaling ratio is 0.7-1.6, even if the maximum size 480 of the head interval is 768 after being amplified by 1.6 times, the maximum size corresponds to the maximum receptive field. The actual size of the largest human head of the three data sets is 382×382, which can not exceed 480, so that the receptive field of the network can completely cover all the human head areas with all the sizes, including the scaled human head areas, and the rationality of the network design of the invention is also proved.
The six intermediate supervision branches are similar in structure, each branch comprises two sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and the design principle is as light as possible, and only a convolution kernel with the size of 1 multiplied by 1 is used. Each intermediate supervising branch first maps features to a new feature space by a convolution layer of convolution kernel size 1 x c (the number of channels c is consistent with the number of output channels of the corresponding dense context module), and then feeds into the two sub-branches respectively. The partition map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally, a two-channel partition map prediction result is output. The density map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 1 respectively, and finally a single-channel density map prediction result is output.
Because the backbone network outputs 1/32 of the original size of the size through five times of downsampling, the positioning information of the crowd is destroyed to a certain extent, the resolution is improved step by utilizing the autonomous network learning through the simple and effective upsampling module, and finally the density map and the segmentation map of the original size are output, so that the recovered crowd position information is recovered, the quality of the density map is improved, and the crowd counting precision is further improved.
The up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the overall up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output. Specifically, upsampleblock_1 is composed of a nearest neighbor interpolation layer (four times upsampling), a convolution layer with a convolution kernel size of 3×3×16, a step size of 1, a batch normalization layer, and a ReLu activation function. Upsampleblock 1, upsampleblock 2 are similar in structure to upsampleblock 3, except that the network parameters are different, the nearest neighbor interpolation layer upsampling multiples are 4,4,2 respectively, the convolution kernel sizes of the convolution layers are 3×3×16 respectively 3×3×8 and 3×3×3. And finally outputting a three-channel prediction result by the up-sampling module, wherein the first two channels are predicted partition graphs, the third channel is a predicted density graph, and then obtaining the total number of people by integrating the density graph.
Therefore, the deployment of the whole lightweight network is completed, and the classification task of the prediction segmentation map and the regression task of the prediction density map are completed simultaneously by a single network. By doing so, network parameters can be shared, and better learning semantic features of two tasks are facilitated. Meanwhile, the segmentation map can provide position information constraint for the prediction of the density map, so that the network focuses on more head areas, and the prediction of background areas and body parts in the image on the density map is restrained, so that the counting is more accurate.
In the training process, the method does not use a pre-training model, and initializes all parameters of the network by using a xavier method. The mean square error function is used to train the density map, the cross entropy loss function is used to segment the regression of the map, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to 0.0001, and the total iteration is more than 400 times.
The mean square error loss function can be expressed as:
where H is the number of training samples, X e Is the e-th input image, D e Is a label of a corresponding density map, f (X e ) Is to inputX e Mapped to a predicted density map.
The cross entropy loss function can be expressed as:
wherein M represents the total number of pixel points, Y m Represents the mth partition map label, P m Representing the probability that the mth pixel in the segmentation map is the foreground.
The loss function of six branches can be expressed as:
wherein C represents the total number of branches,the density map loss and the partition map loss corresponding to the branch n are shown, respectively. The joint loss function can be expressed as:
L=L MSE +αL S +φL B formula (8)
Where α, Φ is the coefficient balancing the three losses, and α=0.001, Φ=0.1 is set in the present invention. The present invention uses joint loss functions for end-to-end training.
(5) Model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is directly obtained, and the total number of people is obtained through integration. Note that only the parameters trained by the backbone network are needed to be loaded in the stage, and the intermediate supervision branches are not effective, so that the model reasoning speed is improved, and the real-time prediction can be achieved by adding a light-weight network structure.
The evaluation index of the model is Mean Absolute Error (MAE), mean Square Error (MSE) and peak signal to noise ratio (PSNR). Algorithm predictive performance was evaluated on three data sets ucf_cc_50, shanghaiTech, and UCF-QNRF data sets, and the method of the present invention yielded competitive results. Considering that the algorithm of the invention aims to achieve real-time estimation based on the balance algorithm precision and speed, the comparison algorithms in tables 1 and 2 are selected.
TABLE 1 comparison of the predicted Performance of the methods of the invention
As shown in Table 1, on Shangaai TechA, the MSE index of the algorithm provided by the invention obtains the second name, and other indexes all obtain the best results, and particularly, on PartB and UCF-QNRF data sets of Shangaai Tech data sets, the MSE index is greatly improved. PartB: MAE increased 31.8% and MSE increased 36.3%. UCF-QNRF: MAE increased 16.1% and MSE increased 10.4%.
Table 2 comparative results of the calculated performance of the method proposed by the present invention
As shown in table 2, on part a of the public data set ShanghaiTech, the calculation performance (model size, parameter number and running time) of the present invention was verified, the verification speed reached 44ms, which is twice faster than the current highest precision lightweight network PCCNet (89 ms), and the counting precision was also improved. Compared with Cascade-MTL with the highest speed at present, MAE is improved by 30 percent, and the method provided by the invention sacrifices a certain speed.

Claims (6)

1. A multi-tag based lightweight fast crowd counting method, comprising:
step one: preprocessing data and enhancing the data;
step two: inputting the processed image data into a convolutional neural network, and extracting crowd advanced features through a series of convolution, downsampling and intensive residual error connection operations of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision;
step three: feeding the extracted advanced features to an up-sampling module, and generating a density map and a segmentation map which are consistent with the original image in size after gradually improving the resolution;
step four: obtaining a final counting result by carrying out integral integration on the density map;
three labels are designed, wherein the three labels comprise a density map of the size of the head of a person generated by utilizing an adaptive Gaussian kernel, a segmentation map containing complete head information generated by utilizing a random scaling strategy and a multi-scale label for intermediate supervision branches, and a crowd counting task based on the density map is converted into a foreground and background segmentation task to assist a regression task of the crowd density map in a displaying manner;
the network model consists of a main convolution network, an up-sampling module and an intermediate supervision branch; inputting the processed data into a model, and extracting crowd advanced features through a backbone network; in this process, the backbone network uses only the convolution layers 1×1 and 3×3, the ReLu activation function, the batch normalization layer, and the dense residual join operation; then the extracted features are fed to an up-sampling module, and the network training is supervised by utilizing the segmentation map and the density map;
the maximum receptive field size of the backbone network is 767; the backbone network structure is divided into 4 sub-dense context modules, namely Denseblock_1, denseblock_2, denseblock_3 and Denseblock_4;
the Denseblock_1 is formed by stacking ten feature extraction blocks, wherein the feature extraction blocks are a convolution layer, a batch normalization layer and a ReLu activation function combination module with the length of a convolution kernel, the width of the convolution kernel and the size of channels being 3 multiplied by 64, the step length of the convolution layer being 2, and the output size of the convolution layer is 1/4 of the input; nine convolution kernel sizes are 3 multiplied by 64, a convolution layer with a step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein a dense connection mode is adopted among the nine modules, and the dimension of the feature map is kept unchanged in the calculation process;
the structure of Denseblock_2, denseblock_3 and Denseblock_4 is completely the same, and the structure is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer with the convolution kernel size of 3 multiplied by 128, a batch normalization layer and a ReLu activation function combination module with the step size of 2, and the output size is 1/2 of the input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module are densely connected, and the size of the feature map is kept unchanged in the calculation process; the Denseblock_5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is formed by a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, a dense connection mode is adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged;
the receptive field is calculated according to the following formula, and six intermediate supervision branches are arranged;
wherein l k Indicating the size of the corresponding receptive field of the kth layer, f k A dimension s of a pooling size of a convolution kernel of a kth layer or a pooling layer h Representing the step length corresponding to the h-layer convolution;
in the six intermediate supervision branch structures, each branch comprises two sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and only a convolution kernel with the size of 1 multiplied by 1 is used; each intermediate supervision branch firstly passes through a convolution layer with the convolution kernel size of 1 multiplied by c, the channel number c is consistent with the output channel number of the corresponding dense context module, the characteristics are mapped to a new characteristic space, and then the characteristics are respectively sent into two sub-branches; the partition map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally, a two-channel partition map prediction result is output; the density map prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 1 respectively, and finally a single-channel density map prediction result is output;
the main network is subjected to five times of downsampling, the output size is 1/32 of the original size, the resolution is gradually improved by utilizing the autonomous network learning through the arrangement of an upsampling module, and finally the density map and the segmentation map of the original size are output;
the up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the overall up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output; specifically, the UpsampleBlock_1 is composed of a convolution layer with four times up-sampling of a nearest neighbor interpolation layer, a convolution kernel size of 3×3×16 and a step length of 1, a batch normalization layer and a ReLu activation function; the upsampleblock_1, upsampleblock_2 and upsampleblock_3 have the same structure, except that the network parameters are different, the nearest neighbor interpolation layer upsampling multiples are 4,4,2 respectively, the convolution kernel sizes of the convolution layers are 3×3×16 respectively 3×3×8 and 3×3×3; and finally outputting a three-channel prediction result by the up-sampling module, wherein the first two channels are predicted segmentation graphs, the third channel is a predicted density graph, and then obtaining the total number of people by integrating the density graph.
2. The multi-label based lightweight fast crowd counting method of claim 1, wherein in step two, the backbone network is comprised of four dense context modules, each of which is comprised of a convolution layer of 1 x 1 and 3 x 3 convolution kernels, a ReLu activation function, a batch normalization layer, and dense residual connections.
3. The method of claim 1, wherein in the second step, according to the statistical information of the head size of the data set, six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480] are set, and six intermediate supervision branches are correspondingly designed, each branch comprises two independent sub-branches, namely a partition map prediction sub-branch and a density map prediction sub-branch, and the corresponding receptive field sizes are 39, 71, 143, 287, 575 and 767.
4. The method for counting the lightweight and rapid population based on the multiple labels according to claim 1, wherein in the third step, an up-sampling module with repeated stacking of nearest neighbor interpolation and convolution layers is designed, the resolution of the feature map is improved step by step, and finally a density map with the original size is output.
5. The multi-label based lightweight rapid population count method of claim 1, wherein the density map generation process is represented as:
where i represents the ith head index, N represents the total head count, δ (x-x) i ) A function representing the position of the head in the image,representing adaptive gaussian kernels, sigma i Is Gaussian kernel standard deviation, < >>Representing the average value of the Euclidean distance sum of the head and three heads adjacent to the head in the image, wherein beta is a weight coefficient;
the generation process of the segmentation map is expressed as:
wherein j represents any position in the density map, p j For the pixel value at position j in the density map,representing the density map generated after introducing scaling factors lambda i Representing a scaling factor corresponding to the ith person's head;
multiscale tag: according to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], dividing the heads in a single image into six groups, regenerating a density image and a segmentation image in each group, and then respectively performing six-scale sampling operation, wherein the downsampling multiples of the six groups of labels are 4,8, 16, 32 and 32 in sequence.
6. The multi-label-based lightweight rapid population counting method of claim 1, wherein two training strategies are designed, one is single-path multi-channel prediction through a main network and an up-sampling module, a single network end-to-end outputs a segmentation graph and a density graph simultaneously, and the other is multi-path single-channel prediction through the main network and six intermediate monitoring branches, wherein each intermediate monitoring branch has two sub-branches, one sub-branch predicts the density graph and one sub-branch predicts the segmentation graph.
CN201911386325.7A 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method Active CN111144329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911386325.7A CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911386325.7A CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Publications (2)

Publication Number Publication Date
CN111144329A CN111144329A (en) 2020-05-12
CN111144329B true CN111144329B (en) 2023-07-25

Family

ID=70521417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911386325.7A Active CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Country Status (1)

Country Link
CN (1) CN111144329B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709290B (en) * 2020-05-18 2023-07-14 杭州电子科技大学 Crowd counting method based on coding and decoding-jump connection scale pyramid network
WO2021237727A1 (en) * 2020-05-29 2021-12-02 Siemens Aktiengesellschaft Method and apparatus of image processing
CN111929054B (en) * 2020-07-07 2022-06-07 中国矿业大学 PRVFLN-based pneumatic regulating valve concurrent fault diagnosis method
CN111898578B (en) * 2020-08-10 2023-09-19 腾讯科技(深圳)有限公司 Crowd density acquisition method and device and electronic equipment
CN111985381B (en) * 2020-08-13 2022-09-09 杭州电子科技大学 Guidance area dense crowd counting method based on flexible convolution neural network
CN112101164A (en) * 2020-09-06 2020-12-18 西北工业大学 Lightweight crowd counting method based on full convolution network
CN112084959B (en) * 2020-09-11 2024-04-16 腾讯科技(深圳)有限公司 Crowd image processing method and device
CN112418120B (en) * 2020-11-27 2021-09-28 湖南师范大学 Crowd detection method based on peak confidence map
CN112396126B (en) * 2020-12-02 2023-09-22 中山大学 Target detection method and system based on detection trunk and local feature optimization
CN112597985B (en) * 2021-03-04 2021-07-02 成都西交智汇大数据科技有限公司 Crowd counting method based on multi-scale feature fusion
CN113033638A (en) * 2021-03-16 2021-06-25 苏州海宸威视智能科技有限公司 Anchor-free frame target detection method based on receptive field perception
CN113327233B (en) * 2021-05-28 2023-05-16 北京理工大学重庆创新中心 Cell image detection method based on transfer learning
CN113887536B (en) * 2021-12-06 2022-03-04 松立控股集团股份有限公司 Multi-stage efficient crowd density estimation method based on high-level semantic guidance

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2505501B (en) * 2012-09-03 2020-09-09 Vision Semantics Ltd Crowd density estimation
US20160315682A1 (en) * 2015-04-24 2016-10-27 The Royal Institution For The Advancement Of Learning / Mcgill University Methods and systems for wireless crowd counting
CN104992223B (en) * 2015-06-12 2018-02-16 安徽大学 Intensive Population size estimation method based on deep learning
CN106203331B (en) * 2016-07-08 2019-05-17 苏州平江历史街区保护整治有限责任公司 A kind of crowd density evaluation method based on convolutional neural networks
CN106326937B (en) * 2016-08-31 2019-08-09 郑州金惠计算机系统工程有限公司 Crowd density distribution estimation method based on convolutional neural networks
CN107679503A (en) * 2017-10-12 2018-02-09 中科视拓(北京)科技有限公司 A kind of crowd's counting algorithm based on deep learning
CN107862261A (en) * 2017-10-25 2018-03-30 天津大学 Image people counting method based on multiple dimensioned convolutional neural networks
CN108154089B (en) * 2017-12-11 2021-07-30 中山大学 Size-adaptive-based crowd counting method for head detection and density map
CN108549835A (en) * 2018-03-08 2018-09-18 深圳市深网视界科技有限公司 Crowd counts and its method, terminal device and the storage medium of model construction
CN108985256A (en) * 2018-08-01 2018-12-11 曜科智能科技(上海)有限公司 Based on the multiple neural network demographic method of scene Density Distribution, system, medium, terminal
CN110163060B (en) * 2018-11-07 2022-12-23 腾讯科技(深圳)有限公司 Method for determining crowd density in image and electronic equipment
CN110020606B (en) * 2019-03-13 2021-03-30 北京工业大学 Crowd density estimation method based on multi-scale convolutional neural network
CN110059581A (en) * 2019-03-28 2019-07-26 常熟理工学院 People counting method based on depth information of scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Conditional Marked Point Process-based Crowd Counting in Sparsely and Moderately Crowded Scenes;Yongsang Yoon 等;2016 International Conference on Control, Automation and Information Sciences (ICCAIS);全文 *
采用STLK算法的高密度人群行人计数;陈涵奇 等;长春理工大学学报(自然科学版);全文 *

Also Published As

Publication number Publication date
CN111144329A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111144329B (en) Multi-label-based lightweight rapid crowd counting method
CN111639692B (en) Shadow detection method based on attention mechanism
CN110059710B (en) Apparatus and method for image classification using convolutional neural network
CN109271933B (en) Method for estimating three-dimensional human body posture based on video stream
CN110909801B (en) Data classification method, system, medium and device based on convolutional neural network
CN111968150B (en) Weak surveillance video target segmentation method based on full convolution neural network
CN111445418A (en) Image defogging method and device and computer equipment
CN112561027A (en) Neural network architecture searching method, image processing method, device and storage medium
CN111626184B (en) Crowd density estimation method and system
CN110443784B (en) Effective significance prediction model method
CN113095254B (en) Method and system for positioning key points of human body part
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN116152591B (en) Model training method, infrared small target detection method and device and electronic equipment
CN114821058A (en) Image semantic segmentation method and device, electronic equipment and storage medium
CN115410087A (en) Transmission line foreign matter detection method based on improved YOLOv4
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
KR102128789B1 (en) Method and apparatus for providing efficient dilated convolution technique for deep convolutional neural network
CN113538402B (en) Crowd counting method and system based on density estimation
CN113705394A (en) Behavior identification method combining long and short time domain features
CN116246109A (en) Multi-scale hole neighborhood attention computing backbone network model and application thereof
CN115587628A (en) Deep convolutional neural network lightweight method
CN114724175B (en) Pedestrian image detection network, pedestrian image detection method, pedestrian image training method, electronic device and medium
CN113344825B (en) Image rain removing method and system
CN111832336B (en) Improved C3D video behavior detection method
CN113902904A (en) Lightweight network architecture system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant