CN111144329A - Light-weight rapid crowd counting method based on multiple labels - Google Patents

Light-weight rapid crowd counting method based on multiple labels Download PDF

Info

Publication number
CN111144329A
CN111144329A CN201911386325.7A CN201911386325A CN111144329A CN 111144329 A CN111144329 A CN 111144329A CN 201911386325 A CN201911386325 A CN 201911386325A CN 111144329 A CN111144329 A CN 111144329A
Authority
CN
China
Prior art keywords
size
network
convolution
density
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911386325.7A
Other languages
Chinese (zh)
Other versions
CN111144329B (en
Inventor
王素玉
杨滨
冯明宽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911386325.7A priority Critical patent/CN111144329B/en
Publication of CN111144329A publication Critical patent/CN111144329A/en
Application granted granted Critical
Publication of CN111144329B publication Critical patent/CN111144329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/53Recognition of crowd images, e.g. recognition of crowd congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • G06V10/464Salient features, e.g. scale invariant feature transforms [SIFT] using a plurality of salient features, e.g. bag-of-words [BoW] representations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a light-weight rapid crowd counting method based on multiple labels. According to the size of the receptive field, a simple and efficient backbone feature extraction network is designed, and a dense context module is arranged in the network, so that the information transmission of a network layer is ensured, and the expression capacity of the network is improved; six multi-scale middle supervision branches are designed, so that the network can be converged more quickly and more stably; an up-sampling module is designed, so that the resolution is increased step by step, and the quality of a density map is improved, so that accurate counting and accurate positioning are realized; three labels are designed, a crowd counting task based on density is remarkably converted into a foreground and background segmentation task to assist a regression task of a crowd density graph, meanwhile, the prediction of the density graph and a segmentation graph is realized, and estimation errors are effectively reduced. The test results of the UCF _ CC _50, ShanghaiTech and UCF-QNFR data sets show that the prediction performance of the method is superior to that of the current mainstream algorithm, the prediction speed reaches real time, and the method can be conveniently deployed in terminal equipment.

Description

Light-weight rapid crowd counting method based on multiple labels
Technical Field
The invention belongs to the field of crowd counting in computer vision, and relates to a method for predicting a density map by using a convolutional neural network and integrating the density map to obtain the total number of people in a single picture, which is different from the current mainstream convolutional network based on VGG, ResNet and DenseNet.
Background
In recent years, large-scale group activities such as parades, festival celebrations, concerts, sporting events and the like are increasingly frequent, and group emergencies caused by dense crowds become the focus of society. The crowd counting is used as an important mode for crowd control and management, the statistics can be carried out on the crowd under the current scene, the resource allocation is assisted, the occurrence of an emergency can be planned, and the safety of public places is enhanced. In addition, the technology of population counting can be easily migrated to other fields to handle similar counting tasks. However, due to the problems of occlusion, background noise, scale and view angle variation, the traditional detection and regression-based method has its limitations, and accurate and fast population counting is still a difficult problem to be solved in the field of computer vision.
Currently, there have been great advances in population counting techniques based on Convolutional Neural Networks (CNN). Most of the current advanced CNN methods predict the density map of the input image by using a pre-trained backbone network (e.g., vgg16, Resnet101, and densnet 169) and a complex modular structure (e.g., Attention Module, Self-attribute Module, and perceptual Module), and then sum the predicted density map to obtain the population count. Still other methods utilize multi-column node (MCNN and Switch-CNN) architectures and a multitask (PCCNet) approach to improve the accuracy of predictions. These methods achieve a rather high accuracy on the mainstream data sets UCF _ CC _50, ShanghaTech, UCF-QNRF data sets. Although the above methods can achieve good effects, in order to pursue high precision, the methods often have the problems of bloated network structure, large parameter quantity and long prediction time consumption, cannot properly balance precision and speed, and are difficult to deploy in a terminal device.
Disclosure of Invention
Aiming at the problems that the algorithm in the crowd counting field is high in complexity, difficult to achieve real-time and incapable of properly balancing accuracy and speed, the invention designs a light-weight and rapid crowd counting neural network based on multiple labels, which has considerable balance in accuracy and operation efficiency and is easy to deploy in terminal equipment.
The invention adopts the following technical scheme: a light-weight and rapid crowd counting method based on multiple labels. The specific flow of the crowd counting algorithm is as follows: preprocessing and enhancing data; inputting the processed image data into the convolutional neural network provided by the invention, and extracting the advanced features of the human head through a series of operations such as convolution, downsampling, dense residual connection and the like of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); the extracted high-level features are further fed to an upsampling module, thereby generating a predicted density map and a segmentation map; the density map is then integrated as a whole to obtain the final count result. The overall flow chart of the method proposed by the present invention is shown in fig. 1.
(1) Data preprocessing: the invention uses three public data sets of UCF-CC-50, ShanghaiTech and UCF-QNRF. For convenience of training and prediction, data preprocessing is performed. The image is limited in height and width, and the maximum aspect ratio does not exceed 1024. Since the neural network contains five downsampling operations, the decoding process contains successive upsampling, ensuring size consistency and positioning accuracy, and Resize operation on the input image, which enables it to be evenly divided by 32.
(2) Data enhancement: aiming at the problem of small number of samples in the data set, the invention uses 5 different data enhancement methods, namely random brightness, random saturation, random horizontal inversion, random noise and random clipping and scaling.
(3) And (3) multi-label manufacturing: the method comprises the steps of generating density map labels by using a self-adaptive Gaussian kernel, generating segmentation labels by using a random scaling strategy on the basis, applying the density map labels and the segmentation labels to single-path multi-channel prediction, converting a crowd counting task based on the density map into a foreground and background segmentation task to assist a regression task of the crowd density map, realizing prediction of the density map and the segmentation map, and effectively reducing prediction errors. In addition, in order to match with the training process of intermediate supervision, six groups of multi-scale intermediate supervision labels are designed according to the generated two labels and the sizes of the receptive fields of the network at different stages.
(4) Setting and training a model: the network model mainly comprises a backbone convolution network, an up-sampling module and an intermediate supervision branch. The backbone network consists of convolution layers with convolution kernel sizes of 1 × 1 and 3 × 3, a ReLu activation function, a batch normalization layer and residual connection, and the total parameter number is only 2.06M. The enhanced data, with the batch size of 16 and the size of 640 multiplied by 640, is input into a model, and through intermediate supervision of six branches, a high-level feature map of a crowd can be extracted, and then the high-level feature map is fed to an upper sampling module, and network training is supervised by using a strong supervision signal of a complete segmentation map and a density map.
In the training process, the invention does not use a pre-training model, but uses a xavier method to initialize all parameters of the network. The mean square error loss function is used for training a density graph, the cross entropy loss function is used for training a segmentation graph, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to be 0.0001, and the total iteration is more than 400 times.
(5) Model prediction: after model training is completed, pre-trained model parameters are loaded, test data of any size are input, a predicted density map is obtained in an end-to-end mode, and the total number of people is obtained by integrating the predicted density map. In the stage, only parameters trained by a main network are loaded, and the middle supervision branch does not work, so that the model reasoning speed is improved, and the lightweight network structure is added, so that the real-time prediction can be achieved.
The evaluation indexes of the model are Mean Absolute Error (MAE), Mean Square Error (MSE) and peak signal-to-noise ratio (PSNR). The algorithmic prediction performance was evaluated on the UCF _ CC _50, ShanghaiTech and UCF-QNRF datasets and the inventive method yielded competitive results. On the PartA part of the ShanghaiTech data set, the calculation performance (model size, parameter number and operation time) of the method is verified, the verification speed reaches 44ms, and the verification speed is twice as fast as that of the current lightweight network PCCNet (89ms) with the highest precision.
Drawings
Fig. 1 is a schematic overall flow chart of the method of the present invention.
Fig. 2 is a schematic diagram of a convolutional neural network structure according to the present invention.
Fig. 3 is a schematic diagram of a multi-tag according to the present invention.
Detailed Description
The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:
the invention relates to a light-weight and rapid crowd counting method based on multiple labels. As shown in fig. 1, the specific flow of the population counting method is as follows: preprocessing and enhancing data, inputting the data into a convolutional neural network, and extracting a crowd characteristic diagram through a series of operations such as convolution, down sampling and residual connection of a backbone network; in this process, six branches of the network are used for multi-scale intermediate supervision (applied only to the training phase); then generating a final predicted density map and a final segmentation map through an up-sampling module; and finally, integrating the density map to obtain a final counting result.
The specific algorithm is referred to as follows:
(1) data pre-processing
Three mainstream public data sets, UCF-CC-50, ShanghaiTech and UCF-QNRF, were data preprocessed. The size change range of the three data sets is large, in order to train uniformly and save video memory, the preprocessing is to limit the height and width of the image, the longest edge does not exceed 1024, and the length-width ratio is ensured to be unchanged. Since the network coding process includes five downsampling operations and the decoding process includes successive upsampling, the Resize operation performed on the input image to ensure the input and output are consistent in size and accurate in positioning is divided by 32.
(2) Data enhancement
UCF _ CC _50 contains 50 images of different resolutions, the number of people annotated with a single image is from 94 to 4543, and the background scene is relatively single. The ShanghaiTech data set contained 1198 pictures, with a total of 330165 people, which varied greatly, but the scene was relatively single. The UCF-QNRF data set 1201 pictures are used for training, 334 images are used for testing, and the method is more real and difficult due to the complex scene and higher crowd density. For the characteristics of these datasets, the present invention uses 5 different enhancement methods, random luminance, random saturation, random horizontal flip, random noise and random scaling clipping. The method aims to derive more training samples, reduce the influence of factors such as different sizes, positions and colors of human heads on the model, prevent the model from being over-fitted and improve the generalization capability of the model.
(3) Multi-label making
As shown in fig. 3, the multi-tag comprises three parts: the density map, segmentation map and multi-scale label for intermediate supervision are used for network end including all human heads.
Density map: a density map is generated by convolving a geometrically adaptive gaussian kernel with each head point level annotation. If pixel xiA crowd density map D representing the position of the ith person's head center coordinate in the scene and labeled for N persons' headsGTThe generation mode of (d) can be expressed as:
Figure BDA0002343733290000051
where i denotes the ith head index, N denotes the total number of heads, δ (x-x)i) A function representing the position of the person's head in the image,
Figure BDA0002343733290000052
representing an adaptive Gaussian kernel, σiIs a standard deviation of a gaussian kernel,
Figure BDA0002343733290000053
the average value of the euclidean distance sums of the head and the three adjacent heads in the image is represented, and β is 0.15, which enables more accurate estimation of the head size.
Segmentation graph: since the density map generated by the adaptive convolution kernel can cover the head region more accurately, the separation can be easily generatedPixel-level annotation of foreground and background. Meanwhile, the invention introduces a scaling coefficient lambda to appropriately scale the human head area, so that the area covered by a larger human head is appropriately reduced, and the area covered by a smaller human head is appropriately enlarged, thereby enabling the segmentation map to contain more complete human head information. Segmentation chart SGTThe generation process of (a) can be expressed as:
Figure BDA0002343733290000054
Figure BDA0002343733290000055
wherein j represents an arbitrary position in the density map, pjFor the pixel value at position j in the density map,
Figure BDA0002343733290000056
representing the resulting density map, λ, after introducing a scaling factoriRepresents the scaling coefficient corresponding to the ith human head, the standard deviation of Gaussian kernel
Figure BDA0002343733290000057
Notably, segmentation graph estimation is a more fundamental and easier binary classification task than the regression task of density graph estimation. The method can provide classification loss, assist a regression task and improve the integral regression quality. The two types of labels are combined to be applied to single-path multi-channel prediction of a backbone network, the prediction of a density graph and a segmentation graph is realized at the same time, and prediction errors can be effectively reduced.
Multi-scale labeling: because the designed network comprises five downsampling operations and the receptive fields of the network in different stages are different in size, heads of people with different sizes can be covered by receptive fields with corresponding sizes or larger, and therefore six middle supervision branches are arranged. In order to match with the intermediate supervised training process, six groups of labels are correspondingly required to be set for supervised learning. Dividing the head in the single image into six groups according to the set six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], regenerating a density image and a segmentation image for each group, and then respectively carrying out sampling operations with different scales, wherein the sampling multiples of the six groups are 4, 8, 16, 32, 32 in sequence. According to the receptive field size and the head size of each branch, heads in different size ranges are distributed to different branch labels, and the six branches can completely cover all heads. The present invention will be described in detail in the next section for the calculation of the receptive field and the division of the head size interval.
(4) Model setup and training
As shown in fig. 2, the network model is mainly composed of a backbone convolutional network, an upsampling module, and an intermediate supervision branch. And inputting the processed data into a model, and extracting high-level characteristics of the crowd through a backbone network. In the process, the backbone network only uses 1 × 1 and 3 × 3 convolutional layers, ReLu activation functions, batch normalization layers and dense residual connection operation, and therefore network lightweight is guaranteed. The extracted features are then fed to an upsampling module, which supervises network training using segmentation and density maps.
In order to ensure that the model can cover all sizes of human heads while ensuring light weight, the backbone network is well designed, the maximum receptive field size is 767, and the human head area of three data sets can be completely covered. The main network structure is divided into 4 sub-intensive context modules, namely Deneblock _1, Deneblock _2, Deneblock _3 and Deneblock _ 4. The dense context module effectively ensures the information flow of the network layer and can reserve more multi-scale context characteristics. The Denseblock _1 is formed by stacking ten feature extraction blocks, the feature extraction blocks are respectively a convolution layer with a convolution kernel size of 3 multiplied by 64 (length multiplied by width multiplied by the number of channels), a batch normalization layer and a ReLu activation function combination module with a step length of 2, and the output size is 1/4 of input; nine convolution layers with convolution kernel size of 3 multiplied by 64 and step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein a dense connection mode is adopted among the nine modules, and the size of the feature map is kept unchanged in the calculation process. The structure of Denseblock _2, Densblock _3 and Densblock _4 is completely the same, and the Densblock _4 is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the convolution core size of which is 3 multiplied by 128 and the step length of which is 2, and the output size is 1/2 of input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module, a dense connection mode is adopted among the four modules, and the size of the feature map is kept unchanged in the calculation process. The Densblock _5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is composed of a convolution layer with convolution kernel size of 3 x 128 and step length of 1, a batch normalization layer and a ReLu activation function, dense connection is also adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged.
Because the receptive fields of the networks in different stages of the designed model are different in size, heads with different sizes can be covered by receptive fields with corresponding sizes or larger sizes. According to the receptive field calculated by the formula (4), the invention sets six branches of intermediate supervision, see fig. 2.
Figure BDA0002343733290000071
Wherein lkDenotes the size of the corresponding receptive field of the k-th layer, fkIs the pooling size, s, of the convolution kernel or pooling layer of the k-th layerhIndicating the step size corresponding to the h-th layer convolution.
The sizes of the receptive fields corresponding to Branch _1-6 are 39, 71, 143, 287, 575 and 767 respectively. The head size intervals detected by the corresponding six branches are set to [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480 ]. Due to the adoption of the image enhancement strategy of random scaling, the scaling is 0.7-1.6, even if the maximum size 480 of the human head interval is 768 after being amplified by 1.6 times of the maximum scaling, the maximum receptive field is just corresponding to. The maximum human head actual size of the three data sets is 382 multiplied by 382 through statistics and does not exceed 480, so that the receptive field of the network can completely cover all sizes of human head regions including the zoomed human head region, and the rationality of the network design is also proved.
The six middle supervision branches are similar in structure, each branch comprises two sub-branches, namely a segmentation graph predictor branch and a density graph predictor branch, the design principle is light weight as much as possible, and only a convolution kernel with the size of 1 multiplied by 1 is used. Each intermediate supervision branch maps the features to a new feature space through a convolution layer with a convolution kernel size of 1 × 1 × c (the number of channels c is consistent with the number of output channels of the corresponding dense context module), and then sends the features to two sub-branches respectively. The segmentation graph prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and finally the prediction results of the two-channel segmentation graph are output. The density map predictor branch comprises two 1 multiplied by 1 convolutional layers, the number of channels is c and 1 respectively, and finally the single-channel density map prediction result is output.
Because the trunk network outputs 1/32 with the original size after five times of downsampling, the positioning information of the crowd is damaged to a certain extent, the resolution is gradually improved by setting a simple and effective upsampling module and utilizing network autonomous learning, and the density map and the segmentation map with the original size are finally output, so that the recovered crowd position information is improved, the quality of the density map is improved, and the crowd counting precision is further improved.
The up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the integral up-sampling multiplying power is 32, and finally a density map and a segmentation map of the size of the original image size are output. Specifically, the upsamplable _ block _1 is composed of a nearest neighbor interpolation layer (quadruple upsampling), a convolution layer with a convolution kernel size of 3 × 3 × 16 and a step size of 1, a batch normalization layer, and a ReLu activation function. The Upsample _ block _1 and the Upsample _ block _2 have similar structures as the Upsample _ block _3, except that network parameters are different, sampling multiples on nearest interpolation layers are 4, 4 and 2 respectively, and convolution kernel sizes of convolution layers are 3 × 3 × 16, 3 × 3 × 8 and 3 × 3 × 3 respectively. And finally outputting three-channel prediction results by an up-sampling module, wherein the first two channels are predicted segmentation maps, the third channel is a predicted density map, and then the total number of people can be obtained by integrating the density maps.
Therefore, the deployment of the whole lightweight network is completed, and the classification task of the prediction segmentation graph and the regression task of the prediction density graph are simultaneously completed by a single network. By doing so, network parameters can be shared, and better semantic feature learning of the two tasks is promoted. Meanwhile, the segmentation map can provide position information constraint for prediction of the density map, so that the network focuses more on the head region, and meanwhile, prediction of the background region and the body part in the image on the density map is suppressed, and therefore counting is more accurate.
In the training process, the invention does not use a pre-training model, and initializes all parameters of the network by using a xavier method. The mean square error function is used for training a density graph, the cross entropy loss function is used for regression of a segmentation graph, the optimizer uses a momentum adaptive optimizer, the initial learning rate is set to be 0.0001, and the total iteration is carried out for more than 400 times.
The mean square error loss function can be expressed as:
Figure BDA0002343733290000081
where H is the number of training samples, XeIs the e-th input image, DeIs the label of the corresponding density map, f (X)e) Is to input XeMapping to a predicted density map.
The cross entropy loss function can be expressed as:
Figure BDA0002343733290000091
wherein M represents the total number of pixels, YmDenotes the m-th segmentation map label, PmRepresenting the probability that the mth pixel in the segmentation map is foreground.
The loss function for the six branches can be expressed as:
Figure BDA0002343733290000092
wherein C represents the total number of branches,
Figure BDA0002343733290000093
the density map loss and the segmentation map loss corresponding to the branch n are shown, respectively.
The joint loss function can be expressed as:
L=LMSE+αLS+φLBformula (8)
Where α, phi is the coefficient to balance the three losses, set to α ═ 0.001 and 0.1 in the present invention.
(5) Model prediction: after model training is completed, the pre-trained model parameters are loaded, test data of any size is input, a predicted density map is directly obtained, and the total number of people is obtained by integrating the predicted density map. Note that only the trained parameters of the backbone network are loaded at this stage, and the middle supervision branch does not work, so that the model reasoning speed is improved, and the lightweight network structure is added, so that real-time prediction can be achieved.
The evaluation indexes of the model are Mean Absolute Error (MAE), Mean Square Error (MSE) and peak signal-to-noise ratio (PSNR). The algorithm prediction performance was evaluated on three data sets, UCF _ CC _50, ShanghaiTech and UCF-QNRF data sets, and the inventive method yielded competitive results. Considering that the algorithm of the invention aims to achieve real-time estimation on the basis of balancing the accuracy and the speed of the algorithm, the comparison algorithm in the tables 1 and 2 is selected.
TABLE 1 comparison of predicted Performance of the method proposed by the present invention
Figure BDA0002343733290000101
Table 1 shows that on ShanghaiTechA, the MSE index of the algorithm proposed by the present invention takes the second name, and other indexes all achieve the best results, especially on the PartB and UCF-QNRF data sets of ShanghaiTech data sets, which are greatly improved. PartB: the MAE is improved by 31.8 percent, and the MSE is improved by 36.3 percent. UCF-QNRF: the MAE is improved by 16.1 percent, and the MSE is improved by 10.4 percent.
TABLE 2 comparison of the calculated Performance of the method proposed by the present invention
Figure BDA0002343733290000102
Table 2 shows that, in the part of PartA of the public data set ShanghaiTech, the computational performance (model size, number of parameters, and running time) of the present invention is verified, the verification speed reaches 44ms, which is twice as fast as the current highest precision lightweight network PCCNet (89ms), and the counting precision is also improved. Compared with the Cascade-MTL with the fastest speed at present, the MAE is improved by 30 percent, and the method of the invention sacrifices certain speed and is worthy.

Claims (8)

1. A lightweight fast crowd counting method based on multiple labels is characterized by comprising the following steps:
the method comprises the following steps: preprocessing and enhancing data;
step two: inputting the processed image data into a convolutional neural network, and extracting high-level features of the crowd through a series of operations of convolution, downsampling and dense residual connection of a backbone network; in the process, six branches of the network are used for multi-scale intermediate supervision;
step three: feeding the extracted high-level features to an upper sampling module, and generating a density graph and a segmentation graph which are consistent with the size of the original image after gradually increasing the resolution;
step four: and finally, obtaining a final counting result by integrally integrating the density map.
2. The method according to claim 1, wherein in the second step, the backbone network is composed of four dense context modules, each module is composed of convolution layers of 1 x 1 and 3 x 3 convolution kernels, a ReLu activation function, a batch normalization layer and dense residual connection.
3. The method as claimed in claim 1, wherein in the second step, six head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480] are set according to the statistical information of the head size of the data set, six intermediate supervision branches are correspondingly designed, each branch comprises two independent sub-branches, namely a segmentation map prediction sub-branch and a density map prediction sub-branch, and the corresponding receptive field size is 39, 71, 143, 287, 575, 767.
4. The method according to claim 1, wherein in step three, an upsampling module with nearest neighbor interpolation and convolutional layer repeated stacking is designed, the resolution of the feature map is gradually increased, and finally a full-size density map is output.
5. The method as claimed in claim 1, wherein three labels are designed, including a density map of approximate head size generated by using an adaptive Gaussian kernel, a segmentation map containing complete head information generated by using a random scaling strategy, and a multi-scale label for an intermediate supervision branch, and the density map-based population counting task is converted into a foreground and background segmentation task to assist the regression task of the population density map.
6. The method of claim 5, wherein the generation process of the density map is represented as follows:
Figure FDA0002343733280000021
where i denotes the ith head index, N denotes the total number of heads, δ (x-x)i) A function representing the position of the person's head in the image,
Figure FDA0002343733280000022
representing an adaptive Gaussian kernel, σiIs a standard deviation of a gaussian kernel,
Figure FDA0002343733280000023
representing the average of the sum of Euclidean distances of the head and three adjacent heads in the imageValue, β is a weight coefficient;
the generation process of the segmentation graph is represented as:
Figure FDA0002343733280000024
Figure FDA0002343733280000025
wherein j represents an arbitrary position in the density map, pjFor the pixel value at position j in the density map,
Figure FDA0002343733280000026
representing the resulting density map, λ, after introducing a scaling factoriRepresents the scaling coefficient corresponding to the ith human head, the standard deviation of Gaussian kernel
Figure FDA0002343733280000027
Multi-scale labeling: dividing the human head in the single image into six groups according to the set six human head size intervals [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480], regenerating a density image and a segmentation image for each group, and then respectively carrying out sampling operation of six scales, wherein the sampling multiples of the six groups are 4, 8, 16, 32, 32 in sequence.
7. The method as claimed in claim 1, wherein two training strategies are designed, one is single-path multi-channel prediction through a backbone network and an up-sampling module, that is, a segmentation graph and a density graph are simultaneously output end to end through a single network, and the other is multi-path single-channel prediction through the backbone network and six intermediate supervision branches, that is, each intermediate supervision branch has two sub-branches, a sub-branch prediction density graph and a sub-branch prediction segmentation graph.
8. The method of claim 1, wherein the network model is composed of a backbone convolutional network, an upsampling module and an intermediate supervision branch; inputting the processed data into a model, and extracting high-level characteristics of the crowd through a backbone network; in this process, the backbone network only uses 1 × 1 and 3 × 3 convolutional layers, ReLu activation functions, bulk normalization layers, and dense residual connection operations; then feeding the extracted features to an upper sampling module, and monitoring network training by using a segmentation graph and a density graph;
the maximum receptive field size of the backbone network is 767, which can completely cover the human head area of the three data sets; the main network structure is divided into 4 sub-intensive context modules, namely Deneblock _1, Deneblock _2, Deneblock _3 and Deneblock _ 4;
the Denseblock _1 is formed by stacking ten feature extraction blocks, the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the size of a convolution kernel length multiplied by width multiplied by the number of channels is 3 multiplied by 64, the step length is 2, and the output size is 1/4 of input; nine convolution layers with convolution kernel sizes of 3 multiplied by 64 and step length of 1, a batch normalization layer and a ReLu activation function combination module, wherein the nine modules adopt a dense connection mode, and the size of a feature map is kept unchanged in the calculation process;
the structure of Denseblock _2, Densblock _3 and Densblock _4 is completely the same, and the Densblock _4 is formed by stacking five feature extraction blocks, wherein the feature extraction blocks are respectively a convolution layer, a batch normalization layer and a ReLu activation function combination module, the convolution core size of which is 3 multiplied by 128 and the step length of which is 2, and the output size is 1/2 of input; the four convolution kernels are 3 multiplied by 64, the convolution layer with the step length of 1, the batch normalization layer and the ReLu activation function combination module, a dense connection mode is adopted among the four modules, and the size of the characteristic diagram is kept unchanged in the calculation process; the Denseblock _5 is formed by stacking three repeated feature extraction blocks, each feature extraction block is composed of a convolution layer with the convolution kernel size of 3 multiplied by 128 and the step length of 1, a batch normalization layer and a ReLu activation function, dense connection is also adopted among the feature extraction blocks, and the feature graph size in the calculation process is kept unchanged;
setting six middle supervision branches according to the receptive field calculated by the formula (3);
Figure FDA0002343733280000031
wherein lkDenotes the size of the corresponding receptive field of the k-th layer, fkIs the pooling size, s, of the convolution kernel or pooling layer of the k-th layerhRepresenting the step size corresponding to the h layer convolution;
the sizes of the receptive fields corresponding to Branch _1-6 are 39, 71, 143, 287, 575 and 767 respectively; the head size intervals detected by the corresponding six branches are set as [3, 15], [15, 30], [30, 60], [60, 120], [120, 240], [240, 480 ]; due to the adoption of an image enhancement strategy of random scaling, the scaling ratio is 0.7-1.6, even if the maximum size 480 of the human head interval is 768 after being amplified by 1.6 times of the maximum ratio, the maximum size just corresponds to the maximum receptive field; the maximum human head actual size of the three data sets is 382 multiplied by 382 through statistics and does not exceed 480, so that the receptive field of the network can completely cover all sizes of human head regions including the zoomed human head region;
the six middle supervision branches are similar in structure, each branch comprises two sub-branches, namely a segmentation graph prediction sub-branch and a density graph prediction sub-branch, and only convolution kernels with the size of 1 multiplied by 1 are used; each intermediate supervision branch firstly maps the features to a new feature space through a convolution layer with the convolution kernel size of 1 multiplied by c, the number c of the channels is consistent with the number of output channels of the corresponding dense context module, and then the features are respectively sent to two sub-branches; the segmentation graph prediction sub-branch comprises two 1 multiplied by 1 convolution layers, the channel numbers are c and 2 respectively, and the prediction results of the two-channel segmentation graph are finally output; the density map predictor branch comprises two 1 multiplied by 1 convolutional layers, the number of channels is c and 1 respectively, and a single-channel density map prediction result is finally output;
because the main network outputs 1/32 with the original size after five times of downsampling, the resolution is gradually improved by arranging a simple and effective upsampling module and utilizing network autonomous learning, and finally a density map and a segmentation map with the original size are output;
the up-sampling module is formed by stacking three sub-modules consisting of an up-sampling layer and a convolution layer, the integral up-sampling multiplying power is 32, and a density map and a segmentation map of the size of the original image are finally output; specifically, the upsamplable _ block _1 is composed of a convolution layer, a batch normalization layer and a ReLu activation function, wherein the nearest neighbor interpolation layer performs upsampling four times, the size of a convolution kernel is 3 × 3 × 16, and the step length is 1; the Upsample _ block _1 and Upsample _ block _2 have similar structures with Upsample _ block _3, except that the network parameters are different, the sampling multiples on the nearest interpolation layer are respectively 4, 4 and 2, and the convolution kernel sizes of the convolution layers are respectively 3 multiplied by 16, 3 multiplied by 8 and 3 multiplied by 3; and finally outputting three-channel prediction results by an up-sampling module, wherein the first two channels are predicted segmentation maps, the third channel is a predicted density map, and then the total number of people is obtained by integrating the density maps.
CN201911386325.7A 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method Active CN111144329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911386325.7A CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911386325.7A CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Publications (2)

Publication Number Publication Date
CN111144329A true CN111144329A (en) 2020-05-12
CN111144329B CN111144329B (en) 2023-07-25

Family

ID=70521417

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911386325.7A Active CN111144329B (en) 2019-12-29 2019-12-29 Multi-label-based lightweight rapid crowd counting method

Country Status (1)

Country Link
CN (1) CN111144329B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709290A (en) * 2020-05-18 2020-09-25 杭州电子科技大学 Crowd counting method based on coding and decoding-jumping connection scale pyramid network
CN111898578A (en) * 2020-08-10 2020-11-06 腾讯科技(深圳)有限公司 Crowd density acquisition method and device, electronic equipment and computer program
CN111929054A (en) * 2020-07-07 2020-11-13 中国矿业大学 PRVFLN-based pneumatic regulating valve concurrent fault diagnosis method
CN111985381A (en) * 2020-08-13 2020-11-24 杭州电子科技大学 Guide area dense crowd counting method based on flexible convolutional neural network
CN112084959A (en) * 2020-09-11 2020-12-15 腾讯科技(深圳)有限公司 Crowd image processing method and device
CN112101164A (en) * 2020-09-06 2020-12-18 西北工业大学 Lightweight crowd counting method based on full convolution network
CN112396126A (en) * 2020-12-02 2021-02-23 中山大学 Target detection method and system based on detection of main stem and local feature optimization
CN112418120A (en) * 2020-11-27 2021-02-26 湖南师范大学 Crowd detection method based on peak confidence map
CN112597985A (en) * 2021-03-04 2021-04-02 成都西交智汇大数据科技有限公司 Crowd counting method based on multi-scale feature fusion
CN113033638A (en) * 2021-03-16 2021-06-25 苏州海宸威视智能科技有限公司 Anchor-free frame target detection method based on receptive field perception
CN113327233A (en) * 2021-05-28 2021-08-31 北京理工大学重庆创新中心 Cell image detection method based on transfer learning
WO2021237727A1 (en) * 2020-05-29 2021-12-02 Siemens Aktiengesellschaft Method and apparatus of image processing
CN113887536A (en) * 2021-12-06 2022-01-04 松立控股集团股份有限公司 Multi-stage efficient crowd density estimation method based on high-level semantic guidance

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2704060A2 (en) * 2012-09-03 2014-03-05 Vision Semantics Limited Crowd density estimation
CN104992223A (en) * 2015-06-12 2015-10-21 安徽大学 Intensive population estimation method based on deep learning
US20160315682A1 (en) * 2015-04-24 2016-10-27 The Royal Institution For The Advancement Of Learning / Mcgill University Methods and systems for wireless crowd counting
CN106203331A (en) * 2016-07-08 2016-12-07 苏州平江历史街区保护整治有限责任公司 A kind of crowd density evaluation method based on convolutional neural networks
CN106326937A (en) * 2016-08-31 2017-01-11 郑州金惠计算机系统工程有限公司 Convolutional neural network based crowd density distribution estimation method
CN107679503A (en) * 2017-10-12 2018-02-09 中科视拓(北京)科技有限公司 A kind of crowd's counting algorithm based on deep learning
CN107862261A (en) * 2017-10-25 2018-03-30 天津大学 Image people counting method based on multiple dimensioned convolutional neural networks
CN108154089A (en) * 2017-12-11 2018-06-12 中山大学 A kind of people counting method of head detection and density map based on dimension self-adaption
CN108549835A (en) * 2018-03-08 2018-09-18 深圳市深网视界科技有限公司 Crowd counts and its method, terminal device and the storage medium of model construction
CN108985256A (en) * 2018-08-01 2018-12-11 曜科智能科技(上海)有限公司 Based on the multiple neural network demographic method of scene Density Distribution, system, medium, terminal
CN110020606A (en) * 2019-03-13 2019-07-16 北京工业大学 A kind of crowd density estimation method based on multiple dimensioned convolutional neural networks
CN110059581A (en) * 2019-03-28 2019-07-26 常熟理工学院 People counting method based on depth information of scene
CN110163060A (en) * 2018-11-07 2019-08-23 腾讯科技(深圳)有限公司 The determination method and electronic equipment of crowd density in image

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2704060A2 (en) * 2012-09-03 2014-03-05 Vision Semantics Limited Crowd density estimation
US20160315682A1 (en) * 2015-04-24 2016-10-27 The Royal Institution For The Advancement Of Learning / Mcgill University Methods and systems for wireless crowd counting
CN104992223A (en) * 2015-06-12 2015-10-21 安徽大学 Intensive population estimation method based on deep learning
CN106203331A (en) * 2016-07-08 2016-12-07 苏州平江历史街区保护整治有限责任公司 A kind of crowd density evaluation method based on convolutional neural networks
CN106326937A (en) * 2016-08-31 2017-01-11 郑州金惠计算机系统工程有限公司 Convolutional neural network based crowd density distribution estimation method
CN107679503A (en) * 2017-10-12 2018-02-09 中科视拓(北京)科技有限公司 A kind of crowd's counting algorithm based on deep learning
CN107862261A (en) * 2017-10-25 2018-03-30 天津大学 Image people counting method based on multiple dimensioned convolutional neural networks
CN108154089A (en) * 2017-12-11 2018-06-12 中山大学 A kind of people counting method of head detection and density map based on dimension self-adaption
CN108549835A (en) * 2018-03-08 2018-09-18 深圳市深网视界科技有限公司 Crowd counts and its method, terminal device and the storage medium of model construction
CN108985256A (en) * 2018-08-01 2018-12-11 曜科智能科技(上海)有限公司 Based on the multiple neural network demographic method of scene Density Distribution, system, medium, terminal
CN110163060A (en) * 2018-11-07 2019-08-23 腾讯科技(深圳)有限公司 The determination method and electronic equipment of crowd density in image
CN110020606A (en) * 2019-03-13 2019-07-16 北京工业大学 A kind of crowd density estimation method based on multiple dimensioned convolutional neural networks
CN110059581A (en) * 2019-03-28 2019-07-26 常熟理工学院 People counting method based on depth information of scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YONGSANG YOON 等: "Conditional Marked Point Process-based Crowd Counting in Sparsely and Moderately Crowded Scenes", 2016 INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION AND INFORMATION SCIENCES (ICCAIS) *
陈涵奇 等: "采用STLK算法的高密度人群行人计数", 长春理工大学学报(自然科学版) *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709290A (en) * 2020-05-18 2020-09-25 杭州电子科技大学 Crowd counting method based on coding and decoding-jumping connection scale pyramid network
CN111709290B (en) * 2020-05-18 2023-07-14 杭州电子科技大学 Crowd counting method based on coding and decoding-jump connection scale pyramid network
WO2021237727A1 (en) * 2020-05-29 2021-12-02 Siemens Aktiengesellschaft Method and apparatus of image processing
CN111929054A (en) * 2020-07-07 2020-11-13 中国矿业大学 PRVFLN-based pneumatic regulating valve concurrent fault diagnosis method
CN111898578A (en) * 2020-08-10 2020-11-06 腾讯科技(深圳)有限公司 Crowd density acquisition method and device, electronic equipment and computer program
CN111898578B (en) * 2020-08-10 2023-09-19 腾讯科技(深圳)有限公司 Crowd density acquisition method and device and electronic equipment
CN111985381A (en) * 2020-08-13 2020-11-24 杭州电子科技大学 Guide area dense crowd counting method based on flexible convolutional neural network
CN111985381B (en) * 2020-08-13 2022-09-09 杭州电子科技大学 Guidance area dense crowd counting method based on flexible convolution neural network
CN112101164A (en) * 2020-09-06 2020-12-18 西北工业大学 Lightweight crowd counting method based on full convolution network
CN112084959A (en) * 2020-09-11 2020-12-15 腾讯科技(深圳)有限公司 Crowd image processing method and device
CN112084959B (en) * 2020-09-11 2024-04-16 腾讯科技(深圳)有限公司 Crowd image processing method and device
CN112418120B (en) * 2020-11-27 2021-09-28 湖南师范大学 Crowd detection method based on peak confidence map
CN112418120A (en) * 2020-11-27 2021-02-26 湖南师范大学 Crowd detection method based on peak confidence map
CN112396126A (en) * 2020-12-02 2021-02-23 中山大学 Target detection method and system based on detection of main stem and local feature optimization
CN112396126B (en) * 2020-12-02 2023-09-22 中山大学 Target detection method and system based on detection trunk and local feature optimization
CN112597985A (en) * 2021-03-04 2021-04-02 成都西交智汇大数据科技有限公司 Crowd counting method based on multi-scale feature fusion
CN113033638A (en) * 2021-03-16 2021-06-25 苏州海宸威视智能科技有限公司 Anchor-free frame target detection method based on receptive field perception
CN113327233A (en) * 2021-05-28 2021-08-31 北京理工大学重庆创新中心 Cell image detection method based on transfer learning
CN113887536A (en) * 2021-12-06 2022-01-04 松立控股集团股份有限公司 Multi-stage efficient crowd density estimation method based on high-level semantic guidance
CN113887536B (en) * 2021-12-06 2022-03-04 松立控股集团股份有限公司 Multi-stage efficient crowd density estimation method based on high-level semantic guidance

Also Published As

Publication number Publication date
CN111144329B (en) 2023-07-25

Similar Documents

Publication Publication Date Title
CN111144329A (en) Light-weight rapid crowd counting method based on multiple labels
CN110322446B (en) Domain self-adaptive semantic segmentation method based on similarity space alignment
CN111639692B (en) Shadow detection method based on attention mechanism
CN110717851B (en) Image processing method and device, training method of neural network and storage medium
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
CN112507777A (en) Optical remote sensing image ship detection and segmentation method based on deep learning
CN112348036A (en) Self-adaptive target detection method based on lightweight residual learning and deconvolution cascade
CN110569851B (en) Real-time semantic segmentation method for gated multi-layer fusion
CN110689599A (en) 3D visual saliency prediction method for generating countermeasure network based on non-local enhancement
CN110321805B (en) Dynamic expression recognition method based on time sequence relation reasoning
CN114898284B (en) Crowd counting method based on feature pyramid local difference attention mechanism
CN110532959B (en) Real-time violent behavior detection system based on two-channel three-dimensional convolutional neural network
CN116152591B (en) Model training method, infrared small target detection method and device and electronic equipment
CN113095254A (en) Method and system for positioning key points of human body part
CN110956222A (en) Method for detecting network for underwater target detection
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN114972851B (en) Ship target intelligent detection method based on remote sensing image
CN116246109A (en) Multi-scale hole neighborhood attention computing backbone network model and application thereof
CN114550047B (en) Behavior rate guided video behavior recognition method
CN116051850A (en) Neural network target detection method, device, medium and embedded electronic equipment
CN116311349A (en) Human body key point detection method based on lightweight neural network
CN115587628A (en) Deep convolutional neural network lightweight method
CN111832336B (en) Improved C3D video behavior detection method
CN111639563B (en) Basketball video event and target online detection method based on multitasking
CN111489361B (en) Real-time visual target tracking method based on deep feature aggregation of twin network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant