CN113780092A - Crowd counting method based on block weak labeling - Google Patents
Crowd counting method based on block weak labeling Download PDFInfo
- Publication number
- CN113780092A CN113780092A CN202110930559.4A CN202110930559A CN113780092A CN 113780092 A CN113780092 A CN 113780092A CN 202110930559 A CN202110930559 A CN 202110930559A CN 113780092 A CN113780092 A CN 113780092A
- Authority
- CN
- China
- Prior art keywords
- network
- block
- layer
- cpnc
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000002372 labelling Methods 0.000 title abstract description 23
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000009499 grossing Methods 0.000 claims abstract description 21
- 238000012360 testing method Methods 0.000 claims abstract description 12
- 230000002776 aggregation Effects 0.000 claims description 6
- 238000004220 aggregation Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000008707 rearrangement Effects 0.000 claims description 6
- 230000002087 whitening effect Effects 0.000 claims description 5
- 230000001965 increasing effect Effects 0.000 claims description 3
- 230000002093 peripheral effect Effects 0.000 claims 1
- 230000001131 transforming effect Effects 0.000 claims 1
- 230000006870 function Effects 0.000 abstract description 33
- 230000000694 effects Effects 0.000 abstract description 3
- 238000003909 pattern recognition Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 7
- 238000002474 experimental method Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000004807 localization Effects 0.000 description 5
- 230000001174 ascending effect Effects 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 239000013256 coordination polymer Substances 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- GUJOJGAPFQRJSV-UHFFFAOYSA-N dialuminum;dioxosilane;oxygen(2-);hydrate Chemical compound O.[O-2].[O-2].[O-2].[Al+3].[Al+3].O=[Si]=O.O=[Si]=O.O=[Si]=O.O=[Si]=O GUJOJGAPFQRJSV-UHFFFAOYSA-N 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 241000282414 Homo sapiens Species 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 230000001537 neural effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000000153 supplemental effect Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 238000002679 ablation Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/40—Scaling of whole images or parts thereof, e.g. expanding or contracting
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for counting crowds based on block weak labeling information, which comprises a training stage and a testing stage, wherein the training stage is used for predicting blocks through a CPNC network, and applying label smoothing, feature smoothing, various data enhancement strategies and auxiliary loss functions, so that the problems of long tail effect of the number of the blocks and inaccuracy of area prediction are solved, the performance similar to that of a method using a density map is obtained under the condition of using less labeling information, and good mobility is shown.
Description
Technical Field
The invention belongs to the technical field of computer vision and image processing, and particularly relates to a crowd counting method based on block weak labeling.
Background
People counting is one of the important computer vision tasks, the goal of which is to count the number of people that appear in a picture. In recent years, this task has played an increasingly important role in security monitoring, public place traffic analysis, and the like. Unlike the target detection task using bounding boxes, the current mainstream crowd counting task mainly uses density maps as learning targets. The density map is generated by convolution smoothing the point label map with a symmetric probability density function, and therefore has the same L1 norm as the point label map. Compared with a point labeled graph, the density graph is more continuous in value, and therefore, the network learning is more facilitated. However, the current method using the density map has the following problems:
the generation of the density map depends on point labeling, and the cost is high when the point labeling is carried out on the targets one by one under the condition of high density. And ideally the size of the probability density kernel that generates the density map depends on the scale of the target. However, if the dimension marking is carried out, the marking cost is further increased. Meanwhile, the density map cannot remove the problem of labeling noise which is often generated when the targets are dense.
Predicting the density map requires preserving the size of the feature map during operation to maintain the resolution of the density map. On the density map, one pixel corresponds to at most one object, and if one pixel corresponds to one object, the density map is degraded into a point labeling map. Therefore, only a small amount of down-sampling can be performed in the network, which increases the consumption of computing resources.
Therefore, a method for counting people based on images is needed, which reduces the labeling cost and the noise influence, thereby satisfying the current use requirement for counting people based on visual images.
Disclosure of Invention
In order to solve the problems, the invention provides a crowd counting method based on block weak labeling. The method divides the counted crowd pictures into a plurality of blocks to form a block quantity picture, the block quantity picture does not need specific position information during marking, marking cost is reduced, and the network structure CPNC is utilized to take the crowd pictures as input and output the number of the people in each block. A smoothing strategy, a data enhancement strategy and an auxiliary loss function are respectively introduced, so that the performance similar to that of the method using the density map is obtained under the condition of using less labeled information, and good mobility is shown, thereby completing the invention.
The invention aims to provide a crowd counting method based on block weak labeling.
The training phase performs block prediction through a CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting.
The CPNC network comprises a feature extraction network, a bottleneck network and a prediction head.
The feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size. The bottleneck network uses cross-layer half-network components in the CSPNet that can efficiently utilize cross-layer features, thereby reducing the complexity of the reduced-size feature map. In particular, the cross-layer half-network component splits a feature into two per channel. One of the features continues to extract deeper features through a branch bottleneck network, and the other feature only passes through convolution transformation with low complexity and combines the results of the two. Preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, and h is a 1 × 100 convolution module with low computational complexity.
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
The measuring head adopts a Bi-FPN network in EfficientDet.
In the invention, the block number B with the number of people n is calculated by a convolution method by using a Gaussian function as a radial basis functionnPerforming label smoothing and using the smoothed quantitiesThe reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Wherein,kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is normally distributed over the values at n-n'.
In the present invention, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature ziWith a smoothed value ofWeighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value ofAndthe Gaussian kernel function is shown as formula (3):
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
The invention introduces a Bi-FPN module on the network design, and uses the Mosaic data to enhance and increase the number of small-size targets.
In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is the equilibrium coefficient. Prediction error usage function Smooth L1(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):
and in the testing stage, the trained CPNC network is applied to a detection task of population counting so as to verify the effectiveness of the model.
The method for counting the crowd based on the weak labeling of the block has the following beneficial effects:
(1) the invention designs a lightweight network structure CPNC, and the network takes pictures as input and outputs the number of the human beings in each block. By analyzing the long tail distribution problem and the small target problem in a plurality of data sets, a smoothing strategy and a data enhancement strategy are respectively introduced, an auxiliary loss function is further introduced, the labeling cost is reduced, and the crowd counting method only needing part of block labeling information is introduced.
(2) In the invention, the Gaussian function is used as a radial basis function, and B is performed by a convolution methodnPerforming smoothing toAfter smoothingAnd ErnThe degree of negative correlation of (a) is significantly improved.
(3) In the invention, standard whitening and recoloring processes are introduced to three characteristic layers of the input Bi-FPN to smooth output characteristics, so that the characteristics are consistent, a network model focuses on blocks with various densities more balanced, and blocks corresponding to the densities with higher occurrence frequency in a data set are not over-fitted according to data distribution in the data set, so that the model has better mobility.
(4) The data enhancement strategy introduced by the invention can effectively improve the prediction precision of the small-size target and enable the size of the target in the enhanced image to be continuously changed.
(5) In the invention, an auxiliary loss function is constructed, the supervision information in the non-labeled data is mined, and the labeling cost is further reduced.
Drawings
Fig. 1 shows a schematic diagram of a CPNC network architecture according to the present invention;
FIG. 2 illustrates an exemplary diagram of random block erasure and block location reset in a supplemental loss function in accordance with the present invention;
fig. 3 shows the application test results of CPNC and CPNC + + on UCF-QNRF in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description.
The invention provides a crowd counting method based on block weak labeling.
The training phase is block prediction by CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting, specifically as described in the literature "WANG C Y, MARK LIAO H Y, WU Y H, et al, Cspnet: A new backbone which can be used for enhancing the performance of the simulation of cnn [ C ]// IEEE/CVF Conference Computer Vision and Pattern Recognition Workships (CVPRW) 2020: 1571-.
The CPNC network includes a feature extraction network, a bottleneck network, and a prediction head, and a network structure thereof is shown in fig. 1.
The feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size. The bottleneck network uses cross-layer half-network components in the CSPNet that can efficiently utilize cross-layer features, thereby reducing the complexity of the reduced-size feature map. In particular, the cross-layer half-network component splits a feature into two per channel. One of the features continues to extract deeper features through a branch bottleneck network, and the other feature only passes through convolution transformation with low complexity and combines the results of the two. Preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, and h is a 1 × 100 convolution module with low computational complexity.
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
The Focus module is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".
The h is a 1 × 100 convolution module with low computational complexity, which is described in the literature "Krizhevsky A, Sutskeeper I, Hinton G E.ImageNet Classification with Deep schematic Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".
According to different requirements, the branch bottleneck network selects network structures with different layers or different complexities, such as ResNet, ResNext, NFNet and the like.
The ResNet is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".
The ResNext is described in particular in the documents "XIE S, GIRSHICK R, DOLL R P, et al.
The ResNest is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".
The NFNet is described in detail in "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".
The measuring head adopts a Bi-FPN network in EfficientDet. Unlike the detection task, the targets in the crowd counting task are small, and the small targets are not easily distinguished by high-level features. Meanwhile, in order to reduce the complexity of the network, the Bi-FPN network feature layer is set to be 3-5 layers, preferably 3 layers, small target identification is enhanced, and a block number graph is obtained.
The efficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE QV. efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".
In image processing for crowd counting, model deviation is easily caused by label imbalance, and the statistical result on a data set shows that the number of corresponding blocks is rapidly reduced along with the increase of density, so that long-tail distribution is presented. In addition, as the density increases, the average size of the objects within the corresponding tile becomes smaller, increasing the difficulty of counting. The result can cause that the prediction deviation of the high-density block is larger, and in order to solve the problem of model deviation, a sample balancing strategy is designed to resist the influence of long tail distribution.
In the invention, the block number B with the number of people n is calculated by a convolution method by using a Gaussian function as a radial basis functionnPerforming label smoothing and using the smoothed quantitiesThe reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Wherein,kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21, preferably 12-18; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is the value at n-n';greater than 0.
After passing through the tag-smoothing policy, the tag is,average error Er of all blocks with number of people being just nnThe degree of negative correlation of (a) is significantly improved. The method is obtained through experiments, and on the NWPU-crown,and ErnHas a Pearson correlation index of-0.72, a value of-0.79 on UCF-QNRF, and all BnAre all greater than 0.
wherein k isiIndicates the number of people in the ith block, n is the number of people, eriThe number of people in the ith block is kiThe prediction error of (2).
The UCF-QNRF is specifically described in the document "H I, M T, K A, et al.composition loss for counting, density map estimation and localization in density peaks [ C ]// IEEE European Conference on Computer Vision (ECCV): 2018: 544-.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
Meanwhile, in order to make the output characteristic consistent, in the invention, standard whitening and recoloring are introduced on the characteristic layer input of the Bi-FPN to smooth the output characteristic ziWith a smoothed value ofWeighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value ofAndthe Gaussian kernel function is shown as formula (3):
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
Through the two smoothing strategies, the model focuses on blocks with various densities more evenly, and does not over-fit blocks corresponding to the density with higher occurrence frequency in the data set according to the data distribution in the data set, so that the model has better mobility.
In order to solve the problem of small-size targets, the invention introduces a Bi-FPN network on the network design and uses the Mosaic data enhancement to increase the number of small-size targets. Meanwhile, in order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train.
Said Mosaic is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo,2020.https:// doi.org/10.5281/zenodo.4154370".
During training, a plurality of pictures are combined by utilizing a Mosaic algorithm, preferably, a plurality of times of batch-size pictures are sampled and randomly divided into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is usediWherein i is 1,2i-1 division into niRegion, marked as enhanced picture set The jth region obtained for the ith group of pictures. When the enhanced pictures in the ith group are divided for the a-th time, the enhanced pictures in the ith group are divided into a plurality of sub-groupsSelecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to ni-an integer of 1. The division is carried out in horizontal and vertical orderAnd (4) performing the steps. Finally, the real pictures in the i groupsThe people are sorted in ascending order according to the number of people,is a real picture set. Will be provided withSorting according to the ascending order of the area, and ranking at the j (j is 1,2, …, n)i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.
In order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train the network. In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The auxiliary loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is a balance coefficient, and is specifically shown in the literature "(LIU X, VAN DE WEIJER J, BAGDANOV A D. leveraging unlabeled data for crowned counting by learning to rank [ C ]]// IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 7661-. Prediction error using function SmoothL1(c) Loss protectorThe gradient is not too large at the beginning of training, and the definition is shown as formula (5):
wherein c is an independent variable.
And in the testing stage, the trained CPNC network is applied to a detection task of population counting so as to verify the effectiveness of the model.
The invention provides a method for counting crowds based on block weak labeling information, which utilizes an improved CPNC network for training and does not depend on accurate position information, so that the labeling cost is lower. The invention provides a plurality of lifting strategies, comprising the following steps: the method has the advantages that the label is smooth, the characteristics are smooth, multiple data enhancement strategies and auxiliary loss functions are adopted, the problems of long tail effect of the number of block people and inaccuracy of region prediction are solved, performance similar to that of a method using a density map is obtained under the condition that less labeled information is used, and good mobility is shown.
Examples
The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.
Example 1
The CPNC network using the data enhancement and supplemental loss functions is denoted CPNC + +.
In the CPNC network:
(1) firstly, the Focus module is used for reducing the size of a training picture to obtain a feature map with reduced size. Focus modules are described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".
Inputting the reduced-size feature map to a cross-layer half-network component in the CSPNet as a bottleneck network, which proceeds as equation (1):
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
wherein g is a branch bottleneck network NFNet-f3 with high computational complexity, which is specifically described in the document "BROCK A, DE S, SMITH S L, et al.high-performance large scale image recognition with out simulation [ J/OL ]. CoRR,2021, abs/2102.06171.https:// arxiv.org/abs/2102.06171". h is a convolution module with a low computational complexity and a convolution kernel of 1 × 100, the specific convolution operation of which is described in the document "Krizhevsky A, Sutskeeper I, Hinton G E. ImageNet Classification with Deep conditional Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
(2) And inputting the output result from the bottleneck network into a Bi-FPN network in EfficientDet as a prediction head for processing to obtain an output result.
The EfficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE Q V. Efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".
Using Gaussian function as radial basis function, and performing block number B of people with number n by convolution methodnPerforming label smoothing (LDS) and using the smoothed quantitiesThe reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Wherein,kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 15; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is normally distributed over the values at n-n'.
wherein k isiIndicates the number of people in the ith block, n is the number of people, eriThe prediction error of the number of people in the ith block.
On a NWPU-Crowd basis,and ErnThe Pearson correlation index of (A) is-0.72, and on UCF-QNRF is-0.79. On NWPU-crown, BnAnd ErnThe degree of negative correlation is-0.10, on UCF-QNRF is-0.11,and ErnThe degree of negative correlation of (a) is significantly improved.
(3) In addition, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature ziPerforming feature smoothing (FDS) with a smoothing value ofWeighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value ofAndthe Gaussian kernel function is shown as formula (3):
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
In the CPNC + + network, on the basis of the CPNC network,
(1) combining a plurality of pictures by utilizing a Mosaic algorithm, and randomly dividing the 4 times of batch-size pictures into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is usediWherein i is 1,2,3,4, n is performed on the enhanced picturei-1 division into niAn area, is described asThe jth region obtained for the ith group of pictures. When the a-th division is performed, the slaveSelecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to ni-an integer of 1. The division is performed alternately in horizontal and vertical order. Finally, the real pictures in the i groupsThe people are sorted in ascending order according to the number of people,for the real picture set, willSorting according to the ascending order of the areas. Will be ranked at j (j is 1,2, …, n)i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.
(2) In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a true value and a predicted value of the number of people in the input image, f (x) is a corresponding characteristic layer, and xhX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS), wherein, the erase operation is as in the document "Pathak D,P,Donahue J,et al.Context Encoders:Feature Learning by Inpainting[C]// IEEE Conference on Computer Vision and Pattern recognition.2016.2536-2544. "; rearrangement operations such as "Noroozi M, Favaro P. unreserved Learning of Visual restitution by Solving Jugsaw pumps [ C]// European Conference on Computer Vision.2016.69-84. "; scaling and aggregation operations are described in the literature "Noroozi M, Pirsiavash H, Favaro P]// International Conference on Computer Vision.2017.5898-5906 "; λ is 0.0001.
Prediction error using SmoothL1(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):
wherein c is an independent variable.
The CPNC network and the CPNC + + network were evaluated on three public open view dense population datasets, Shanghai Tech, UCF-QNRF, and NWPU-Crowd, and compared to existing density map-based methods. For block ((w)1,h1),(w2,h2) Point annotation datasets can be converted to block annotation datasets by equation (11).
Wherein h is1、h2Respectively the ordinate of the upper left corner and the ordinate of the lower right corner of the block, w1Is the abscissa of the upper left corner of the block, w2Is the abscissa of the lower right corner of the region, Y is the point label information (point label value), and Y (w, h) is the point label value of the position with coordinates (w, h).
Unless otherwise specified, the experiments were performed on RTX 3090 GPUs, with an input size of 1024 × 1024, a batch size of 16 per GPU, and using synchronized batch normalization during training. The number of iterations is 500 rounds and Adam is applied as optimizer, the fixed learning rate is 10-5. The data enhancement trigger probabilities used are all 0.3. During testing, if the picture size exceeds the input size during training, the average is performed by using a covered window sliding mode, the window size is 1024 × 1024, and the coverage rate is 0.25. If the picture size is smaller than the input size at training, a multiple of 0 to 64 is complemented at the edge. If not specifically stated, the network backbone employs NFNet-f3, and under this network, each inference takes only 0.06 seconds, while using DM-Count on the same machine takes 0.15 seconds.
The Shanghai Tech is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The UCF-QNRF is specifically described in the document "H I, M T, K A, et al.composition loss for counting, density map estimation and localization in density peaks [ C ]// IEEE European Conference on Computer Vision (ECCV): 2018: 544-.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
The NFNet-f3 is specifically described in the document "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".
The DM-Count is described in detail in the literature "WANG B, LIU H, SAMARAS D, et al.
In addition, in order to prove that the CPNC + + network can effectively utilize the unlabeled data, 30% of blocks and labeled information of the number of the blocks are randomly selected from the training data to be used as supervision data, the rest 70% of data are used as unlabeled data, and the model obtained by training is recorded as CPNC + + (30%).
TABLE 1 comparison of CPNC network, CPNC + + network and CPNC + + (30%) models in the present invention with existing methods on open datasets
The MCNN is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al, Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The SCNN is described in detail in the document "SAM D B, SURYA S, BABU R V. switching capacitive neural network for crown counting [ C ]// IEEE Conference on Computer Vision and Pattern registration: 2017-January.2017".
The IG-NN is described in detail in the literature "SAM D B, SAJJAN N, BABU R V, et al. Divide and grow: Capturing human change conversion in grown images with innovative growth cnn [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 3618-.
The CSRNet is specifically described in the literature "LI Y, ZHANG X, CHEN D.Csrnet: scaled volumetric neural networks for understating the high condensed scenes [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 1091-.
The SFCN-101 is described in detail in the literature "WANG Q, GAO J, LINW, et al.
Specifically, the CAN is described in "LIU W, SALZMANN M, FUA P.Context-aware crown counting [ C ]// IEEE Computer Society Conference on Computer Vision and Pattern recognition.2019: 5094-5103".
The DM-Count is described in detail in the literature "WANG B, LIU H, SAMARAS D, et al.
The SDCNET is described in detail in the literature "XIONG H, LU H, LIU C, et al. from open set to closed set" Counting objects by spatial two-and-controller [ C ]//2019IEEE/CVF International Conference on Computer Vision (ICCV): 2019:8361-8370 ".
The Mean Absolute Error (MAE) and the Mean Square Error (MSE) are used as evaluation indexes, which are defined as shown in formulas (12), (13):
wherein N is the total number of pictures, CiIn order to predict the value of the target,are true values.
The Shanghai Tech dataset consists of two parts: STA and STB. STAs are more densely populated and more difficult than STBs. The official partitioning scheme of the training set and test set was used in the experiments, as described in the literature "ZHANG Y, ZHOU D, CHEN S, et al. As can be seen from the experimental results listed in table 1, the MAE and MSE on STA of CPNC + + were reduced by 20.9% and 20.4%, respectively, compared to CPNC. Compared with the method using the density map, the CPNC + + achieves similar performance on both the STA and the STB under the premise of using less supervision information.
The Shanghai Tech data set, STA and STB are described in particular in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric neural network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The NWPU-crown is a large data set which comprises 5109 high-resolution pictures, wherein the number of the training set, the verification set and the test set is 3109, 500 and 1000 respectively. The data set also gives bounding box labels, and the population counting method based on the density map can estimate more accurate Gaussian kernel size according to the bounding box, but does not use the information in training and testing at the time. The performance comparison with the previous method is shown in table 1. It can be seen that in the method using the density map, DM-Count and sdcent achieved the best MAE and MSE, respectively, and CPNC + + achieved similar performance to bounding box information and point labeling information without them.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
The UCF-QNRF is a large crowd counting data set, and the advantages of the CPNC network and the CPNC + + network provided by the invention are more obvious on the data set. This data set consisted of 1,535 pictures, containing a total of 125 million head labels. Where the resolution of the picture is higher, the sliding window strategy described above is used. 1201 officially divided pictures were used as training sets and 334 as test sets in the experiments. As can be seen from the results in table 1, the performance of CPNC + + is improved by 13.4% compared to CPNC, exceeding DM-Count, which is the best method using density maps, indicating the effectiveness of the proposed training strategy.
CPNC + + (30%) achieved good performance on each data set with 5 random selections. As can be seen from table 1, the performance of CPNC + + (30%) is close to that of a partial density map-based approach such as MCNN. On UCF-QNRF, the MAE of CPNC + + (30%) was 105.3, which is only 21.1 higher than that of CPNC + +, approaching most of the methods using density maps using full-position labeling. Representative results of CPNC and CPNC + + on UCF-QNRF are shown in FIG. 3, where GT represents the true value of population. It can be seen that CPNC + + makes effective promotion on the basis of CPNC. Compared to CPNC, CPNC + + predicts more accurately when targets are denser (lines 2, 3) and smaller (line 4), while performance does not degrade on data where the population is sparse (line 1).
Example 2
Existing datasets are limited samples of real-world scenes, and previous methods may over-fit the datasets. In the invention, CPNC + + has better generalization because of balancing the unbalanced distribution. To verify this, the model is trained on the NWPU-crown, the model is selected according to the verification set of the NWPUCrowd, and the test is performed on STA, UCF-QNRF and JHU-crown, and compared with the DM-Count which is the best method for comprehensive performance at present. As can be seen from the results in table 2, CPNC + + migrates better when tested across datasets.
TABLE 2 MAE for cross-dataset test of CPNC + + and DM-Count
Example 3
In order to examine the impact of each enhancement strategy, ablation experiments were performed on UCF-QNRF. As shown in table 3, wherein LDS and FDS refer to label smoothing and feature smoothing, respectively, MON refers to the proposed data enhancement strategy (Mosiac enhancement), GD refers to region erasure and region rearrangement, and GS refers to region aggregation. Experiments show that the accuracy of population counting by using a block prediction method can be improved by using the proposed strategy alone, and meanwhile, the methods are mutually compatible.
TABLE 3 Effect measurement for each enhancement strategy
The invention has been described in detail with reference to specific embodiments and/or illustrative examples and the accompanying drawings, which, however, should not be construed as limiting the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.
Claims (10)
1. The method comprises a training stage and a testing stage, wherein the training stage carries out block prediction through a CPNC (compact peripheral network) network, and the CPNC network is a cross-stage local network for counting the crowd.
2. The method of claim 1, wherein the CPNC network comprises a feature extraction network, a bottleneck network, and a predictive head.
3. The method of claim 2,
the feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size;
the bottleneck network uses cross-layer half-network components in CSPNet;
the measuring head adopts a Bi-FPN network in EfficientDet.
4. The method of claim 3, wherein the cross-layer half-network component splits features into two per channel, wherein one feature continues to extract deeper features through a branch bottleneck network, the other feature only passes through a low complexity convolution transform, and combines the results of the two,
preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, h is a 1 × 100 convolution module with low computational complexity,
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
5. The method of claim 3, wherein the Bi-FPN network feature layer is set to 3-5 layers, preferably 3 layers, to enhance small object discrimination and obtain the block number map.
6. Method according to one of claims 1 to 5, characterized in that the method uses a Gaussian function as a radial basis function, and the number B of blocks of the number n is carried out by means of convolutionnSmoothing the label and using the smoothedQuantity BnThe inverse of (c) is used as the weight w of the corresponding block, and the specific operation is as shown in equation (2):
7. Method according to one of claims 1 to 6, characterized in that standard whitening and recoloring are introduced on the feature layer inputs of the Bi-FPN to smooth the output feature ziWith a smoothed value ofWeighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value ofAndthe concrete formula is shown in (3):
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
8. Method according to one of the claims 1 to 7, wherein the number of small-sized targets is increased using Mosaic data enhancement.
9. Method according to one of claims 1 to 8, characterized in that a secondary loss function is used in the method, which is defined as shown in equation (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishAnd transforming the image after h by x, wherein h comprises block random erasing and rearrangement, block characteristic scaling and aggregation, and lambda is a balance coefficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110930559.4A CN113780092B (en) | 2021-08-13 | 2021-08-13 | Crowd counting method based on block weak labeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110930559.4A CN113780092B (en) | 2021-08-13 | 2021-08-13 | Crowd counting method based on block weak labeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113780092A true CN113780092A (en) | 2021-12-10 |
CN113780092B CN113780092B (en) | 2022-06-10 |
Family
ID=78837663
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110930559.4A Active CN113780092B (en) | 2021-08-13 | 2021-08-13 | Crowd counting method based on block weak labeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113780092B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114758288A (en) * | 2022-03-15 | 2022-07-15 | 华北电力大学 | Power distribution network engineering safety control detection method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104992223A (en) * | 2015-06-12 | 2015-10-21 | 安徽大学 | Intensive population estimation method based on deep learning |
CN106845621A (en) * | 2017-01-18 | 2017-06-13 | 山东大学 | Dense population number method of counting and system based on depth convolutional neural networks |
CN111882517A (en) * | 2020-06-08 | 2020-11-03 | 杭州深睿博联科技有限公司 | Bone age evaluation method, system, terminal and storage medium based on graph convolution neural network |
CN112215129A (en) * | 2020-10-10 | 2021-01-12 | 江南大学 | Crowd counting method and system based on sequencing loss and double-branch network |
CN112417288A (en) * | 2020-11-25 | 2021-02-26 | 南京大学 | Task cross-domain recommendation method for crowdsourcing software testing |
-
2021
- 2021-08-13 CN CN202110930559.4A patent/CN113780092B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104992223A (en) * | 2015-06-12 | 2015-10-21 | 安徽大学 | Intensive population estimation method based on deep learning |
CN106845621A (en) * | 2017-01-18 | 2017-06-13 | 山东大学 | Dense population number method of counting and system based on depth convolutional neural networks |
CN111882517A (en) * | 2020-06-08 | 2020-11-03 | 杭州深睿博联科技有限公司 | Bone age evaluation method, system, terminal and storage medium based on graph convolution neural network |
CN112215129A (en) * | 2020-10-10 | 2021-01-12 | 江南大学 | Crowd counting method and system based on sequencing loss and double-branch network |
CN112417288A (en) * | 2020-11-25 | 2021-02-26 | 南京大学 | Task cross-domain recommendation method for crowdsourcing software testing |
Non-Patent Citations (2)
Title |
---|
WILLIAM: "一文读懂YOLOV5与YOLOV4", 《知乎》 * |
XIALEI LIU等: "Leveraging Unlabeled Data for Crowd Counting by Learning to Rank", 《IEEE CVF》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114758288A (en) * | 2022-03-15 | 2022-07-15 | 华北电力大学 | Power distribution network engineering safety control detection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN113780092B (en) | 2022-06-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109344736B (en) | Static image crowd counting method based on joint learning | |
Li et al. | Adaptively constrained dynamic time warping for time series classification and clustering | |
CN110717411A (en) | Pedestrian re-identification method based on deep layer feature fusion | |
Yan et al. | Crowd counting via perspective-guided fractional-dilation convolution | |
Kim et al. | Color–texture segmentation using unsupervised graph cuts | |
CN107240122A (en) | Video target tracking method based on space and time continuous correlation filtering | |
Xian et al. | Evaluation of low-level features for real-world surveillance event detection | |
Danelljan et al. | Deep motion and appearance cues for visual tracking | |
Fang et al. | Efficient and robust fragments-based multiple kernels tracking | |
Yi et al. | Motion keypoint trajectory and covariance descriptor for human action recognition | |
CN114913379B (en) | Remote sensing image small sample scene classification method based on multitasking dynamic contrast learning | |
CN106157330A (en) | A kind of visual tracking method based on target associating display model | |
CN110533100A (en) | A method of CME detection and tracking is carried out based on machine learning | |
CN111709331A (en) | Pedestrian re-identification method based on multi-granularity information interaction model | |
CN106777159A (en) | A kind of video clip retrieval and localization method based on content | |
CN114973112A (en) | Scale-adaptive dense crowd counting method based on antagonistic learning network | |
KR20200010971A (en) | Apparatus and method for detecting moving object using optical flow prediction | |
Aldhaheri et al. | MACC Net: Multi-task attention crowd counting network | |
CN108257148B (en) | Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking | |
Xu et al. | Domain adaptation from synthesis to reality in single-model detector for video smoke detection | |
Ma et al. | PPDTSA: Privacy-preserving deep transformation self-attention framework for object detection | |
Jiang et al. | Flexible sliding windows with adaptive pixel strides | |
CN113780092B (en) | Crowd counting method based on block weak labeling | |
Huang et al. | Aerial image classification by learning quality-aware spatial pyramid model | |
Li et al. | Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |