CN113780092A - Crowd counting method based on block weak labeling - Google Patents

Crowd counting method based on block weak labeling Download PDF

Info

Publication number
CN113780092A
CN113780092A CN202110930559.4A CN202110930559A CN113780092A CN 113780092 A CN113780092 A CN 113780092A CN 202110930559 A CN202110930559 A CN 202110930559A CN 113780092 A CN113780092 A CN 113780092A
Authority
CN
China
Prior art keywords
network
block
layer
cpnc
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110930559.4A
Other languages
Chinese (zh)
Other versions
CN113780092B (en
Inventor
李国荣
黄庆明
刘心岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN202110930559.4A priority Critical patent/CN113780092B/en
Publication of CN113780092A publication Critical patent/CN113780092A/en
Application granted granted Critical
Publication of CN113780092B publication Critical patent/CN113780092B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof

Abstract

The invention discloses a method for counting crowds based on block weak labeling information, which comprises a training stage and a testing stage, wherein the training stage is used for predicting blocks through a CPNC network, and applying label smoothing, feature smoothing, various data enhancement strategies and auxiliary loss functions, so that the problems of long tail effect of the number of the blocks and inaccuracy of area prediction are solved, the performance similar to that of a method using a density map is obtained under the condition of using less labeling information, and good mobility is shown.

Description

Crowd counting method based on block weak labeling
Technical Field
The invention belongs to the technical field of computer vision and image processing, and particularly relates to a crowd counting method based on block weak labeling.
Background
People counting is one of the important computer vision tasks, the goal of which is to count the number of people that appear in a picture. In recent years, this task has played an increasingly important role in security monitoring, public place traffic analysis, and the like. Unlike the target detection task using bounding boxes, the current mainstream crowd counting task mainly uses density maps as learning targets. The density map is generated by convolution smoothing the point label map with a symmetric probability density function, and therefore has the same L1 norm as the point label map. Compared with a point labeled graph, the density graph is more continuous in value, and therefore, the network learning is more facilitated. However, the current method using the density map has the following problems:
the generation of the density map depends on point labeling, and the cost is high when the point labeling is carried out on the targets one by one under the condition of high density. And ideally the size of the probability density kernel that generates the density map depends on the scale of the target. However, if the dimension marking is carried out, the marking cost is further increased. Meanwhile, the density map cannot remove the problem of labeling noise which is often generated when the targets are dense.
Predicting the density map requires preserving the size of the feature map during operation to maintain the resolution of the density map. On the density map, one pixel corresponds to at most one object, and if one pixel corresponds to one object, the density map is degraded into a point labeling map. Therefore, only a small amount of down-sampling can be performed in the network, which increases the consumption of computing resources.
Therefore, a method for counting people based on images is needed, which reduces the labeling cost and the noise influence, thereby satisfying the current use requirement for counting people based on visual images.
Disclosure of Invention
In order to solve the problems, the invention provides a crowd counting method based on block weak labeling. The method divides the counted crowd pictures into a plurality of blocks to form a block quantity picture, the block quantity picture does not need specific position information during marking, marking cost is reduced, and the network structure CPNC is utilized to take the crowd pictures as input and output the number of the people in each block. A smoothing strategy, a data enhancement strategy and an auxiliary loss function are respectively introduced, so that the performance similar to that of the method using the density map is obtained under the condition of using less labeled information, and good mobility is shown, thereby completing the invention.
The invention aims to provide a crowd counting method based on block weak labeling.
The training phase performs block prediction through a CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting.
The CPNC network comprises a feature extraction network, a bottleneck network and a prediction head.
The feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size. The bottleneck network uses cross-layer half-network components in the CSPNet that can efficiently utilize cross-layer features, thereby reducing the complexity of the reduced-size feature map. In particular, the cross-layer half-network component splits a feature into two per channel. One of the features continues to extract deeper features through a branch bottleneck network, and the other feature only passes through convolution transformation with low complexity and combines the results of the two. Preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, and h is a 1 × 100 convolution module with low computational complexity.
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
The measuring head adopts a Bi-FPN network in EfficientDet.
In the invention, the block number B with the number of people n is calculated by a convolution method by using a Gaussian function as a radial basis functionnPerforming label smoothing and using the smoothed quantities
Figure BDA0003211160270000031
The reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Figure BDA0003211160270000032
Figure BDA0003211160270000033
Wherein the content of the first and second substances,
Figure BDA0003211160270000034
kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is normally distributed over the values at n-n'.
In the present invention, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature ziWith a smoothed value of
Figure BDA0003211160270000041
Weighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value of
Figure BDA0003211160270000042
And
Figure BDA0003211160270000043
the Gaussian kernel function is shown as formula (3):
Figure BDA0003211160270000044
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
The invention introduces a Bi-FPN module on the network design, and uses the Mosaic data to enhance and increase the number of small-size targets.
In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is the equilibrium coefficient. Prediction error usage function Smooth L1(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):
Figure BDA0003211160270000051
and in the testing stage, the trained CPNC network is applied to a detection task of population counting so as to verify the effectiveness of the model.
The method for counting the crowd based on the weak labeling of the block has the following beneficial effects:
(1) the invention designs a lightweight network structure CPNC, and the network takes pictures as input and outputs the number of the human beings in each block. By analyzing the long tail distribution problem and the small target problem in a plurality of data sets, a smoothing strategy and a data enhancement strategy are respectively introduced, an auxiliary loss function is further introduced, the labeling cost is reduced, and the crowd counting method only needing part of block labeling information is introduced.
(2) In the invention, the Gaussian function is used as a radial basis function, and B is performed by a convolution methodnPerforming smoothing toAfter smoothing
Figure BDA0003211160270000052
And ErnThe degree of negative correlation of (a) is significantly improved.
(3) In the invention, standard whitening and recoloring processes are introduced to three characteristic layers of the input Bi-FPN to smooth output characteristics, so that the characteristics are consistent, a network model focuses on blocks with various densities more balanced, and blocks corresponding to the densities with higher occurrence frequency in a data set are not over-fitted according to data distribution in the data set, so that the model has better mobility.
(4) The data enhancement strategy introduced by the invention can effectively improve the prediction precision of the small-size target and enable the size of the target in the enhanced image to be continuously changed.
(5) In the invention, an auxiliary loss function is constructed, the supervision information in the non-labeled data is mined, and the labeling cost is further reduced.
Drawings
Fig. 1 shows a schematic diagram of a CPNC network architecture according to the present invention;
FIG. 2 illustrates an exemplary diagram of random block erasure and block location reset in a supplemental loss function in accordance with the present invention;
fig. 3 shows the application test results of CPNC and CPNC + + on UCF-QNRF in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description.
The invention provides a crowd counting method based on block weak labeling.
The training phase is block prediction by CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting, specifically as described in the literature "WANG C Y, MARK LIAO H Y, WU Y H, et al, Cspnet: A new backbone which can be used for enhancing the performance of the simulation of cnn [ C ]// IEEE/CVF Conference Computer Vision and Pattern Recognition Workships (CVPRW) 2020: 1571-.
The CPNC network includes a feature extraction network, a bottleneck network, and a prediction head, and a network structure thereof is shown in fig. 1.
The feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size. The bottleneck network uses cross-layer half-network components in the CSPNet that can efficiently utilize cross-layer features, thereby reducing the complexity of the reduced-size feature map. In particular, the cross-layer half-network component splits a feature into two per channel. One of the features continues to extract deeper features through a branch bottleneck network, and the other feature only passes through convolution transformation with low complexity and combines the results of the two. Preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, and h is a 1 × 100 convolution module with low computational complexity.
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
The Focus module is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".
The h is a 1 × 100 convolution module with low computational complexity, which is described in the literature "Krizhevsky A, Sutskeeper I, Hinton G E.ImageNet Classification with Deep schematic Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".
According to different requirements, the branch bottleneck network selects network structures with different layers or different complexities, such as ResNet, ResNext, NFNet and the like.
The ResNet is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".
The ResNext is described in particular in the documents "XIE S, GIRSHICK R, DOLL R P, et al.
The ResNest is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".
The NFNet is described in detail in "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".
The measuring head adopts a Bi-FPN network in EfficientDet. Unlike the detection task, the targets in the crowd counting task are small, and the small targets are not easily distinguished by high-level features. Meanwhile, in order to reduce the complexity of the network, the Bi-FPN network feature layer is set to be 3-5 layers, preferably 3 layers, small target identification is enhanced, and a block number graph is obtained.
The efficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE QV. efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".
In image processing for crowd counting, model deviation is easily caused by label imbalance, and the statistical result on a data set shows that the number of corresponding blocks is rapidly reduced along with the increase of density, so that long-tail distribution is presented. In addition, as the density increases, the average size of the objects within the corresponding tile becomes smaller, increasing the difficulty of counting. The result can cause that the prediction deviation of the high-density block is larger, and in order to solve the problem of model deviation, a sample balancing strategy is designed to resist the influence of long tail distribution.
In the invention, the block number B with the number of people n is calculated by a convolution method by using a Gaussian function as a radial basis functionnPerforming label smoothing and using the smoothed quantities
Figure BDA0003211160270000091
The reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Figure BDA0003211160270000092
Figure BDA0003211160270000093
Wherein the content of the first and second substances,
Figure BDA0003211160270000094
kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21, preferably 12-18; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is the value at n-n';
Figure BDA0003211160270000101
greater than 0.
After passing through the tag-smoothing policy, the tag is,
Figure BDA0003211160270000102
average error Er of all blocks with number of people being just nnThe degree of negative correlation of (a) is significantly improved. The method is obtained through experiments, and on the NWPU-crown,
Figure BDA0003211160270000103
and ErnHas a Pearson correlation index of-0.72, a value of-0.79 on UCF-QNRF, and all BnAre all greater than 0.
Wherein the content of the first and second substances,
Figure BDA0003211160270000104
wherein k isiIndicates the number of people in the ith block, n is the number of people, eriThe number of people in the ith block is kiThe prediction error of (2).
The UCF-QNRF is specifically described in the document "H I, M T, K A, et al.composition loss for counting, density map estimation and localization in density peaks [ C ]// IEEE European Conference on Computer Vision (ECCV): 2018: 544-.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
Meanwhile, in order to make the output characteristic consistent, in the invention, standard whitening and recoloring are introduced on the characteristic layer input of the Bi-FPN to smooth the output characteristic ziWith a smoothed value of
Figure BDA0003211160270000105
Weighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value of
Figure BDA0003211160270000106
And
Figure BDA0003211160270000107
the Gaussian kernel function is shown as formula (3):
Figure BDA0003211160270000111
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
Through the two smoothing strategies, the model focuses on blocks with various densities more evenly, and does not over-fit blocks corresponding to the density with higher occurrence frequency in the data set according to the data distribution in the data set, so that the model has better mobility.
In order to solve the problem of small-size targets, the invention introduces a Bi-FPN network on the network design and uses the Mosaic data enhancement to increase the number of small-size targets. Meanwhile, in order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train.
Said Mosaic is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo,2020.https:// doi.org/10.5281/zenodo.4154370".
During training, a plurality of pictures are combined by utilizing a Mosaic algorithm, preferably, a plurality of times of batch-size pictures are sampled and randomly divided into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is usediWherein i is 1,2i-1 division into niRegion, marked as enhanced picture set
Figure BDA0003211160270000121
Figure BDA0003211160270000122
The jth region obtained for the ith group of pictures. When the enhanced pictures in the ith group are divided for the a-th time, the enhanced pictures in the ith group are divided into a plurality of sub-groups
Figure BDA0003211160270000123
Selecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to ni-an integer of 1. The division is carried out in horizontal and vertical orderAnd (4) performing the steps. Finally, the real pictures in the i groups
Figure BDA0003211160270000124
The people are sorted in ascending order according to the number of people,
Figure BDA0003211160270000125
is a real picture set. Will be provided with
Figure BDA0003211160270000126
Sorting according to the ascending order of the area, and ranking at the j (j is 1,2, …, n)i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.
In order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train the network. In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The auxiliary loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is a balance coefficient, and is specifically shown in the literature "(LIU X, VAN DE WEIJER J, BAGDANOV A D. leveraging unlabeled data for crowned counting by learning to rank [ C ]]// IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 7661-. Prediction error using function SmoothL1(c) Loss protectorThe gradient is not too large at the beginning of training, and the definition is shown as formula (5):
Figure BDA0003211160270000131
wherein c is an independent variable.
And in the testing stage, the trained CPNC network is applied to a detection task of population counting so as to verify the effectiveness of the model.
The invention provides a method for counting crowds based on block weak labeling information, which utilizes an improved CPNC network for training and does not depend on accurate position information, so that the labeling cost is lower. The invention provides a plurality of lifting strategies, comprising the following steps: the method has the advantages that the label is smooth, the characteristics are smooth, multiple data enhancement strategies and auxiliary loss functions are adopted, the problems of long tail effect of the number of block people and inaccuracy of region prediction are solved, performance similar to that of a method using a density map is obtained under the condition that less labeled information is used, and good mobility is shown.
Examples
The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.
Example 1
The CPNC network using the data enhancement and supplemental loss functions is denoted CPNC + +.
In the CPNC network:
(1) firstly, the Focus module is used for reducing the size of a training picture to obtain a feature map with reduced size. Focus modules are described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".
Inputting the reduced-size feature map to a cross-layer half-network component in the CSPNet as a bottleneck network, which proceeds as equation (1):
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
wherein g is a branch bottleneck network NFNet-f3 with high computational complexity, which is specifically described in the document "BROCK A, DE S, SMITH S L, et al.high-performance large scale image recognition with out simulation [ J/OL ]. CoRR,2021, abs/2102.06171.https:// arxiv.org/abs/2102.06171". h is a convolution module with a low computational complexity and a convolution kernel of 1 × 100, the specific convolution operation of which is described in the document "Krizhevsky A, Sutskeeper I, Hinton G E. ImageNet Classification with Deep conditional Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".
ni-1Layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
(2) And inputting the output result from the bottleneck network into a Bi-FPN network in EfficientDet as a prediction head for processing to obtain an output result.
The EfficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE Q V. Efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".
Using Gaussian function as radial basis function, and performing block number B of people with number n by convolution methodnPerforming label smoothing (LDS) and using the smoothed quantities
Figure BDA0003211160270000151
The reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).
Figure BDA0003211160270000152
Figure BDA0003211160270000153
Wherein the content of the first and second substances,
Figure BDA0003211160270000154
kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 15; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is normally distributed over the values at n-n'.
Wherein the content of the first and second substances,
Figure BDA0003211160270000155
wherein k isiIndicates the number of people in the ith block, n is the number of people, eriThe prediction error of the number of people in the ith block.
On a NWPU-Crowd basis,
Figure BDA0003211160270000156
and ErnThe Pearson correlation index of (A) is-0.72, and on UCF-QNRF is-0.79. On NWPU-crown, BnAnd ErnThe degree of negative correlation is-0.10, on UCF-QNRF is-0.11,
Figure BDA0003211160270000157
and ErnThe degree of negative correlation of (a) is significantly improved.
(3) In addition, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature ziPerforming feature smoothing (FDS) with a smoothing value of
Figure BDA0003211160270000158
Weighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value of
Figure BDA0003211160270000159
And
Figure BDA00032111602700001510
the Gaussian kernel function is shown as formula (3):
Figure BDA0003211160270000161
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
In the CPNC + + network, on the basis of the CPNC network,
(1) combining a plurality of pictures by utilizing a Mosaic algorithm, and randomly dividing the 4 times of batch-size pictures into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is usediWherein i is 1,2,3,4, n is performed on the enhanced picturei-1 division into niAn area, is described as
Figure BDA0003211160270000162
The jth region obtained for the ith group of pictures. When the a-th division is performed, the slave
Figure BDA0003211160270000163
Selecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to ni-an integer of 1. The division is performed alternately in horizontal and vertical order. Finally, the real pictures in the i groups
Figure BDA0003211160270000164
The people are sorted in ascending order according to the number of people,
Figure BDA0003211160270000165
for the real picture set, will
Figure BDA0003211160270000166
Sorting according to the ascending order of the areas. Will be ranked at j (j is 1,2, …, n)i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.
(2) In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a true value and a predicted value of the number of people in the input image, f (x) is a corresponding characteristic layer, and xhX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS), wherein, the erase operation is as in the document "Pathak D,
Figure BDA0003211160270000172
P,Donahue J,et al.Context Encoders:Feature Learning by Inpainting[C]// IEEE Conference on Computer Vision and Pattern recognition.2016.2536-2544. "; rearrangement operations such as "Noroozi M, Favaro P. unreserved Learning of Visual restitution by Solving Jugsaw pumps [ C]// European Conference on Computer Vision.2016.69-84. "; scaling and aggregation operations are described in the literature "Noroozi M, Pirsiavash H, Favaro P]// International Conference on Computer Vision.2017.5898-5906 "; λ is 0.0001.
Prediction error using SmoothL1(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):
Figure BDA0003211160270000171
wherein c is an independent variable.
The CPNC network and the CPNC + + network were evaluated on three public open view dense population datasets, Shanghai Tech, UCF-QNRF, and NWPU-Crowd, and compared to existing density map-based methods. For block ((w)1,h1),(w2,h2) Point annotation datasets can be converted to block annotation datasets by equation (11).
Figure BDA0003211160270000181
Wherein h is1、h2Respectively the ordinate of the upper left corner and the ordinate of the lower right corner of the block, w1Is the abscissa of the upper left corner of the block, w2Is the abscissa of the lower right corner of the region, Y is the point label information (point label value), and Y (w, h) is the point label value of the position with coordinates (w, h).
Unless otherwise specified, the experiments were performed on RTX 3090 GPUs, with an input size of 1024 × 1024, a batch size of 16 per GPU, and using synchronized batch normalization during training. The number of iterations is 500 rounds and Adam is applied as optimizer, the fixed learning rate is 10-5. The data enhancement trigger probabilities used are all 0.3. During testing, if the picture size exceeds the input size during training, the average is performed by using a covered window sliding mode, the window size is 1024 × 1024, and the coverage rate is 0.25. If the picture size is smaller than the input size at training, a multiple of 0 to 64 is complemented at the edge. If not specifically stated, the network backbone employs NFNet-f3, and under this network, each inference takes only 0.06 seconds, while using DM-Count on the same machine takes 0.15 seconds.
The Shanghai Tech is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The UCF-QNRF is specifically described in the document "H I, M T, K A, et al.composition loss for counting, density map estimation and localization in density peaks [ C ]// IEEE European Conference on Computer Vision (ECCV): 2018: 544-.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
The NFNet-f3 is specifically described in the document "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".
The DM-Count is described in detail in the literature "WANG B, LIU H, SAMARAS D, et al.
In addition, in order to prove that the CPNC + + network can effectively utilize the unlabeled data, 30% of blocks and labeled information of the number of the blocks are randomly selected from the training data to be used as supervision data, the rest 70% of data are used as unlabeled data, and the model obtained by training is recorded as CPNC + + (30%).
TABLE 1 comparison of CPNC network, CPNC + + network and CPNC + + (30%) models in the present invention with existing methods on open datasets
Figure BDA0003211160270000201
The MCNN is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al, Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The SCNN is described in detail in the document "SAM D B, SURYA S, BABU R V. switching capacitive neural network for crown counting [ C ]// IEEE Conference on Computer Vision and Pattern registration: 2017-January.2017".
The IG-NN is described in detail in the literature "SAM D B, SAJJAN N, BABU R V, et al. Divide and grow: Capturing human change conversion in grown images with innovative growth cnn [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 3618-.
The CSRNet is specifically described in the literature "LI Y, ZHANG X, CHEN D.Csrnet: scaled volumetric neural networks for understating the high condensed scenes [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 1091-.
The SFCN-101 is described in detail in the literature "WANG Q, GAO J, LINW, et al.
Specifically, the CAN is described in "LIU W, SALZMANN M, FUA P.Context-aware crown counting [ C ]// IEEE Computer Society Conference on Computer Vision and Pattern recognition.2019: 5094-5103".
The DM-Count is described in detail in the literature "WANG B, LIU H, SAMARAS D, et al.
The SDCNET is described in detail in the literature "XIONG H, LU H, LIU C, et al. from open set to closed set" Counting objects by spatial two-and-controller [ C ]//2019IEEE/CVF International Conference on Computer Vision (ICCV): 2019:8361-8370 ".
The Mean Absolute Error (MAE) and the Mean Square Error (MSE) are used as evaluation indexes, which are defined as shown in formulas (12), (13):
Figure BDA0003211160270000221
Figure BDA0003211160270000222
wherein N is the total number of pictures, CiIn order to predict the value of the target,
Figure BDA0003211160270000223
are true values.
The Shanghai Tech dataset consists of two parts: STA and STB. STAs are more densely populated and more difficult than STBs. The official partitioning scheme of the training set and test set was used in the experiments, as described in the literature "ZHANG Y, ZHOU D, CHEN S, et al. As can be seen from the experimental results listed in table 1, the MAE and MSE on STA of CPNC + + were reduced by 20.9% and 20.4%, respectively, compared to CPNC. Compared with the method using the density map, the CPNC + + achieves similar performance on both the STA and the STB under the premise of using less supervision information.
The Shanghai Tech data set, STA and STB are described in particular in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric neural network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.
The NWPU-crown is a large data set which comprises 5109 high-resolution pictures, wherein the number of the training set, the verification set and the test set is 3109, 500 and 1000 respectively. The data set also gives bounding box labels, and the population counting method based on the density map can estimate more accurate Gaussian kernel size according to the bounding box, but does not use the information in training and testing at the time. The performance comparison with the previous method is shown in table 1. It can be seen that in the method using the density map, DM-Count and sdcent achieved the best MAE and MSE, respectively, and CPNC + + achieved similar performance to bounding box information and point labeling information without them.
The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.
The UCF-QNRF is a large crowd counting data set, and the advantages of the CPNC network and the CPNC + + network provided by the invention are more obvious on the data set. This data set consisted of 1,535 pictures, containing a total of 125 million head labels. Where the resolution of the picture is higher, the sliding window strategy described above is used. 1201 officially divided pictures were used as training sets and 334 as test sets in the experiments. As can be seen from the results in table 1, the performance of CPNC + + is improved by 13.4% compared to CPNC, exceeding DM-Count, which is the best method using density maps, indicating the effectiveness of the proposed training strategy.
CPNC + + (30%) achieved good performance on each data set with 5 random selections. As can be seen from table 1, the performance of CPNC + + (30%) is close to that of a partial density map-based approach such as MCNN. On UCF-QNRF, the MAE of CPNC + + (30%) was 105.3, which is only 21.1 higher than that of CPNC + +, approaching most of the methods using density maps using full-position labeling. Representative results of CPNC and CPNC + + on UCF-QNRF are shown in FIG. 3, where GT represents the true value of population. It can be seen that CPNC + + makes effective promotion on the basis of CPNC. Compared to CPNC, CPNC + + predicts more accurately when targets are denser (lines 2, 3) and smaller (line 4), while performance does not degrade on data where the population is sparse (line 1).
Example 2
Existing datasets are limited samples of real-world scenes, and previous methods may over-fit the datasets. In the invention, CPNC + + has better generalization because of balancing the unbalanced distribution. To verify this, the model is trained on the NWPU-crown, the model is selected according to the verification set of the NWPUCrowd, and the test is performed on STA, UCF-QNRF and JHU-crown, and compared with the DM-Count which is the best method for comprehensive performance at present. As can be seen from the results in table 2, CPNC + + migrates better when tested across datasets.
TABLE 2 MAE for cross-dataset test of CPNC + + and DM-Count
Figure BDA0003211160270000241
Example 3
In order to examine the impact of each enhancement strategy, ablation experiments were performed on UCF-QNRF. As shown in table 3, wherein LDS and FDS refer to label smoothing and feature smoothing, respectively, MON refers to the proposed data enhancement strategy (Mosiac enhancement), GD refers to region erasure and region rearrangement, and GS refers to region aggregation. Experiments show that the accuracy of population counting by using a block prediction method can be improved by using the proposed strategy alone, and meanwhile, the methods are mutually compatible.
TABLE 3 Effect measurement for each enhancement strategy
Figure BDA0003211160270000251
The invention has been described in detail with reference to specific embodiments and/or illustrative examples and the accompanying drawings, which, however, should not be construed as limiting the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. The method comprises a training stage and a testing stage, wherein the training stage carries out block prediction through a CPNC (compact peripheral network) network, and the CPNC network is a cross-stage local network for counting the crowd.
2. The method of claim 1, wherein the CPNC network comprises a feature extraction network, a bottleneck network, and a predictive head.
3. The method of claim 2,
the feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size;
the bottleneck network uses cross-layer half-network components in CSPNet;
the measuring head adopts a Bi-FPN network in EfficientDet.
4. The method of claim 3, wherein the cross-layer half-network component splits features into two per channel, wherein one feature continues to extract deeper features through a branch bottleneck network, the other feature only passes through a low complexity convolution transform, and combines the results of the two,
preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, h is a 1 × 100 convolution module with low computational complexity,
fi=[g(fi-1[0:ni-1/2]),h(fi-1[ni-1/2:ni-1])] (1)
ni-1layer i-1 characteristics f for cross-layer half-network componentsi-1I denotes the i-th layer of the cross-layer half-network component, fiFeatures derived via the i-th layer of the cross-layer half-network component, fi-1[0:n/2]Is characterized byi-1First half element of (a), fi-1[n/2:n]) Is characterized byi-1The latter half element of (a).
5. The method of claim 3, wherein the Bi-FPN network feature layer is set to 3-5 layers, preferably 3 layers, to enhance small object discrimination and obtain the block number map.
6. Method according to one of claims 1 to 5, characterized in that the method uses a Gaussian function as a radial basis function, and the number B of blocks of the number n is carried out by means of convolutionnSmoothing the label and using the smoothedQuantity BnThe inverse of (c) is used as the weight w of the corresponding block, and the specific operation is as shown in equation (2):
Figure FDA0003211160260000021
Figure FDA0003211160260000022
wherein the content of the first and second substances,
Figure FDA0003211160260000023
kirepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21; n 'is the number of people in the window, N (N-N'; 0, sigma)2) Is a mean value of 0 and a variance of σ2Is normally distributed over the values at n-n'.
7. Method according to one of claims 1 to 6, characterized in that standard whitening and recoloring are introduced on the feature layer inputs of the Bi-FPN to smooth the output feature ziWith a smoothed value of
Figure FDA0003211160260000024
Weighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sampleiSum covariance ∑iCalculate a corresponding smoothing value of
Figure FDA0003211160260000025
And
Figure FDA0003211160260000026
the concrete formula is shown in (3):
Figure FDA0003211160260000027
wherein, yi,yi′The number of people in the ith and ith' images respectively; n (y)i-yi′;0,σ2) Is a mean value of 0 and a variance of σ2Is normally distributed at yi-yi′The value of (d); sigmai′Is the variance of the features of the i' th sample; mu.si′Is the mean of the features of the i' th sample.
8. Method according to one of the claims 1 to 7, wherein the number of small-sized targets is increased using Mosaic data enhancement.
9. Method according to one of claims 1 to 8, characterized in that a secondary loss function is used in the method, which is defined as shown in equation (4):
L(x,y,y′)=Smooth L1(y-y′)+λ(f(x)-h-1f(xh))2 (4)
wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x ishAnd transforming the image after h by x, wherein h comprises block random erasing and rearrangement, block characteristic scaling and aggregation, and lambda is a balance coefficient.
10. The method of claim 9, wherein said SmoothL is administered1The definition is shown in formula (5):
Figure FDA0003211160260000031
where c is a function argument.
CN202110930559.4A 2021-08-13 2021-08-13 Crowd counting method based on block weak labeling Active CN113780092B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110930559.4A CN113780092B (en) 2021-08-13 2021-08-13 Crowd counting method based on block weak labeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110930559.4A CN113780092B (en) 2021-08-13 2021-08-13 Crowd counting method based on block weak labeling

Publications (2)

Publication Number Publication Date
CN113780092A true CN113780092A (en) 2021-12-10
CN113780092B CN113780092B (en) 2022-06-10

Family

ID=78837663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110930559.4A Active CN113780092B (en) 2021-08-13 2021-08-13 Crowd counting method based on block weak labeling

Country Status (1)

Country Link
CN (1) CN113780092B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992223A (en) * 2015-06-12 2015-10-21 安徽大学 Dense population estimation method based on deep learning
CN106845621A (en) * 2017-01-18 2017-06-13 山东大学 Dense population number method of counting and system based on depth convolutional neural networks
CN111882517A (en) * 2020-06-08 2020-11-03 杭州深睿博联科技有限公司 Bone age evaluation method, system, terminal and storage medium based on graph convolution neural network
CN112215129A (en) * 2020-10-10 2021-01-12 江南大学 Crowd counting method and system based on sequencing loss and double-branch network
CN112417288A (en) * 2020-11-25 2021-02-26 南京大学 Task cross-domain recommendation method for crowdsourcing software testing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992223A (en) * 2015-06-12 2015-10-21 安徽大学 Dense population estimation method based on deep learning
CN106845621A (en) * 2017-01-18 2017-06-13 山东大学 Dense population number method of counting and system based on depth convolutional neural networks
CN111882517A (en) * 2020-06-08 2020-11-03 杭州深睿博联科技有限公司 Bone age evaluation method, system, terminal and storage medium based on graph convolution neural network
CN112215129A (en) * 2020-10-10 2021-01-12 江南大学 Crowd counting method and system based on sequencing loss and double-branch network
CN112417288A (en) * 2020-11-25 2021-02-26 南京大学 Task cross-domain recommendation method for crowdsourcing software testing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
WILLIAM: "一文读懂YOLOV5与YOLOV4", 《知乎》 *
XIALEI LIU等: "Leveraging Unlabeled Data for Crowd Counting by Learning to Rank", 《IEEE CVF》 *

Also Published As

Publication number Publication date
CN113780092B (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN109344736B (en) Static image crowd counting method based on joint learning
Li et al. Adaptively constrained dynamic time warping for time series classification and clustering
WO2018023734A1 (en) Significance testing method for 3d image
CN110717411A (en) Pedestrian re-identification method based on deep layer feature fusion
Kim et al. Color–texture segmentation using unsupervised graph cuts
CN107240122A (en) Video target tracking method based on space and time continuous correlation filtering
Yan et al. Crowd counting via perspective-guided fractional-dilation convolution
Fang et al. Efficient and robust fragments-based multiple kernels tracking
Xian et al. Evaluation of low-level features for real-world surveillance event detection
Danelljan et al. Deep motion and appearance cues for visual tracking
Yi et al. Motion keypoint trajectory and covariance descriptor for human action recognition
CN110533100A (en) A method of CME detection and tracking is carried out based on machine learning
CN111709331A (en) Pedestrian re-identification method based on multi-granularity information interaction model
Mo et al. Background noise filtering and distribution dividing for crowd counting
CN106777159A (en) A kind of video clip retrieval and localization method based on content
CN114973112A (en) Scale-adaptive dense crowd counting method based on antagonistic learning network
KR20200010971A (en) Apparatus and method for detecting moving object using optical flow prediction
Aldhaheri et al. MACC Net: Multi-task attention crowd counting network
CN108257148B (en) Target suggestion window generation method of specific object and application of target suggestion window generation method in target tracking
Zhu et al. Human detection under UAV: an improved faster R-CNN approach
Xu et al. Domain adaptation from synthesis to reality in single-model detector for video smoke detection
CN113780092B (en) Crowd counting method based on block weak labeling
Huang et al. Aerial image classification by learning quality-aware spatial pyramid model
Ma et al. PPDTSA: Privacy-preserving deep transformation self-attention framework for object detection
Li et al. Research on YOLOv3 pedestrian detection algorithm based on channel attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant