CN113780092A

CN113780092A - Crowd counting method based on block weak labeling

Info

Publication number: CN113780092A
Application number: CN202110930559.4A
Authority: CN
Inventors: 李国荣; 黄庆明; 刘心岩
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-08-13
Filing date: 2021-08-13
Publication date: 2021-12-10
Anticipated expiration: 2041-08-13
Also published as: CN113780092B

Abstract

The invention discloses a method for counting crowds based on block weak labeling information, which comprises a training stage and a testing stage, wherein the training stage is used for predicting blocks through a CPNC network, and applying label smoothing, feature smoothing, various data enhancement strategies and auxiliary loss functions, so that the problems of long tail effect of the number of the blocks and inaccuracy of area prediction are solved, the performance similar to that of a method using a density map is obtained under the condition of using less labeling information, and good mobility is shown.

Description

Crowd counting method based on block weak labeling

Technical Field

The invention belongs to the technical field of computer vision and image processing, and particularly relates to a crowd counting method based on block weak labeling.

Background

People counting is one of the important computer vision tasks, the goal of which is to count the number of people that appear in a picture. In recent years, this task has played an increasingly important role in security monitoring, public place traffic analysis, and the like. Unlike the target detection task using bounding boxes, the current mainstream crowd counting task mainly uses density maps as learning targets. The density map is generated by convolution smoothing the point label map with a symmetric probability density function, and therefore has the same L1 norm as the point label map. Compared with a point labeled graph, the density graph is more continuous in value, and therefore, the network learning is more facilitated. However, the current method using the density map has the following problems:

the generation of the density map depends on point labeling, and the cost is high when the point labeling is carried out on the targets one by one under the condition of high density. And ideally the size of the probability density kernel that generates the density map depends on the scale of the target. However, if the dimension marking is carried out, the marking cost is further increased. Meanwhile, the density map cannot remove the problem of labeling noise which is often generated when the targets are dense.

Predicting the density map requires preserving the size of the feature map during operation to maintain the resolution of the density map. On the density map, one pixel corresponds to at most one object, and if one pixel corresponds to one object, the density map is degraded into a point labeling map. Therefore, only a small amount of down-sampling can be performed in the network, which increases the consumption of computing resources.

Therefore, a method for counting people based on images is needed, which reduces the labeling cost and the noise influence, thereby satisfying the current use requirement for counting people based on visual images.

Disclosure of Invention

In order to solve the problems, the invention provides a crowd counting method based on block weak labeling. The method divides the counted crowd pictures into a plurality of blocks to form a block quantity picture, the block quantity picture does not need specific position information during marking, marking cost is reduced, and the network structure CPNC is utilized to take the crowd pictures as input and output the number of the people in each block. A smoothing strategy, a data enhancement strategy and an auxiliary loss function are respectively introduced, so that the performance similar to that of the method using the density map is obtained under the condition of using less labeled information, and good mobility is shown, thereby completing the invention.

The invention aims to provide a crowd counting method based on block weak labeling.

The training phase performs block prediction through a CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting.

The CPNC network comprises a feature extraction network, a bottleneck network and a prediction head.

The feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size. The bottleneck network uses cross-layer half-network components in the CSPNet that can efficiently utilize cross-layer features, thereby reducing the complexity of the reduced-size feature map. In particular, the cross-layer half-network component splits a feature into two per channel. One of the features continues to extract deeper features through a branch bottleneck network, and the other feature only passes through convolution transformation with low complexity and combines the results of the two. Preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, and h is a 1 × 100 convolution module with low computational complexity.

f_i＝[g(f_i-1[0:n_i-1/2]),h(f_i-1[n_i-1/2:n_i-1])] (1)

n_i-1Layer i-1 characteristics f for cross-layer half-network components_i-1I denotes the i-th layer of the cross-layer half-network component, f_iFeatures derived via the i-th layer of the cross-layer half-network component, f_i-1[0:n/2]Is characterized by_i-1First half element of (a), f_i-1[n/2:n]) Is characterized by_i-1The latter half element of (a).

The measuring head adopts a Bi-FPN network in EfficientDet.

In the invention, the block number B with the number of people n is calculated by a convolution method by using a Gaussian function as a radial basis function_nPerforming label smoothing and using the smoothed quantities

The reciprocal of (c) is used as the weight w of the corresponding block, and the specific operation is shown in formula (2).

Wherein the content of the first and second substances,

k_irepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21; n 'is the number of people in the window, N (N-N'; 0, sigma)²) Is a mean value of 0 and a variance of σ²Is normally distributed over the values at n-n'.

In the present invention, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature z_iWith a smoothed value of

Weighting the distance between the population values of the samples by using a Gaussian kernel function to obtain the mean value mu of the characteristics of the current sample_iSum covariance ∑_iCalculate a corresponding smoothing value of

And

the Gaussian kernel function is shown as formula (3):

wherein, y_i，y_i′The number of people in the ith and ith' images respectively; n (y)_i-y_i′；0,σ²) Is a mean value of 0 and a variance of σ²Is normally distributed at y_i-y_i′The value of (d); sigma_i′Is the variance of the features of the i' th sample; mu.s_i′Is the mean of the features of the i' th sample.

The invention introduces a Bi-FPN module on the network design, and uses the Mosaic data to enhance and increase the number of small-size targets.

In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):

L(x,y,y′)＝Smooth L₁(y-y′)+λ(f(x)-h^-1f(x_h))² (4)

wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x is_hX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is the equilibrium coefficient. Prediction error usage function Smooth L₁(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):

and in the testing stage, the trained CPNC network is applied to a detection task of population counting so as to verify the effectiveness of the model.

The method for counting the crowd based on the weak labeling of the block has the following beneficial effects:

(1) the invention designs a lightweight network structure CPNC, and the network takes pictures as input and outputs the number of the human beings in each block. By analyzing the long tail distribution problem and the small target problem in a plurality of data sets, a smoothing strategy and a data enhancement strategy are respectively introduced, an auxiliary loss function is further introduced, the labeling cost is reduced, and the crowd counting method only needing part of block labeling information is introduced.

(2) In the invention, the Gaussian function is used as a radial basis function, and B is performed by a convolution method_nPerforming smoothing toAfter smoothing

And Er_nThe degree of negative correlation of (a) is significantly improved.

(3) In the invention, standard whitening and recoloring processes are introduced to three characteristic layers of the input Bi-FPN to smooth output characteristics, so that the characteristics are consistent, a network model focuses on blocks with various densities more balanced, and blocks corresponding to the densities with higher occurrence frequency in a data set are not over-fitted according to data distribution in the data set, so that the model has better mobility.

(4) The data enhancement strategy introduced by the invention can effectively improve the prediction precision of the small-size target and enable the size of the target in the enhanced image to be continuously changed.

(5) In the invention, an auxiliary loss function is constructed, the supervision information in the non-labeled data is mined, and the labeling cost is further reduced.

Drawings

Fig. 1 shows a schematic diagram of a CPNC network architecture according to the present invention;

FIG. 2 illustrates an exemplary diagram of random block erasure and block location reset in a supplemental loss function in accordance with the present invention;

fig. 3 shows the application test results of CPNC and CPNC + + on UCF-QNRF in embodiment 1 of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description.

The invention provides a crowd counting method based on block weak labeling.

The training phase is block prediction by CPNC network, wherein the CPNC network is a cross-phase local network (CSPNet for Crowd Counting, CPNC) for population Counting, specifically as described in the literature "WANG C Y, MARK LIAO H Y, WU Y H, et al, Cspnet: A new backbone which can be used for enhancing the performance of the simulation of cnn [ C ]// IEEE/CVF Conference Computer Vision and Pattern Recognition Workships (CVPRW) 2020: 1571-.

The CPNC network includes a feature extraction network, a bottleneck network, and a prediction head, and a network structure thereof is shown in fig. 1.

f_i＝[g(f_i-1[0:n_i-1/2]),h(f_i-1[n_i-1/2:n_i-1])] (1)

The Focus module is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".

The h is a 1 × 100 convolution module with low computational complexity, which is described in the literature "Krizhevsky A, Sutskeeper I, Hinton G E.ImageNet Classification with Deep schematic Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".

According to different requirements, the branch bottleneck network selects network structures with different layers or different complexities, such as ResNet, ResNext, NFNet and the like.

The ResNet is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".

The ResNext is described in particular in the documents "XIE S, GIRSHICK R, DOLL R P, et al.

The ResNest is specifically described in the literature "HE K, ZHANG X, REN S, et al. deep residual learning for image Recognition [ C ]//2016IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 2016: 770-778".

The NFNet is described in detail in "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".

The measuring head adopts a Bi-FPN network in EfficientDet. Unlike the detection task, the targets in the crowd counting task are small, and the small targets are not easily distinguished by high-level features. Meanwhile, in order to reduce the complexity of the network, the Bi-FPN network feature layer is set to be 3-5 layers, preferably 3 layers, small target identification is enhanced, and a block number graph is obtained.

The efficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE QV. efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".

In image processing for crowd counting, model deviation is easily caused by label imbalance, and the statistical result on a data set shows that the number of corresponding blocks is rapidly reduced along with the increase of density, so that long-tail distribution is presented. In addition, as the density increases, the average size of the objects within the corresponding tile becomes smaller, increasing the difficulty of counting. The result can cause that the prediction deviation of the high-density block is larger, and in order to solve the problem of model deviation, a sample balancing strategy is designed to resist the influence of long tail distribution.

Wherein the content of the first and second substances,

k_irepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 9-21, preferably 12-18; n 'is the number of people in the window, N (N-N'; 0, sigma)²) Is a mean value of 0 and a variance of σ²Is the value at n-n';

greater than 0.

After passing through the tag-smoothing policy, the tag is,

average error Er of all blocks with number of people being just n_nThe degree of negative correlation of (a) is significantly improved. The method is obtained through experiments, and on the NWPU-crown,

and Er_nHas a Pearson correlation index of-0.72, a value of-0.79 on UCF-QNRF, and all B_nAre all greater than 0.

Wherein the content of the first and second substances,

wherein k is_iIndicates the number of people in the ith block, n is the number of people, er_iThe number of people in the ith block is k_iThe prediction error of (2).

The UCF-QNRF is specifically described in the document "H I, M T, K A, et al.composition loss for counting, density map estimation and localization in density peaks [ C ]// IEEE European Conference on Computer Vision (ECCV): 2018: 544-.

The NWPU-crown is described in detail in the literature "WANG Q, GAO J, LINW, et al. Nwpu-crown: A large-scale marking for crown counting and localization [ J ]. IEEE Transactions on Pattern Analysis and Machine Analysis, 2021,43(6): 2141-.

Meanwhile, in order to make the output characteristic consistent, in the invention, standard whitening and recoloring are introduced on the characteristic layer input of the Bi-FPN to smooth the output characteristic z_iWith a smoothed value of

And

the Gaussian kernel function is shown as formula (3):

Through the two smoothing strategies, the model focuses on blocks with various densities more evenly, and does not over-fit blocks corresponding to the density with higher occurrence frequency in the data set according to the data distribution in the data set, so that the model has better mobility.

In order to solve the problem of small-size targets, the invention introduces a Bi-FPN network on the network design and uses the Mosaic data enhancement to increase the number of small-size targets. Meanwhile, in order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train.

Said Mosaic is described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo,2020.https:// doi.org/10.5281/zenodo.4154370".

During training, a plurality of pictures are combined by utilizing a Mosaic algorithm, preferably, a plurality of times of batch-size pictures are sampled and randomly divided into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is used_iWherein i is 1,2_i-1 division into n_iRegion, marked as enhanced picture set

The jth region obtained for the ith group of pictures. When the enhanced pictures in the ith group are divided for the a-th time, the enhanced pictures in the ith group are divided into a plurality of sub-groups

Selecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to n_i-an integer of 1. The division is carried out in horizontal and vertical orderAnd (4) performing the steps. Finally, the real pictures in the i groups

The people are sorted in ascending order according to the number of people,

is a real picture set. Will be provided with

Sorting according to the ascending order of the area, and ranking at the j (j is 1,2, …, n)_i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.

In order to further reduce the labeling cost, the invention excavates the supervision information in the label-free data and constructs an auxiliary loss function to effectively train the network. In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The auxiliary loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):

L(x,y,y′)＝Smooth L₁(y-y′)+λ(f(x)-h^-1f(x_h))² (4)

wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x is_hX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS); λ is a balance coefficient, and is specifically shown in the literature "(LIU X, VAN DE WEIJER J, BAGDANOV A D. leveraging unlabeled data for crowned counting by learning to rank [ C ]]// IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 7661-. Prediction error using function SmoothL₁(c) Loss protectorThe gradient is not too large at the beginning of training, and the definition is shown as formula (5):

wherein c is an independent variable.

The invention provides a method for counting crowds based on block weak labeling information, which utilizes an improved CPNC network for training and does not depend on accurate position information, so that the labeling cost is lower. The invention provides a plurality of lifting strategies, comprising the following steps: the method has the advantages that the label is smooth, the characteristics are smooth, multiple data enhancement strategies and auxiliary loss functions are adopted, the problems of long tail effect of the number of block people and inaccuracy of region prediction are solved, performance similar to that of a method using a density map is obtained under the condition that less labeled information is used, and good mobility is shown.

Examples

The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.

Example 1

The CPNC network using the data enhancement and supplemental loss functions is denoted CPNC + +.

In the CPNC network:

(1) firstly, the Focus module is used for reducing the size of a training picture to obtain a feature map with reduced size. Focus modules are described in particular in the document "JOCHER G, STOKEN A, BOROVEC J, et al. ultralytics/yolov5: v3.1-Bug Fixes and Performance Improvements [ CP/OL ]. Zenodo, 2020".

Inputting the reduced-size feature map to a cross-layer half-network component in the CSPNet as a bottleneck network, which proceeds as equation (1):

f_i＝[g(f_i-1[0:n_i-1/2]),h(f_i-1[n_i-1/2:n_i-1])] (1)

wherein g is a branch bottleneck network NFNet-f3 with high computational complexity, which is specifically described in the document "BROCK A, DE S, SMITH S L, et al.high-performance large scale image recognition with out simulation [ J/OL ]. CoRR,2021, abs/2102.06171.https:// arxiv.org/abs/2102.06171". h is a convolution module with a low computational complexity and a convolution kernel of 1 × 100, the specific convolution operation of which is described in the document "Krizhevsky A, Sutskeeper I, Hinton G E. ImageNet Classification with Deep conditional Neural Networks [ C ]// Advances in Neural Information Processing systems.2012".

(2) And inputting the output result from the bottleneck network into a Bi-FPN network in EfficientDet as a prediction head for processing to obtain an output result.

The EfficientDet and Bi-FPN networks are described in particular in the literature "TAN M, LE Q V. Efficientnetv2: Smaller models and fast tracking [ J/OL ]. CoRR,2021, abs/2104.00298".

Using Gaussian function as radial basis function, and performing block number B of people with number n by convolution method_nPerforming label smoothing (LDS) and using the smoothed quantities

Wherein the content of the first and second substances,

k_irepresenting the number of people in the ith block; ζ is the size of the window of convolution, which is 15; n 'is the number of people in the window, N (N-N'; 0, sigma)²) Is a mean value of 0 and a variance of σ²Is normally distributed over the values at n-n'.

Wherein the content of the first and second substances,

wherein k is_iIndicates the number of people in the ith block, n is the number of people, er_iThe prediction error of the number of people in the ith block.

On a NWPU-Crowd basis,

and Er_nThe Pearson correlation index of (A) is-0.72, and on UCF-QNRF is-0.79. On NWPU-crown, B_nAnd Er_nThe degree of negative correlation is-0.10, on UCF-QNRF is-0.11,

and Er_nThe degree of negative correlation of (a) is significantly improved.

(3) In addition, standard whitening and recoloring are introduced on the feature layer input of the Bi-FPN to smooth the output feature z_iPerforming feature smoothing (FDS) with a smoothing value of

And

the Gaussian kernel function is shown as formula (3):

In the CPNC + + network, on the basis of the CPNC network,

(1) combining a plurality of pictures by utilizing a Mosaic algorithm, and randomly dividing the 4 times of batch-size pictures into batch-size groups. Each time an enhanced picture is generated, the number n of real pictures in the ith group is used_iWherein i is 1,2,3,4, n is performed on the enhanced picture_i-1 division into n_iAn area, is described as

The jth region obtained for the ith group of pictures. When the a-th division is performed, the slave

Selecting the largest region, and dividing into two parts, wherein a is greater than or equal to 1 and less than or equal to n_i-an integer of 1. The division is performed alternately in horizontal and vertical order. Finally, the real pictures in the i groups

The people are sorted in ascending order according to the number of people,

for the real picture set, will

Sorting according to the ascending order of the areas. Will be ranked at j (j is 1,2, …, n)_i-1) the picture of the position is scaled to fit in the area that is arranged at the jth position.

(2) In the training process, the data is subjected to random block erasing or block position resetting, and the characteristics of the block at the corresponding position after the resetting are compared with the characteristics of the original block. In addition, adjacent four blocks can be aggregated into one block by scaling, and the number of the aggregated blocks should be equal to the sum of the target numbers in the original four blocks, as shown in fig. 2. Both of the above processes may use MSE for supervision as an auxiliary loss function. The overall loss function consists of a prediction error with a marked part and an auxiliary loss function without a mark, and is defined as shown in a formula (4):

L(x,y,y′)＝Smooth L₁(y-y′)+λ(f(x)-h^-1f(x_h))² (4)

wherein x is an input image sample, y and y' are respectively a true value and a predicted value of the number of people in the input image, f (x) is a corresponding characteristic layer, and x_hX is the image after h has been transformed. h comprises block random erase and re-arrangement (GD), block feature scaling and aggregation (GS), wherein, the erase operation is as in the document "Pathak D,

P,Donahue J,et al.Context Encoders:Feature Learning by Inpainting[C]// IEEE Conference on Computer Vision and Pattern recognition.2016.2536-2544. "; rearrangement operations such as "Noroozi M, Favaro P. unreserved Learning of Visual restitution by Solving Jugsaw pumps [ C]// European Conference on Computer Vision.2016.69-84. "; scaling and aggregation operations are described in the literature "Noroozi M, Pirsiavash H, Favaro P]// International Conference on Computer Vision.2017.5898-5906 "; λ is 0.0001.

Prediction error using SmoothL₁(c) The loss ensures that the gradient is not too large at the beginning of the training, which is defined as shown in equation (5):

wherein c is an independent variable.

The CPNC network and the CPNC + + network were evaluated on three public open view dense population datasets, Shanghai Tech, UCF-QNRF, and NWPU-Crowd, and compared to existing density map-based methods. For block ((w)₁,h₁),(w₂,h₂) Point annotation datasets can be converted to block annotation datasets by equation (11).

Wherein h is₁、h₂Respectively the ordinate of the upper left corner and the ordinate of the lower right corner of the block, w₁Is the abscissa of the upper left corner of the block, w₂Is the abscissa of the lower right corner of the region, Y is the point label information (point label value), and Y (w, h) is the point label value of the position with coordinates (w, h).

Unless otherwise specified, the experiments were performed on RTX 3090 GPUs, with an input size of 1024 × 1024, a batch size of 16 per GPU, and using synchronized batch normalization during training. The number of iterations is 500 rounds and Adam is applied as optimizer, the fixed learning rate is 10^-5. The data enhancement trigger probabilities used are all 0.3. During testing, if the picture size exceeds the input size during training, the average is performed by using a covered window sliding mode, the window size is 1024 × 1024, and the coverage rate is 0.25. If the picture size is smaller than the input size at training, a multiple of 0 to 64 is complemented at the edge. If not specifically stated, the network backbone employs NFNet-f3, and under this network, each inference takes only 0.06 seconds, while using DM-Count on the same machine takes 0.15 seconds.

The Shanghai Tech is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.

The NFNet-f3 is specifically described in the document "BROCK A, DE S, SMITH S L, et al, high-performance large-scale image recognition with out normalization [ J/OL ]. CoRR,2021, abs/2102.06171".

The DM-Count is described in detail in the literature "WANG B, LIU H, SAMARAS D, et al.

In addition, in order to prove that the CPNC + + network can effectively utilize the unlabeled data, 30% of blocks and labeled information of the number of the blocks are randomly selected from the training data to be used as supervision data, the rest 70% of data are used as unlabeled data, and the model obtained by training is recorded as CPNC + + (30%).

TABLE 1 comparison of CPNC network, CPNC + + network and CPNC + + (30%) models in the present invention with existing methods on open datasets

The MCNN is described in detail in the literature "ZHANG Y, ZHOU D, CHEN S, et al, Single-image crowned counting via multi-column volumetric connected network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.

The SCNN is described in detail in the document "SAM D B, SURYA S, BABU R V. switching capacitive neural network for crown counting [ C ]// IEEE Conference on Computer Vision and Pattern registration: 2017-January.2017".

The IG-NN is described in detail in the literature "SAM D B, SAJJAN N, BABU R V, et al. Divide and grow: Capturing human change conversion in grown images with innovative growth cnn [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 3618-.

The CSRNet is specifically described in the literature "LI Y, ZHANG X, CHEN D.Csrnet: scaled volumetric neural networks for understating the high condensed scenes [ C ]//2018IEEE/CVF Conference on Computer Vision and Pattern recognition.2018: 1091-.

The SFCN-101 is described in detail in the literature "WANG Q, GAO J, LINW, et al.

Specifically, the CAN is described in "LIU W, SALZMANN M, FUA P.Context-aware crown counting [ C ]// IEEE Computer Society Conference on Computer Vision and Pattern recognition.2019: 5094-5103".

The SDCNET is described in detail in the literature "XIONG H, LU H, LIU C, et al. from open set to closed set" Counting objects by spatial two-and-controller [ C ]//2019IEEE/CVF International Conference on Computer Vision (ICCV): 2019:8361-8370 ".

The Mean Absolute Error (MAE) and the Mean Square Error (MSE) are used as evaluation indexes, which are defined as shown in formulas (12), (13):

wherein N is the total number of pictures, C_iIn order to predict the value of the target,

are true values.

The Shanghai Tech dataset consists of two parts: STA and STB. STAs are more densely populated and more difficult than STBs. The official partitioning scheme of the training set and test set was used in the experiments, as described in the literature "ZHANG Y, ZHOU D, CHEN S, et al. As can be seen from the experimental results listed in table 1, the MAE and MSE on STA of CPNC + + were reduced by 20.9% and 20.4%, respectively, compared to CPNC. Compared with the method using the density map, the CPNC + + achieves similar performance on both the STA and the STB under the premise of using less supervision information.

The Shanghai Tech data set, STA and STB are described in particular in the literature "ZHANG Y, ZHOU D, CHEN S, et al.Single-image crowned counting via multi-column volumetric neural network [ C ]// IEEE Conference on Computer Vision and Pattern recognition.2016: 589-.

The NWPU-crown is a large data set which comprises 5109 high-resolution pictures, wherein the number of the training set, the verification set and the test set is 3109, 500 and 1000 respectively. The data set also gives bounding box labels, and the population counting method based on the density map can estimate more accurate Gaussian kernel size according to the bounding box, but does not use the information in training and testing at the time. The performance comparison with the previous method is shown in table 1. It can be seen that in the method using the density map, DM-Count and sdcent achieved the best MAE and MSE, respectively, and CPNC + + achieved similar performance to bounding box information and point labeling information without them.

The UCF-QNRF is a large crowd counting data set, and the advantages of the CPNC network and the CPNC + + network provided by the invention are more obvious on the data set. This data set consisted of 1,535 pictures, containing a total of 125 million head labels. Where the resolution of the picture is higher, the sliding window strategy described above is used. 1201 officially divided pictures were used as training sets and 334 as test sets in the experiments. As can be seen from the results in table 1, the performance of CPNC + + is improved by 13.4% compared to CPNC, exceeding DM-Count, which is the best method using density maps, indicating the effectiveness of the proposed training strategy.

CPNC + + (30%) achieved good performance on each data set with 5 random selections. As can be seen from table 1, the performance of CPNC + + (30%) is close to that of a partial density map-based approach such as MCNN. On UCF-QNRF, the MAE of CPNC + + (30%) was 105.3, which is only 21.1 higher than that of CPNC + +, approaching most of the methods using density maps using full-position labeling. Representative results of CPNC and CPNC + + on UCF-QNRF are shown in FIG. 3, where GT represents the true value of population. It can be seen that CPNC + + makes effective promotion on the basis of CPNC. Compared to CPNC, CPNC + + predicts more accurately when targets are denser (lines 2, 3) and smaller (line 4), while performance does not degrade on data where the population is sparse (line 1).

Example 2

Existing datasets are limited samples of real-world scenes, and previous methods may over-fit the datasets. In the invention, CPNC + + has better generalization because of balancing the unbalanced distribution. To verify this, the model is trained on the NWPU-crown, the model is selected according to the verification set of the NWPUCrowd, and the test is performed on STA, UCF-QNRF and JHU-crown, and compared with the DM-Count which is the best method for comprehensive performance at present. As can be seen from the results in table 2, CPNC + + migrates better when tested across datasets.

TABLE 2 MAE for cross-dataset test of CPNC + + and DM-Count

Example 3

In order to examine the impact of each enhancement strategy, ablation experiments were performed on UCF-QNRF. As shown in table 3, wherein LDS and FDS refer to label smoothing and feature smoothing, respectively, MON refers to the proposed data enhancement strategy (Mosiac enhancement), GD refers to region erasure and region rearrangement, and GS refers to region aggregation. Experiments show that the accuracy of population counting by using a block prediction method can be improved by using the proposed strategy alone, and meanwhile, the methods are mutually compatible.

TABLE 3 Effect measurement for each enhancement strategy

The invention has been described in detail with reference to specific embodiments and/or illustrative examples and the accompanying drawings, which, however, should not be construed as limiting the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims

1. The method comprises a training stage and a testing stage, wherein the training stage carries out block prediction through a CPNC (compact peripheral network) network, and the CPNC network is a cross-stage local network for counting the crowd.

2. The method of claim 1, wherein the CPNC network comprises a feature extraction network, a bottleneck network, and a predictive head.

3. The method of claim 2,

the feature extraction network reduces the size of a training picture by using a Focus module to obtain a feature map with reduced size;

the bottleneck network uses cross-layer half-network components in CSPNet;

the measuring head adopts a Bi-FPN network in EfficientDet.

4. The method of claim 3, wherein the cross-layer half-network component splits features into two per channel, wherein one feature continues to extract deeper features through a branch bottleneck network, the other feature only passes through a low complexity convolution transform, and combines the results of the two,

preferably, the cross-layer half-network component is as shown in formula (1), g is a branch bottleneck network with high computational complexity, h is a 1 × 100 convolution module with low computational complexity,

f_i＝[g(f_i-1[0:n_i-1/2]),h(f_i-1[n_i-1/2:n_i-1])] (1)

5. The method of claim 3, wherein the Bi-FPN network feature layer is set to 3-5 layers, preferably 3 layers, to enhance small object discrimination and obtain the block number map.

6. Method according to one of claims 1 to 5, characterized in that the method uses a Gaussian function as a radial basis function, and the number B of blocks of the number n is carried out by means of convolution_nSmoothing the label and using the smoothedQuantity B_nThe inverse of (c) is used as the weight w of the corresponding block, and the specific operation is as shown in equation (2):

wherein the content of the first and second substances,

7. Method according to one of claims 1 to 6, characterized in that standard whitening and recoloring are introduced on the feature layer inputs of the Bi-FPN to smooth the output feature z_iWith a smoothed value of

And

the concrete formula is shown in (3):

8. Method according to one of the claims 1 to 7, wherein the number of small-sized targets is increased using Mosaic data enhancement.

9. Method according to one of claims 1 to 8, characterized in that a secondary loss function is used in the method, which is defined as shown in equation (4):

L(x,y,y′)＝Smooth L₁(y-y′)+λ(f(x)-h^-1f(x_h))² (4)

wherein x is an input image sample, y and y' are respectively a real value and a predicted value of the number of people in the image, f (x) is a corresponding characteristic layer, and x is_hAnd transforming the image after h by x, wherein h comprises block random erasing and rearrangement, block characteristic scaling and aggregation, and lambda is a balance coefficient.

10. The method of claim 9, wherein said SmoothL is administered₁The definition is shown in formula (5):

where c is a function argument.