CN112818884B

CN112818884B - Crowd counting method

Info

Publication number: CN112818884B
Application number: CN202110169724.9A
Authority: CN
Inventors: 李国荣; 刘心岩; 苏荔; 黄庆明
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-11-30
Anticipated expiration: 2041-02-07
Also published as: CN112818884A

Abstract

The invention discloses a crowd counting method, which comprises a training stage and a testing stage, wherein the training stage comprises the following steps: step 1, obtaining the similarity between training images and selecting a training sample; step 2, clustering the selected training samples, and storing a group of weights for each class; and 3, training a weight retrieval module. The crowd counting method using storage enhancement disclosed by the invention constructs a multi-weight network, utilizes the relation among samples, improves the generalization capability of a single simple network with a plurality of parameter sets, can be integrated with most of the existing methods, and obviously improves the performance of the single simple network.

Description

Crowd counting method

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a crowd counting method.

Background

The crowd counting task is used to estimate the number of objects in a picture, such as pedestrians, vehicles, animals, etc. This task has attracted increasing attention due to its wide application in a variety of scenarios, such as airports, stations, shopping malls, or people walking where people counting is important. Particularly in epidemic pandemics, the possibility of infection by viruses is remarkably increased by crowds, and the detection and warning of crowds gathering in public areas play an important role in controlling the spread of the epidemic.

Existing population counting methods have achieved reliable performance in certain application contexts, such as uniform density or fixed viewing angle. However, without limitation, the performance of existing methods is greatly impaired, mainly because the unconstrained target scene is complex in many ways, including different viewing angles, variable scales, different densities, and a wide range of brightness and contrast, etc., which ultimately results in significant changes in the visual characteristics of the objects being counted.

Most existing approaches attempt to handle unconstrained situations using a single network with multiple channels, with different channels being used to handle data of different scales. However, related research has indicated that it is difficult to solve the population count problem with a single network, suggesting the use of multiple networks, where each network is responsible for a particular size or density. For example, Switch-CNN designs a Switch structure before CNN of multiple channels to find the optimal channel for a given picture, but because it is impractical to design multiple channels manually, Switch-CNN can only handle limited scale changes.

Furthermore, to handle the cross-scene counting task, the prior art first pre-trains the model on a training data set, uses a coarse density map and a given perspective to find training samples similar to and fine with the test image during the inference phase, and then adjusts the pre-trained model on these samples to obtain a customized model of the test image. However, perspective views are not readily available, are not commonly available, and the similarity between density maps does not describe the complex correlation between images.

It can be seen that most of the existing population counting methods adopt a complex structure and a backbone network with a large parameter quantity to enhance the generalization of population counting, but when the population counting methods are tested on a large-scale data set, the improvement brought by the methods cannot be satisfied. Therefore, there is a need to provide a new people counting method to solve the above problems.

Disclosure of Invention

In order to overcome the above problems, the present inventors have conducted intensive studies and, as a result, found that: establishing a crowd counting network adopting a plurality of groups of parameters by analyzing the relation among the samples, wherein the network loads different parameters for different samples; meanwhile, a task-driven similarity is proposed, which is based on the mutual enhancement relationship between samples during fine tuning, similar samples are clustered into a cluster according to the similarity, each cluster is used for acquiring a group of specific parameters, the method utilizes the relationship between samples, improves the generalization capability of a single simple network with a plurality of parameter sets, can be integrated with most of the existing methods, and remarkably improves the performance of the single simple network, thereby completing the invention.

Specifically, the present invention aims to provide the following:

in a first aspect, there is provided a method of population counting using memory enhancement, the method comprising a training phase and a testing phase, the training phase comprising the steps of:

step 1, obtaining the similarity between training images and selecting a training sample;

step 2, clustering the selected training samples, and storing a group of weights for each class;

and 3, training a weight retrieval module.

In a second aspect, a computer readable storage medium is provided, storing a program for people counting using storage enhancement, which program, when executed by a processor, causes the processor to carry out the steps of the method for people counting using storage enhancement.

In a third aspect, a computer device is provided, comprising a memory storing a program for crowd counting using memory enhancement, and a processor, wherein the program, when executed by the processor, causes the processor to perform the steps of the method for crowd counting using memory enhancement.

The invention has the advantages that:

(1) the crowd counting method using storage enhancement provided by the invention constructs a multi-weight network, utilizes the relation among samples, improves the generalization capability of a single simple network with a plurality of parameter sets, can be integrated with most of the existing methods, and obviously improves the performance of the single simple network;

(2) according to the population counting method using storage enhancement, provided by the invention, a plurality of clusters of training data are obtained by adopting the mutual fine-tuning similarity and heuristic clustering method, and each cluster is used for learning a group of parameters, so that the method is beneficial to testing images similar to the clusters;

(3) according to the population counting method using storage enhancement, a simple and effective population counting model (FDC) is designed, a small density map regressor is provided, a plurality of FDCs (MFDCs) with a plurality of parameter sets are obtained through the proposed multi-parameter strategy, and the detection performance is remarkably improved.

Drawings

FIG. 1 illustrates a flow diagram of a population counting method using memory augmentation in accordance with a preferred embodiment of the present invention;

FIG. 2 is a diagram showing the improvement effect of the FDC method for MFDC method according to the embodiment of the present invention;

fig. 3 is a graph showing the comparison effect between the parameter and performance of the method according to the embodiment of the present invention and the existing methods.

Detailed Description

The present invention will be described in further detail below with reference to preferred embodiments and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In a first aspect of the present invention, there is provided a method of population counting using memory enhancement, as shown in fig. 1, the method comprising a training phase and a testing phase, the training phase comprising the steps of:

step 1, obtaining the similarity between training images, and selecting a training sample according to the stability of pre-training;

step 2, clustering the selected training samples, training each type of sample and storing a group of weights;

and 3, training a weight retrieval module.

The steps of the training phase are described in further detail below:

step 1, establishing a crowd counting network model, and selecting a training sample according to the stability of pre-training.

Wherein, step 1 comprises the following substeps:

step 1-1, establishing a crowd counting network model.

The general network model includes a feature extractor and a density map generator, and the inventors consider that in the subsequent process of using the storage enhanced population counting method, the density map generator needs to be fine-tuned multiple times, and in addition to requiring a large parameter storage space, if the training set is small, the model with the large density map generator is easily over-fitted.

Therefore, in order to solve the above problem, a simple basic model called FDC (i.e. a built population counting network model) is adopted in the present invention.

According to a preferred embodiment of the present invention, the created population counting network model is composed of a standard Feature Pyramid Network (FPN) as a basic feature extractor and a dilated convolution as a density map generator.

Preferably, the FPN in the FDC can adopt different networks as backbone networks, such as ResNet-18, ResNet-34, ResNet-50 and the like, and preferably ResNet-18.

More preferably, to align the output of the FPN, the deconvolution layer is used as an upsampler of the population counting network model.

The inventors have found that the density map regression parameters of the FDC used in the present invention are less, but still sufficiently efficient, compared to the state of the art most methods (e.g.CSRNet, ACMNet, DM-Count method).

Specifically, the method comprises the following steps: in the training process, a part of images are cut out from the original image, the size of the part of images is 224 × 224, 448 × 448 and 896 × 896, and the size of the parts of images is adjusted to 224 × 224; FPN generates four feature maps, ranging in size from 7x 7 to 56x56, which are then upsampled by multiple deconvolution layers of step size 2 to generate feature maps of size 56x 56; these feature maps were then connected and fused through two 3 × 3 convolutional layers to generate an output density map.

And 1-2, obtaining the similarity between training image samples.

In general, in the case of a well-trained population count model (base model) and a sample, if the model is trimmed on a given sample, the performance of the trimmed model for that sample, and other similar samples, will improve. Similarities between other samples may be defined based on their performance improvement.

Specifically, T ═ x is defined_i，y_i) 1,2, N is a crop with N samples (sample and its corresponding point)Label) of the training set; the feature extractor and density map generator for a given model are defined as f ═ Ψ (x, θ), respectively₁) And d ═ Φ (f, θ)₂) Wherein, theta₁And theta₂Parameters of the feature extractor and the density map generator, respectively; the loss function is expressed as

Wherein ^ y ═ Φ (Ψ (x, θ)₁)，θ₂)。

In the present invention, a novel metric is proposed to evaluate the similarity between training image samples, i.e. a similarity directly dependent on a specific task and model-sample x when using the density map parameters trimmed by other samples_iThe loss function of (a) varies.

Preferably, the similarity between the training image samples is obtained by a method comprising the steps of:

step 1-2-1, the loss of the ith sample is obtained.

Wherein the ith sample (x) of the basic model (population counting network model)_i，y_i) Has a loss of l_i＝L(Φ(f_i，θ₂)，y_i)。

And 1-2-2, fine-tuning parameters of a density map generator of the crowd counting network model.

Wherein the parameters of the density map generator of the basic model are fine-tuned to obtain the ith sample (x)_i，y_i) And a specific and efficient set of weights for similar samples, the optimal parameter for i is obtained as

And 1-2-3, obtaining the loss of the jth sample, and obtaining the fine-tuning similarity between the ith sample and the jth sample.

Wherein a fine tuning model is obtained

Sample j (x)_j，y_j) Is lost in

In the present invention, it can be seen from the above description that if the sample i and the sample j are similar, the fine tuning model of the ith sample will achieve performance improvement on the jth sample. The more similar between samples, the greater the improvement will be. Therefore, the degree of improvement after fine tuning in the present invention can be regarded as the similarity between samples.

According to a preferred embodiment of the present invention, the fine-tuning similarity between the ith sample and the jth sample is obtained by:

in the present invention, when the predictions of two pictures have performance improvement by model weights finely adjusted from the base model to each other, the similarity (similarity) between them will be positive, and the larger the ratio (i.e., the degree of mutual improvement) is, the larger the similarity improves each other.

The inventor considers that calculating the fine tuning similarity of all training image samples is time-consuming, intuitively, a difficult sample (a sample with a large error from a true label in prediction) is important, but in the training process, some unstable samples exist, so that the loss of a basic model is unstable. Although some of these unstable samples are not difficult samples, trimming using some samples in the data set will greatly reduce the loss of these samples.

Therefore, in the present invention, it is preferable to calculate the fine-tuning similarity between unstable samples. In addition, since the loss of the stable samples does not change much during the training process, and the influence of fine tuning on the stable samples on the parameters is small, it is preferable that the fine tuning similarity between the samples can be directly estimated as 0.

And 1-3, selecting a training sample.

In the present invention, in order to evaluate the instability of the sample during training, it is preferable to use the sequence and inversion tests, and only consider the decreasing trend of the loss function.

According to a preferred embodiment of the present invention, the index I (I, m) of the decreasing trend of the ith sample in the mth training period is defined as follows:

wherein the content of the first and second substances,

represents the predicted value of the m-th training period to the i-th sample, y_iThe true value of the ith sample is represented, and e represents the hyper-parameter.

In a further preferred embodiment, the tolerance to small variations is adjusted using the over-parameter e, the instability of the training sample is preferably obtained by:

where M represents the total number of cycles, the closer the equation is to 1, the greater the instability of the training samples.

In the present invention, a threshold η is set, preferably in the range of (0) to (0.5), wherein samples with instability greater than the threshold form an unstable sample set, denoted Q.

In the present invention, all training samples in the unstable sample set Q are preferably selected to calculate the inter-trimmed similarity.

And 2, clustering the selected training samples, training each type of sample and storing a group of weights.

Preferably, the clustering is performed according to a method comprising the steps of:

step 2-1, obtaining the sum of the similarity between the sample u in the unstable sample set Q and all other samples in Q, and marking all the samples as unprocessed state;

step 2-2, performing descending order arrangement on all unstable samples according to the sum of the similarity, and traversing the samples;

and 2-3, clustering according to the processing state of the sample.

Preferably, when clustering is performed, firstly, whether the sample is unprocessed or not is judged, and if the sample is processed, the next cycle process is started; if the samples are not processed, a new cluster is created, all unstable samples are sorted in descending order according to the similarity of the currently processed samples, and the samples are traversed.

More preferably, in the created new cluster, the processing state of each sample is judged, and if the sample is processed, the next cycle process is entered; if the sample is not processed, it is determined whether the sample is similar to all samples in the cluster, if so, the sample is added to the cluster, and if not, the sample is skipped.

In the invention, the clustering method follows two principles, firstly, the fine tuning similarity of all samples in each cluster is positive; second, the number of clusters should be as small as possible to reduce the space required for model storage.

By the heuristic clustering method, time cost can be saved by real-time fine adjustment, each cluster is used for learning a group of parameters, and the method is very effective for testing images similar to the cluster.

Wherein samples not in the unstable sample set Q are designated as a cluster, denoted S₀. And (3) using each cluster to fine-tune the density map generator of the basic model to obtain K +1 weight sets, wherein K represents the number of clusters obtained in the step 2.

And 3, training a weight retrieval module.

In the present invention, in order to obtain the optimal weight of the test sample, each cluster is preferably regarded as a class, a multi-class classifier is trained, that is, the weight retrieval module, and the established population counting network model is marked as FDC (i.e., MFDC) with multiple parameter sets.

According to a preferred embodiment of the present invention, when training the multi-class classifier, the soft label is represented by the following formula:

wherein S is_jFor the jth cluster, x_iIs the ith sample, s (i, j) is the similarity between sample i and sample j,

is the label of sample i on cluster j.

The present inventors consider that a sample belonging to one cluster may have positive inter-trimmed similarity with some samples in other clusters, and therefore, in the present invention, a soft label described by the above formula is adopted instead of simply using a hard label, which is calculated based on the average similarity between the sample and the samples in the cluster.

In a further preferred embodiment, ResNet-18 is employed as the backbone for the multi-class classifier.

Wherein the input to the classifier comprises the original training image and the output of the feature extractor in the base model.

Preferably, the original training image is aligned to the same size of the feature extractor output by the shallow CNN-Pool-CNN structure.

In a further preferred embodiment, the cross-entropy loss function for training the multi-class classifier is as follows:

wherein L is a loss function value, T is a total number of samples, N is a total number of classes,

for the purpose of the calculated soft label,

the prediction probability for sample i classified as cluster j.

When a test image is tested, a prediction result of a weight retrieval module (a multi-class classifier) after training convergence is represented as the probability that the image belongs to a certain cluster. If each probability is small, the probability of representing data belonging to each cluster is small, and it is considered to be from cluster 0.

In the invention, the trained multi-class classifier can predict class labels of the test data, and the prediction result is used for retrieving the optimal weight so as to dynamically select a group of specific parameters according to the characteristics of the test image, thereby greatly improving the performance.

The population counting method using storage enhancement adopts a multi-weight strategy for population counting, the strategy utilizes the relation among samples, improves the generalization capability of a single simple network with a plurality of parameter sets, can be integrated with most of the existing methods, and can obviously improve the performance of the existing methods; and meanwhile, a plurality of clusters of the training images are obtained by adopting an effective task-driven similarity and clustering method, each cluster is used for learning a group of parameters, and the method is very effective for testing the images similar to the clusters.

The invention also provides a computer readable storage medium storing a program for population counting using memory enhancement, which program, when executed by a processor, causes the processor to carry out the steps of the method for population counting using memory enhancement.

The crowd counting method using memory enhancement in the present invention can be implemented by means of software plus necessary general hardware platform, the software is stored in a computer readable storage medium (including ROM/RAM, magnetic disk, optical disk), and includes several instructions to make a terminal device (which may be a mobile phone, a computer, a server, a network device, etc.) execute the method of the present invention.

The invention also provides a computer device comprising a memory and a processor, the memory storing a program for people counting using memory enhancement, which program, when executed by the processor, causes the processor to carry out the steps of the method for people counting using memory enhancement.

Examples

The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.

Example 1

1. Data set

This example was performed on three datasets of ShanghaiTech Part A, UCF-QNRF, and NWPU-crown, in that order.

Wherein, the ShanghaiTech Part A data set refers to: desenzhou/ShanghaiTechDataset, data set associated in Single Image Crowd Counting via Multi Column volumetric Neural Network (MCNN) (githu. com); the UCF-QNRF dataset refers to CRCV Center for Research in Computer Vision at the University of Central Florida (ucf.edu); the NWPU-Crow dataset refers to a Crown Benchmark.

A basic case introduction for these three data sets is shown in table 1.

TABLE 1

Name (R)	Number of pictures	Number of people
			ShangHaiTech Part A	482	241667
UCF-QNRF	1535	1.25million
			NWPU-Crowd	5109	Unknown

2. Performance evaluation criteria:

performance indicators include Mean Absolute Error (MAE) and Mean Square Error (MSE).

3. Task description

The network is trained using a training set provided by the public data set, and predictions are made on a test set provided by the public data set. For the ShangHaiTech Part A dataset and the UCF-QNRF dataset, the prediction index is calculated from the open test set. The NWPU-crown dataset is submitted on crown Benchmark to obtain feedback.

4. Results and analysis

The method provided by the invention is compared with the existing method on different data sets, and the comparison result of the average absolute error (MAE) and the Mean Square Error (MSE) is shown in tables 2-4.

Table 2 shows the comparison result between the method of the present invention and the existing method on the ShangHaiTech Part _ a dataset, and table 3 shows the comparison result between the method of the present invention and the existing method on the UCF-QNRF dataset, and the comparison result between the method of the present invention and the existing method on the NWPU-crwood dataset.

TABLE 2

The MCNN method is specifically shown in the literature Zhang, Y; zhou, d.; chen, s.; gao, S. & Ma, Y.Single Image Crowd Counting via Multi Column volume protocol of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp.589-597,2016 ";

the CSRNet method is specifically disclosed in the literature "Li, Y.; zhang, X. & Chen, D.CSRNet: scaled relational Neural Networks for applying the mapping the high mapped scenes. proceedings of the IEEE Computer Society reference on Computer Vision and Pattern Recognition, 2018';

the ResSFCN-101 method is specifically disclosed in the literature "Laradji, I.H.; rostamzadeh, N.; pinheiro, p.o.; vazzez, D. & Schmidt, M.wheel are the Blobs, Counting by Localization with Point supervision of proceedings of the European Conference on Computer Vision (ECCV), pp.547-562,2018 ";

CAN methods are specifically described in literature "Liu, w.; salzmann, M. & Fua, P.Context-Aware crown counting. proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol.2019-June,2019 ";

the DM-Count method is specifically disclosed in the literature "Wang, B.; liu, h.; samaras, D. & Hoai, M.distribution Matching for Crowd counting. proceedings of Advances in Neural Information Processing Systems, 2020';

S-DCNet method and SS-DCNet (cls) are specifically disclosed in the literature "Xiong, H.; lu, h.; liu, c.; liang, l.; cao, Z. & Shen, C.from Open Set to Closed Set, Supervised Spatial Divide-and-Conquer for Object counting. proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.8362-8371,2019 ";

M-MCNN is MCNN modified by the method of the invention;

M-CSRNet is CSRNet improved using the method of the invention;

FDC-18 is a method using a basic model;

MFDC-18 is the method of the present invention (employing FDC with multiple parameter sets).

TABLE 3

Method	MAE	MSE
			MCNN	277	426
Switch-CNN	228	445
			CAN	107	183
CSRNet	98.2	157.2
			S-DCNet	97.7	167.6
DM-Count	85.6	148.3
			SS-DCNet(cls)	81.9	143.8
M-MCNN	234.1	381.8
			M-CSRNet	83.1	144.6
FDC-18	93.0	157.3
			MFDC-18	76.2	121.5

The Switch-CNN method is specifically disclosed in the literature "Sam, D.B.; surya, S. & Babu, R.V.switching capacitive neutral Network for Crowd counting. proceedings IEEE Conference on Computer Vision and Pattern Recognition, Vol.2017-January,2017 ".

TABLE 4

Where O _ MAE denotes MAE averaged by picture, O _ MSE denotes MSE averaged by picture, O _ NAE denotes MAE normalized by number of people, avg.

As can be seen from tables 2-4, on the data set of Shanghai Tech Part A and UCF-QNRF, the MFDC-18 method disclosed by the invention has the lowest MSE and MAE values, and the MCNN and CSRNet (namely M-MCNN and M-CSRNet) improved by the method disclosed by the invention are improved in performance compared with the original method.

On the NWPU-crown data set, the method disclosed by the invention is greatly superior to the previous method in indexes of O _ MAE and O _ MSE, and the effectiveness of the method is illustrated; the classification test error of the method based on the Avg, MAE (S/L), namely the scene and illumination is still greatly superior to that of the previous method, and the effectiveness of the method in various scene categories is proved.

Further, fig. 2 shows the effect of the improvement of the multi-weight Method (MFDC) to the single-weight method (FDC) according to the present invention, and as can be seen from fig. 2, the MFDC method greatly improves the prediction accuracy compared to the FDC method.

FIG. 3 shows the comparison of the population Count and performance of the population counting method using the memory enhancement of the present invention with the prior art methods (MCNN, SANet, PCCNet-light, CSRNet, SFCN-101, Bayesian, SCAR, CAN, SDCNT, DM-Count, respectively).

As can be seen from fig. 3, under the condition of similar parameter quantities, the method of the present invention greatly reduces the average error value compared with the conventional method.

The invention has been described in detail with reference to specific embodiments and illustrative examples, but the description is not intended to be construed in a limiting sense. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention.

Claims

1. A method of population counting, the method comprising a training phase and a testing phase, the training phase comprising the steps of:

step 1-1, establishing a crowd counting network model;

step 1-2, obtaining the similarity between training image samples;

definition T { (x)_i，y_i) 1,2, N, wherein i is a training set with N sample crops, and the sample crops are marked for samples and corresponding points thereof; the feature extractor and density map generator for a given model are defined as f ═ Ψ (x,θ₁) And d ═ Φ (f, θ)₂) Wherein, theta₁And theta₂Parameters of the feature extractor and the density map generator, respectively; the loss function is expressed as

Where ŷ ═ Φ (Ψ (x, θ)₁)，θ₂)；

The similarity between the training image samples is obtained by a method comprising the steps of:

step 1-2-1, obtaining the loss of the ith sample;

wherein the ith sample (x) of the basic model_i，y_i) Has a loss of l_i＝L(Φ(f_i，θ₂)，y_i) The basic model is a population counting network model;

step 1-2-2, fine-tuning parameters of a density map generator of the crowd counting network model;

wherein, the parameters of the density chart generator of the basic model are finely adjusted to obtain the optimal parameters

Step 1-2-3, obtaining the loss of the jth sample, and obtaining the fine tuning similarity between the ith sample and the jth sample;

the fine-tuning similarity between the ith and jth samples is obtained by:

l_i＝L(Φ(f_i，θ₂)，y_i) I-th sample (x) of the population count network model_i，y_i) Loss of (d);

representing fine-tuning models

Sample j (x)_j，y_j) Loss of (d);

step 1-3, selecting a training sample;

an index I (I, m) of the falling tendency of the ith sample in the mth training period is defined as follows:

x_iis the ith sample; y is_iRepresenting the true value of the ith sample, and epsilon representing the hyper-parameter;

the instability of the training sample is obtained by:

setting a threshold eta with the value of 0-0.5, wherein a sample with instability greater than the threshold is selected to form an unstable sample set, which is represented as Q;

the clustering is performed according to a method comprising the following steps:

step 2-3, clustering is carried out according to the processing state of the sample;

in the step 2, the process is carried out,

when clustering is carried out, firstly, judging whether a sample is unprocessed or not, and if the sample is processed, entering the next cycle process; if the samples are not processed, a new cluster is created, all unstable samples are sorted in a descending order according to the similarity of the currently processed samples, and the samples are traversed;

in the new cluster creation, the processing state of each sample is judged, and if the sample is processed, the next cycle process is started; if the sample is not processed, judging whether the sample is similar to all samples in the cluster, if so, adding the sample into the cluster, and if not, skipping the sample;

step 3, training a weight retrieval module;

in step 3, when the weight retrieval module is trained, the soft label shown as the following formula is adopted:

label of sample i on cluster j; k represents the number of clusters obtained in step 2.

2. The population counting method according to claim 1, wherein in step 1-1, said constructed population counting network model is composed of a standard Feature Pyramid Network (FPN) as a basic feature extractor and a dilated convolution as a density map generator.

3. A computer-readable storage medium, in which a program for population counting enhanced with storage is stored, which program, when executed by a processor, causes the processor to carry out the steps of the population counting method according to one of claims 1 to 2.

4. A computer device comprising a memory and a processor, the memory storing a population counting program enhanced with memory, the program, when executed by the processor, causing the processor to carry out the steps of the population counting method according to one of claims 1 to 2.