CN111428587A

CN111428587A - Crowd counting and density estimating method and device, storage medium and terminal

Info

Publication number: CN111428587A
Application number: CN202010162552.8A
Authority: CN
Inventors: 李莉; 赵震; 林国义
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2020-07-17
Anticipated expiration: 2040-03-10
Also published as: CN111428587B

Abstract

The invention provides a crowd counting and density estimating method, a device, a storage medium and a terminal. Firstly, randomly selecting a plurality of images from a non-labeled crowd image data set for labeling to generate corresponding crowd density labels, and adding the crowd density labels into a labeled crowd image data set; training a population counting and density estimation model until the model converges; selecting a plurality of images from the remaining unlabelled crowd image data set by using a probability weighted selection strategy to generate corresponding crowd density labels, and adding the corresponding crowd density labels into the labeled crowd image data set; continuously iterating and optimizing until the model meets the performance requirement; adding an antagonistic learning branch based on feature mixing and gradient inversion into the model, and training the model by using the labeled crowd image and the unlabeled crowd image; and carrying out crowd counting and density estimation on the crowd image to be predicted. The invention fundamentally reduces the labeling workload of the crowd image and improves the generalization performance of the model.

Description

Crowd counting and density estimating method and device, storage medium and terminal

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for crowd counting and density estimation, a storage medium, and a terminal.

Background

In recent years, convolutional neural networks have enjoyed great success in the fields of population counting and density estimation. Convolutional neural networks typically have training parameters on the order of tens of millions, and typically require large amounts of labeled data to prevent overfitting of the model. In constructing a data set for the people counting and density estimation task, it is generally necessary to mark the center of each person's head in each image and record its coordinate position. Since there may be a large number of heads in each image of the crowd, the labeling method is time-consuming and labor-consuming. The mainstream method is to label all the acquired images of the crowd and train the model by using a fully supervised learning method, but it inevitably needs to expend a great deal of effort and even financial resources for labeling the data. Therefore, for a scene that cannot be labeled with a large number of images of a crowd for some reason, the fully supervised learning method may cause a rapid decrease or even collapse of the model effect due to lack of sufficient labeling data. Therefore, the application of convolutional neural networks in the field of population counting and density estimation lacking labeled images is yet to be further researched.

In a real scene, crowd images can be obtained from video monitoring and other equipment in a large quantity, can be used for providing the most intuitive crowd distribution information, and has important application in the aspects of security, business discovery, people counting and the like. In general, a model trained by using a crowd image under a certain feature distribution or shot parameter may have performance degraded or even collapsed in other scenes. Therefore, in practical applications, in order to obtain optimal performance, images of people closest to the future use scene are usually collected and labeled for model training. But at the same time, if the acquired images are all labeled, a great deal of labeling work is inevitably caused. Related research has found that many training samples only provide redundant training functions in model training, and even compromise model training. Therefore, if data which is most beneficial to model training can be selected as much as possible by a certain method and label-free data is fully utilized, labeling work can be reduced fundamentally and even an effect superior to full labeling can be achieved.

However, it is a great challenge to pick out the most needed images of the model from a large number of unlabeled images and to make full use of the unlabeled images. At present, in the field of population counting and density estimation, related researches for screening images from unlabeled images for labeling are few, and only a few researches are conducted to explore how to fully utilize the unlabeled images: for example, by using a generative countermeasure network, two classification branches of a real image and a false image generated by the model are added in the model, so that the unlabeled real image can participate in the training process of the model, and the generalization performance of the common part of the model is improved; a special multi-stage self-coding structure is provided, so that most parameters in a model can be trained by a label-free image, and only a small number of model parameters need to be trained by using a label image; self-supervision sequencing loss is realized by counting the selected regions of the crowd images, and the loss function can be used on the labeled images and the unlabeled images at the same time; model performance is improved with additional artificial images and domain adaptive techniques. However, these studies still have many disadvantages. First, the improvements brought by these operations are limited and do not achieve practical effects. Secondly, in order to introduce label-free data in the training process, some unstable modules may be added, so that the whole model is difficult to train or the training time is too long, and the method is not suitable for actual scenes. Furthermore, active learning, while almost blank in research in the area of population counting and density estimation, has demonstrated its advantages in many areas of reducing data annotation workload and speeding up model training efficiency.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a people counting and density estimating method, apparatus, storage medium and terminal, which are used to solve the problems in the prior art.

To achieve the above and other related objects, a first aspect of the present invention provides a population counting and density estimating method, comprising: randomly selecting a plurality of images from the image data set of the non-labeled crowd with the limited number of labels for labeling to generate corresponding crowd density labels, and adding the crowd density labels into the image data set of the labeled crowd; training a population counting and density estimation model based on the labeled population image dataset until the model converges; based on the model, selecting a plurality of images from the remaining unlabelled crowd image data set by using a probability weighted selection strategy to generate corresponding crowd density labels, and adding the corresponding crowd density labels into the labeled crowd image data set; continuously iterating and optimizing until the performance of the model is required; adding an antagonistic learning branch based on feature mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image; and performing crowd counting and density estimation on the crowd images to be predicted by using the model trained by using the labeled crowd images and the unlabeled crowd images.

In some embodiments of the first aspect of the present invention, said selecting, based on the model, a plurality of images from the remaining unlabeled population image dataset using a probability weighted selection strategy, generating corresponding population density labels, and adding to the labeled population image dataset, comprises: predicting the rest unlabeled crowd image data sets by using the model to obtain a density prediction graph and a counting prediction value of each unlabeled crowd image data; carrying out region division on the counting prediction value set by utilizing a natural breakpoint method, and dividing the labeled crowd image and the unlabeled crowd image according to the obtained regions; calculating a similarity metric value of the labeled crowd image data set and the unlabeled crowd image data set in each region based on grid division; respectively converting the similarity metric value of each image data of the non-tag crowd in each region into a probability value through normalization; and performing non-playback sampling on the image data of the non-labeled crowd in each region according to the probability value, performing head labeling on the extracted image data of the non-labeled crowd, generating a crowd density label, and adding the crowd density label into the image data set of the labeled crowd.

In some embodiments of the first aspect of the present invention, the method for obtaining the crowd density label includes obtaining by convolving a head label of the image with a gaussian kernel or an adaptive gaussian kernel.

In some embodiments of the first aspect of the present invention, the method of training a population count and density estimation model based on the labeled population image dataset comprises a density map regression method.

In some embodiments of the first aspect of the present invention, the density map regression method can employ functions including a mean square error loss function.

In some embodiments of the first aspect of the present invention, the antagonistic learning branch comprises a feature mixture layer, a gradient inversion layer, and a subsequent dichotomy module.

In some embodiments of the first aspect of the present invention, the functions that the antagonistic learning branch can take include a two-class cross entropy loss function.

To achieve the above and other related objects, a second aspect of the present invention provides a crowd counting and density estimating apparatus, comprising: the labeling module randomly selects a plurality of images from the image data set of the non-labeled crowd with the label quantity limitation for labeling, generates corresponding crowd density labels and adds the corresponding crowd density labels into the image data set of the labeled crowd; a training module for training a population count and density estimation model based on the labeled population image dataset until the model converges; the screening module is used for selecting a plurality of images from the rest non-label crowd image data sets by using a probability weighted selection strategy based on the model, generating corresponding crowd density labels and adding the corresponding crowd density labels into the labeled crowd image data sets; the circulation module repeats the operation of the training module and the screening module until the model meets the performance requirement; the characteristic mixing module is used for adding an antagonistic learning branch based on characteristic mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image; and the counting module is used for carrying out crowd counting and density estimation on the crowd image to be predicted by using the model trained by the labeled crowd image and the unlabeled crowd image.

To achieve the above and other related objects, a third aspect of the present invention provides a computer-readable storage medium having a computer program stored thereon, the computer program, when executed by a processor, implementing the crowd counting and density estimation method.

To achieve the above and other related objects, a fourth aspect of the present invention provides an electronic terminal comprising: a processor and a memory; the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the terminal to execute the crowd counting and density estimation method.

As described above, the crowd counting and density estimating method, apparatus, storage medium and terminal according to the present invention have the following advantageous effects:

(1) the labeling cost of the method is low. In general, the main cost of deep learning comes from the labeling of training data, and the labeling cost of a large data set may reach tens of millions of dollars. The active learning method provided by the invention can fundamentally reduce the labeling workload of the crowd images, simultaneously keeps higher model performance, and particularly when a small model or a mobile end model is faced, the parameter quantity is less, and the overlarge training data quantity is not needed, so that the training data selected by active learning plays a greater role, and the model performance is extremely close to or even surpasses the full supervision learning. And the semi-supervised learning based on feature mixing and gradient inversion fully utilizes the acquired data, further improves the model capability by utilizing massive label-free images, and achieves the purpose of practicality only labeling a small amount of data.

(2) The generalization performance of the model is good. According to the method, only data which is more beneficial to the model is labeled, so that the model is not influenced by negative effects of harmful data, and meanwhile, due to the fact that semi-supervised learning based on feature mixing and gradient inversion is used, the model can make full use of a large amount of multi-source unlabeled data, and the generalization performance of the model is improved.

Drawings

FIG. 1 is a flow chart illustrating a method for population count and density estimation according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a crowd counting and density estimating apparatus according to an embodiment of the invention.

Fig. 3 is a schematic structural diagram of an electronic terminal according to an embodiment of the invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It is noted that in the following description, reference is made to the accompanying drawings which illustrate several embodiments of the present invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate describing one element or feature's relationship to another element or feature as illustrated in the figures.

In the present invention, unless otherwise expressly specified or limited, the terms "mounted," "connected," "secured," "retained," and the like are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.

Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," and/or "comprising," when used in this specification, specify the presence of stated features, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, operations, elements, components, items, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions or operations are inherently mutually exclusive in some way.

The query screening strategy is one of the most critical steps for active learning, determines the final performance of the model to the greatest extent, and is mainly divided into an uncertainty strategy and a diversity strategy. Currently, active learning is mainly applied to the classification field, and because the classification task is relatively simple, uncertainty and diversity, such as entropy, confidence and the like of classification probability, can be defined according to a plurality of intuitive ways. However, the task of population counting and density estimation is a regression problem with intensive predictions, and there is no research on its query screening strategy.

The active learning strategy is made according to the sample diversity. Repeated experiments show that images with different population densities play an important role in the generalization performance of the model. If the crowd density is too single, the model can be subjected to more serious overfitting and is poorer in performance on a test set, so that the method considers that the model can be positively influenced by uniformly sampling in each interval of the crowd density. Assuming that each image is equal in size and the population distribution is relatively uniform, the most intuitive embodiment of the population density is the population count of the image. Meanwhile, if the images obtained by screening are too similar, information redundancy can be brought, so that the GDSIM similarity measurement scheme is provided to avoid the information redundancy as much as possible.

Semi-supervised learning based on confrontation has been widely used in various research fields, however, almost blank in the field of population counting and density estimation. Since the labeled exemplars and unlabeled exemplars are from the same domain, it is difficult for general counterlearning to play a good role. Therefore, the invention provides a feature mixing countermeasure module based on feature mixing and gradient inversion, and discrete data are continuous as much as possible to reduce the difficulty of feature alignment of the model

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions in the embodiments of the present invention are further described in detail by the following embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

Fig. 1 is a schematic flow chart of the method for population counting and density estimation in the present embodiment, which includes the following specific steps:

step S11: randomly selecting a plurality of images from the image data set of the non-labeled crowd with the limitation of label quantity to label, generating corresponding crowd density labels, and adding the corresponding crowd density labels into the image data set of the labeled crowd.

In a preferred embodiment of this embodiment, the method for obtaining the crowd density label includes obtaining the crowd density label by convolving the head label of the image with a gaussian kernel or an adaptive gaussian kernel, that is:

wherein σ₀Representing the Gaussian kernel, σ_iRepresenting an adaptive Gaussian kernel, P being a graphLike any coordinate value in x, p_iIs the coordinate value G of the human head i in the image x_σSelecting K persons closest to person head i in the image x by a Nearest Neighbor classification algorithm (KNN) for a Gaussian kernel function with variance of sigma, wherein N is a population count value of the image x, and d is_ikFor the pixel distance between head i and head K in the K labels, β is

The scaling factor of (c).

Step S12: training a population count and density estimation model based on the labeled population image dataset until the model converges.

In a preferred embodiment of the present invention, the method for training the population count and density estimation model based on the labeled population image dataset includes a density map regression method. The functions that can be adopted by the density map regression method include a mean square error loss function, which is specifically expressed as:

wherein N is the batch size of input data in the training process,

and

the density prediction graph and the density label graph of the image n are respectively represented, and the type of the model and whether the model is pre-trained are not limited.

Specifically, the crowd counting and density estimation model comprises a feature extractor and a density regression module. The input of the feature extractor is a crowd image, and the input of the density regression module is the output feature of the feature extractor. The training process is end-to-end training.

Step S13: based on the model, selecting a plurality of images from the remaining unlabeled crowd image data set by using a probability weighted selection strategy to generate corresponding crowd density labels, and adding the corresponding crowd density labels into the labeled crowd image data set.

Alternatively, step S13 in this embodiment can be implemented by steps S131 to S135 described below.

Step 131: using said model R^tFor the remaining unlabeled population image dataset U^tPredicting to obtain image data of each non-label crowd

Density prediction map of

And counting the predicted values

Wherein t is the active learning turn, and j is the serial number of the image data of the non-label crowd.

Step 132: counting prediction value set by utilizing natural breakpoint method

And performing area division, and dividing the labeled crowd image and the unlabeled crowd image according to the obtained areas. The natural breakpoint method classifies the research objects according to the statistical distribution rule of the data, including the natural turning points, the characteristic points and the like of the data, so that the difference between classes is maximized and the difference between the same classes is minimized. Specifically, the research data can be made into a frequency histogram, a gradient curve graph and an accumulation frequency histogram, which are all helpful for finding out natural break points of the data. The natural breakpoint method comprises a jenks natural breakpoint method.

Step 133: calculating each region P_k ^tImage dataset V of people with middle tags_k ^tAnd a tagless crowd image dataset U_k ^tBased on the grid-partitioning similarity metric. Specifically, the calculation formula of the similarity metric value may be represented as:

wherein the image u_jEvenly divided into 4 according to area^lBlock, L_AThe index of the fine granularity of the grid is,

for the count prediction value of the l area of the jth unlabeled picture,

and the true counting value of the ith area of the ith labeled image is obtained.

Step 134: respectively combining each region P_k ^tImage data of each non-label crowd

Measure of similarity of

And converting into probability values through normalization. Specifically, the calculation formula of the probability value is represented as:

where J represents the number of unlabeled images.

Step S135: in each region P according to the above probability values_k ^tAnd performing non-return sampling on the image data of the non-labeled crowd, performing head labeling on the extracted image data of the non-labeled crowd, generating crowd density labels, and adding the crowd density labels into the image data set of the labeled crowd.

Step S14: and continuously iterating and optimizing until the model meets the performance requirement. In particular, the performance requirements include accuracy, sensitivity, specificity, and mahalanobis correlation coefficients of the model; error rate, accuracy, P-R curve, F1 metric, ROC curve, and cost curve are also included.

Step S15: and adding an antagonistic learning branch based on feature mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image.

Specifically, a counterstudy branch based on feature mixing and gradient inversion is connected behind a feature extractor of the model, a density regression branch is not changed, and the two branches are independent of each other. The density regression branch is only influenced by the images of the labeled population, and the feature extractor and the counterstudy branch are simultaneously influenced by the images of the labeled population and the images of the unlabeled population.

In a preferred embodiment of the present invention, the pair of learning branches includes a feature mixture layer, a gradient inversion layer and two subsequent classification modules. The feature mixture layer can be represented as:

λ～Beta(α,α)；

λ′＝max(λ,1-λ)；

F(x′)＝λ′F(x₁)+(1-λ′)F(x₂)；

y′＝λ′y₁+(1-λ′)y₂；

wherein Beta is Beta distribution, α is Beta distribution parameter, x₁And x₂For any two images, F (x) is the output feature of image x at the feature extractor, y₁And y₂Each represents x₁And x₂The category label of (1).

Preferably, the antagonistic learning branch employs a two-class cross-entropy loss function:

wherein N is a data serial number, N is a batch size of input data in the training process, and y_nClass labels for features n, p_nIs a probabilistic predictor that feature n belongs to category 1.

Specifically, the embodiment adopts an end-to-end training mode, and optimizes the model by using a density regression branch and a counterstudy branch, where the total loss function is represented as:

L＝L_reg+λL_a；

where λ is the scaling factor.

Step S16: and performing crowd counting and density estimation on the crowd images to be predicted by using the model trained by using the labeled crowd images and the unlabeled crowd images. Specifically, the density regression branch of the model is utilized to perform population counting and density estimation on the image to be predicted.

In summary, the embodiments of the present invention are based on an active learning idea, utilize sample diversity to perform uniform sampling on each interval of population density, and provide a GDSIM similarity measurement scheme to avoid occurrence of information redundancy as much as possible, and provide a semi-supervised learning method based on feature mixing and gradient inversion, so as to make discrete data continuous as much as possible to reduce difficulty of feature alignment performed by a model, and make full use of unlabeled image data, so that the embodiments of the present invention can obtain a population count and density estimation model with good generalization performance under a condition of extremely low labeling cost.

Example two

Fig. 2 is a schematic structural diagram of a crowd counting and density estimating apparatus according to an embodiment of the invention, including: the labeling module 21 randomly selects a plurality of images from the image data set of the non-labeled crowd with the label quantity limitation for labeling, generates corresponding crowd density labels, and adds the corresponding crowd density labels into the image data set of the labeled crowd; a training module 22 for training a population count and density estimation model based on the labeled population image dataset until the model converges; the screening module 23 is used for selecting a plurality of images from the remaining unlabelled crowd image data set by using a probability weighted selection strategy based on the model, generating corresponding crowd density labels, and adding the corresponding crowd density labels into the labeled crowd image data set; the circulation module 24 continuously iterates and optimizes until the model meets the performance requirement; the feature mixing module 25 is used for adding an antagonistic learning branch based on feature mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image; and the counting module 26 is used for counting the population and estimating the density of the population to be predicted by using the model trained by using the labeled population image and the unlabeled population image.

It should be noted that the modules provided in this embodiment are similar to the methods provided in the foregoing, and therefore, the detailed description is omitted. It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the labeling module may be a processing element separately set up, or may be implemented by being integrated into a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and the processing element of the apparatus calls and executes the functions of the labeling module. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

EXAMPLE III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the population counting and density estimation method.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Example four

Fig. 3 is a schematic structural diagram of an electronic terminal according to an embodiment of the present invention. This example provides an electronic terminal, includes: a processor 31, a memory 32, a communicator 33; the memory 32 is connected to the processor 31 and the communicator 33 through a system bus and is used for mutual communication, the memory 32 is used for storing computer programs, the communicator 33 is used for communicating with other devices, and the processor 31 is used for running the computer programs, so that the electronic terminal executes the steps of the people counting and density estimation method.

The above-mentioned system bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus. The communication interface is used for realizing communication between the database access device and other equipment (such as a client, a read-write library and a read-only library). The Memory may include a Random Access Memory (RAM), and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In summary, the crowd counting and density estimating method, device, storage medium and terminal provided by the invention can fundamentally reduce the labeling workload of crowd images during crowd counting and improve the generalization performance of the model. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for population count and density estimation, comprising:

randomly selecting a plurality of images from the image data set of the non-labeled crowd with the limited number of labels for labeling to generate corresponding crowd density labels, and adding the crowd density labels into the image data set of the labeled crowd;

training a population counting and density estimation model based on the labeled population image dataset until the model converges;

based on the model, selecting a plurality of images from the remaining unlabelled crowd image data set by using a probability weighted selection strategy to generate corresponding crowd density labels, and adding the corresponding crowd density labels into the labeled crowd image data set;

continuously iterating and optimizing until the model meets the performance requirement;

adding an antagonistic learning branch based on feature mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image;

and performing crowd counting and density estimation on the crowd images to be predicted by using the model trained by using the labeled crowd images and the unlabeled crowd images.

2. The method of claim 1, wherein selecting a plurality of images from the remaining unlabeled population image dataset based on the model using a probability weighted selection strategy to generate corresponding population density labels for adding to the labeled population image dataset comprises:

predicting the rest unlabeled crowd image data sets by using the model to obtain a density prediction graph and a counting prediction value of each unlabeled crowd image data;

carrying out region division on the counting prediction value set by utilizing a natural breakpoint method, and dividing the labeled crowd image and the unlabeled crowd image according to the obtained regions;

calculating a similarity metric value of the labeled crowd image data set and the unlabeled crowd image data set in each region based on grid division;

respectively converting the similarity metric value of each image data of the non-tag crowd in each region into a probability value through normalization;

and performing non-playback sampling on the image data of the non-labeled crowd in each region according to the probability value, performing head labeling on the extracted image data of the non-labeled crowd, generating a crowd density label, and adding the crowd density label into the image data set of the labeled crowd.

3. The method of claim 1, wherein the obtaining of the crowd density label comprises convolving a head label of the image with a gaussian kernel or an adaptive gaussian kernel.

4. The method of claim 1, wherein the method of training a population count and density estimation model based on the labeled population image dataset comprises a density map regression method.

5. The method of claim 4, wherein the function that can be employed by the density map regression method comprises a mean square error loss function.

6. The method of claim 1, wherein the antagonistic learning branch comprises a feature mixture layer, a gradient inversion layer, and a subsequent dichotomy module.

7. The method of claim 1, wherein the functions that the antagonistic learning branch can take comprise a two-class cross entropy loss function.

8. A crowd counting and density estimating apparatus, comprising:

the labeling module randomly selects a plurality of images from the image data set of the non-labeled crowd with the label quantity limitation for labeling, generates corresponding crowd density labels and adds the corresponding crowd density labels into the image data set of the labeled crowd;

a training module for training a population count and density estimation model based on the labeled population image dataset until the model converges;

the screening module is used for selecting a plurality of images from the rest non-label crowd image data sets by using a probability weighted selection strategy based on the model, generating corresponding crowd density labels and adding the corresponding crowd density labels into the labeled crowd image data sets;

the loop module continuously iterates and optimizes until the model meets the performance requirement;

the characteristic mixing module is used for adding an antagonistic learning branch based on characteristic mixing and gradient inversion into the model meeting the performance requirement, and training the model by using the labeled crowd image and the unlabeled crowd image;

and the counting module is used for carrying out crowd counting and density estimation on the crowd image to be predicted by using the model trained by the labeled crowd image and the unlabeled crowd image.

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the people counting and density estimation method according to any one of claims 1 to 7.

10. An electronic terminal, comprising: a processor and a memory;

the memory is for storing a computer program, and the processor is for executing the computer program stored by the memory to cause the terminal to perform the people counting and density estimation method according to any one of claims 1 to 7.