CN112580616B

CN112580616B - Crowd quantity determination method, device, equipment and storage medium

Info

Publication number: CN112580616B
Application number: CN202110218075.7A
Authority: CN
Inventors: 王昌安; 宋庆宇; 张博深; 王亚彪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2021-06-18
Anticipated expiration: 2041-02-26
Also published as: WO2022179542A1; CN112580616A

Abstract

The application relates to the technical field of image processing, and discloses a method, a device, equipment and a storage medium for determining the number of people. The method comprises the following steps: acquiring a first image; performing feature extraction based on the first image to obtain a crowd density feature map corresponding to the first image; classifying the crowd density characteristic map respectively based on at least two crowd classification intervals; acquiring the number of people information corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image; and acquiring the number of people corresponding to the first image based on the information of the number of people corresponding to each sub-region in the first image. The scheme can be applied to the field of intelligent traffic, the number of people in each area on the first image is estimated according to the classification condition of the first image in two people classification areas with different subintervals, the accuracy of estimating the number of people in the image is improved, and the accuracy of intelligent traffic scheduling is improved.

Description

Crowd quantity determination method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to a method, an apparatus, a device, and a storage medium for determining a number of people.

Background

Crowd density estimation is an application capable of automatically deducing the total number of people in an image, and plays an important role in the fields of video monitoring, public safety and the like.

In the related technology, in the traditional method based on detection and direct regression, most of the population density estimation algorithms at the present stage are combined with thermodynamic diagram regression, and end-to-end training and reasoning are performed by utilizing a deep learning technology, so that the problems of large population density distribution range and large head scale variation range can be well solved, and the counting precision is improved to a certain extent. In the process of identifying the number of people in the image, the image blocks can be divided into different categories (the same category corresponds to a people number range, namely a counting interval) according to the total number of people in each image block, so that the problem that the image blocks are sensitive to abnormal values when a specific number of people returns can be avoided.

In the scheme, when the image blocks are divided into different categories, the number of people predicted by the image blocks can be uniformly set as the proxy count value of the corresponding section, so that the discretization error is large, and the accuracy of predicting the number of people in the image is influenced.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for determining the number of people, which can improve the accuracy of estimating the number of people in an image.

In one aspect, a method for determining a population number is provided, the method comprising:

acquiring a first image;

performing feature extraction based on the first image to obtain a crowd density feature map corresponding to the first image;

classifying the crowd density feature map respectively based on at least two people number classification intervals to obtain sub-intervals corresponding to each sub-area in the first image in the at least two people number classification intervals; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

obtaining the number information of people corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image;

and acquiring the number of people corresponding to the first image based on the information of the number of people corresponding to each sub-region in the first image.

In another aspect, a method for determining a population number is provided, the method comprising:

acquiring a first sample image and sub-sections corresponding to each sub-area in the first sample image in at least two people number classification sections respectively; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

performing feature extraction on the first sample image through the feature map acquisition layer to acquire a sample crowd density feature map corresponding to the first sample image;

classifying the sample crowd density feature map through feature classification layers respectively corresponding to the at least two crowd classification intervals in the crowd density estimation model to obtain prediction results respectively corresponding to each sub-region in the first sample image; the prediction result is used for indicating the corresponding relation between each sub-area in the first sample image and the at least two people number classification intervals;

training the crowd density estimation model based on the prediction results of the regions on the first sample image respectively corresponding to the at least two feature classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the at least two people number classification regions;

and the trained crowd density estimation model is used for processing the input first image to obtain the crowd number corresponding to the first image.

In yet another aspect, there is provided a population quantity determination apparatus, the apparatus comprising:

the first image acquisition module is used for acquiring a first image;

the first image extraction module is used for extracting features based on the first image to obtain a crowd density feature map corresponding to the first image;

the characteristic map classification module is used for classifying the crowd density characteristic map respectively based on at least two crowd classification intervals to obtain sub-intervals corresponding to each sub-area in the first image in the at least two crowd classification intervals respectively; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

the people number information acquisition module is used for acquiring the people number information corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image;

and the image number obtaining module is used for obtaining the number of people corresponding to the first image based on the number information respectively corresponding to each subarea in the first image.

In a possible implementation manner, the first image extraction module is further configured to,

performing feature extraction on the first image through a feature map acquisition layer in a crowd density estimation model to acquire a sample crowd density feature map corresponding to the first image;

the feature map classification module is further configured to,

classifying the crowd density feature map through feature classification layers which correspond to the at least two crowd classification intervals in a crowd density estimation model respectively to obtain sub-intervals which correspond to the sub-intervals in the at least two crowd classification intervals respectively in each sub-area in the first image;

the crowd density estimation model is a machine learning model obtained by training by taking a first sample image as a training sample and taking subintervals of all the subregions in the first sample image, which respectively correspond to the at least two people number classification intervals, as labels.

In one possible implementation, the apparatus further includes:

the first sample image acquisition module is used for acquiring a first sample image and sub-sections corresponding to the sub-sections of the sub-sections in the first sample image in the at least two people number classification sections;

the first sample extraction module is used for extracting the features of the first sample image through the feature map acquisition layer to acquire a sample crowd density feature map corresponding to the first sample image;

the first sample classification module is used for classifying the sample crowd density characteristic diagram through the characteristic classification layers which are respectively corresponding to the at least two crowd classification intervals in the crowd density estimation model to obtain the prediction results which are respectively corresponding to the sub-regions in the first sample image; the prediction result is used for indicating the corresponding relation between each sub-area in the first sample image and the at least two people number classification intervals;

and the crowd density estimation model training module is used for training the crowd density estimation model based on the prediction results of the regions on the first sample image respectively corresponding to the at least two characteristic classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the at least two crowd classification regions.

In one possible implementation manner, the first sample image obtaining module includes:

a training sample set obtaining unit for obtaining a training sample set; the training sample set comprises all sample images and image labels corresponding to all the sample images; the image labels are used for indicating the corresponding positions of the sample images in the sample images;

the sample crowd number acquiring unit is used for acquiring the crowd number of each subarea in the first sample image based on the image label corresponding to the first sample image; the first sample image is any one of the respective sample images;

and the sub-interval obtaining unit is used for obtaining sub-intervals corresponding to the sub-areas in the first sample image in the at least two people number classification intervals respectively based on the number of people in each sub-area in the first sample image.

In one possible implementation, in response to the at least two people classification intervals including a first people classification interval and a second people classification interval, the apparatus further includes:

a sample crowd number obtaining module, configured to obtain, based on the image label corresponding to each sample image, a crowd number of each sub-region in each sample image;

a first interval obtaining module, configured to determine the first person classification interval based on the number of people in each sub-region in each sample image;

and the second interval acquisition module is used for determining the second person number classification interval based on the first person number classification interval.

In a possible implementation manner, the first interval obtaining module includes:

the endpoint set acquisition unit is used for acquiring a first endpoint set based on the maximum value of the number of people in each sub-region in each sample image; the first set of endpoints is to indicate interval endpoints of a first people classification interval;

a segmentation point set obtaining unit, configured to determine a first segmentation point set based on an interval endpoint of the first person classification interval; the first set of segmentation points is to indicate interval segmentation points of the first people classification interval;

a people-number-classification-interval determining unit configured to determine the first people-number classification interval based on the first endpoint set and the first segment point set.

In one possible implementation, the first person classification interval includes at least two first subintervals;

the second interval obtaining module includes:

an interval proxy value obtaining unit, configured to determine, based on each first subinterval, an interval proxy value corresponding to each first subinterval; the interval agent value is used for determining the number information of people in the image area corresponding to the subinterval;

and the second section determining unit is used for determining the second people number classification section based on the section proxy value corresponding to the first subinterval.

In a possible implementation manner, the second interval obtaining unit includes:

a second endpoint set obtaining subunit, configured to obtain a second endpoint set based on a maximum value of the number of people in each sub-region in each sample image; the second set of endpoints is to indicate an interval endpoint for the second people classification interval;

a second segmentation point set obtaining subunit, configured to obtain a second segmentation point set based on the interval proxy value corresponding to the first subinterval; the second segment point set is used for indicating an interval segment point of the second people number classification interval;

and the second person number subsection section obtaining subunit is used for obtaining the second person number subsection section based on the second endpoint set and the second subsection point set.

In one possible implementation manner, the sample population number obtaining module is configured to,

obtaining a first sample hotspot graph corresponding to the first sample image based on the first sample image and the image label corresponding to the first sample image; the first sample hotspot graph is used for indicating the positions of the crowds in the first sample image;

based on the first sample heat point diagram, performing data processing through a Gaussian convolution kernel to obtain a first sample thermodynamic diagram corresponding to the first sample image;

and respectively integrating each sub-region in the first sample image based on the first sample thermodynamic diagram to obtain the number of people in each sub-region of the first sample image.

In one possible implementation, in response to the at least two people classification intervals including a first people classification interval and a second people classification interval, the feature map classification module is configured to,

classifying the crowd density feature map based on a feature classification layer corresponding to the first person classification interval to obtain sub-intervals corresponding to the sub-areas in the first person classification interval in the first image;

classifying the crowd density feature map based on a feature classification layer corresponding to the second crowd classification interval to obtain sub-intervals corresponding to the sub-areas in the second crowd classification interval in the first image;

obtaining the number of people information corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image, including:

determining first person information corresponding to each sub-region in the first image based on the sub-region corresponding to each sub-region in the first image in the first person classification region;

determining second people number information corresponding to each sub-region in the first image based on the sub-region corresponding to each sub-region in the second people number classification region in the first image;

and acquiring the number information corresponding to each sub-region in the first image based on the first number information corresponding to each sub-region in the first image and the second number information corresponding to each sub-region in the first image.

the system comprises a sample image acquisition module, a first image acquisition module and a second image acquisition module, wherein the sample image acquisition module is used for acquiring a first sample image and sub-sections corresponding to each sub-area in the first sample image in at least two people number classification sections; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

the sample image extraction module is used for extracting the features of the first sample image through the feature map acquisition layer to acquire a sample crowd density feature map corresponding to the first sample image;

a prediction result obtaining module, configured to classify the sample crowd density feature map through feature classification layers in the crowd density estimation model, where the feature classification layers correspond to the at least two crowd classification intervals, respectively, so as to obtain prediction results corresponding to the sub-regions in the first sample image, respectively; the prediction result is used for indicating the corresponding relation between each sub-area in the first sample image and the at least two people number classification intervals;

a model training module, configured to train the crowd density estimation model based on prediction results of each region in the first sample image that respectively corresponds to the at least two feature classification layers and sub-regions of each sub-region in the first sample image that respectively corresponds to the at least two people classification regions;

In yet another aspect, a computer-readable storage medium is provided, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, which is loaded and executed by a processor to implement the above-mentioned crowd quantity determination method.

In yet another aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the crowd amount determination method.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

classifying the people number density characteristic map corresponding to the first image through at least two people number classification sections to obtain sub-sections corresponding to all sub-regions in the first image, and determining the people number information of all the sub-regions according to the sub-sections corresponding to all the sub-regions in at least two people number human sections. Through the scheme, the input first image is classified based on at least two people number classification intervals with different interval segmentation points, so that when each image area in the first image is classified to the sub-interval of the people number classification interval, the number of people in each area on the first image can be estimated according to the classification conditions of the two people number classification intervals with different sub-intervals, the discretization error generated during the people number classification is reduced, and the accuracy of the people number estimation in the image is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

FIG. 1 illustrates a schematic diagram of a computer system provided by an exemplary embodiment of the present application;

FIG. 2 is a flow diagram illustrating a method for population quantity determination according to an exemplary embodiment;

FIG. 3 is a flow diagram illustrating a method for population quantity determination according to an exemplary embodiment;

FIG. 4 is a method flow diagram illustrating a method of crowd determination according to an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a method for determining annotation types according to the embodiment shown in FIG. 4;

FIG. 6 is a diagram illustrating data classification through two classification levels according to the embodiment shown in FIG. 4;

FIG. 7 is a schematic diagram of a model of a feature map acquisition layer according to the embodiment shown in FIG. 4;

FIG. 8 is a diagram illustrating a classification of prediction intervals according to the embodiment shown in FIG. 4;

FIG. 9 is a block flow diagram of a model training and population quantity estimation provided in accordance with an exemplary embodiment;

fig. 10 is a block diagram illustrating a configuration of a population quantity determining apparatus according to an exemplary embodiment;

fig. 11 is a block diagram illustrating a configuration of a population quantity determining apparatus according to an exemplary embodiment;

FIG. 12 is a block diagram illustrating a computer device according to an example embodiment.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

First, terms related to embodiments of the present application will be described.

1) Artificial Intelligence (AI)

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

2) Computer Vision (Computer Vision, CV)

Computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

3) Machine Learning (Machine Learning, ML)

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

4) Intelligent traffic (Intelligent Transport)

The intelligent traffic is based on intelligent traffic, technologies such as internet of things, cloud computing, internet, artificial intelligence, automatic control and mobile internet are fully utilized in the traffic field, traffic information is collected through high and new technologies, traffic management, transportation, public trip and other traffic field aspects and the whole traffic construction management process are managed and supported, the traffic system has the capabilities of perception, interconnection, analysis, prediction, control and the like in an area, a city and even a larger space-time range, traffic safety is fully guaranteed, efficiency of traffic infrastructure is brought into play, operation efficiency and management level of the traffic system are improved, and the intelligent traffic system is used for smooth public trip and sustainable economic development.

The crowd quantity determining method provided by the embodiment of the application can be applied to computer equipment with stronger data processing capacity. The crowd quantity determining method can be a training method for a crowd density estimation model, and the crowd density estimation model can process the input image to obtain the crowd quantity corresponding to the input image. In a possible implementation manner, the crowd quantity determination method provided by the embodiment of the present application may be applied to a personal computer, a workstation, or a server, that is, training of the crowd density estimation model may be performed by the personal computer, the workstation, or the server. In a possible implementation manner, the crowd density estimation model trained by the crowd quantity determination method provided by the embodiment of the application can be applied to data processing of input image data to obtain prediction data of the crowd quantity corresponding to the image.

Referring to FIG. 1, a schematic diagram of a computer system provided by an exemplary embodiment of the present application is shown. The computer system 200 includes a terminal 110 and a server 120, wherein the terminal 110 and the server 120 perform data communication through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.

The terminal 110 has an application program with an image processing function installed therein, where the application program may be a professional image processing application program, a social contact application program, a virtual reality application program, a game application program, or an Artificial Intelligence (AI) application program with an image processing function, and the application is not limited thereto in this embodiment.

Optionally, the terminal 110 may be a terminal device having an image acquisition component, where the image acquisition component is used to acquire an image and store the image in a data storage module in the terminal 110; the terminal 110 may also be a terminal device having a data transmission interface for receiving image data captured by an image capture device having an image capture component.

Optionally, the computer device 110 may be a mobile terminal such as a smart phone, a tablet computer, a laptop portable notebook computer, or the like, or a terminal such as a desktop computer, a projection computer, or the like, or an intelligent terminal having a data processing component, which is not limited in this embodiment of the application.

The server 120 may be implemented as one server, or may be implemented as a server cluster formed by a group of servers, which may be physical servers, or may be implemented as a cloud server. In one possible implementation, server 120 is a backend server for applications in computer device 110.

In a possible implementation manner of this embodiment, the server 120 trains the crowd density estimation model through a preset training sample set (i.e., sample images), where the training sample set may include sample images with different crowd densities. After the training process of the crowd density estimation model by the server 120 is completed, the trained crowd density estimation model is sent to the terminal 110 through wired or wireless connection. The terminal 110 receives the trained crowd density estimation model, and inputs data information corresponding to the crowd density estimation model into an application program with a crowd number determination function, so that when a user uses the application program to process image data, the image data can be processed according to the trained crowd density estimation model, and all or part of steps of the crowd number determination method can be realized.

FIG. 2 is a flow diagram illustrating a method for population quantity determination according to an exemplary embodiment. The method may be performed by a computer device, which may be the terminal 120 in the embodiment shown in fig. 1 described above. As shown in fig. 2, the flow of the population quantity determination method may include the following steps.

Step 201, a first image is acquired.

Step 202, performing feature extraction based on the first image to obtain a crowd density feature map corresponding to the first image.

And 203, classifying the crowd density feature map respectively based on at least two people number classification intervals to obtain sub-intervals corresponding to each sub-area in the first image in the at least two people number classification intervals.

Wherein the at least two people number classification intervals have different interval segmentation points; the segment segmentation point is used for segmenting the people number classification segment into at least two sub-segments.

And 204, acquiring the number information corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image.

In one possible implementation, the people number information is used to indicate a predicted number of people for each sub-region in the first image.

Step 205, acquiring the number of people corresponding to the first image based on the information of the number of people corresponding to each sub-region in the first image.

In summary, in the embodiment of the present application, the people number density feature map corresponding to the first image is classified through at least two people number classification sections to obtain sub-sections corresponding to each sub-region in the first image, and the people number information of each sub-region is determined according to the sub-sections corresponding to each sub-region in at least two people number human-regions. Through the scheme, the input first image is classified based on at least two people number classification intervals with different interval segmentation points, so that when each image area in the first image is classified to the sub-interval of the people number classification interval, the number of people in each area on the first image can be estimated according to the classification conditions of the two people number classification intervals with different sub-intervals, the discretization error generated during the people number classification is reduced, and the accuracy of the people number estimation in the image is improved.

FIG. 3 is a flow diagram illustrating a method for population quantity determination according to an exemplary embodiment. The method may be performed by a computer device, which may be the server 120 in the embodiment illustrated in fig. 1 described above. As shown in fig. 3, the flow of the population quantity determination method may include the following steps.

Step 301, a first sample image and sub-regions corresponding to the sub-regions in the first sample image in the at least two people number classification regions are obtained.

Step 302, performing feature extraction on the first sample image through the feature map acquisition layer to acquire a sample crowd density feature map corresponding to the first sample image.

And step 303, classifying the sample crowd density characteristic diagram through the characteristic classification layers respectively corresponding to the at least two crowd classification intervals in the crowd density estimation model, and obtaining the prediction results respectively corresponding to the sub-regions in the first sample image.

The prediction result is used for indicating the corresponding relation between each sub-area in the first sample image and the at least two people number classification intervals.

Step 304, training the crowd density estimation model based on the prediction results of the regions in the first sample image respectively corresponding to the at least two feature classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the sub-regions in the at least two people number classification regions.

The trained crowd density estimation model is used for processing the input first image to obtain the crowd number corresponding to the first image.

FIG. 4 is a method flow diagram illustrating a method of population quantity determination according to an exemplary embodiment. The method may be performed by a model processing device, which may be the server 120 in the embodiment shown in fig. 1 and a data processing device, which may be the terminal 120 in the embodiment shown in fig. 1. As shown in fig. 4, the flow of the population quantity determination method may include the following steps.

Step 401, a first sample image and sub-regions corresponding to the sub-regions in the first sample image in the at least two people number classification regions are obtained.

In one possible implementation, a training sample set is obtained; the training sample set comprises the sample images and image labels corresponding to the sample images; the image label is used for indicating the corresponding position of the sample image in each sample image; acquiring the number of people in each sub-region in the first sample image based on the image label corresponding to the first sample image; the first sample image is any one of the respective sample images; and acquiring sub-sections corresponding to the sub-regions in the first sample image in the at least two people number classification sections based on the number of people in each sub-region in the first sample image.

In one possible implementation, the image annotation may be generated based on the head position of each object (i.e., human body) on each sample image, that is, the image annotation determines the position of the crowd and the information of the number of people on the first sample image according to the head position information of each object on each sample image.

In one possible implementation, the sample images in the training sample set are sample images of the same resolution, i.e., the image pixel values of the sample images in the training sample set are the same.

When each sample image in the training sample set is a sample image with the same resolution, the image label on each sample image can be used for indicating the information of the number of people corresponding to each sample image, and can also indicate the information of the crowd density corresponding to each sample image.

In a possible implementation manner, based on the first sample image and the image label corresponding to the first sample image, obtaining a first sample hotspot graph corresponding to the first sample image; the first sample heat point diagram is used for indicating the position of the crowd in the first sample image; based on the first sample heat point diagram, performing data processing through a Gaussian convolution kernel to obtain a first sample thermodynamic diagram corresponding to the first sample image; and respectively integrating each sub-region in the first sample image based on the first sample thermodynamic diagram to obtain the number of people in each sub-region of the first sample image.

When the first sample image and the image label corresponding to the first sample image are obtained, the position corresponding to the image label on the first sample image may be highlighted according to the image label corresponding to the first sample image, and the first sample hotspot image corresponding to the first sample image is obtained. For example, in this first sample hotspot graph, consider the N head center points x1 through xn in the graph. For each head central point xi, a two-dimensional response map Hi can be generated, only the pixel value of the head central point position in the response map is 1, and the rest positions are 0, then the his corresponding to all the head central points are added to obtain a response map H (i.e. a first sample hotspot map) of all the heads corresponding to the first sample image, and the integral value of the response map is the total number of people.

When the image block segmentation is performed on the first sample image, when any one of the sub-regions in the first sample image contains the head center point of a certain object, but all image parts of the object are not completely located in the sub-region, but the head center point of the object is located in the sub-region, so that the object is considered to be completely located in the sub-region, and therefore, the information of the number of people for representing each sub-region in the first sample image by generating the first sample hotspot graph corresponding to the first sample image is inaccurate. At this time, the response graph can be convolved by a normalized Gaussian convolution kernel to obtain a first sample thermodynamic diagram corresponding to the first sample image, the first sample thermodynamic diagram is a gaussian distribution diagram formed based on the center point of each human head in the first sample image, the pixel value size of each point in the first sample thermodynamic diagram is used to indicate the crowd density of each point in the first sample thermodynamic diagram, the first sample thermodynamic diagram can therefore be used to indicate the crowd density of individual pixel points on the first sample image, and because the gaussian kernel is normalized, the value obtained after integration of the first sample thermodynamic diagram obtained after data processing by the gaussian convolution kernel is still the total number of people in the first sample image, and similarly, and integrating each sub-region in the first sample image to obtain the number of people corresponding to each sub-region in the first sample image.

In one possible implementation, the first sample image is scanned according to a non-overlapping sliding window of a specified size, and sub-regions of the first sample image are obtained.

The sub-regions of the first sample image are segmented according to a non-overlapping sliding window scan with a specified size, that is, the size of each sub-region of the first sample image is determined according to the non-overlapping sliding window, and the size of each sub-region of the first sample image is the same.

In a possible implementation manner, the number of people in each sub-region in each sample image is obtained based on the image label corresponding to each sample image; determining the first person classification interval based on the number of the persons in each sub-region in each sample image; based on the first people classification interval, the second people classification interval is determined.

The first people classification interval may be a people classification interval corresponding to a first classification layer in the people quantity estimation model, where the first classification layer is one of at least two feature classification layers of the people quantity estimation model.

The different characteristic classification layers have different people number classification intervals, the different people number classification intervals have different interval segmentation points, and the interval segmentation points are used for dividing the people number classification intervals into sub-intervals, so that the different people number classification intervals have different sub-intervals. Based on the number of people in each sub-region in the first sample image and the number of people classification sections corresponding to the at least two feature classification layers, the number of people in each sub-region in the first sample image can be obtained, and corresponding sub-regions in the at least two different number of people classification sections (i.e., the number of people classification sections corresponding to the at least two feature classification layers) are respectively obtained, so as to determine the labeling types corresponding to each sub-region of the first sample image and the at least two feature classification layers.

Please refer to fig. 5, which illustrates a schematic diagram of a method for determining a label type according to an embodiment of the present application. As shown in fig. 5, a first sample heat map corresponding to a first sample image is generated according to the first sample image and an image label corresponding to the first sample image, and a response map is convolved by a normalized gaussian convolution kernel to obtain a first sample heat map 501 corresponding to the first sample image, where the pixel value size of each point in the first sample heat map 501 may indicate the crowd density of each point in the first sample heat map. And integrating the sub-regions in the first sample thermodynamic diagram to obtain the number of people 502 corresponding to each sub-region of the first sample image, and classifying the number of people 502 corresponding to each sub-region of the first sample image through a people classification section 503 in the feature classification layer, wherein the people classification section 503 comprises sub-regions of [0, 1], [1, 2], [2, 3], [3, 4], [4, 5 ]. For example, for the "1.2" upper left part of the number of people 502 corresponding to each sub-region in the first sample image, the number of people can be classified into [0, 1] sub-regions by the people classification region 503; for the lower left part "4.2" of the number of people 502 corresponding to each sub-region in the first sample image, the number of people can be classified into [4, 5] sub-regions by the people classification region 503.

In a possible implementation manner, a first endpoint set is obtained based on the maximum value of the number of people in each sub-region in each sample image; the first set of endpoints is to indicate interval endpoints of a first people classification interval; determining a first segmentation point set based on an interval endpoint of the first person classification interval; the first set of segmentation points is used to indicate interval segmentation points of the first people classification interval; determining the first person classification interval based on the first set of endpoints and the first set of segmentation points.

The interval end point of the first person classification interval may be determined according to the maximum value of the number of people in each sub-area in each sample image. Since the first person classification interval is used for classifying the number of people in the sub-region of each sample image in the training sample set, the first person classification interval needs to include the maximum value of the number of people in each sub-region of each sample image in the training sample set.

In a possible implementation manner, the minimum value of the number of people in each sub-region in each sample image is a minimum value that is not zero in the number of people in each sub-region in each sample image.

In one possible implementation, the maximum value of the population number of each sub-region in each sample image is obtained as the first endpoint set.

The minimum value of the number of people in each sub-region in each sample image is obtained as a left end point in the first end point set, the maximum value of the number of people in each sub-region in each sample image is obtained as a right end point in the first end point set, and the left end point and the right end point are interval end points of the first person group classification interval.

When the number of people in the number of people classification interval can be guaranteed to include the number of people in each sub-area of all sample images in the training sample set, the smaller the number of people classification interval is, the more accurate the classification is, and therefore the maximum value of the number of people in each sub-area of each sample image in the training sample set can be directly determined as the interval endpoint of the first person classification interval.

After the first endpoint set is determined, that is, after the interval endpoint of the first person group classification interval is determined, the interval segmentation point of the first person group classification interval may be determined according to the interval endpoint of the first person group classification interval.

In one possible implementation, the first classification number is obtained; based on the first classification number, an interval segmentation point of the first crowd classification interval is determined.

The first classification number is used for indicating the number of types which can be obtained after the first classification layer classifies the input sample image. For example, when the first classification number is N (N is greater than or equal to 2, and N is a positive integer), that is, after the data is classified by the first classification layer, the probabilities that the data are respectively of N types may be obtained, at this time, the number of the segment points of the first-person classification interval corresponding to the first classification layer may be N-1, and the first-person classification interval is segmented by the N-1 segment points, so that N first sub-intervals may be obtained.

In a possible implementation manner, based on the first group classification interval, the first group classification interval is averagely divided by the first classification number, and an interval endpoint of the first group classification interval is obtained.

In another possible implementation manner, the interval endpoint of the first people classification interval corresponding to the first classification layer may be e ^ { K (log (b) -log (a))/K + log (a)) }, where, assuming that the smallest total number of people is a and the largest total number of people is b except for the area where the number of people is 0, the number of the subintervals to be divided is K. At this time, the section sizes of the sub-sections are in nonlinear distribution, that is, the sub-sections for classifying smaller people are distributed more densely, and the sub-sections for classifying larger people are distributed more dispersedly, so that the better classification effect on the numbers of people with different densities is realized.

In one possible implementation, the first person classification interval includes at least two first sub-intervals; determining an interval proxy value corresponding to each first subinterval based on each first subinterval; the interval agent value is used for determining the number information of the people in the image area corresponding to the subinterval; and determining the second people classification interval based on the interval agent value corresponding to the first subinterval.

When the first person classification interval corresponding to the first classification layer includes at least two first classification subintervals, that is, the first person classification interval is composed of a plurality of first classification subintervals, each first classification subinterval is determined according to an interval segmentation point, an interval proxy value corresponding to each first subinterval can be determined according to each first subinterval, and when the type of the image block is determined through the first classification layer, that is, the first classification subinterval corresponding to the image is determined, the number of people in the image block is determined according to the interval proxy value of the first classification subinterval.

In a possible implementation manner, a second endpoint set is obtained based on the maximum value of the number of people in each sub-region in each sample image; the second endpoint set is used for indicating an interval endpoint of the second people number classification interval; acquiring a second segmentation point set based on the interval proxy value corresponding to the first subinterval; the second segmentation point set is used for indicating interval segmentation points of the second people number classification interval; and acquiring the second person number segmentation interval based on the second endpoint set and the second segmentation point set.

The second people number classification interval may be a people number classification interval corresponding to a second classification layer in the people number estimation model, where the second classification layer is one of at least two feature classification layers of the people number estimation model.

The section end point of the second people number classification section is determined according to the maximum value of the people number of each sub-region in each sample image, and the section end point of the second people number classification section can be consistent with the section end point of the people number classification section corresponding to the first classification layer. Wherein the second set of segmentation points may be determined based on the interval proxy value corresponding to the first subinterval.

In one possible implementation manner, the interval proxy value corresponding to each first sub-interval is determined as the second segmentation point set.

In one possible implementation, the interval proxy value of each first sub-interval is an interval midpoint of each first sub-interval.

The end point of the second people classifying section is determined according to the middle point of each first subinterval in the first classifying layer, and when data is classified according to the first classifying layer, for the data at the edge of the first sub-interval in the first human classification interval, when the data is directly classified by the first classification layer, the discrete error is larger, so the data can be classified by the second sub-interval, since the interval end points of the second head count classification interval are determined based on the middle point of the first sub-interval, the second sub-interval is therefore interleaved with the first sub-interval, and data at the edge of the first sub-interval is usually located at the center of the second sub-interval and can be better classified by the second sub-interval, and at this time, meanwhile, the type of the first sample data is determined through the first classification layer and the second classification layer, so that the accuracy of labeling and classifying the sample data can be improved.

Please refer to fig. 6, which illustrates a schematic diagram of data classification through two classification layers according to an embodiment of the present application. As shown in fig. 6, a first sample heat map corresponding to a first sample image is generated according to the first sample image and an image label corresponding to the first sample image, and a response map is convolved by a normalized gaussian convolution kernel to obtain a first sample heat map 601 corresponding to the first sample image, where the pixel value size of each point in the first sample heat map 601 can indicate the crowd density of each point in the first sample heat map. Integrating on the basis of each sub-region in the first sample thermodynamic diagram to obtain the number of people 602 corresponding to each sub-region of the first sample image, and classifying the number of people 602 corresponding to each sub-region of the first sample image through a first people classification section 603 and a second people classification section 604 in the feature classification layer, wherein the first people classification section 603 comprises sub-regions of [0, 1], [1, 2], [2, 3], [3, 4], [4, 5 ]; the second people classification interval 604 includes sub-intervals of [0, 1.5], [1.5, 2.5], [2.5, 3.5], [3.5, 4.5], [4.5, 5.5 ]. For example, for the upper left portion "1.2" of the crowd number 602 corresponding to each sub-region in the first sample image, the first crowd classification section 603 can classify the crowd number into a [0, 1] sub-section, and the second crowd classification section 604 can classify the crowd number into a [0, 1.5] sub-section; for the lower left part "4.2" of the crowd number 602 corresponding to each sub-region in the first sample image, the first human classification interval 603 can classify the first human classification interval into a [4, 5] sub-interval; the second people-number classification interval 604 can be classified as a [4.5, 5.5] sub-interval.

Step 402, performing feature extraction on the first sample image through the feature map acquisition layer to acquire a sample crowd density feature map corresponding to the first sample image.

The feature map acquisition layer in the crowd density estimation model is used for performing feature extraction on a first sample image in the training sample set to obtain image features corresponding to the first sample image, wherein the image features obtained by performing feature extraction through the feature map acquisition layer are used for indicating crowd information in the first sample image, so that the crowd density feature map of the sample corresponding to the first sample image can be used for indicating the number of crowds and crowd density corresponding to the first sample image.

In one possible implementation, the sample population density profile has the same size as the first sample image.

The sample crowd density characteristic diagram obtained by carrying out characteristic extraction on the first sample image through the crowd density estimation model is the same as the pixel size of the input first sample image.

In one possible implementation, the feature map acquisition layer in the crowd density estimation model may be a U-shaped neural network model having an encoder-decoder structure. Wherein the encoder structure in the feature map acquisition layer is used for extracting deep features of the input sample image through downsampling, and the encoder structure in the feature map acquisition layer is used for restoring the deep features with low resolution into image features with high resolution through upsampling. Please refer to fig. 7, which illustrates a model diagram of a feature map acquisition layer according to an embodiment of the present application. As shown in fig. 7, the feature acquisition layer includes the down-sampling module 701, wherein the down-sampling module 701 is composed of each convolution layer, the down-sampling module 701 may be a VGG16 convolutional neural network or a VGG19 convolutional neural network, and the down-sampling module 701 may extract high-level semantic features of low resolution of the input image. The high-level semantic features extracted by the down-sampling module 701 can be input to the up-sampling module 702, the low-resolution high-level semantic features are reduced to higher-resolution semantic features, and the feature map output by the up-sampling module 702 is passed through the up-sampling module 703 with the same structure, so that the resolution of the feature map is improved. In order to obtain a high-resolution feature map with high-level semantic information and detail information, the up-sampling module 702 and the up-sampling module 703 also introduce the high-resolution detail information through a jump connection.

Part 710 of fig. 7 is a schematic diagram of the structure of the upsampling module 702 and the upsampling module 703. The high-level semantic features are input into an upsampling module 711, after upsampling, the high-level semantic features are input into a convolution module 712 for feature extraction, the extracted features are input into a cascade layer and are cascaded with high-resolution detail features acquired through jump link, and the cascaded features are sequentially input into a convolution layer 713 and a convolution layer 714 to obtain the output of the upsampling module.

And 403, classifying the sample crowd density feature map through the feature classification layers respectively corresponding to the at least two crowd classification intervals in the crowd density estimation model to obtain the prediction results respectively corresponding to the sub-regions in the first sample image.

In one possible implementation, the prediction result is used to indicate a prediction probability set of each sub-region in the first sample image corresponding to the at least two feature classification layers.

In a possible implementation manner, based on the sample crowd density feature map, data processing is performed through at least two feature classification layers in the crowd density estimation model to obtain prediction probability sets of each sub-region in the first sample image corresponding to the at least two feature classification layers respectively; wherein the prediction probability set is used to indicate the probability that each sub-region in the first sample image belongs to the respective types corresponding to the at least two feature classification layers; and acquiring the prediction categories of the sub-regions in the first sample image respectively corresponding to the at least two feature classification layers based on the prediction probability sets of the sub-regions in the first sample image respectively corresponding to the at least two feature classification layers.

For example, when the at least two feature classification layers include a first classification layer and a second classification layer, the sample population density profile may be processed by the first classification layer and the second classification layer simultaneously; processing (classifying) the sample crowd density feature map through the first classification layer to obtain a first prediction probability set corresponding to each sub-region in the sample crowd density feature map and the first classification layer, wherein the first prediction probability set is used for indicating the probability that each sub-region of the sample crowd density feature map belongs to each type of the first classification layer; and processing the sample crowd density characteristic diagram through the second classification layer to obtain a second prediction probability set corresponding to each sub-region in the sample crowd density characteristic diagram and the second classification layer, wherein the second prediction probability set is used for indicating the probability that each sub-region of the sample crowd density characteristic belongs to each type of the second classification layer.

After the first prediction probability set corresponding to each sub-region and the second prediction probability set corresponding to each sub-region are respectively obtained, the classification with the highest probability in the first prediction probability set corresponding to each sub-region can be respectively determined as the classification of each sub-region of the sample crowd density map by the first classification layer, and the determination with the highest probability in the second prediction probability set corresponding to each sub-region can be respectively determined as the classification of each sub-region of the sample crowd density map by the second classification layer.

In one possible implementation, at least two feature classification layers in the population density estimation model are connected in parallel with other parts of the population density estimation model.

When the crowd density estimation model acquires a first sample image to be classified, the crowd density estimation model can firstly perform feature extraction on the first sample image through a feature extraction layer to acquire a sample crowd density feature map corresponding to the first sample image; the sample crowd density feature map is divided into feature maps of all sub-regions, the feature maps of all sub-regions are input into at least two feature classification layers of the crowd density estimation model, and the at least two feature classification layers respectively process the sample crowd density feature map to obtain the classification probability of the at least two feature classification layers and all sub-regions of the sample crowd density feature map.

Step 404, training the crowd density estimation model based on the prediction results of the regions in the first sample image respectively corresponding to the at least two feature classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the at least two people number classification regions.

In a possible implementation manner, in response to that the prediction results of the regions on the first sample image respectively corresponding to the at least two feature classification layers are prediction probability sets of the regions on the first sample image respectively corresponding to the at least two feature classification layers, the crowd density estimation model is trained based on the prediction probability sets of the regions on the first sample image respectively corresponding to the at least two feature classification layers and the label categories of the first sample image corresponding to the at least two feature classification layers.

In a possible implementation manner, the crowd density estimation model is trained through a loss function based on a prediction probability set corresponding to each region on the first sample image and the at least two feature classification layers and labeling categories corresponding to the first sample image and the at least two feature classification layers.

For example, when the at least two feature classification layers include a first classification layer and a second classification layer, based on the prediction probability set corresponding to each region on the first sample image and the first classification layer, the label type corresponding to each region on the first sample image and the first classification layer, the prediction probability set corresponding to each region on the first sample image and the second classification layer, and the label type corresponding to each region on the first sample image and the second classification layer, the loss function is input, the loss function value is obtained, and the loss function value is updated.

In a possible implementation manner, when the response and the at least two feature classification layers include a first classification layer and a second classification layer, a first loss function value is obtained based on a prediction probability set corresponding to each region on the first sample image and the first classification layer, and a labeling category corresponding to each region on the first sample image and the first classification layer; and obtaining a second loss function value based on the prediction probability set corresponding to each region on the first sample image and the second classification layer and the labeling type corresponding to each region on the first sample image and the second classification layer, and training the crowd density estimation model based on the first loss function value and the second loss function value.

In a possible implementation manner, the first loss function value and the second loss function value are loss function values obtained based on a multi-class cross entropy loss function.

In this embodiment of the application, the training sample of the crowd density estimation model is a sample image, and the training label corresponding to the training sample is a category of each region of the sample image obtained by integrating each region of the sample image based on the crowd density thermodynamic diagram of the sample image and classifying each region according to the crowd classification intervals corresponding to the at least two feature classification layers.

The segment points of the people classification intervals corresponding to the at least two feature classification layers are different, namely, when the numerical value of the sub-interval edge of one people classification interval is directly classified into the type corresponding to the sub-interval, the generated discretization error is large, but when the numerical value is classified through the people classification interval of another different segment point, because the segment segmentation points are different, the numerical value is not at the edge of the segment when the people classification interval of another different segment point is classified, the generated discretization error is small, and therefore when the class judgment of each region of the sample image is carried out through the feature classification layers of the people classification intervals with the at least two different segment points, the influence of the discretization error on the classification can be effectively reduced.

Please refer to fig. 8, which illustrates a prediction interval classification diagram according to an embodiment of the present application. As shown in fig. 8, when the population is classified in the population classification section, if the population is a value of about 2, when the population is classified in the first population classification section 801, the population is located around the section segmentation point between the [1, 2] and [2, 3] sections, that is, the population is classified in both the [1, 2] and [2, 3] sections, which results in a large dispersion error, whereas when the population is classified in the second population classification section, the population can be accurately classified in the [1.5, 2.5] section, which results in a small dispersion error.

In step 405, a first image is acquired.

In a possible implementation manner, the first image may be an image with crowd information, and when the first image is processed by the crowd density estimation model, the crowd information corresponding to the first image may be obtained.

The crowd density estimation model is a machine learning model obtained by training by taking a first sample image as a training sample and taking sub-intervals, corresponding to each sub-area in the first sample image, in the at least two people number classification intervals as labels.

And 406, performing feature extraction on the first image through a feature map acquisition layer in the crowd density estimation model to acquire a sample crowd density feature map corresponding to the first image.

In a possible implementation manner, the feature map obtaining layer in the crowd density estimation model is configured to perform feature extraction on the first image to obtain an image feature corresponding to the first image, where the image feature obtained by performing feature extraction through the feature map obtaining layer is used to indicate crowd information in the first image, and therefore the crowd density feature map corresponding to the first image may be used to indicate the number of crowds and the crowd density corresponding to the first image.

In one possible implementation, the population density feature map is the same size as the first image.

The crowd density feature map obtained by feature extraction of the first image through the crowd density estimation model is the same as the pixel size of the input first image.

Step 407, classifying the crowd density feature map through the feature classification layers respectively corresponding to the at least two crowd classification intervals in the crowd density estimation model, so as to obtain sub-intervals corresponding to the sub-areas in the first image respectively in the at least two crowd classification intervals.

In a possible implementation manner, in response to that the at least two people number classification sections include a first people number classification section and a second people number classification section, classifying the crowd density feature map based on a feature classification layer corresponding to the first people number classification section to obtain sub-sections, corresponding to the sub-sections in the first people number classification section, of the sub-sections in the first image; and classifying the crowd density feature map based on the feature classification layer corresponding to the second crowd classification interval to obtain sub-intervals corresponding to the sub-areas in the second crowd classification interval in the first image.

For example, when the at least two feature classification layers include a first classification layer (i.e., a feature classification layer corresponding to a first crowd classification interval) and a second classification layer (i.e., a feature classification layer corresponding to a second crowd classification interval), the crowd density feature map may be processed through the first classification layer and the second classification layer at the same time; processing (classifying) the crowd density feature map through the first classification layer to obtain a first prediction probability set corresponding to each sub-region in the crowd density feature map and the first classification layer, wherein the first prediction probability set is used for indicating the probability that each sub-region of the crowd density feature map belongs to each sub-region in the first crowd classification interval; and processing the crowd density feature map through the second classification layer to obtain a second prediction probability set corresponding to each sub-region in the crowd density feature map and the second classification layer, wherein the second prediction probability set is used for indicating the probability that each sub-region of the crowd density feature belongs to each sub-region in the second crowd classification region.

After the first prediction probability set corresponding to each sub-region and the second prediction probability set corresponding to each sub-region are respectively obtained, the classification with the highest probability in the first prediction probability set corresponding to each sub-region can be respectively determined as the classification of each sub-region of the crowd density map by the first classification layer, and the determination with the highest probability in the second prediction probability set corresponding to each sub-region can be respectively determined as the classification of each sub-region of the crowd density map by the second classification layer.

Step 408, obtaining the information of the number of people corresponding to each sub-region in the first image based on the sub-regions corresponding to each sub-region in the at least two people number classification regions in the first image.

In one possible implementation manner, in response to that the at least two people number classification sections include a first people number classification section and a second people number classification section, determining first people number information corresponding to each sub-region in the first image based on the sub-section corresponding to each sub-region in the first person number classification section in the first image; determining second people number information corresponding to each sub-region in the first image based on the sub-region corresponding to each sub-region in the second people number classification region in the first image; and acquiring the number information respectively corresponding to each sub-region in the first image based on the first number information corresponding to each sub-region in the first image and the second number information corresponding to each sub-region in the first image.

In a possible implementation manner, based on the prediction categories of the regions in the first image respectively corresponding to the at least two feature classification layers, determining the number of regional crowds of the regions in the first image respectively corresponding to the at least two feature classification layers; and determining the number information of people corresponding to each area in the first image based on the number of people in the area corresponding to each area in the first image and the at least two characteristic classification layers.

In a possible implementation manner, based on the prediction categories of the regions in the first image respectively corresponding to the at least two feature classification layers, determining sub-regions of the regions in the first image respectively corresponding to the at least two people number classification regions; and determining the number of regional crowds of each region in the first image respectively corresponding to the at least two feature classification layers based on the corresponding subintervals of each region in the first image respectively in the at least two people number classification intervals.

According to the prediction categories of the regions in the first image corresponding to the at least two feature classification layers, subintervals of the regions in the first image corresponding to the at least two feature classification layers can be determined, and according to the agent count values of the subintervals, the number of regional crowds of the regions in the first image corresponding to the at least two feature classification layers is determined.

In a possible implementation manner, an average value of the population numbers of the regions corresponding to the at least two feature classification layers in each region in the first image is determined as the information of the number of people corresponding to each region in the first image.

In one possible implementation, in response to the at least two feature classification layers including the first classification layer and the second classification layer; determining first personal number information corresponding to each area in the first image based on the prediction category corresponding to each area in the first image and the first classification layer; determining second people number information corresponding to each area in the first image based on the prediction category corresponding to each area in the first image and the second classification layer; and acquiring the number information respectively corresponding to each subarea on the first image based on the first number information corresponding to each area in the first image and the second number information corresponding to each area in the first image.

Step 409, acquiring the number of people corresponding to the first image based on the information of the number of people corresponding to each sub-region in the first image.

In a possible implementation manner, the number of people corresponding to each area in the first image is summed to obtain the number of people corresponding to the first image.

In a possible implementation manner, the number of people meeting the specified condition in the number of people information corresponding to each area in the first image is summed to obtain the number of people corresponding to the first image.

The specified condition may be all the number-of-persons information excluding the maximum value and the minimum value among the number-of-persons information corresponding to the respective areas in the first image.

Generally, a single image is used as input in a crowd density estimation algorithm based on a deep learning technology, image features are extracted through a deep convolutional network, and since a crowd density estimation task needs both context features with high semantic information and local detail information, in order to obtain a high-resolution feature map with high-level semantic information and detail information, a main stream network generally uses a U-shaped network structure which is sampled first and then is sampled upwards as shown in fig. 7, and a jump link is introduced to introduce detail information for the upsampling, and finally a predicted output crowd density thermodynamic distribution map is used. On the extracted feature map, the categories corresponding to the same image block are predicted using parallel overlapping feature classification layers as shown in fig. 8. Two forecasted population numbers (two proxy count values corresponding to the categories) can be respectively obtained according to the categories classified by the population classification intervals corresponding to the two feature classification layers, and then the average value is taken to obtain the final forecasted population number of the image block.

The scheme shown in the embodiment of the application can also be realized in the field of intelligent traffic. In the field of intelligent transportation, the accuracy of passenger flow statistics directly influences the operation of an intelligent transportation system and the operation efficiency of transportation tools, a management platform corresponding to the intelligent transportation can acquire real-time crowd images of transportation places needing to be detected through monitoring equipment such as a camera, the number of the crowd of each transportation place is determined according to the scheme shown in the embodiment of the application, the influence of discretization errors on the crowd number estimation is reduced, accurate density estimation of the crowd images is realized, real-time, visual and accurate passenger flow data are provided for managers, and more efficient management and organization work are facilitated.

FIG. 9 is a block flow diagram of model training and population quantity estimation provided in accordance with an exemplary embodiment. The model training process may be applied to the model training device 900, which may be a server, and the population quantity determination process may be applied to the data processing device 910, which may be a user terminal, where the model training and population quantity estimation processes are as follows.

In the model training device 900, a sample image 901 is labeled by an image corresponding to the sample image 901 to obtain a hot spot map corresponding to the sample image 901, and then the hot spot map is subjected to gaussian convolution processing to obtain a crowd density thermodynamic map corresponding to the sample image 901. And classifying the crowd density thermodynamic diagram through two crowd classification sections 903 corresponding to two prediction heads (namely, feature classification layers) in the crowd density estimation model 902, determining each region in the crowd density thermodynamic diagram, determining corresponding sub-classification sections in the two crowd classification sections 903, and storing the sub-classification sections as the labeling information 904 corresponding to the sample image.

The sample image 901 is subjected to feature extraction by the crowd density estimation model 902, and then is classified with the two prediction heads respectively, so as to obtain prediction results 905 corresponding to each sub-region in the sample image 901 and the two prediction heads. The annotation information 904 corresponding to the sample image and the prediction result 905 corresponding to the sample image can be used to train the crowd density estimation model 902.

The model training device 910 may transmit the trained crowd density estimation model 902 to the data processing device 910, and in the data processing device, load the crowd density estimation model 902 into a crowd density estimation model 912, and process the input first image according to the crowd density estimation model, so as to obtain the number of people corresponding to the first image.

Fig. 10 is a block diagram showing the configuration of a population quantity determining apparatus according to an exemplary embodiment. The crowd quantity determination device may implement all or part of the steps in the method provided by the embodiment shown in fig. 2 or fig. 4, and includes the following parts:

a first image acquisition module 1001 configured to acquire a first image;

a first image extraction module 1002, configured to perform feature extraction based on the first image, so as to obtain a crowd density feature map corresponding to the first image;

the feature map classification module 1003 is configured to classify the crowd density feature map based on at least two crowd classification intervals, and obtain sub-intervals corresponding to the sub-areas in the first image in the at least two crowd classification intervals; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

the people number information obtaining module 1004 is configured to obtain people number information corresponding to each sub-region in the first image based on the sub-region corresponding to each sub-region in the at least two people number classification regions in the first image;

an image people number obtaining module 1005, configured to obtain the number of people corresponding to the first image based on the information of the number of people corresponding to each sub-region in the first image.

the feature map classification module 1003 is further configured to,

In one possible implementation, the apparatus further includes:

a training sample set obtaining unit for obtaining a training sample set; the training sample set comprises the sample images and image labels corresponding to the sample images; the image labels are used for indicating the corresponding positions of the sample images in the sample images;

a segmentation point set obtaining unit, configured to determine a first segmentation point set based on an interval endpoint of the first person classification interval; the first segmentation point set is used for indicating interval segmentation points of the people number classification interval corresponding to the first classification layer;

and the people number classification interval determining unit is used for determining the people number classification interval corresponding to the first classification layer based on the first endpoint set and the first segmentation point set.

the second interval obtaining module includes:

a second endpoint set obtaining subunit, configured to obtain a second endpoint set based on a maximum value of the number of people in each sub-region in each sample image; the second endpoint set is used for indicating the interval endpoint of the people number classification interval corresponding to the second classification layer;

a second segmentation point set obtaining subunit, configured to obtain a second segmentation point set based on the interval proxy value corresponding to the first subinterval; the second segmentation point set is used for indicating the interval segmentation points of the people number classification interval corresponding to the second classification layer;

In one possible implementation, in response to the at least two people classification intervals including a first people classification interval and a second people classification interval, the feature map classification module 1003 is configured to,

Fig. 11 is a block diagram illustrating a configuration of a population quantity determining apparatus according to an exemplary embodiment. The crowd quantity determination device may implement all or part of the steps in the method provided by the embodiment shown in fig. 3 or fig. 4, and includes the following parts:

a sample image obtaining module 1101, configured to obtain a first sample image and sub-sections, corresponding to each sub-section in the first sample image, in the at least two people number classification sections; the at least two people number classification intervals have different interval segmentation points; the interval segmentation point is used for segmenting the people number classification interval into at least two sub-intervals;

a sample image extraction module 1102, configured to perform feature extraction on the first sample image through the feature map acquisition layer, and acquire a sample crowd density feature map corresponding to the first sample image;

a prediction result obtaining module 1103, configured to classify the sample crowd density feature map through feature classification layers in the crowd density estimation model, where the feature classification layers correspond to the at least two crowd classification intervals, respectively, so as to obtain prediction results corresponding to each sub-region in the first sample image, respectively; the prediction result is used for indicating the corresponding relation between each sub-area in the first sample image and the at least two people number classification intervals;

a model training module 1104, configured to train the crowd density estimation model based on prediction results of the regions in the first sample image that respectively correspond to the at least two feature classification layers and sub-regions of the sub-regions in the first sample image that respectively correspond to the at least two people classification regions;

FIG. 12 is a block diagram illustrating a computer device according to an example embodiment. The computer device may be implemented as a model training device and/or a data processing device in the various method embodiments described above. The computer apparatus 1200 includes a Central Processing Unit (CPU) 1201, a system Memory 1204 including a Random Access Memory (RAM) 1202 and a Read-Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the Central Processing Unit 1201. The computer device 1200 also includes a basic input/output system 1206, which facilitates transfer of information between various components within the computer, and a mass storage device 1207, which stores an operating system 1213, application programs 1214, and other program modules 1215.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the computer device 1200. That is, the mass storage device 1207 may include a computer-readable medium (not shown) such as a hard disk or Compact disk Read-Only Memory (CD-ROM) drive.

Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, flash memory or other solid state storage technology, CD-ROM, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

The computer device 1200 may be connected to the internet or other network devices through a network interface unit 1211 connected to the system bus 1205.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the central processing unit 1201 implements all or part of the steps of the method shown in fig. 2, 3, or 4 by executing the one or more programs.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as a memory comprising computer programs (instructions), executable by a processor of a computer device to perform the methods shown in the various embodiments of the present application, is also provided. For example, the non-transitory computer readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods shown in the various embodiments described above.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for determining a number of people, the method comprising:

acquiring a first image;

2. The method according to claim 1, wherein the performing feature extraction based on the first image to obtain a crowd density feature map corresponding to the first image comprises:

extracting the features of the first image through a feature map acquisition layer in a crowd density estimation model to acquire a crowd density feature map corresponding to the first image;

the classifying processing is respectively carried out on the crowd density characteristic map based on at least two people number classification intervals to obtain sub-intervals corresponding to each sub-area in the first image in the at least two people number classification intervals respectively, and the classifying processing method comprises the following steps:

3. The method of claim 2, further comprising:

acquiring a first sample image and sub-sections corresponding to the sub-sections of the sub-sections in the first sample image in the at least two people number classification sections;

and training the crowd density estimation model based on the prediction results of the sub-regions on the first sample image respectively corresponding to the at least two feature classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the at least two crowd classification regions.

4. The method of claim 3, wherein the obtaining the first sample image and the sub-regions of the first sample image respectively corresponding to the sub-regions of the at least two people classification regions comprises:

acquiring a training sample set; the training sample set comprises all sample images and image labels corresponding to all the sample images; the image annotation is generated based on head positions of respective objects on the respective sample images;

acquiring the number of people in each sub-region in the first sample image based on the image label corresponding to the first sample image; the first sample image is any one of the respective sample images;

and acquiring sub-sections, corresponding to the sub-regions in the first sample image, in the at least two people number classification sections respectively based on the number of people in each sub-region in the first sample image.

5. The method of claim 4, wherein the at least two people classification intervals comprise a first people classification interval and a second people classification interval, the method further comprising:

acquiring the number of people in each sub-region in each sample image based on the image label corresponding to each sample image;

determining the first person classification interval based on the number of people in each sub-region in each sample image;

determining the second people classification interval based on the first people classification interval.

6. The method of claim 5, wherein determining the first people classification interval based on the number of people in each sub-region in each sample image comprises:

acquiring a first endpoint set based on the maximum value of the number of people in each sub-region in each sample image; the first set of endpoints is to indicate interval endpoints of a first people classification interval;

determining a first segmentation point set based on an interval endpoint of the first people classification interval; the first set of segmentation points is to indicate interval segmentation points of the first people classification interval;

determining the first person classification interval based on the first set of endpoints and the first set of segmentation points.

7. The method of claim 5, wherein the first human classification interval comprises at least two first sub-intervals;

the determining the second people classification interval based on the first people classification interval comprises:

determining interval proxy values corresponding to the first sub intervals based on the first sub intervals; the interval agent value is used for determining the number information of the people in the image area corresponding to the first subinterval;

and determining the second people number classification interval based on the interval agent value corresponding to the first subinterval.

8. The method of claim 7, wherein determining the second people classification interval based on the interval proxy value corresponding to the first sub-interval comprises:

acquiring a second endpoint set based on the maximum value of the number of people in each sub-region in each sample image; the second set of endpoints is to indicate an interval endpoint for the second people classification interval;

acquiring a second segmentation point set based on the interval proxy value corresponding to the first subinterval; the second segment point set is used for indicating an interval segment point of the second people number classification interval;

and acquiring the second people number classification interval based on the second endpoint set and the second segmentation point set.

9. The method according to any one of claims 5 to 8, wherein the obtaining the number of people in each sub-region of the first sample image based on the image label corresponding to the first sample image comprises:

10. The method of claim 2, wherein the at least two people number classification sections comprise a first people number classification section and a second people number classification section, and the classifying the crowd density feature map through the feature classification layers respectively corresponding to the at least two people number classification sections in the crowd density estimation model to obtain sub-sections respectively corresponding to the sub-sections in the at least two people number classification sections in the first image comprises:

11. A method for determining a number of people, the method comprising:

performing feature extraction on the first sample image through a feature map acquisition layer in a crowd density estimation model to acquire a sample crowd density feature map corresponding to the first sample image;

training the crowd density estimation model based on the prediction results of the sub-regions on the first sample image respectively corresponding to the at least two feature classification layers and the sub-regions of the sub-regions in the first sample image respectively corresponding to the at least two people number classification regions;

12. A population quantity determining apparatus, the apparatus comprising:

the first image acquisition module is used for acquiring a first image;

13. A population quantity determining apparatus, the apparatus comprising:

the sample image extraction module is used for extracting the features of the first sample image through a feature map acquisition layer in a crowd density estimation model to acquire a sample crowd density feature map corresponding to the first sample image;

a model training module, configured to train the crowd density estimation model based on prediction results of the sub-regions in the first sample image that respectively correspond to the at least two feature classification layers and sub-regions of the sub-regions in the first sample image that respectively correspond to the at least two people classification regions;

14. A computer device comprising a processor and a memory, said memory having stored therein at least one instruction, at least one program or a set of codes, said at least one instruction, said at least one program or said set of codes being loaded and executed by said processor to implement the method of population quantity determination as claimed in any one of claims 1 to 11.

15. A computer readable storage medium having stored therein at least one instruction, at least one program or set of codes, which is loaded and executed by a processor to perform a method of population quantity determination as claimed in any one of claims 1 to 11.