CN113822111A

CN113822111A - Crowd detection model training method and device and crowd counting method and device

Info

Publication number: CN113822111A
Application number: CN202110067279.5A
Authority: CN
Inventors: 谷爱国
Original assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Current assignee: Beijing Jingdong Zhenshi Information Technology Co Ltd
Priority date: 2021-01-19
Filing date: 2021-01-19
Publication date: 2021-12-21
Anticipated expiration: 2041-01-19
Also published as: CN113822111B

Abstract

The application discloses a crowd detection model training method and device and a crowd counting method and device, wherein the model training method comprises the following steps: acquiring a sample data set; wherein, the head and the five sense organs of the person in the sample picture are marked with a detection frame; training a pre-constructed crowd detection model by using the sample data set to obtain a target crowd detection model; wherein the training comprises: detecting the head of the person and the five sense organs in the sample picture to obtain a head candidate detection frame and a five sense organs candidate detection frame; generating attention feature vectors of corresponding heads by utilizing a heuristic attention weighting network based on the head candidate detection boxes and the five sense organ candidate detection boxes; identifying the authenticity of the corresponding head by utilizing a classification network based on the attention feature vector; and adjusting parameters of the crowd detection model according to the identification result and the detection frame identified in the sample picture. By adopting the invention, the crowd counting accuracy can be improved.

Description

Crowd detection model training method and device and crowd counting method and device

Technical Field

The invention relates to the technical field of computers, in particular to a crowd detection model training method and device and a crowd counting method and device.

Background

People counting is an important computer vision technology for security. In the intelligent security field, the unmanned patrol car can effectively judge crowd gathering conditions through crowd counting, early warning is made in advance, and abnormal behaviors are prevented.

Human head detection is a common people counting method, which calculates the number of people by recognizing the head of people.

Disclosure of Invention

In view of the above, the present invention is directed to a method and an apparatus for training a crowd detection model, and a method and an apparatus for counting crowd, which can improve the accuracy of crowd counting.

In order to achieve the above purpose, the embodiment of the present invention provides a technical solution:

a method of crowd detection model training, the method comprising:

acquiring a sample data set; wherein, the head and the five sense organs of the person in the sample picture are marked with a detection frame;

training a pre-constructed crowd detection model by using the sample data set to obtain a target crowd detection model; wherein the training comprises:

detecting the head of the person and the five sense organs in the sample picture to obtain a head candidate detection frame and a five sense organs candidate detection frame;

generating attention feature vectors of corresponding heads by utilizing a heuristic attention weighting network based on the head candidate detection boxes and the five sense organ candidate detection boxes;

identifying the authenticity of the corresponding head by utilizing a classification network based on the attention feature vector; and adjusting parameters of the crowd detection model according to the identification result and the detection frame identified in the sample picture.

In one embodiment, the detecting the head and the five sense organs of the person in the sample picture to obtain a head candidate detection frame and a five sense organ candidate detection frame includes:

detecting the head in the sample picture by using a pre-trained head detection model to obtain a head candidate detection frame;

obtaining a subgraph of the corresponding head based on the head candidate detection frame;

and detecting each five sense organs in the subgraph by using a five sense organs detection model to obtain the five sense organs candidate detection frame.

In one embodiment, the generating the attention feature vector for the respective head comprises:

extracting a corresponding head subregion characteristic matrix based on a first head candidate detection frame by utilizing a first interested region extraction layer of the heuristic attention weighting network;

utilizing a first global pooling layer of the heuristic attention weighting network to perform global average sampling on the head sub-region feature matrix to obtain a corresponding head average feature vector;

extracting a corresponding feature matrix of the sub-region of the facial features based on each of the facial feature candidate detection boxes by using a second region-of-interest extraction layer of the heuristic attention weighting network;

utilizing a second global pooling layer of the heuristic attention weighting network to carry out average sampling on the feature matrix of each facial feature subregion so as to obtain an average feature vector of the corresponding facial feature;

calculating an attention weight vector for each of the five sense organs in the respective head based on the average feature vector for the head and the average feature vector for the respective five sense organs;

and performing point multiplication on the head average feature vector and the attention weight vector of each corresponding five sense organs respectively, and summing the result of the point multiplication to obtain the attention feature vector of the head corresponding to the first head candidate detection box.

In one embodiment, said calculating an attention weight vector for each of said five sense organs in the respective head comprises:

if the average feature vector exists in the five sense organs, the average feature vector of the corresponding five sense organs is point-multiplied with the average feature vector of the head to obtain a corresponding attention weight vector;

if the average feature vector does not exist for the five sense organs, then the corresponding attention weight vector is zero.

In one embodiment, the adjusting the parameters of the crowd detection model comprises:

adjusting parameters in the head detection model, the facial feature detection model, the heuristic attention weighting network, and the classification network.

In one embodiment, the five sense organs include:

left eye, right eye, left ear, right ear, and mouth.

A method of population counting, comprising:

acquiring a target detection picture;

detecting the head of a person in the target detection picture based on a crowd detection model, and counting the detected head to obtain the number of the person in the target detection picture;

wherein the confidence of the counting head is greater than a preset threshold; the crowd detection model is obtained by training in advance by adopting any crowd detection model training method.

A crowd detection model training apparatus comprising:

the sample data acquisition module is used for acquiring a sample data set; wherein, the head and the five sense organs of the person in the sample picture are marked with a detection frame;

the model training module is used for training a pre-constructed crowd detection model by using the sample data set to obtain a target crowd detection model; wherein the training comprises:

A people counting device comprising:

the detection target acquisition module is used for acquiring a target detection picture;

the head detection module is used for detecting the head in the target detection picture based on a crowd detection model and counting the detected head to obtain the number of people in the target detection picture; wherein the confidence of the counting head is greater than a preset threshold; the crowd detection model is obtained by training by adopting any crowd detection model training method.

A crowd detection model training apparatus comprising a processor and a memory;

the memory has stored therein an application executable by the processor for causing the processor to perform the crowd detection model training method of any one of claims 1 to 6.

A computer readable storage medium having stored therein computer readable instructions for performing the crowd detection model training method as described above.

The embodiment of the invention also provides crowd counting equipment, which comprises a processor and a memory;

the memory has stored therein an application executable by the processor for causing the processor to perform the people counting method as described above.

Embodiments of the present invention also provide a computer-readable storage medium having stored therein computer-readable instructions for performing the people counting method as described above.

According to the technical scheme, in the model training method and device and the crowd counting method and device provided by the embodiment of the invention, in the process of training the crowd detection model by using the sample picture, the five sense organs are detected on the basis of the human head detection, and the results of the human head detection and the five sense organs are comprehensively processed by using a heuristic attention weighting mechanism to generate the attention characteristic vector of each head detected by the human head. Therefore, the difference between the human head and other similar objects in the shape can be improved by utilizing the result of the five sense organs detection and adopting a heuristic attention weighting mechanism, so that the accuracy of the attention feature vector input to the classification network can be improved, the false detection of the human head in the human head detection result is screened out, and the detection accuracy of the crowd detection model can be improved. Accordingly, the accuracy of people counting by using the people detection model is improved.

Drawings

FIG. 1 is a schematic flow chart of a method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a crowd detection model network structure according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a second embodiment of the method of the present invention;

FIG. 4 is a schematic structural diagram of an apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a fourth apparatus according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The inventor finds that the existing scheme for counting the crowd by using human head detection exists in the process of realizing the invention: the counting error is large. Through careful study analysis, the specific causes of the problem were found to be as follows:

the existing human head detection scheme is realized based on a general target detection framework. In the scheme, target features are extracted firstly, and then the category and the position of a target are obtained through classification and regression. In an actual application scenario, angles of a person relative to a camera may be different, and differences of characteristics of the head of the person at different angles are relatively large, for example, differences of a front face and a back head are relatively large. The presence of such a large difference makes it easy for similar objects that have less difference in their characteristics from the human head to be misdetected from the human head. Because, in a real scenario, it is inevitable that: at a certain angle, the difference in the characteristics of the human head from other objects is smaller than the difference in the characteristics of the human head at different angles. For example, the plush toy has little difference in head from the human hindbrain region. Thus, the head of the plush toy can be easily identified as the head of a person by using the existing human head detection technology. Therefore, the existing human head detection scheme is easy to have the false detection problem, so that the crowd counting error is larger.

Fig. 1 is a schematic flow diagram of a model training method according to an embodiment of the present invention, and as shown in fig. 1, the model training method implemented in the embodiment mainly includes:

step 101, obtaining a sample data set.

Wherein, the head and the five sense organs of the person in the sample picture are marked with a detection frame.

In practical application, the person skilled in the art can select the five sense organs to be detected according to actual needs. Preferably, in order to effectively screen false detection results of human head detection, the five sense organs to be detected may include: left eye, right eye, left ear, right ear, and mouth. In practical application, the person skilled in the art can set the five sense organs to be detected according to actual needs, as long as: regardless of the shooting angle, the image of the head of the real person includes at least one of the five sense organs in the set of five sense organs. Thus, the head of the person subjected to false detection can be screened out based on the five sense organs, so that the detection accuracy is improved.

In this step, the head and the preset five sense organs in each sample picture need to be marked with the detection frame identifiers, so that when the model is trained, the model parameters are adjusted based on the detection frame identifiers in the sample pictures and the detection results output by the model.

And 102, training a pre-constructed crowd detection model by using the sample data set to obtain a target crowd detection model.

Based on the sample pictures in the sample data set, training the crowd detection model can be specifically realized by adopting the following steps:

generating an attention feature vector for each head using a heuristic attention weighting network based on the head candidate detection box and the facial feature candidate detection box for the respective head;

Here, the specific method for adjusting the parameters of the crowd detection model according to the recognition result and the detection frame identified in the sample picture is known to those skilled in the art, and is not described herein again.

According to the training method, the head of the person in the sample picture needs to be detected, the five sense organs in the head need to be detected, the object of the head which is misjudged as the person in the head detection result can be effectively screened out by utilizing the detection result of the five sense organs and the heuristic attention weighting network, the accuracy of the head attention characteristic vector input to the classification network is effectively improved, the accuracy of the head identification result input by the classification network is improved, and the detection accuracy of the crowd detection model is further improved.

In addition, in the training method, the heuristic attention weighting mechanism is introduced to improve the accuracy of classification, so that after the attention feature vector of the head of the person is generated, the attention feature vector is only required to be input into the classification network of the model to identify the authenticity of the head of the person, and a detection frame is required to be finely adjusted through regression processing to improve the detection accuracy like the existing head detection method, so that the detection speed can be effectively improved compared with the existing head detection method.

The classification network is used for identifying the authenticity of the head of the corresponding person based on the attention feature vector of each head. The specific structure can be implemented by using an existing classifier, for example, two full-connection layers and a softmax activation function can be included, but the specific structure is not limited to this, and the specific structure can also be implemented by using one full-connection layer or a plurality of full-connection layers.

In one embodiment, in the above model training method, the following method may be adopted to detect the head and the five sense organs in the sample picture, and obtain the head candidate detection frame and the five sense organ candidate detection frame:

step a1, detecting the head in the sample picture by using a pre-trained head detection model to obtain a head candidate detection frame.

In this step, the detection frame of each head in the picture detected by the head detection model is used as a head candidate detection frame, so as to verify the authenticity of the picture in the subsequent steps.

Here, the head detection model is a model for detecting the head of a person in a picture.

And a2, obtaining a subgraph of the corresponding head based on the head candidate detection frame.

In this step, the image in the head candidate detection frame is used as a subgraph of the corresponding head, so that five sense organs in the image are identified based on the subgraph in the subsequent step, and a subregion feature map of each sense organ is obtained.

Step a3, detecting each facial organ in the subgraph by using a facial organ detection model to obtain the facial organ candidate detection frame.

In this step, each preset five sense organs in the head map is detected, and the detected detection frame is used as a candidate detection frame for the corresponding five sense organs. For example, if the five sense organs to be detected include the left eye, the right eye, the left ear, the right ear and the mouth, the step will need to detect these five sense organs from the subgraph, and obtain the detection frame for the left eye, the detection frame for the right eye, the detection frame for the left ear, the detection frame for the right ear and the detection frame for the mouth.

It should be noted that, due to different shooting angles in practical applications, there is a possibility that an image of all the preset five sense organs cannot be included in one head sub-image, that is, there may be no detection frame for some preset five sense organs in the sub-image.

In the above method, both the head detection model and the facial feature detection model may be implemented by using an existing target detection method, for example, by using a region candidate network (RPN).

In one embodiment, in the above model training method, for each detected head, generating the attention feature vector of the head using a heuristic attention weighting network may employ the following method:

and b1, extracting a corresponding head sub-region feature matrix based on the first head candidate detection box by using a first region of interest extraction layer (ROI Pooling) of the heuristic attention weighting network.

In this step, for each head candidate detection box detected in step a1, a corresponding head sub-region feature matrix (i.e., a head sub-region feature map) is extracted based on the head candidate detection box, so as to obtain a head average feature vector of the corresponding head. The first head candidate detection box represents any one of the head candidate detection boxes detected in step a 1.

Step b2, utilizing a first Global Pooling layer (Global Pooling) of the heuristic attention weighting network to perform Global average sampling on the head sub-region feature matrix to obtain a corresponding head average feature vector.

Step b3, extracting a corresponding feature matrix of the sub-region of the five sense organs based on each of the candidate detection boxes of the five sense organs in the first candidate detection box of the head by using a second region of interest extraction layer of the heuristic attention weighting network.

In this step, the feature matrix of the facial features of the head is extracted from the first head candidate detection frame, and if a certain facial feature does not have a candidate detection frame, the corresponding feature matrix of the facial features does not exist.

Step b4, utilizing the second global pooling layer of the heuristic attention weighting network to perform average sampling on the feature matrix of each facial feature subregion so as to obtain an average feature vector of the corresponding facial feature.

In this step, the feature matrix of the facial features of each facial feature of the facial features of the human in the first head candidate detection frame is sampled averagely to obtain an average feature vector of the corresponding facial features of the human, so that attention weighting processing is performed based on the average feature vector to screen out features of the head of the human detected by mistake in human head detection.

Step b5, calculating an attention weight vector of each of the five sense organs in the head corresponding to the first head candidate detection box based on the head average feature vector and the average feature vector of the corresponding five sense organs.

In one embodiment, this may be specifically in accordance with

Calculating an attention weight vector for each of said five sense organs in the respective head, wherein w_iAttention weight vector, m, representing the five sense organs i_iRepresents the mean eigenvector of the five sense organs i, h represents the mean eigenvector of the head, w_iWith the same dimension as said h.

In the above calculation method, if the average feature vector exists in the five sense organs, the average feature vector of the corresponding five sense organs is point-multiplied with the average feature vector of the head to obtain the corresponding attention weight vector. If the average feature vector does not exist for the five sense organs, then the corresponding attention weight vector is zero. In this way, for an object whose head is erroneously detected as a human, the attention weight vectors of all five sense organs corresponding to the object are zero, because the detection frame of the five sense organs is not detected in the sub-image.

Step b6, performing point multiplication on the head average feature vector and the attention weight vector of each corresponding five sense organs respectively, and summing the result of the point multiplication to obtain the attention feature vector of the head corresponding to the first head candidate detection box.

Here, as described in the above step, the attention weight vector of the five sense organs of an object similar to the head of a person will be zero, and thus, the result of multiplying the zero vector by the head average feature vector point will be a zero vector. In this way, the attention feature vector of the object similar to the head of the person is a zero vector, so that the difference between the head of the person and other objects with similar shapes is improved, and therefore, the object which is falsely detected as the head of the person can be effectively screened out by using the step b 6.

In one embodiment, in the above model training method, when adjusting parameters of the crowd detection model according to the result output by the classification network, parameters of the head detection model, the five sense organs detection model, the heuristic attention weighting network, and the classification network in the model are specifically optimized and adjusted. The specific adjustment method is known to those skilled in the art and will not be described herein.

In one embodiment, for training in the heuristic attention weighting network and classification network, the cross entropy loss function can be used to optimize by a stochastic gradient descent method, but is not limited thereto.

In order to facilitate clear understanding of the crowd detection model structure provided by the embodiment of the invention. Fig. 2 is a schematic diagram of a crowd detection model network structure obtained based on the above model training method. As shown in fig. 2, the model includes a head detection model, a five sense organs detection model, a heuristic attention weighting network, and a classification network. In the network structure example, the head detection model and the facial feature detection model are both implemented by using RPN.

Based on the above embodiment of the model training method, an embodiment of the present invention further provides a population counting method, as shown in fig. 3, the population counting method includes:

step 301, obtaining a target detection picture.

And 302, detecting the head of the person in the target detection picture based on a crowd detection model, and counting the detected head to obtain the number of the person in the target detection picture.

Wherein the confidence of the counting head is greater than a preset threshold; the crowd detection model is obtained by adopting the embodiment of the crowd detection model training method in advance.

In step 302, for each head detected by the crowd detection model, crowd counting is performed according to the corresponding confidence level, that is, the head with the statistical confidence level greater than the preset threshold value is counted. The specific calculation method of the confidence coefficient of the detection result can be realized by adopting the existing method.

As described in the above analysis, in the crowd detection model used in this step, due to the introduction of the five sense organs detection means and the combination of the heuristic attention mechanism, the false detection result of the human head detection can be effectively screened out, so that the detection accuracy of the crowd detection model can be ensured. Therefore, in step 302, the people detection model obtained by training in the first embodiment of the present invention is used to detect the head of the person in the target detection picture, and the number is counted according to the detection result, so that the accuracy of people detection can be improved.

Here, the threshold value, which is a constraint condition for limiting the head of the detected person to participate in counting, may be set to an appropriate value by those skilled in the art.

Corresponding to the above embodiment of the model training method, an embodiment of the present invention further provides a model training apparatus, as shown in fig. 4, the apparatus includes:

a sample data obtaining module 401, configured to obtain a sample data set; wherein, the head and the five sense organs of the person in the sample picture are marked with a detection frame.

A model training module 402, configured to train a pre-constructed crowd detection model by using the sample data set to obtain a target crowd detection model; wherein the training comprises:

Corresponding to the above embodiment of the crowd counting method, an embodiment of the present invention further provides a crowd counting apparatus, as shown in fig. 5, the crowd counting apparatus includes:

a detected target obtaining module 501, configured to obtain a target detected picture;

a head detection module 502, configured to detect the head of a person in the target detection picture based on a crowd detection model, and count the detected head to obtain the number of the person in the target detection picture; wherein the confidence of the counting head is greater than a preset threshold; the crowd detection model is obtained by training through the crowd detection model training method.

It can be seen from the above embodiments that, in the above embodiment of the model training method, in the process of training the crowd detection model by using the sample picture, the five sense organs detection is introduced on the basis of the head detection, and the results of the head detection and the five sense organs detection are comprehensively processed by using a heuristic attention weighting mechanism to generate the attention feature vector of each head detected by the head. Therefore, the difference between the human head and other similar objects in the shape can be improved by utilizing the result of the five sense organs detection and adopting a heuristic attention weighting mechanism, so that the accuracy of the attention feature vector input to the classification network can be improved, the false detection result of the human head detection can be screened out, and the detection accuracy of the trained crowd detection model can be improved. Accordingly, the accuracy of people counting by using the people detection model is improved.

The crowd detection model provided by the embodiment of the invention can effectively overcome the influence of the shooting angle on the detection accuracy, so that the crowd counting method realized based on the crowd detection model has wider application scenes and is suitable for various scenes, such as crowd dense scenes, crowd sparse scenes, angle change diversity and the like.

Corresponding to the embodiment of the crowd detection model training method, the embodiment of the invention also provides crowd detection model training equipment, which comprises a processor and a memory;

the memory has stored therein an application executable by the processor for causing the processor to perform the crowd detection model training method as described above.

Embodiments of the present invention also provide a computer-readable storage medium, in which computer-readable instructions are stored, and the computer-readable instructions are used for executing the crowd detection model training method described above.

Corresponding to the embodiment of the crowd counting method, the embodiment of the invention also provides crowd counting equipment, which comprises a processor and a memory;

In the above embodiments, the memory may be specifically implemented as various storage media such as an Electrically Erasable Programmable Read Only Memory (EEPROM), a Flash memory (Flash memory), a Programmable Read Only Memory (PROM), and the like. The processor may be implemented to include one or more central processors or one or more field programmable gate arrays, wherein the field programmable gate arrays integrate one or more central processor cores. In particular, the central processor or central processor core may be implemented as a CPU or MCU.

It should be noted that not all steps and modules in the above flows and structures are necessary, and some steps or modules may be omitted according to actual needs. The execution order of the steps is not fixed and can be adjusted as required. The division of each module is only for convenience of describing adopted functional division, and in actual implementation, one module may be divided into multiple modules, and the functions of multiple modules may also be implemented by the same module, and these modules may be located in the same device or in different devices.

The hardware modules in the various embodiments may be implemented mechanically or electronically. For example, a hardware module may include specially designed permanent circuits or logic devices (e.g., a special purpose processor such as an FPGA or ASiC) for performing specific operations. A hardware module may also include programmable logic devices or circuits (e.g., including a general-purpose processor or other programmable processor) that are temporarily configured by software to perform certain operations. The implementation of the hardware module in a mechanical manner, or in a dedicated permanent circuit, or in a temporarily configured circuit (e.g., configured by software), may be determined based on cost and time considerations.

Examples of the storage medium for supplying the program code include floppy disks, hard disks, magneto-optical disks, optical disks (e.g., CD-ROMs, CD-R, CD-RWs, DVD-ROMs, DVD-RAMs, DVD-RWs, DVD + RWs), magnetic tapes, nonvolatile memory cards, and ROMs. Alternatively, the program code may be downloaded from a server computer or the cloud by a communication network.

"exemplary" means "serving as an example, instance, or illustration" herein, and any illustration, embodiment, or steps described as "exemplary" herein should not be construed as a preferred or advantageous alternative. For the sake of simplicity, the drawings are only schematic representations of the parts relevant to the invention, and do not represent the actual structure of the product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically illustrated or only labeled. In this document, "a" does not mean that the number of the relevant portions of the present invention is limited to "only one", and "a" does not mean that the number of the relevant portions of the present invention "more than one" is excluded. In this document, "upper", "lower", "front", "rear", "left", "right", "inner", "outer", and the like are used only to indicate relative positional relationships between relevant portions, and do not limit absolute positions of the relevant portions.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a crowd detection model, the method comprising:

2. The method of claim 1, wherein the detecting the head and the five sense organs of the person in the sample picture to obtain a head candidate detection frame and a five sense organ candidate detection frame comprises:

3. The method of claim 1, wherein the generating the attention feature vector for the respective head comprises:

4. The method of claim 3, wherein said calculating an attention weight vector for each of said five sense organs in the respective head comprises:

5. The method of claim 2, wherein the adjusting the parameters of the crowd detection model comprises:

6. The method of claim 1, wherein the five sense organs comprise:

left eye, right eye, left ear, right ear, and mouth.

7. A method of population counting, comprising:

acquiring a target detection picture;

wherein the confidence of the counting head is greater than a preset threshold; the population detection model is previously trained by any one of the methods of claims 1 to 6.

8. A crowd detection model training device, comprising:

9. A people counting device, comprising:

the head detection module is used for detecting the head in the target detection picture based on a crowd detection model and counting the detected head to obtain the number of people in the target detection picture; wherein the confidence of the counting head is greater than a preset threshold; the population detection model is trained using any one of the methods of claims 1 to 6.

10. A crowd detection model training apparatus comprising a processor and a memory;

11. A computer-readable storage medium having computer-readable instructions stored thereon for performing the crowd detection model training method of any one of claims 1 to 6.

12. A crowd counting device comprising a processor and a memory;

the memory has stored therein an application executable by the processor for causing the processor to perform the people counting method of claim 7.

13. A computer readable storage medium having computer readable instructions stored thereon for performing the people counting method of claim 7.