CN113435390A

CN113435390A - Crowd positioning method and device, electronic equipment and storage medium

Info

Publication number: CN113435390A
Application number: CN202110777233.2A
Authority: CN
Inventors: 杨昆霖; 李昊鹏; 侯军; 伊帅
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-09-24

Abstract

The present disclosure relates to a crowd positioning method and apparatus, an electronic device, and a storage medium, the method including: performing feature extraction on the crowd image to obtain at least one first feature map; detecting human key points of the at least one first feature map, and determining a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human point; according to the positioning probability map, carrying out weighting processing on the at least one first feature map to obtain at least one second feature map; and performing human body frame detection on the at least one second characteristic diagram, and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image. The embodiment of the disclosure can improve the accuracy of crowd positioning.

Description

Crowd positioning method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a crowd positioning method and apparatus, an electronic device, and a storage medium.

Background

With the growth of population and the acceleration of urbanization process, the behavior of mass population aggregation is more and more, and the scale is larger and larger. The crowd analysis has important significance for public safety and city planning. Common crowd analysis tasks include crowd counting, crowd behavior parsing, crowd positioning and the like, wherein the crowd positioning is the basis of other crowd analysis tasks. The crowd positioning means that the position of a human body included in an image or a video is estimated through a computer vision algorithm, and the coordinates of the human body included in the image or the video are determined, so that data basis is provided for people group analysis tasks such as follow-up crowd counting and crowd behavior analysis. The accuracy of crowd positioning directly affects the precision of crowd counting and the result of crowd behavior analysis. Therefore, a people positioning method with high accuracy is needed.

Disclosure of Invention

The disclosure provides a crowd positioning method and device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a crowd positioning method, including: performing feature extraction on the crowd image to obtain at least one first feature map; detecting human key points of the at least one first feature map, and determining a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human point; according to the positioning probability map, carrying out weighting processing on the at least one first feature map to obtain at least one second feature map; and performing human body frame detection on the at least one second characteristic diagram, and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image.

In a possible implementation manner, the performing human key point detection on the at least one first feature map and determining a positioning probability map corresponding to the crowd image includes: performing convolution processing on the target first feature map to obtain a third feature map, wherein the target first feature map is one of the at least one first feature map; performing transpose convolution processing on the third feature map to obtain a fourth feature map, wherein the fourth feature map and the crowd image have the same size; and performing convolution processing on the fourth feature map to obtain the positioning probability map.

In a possible implementation manner, the weighting the at least one first feature map according to the positioning probability map to obtain at least one second feature map includes: and according to the positioning probability map, carrying out pixel-level weighting processing on each first feature map to obtain the second feature map corresponding to each first feature map.

In a possible implementation manner, the performing human body frame detection on the at least one second feature map and determining a target human body frame detection result corresponding to the crowd image includes: performing human body frame detection on the at least one second characteristic diagram to obtain at least one human body frame prediction result; and fusing the at least one human body frame prediction result to obtain the target human body frame detection result.

In a possible implementation manner, the fusing the at least one human body frame prediction result to obtain the target human body frame detection result includes: and carrying out non-maximum suppression processing on the at least one human body frame prediction result to obtain the target human body frame detection result.

In one possible implementation, the method further includes: according to the positioning probability map, carrying out crowd analysis on the crowd image to obtain a first analysis result corresponding to the crowd image; according to the detection result of the target human body frame, carrying out crowd analysis on the crowd image to obtain a second analysis result corresponding to the crowd image; and fusing the first analysis result and the second analysis result to obtain a target analysis result corresponding to the crowd image.

In one possible implementation manner, the crowd positioning method is implemented by a crowd positioning neural network, and the training samples of the crowd positioning neural network include: the method comprises the steps that a crowd sample image, a real positioning picture corresponding to the crowd sample image and a real human body frame detection result corresponding to the crowd sample image are obtained; the training method of the crowd positioning neural network comprises the following steps: determining a predicted positioning probability map corresponding to the crowd sample image and a predicted human body frame detection result corresponding to the crowd sample image through the crowd positioning neural network, wherein the predicted positioning probability map is used for indicating the probability that each pixel point in the crowd sample image is a target human body point, and the predicted human body frame detection result comprises a predicted human body frame corresponding to each human body in the crowd sample image; determining a first positioning loss based on the predicted positioning probability map and the real positioning map; determining a second positioning loss based on the predicted human body frame detection result and the real human body frame detection result; optimizing the population localization neural network based on the first localization loss and the second localization loss.

In one possible implementation, the optimizing the population localization neural network based on the first localization loss and the second localization loss includes: summing the first positioning loss and the second positioning loss to obtain a target positioning loss; optimizing the population localization neural network based on the target localization loss.

In a possible implementation manner, the real positioning map includes positions of the target human body points corresponding to the human bodies in the crowd sample image, where one human body corresponds to one target human body point; the method further comprises the following steps: carrying out face detection on the crowd sample image, and determining a predicted face frame corresponding to the crowd sample image; determining a real human body frame corresponding to each human body in the crowd sample image through a linear interpolation algorithm according to the position of the target human body point corresponding to each human body in the crowd sample image and a predicted human face frame corresponding to the crowd sample image; and determining the detection result of the real human body frame according to the real human body frame corresponding to each human body in the crowd sample image.

According to an aspect of the present disclosure, there is provided a crowd positioning device comprising: the characteristic extraction module is used for extracting the characteristics of the crowd images to obtain at least one first characteristic diagram; the human body key point detection module is used for detecting human body key points of the at least one first feature map and determining a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human body point; the characteristic strengthening module is used for carrying out weighting processing on the at least one first characteristic diagram according to the positioning probability diagram to obtain at least one second characteristic diagram; and the human body frame detection module is used for carrying out human body frame detection on the at least one second characteristic diagram and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, feature extraction is carried out on the crowd images to obtain at least one first feature map; detecting human key points of at least one first characteristic graph, and determining a positioning probability graph corresponding to the crowd image, wherein the positioning probability graph is used for indicating the probability that each pixel point in the crowd image is a target human point; according to the positioning probability map, carrying out weighting processing on at least one first feature map to obtain at least one second feature map; and performing human body frame detection on the at least one second characteristic graph, and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image. In the crowd positioning process, the human body key point detection and the human body frame detection are comprehensively considered, so that the crowd positioning accuracy can be effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of a method of crowd location according to an embodiment of the disclosure;

FIG. 2 illustrates a schematic diagram of a population-localizing neural network, in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of a crowd locating device according to an embodiment of the disclosure;

FIG. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow chart of a crowd positioning method according to an embodiment of the disclosure. The crowd positioning method may be executed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and the like, and the crowd positioning method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the crowd location method may be performed by a server. As shown in fig. 1, the crowd locating method may include:

in step S11, feature extraction is performed on the crowd image to obtain at least one first feature map.

The crowd image is an image including a dense crowd, which may be obtained by image acquisition equipment after image acquisition is performed on the dense crowd in a certain spatial range, may also be a key image frame including the dense crowd obtained from a video, and may also be obtained by other methods, which is not specifically limited by the present disclosure.

The feature extraction of the crowd image may be specifically performed by a feature extraction module in a convolutional neural network, and a detailed description of the feature extraction process will be described later in combination with possible implementation manners of the present disclosure, which is not described herein again.

In step S12, performing human key point detection on at least one first feature map, and determining a positioning probability map corresponding to the crowd image, where the positioning probability map is used to indicate the probability that each pixel point in the crowd image is a target human point.

The target human body point may be any key point for indicating a human body, and one human body in the crowd image corresponds to one target human body point. For example, the target body point is a center point of a head of a human body, a key point of a face of the human body, and the like, which is not specifically limited by the present disclosure.

By detecting the human key points of at least one first characteristic graph corresponding to the crowd image, the probability that each pixel point in the crowd image is a target human point can be determined, and therefore a positioning probability graph corresponding to the crowd image can be obtained. The following will describe the human body key point detection process in detail with reference to possible implementation manners of the present disclosure, and details are not described herein.

In step S13, at least one first feature map is weighted according to the positioning probability map to obtain at least one second feature map.

Because the positioning probability map is a detection result of human body key point detection, at least one first feature map corresponding to the crowd image is weighted according to the positioning probability map, so that feature enhancement of the at least one first feature map can be realized by using the detection result of the human body key point detection, and at least one second feature map after the feature enhancement corresponding to the crowd image is obtained. The weighting process will be described in detail later with reference to possible implementations of the present disclosure, and will not be described in detail here.

In step S14, performing human frame detection on the at least one second feature map, and determining a target human frame detection result corresponding to the crowd image, where the target human frame detection result includes target human frames corresponding to human bodies in the crowd image.

Compared with the mode that the crowd positioning is realized by performing the human body frame detection only based on the first feature diagram obtained after the feature extraction is performed on the crowd image in the related technology, in the embodiment of the disclosure, because at least one second feature diagram is obtained after the feature enhancement is performed on the detection result of the human body key point detection, the human body frame detection is performed on at least one second feature diagram, the human body key point detection and the human body frame detection can be comprehensively considered in the crowd positioning process, the target human body frame detection result with higher accuracy corresponding to the crowd image is obtained, and therefore the accuracy of the crowd positioning is effectively improved. The human frame detection process will be described in detail later with reference to possible implementations of the present disclosure, and will not be described herein again.

In the embodiment of the disclosure, feature extraction is carried out on the crowd images to obtain at least one first feature map; detecting human key points of at least one first characteristic graph, and determining a positioning probability graph corresponding to the crowd image, wherein the positioning probability graph is used for indicating the probability that each pixel point in the crowd image is a target human point; according to the positioning probability map, carrying out weighting processing on at least one first feature map to obtain at least one second feature map; and performing human body frame detection on the at least one second characteristic graph, and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame result comprises target human body frames corresponding to all human bodies in the crowd image. In the crowd positioning process, the human body key point detection and the human body frame detection are comprehensively considered, so that the crowd positioning accuracy can be effectively improved.

In a possible implementation manner, a feature extraction module in the convolutional neural network may be utilized to perform feature extraction on the crowd image to obtain at least one first feature map.

In one example, the feature extraction module in the convolutional neural network may be composed of a deep convolutional neural network (e.g., a VGG-16 network) trained on a common image dataset (e.g., ImageNet) in the computer vision domain.

For example, the crowd image is input into a feature extraction module in the convolutional neural network, and after the feature extraction module performs feature extraction on the crowd image, a four-layer feature pyramid is obtained, that is, four first feature maps F are obtained₁、F₂、F₃And F₄Wherein, four first characteristic diagrams F₁、F₂、F₃And F₄Is one fourth and one eighth of the crowd image in sequenceOne, one sixteenth, and one thirty-half.

The structure of the feature extraction module may be configured as other network structures according to actual situations, besides the VGG-16 network, which is not specifically limited in this disclosure. For the feature extraction modules of different network structures, the number and size of the first feature maps obtained after feature extraction is performed on the crowd images may be different, which is not specifically limited by the present disclosure.

In a possible implementation manner, the detecting of the human body key points on at least one first feature map and determining a positioning probability map corresponding to the crowd image include: performing convolution processing on the target first feature map to obtain a third feature map, wherein the target first feature map is one of the at least one first feature map; performing transposition convolution processing on the third feature map to obtain a fourth feature map, wherein the fourth feature map and the crowd image have the same size; and performing convolution processing on the fourth feature map to obtain a positioning probability map.

Selecting any one of the at least one first feature map as a target first feature map, and further performing convolution processing on the target first feature map to realize further feature extraction to obtain a third feature map, performing transposition convolution processing on the third feature map to determine the probability that each pixel point in the crowd image is the target human body point, so that a fourth feature map with the same size as the crowd image can be obtained, and further performing convolution processing on the fourth feature map, and a positioning probability map for indicating the probability that each pixel point in the crowd image is the target human body point can be effectively obtained.

Any one of the at least one first feature map is selected as a target first feature map to detect the human key points, so that the human key point detection efficiency can be effectively improved, and the positioning probability map can be quickly obtained. The selection of the target first characteristic diagram can be determined according to actual conditions, and the selection is not specifically limited by the disclosure.

In one example, the localization probability map is the same size as the crowd image. For example, where the crowd image is I ∈ R^H ^×W×3In the case ofH and W are the height and width of the crowd image respectively, the number of channels of the crowd image is 3 (for example, RGB three channels), and in this case, the positioning probability map can be recorded as Y e R^H×WWherein Y (x) e [0,1 ∈]Y (x) is used to indicate that I (x) is the probability of the target human body point, and x is the coordinate of the pixel point with the same relative position in the crowd image I and the positioning probability map Y.

In a possible implementation manner, a human key point detection module in a convolutional neural network can be utilized to perform human key point detection on the target first feature map to obtain a positioning probability map.

In an example, a human keypoint detection module in a convolutional neural network may include a convolutional layer for performing convolutional processing and a transposed convolutional layer for performing transposed convolutional processing. The specific structure of the human body key point detection module may be set according to actual conditions (for example, the number of layers of the convolutional layers, the arrangement manner of each layer, and the like), and this disclosure does not specifically limit this.

In one example, the four first characteristic maps F are used₁、F₂、F₃And F₄First characteristic diagram F in (1)₂Taking the first feature map of the target as an example, the first feature map of the target F₂A human body key point detection module in the input convolutional neural network, wherein the human body key point detection module is used for detecting a first characteristic diagram F of a target₂The process of detecting key points of a human body can be described as follows: using three convolutional layers (convolutional kernel size 3, void ratio 2, number of channels 512), the first feature map F for the target₂Performing convolution processing to realize further feature extraction to obtain a third feature map; then, three transposed convolution layers (the convolution kernel size is 4, the step size is 2, the channel number is 256, 128 and 64 respectively) and one convolution layer (the convolution kernel size is 3, the voidage is 2, the channel number is 256, 128 and 64 respectively) connected after each transposed convolution layer are utilized to perform transposed convolution processing on the third feature map so as to realize feature transformation into the size of the crowd image, obtain a fourth feature map with the same size as the crowd image, finally, one 1 x 1 convolution layer is utilized to perform convolution processing on the fourth feature map, and the channel number of the feature is transformedAnd the number is 1, and finally the positioning probability map Y is output.

In a possible implementation manner, in order to improve the accuracy of the positioning probability map, the human body key point detection may be performed by using a plurality of first feature maps to determine the positioning probability map corresponding to the crowd image.

Still using the above four first characteristic diagrams F₁、F₂、F₃And F₄Respectively combining the four first characteristic maps F₁、F₂、F₃And F₄A human body key point detection module in the input convolutional neural network, and a first feature map F is respectively obtained based on the human body key point detection module₁Corresponding localization probability map Y₁First characteristic diagram F₂Corresponding localization probability map Y₂First characteristic diagram F₃Corresponding localization probability map Y₃And a first characteristic diagram F₄Corresponding localization probability map Y₄. Further aligning the positioning probability map Y₁、Y₂、Y₃And Y₄And fusing to obtain a positioning probability graph Y corresponding to the crowd image. The detection process of the human body key point detection module on each first feature map and the detection process of the human body key point detection module on the target first feature map F₂The detection process is similar and will not be described herein.

In a possible implementation manner, the weighting processing is performed on at least one first feature map according to the positioning probability map to obtain at least one second feature map, including: and according to the positioning probability map, performing pixel-level weighting processing on each first feature map to obtain a second feature map corresponding to each first feature map.

Because the positioning probability map is the detection result of the human body key point detection, each first feature map corresponding to the crowd image is weighted according to the positioning probability map, so that feature enhancement can be performed on each first feature map by using the detection result of the human body key point detection, and a feature-enhanced second feature map corresponding to each first feature map is obtained.

In a possible implementation manner, performing pixel-level weighting processing on each first feature map according to the positioning probability map to obtain a second feature map corresponding to each first feature map, includes: according to the size of each first feature map, downsampling the positioning probability map to obtain a sampling positioning probability map with the same size as each first feature map; and carrying out pixel-level weighting processing on each first feature map and the sampling positioning probability map with the same size as the first feature map to obtain a second feature map corresponding to each first feature map.

Still using the above four first characteristic diagrams F₁、F₂、F₃And F₄And the positioning probability map Y, since the four first feature maps F₁、F₂、F₃And F₄Is one fourth, one eighth, one sixteenth and one thirty half of the crowd image in turn, while the positioning probability map Y and the crowd image have the same size.

Therefore, to realize the pair of the first feature map F by the localization probability map Y₁、F₂、F₃And F₄The weighting processing of the pixel level is respectively carried out, firstly, the positioning probability chart Y is sampled down to obtain a first feature chart F₁Sample positioning probability map Y with same size₁And a first characteristic diagram F₂Sample positioning probability map Y with same size₂And a first characteristic diagram F₃Sample positioning probability map Y with same size₃And the first characteristic diagram F₄Sample positioning probability map Y with same size₄. The positioning probability map is downsampled by using a bilinear interpolation algorithm, and other downsampling methods can be adopted according to actual conditions, which is not specifically limited by the disclosure.

Then, a pixel-level weighting process is performed on each first feature map by using the sampling localization probability map corresponding to each first feature map through the following formula (1), so as to obtain a second feature map corresponding to each first feature map.

Wherein, F₁' is a first characteristic diagram F₁Corresponding second characteristic diagram, and first characteristic diagram F₁Have the same size; f₂' is a first characteristic diagram F₂Corresponding second characteristic diagram, and first characteristic diagram F₂Have the same size; f₃' is a first characteristic diagram F₃Corresponding second characteristic diagram, and first characteristic diagram F₃Have the same size; f₄' is a first characteristic diagram F₄Corresponding second characteristic diagram, and first characteristic diagram F₄Have the same dimensions.

In a possible implementation manner, performing human body frame detection on at least one second feature map, and determining a target human body frame detection result corresponding to the crowd image includes: performing human body frame detection on the at least one second characteristic diagram to obtain at least one human body frame prediction result; and fusing at least one human body frame prediction result to obtain a target human body frame detection result.

And performing human body frame detection on each second characteristic diagram to obtain a human body frame prediction result corresponding to each second characteristic diagram, wherein the human body frame prediction result corresponding to each second characteristic diagram comprises a prediction human body frame corresponding to each human body in the crowd image, and fusing the human body frame prediction results corresponding to the plurality of second characteristic diagrams, so that a target human body frame detection result with high accuracy corresponding to the crowd image can be finally obtained.

In a possible implementation manner, a human body frame detection module in the convolutional neural network may be utilized to perform human body frame detection on at least one second feature map to obtain a target human body frame.

In an example, the human body frame detection module in the convolutional neural network may be formed by a Single-sample multi-frame detector (SSD) network. The structure of the human body frame detection module may be formed by an SSD network, and may also be set to be other network structures capable of implementing the human body frame detection function according to the actual situation, which is not specifically limited in this disclosure.

Still using the above-mentioned four second characteristic diagrams F₁'、F₂'、F₃' and F₄' for example, four second feature maps F₁'、F₂'、F₃' and F₄And inputting the human body frame detection module in the convolutional neural network, wherein the human body frame detection module performs human body frame detection on each second characteristic graph to obtain a human body frame prediction result corresponding to each second characteristic graph, namely four human body frame prediction results, and further fusing the four human body frame prediction results to obtain a target human body frame detection result.

In a possible implementation manner, fusing at least one human body frame prediction result to obtain a target human body frame detection result, including: and carrying out non-maximum suppression processing on at least one human body frame prediction result to obtain a target human body frame detection result.

And fusing at least one human body frame prediction result through a non-maximum value inhibition algorithm, determining a predicted human body frame corresponding to a human body in the human body frame prediction result aiming at any human body in the crowd image, and further determining the predicted human body frame with the cross-over ratio larger than the cross-over ratio threshold value as a target human body frame corresponding to the human body, so that a target human body frame detection result comprising the target human body frame corresponding to each human body in the crowd image can be finally obtained. The specific value of the intersection ratio threshold may be determined according to actual conditions, and this disclosure does not specifically limit this.

In one possible implementation, the crowd locating method further includes: according to the positioning probability map, carrying out crowd analysis on the crowd image to obtain a first analysis result corresponding to the crowd image; performing crowd analysis on the crowd image according to the detection result of the target human body frame to obtain a second analysis result corresponding to the crowd image; and fusing the first analysis result and the second analysis result to obtain a target analysis result corresponding to the crowd image.

Because the positioning probability map is used for indicating the probability that each pixel point in the crowd image is the target human body point, and the target human body frame detection result comprises the target human body frame corresponding to each human body in the crowd image, the crowd analysis, such as crowd counting and crowd behavior analysis, can be performed on the crowd image according to the positioning probability map and the target human body frame detection result.

In order to improve the accuracy of crowd analysis, the crowd images can be subjected to crowd analysis by using the positioning probability map to obtain first analysis results corresponding to the crowd images, the crowd images are subjected to crowd analysis by using the target human body frame detection results to obtain second analysis results corresponding to the crowd images, and then the first analysis results and the second analysis results are fused to obtain target analysis results with higher accuracy, so that the accuracy of the crowd analysis is effectively improved.

In one possible implementation, the crowd positioning method is implemented by a crowd positioning neural network, and the training samples of the crowd positioning neural network include: the method comprises the following steps of (1) detecting a crowd sample image, a real positioning picture corresponding to the crowd sample image and a real human body frame detection result corresponding to the crowd sample image; the training method of the crowd positioning neural network comprises the following steps: determining a prediction positioning probability map corresponding to the crowd sample image and a prediction human body frame detection result corresponding to the crowd sample image through a crowd positioning neural network, wherein the prediction positioning probability map is used for indicating the probability that each pixel point in the crowd sample image is a target human body point, and the prediction human body frame detection result comprises a prediction human body frame corresponding to each human body in the crowd sample image; determining a first positioning loss based on the predicted positioning probability map and the real positioning map; determining a second positioning loss based on the predicted human body frame detection result and the real human body frame detection result; and optimizing the crowd positioning neural network based on the first positioning loss and the second positioning loss.

In order to realize crowd positioning quickly, a crowd positioning neural network can be obtained by pre-training based on the crowd positioning method, the crowd positioning neural network comprises a human body key point detection branch and a human body frame detection branch, and then in practical application, the human body key point detection and the human body frame detection are comprehensively considered through the crowd positioning neural network, so that the crowd image can be positioned quickly and effectively.

And pre-constructing a training sample for network training of the crowd positioning neural network. The training sample comprises a crowd sample image, a real positioning image corresponding to the crowd sample image and a real human body frame detection result corresponding to the crowd sample image.

The crowd sample image is an image containing dense crowd, which may be obtained by image acquisition equipment after image acquisition is performed on the dense crowd in a certain spatial range, may also be a key image frame containing the dense crowd obtained from a video, and may also be obtained by other methods, which is not specifically limited by the present disclosure.

In one possible implementation manner, the real positioning map includes positions of target human body points corresponding to human bodies in the crowd sample image, where one human body corresponds to one target human body point.

The real positioning image and the crowd sample image have the same size, the pixel value of a pixel point in the real positioning image is 0 or 1, and the position of the pixel point with the pixel value of 1 is used for indicating the position of a target human body point included in the crowd sample image; and the position of the pixel point with the pixel value of 0 is used for indicating other positions except the target human body point included in the crowd sample image.

In one possible implementation, the crowd locating method further includes: determining an annotation result corresponding to the crowd sample image, wherein the annotation result comprises coordinates of the target human body point in the crowd sample image; and determining a real positioning diagram according to the labeling result.

By carrying out coordinate marking on the target human body point on the crowd sample image, the real positioning diagram corresponding to the crowd sample image can be effectively determined according to the marking result, and then the training sample for carrying out network training on the crowd positioning neural network can be effectively constructed according to the crowd sample image and the real positioning diagram.

In an example, the crowd sample image is I' ∈ R^H×W×3H and W are the height and width of the crowd sample image, respectively, and the number of channels of the crowd image is 3. Labeling the target human body points included in the crowd sample image I' to obtain a labeling result corresponding to the crowd sample image

Wherein, a_iIs the ith individualThe coordinates of the corresponding target human body point in the crowd sample image I', and n is the number of human bodies included in the crowd image.

In an example, the labeling result corresponding to the crowd sample image I' can be obtained

Determining a real positioning map corresponding to the crowd sample image I' by using the following formula (2)

Wherein the content of the first and second substances,

y is the crowd sample image I' and the real location map

The coordinates of the pixels with the same relative position are in the middle, and K is [0,1, 0; 1,1, 1; 0,1,0]Is a convolution kernel, ψ (-) is a graph of the convolution result, δ (-) is a multivariate delta function, a concrete form of which can be shown in the following formula (3),

according to the labeling result corresponding to the crowd sample image, the real positioning map can be determined by adopting the formula (2) and other methods, which is not specifically limited by the present disclosure.

The real human body frame detection result corresponding to the crowd sample image comprises a real human body frame corresponding to each human body in the crowd image.

In one possible implementation, the crowd locating method further includes: carrying out face detection on the crowd sample image, and determining a predicted face frame corresponding to the crowd sample image; determining a real human body frame corresponding to each human body in the crowd sample image through a linear interpolation algorithm according to the position of a target human body point corresponding to each human body in the crowd sample image and a predicted human face frame corresponding to the crowd sample image; and determining a real human body frame detection result according to the real human body frame corresponding to each human body in the crowd sample image.

In one example, an initial detection network (e.g., RetinaNet, Faster RCNN, SSD, etc.) is utilized to pre-train on a large-scale face detection data set (e.g., WiderFace), resulting in a face detection network. And carrying out face detection on the crowd sample image by using a face detection network and using a higher detection threshold value to obtain a predicted face frame with higher accuracy corresponding to the crowd sample image. And determining a predicted face frame corresponding to the crowd sample image detected by the face detection network as a real body frame corresponding to a human body included in the crowd sample image.

Because the real positioning diagram comprises the position of the target human body point corresponding to each human body in the crowd image, whether the real human body frame corresponding to each human body in the crowd image can be obtained or not is judged according to the real positioning diagram and the predicted human face frame obtained according to the detection of the human face detection network. And under the condition that a target human body point without a corresponding real human body frame exists in a predicted human face frame obtained by detection according to the human face detection network, obtaining the real human body frame corresponding to each human body in the crowd image through a linear interpolation algorithm according to the real positioning image and the determined real human body frame, so that a real human body frame detection result comprising the real human body frames corresponding to each human body in the crowd image can be obtained.

After the real positioning diagram and the real human body frame detection result corresponding to the crowd image are determined and obtained by the method, the training sample for network training of the crowd positioning neural network can be effectively constructed according to the real positioning diagram and the real human body frame detection result corresponding to the crowd image.

After the training sample is determined, the training sample is utilized to carry out network training on the crowd positioning neural network, and firstly, a prediction positioning probability graph which corresponds to the crowd sample image in the training sample and is used for indicating the probability that each pixel point in the crowd sample image is the target human body point is determined through the crowd positioning neural network.

In one possible implementation manner, determining a prediction localization probability map corresponding to a crowd sample image through a crowd localization neural network includes: performing feature extraction on the crowd image to obtain at least one fifth feature map; and detecting key points of the human body according to the at least one fifth characteristic diagram, and determining a predicted positioning probability diagram.

The crowd positioning neural network comprises a feature extraction module, and at least one fifth feature map can be obtained after the crowd sample image is subjected to feature extraction through the feature extraction module.

The network structure of the feature extraction module in the crowd positioning neural network is similar to that of the feature extraction module in the convolutional neural network, and the process of extracting the features of the crowd sample image by the feature extraction module in the crowd positioning neural network is similar to that of extracting the features of the crowd image by the feature extraction module in the convolutional neural network, and is not repeated here.

The human key point detection branch in the crowd positioning neural network comprises a human key point detection module, and the predicted positioning probability map can be obtained after the human key point detection is carried out on at least one fifth feature map through the human key point detection module.

The network structure of the human key point detection module in the crowd positioning neural network is similar to the network structure of the human key point detection module in the convolutional neural network, the process of detecting the human key points by the human key point detection module in the crowd positioning neural network according to the at least one fifth feature map is similar to the process of detecting the human key points by the human key point detection module in the convolutional neural network according to the at least one first feature map, and details are not repeated here.

In one possible implementation, determining a predicted human body frame corresponding to a human body sample image through a human body localization neural network includes: according to the predicted positioning probability graph, weighting each fifth feature graph to obtain a sixth feature graph corresponding to each fifth feature graph; and carrying out human body frame detection on the sixth feature map corresponding to each fifth feature map, and determining a predicted human body frame detection result.

The process of performing weighting processing on each fifth feature map according to the predicted positioning probability map to obtain a sixth feature map corresponding to each fifth feature map is similar to the process of performing weighting processing on each first feature map according to the positioning probability map to obtain a second feature map corresponding to each first feature map, and details are not repeated here.

The human body frame detection branch in the crowd positioning neural network comprises a human body frame detection module, the crowd positioning neural network performs human body frame detection on the sixth characteristic diagram by using the human body frame detection module, and a predicted human body frame detection result corresponding to the crowd image can be determined.

The network structure of the human body frame detection module in the crowd positioning neural network is similar to that of the human body frame detection module in the convolutional neural network, the process of the human body frame detection module in the crowd positioning neural network for detecting the human body frame of the sixth feature map is similar to that of the human body frame detection module in the convolutional neural network for detecting the human body frame of the second feature map, and details are not repeated here.

Because the predicted positioning probability map is used for indicating the probability that each pixel point in the crowd sample image is the target human body point, and the real positioning map comprises the position of the target human body point corresponding to each human body in the crowd image, the first positioning loss corresponding to the human body examination detection branch in the crowd neural network can be determined based on the predicted positioning probability map and the real positioning map. In an example, the first positioning loss may be determined by using a cross entropy loss function, and the first positioning loss may also be determined by using other loss functions, which is not specifically limited by the present disclosure.

Because the detection result of the predicted human body frame comprises the predicted human body frame corresponding to each human body in the crowd sample image, and the detection result of the real human body frame comprises the real human body frame corresponding to each human body in the crowd image, the second positioning loss corresponding to the human body frame detection network branch in the crowd positioning neural network can be determined based on the detection result of the predicted human body frame and the detection result of the real human body frame. In an example, the classification loss and the regression loss may be considered together to determine the second positioning loss, and the second positioning loss may also be determined by other manners, which are not specifically limited by the present disclosure.

Based on the first positioning loss corresponding to the human body detection branch and the second positioning loss corresponding to the human body frame detection network branch, the network parameters of the crowd positioning neural network can be adjusted to optimize the crowd positioning neural network.

In one possible implementation, optimizing a population positioning neural network based on the first positioning loss and the second positioning loss includes: summing the first positioning loss and the second positioning loss to obtain a target positioning loss; and optimizing the crowd positioning neural network based on the target positioning loss.

By summing the first positioning loss and the second positioning loss, the target positioning loss corresponding to the crowd positioning neural network can be obtained, and then the crowd positioning neural network is optimized based on the target positioning loss.

In a possible implementation manner, a first weight corresponding to the first positioning loss and a second weight corresponding to the second positioning loss may also be determined, and then the first positioning loss and the second positioning loss are subjected to weighted summation based on the first weight and the second weight to obtain a target positioning loss. Specific values of the first weight and the second weight may be determined according to actual conditions, and this disclosure does not specifically limit this.

After the target positioning loss is determined, network parameters corresponding to the crowd positioning neural network can be adjusted according to the target positioning loss so as to optimize the crowd positioning network, iterative training is carried out by adopting the network training method until the iterative training meets preset training conditions, and the trained crowd positioning neural network is finally obtained.

In one possible implementation, network parameter adjustment is performed using a gradient-descent method based on target location loss.

For example, the network parameter at the time of the ith iterative training is θ_iUsing the network parameter θ_iThe target positioning loss determined after the network training is performed is L, and the network parameter theta is determined during the (i + 1) th iterative training_i+1Can be determined by the following formula (3):

wherein ∑ represents the gradient operator sign, γ is the net learning rate. The specific value of the network learning rate γ may be determined according to actual situations, for example, γ is 0.0001, which is not limited in this disclosure.

In one possible implementation, the preset training condition may be network convergence. For example, the network training method is adopted to perform iterative training until the network parameters are not changed any more, the network is considered to be converged, and the trained crowd positioning neural network is determined.

In one possible implementation, the preset training condition may be an iteration threshold. For example, the network training method is adopted to perform iterative training until the number of iterations reaches an iteration threshold, and the trained population positioning neural network is determined.

In one possible implementation, the preset training condition may be a positioning threshold. For example, the network training method is adopted to perform iterative training until the positioning accuracy corresponding to the network is greater than the positioning threshold value, and the trained crowd positioning neural network is determined.

The preset training condition may be, in addition to the network convergence, the iteration threshold, or the positioning threshold, other training conditions may be set according to practical situations, which is not specifically limited by the present disclosure.

Fig. 2 shows a schematic diagram of a population-localizing neural network, in accordance with an embodiment of the present disclosure. As shown in fig. 2, the crowd positioning neural network 20 includes a feature extraction module 21, a human key point detection module 22 and a human frame detection module 23.

As shown in fig. 2, the crowd image required for crowd positioning is input to the crowd positioning neural network 20. The feature extraction module 21 performs feature extraction on the crowd images to obtain four layers of feature pyramids, that is, four first feature maps (the number of the first feature maps may be determined according to actual conditions, which is not specifically limited by the present disclosure); the human body key point detection module 22 performs human body key point detection according to the four first feature maps, and determines a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human body point; the crowd positioning neural network 20 performs weighting processing on the four first feature maps according to the positioning probability map to obtain four second feature maps; the human body frame detection module 23 performs human body frame detection according to the four second feature maps, and determines a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result includes target human body frames corresponding to all human bodies in the crowd image. The specific processing procedures of the feature extraction module 21, the human key point detection module 22 and the human frame detection module 23, and the weighting processing procedures of the four first feature maps are similar to the related descriptions above, and are not described here again.

In the embodiment of the disclosure, the crowd positioning neural network is utilized, and in the crowd positioning process, the human body key point detection and the human body frame detection are comprehensively considered, so that the target human body frame detection result with higher accuracy corresponding to the crowd image is obtained, and the crowd positioning accuracy is effectively improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a crowd positioning device, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the crowd positioning methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

FIG. 3 illustrates a block diagram of a crowd locating device according to an embodiment of the disclosure. As shown in fig. 3, the apparatus 30 includes:

the feature extraction module 31 is configured to perform feature extraction on the crowd image to obtain at least one first feature map;

a human body key point detection module 32, configured to perform human body key point detection on at least one first feature map, and determine a positioning probability map corresponding to the crowd image, where the positioning probability map is used to indicate a probability that each pixel point in the crowd image is a target human body point;

the feature enhancing module 33 is configured to perform weighting processing on the at least one first feature map according to the positioning probability map to obtain at least one second feature map;

and a human body frame detection module 34, configured to perform human body frame detection on the at least one second feature map, and determine a target human body frame detection result corresponding to the crowd image, where the target human body frame detection result includes a target human body frame corresponding to each human body in the crowd image.

In a possible implementation manner, the human body key point detecting module 32 is specifically configured to:

performing convolution processing on the target first feature map to obtain a third feature map, wherein the target first feature map is one of the at least one first feature map;

performing transposition convolution processing on the third feature map to obtain a fourth feature map, wherein the fourth feature map and the crowd image have the same size;

and performing convolution processing on the fourth feature map to obtain a positioning probability map.

In one possible implementation, the feature enhancing module 33 is specifically configured to:

and according to the positioning probability map, performing pixel-level weighting processing on each first feature map to obtain a second feature map corresponding to each first feature map.

In one possible implementation, the human frame detection module 34 includes:

the human body frame detection submodule is used for carrying out human body frame detection on the at least one second characteristic diagram to obtain at least one human body frame prediction result;

and the fusion submodule is used for fusing at least one human body frame prediction result to obtain a target human body frame detection result.

In a possible implementation, the fusion submodule is specifically configured to:

and carrying out non-maximum suppression processing on at least one human body frame prediction result to obtain a target human body frame detection result.

In one possible implementation, the apparatus 30 further includes:

the first crowd analysis module is used for carrying out crowd analysis on the crowd image according to the positioning probability map to obtain a first analysis result corresponding to the crowd image;

the second crowd analysis module is used for carrying out crowd analysis on the crowd image according to the detection result of the target human body frame to obtain a second analysis result corresponding to the crowd image;

and the fusion module is used for fusing the first analysis result and the second analysis result to obtain a target analysis result corresponding to the crowd image.

In one possible implementation, the apparatus 30 implements the crowd positioning method through a crowd positioning neural network, and the training samples of the crowd positioning neural network include: the method comprises the following steps of (1) detecting a crowd sample image, a real positioning picture corresponding to the crowd sample image and a real human body frame detection result corresponding to the crowd sample image;

the apparatus 30 further comprises, a network training module comprising:

the first determining submodule is used for determining a prediction positioning probability map corresponding to the crowd sample image and a prediction human body frame detection result corresponding to the crowd sample image through a crowd positioning neural network, wherein the prediction positioning probability map is used for indicating the probability that each pixel point in the crowd sample image is a target human body point, and the prediction human body frame detection result comprises a prediction human body frame corresponding to each human body in the crowd sample image;

a second determining submodule for determining a first positioning loss based on the predicted positioning probability map and the real positioning map;

a third determining submodule for determining a second positioning loss based on the predicted human body frame detection result and the real human body frame detection result;

and the optimization submodule is used for optimizing the crowd positioning neural network based on the first positioning loss and the second positioning loss.

In one possible implementation, the first determining sub-module includes:

the characteristic extraction unit is used for extracting the characteristics of the crowd images to obtain at least one fifth characteristic diagram;

and the human body key point detection unit is used for detecting human body key points according to the at least one fifth feature map and determining a prediction positioning probability map.

In a possible implementation manner, the first determining sub-module further includes:

the feature strengthening unit is used for performing weighting processing on each fifth feature map according to the predicted positioning probability map to obtain a sixth feature map corresponding to each fifth feature map;

and the human body frame detection unit is used for carrying out human body frame detection on the sixth feature map corresponding to each fifth feature map and determining a predicted human body frame detection result.

In a possible implementation, the optimization submodule is specifically configured to:

summing the first positioning loss and the second positioning loss to obtain a target positioning loss;

and optimizing the crowd positioning neural network based on the target positioning loss.

In a possible implementation manner, the real positioning map includes positions of target human body points corresponding to each human body in the crowd sample image, wherein one human body corresponds to one target human body point;

the apparatus 30, further comprising:

the human face detection module is used for carrying out human face detection on the crowd sample image and determining a predicted human face frame corresponding to the crowd sample image;

the first determining module is used for determining a real human body frame corresponding to each human body in the crowd sample image through a linear interpolation algorithm according to the position of a target human body point corresponding to each human body in the crowd sample image and a predicted human face frame corresponding to the crowd sample image;

and the second determining module is used for determining a real human body frame detection result according to the real human body frame corresponding to each human body in the crowd sample image.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 4, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 5, the electronic device 1900 may be provided as a server. Referring to fig. 5, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may further include a power supply assembly 1926 configured toTo perform power management of the electronic device 1900, a wired or wireless network interface 1950 is configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for locating a group of people, comprising:

performing feature extraction on the crowd image to obtain at least one first feature map;

detecting human key points of the at least one first feature map, and determining a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human point;

according to the positioning probability map, carrying out weighting processing on the at least one first feature map to obtain at least one second feature map;

and performing human body frame detection on the at least one second characteristic diagram, and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image.

2. The method according to claim 1, wherein the performing human key point detection on the at least one first feature map and determining a positioning probability map corresponding to the crowd image comprises:

performing transpose convolution processing on the third feature map to obtain a fourth feature map, wherein the fourth feature map and the crowd image have the same size;

and performing convolution processing on the fourth feature map to obtain the positioning probability map.

3. The method according to claim 1 or 2, wherein the weighting the at least one first feature map according to the positioning probability map to obtain at least one second feature map comprises:

and according to the positioning probability map, carrying out pixel-level weighting processing on each first feature map to obtain the second feature map corresponding to each first feature map.

4. The method according to any one of claims 1 to 3, wherein the performing human frame detection on the at least one second feature map and determining a target human frame detection result corresponding to the crowd image comprises:

performing human body frame detection on the at least one second characteristic diagram to obtain at least one human body frame prediction result;

and fusing the at least one human body frame prediction result to obtain the target human body frame detection result.

5. The method according to claim 4, wherein the fusing the at least one human frame prediction result to obtain the target human frame detection result comprises:

and carrying out non-maximum suppression processing on the at least one human body frame prediction result to obtain the target human body frame detection result.

6. The method according to any one of claims 1 to 5, further comprising:

according to the positioning probability map, carrying out crowd analysis on the crowd image to obtain a first analysis result corresponding to the crowd image;

according to the detection result of the target human body frame, carrying out crowd analysis on the crowd image to obtain a second analysis result corresponding to the crowd image;

and fusing the first analysis result and the second analysis result to obtain a target analysis result corresponding to the crowd image.

7. The method according to any one of claims 1 to 6, wherein the crowd sourcing method is implemented by a crowd sourcing neural network, the training samples of the crowd sourcing neural network comprising: the method comprises the steps that a crowd sample image, a real positioning picture corresponding to the crowd sample image and a real human body frame detection result corresponding to the crowd sample image are obtained;

the training method of the crowd positioning neural network comprises the following steps:

determining a predicted positioning probability map corresponding to the crowd sample image and a predicted human body frame detection result corresponding to the crowd sample image through the crowd positioning neural network, wherein the predicted positioning probability map is used for indicating the probability that each pixel point in the crowd sample image is a target human body point, and the predicted human body frame detection result comprises a predicted human body frame corresponding to each human body in the crowd sample image;

determining a first positioning loss based on the predicted positioning probability map and the real positioning map;

determining a second positioning loss based on the predicted human body frame detection result and the real human body frame detection result;

optimizing the population localization neural network based on the first localization loss and the second localization loss.

8. The method of claim 7, wherein optimizing the population locator neural network based on the first and second positioning losses comprises:

optimizing the population localization neural network based on the target localization loss.

9. The method according to claim 7 or 8, wherein the real positioning map comprises positions of the target human body points corresponding to human bodies in the human group sample image, wherein one human body corresponds to one target human body point;

the method further comprises the following steps:

carrying out face detection on the crowd sample image, and determining a predicted face frame corresponding to the crowd sample image;

determining a real human body frame corresponding to each human body in the crowd sample image through a linear interpolation algorithm according to the position of the target human body point corresponding to each human body in the crowd sample image and a predicted human face frame corresponding to the crowd sample image;

and determining the detection result of the real human body frame according to the real human body frame corresponding to each human body in the crowd sample image.

10. A crowd positioning device, comprising:

the characteristic extraction module is used for extracting the characteristics of the crowd images to obtain at least one first characteristic diagram;

the human body key point detection module is used for detecting human body key points of the at least one first feature map and determining a positioning probability map corresponding to the crowd image, wherein the positioning probability map is used for indicating the probability that each pixel point in the crowd image is a target human body point;

the characteristic strengthening module is used for carrying out weighting processing on the at least one first characteristic diagram according to the positioning probability diagram to obtain at least one second characteristic diagram;

and the human body frame detection module is used for carrying out human body frame detection on the at least one second characteristic diagram and determining a target human body frame detection result corresponding to the crowd image, wherein the target human body frame detection result comprises target human body frames corresponding to all human bodies in the crowd image.

11. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any of claims 1 to 9.

12. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 9.