CN113297983A

CN113297983A - Crowd positioning method and device, electronic equipment and storage medium

Info

Publication number: CN113297983A
Application number: CN202110586959.8A
Authority: CN
Inventors: 杨昆霖; 李昊鹏; 侯军; 伊帅
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-24

Abstract

The present disclosure relates to a crowd positioning method and apparatus, an electronic device, and a storage medium, the method including: performing feature extraction on at least two frames of crowd images acquired from the crowd video clip to obtain at least two first feature images, wherein the at least two frames of crowd images comprise a frame of target crowd image and at least one frame of non-target crowd image; determining a first light flow map between the target people image and the at least one frame of non-target people image; according to the first light flow diagram, the at least two first characteristic diagrams are fused to obtain a second characteristic diagram corresponding to the target crowd image; and carrying out crowd positioning according to the second characteristic map to obtain a target positioning map corresponding to the target crowd image, wherein the target positioning map is used for indicating the position of a human body included in the target crowd image. The embodiment of the disclosure can effectively improve the accuracy of crowd positioning.

Description

Crowd positioning method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a crowd positioning method and apparatus, an electronic device, and a storage medium.

Background

With the growth of population and the acceleration of urbanization process, the behavior of mass population aggregation is more and more, and the scale is larger and larger. The crowd analysis has important significance for public safety and city planning. Common crowd analysis tasks include crowd counting, crowd behavior parsing, crowd positioning and the like, wherein the crowd positioning is the basis of other crowd analysis tasks. The crowd positioning means that the position of a human body included in an image or a video is estimated through a computer vision algorithm, and the coordinates of the human body included in the image or the video are determined, so that data basis is provided for people group analysis tasks such as follow-up crowd counting and crowd behavior analysis. The accuracy of crowd positioning directly affects the precision of crowd counting and the result of crowd behavior analysis. Therefore, a people positioning method with high accuracy is needed.

Disclosure of Invention

The disclosure provides a crowd positioning method and device, electronic equipment and a storage medium technical scheme.

According to an aspect of the present disclosure, there is provided a crowd positioning method, including: performing feature extraction on at least two frames of crowd images acquired from the crowd video clip to obtain at least two first feature images, wherein the at least two frames of crowd images comprise a frame of target crowd image and at least one frame of non-target crowd image; determining a first light flow map between the target people image and the at least one frame of non-target people image; according to the first light flow diagram, the at least two first characteristic diagrams are fused to obtain a second characteristic diagram corresponding to the target crowd image; and carrying out crowd positioning according to the second characteristic map to obtain a target positioning map corresponding to the target crowd image, wherein the target positioning map is used for indicating the position of a human body included in the target crowd image.

In a possible implementation manner, the fusing the at least two first feature maps according to the first light flow map to obtain a second feature map corresponding to the target population image includes: according to the first light flow graph, performing spatial transformation on the first feature map corresponding to the at least one frame of non-target crowd image to obtain at least one third feature map corresponding to the at least one frame of non-target crowd image; and fusing the target first feature map and the at least one third feature map to obtain the second feature map, wherein the target first feature map is the first feature map corresponding to the target crowd image.

In a possible implementation manner, the fusing the target first feature map and the at least one third feature map to obtain the second feature map includes: fusing the target first feature map and the at least one third feature map along a channel dimension to obtain a fourth feature map, wherein the channel number of the fourth feature map is the sum of the channel numbers of the target first feature map and the at least one third feature map; and performing dimension reduction processing on the fourth feature map along a channel dimension to obtain the second feature map, wherein the second feature map and the target first feature map have the same size and channel number.

In a possible implementation manner, the performing crowd positioning according to the second feature map to obtain a target positioning map corresponding to the target crowd image includes: carrying out crowd positioning according to the second feature map to obtain a first positioning probability map, wherein the first positioning probability map is used for indicating the probability that each pixel point in the target crowd image is a human body; and according to a probability threshold value, carrying out image processing on the first positioning probability map to obtain the target positioning map.

In a possible implementation manner, the performing image processing on the first positioning probability map according to a probability threshold to obtain the target positioning map includes: carrying out average pooling operation on the first positioning probability map to obtain a mean pooling map; performing maximum pooling operation on the mean pooling image to obtain a maximum pooling image; obtaining a second positioning probability map according to the mean pooling map and the maximum pooling map; and performing threshold segmentation on the second positioning probability map according to the probability threshold to obtain the target positioning map.

In one possible implementation manner, the crowd positioning method is implemented by a crowd positioning neural network, and a training sample of the crowd positioning neural network comprises a crowd sample video clip and a real positioning map corresponding to a target crowd sample image in the crowd sample video clip; the training method of the crowd positioning neural network comprises the following steps: determining a predicted location probability map corresponding to the target crowd sample image through the crowd locating neural network, wherein the predicted location probability map is used for indicating the probability that each pixel point in the target crowd sample image is a human body; determining a positioning loss based on the predicted positioning probability map and the real positioning map; optimizing the crowd-sourcing neural network based on the localization loss.

In a possible implementation manner, the determining, by the crowd positioning neural network, a predicted positioning probability map corresponding to the target crowd sample image includes: performing feature extraction on at least two frames of crowd sample images acquired from the crowd sample video clip to obtain at least two fifth feature maps, wherein the at least two frames of crowd images comprise the target crowd sample image and at least one frame of non-target crowd sample image; determining a second light flow map between the target population sample image and the at least one frame of non-target population sample image; according to the second light flow diagram, the at least two fifth feature graphs are fused to obtain a sixth feature graph corresponding to the target crowd sample image; and carrying out crowd positioning according to the sixth characteristic diagram to obtain the predicted positioning probability diagram.

In one possible implementation, the determining a location loss based on the predicted location probability map and the real location map includes: and determining the positioning loss by utilizing a cross entropy loss function according to the predicted positioning probability map, the real positioning map and a positive sample weight, wherein the positive sample weight is the weight corresponding to a pixel point used for indicating the position of the human body in the real positioning map.

According to an aspect of the present disclosure, there is provided a crowd positioning device comprising: the system comprises a characteristic extraction module, a first feature graph acquisition module and a second feature graph acquisition module, wherein the characteristic extraction module is used for performing characteristic extraction on at least two frames of crowd images acquired from crowd video clips to obtain at least two first feature graphs, and the at least two frames of crowd images comprise a target crowd image and at least one non-target crowd image; an optical flow determination module for determining a first optical flow graph between the target demographic image and the at least one non-target demographic image; the characteristic fusion module is used for fusing the at least two first characteristic graphs according to the first light flow graph to obtain a second characteristic graph corresponding to the target crowd image; and the crowd positioning module is used for carrying out crowd positioning according to the second characteristic diagram to obtain a target positioning diagram corresponding to the target crowd image, wherein the target positioning diagram is used for indicating the position of a human body included in the target crowd image.

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, feature extraction is performed on at least two frames of crowd images acquired from a crowd video clip to obtain at least two first feature maps, wherein the at least two frames of crowd images comprise a frame of target crowd image and at least one frame of non-target crowd image; determining a first light flow map between the target crowd image and at least one frame of the non-target crowd image; according to the first light flow diagram, at least two first characteristic diagrams are fused to obtain a second characteristic diagram corresponding to the target crowd image; and carrying out crowd positioning according to the second characteristic diagram to obtain a target positioning diagram corresponding to the target crowd image, wherein the target positioning diagram is used for indicating the position of the human body included in the target crowd image. In the crowd positioning process, the first light flow graph which is obtained from the crowd video clip and is between the target crowd image and the at least one frame of non-target crowd image is used for fusing the at least two first feature graphs corresponding to the target crowd image and the at least one frame of non-target crowd image, and the fused second feature graph can reflect the inter-frame time sequence information of the target crowd image and other non-target crowd images and the intra-frame space information of the target crowd image, so that after crowd positioning is carried out by using the second feature graph, a target positioning graph with higher accuracy corresponding to the target crowd image can be obtained, and the crowd positioning accuracy is effectively improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 shows a flow diagram of a method of crowd location according to an embodiment of the disclosure;

FIG. 2 illustrates a schematic diagram of a population-localizing neural network, in accordance with an embodiment of the present disclosure;

FIG. 3 illustrates a block diagram of a crowd locating device according to an embodiment of the disclosure;

FIG. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure;

fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flow chart of a crowd positioning method according to an embodiment of the disclosure. The crowd positioning method may be executed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and the like, and the crowd positioning method may be implemented by a processor calling a computer readable instruction stored in a memory. Alternatively, the crowd location method may be performed by a server. As shown in fig. 1, the crowd locating method may include:

in step S11, feature extraction is performed on at least two frames of crowd images obtained from the crowd video clip to obtain at least two first feature maps, where the at least two frames of crowd images include a frame of target crowd image and at least one frame of non-target crowd image.

The crowd video clip herein is a video clip including a plurality of frames of crowd images, and may be obtained by video-capturing dense crowd within a certain spatial range (for example, places with large traffic, such as squares, shopping malls, subway stations, tourist attractions, and the like) by an image capturing device, or may be obtained by other methods, which is not specifically limited by the present disclosure.

In an example, after the image acquisition device performs video acquisition on dense crowd to obtain an original crowd video, under the condition that the frame rate of the original crowd video is high, the difference between adjacent frame crowd images in the original crowd video is small, in order to better utilize the space-time relationship between different frame crowd images, the frame rate down-sampling can be performed on the original crowd video with high frame rate to obtain crowd video segments with frame rates smaller than a threshold value, so that the difference between the adjacent frame crowd images in the crowd video segments is large, the space-time relationship between different frame crowd images in the crowd video segments can be better utilized, and the crowd positioning with better precision is realized. For example, the frame rate of the crowd video segment is 5 frames per second.

In an example, after the original crowd video is frame rate down-sampled to obtain the crowd video segment, since each frame of crowd image of the crowd video segment may include other background parts in addition to the dense crowd, in order to better perform crowd positioning, each frame of crowd image may be cropped, and the dense crowd part included in each frame of crowd image is reserved. However, each frame of crowd image after clipping needs to have the same scale, so as to ensure that subsequent image processing operations such as feature extraction and feature fusion can be implemented on each frame of crowd image.

The crowd video clip includes a plurality of frames of crowd images, and at least two frames of crowd images are obtained from the crowd video clip for feature extraction, wherein the number of frames of the crowd images obtained from the crowd video clip for feature extraction may be determined according to an actual situation, for example, 2 frames, 3 frames, 5 frames, 7 frames, and the like, which is not specifically limited by the present disclosure.

The feature extraction is performed on the at least two frames of crowd images, and specifically, the feature extraction is performed on the at least two frames of crowd images of the human through a feature extraction module in the convolutional neural network to obtain at least two first feature maps. The feature extraction process will be described in detail later with reference to possible implementations of the present disclosure, and will not be described in detail here.

In step S12, a first light flow diagram between the target people image and at least one frame of the non-target people image is determined.

The target crowd image is one of at least two frames of crowd images obtained from the crowd video clip. For example, the first frame crowd image in the at least two frames of crowd images; may be an end frame crowd image of the at least two frame crowd images; in the case that the crowd image of the odd frame which is greater than or equal to 3 frames is obtained from the crowd video clip, the target crowd image can be an intermediate frame crowd image in the crowd image of the odd frame which is greater than or equal to 3 frames; but may be any one of at least two crowd images, which is not limited in this disclosure.

The non-target crowd image may be a frame crowd image other than the target crowd image, of at least two frame crowd images obtained from the crowd video clip.

In the related art, when crowd positioning is performed by using crowd video clips, at least two first feature maps obtained by extracting features of at least two frames of crowd images acquired from the crowd video clips are subjected to simple fusion of channel dimensions, and time sequence information between different frames of crowd images in the crowd video clips cannot be fully utilized.

Since the optical flow may reflect a change of a pixel in an image sequence in a time domain, in the embodiment of the present disclosure, a first optical flow graph between a target crowd image and other non-target crowd images is determined according to at least two frames of crowd images acquired from a crowd video clip, so that spatiotemporal information between the target crowd image and the other non-target crowd images may be obtained. The process of determining the first light flow diagram will be described in detail later with reference to possible implementations of the present disclosure, and details are not described herein.

In step S13, at least two first feature maps are fused according to the first light flow map, so as to obtain a second feature map corresponding to the target crowd image.

Because the first optical flow graph can reflect the space-time information between the target crowd image and other non-target crowd images, at least two first feature graphs are fused according to the first optical flow graph, so that in the feature fusion process, the second feature graph corresponding to the target crowd image and having semantic features with higher robustness can be obtained by utilizing the intra-frame space information of the target crowd image and the inter-frame time sequence information between different frame crowd images. Hereinafter, details of the fusion of the at least two first feature maps according to the first light flow diagram will be described in combination with possible implementations of the present disclosure, and details are not described herein.

In step S14, performing crowd positioning according to the second feature map to obtain a target positioning map corresponding to the target crowd image, where the target positioning map is used to indicate the position of the human body included in the target crowd image.

The second feature map obtained after feature fusion can reflect the time-space relationship between the target crowd image and other non-target crowd images, so that the target positioning map with higher accuracy corresponding to the target crowd image can be obtained after crowd positioning is carried out by using the second feature map. The group location process will be described in detail later with reference to possible implementations of the present disclosure, and will not be described in detail here.

In the embodiment of the disclosure, feature extraction is performed on at least two frames of crowd images acquired from a crowd video clip to obtain at least two first feature maps, wherein the at least two frames of crowd images comprise a frame of target crowd image and at least one frame of non-target crowd image; determining a first light flow map between the target crowd image and at least one frame of the non-target crowd image; according to the first light flow diagram, at least two first characteristic diagrams are fused to obtain a second characteristic diagram corresponding to the target crowd image; and carrying out crowd positioning according to the second characteristic diagram to obtain a target positioning diagram corresponding to the target crowd image, wherein the target positioning diagram is used for indicating the position of the human body included in the target crowd image.

Compared with a mode of carrying out crowd positioning based on a single crowd image in the related art, the crowd positioning of the target crowd image is carried out by utilizing the crowd video clip, and the time sequence information in the crowd video clip can be mined, so that the crowd positioning precision can be improved.

In addition, compared with a mode of simply fusing at least two first feature maps at a channel level in the related art, in the crowd positioning process, the light stream maps between the target crowd images and other non-target crowd images are utilized to capture the incidence relation of pixels in the crowd video clips on time sequence information and space information, so that the at least two first feature maps are fused, more sufficient and more robust feature information can be mined, the fused second feature map can reflect the time-space relation between the target crowd images and other non-target crowd images, and further, after the crowd positioning is carried out by utilizing the second feature map, a target positioning map with higher accuracy corresponding to the target crowd images can be obtained, and therefore the accuracy of the crowd positioning is effectively improved.

In a possible implementation manner, a feature extraction module in a convolutional neural network may be utilized to perform feature extraction on at least two frames of crowd images acquired from crowd video segments to obtain at least two first feature maps.

In an example, a feature extraction module in a convolutional neural network may be comprised of convolutional layers in a deep convolutional neural network (e.g., the first 13 convolutional layers of a VGG-16 network) trained on a common image dataset (e.g., ImageNet) in the computer vision domain.

For example, three frames of crowd images I are obtained from a crowd video clip₁、I₂And I₃Wherein, I₁、I₂、I₃∈R^H×W×3H and W are the height and width of the crowd image, respectively, and the number of channels of the crowd image is 3 (e.g., RGB three channels). Three-frame crowd image I₁、I₂And I₃The feature extraction module input into the convolutional neural network performs feature extraction to obtain three first feature maps: crowd image I₁Corresponding first characteristic diagram F₁Crowd image I₂Corresponding first characteristic diagram F₂And crowd image I₃Corresponding first characteristic diagram F₃. Wherein the size of the first feature map is reduced to 1/8 corresponding to the frame crowd image, and the number of channels is 512, i.e. F₁、F₂、

The structure of the feature extraction module may be configured as other network structures according to actual situations, besides the structure of the first 13 convolutional layers of the VGG-16 network, which is not specifically limited in this disclosure. For feature extraction modules of different network structures, the size and the number of channels of the first feature map obtained after feature extraction is performed on the crowd video clip may be different, which is not specifically limited by the present disclosure.

In one possible implementation, a first light flow graph between the target population image and the non-target population image may be determined using an optical flow computation module in a convolutional neural network.

In one example, the optical flow computation module may be a pre-trained PWC-Net module for computing optical flow. The optical flow calculation module may also be other modules with optical flow calculation function that are trained in advance, and the disclosure is not limited thereto. The training process of the optical flow calculation module may be trained according to a network training mode of the related art, which is not specifically limited by the present disclosure.

Still using the above to obtain three frames of crowd images I from the crowd video clip₁、I₂And I₃For example, assume an inter-frame crowd image I₂Is the target crowd image, the crowd image I₁And I₃Is an image of a non-target group of people. Three-frame crowd image I₁、I₂And I₃An input optical flow calculation module for determining the target crowd image I₂To non-target crowd image I₁M of the first light flow graph m_f(Forward light flow map), and target crowd image I₂To non-target crowd image I₃M of the first light flow graph m_b(backward light flow diagram). Wherein the first light flow map μ_fAnd a first light flow map m_bReduced to 1/8, i.e. m, of the original population image_f、

The determination method of the first optical flow graph may be determined by other optical flow calculation methods besides the optical flow calculation module, which is not specifically limited in this disclosure.

In a possible implementation manner, according to the first light flow diagram, at least two first feature maps are fused to obtain a second feature map corresponding to the target population image, including: according to the first light flow graph, performing spatial transformation on the first feature graph corresponding to at least one frame of non-target crowd image to obtain at least one third feature graph corresponding to at least one frame of non-target crowd image; and fusing the target first feature map and at least one third feature map to obtain a second feature map, wherein the target first feature map is a first feature map corresponding to the target crowd image.

According to the first light flow graph between the target crowd image and the non-target crowd image, the first feature graph corresponding to the non-target crowd image is subjected to space transformation, so that feature alignment is carried out between the first feature graph corresponding to the non-target crowd image and the target first feature graph corresponding to the target crowd image, and the aligned third feature graph is obtained, so that the target first feature graph and the third feature graph can be more effectively fused, and the second feature graph corresponding to the target crowd image is obtained.

In an example, at least two first feature maps can be fused according to the first light flow map by using an optical flow feature fusion module in a convolutional neural network, so as to obtain a second feature map corresponding to the target crowd image.

Still using the first characteristic diagram F₁、F₂、F₃The first light flow graph μ_f、Μ_bFor example, the first characteristic diagram F₁、F₂、F₃And a first light flow map μ_f、Μ_bInput optical flow feature fusion module due to first optical flow map M_fIs an image of a target crowd I₂To non-target crowd image I₁Forward optical flow map in between, so the optical flow feature fusion module is based on the first optical flow map m_fUsing the following formula (1)For non-target crowd image I₁Corresponding first characteristic diagram F₁Performing spatial transformation to realize non-target crowd image I₁Corresponding first characteristic diagram F₁And target crowd image I₂Corresponding target first feature map F₂The non-target crowd image I is obtained through the feature alignment between the two₁Corresponding third characteristic diagram F₁'：

Wherein, WARP (-) is an image affine transformation function for realizing the spatial transformation.

Likewise, due to the first light flow map m_bIs an image of a target crowd I₂To non-target crowd image I₃Backward light flow map in between, so the light flow feature fusion module is based on the first light flow map m_bUsing the following formula (2), the non-target group image I₃Corresponding first characteristic diagram F₃Performing spatial transformation to realize non-target crowd image I₃Corresponding first characteristic diagram F₃And target crowd image I₂Corresponding target first feature map F₂The non-target crowd image I is obtained through the feature alignment between the two₃Corresponding third characteristic diagram F₃'：

According to the first light flow diagram between the target crowd image and the non-target crowd image, performing spatial transformation on the first feature diagram corresponding to the non-target crowd image to obtain a third feature diagram corresponding to the non-target crowd image, and in addition to sampling the WARP (·) image affine transformation function, other spatial transformation modes can be sampled, which is not specifically limited by the present disclosure.

In a possible implementation manner, the fusing the target first feature map and the at least one third feature map to obtain a second feature map, including: fusing the target first feature map and the at least one third feature map along the channel dimension to obtain a fourth feature map, wherein the channel number of the fourth feature map is the sum of the channel numbers of the target first feature map and the at least one third feature map; and performing dimensionality reduction on the fourth feature map along the channel dimension to obtain a second feature map, wherein the second feature map and the target first feature map have the same size and channel number.

The third feature map is determined based on the first optical flow map between the target crowd image and the non-target crowd image after spatial transformation, and can reflect the spatiotemporal relationship between the target crowd image and the non-target crowd image, so that compared with a mode of simply fusing a plurality of first feature maps at a channel level in the related art, the first feature map and the third feature map corresponding to the target crowd image are fused along the channel dimension, and a fourth feature map with more sufficient feature information and more robustness can be obtained.

Since the number of channels of the fourth feature map obtained after feature fusion along the channel dimension is the sum of the number of channels of at least two first feature maps, in order to make the feature map after fusion and the feature map before fusion have the same size and number of channels, dimension reduction processing is performed on the fourth feature map along the channel dimension, so that the second feature map after feature enhancement corresponding to the target crowd image is obtained.

The dimension reduction processing along the channel dimension may be implemented by using a convolution activation processing mode, and may also be implemented by using other channel dimension reduction processing modes, which is not specifically limited by the present disclosure.

Still with the above-mentioned target first characteristic diagram F₂The third characteristic diagram F₁'、F₃For example, the optical flow feature fusion module can use the following formula (3) to perform the first feature map F on the target₂And a third characteristic diagram F₁'、F₃', fusing along the channel dimension to obtain a fourth characteristic diagram F_c：

Where cat (-) is used to indicate feature fusion operations along the channel dimension. Using convolution-activation layer in optical flow feature module to process fourth feature map F_cPerforming convolution activation processing to obtain a first feature map F of a sum target₂Second profile with the same dimensions and number of channels

The second feature map F can reflect the target crowd image I₂Image I of other non-target people₁、I₃The spatiotemporal relationship between them.

In another embodiment, when only two frames of crowd images I are obtained from a crowd video clip₁And I₂And a crowd image I₂Is a target crowd image, a crowd image I₁Under the condition of non-target crowd image, obtaining target crowd image I₂The corresponding method of the second characteristic diagram is as follows: determining a crowd image I₁、I₂Corresponding first characteristic diagram F₁、F₂(ii) a Determination of target crowd image I by optical flow calculation module₂To non-target crowd image I₁M of the first light flow graph m_f(forward light flow graph); the first characteristic diagram F₁And F₂And a first light flow map m_fInputting the optical flow characteristic fusion module, and obtaining a third characteristic diagram F by using the formula (1)₁'; the first characteristic diagram F₂And a third characteristic diagram F₁' fusion to obtain target crowd image I₂The corresponding second characteristic diagram F. Wherein, the first characteristic diagram F₂And the third characteristic diagram F₁The method for obtaining the second feature map F by fusion is similar to the method of the formula (3), and the difference is only that the number of the input feature maps is different, and the detailed description is omitted here.

In another embodiment, when only two frames of crowd images I are obtained from a crowd video clip₂And I₃And a crowd image I₂Is a target crowd image, a crowd image I₃Under the condition of non-target crowd image, obtaining target crowd image I₂Corresponding second featureThe method for characterizing the graph is as follows: determining a crowd image I₂、I₃Corresponding first characteristic diagram F₂、F₃(ii) a Determination of target crowd image I by optical flow calculation module₂To non-target crowd image I₃M of the first light flow graph m_b(backward light flow graph); the first characteristic diagram F₂And F₃And a first light flow map m_bInputting the optical flow characteristic fusion module, and obtaining a third characteristic diagram F by using the formula (2)₃'; the first characteristic diagram F₂And a third characteristic diagram F₃' fusion to obtain target crowd image I₂The corresponding second characteristic diagram F. Wherein, the first characteristic diagram F₂And the third characteristic diagram F₃The method for obtaining the second feature map F by fusion is similar to the method of the formula (3), and the difference is only that the number of the input feature maps is different, and the detailed description is omitted here.

In a possible implementation manner, performing crowd positioning according to the second feature map to obtain a target positioning map corresponding to the target crowd image, includes: carrying out crowd positioning according to the second feature map to obtain a first positioning probability map, wherein the first positioning probability map is used for indicating the probability that each pixel point in the target crowd image is a human body; and according to the probability threshold, carrying out image processing on the first positioning probability map to obtain a target positioning map.

The second feature map can reflect the time-space relationship between the target crowd image and other non-target crowd images, so that the first positioning probability map with higher accuracy corresponding to the target crowd image can be obtained by utilizing the second feature map for crowd positioning.

The process of crowd sourcing according to the second profile is described in detail below in conjunction with possible implementations of the present disclosure.

In a possible implementation manner, performing crowd positioning according to the second feature map to obtain a first positioning probability map, including: performing convolution processing on the second characteristic diagram to obtain a seventh characteristic diagram; performing transposition convolution processing on the seventh feature map to obtain an eighth feature map, wherein the size of the eighth feature map is the same as that of the target crowd image; and performing convolution processing on the eighth feature map to obtain a first positioning probability map.

And in order to determine the probability that each pixel point in the target crowd image is a human body, the seventh feature map is subjected to transposition convolution processing, so that an eighth feature map with the same size as the target crowd image can be obtained, and then the eighth feature map is subjected to convolution processing, and the first positioning probability map used for indicating the probability that each pixel point in the target crowd image is a human body can be effectively obtained.

The first localization probability map has the same size as the target population image. Still using the above target crowd image I₂For example, the target crowd image is I₂∈R^H×W×3In the case of (2), the first positioning probability map may be written as

Wherein the content of the first and second substances,

for indicating I₂(x) Is the probability of a human body, x is the target crowd image I₂And a first positioning probability map

And coordinates of pixel points with the same relative position in the image.

In a possible implementation manner, a positioning prediction module in the convolutional neural network may be utilized to perform crowd positioning on the second feature map to obtain a first positioning probability map.

In an example, a localization prediction module in a convolutional neural network may include a convolutional layer for performing convolutional processing and a transposed convolutional layer for performing transposed convolutional processing. The specific structure of the positioning prediction module may be set according to actual conditions (for example, the number of layers of the convolutional layers, the arrangement manner of each layer, and the like), and this disclosure does not specifically limit this.

In an example, still taking the second feature map F as an example, the second feature map F is input to a positioning prediction module in a convolutional neural network, and a process of performing crowd positioning on the second feature map F by the positioning prediction module may be specifically described as follows: performing convolution processing on the second feature map Z by using three convolution layers (the size of a convolution kernel is 3, the voidage is 2, and the number of channels is 512) to realize further feature extraction to obtain a seventh feature map; then, by using three transposed convolutional layers (the size of a convolutional kernel is 4, the step size is 2, the number of channels is 256, 128 and 64 respectively) and one convolutional layer (the size of the convolutional kernel is 3, the voidage is 2, the number of channels is 256, 128 and 64 respectively) connected after each transposed convolutional layer, after the transposed convolutional layer is used for carrying out transposed convolutional processing on the seventh feature map, the feature is converted into the size of the target crowd image to obtain an eighth feature map with the same size as the target crowd image, finally, the eighth feature map is subjected to convolutional processing by using one 1 x 1 convolutional layer to convert the number of channels of the feature into 1, and finally, the first positioning probability map is output

Because the first positioning probability map is only used for indicating the probability that each pixel point in the target crowd image is a human body, the first positioning probability map is subjected to image processing through a preset probability threshold, and therefore the target positioning map used for indicating the position of the human body in the target crowd image can be effectively obtained. The specific value of the probability threshold may be determined according to actual conditions, and this disclosure does not specifically limit this.

In a possible implementation manner, the image processing is performed on the first positioning probability map according to a probability threshold to obtain a target positioning map, including: carrying out average pooling operation on the first positioning probability map to obtain a mean pooling map; performing maximum pooling operation on the mean pooling image to obtain a maximum pooling image; obtaining a second positioning probability map according to the mean pooling map and the maximum pooling map; and performing threshold segmentation on the second positioning probability map according to the probability threshold to obtain a target positioning map.

And carrying out average pooling operation on the first positioning probability map to obtain a mean pooling map, and carrying out maximum pooling operation on the mean pooling map to obtain a maximum pooling map, so that the suppression of image noise can be effectively realized.

In one example, first, performing an average pooling operation on the first positioning probability map by using a first pooling kernel with a size of 3 and a step size of 1 to obtain a mean pooling map; then, the maximum pooling operation is performed on the mean pooling image using a second pooling kernel having a size of 3 and a step size of 1, resulting in a maximum pooling image. The specific values of the sizes and the step lengths of the first pooled kernel and the second pooled kernel may be set according to actual situations, which is not specifically limited in this disclosure.

And performing peak value screening on the mean value pooling image and the maximum pooling image so as to obtain a second positioning probability image with higher precision.

In one example, the pixel values of the pixels in the mean pooling map and the maximum pooling map are used to indicate the probability that the corresponding pixel in the target crowd image is a human body. Comparing the mean pooling image with the maximum pooling image pixel by pixel, and determining the pixel value of the pixel with the same relative position in the second positioning probability image as the probability value if the pixel values are the same probability value for the pixel with the same relative position in the mean pooling image and the maximum pooling image; and aiming at the pixel points with the same relative position in the mean pooling image and the maximum pooling image, if the pixel values are different probability values, determining the pixel value of the pixel point with the same relative position in the second positioning probability image as 0.

For example, in the case where the pixel value of the pixel point (i, j) in the mean pooling map (the pixel point at the ith row and the jth column) is 0.7 and the pixel value of the pixel point (i, j) in the maximum pooling map is 0.7, the pixel value of the pixel point (i, j) in the second localization probability map is determined to be 0.7.

For another example, when the pixel value of the pixel point (i, j) in the mean pooling image is 0.7 and the pixel value of the pixel point (i, j) in the maximum pooling image is 0.5, the pixel value of the pixel point (i, j) in the second localization probability map is determined to be 0.

Finally, threshold segmentation is carried out on the second positioning probability map through a preset probability threshold, so that a finally needed target positioning map can be obtained, and the position of the human body in the target crowd image is effectively determined.

In an example, comparing a probability value corresponding to each pixel point in the second positioning probability map with a probability threshold, and determining the pixel value of a pixel point with the same relative position in the target positioning map as 1 when the pixel value of a certain pixel point is greater than or equal to the probability threshold; and under the condition that the pixel value of a certain pixel point is smaller than the probability threshold, determining the pixel value of the pixel point with the same relative position in the target positioning image as 0.

The target positioning image and the target crowd image have the same size, and the position of a pixel point with a pixel value of 1 in the target positioning image is used for indicating the position of a human body included in the target crowd image. For example, under the condition that the pixel value of the pixel point (i, j) in the target positioning image is 1, the position of the pixel point (i, j) in the target crowd image corresponds to a human body; and under the condition that the pixel value of the pixel point (i, j) in the target positioning image is 0, the position of the pixel point (i, j) in the target crowd image corresponds to the part except the head of the human body.

According to the target positioning diagram, the position of the human body included in the target crowd image can be determined, crowd positioning is achieved, and data basis is provided for other crowd analysis tasks (such as crowd counting, crowd behavior analysis and the like). For example, the number of the human bodies included in the target crowd image can be obtained by counting the number of the pixel points with the pixel value of 1 in the target positioning image, so that the crowd counting of the target crowd image is realized. For another example, the behavior trajectory of the human body included in the target crowd image can be obtained by counting the distribution of the pixel points with the pixel value of 1 in the target positioning image, so as to realize the crowd behavior analysis of the target crowd image

In one possible implementation manner, the crowd positioning method is implemented through a crowd positioning neural network, and a training sample of the crowd positioning neural network comprises a crowd sample video clip and a real positioning map corresponding to a target crowd sample image in the crowd sample video clip; the training method of the crowd positioning neural network comprises the following steps: determining a prediction positioning probability map corresponding to the target crowd sample image through a crowd positioning neural network, wherein the prediction positioning probability map is used for indicating the probability that each pixel point in the target crowd sample image is a human body; determining the positioning loss based on the prediction positioning probability graph and the real positioning graph; and optimizing the crowd positioning neural network based on the positioning loss.

In order to realize crowd positioning quickly, the crowd positioning neural network can be obtained by pre-training based on the crowd positioning method, and then in practical application, the crowd positioning neural network can be used for realizing the affirmation positioning of the crowd image quickly and effectively.

The method comprises the steps of constructing a training sample for network training of the crowd positioning neural network in advance, wherein the training sample comprises a crowd sample video clip and a real positioning picture corresponding to a target crowd sample image in the crowd sample video clip.

The crowd sample video clip herein is a video clip including a plurality of frames of crowd sample images, and may be obtained by video acquisition of dense crowd in a certain spatial range (for example, places with large traffic, such as squares, shopping malls, subway stations, tourist attractions, and the like) by an image acquisition device, or may be obtained by other methods, which is not specifically limited by the present disclosure.

In one possible implementation, the crowd locating method further includes: acquiring an original crowd sample video; and performing frame rate downsampling on the original crowd sample video to obtain crowd sample video fragments, wherein the frame rate of the crowd sample video fragments is less than a threshold value.

The image acquisition equipment carries out video acquisition on dense crowds to obtain original crowd sample videos, under the condition that the frame rate of the original crowd sample videos is high, the difference between adjacent frame crowd sample images in the original crowd sample videos is small, in order to better utilize the space-time relationship between different frame crowd sample images, the frame rate down-sampling can be carried out on the original crowd sample videos with high frame rates to obtain crowd sample video segments with frame rates smaller than a threshold value, so that the difference between the adjacent frame crowd sample images in the crowd sample video segments is large, the space-time relationship between the different frame crowd sample images in the crowd sample video segments can be better utilized, and the crowd positioning neural network is trained. For example, the frame rate of the crowd sample video segment is 5 frames per second.

After the original crowd sample video is subjected to frame rate down-sampling to obtain the crowd sample video segment, each frame of crowd sample image of the crowd sample video segment may include other background parts besides the dense crowd, so that the crowd positioning neural network can be trained better, each frame of crowd sample image can be cut, and the dense crowd part included in each frame of crowd sample image is reserved. However, each frame of the cropped crowd sample image needs to have the same scale, so as to ensure that the subsequent image processing operations such as feature extraction and feature fusion can be implemented on each frame of the crowd sample image.

The crowd sample video clip includes a plurality of frames of crowd sample images, and at least two frames of crowd sample images are obtained from the crowd sample video clip for feature extraction, wherein the specific number of the at least two frames of crowd sample images for feature extraction may be determined according to actual conditions, for example, 2 frames, 3 frames, 5 frames, 7 frames, and the like, which is not specifically limited by the present disclosure.

And randomly selecting one frame of crowd sample image from at least two frames of crowd sample images obtained from the crowd sample video clip to determine the frame of crowd sample image as a target crowd sample image. For example, the target crowd sample image may be a first frame crowd sample image of at least two frames of crowd sample images; may be an end frame crowd sample image of the at least two frames of crowd sample images; in the case that the odd-frame crowd sample image greater than or equal to 3 frames is obtained from the crowd sample video clip, the target crowd sample image may be a middle frame crowd sample image in the odd-frame crowd sample image greater than or equal to 3 frames, or may be any one frame crowd sample image in at least two frames of crowd sample images, which is not specifically limited in this disclosure.

The real positioning image and the target crowd sample image have the same size, the pixel value of a pixel point in the real positioning image is 0 or 1, and the position of the pixel point with the pixel value of 1 is used for indicating the position of a human body included in the target crowd sample image; and the position of the pixel point with the pixel value of 0 is used for indicating other positions except the human body in the target crowd sample image. The relationship between the real positioning map and the target population sample image is similar to the relationship between the target positioning map and the target population image, and is not described herein again.

In one possible implementation, the crowd locating method further includes: determining an annotation result corresponding to the target crowd sample image, wherein the annotation result comprises coordinates of a human body in the target crowd sample image; and determining a real positioning diagram according to the labeling result.

The human body coordinates of the target crowd sample image are labeled, so that a real positioning diagram corresponding to the target crowd sample image can be effectively determined according to the labeling result, and a training sample for network training of a crowd positioning neural network can be effectively constructed according to the target crowd sample image and the real positioning diagram.

In one example, the target population sample image is I₂'∈R^H×W×3H and W are respectively target population sample images I₂' height and width, the number of channels of the target population sample image is 3. For target crowd sample image I₂' the included human body is marked to obtain a target crowd sample image I₂' corresponding labeling result

Wherein, a_iIs the sample image I of the ith human body in the target crowd₂' coordinate in, m is the target population sample image I₂The number of human bodies included in the' list.

In an example, the image I can be based on a target population sample₂' corresponding labeling result

Determining a target crowd sample image I by using the following formula (4)₂' corresponding real positioning diagram Y epsilon R^H×W：

Wherein the content of the first and second substances,

y is a target population sample image I₂' coordinates of a pixel point having the same relative position as that in the real positioning diagram Y, K ═ 0,1, 0; 1,1, 1; 0,1,0]Is a convolution kernel, ψ (-) is a graph of the convolution result, δ (-) is a multivariate delta function, a concrete form of which can be shown in the following formula (5),

according to the labeling result corresponding to the target population sample image, the real positioning map can be determined by adopting the formula (4) and other methods, which is not specifically limited by the present disclosure.

After the training sample is determined, the training sample is utilized to carry out network training on the crowd positioning neural network, and firstly, a prediction positioning probability graph which corresponds to a target crowd sample image and is used for indicating the probability that each pixel point in the target crowd sample image is a human body is determined through the crowd positioning neural network.

In a possible implementation manner, determining a predicted location probability map corresponding to a target population sample image through a population location neural network includes: performing feature extraction on at least two frames of crowd sample images acquired from the crowd sample video clip to obtain at least two fifth feature maps, wherein the at least two frames of crowd images comprise a target crowd sample image and at least one frame of non-target crowd sample image; determining a second light flow map between the target crowd sample image and the at least one frame of non-target crowd sample image; according to the second light flow diagram, at least two fifth feature graphs are fused to obtain a sixth feature graph corresponding to the target crowd sample image; and carrying out crowd positioning according to the sixth characteristic diagram to obtain a prediction positioning probability diagram.

The crowd positioning neural network comprises a feature extraction module, and at least two fifth feature maps can be obtained after at least two frames of crowd sample images acquired from the crowd sample video clip are subjected to feature extraction through the feature extraction module. For example, three frames of crowd sample images are obtained from a crowd sample video clip: i is₁'、I'₂And l'₃Wherein, the crowd sample image I'₂Is a sample image of the target population. Human crowd sample image I 'by utilizing feature extraction module in human crowd positioning neural network'₁、I'₂And l'₃And (5) carrying out feature extraction to obtain three fifth feature maps: crowd sample image I'₁Corresponding fifth characteristic diagram F₁', crowd image I'₂Corresponding fifth characteristic diagram F₂'and crowd image I'₃Corresponding fifth characteristic diagram F₃'。

The network structure of the feature extraction module in the crowd positioning neural network is similar to the network structure of the feature extraction module in the convolutional neural network, the feature extraction process of the feature extraction module in the crowd positioning neural network on at least two frames of crowd sample images is similar to the feature extraction process of the feature extraction module in the convolutional neural network on at least two frames of crowd images, and details are not repeated here.

The crowd positioning neural network comprises an optical flow calculation module, and the target crowd sample image I 'is determined by utilizing the optical flow calculation module'₂To non-target crowd sample image I'₁The preceding second light flow map m'_f(Forward light flow graph), and target crowd sample image I'₂To non-target crowd sample image I'₃Of (d) a second light flow map m'_b(backward light flow diagram).

The network structure of the optical flow calculation module in the crowd positioning neural network is similar to the network structure of the optical flow calculation module in the convolutional neural network, and the process of determining the second optical flow by the optical flow calculation module in the crowd positioning neural network is similar to the process of determining the first optical flow by the optical flow calculation module in the convolutional neural network, and details are not repeated here.

The crowd positioning neural network comprises an optical flow characteristic fusion module in accordance with a second optical flow graph M'_fAnd a second light flow map m'_bFor three fifth feature maps F₁'、F₂' and F₃'fusion is carried out to obtain a target crowd sample image I'₂A corresponding sixth characteristic diagram F'.

The network structure of the optical flow feature fusion module in the crowd positioning neural network is similar to the network structure of the optical flow feature fusion module in the convolutional neural network, the process of fusing the fifth feature maps by the optical flow feature fusion module in the crowd positioning neural network according to the second optical flow graph is similar to the process of fusing the first feature maps by the optical flow feature fusion module in the convolutional neural network according to the first optical flow graph, and details are not repeated here.

The crowd positioning neural network also comprises a positioning prediction module, and crowd positioning is carried out on the sixth characteristic map by using the positioning prediction module, so that a prediction positioning probability map used for indicating the probability that each pixel point in the target crowd sample image is a human body can be obtained. Still taking the sixth feature map F 'as an example, the positioning prediction module in the crowd positioning neural network is used to perform crowd positioning on the sixth feature map F' to obtain a predicted positioning probability map

The network structure of the positioning prediction module in the crowd positioning neural network is similar to the network structure of the positioning prediction module in the convolutional neural network, and the crowd positioning process of the positioning prediction module in the crowd positioning neural network on the sixth feature map is similar to the crowd positioning process of the positioning prediction module in the convolutional neural network on the second feature map, and details are not repeated here.

Because the predicted positioning probability map is used for indicating the probability that each pixel point in the target crowd sample image is a human body, and the real positioning map is used for indicating the position of the human body in the target crowd sample image, the positioning loss of the crowd neural network can be determined based on the predicted positioning probability map and the real positioning map, and then the network parameters of the crowd positioning neural network can be adjusted based on the positioning loss, so that the crowd positioning neural network can be optimized.

After the crowd positioning neural network obtains the predicted positioning probability map corresponding to the target crowd sample image, the positioning loss of the crowd positioning neural network can be determined according to the difference between the predicted positioning probability map and the real positioning map.

In one possible implementation, determining a location loss based on the predicted location probability map and the true location map includes: and determining the positioning loss by utilizing a cross entropy loss function according to the predicted positioning probability map, the real positioning map and the positive sample weight, wherein the positive sample weight is the weight corresponding to the pixel point in the real positioning map for indicating the position of the human body.

In one example, the probability map is located according to predictions

And a real localization graph Y, the localization loss L can be determined by the cross entropy loss function shown in the following equation (6):

wherein H and W are respectively a predicted positioning probability map

And the height and width of the true location map Y, Y being the predicted location probability map

The coordinates of the pixel points with the same relative positions as those in the real positioning diagram Y are set to be 0 or 1, and when the value of Y (Y) is set to be 1, the Y position of the pixel point in the target crowd sample image can be indicated to be 1The human body position can indicate that the pixel point y in the target crowd sample image is other positions except the human body when the value of Y (y) is 0,

value of [0,1]，

The method is used for indicating the probability that the pixel point y in the target crowd sample image is a human body, and the lambda is the weight of the positive sample.

In the target crowd sample image, the pixel point of the position of the human body can be regarded as a positive sample, other pixel points are regarded as negative samples, and the proportion of the background part in the target crowd sample image is possibly far greater than the proportion of the human body, namely the quantity of the negative samples is far greater than the quantity of the positive samples, so that the positive and negative samples can be balanced in the training process by setting the weight of the positive samples.

The positioning loss may be determined by other loss functions besides the cross entropy loss function shown in the above formula (6), which is not specifically limited in this disclosure.

After the positioning loss is determined, network parameters corresponding to the crowd positioning neural network can be adjusted according to the positioning loss so as to optimize the crowd positioning network, iterative training is carried out by adopting the network training method until the iterative training meets preset training conditions, and the trained crowd positioning neural network is finally obtained.

In one possible implementation, network parameter adjustment is performed using a gradient-descent method based on positioning losses.

For example, the network parameter at the time of the ith iterative training is θ_iUsing the network parameter θ_iThe positioning loss determined after the network training is performed is L, and the network parameter theta is determined during the (i + 1) th iterative training_i+1Can be determined by the following formula (7):

wherein ∑ represents the gradient operator sign, γ is the net learning rate. The specific value of the network learning rate γ may be determined according to actual situations, for example, γ is 0.0001, which is not limited in this disclosure.

In one possible implementation, the preset training condition may be network convergence. For example, the network training method is adopted to perform iterative training until the network parameters are not changed any more, the network is considered to be converged, and the trained crowd positioning neural network is determined.

In one possible implementation, the preset training condition may be an iteration threshold. For example, the network training method is adopted to perform iterative training until the number of iterations reaches an iteration threshold, and the trained population positioning neural network is determined.

In one possible implementation, the preset training condition may be a positioning threshold. For example, the network training method is adopted to perform iterative training until the positioning accuracy corresponding to the network is greater than the positioning threshold value, and the trained crowd positioning neural network is determined.

The preset training condition may be, in addition to the network convergence, the iteration threshold, or the positioning threshold, other training conditions may be set according to practical situations, which is not specifically limited by the present disclosure.

Fig. 2 shows a schematic diagram of a population-localizing neural network, in accordance with an embodiment of the present disclosure. As shown in fig. 2, the crowd-sourcing neural network 20 includes a feature extraction module 21, an optical flow computation module 22, an optical flow feature fusion module 23, and a location prediction module 24.

As shown in fig. 2, at least two frames of crowd images acquired from a crowd video clip, which need to be crowd-located, are input into a crowd-locating neural network 20, and a feature extraction module 21 performs feature extraction on the at least two frames of crowd images to obtain at least two first feature maps; the optical flow calculation module 22 determines a first optical flow (which may include, for example, a forward optical flow and/or a backward optical flow) between the target crowd image and the non-target crowd image in the at least two frames of crowd images; the optical flow feature fusion module 23 fuses at least two first feature maps according to the first optical flow to obtain a second feature map corresponding to the target crowd image; the positioning prediction module 24 performs crowd positioning on the second feature map to obtain a target positioning map for indicating the position of the human body included in the target crowd image. The specific processing procedures of the feature extraction module 21, the optical flow calculation module 22, the optical flow feature fusion module 23 and the positioning prediction module 24 are similar to those described above, and are not described herein again.

In the embodiment of the disclosure, by using the crowd positioning neural network, in the crowd positioning process, at least two first feature maps corresponding to at least two frames of crowd images are fused by using a first light flow map between a target crowd image and a non-target crowd image in the at least two frames of crowd images acquired from the crowd video clip, and the fused second feature map can reflect inter-frame timing sequence information of the target crowd image and other non-target crowd images and intra-frame space information of the target crowd image, so that after crowd positioning is performed by using the second feature map, a target positioning map with higher accuracy corresponding to the target crowd image can be obtained, and thus the accuracy of crowd positioning is effectively improved.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a crowd positioning device, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the crowd positioning methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

FIG. 3 illustrates a block diagram of a crowd locating device according to an embodiment of the disclosure. As shown in fig. 3, the apparatus 30 includes:

the feature extraction module 31 is configured to perform feature extraction on at least two frames of crowd images acquired from the crowd video clip to obtain at least two first feature maps, where the at least two frames of crowd images include a target crowd image and at least one non-target crowd image;

an optical flow determination module 32 for determining a first optical flow graph between the target demographic image and the at least one frame of non-target demographic image;

the feature fusion module 33 is configured to fuse the at least two first feature maps according to the first light flow map to obtain a second feature map corresponding to the target crowd image;

and the crowd positioning module 34 is configured to perform crowd positioning according to the second feature map to obtain a target positioning map corresponding to the target crowd image, where the target positioning map is used to indicate a position of a human body included in the target crowd image.

In one possible implementation, the feature fusion module 33 includes:

the space transformation submodule is used for carrying out space transformation on the first feature map corresponding to the at least one frame of non-target crowd image according to the first light flow diagram to obtain at least one third feature map corresponding to the at least one frame of non-target crowd image;

and the feature fusion submodule is used for fusing the target first feature map and at least one third feature map to obtain a second feature map, wherein the target first feature map is a first feature map corresponding to the target crowd image.

In one possible implementation, the feature fusion submodule is specifically configured to:

fusing the target first feature map and the at least one third feature map along the channel dimension to obtain a fourth feature map, wherein the channel number of the fourth feature map is the sum of the channel numbers of the target first feature map and the at least one third feature map;

and performing dimensionality reduction on the fourth feature map along the channel dimension to obtain a second feature map, wherein the second feature map and the target first feature map have the same size and channel number.

In one possible implementation, the crowd location module 34 includes:

the first determining submodule is used for carrying out crowd positioning according to the second characteristic map to obtain a first positioning probability map, wherein the first positioning probability map is used for indicating the probability that each pixel point in the target crowd image is a human body;

and the second determining submodule is used for carrying out image processing on the first positioning probability map according to the probability threshold value to obtain a target positioning map.

In a possible implementation manner, the second determining submodule is specifically configured to:

carrying out average pooling operation on the first positioning probability map to obtain a mean pooling map;

performing maximum pooling operation on the mean pooling image to obtain a maximum pooling image;

obtaining a second positioning probability map according to the mean pooling map and the maximum pooling map;

and performing threshold segmentation on the second positioning probability map according to the probability threshold to obtain a target positioning map.

In a possible implementation manner, the apparatus 30 implements the crowd positioning method through a crowd positioning neural network, where a training sample of the crowd positioning neural network includes a crowd sample video clip and a real positioning map corresponding to a target crowd sample image in the crowd sample video clip;

the apparatus 30 further comprises, a network training module comprising:

the third determining submodule is used for determining a predicted positioning probability map corresponding to the target crowd sample image through a crowd positioning neural network, wherein the predicted positioning probability map is used for indicating the probability that each pixel point in the target crowd sample image is a human body;

the fourth determining submodule is used for determining the positioning loss based on the prediction positioning probability map and the real positioning map;

and the optimization submodule is used for optimizing the crowd positioning neural network based on the positioning loss.

In a possible implementation manner, the third determining submodule is specifically configured to:

performing feature extraction on at least two frames of crowd sample images acquired from the crowd sample video clip to obtain at least two fifth feature maps, wherein the at least two frames of crowd images comprise a target crowd sample image and at least one frame of non-target crowd sample image;

determining a second light flow map between the target crowd sample image and the at least one frame of non-target crowd sample image;

according to the second light flow diagram, at least two fifth feature graphs are fused to obtain a sixth feature graph corresponding to the target crowd sample image;

and carrying out crowd positioning according to the sixth characteristic diagram to obtain a prediction positioning probability diagram.

In a possible implementation manner, the fourth determining submodule is specifically configured to:

and determining the positioning loss by utilizing a cross entropy loss function according to the predicted positioning probability map, the real positioning map and the positive sample weight, wherein the positive sample weight is the weight corresponding to the pixel point in the real positioning map for indicating the position of the human body.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 4 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 4, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 4, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 5 shows a block diagram of an electronic device in accordance with an embodiment of the disclosure. As shown in fig. 5, the electronic device 1900 may be provided as a server. Referring to fig. 5, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the above-described method.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932^TM) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X)^TM) Multi-user, multi-process computer operating system (Unix)^TM) Free and open native code Unix-like operating System (Linux)^TM) Open native code Unix-like operating System (FreeBSD)^TM) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for locating a group of people, comprising:

performing feature extraction on at least two frames of crowd images acquired from the crowd video clip to obtain at least two first feature images, wherein the at least two frames of crowd images comprise a frame of target crowd image and at least one frame of non-target crowd image;

determining a first light flow map between the target people image and the at least one frame of non-target people image;

according to the first light flow diagram, the at least two first characteristic diagrams are fused to obtain a second characteristic diagram corresponding to the target crowd image;

and carrying out crowd positioning according to the second characteristic map to obtain a target positioning map corresponding to the target crowd image, wherein the target positioning map is used for indicating the position of a human body included in the target crowd image.

2. The method according to claim 1, wherein said fusing the at least two first feature maps according to the first light flow map to obtain a second feature map corresponding to the target population image comprises:

according to the first light flow graph, performing spatial transformation on the first feature map corresponding to the at least one frame of non-target crowd image to obtain at least one third feature map corresponding to the at least one frame of non-target crowd image;

and fusing the target first feature map and the at least one third feature map to obtain the second feature map, wherein the target first feature map is the first feature map corresponding to the target crowd image.

3. The method according to claim 2, wherein the fusing the target first feature map and the at least one third feature map to obtain the second feature map comprises:

fusing the target first feature map and the at least one third feature map along a channel dimension to obtain a fourth feature map, wherein the channel number of the fourth feature map is the sum of the channel numbers of the target first feature map and the at least one third feature map;

and performing dimension reduction processing on the fourth feature map along a channel dimension to obtain the second feature map, wherein the second feature map and the target first feature map have the same size and channel number.

4. The method according to any one of claims 1 to 3, wherein the performing crowd positioning according to the second feature map to obtain a target positioning map corresponding to the target crowd image comprises:

carrying out crowd positioning according to the second feature map to obtain a first positioning probability map, wherein the first positioning probability map is used for indicating the probability that each pixel point in the target crowd image is a human body;

and according to a probability threshold value, carrying out image processing on the first positioning probability map to obtain the target positioning map.

5. The method of claim 4, wherein the image processing the first localization probability map according to a probability threshold to obtain the target localization map comprises:

and performing threshold segmentation on the second positioning probability map according to the probability threshold to obtain the target positioning map.

6. The method according to any one of claims 1 to 5, wherein the crowd positioning method is implemented by a crowd positioning neural network, and the training sample of the crowd positioning neural network comprises a crowd sample video clip and a real positioning map corresponding to the target crowd sample image in the crowd sample video clip;

the training method of the crowd positioning neural network comprises the following steps:

determining a predicted location probability map corresponding to the target crowd sample image through the crowd locating neural network, wherein the predicted location probability map is used for indicating the probability that each pixel point in the target crowd sample image is a human body;

determining a positioning loss based on the predicted positioning probability map and the real positioning map;

optimizing the crowd-sourcing neural network based on the localization loss.

7. The method of claim 6, wherein said determining a predicted location probability map corresponding to said target population sample image by said population localization neural network comprises:

performing feature extraction on at least two frames of crowd sample images acquired from the crowd sample video clip to obtain at least two fifth feature maps, wherein the at least two frames of crowd images comprise the target crowd sample image and at least one frame of non-target crowd sample image;

determining a second light flow map between the target population sample image and the at least one frame of non-target population sample image;

according to the second light flow diagram, the at least two fifth feature graphs are fused to obtain a sixth feature graph corresponding to the target crowd sample image;

and carrying out crowd positioning according to the sixth characteristic diagram to obtain the predicted positioning probability diagram.

8. The method according to claim 6 or 7, wherein said determining a location loss based on said predicted location probability map and said true location map comprises:

and determining the positioning loss by utilizing a cross entropy loss function according to the predicted positioning probability map, the real positioning map and a positive sample weight, wherein the positive sample weight is the weight corresponding to a pixel point used for indicating the position of the human body in the real positioning map.

9. A crowd positioning device, comprising:

the system comprises a characteristic extraction module, a first feature graph acquisition module and a second feature graph acquisition module, wherein the characteristic extraction module is used for performing characteristic extraction on at least two frames of crowd images acquired from crowd video clips to obtain at least two first feature graphs, and the at least two frames of crowd images comprise a target crowd image and at least one non-target crowd image;

an optical flow determination module for determining a first optical flow graph between the target demographic image and the at least one non-target demographic image;

the characteristic fusion module is used for fusing the at least two first characteristic graphs according to the first light flow graph to obtain a second characteristic graph corresponding to the target crowd image;

and the crowd positioning module is used for carrying out crowd positioning according to the second characteristic diagram to obtain a target positioning diagram corresponding to the target crowd image, wherein the target positioning diagram is used for indicating the position of a human body included in the target crowd image.

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 8.

11. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 8.