CN114694205A

CN114694205A - Social distance detection method and device, electronic equipment and storage medium

Info

Publication number: CN114694205A
Application number: CN202011642568.5A
Authority: CN
Inventors: 黄德威
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-07-01

Abstract

The embodiment of the invention provides a social distance detection method, a social distance detection device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring an image to be detected; predicting positioning point information and human head frame information of each target person in the image to be detected through a pre-trained neural network model, and extracting corresponding human head depth information according to the human head frame information; calculating height information of a corresponding target person according to the positioning point information; according to the height information and the head depth information, performing three-dimensional reconstruction on the head of the target person to obtain three-dimensional head information of the target person; and calculating the social distance among a plurality of target persons based on the three-dimensional head information of the target persons. By calculating the height information of the target personnel, more accurate target head information is extracted for three-dimensional reconstruction, so that the position of the three-dimensional target head in a three-dimensional space is more accurate, and the accuracy of the social distance between the target personnel is improved.

Description

Social distance detection method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a social distance detection method and device, electronic equipment and a storage medium.

Background

In some scenarios, it is desirable to control the separation distance between people, such as ticket buying windows, banking windows, public places during flu hours, special places that limit aggregation, and the like. The existing detection of the distance between the personnel mainly comprises the steps of carrying out visual inspection on the personnel on site or monitoring videos through workers or detecting the distance between the personnel through an image processing means. However, the visual inspection by workers not only wastes time and labor, has high labor cost, but also is influenced by subjective factors, and has low accuracy, and the existing image processing means mainly detects the distance between the workers in the two-dimensional image to measure and calculate the approximate actual distance and has low accuracy. Therefore, the existing distance detection between the persons has the problem of low detection accuracy.

Disclosure of Invention

The embodiment of the invention provides a social distance detection method, which can improve the accuracy of social distance detection among people.

In a first aspect, an embodiment of the present invention provides a social distance detection method, where the method includes:

acquiring an image to be detected, wherein the image to be detected comprises image depth information and a head to be detected, and the head to be detected comprises heads of a plurality of target persons;

predicting positioning point information and human head frame information of each target person in the image to be detected through a pre-trained neural network model, and extracting corresponding human head depth information according to the human head frame information;

calculating height information of a corresponding target person according to the positioning point information;

according to the height information and the head depth information, performing three-dimensional reconstruction on the head of the target person to obtain three-dimensional head information of the target person;

and calculating the social distance among a plurality of target persons based on the three-dimensional head information of the target persons.

Optionally, the pre-trained neural network model includes a public network, and a first branch network, a second branch network, and a third branch network connected behind the public network, where the public network is used to extract common features of the first branch network, the second branch network, and the third branch network, the first branch network and the second branch network are used to extract location point information of the head, and the third branch network is used to extract the head frame information.

Optionally, the locating point information of the human head includes a human head central point, a human head projection central point and a human foot key point, and the locating point information of each target person in the image to be detected is predicted through a pre-trained neural network model, which includes:

extracting common features of the image to be detected through the public network;

inputting the common features into the first branch network, predicting and obtaining a Gaussian heat map of the head through the first branch network, and obtaining a head central point based on the Gaussian heat map of the head;

and inputting the common features into the second branch network, and predicting to obtain a head projection central point and a foot key point through the second branch network, wherein the second branch network performs offset prediction based on the head central point.

Optionally, predicting the human head frame information of each target person in the image to be detected through a pre-trained neural network model, including:

and inputting the common features into the third branch network, and predicting to obtain the head frame information through the third branch network, wherein the third branch network performs offset prediction based on the head central point.

Optionally, the calculating height information of the corresponding target person according to the positioning point information includes:

and calculating the height information of the corresponding target person by combining the head central point, the head projection central point and the foot key point.

Optionally, extracting the common features through the public network includes:

sequentially carrying out convolution downsampling of a first multiple, convolution downsampling of a second multiple and convolution downsampling of a third multiple on an image to be detected to sequentially obtain a first common feature map, a second common feature map and a third common feature map, wherein the numerical value of the first multiple is smaller than the numerical value of the second multiple, the numerical value of the second multiple is smaller than the numerical value of the third multiple, the tensor size of the first common feature map is larger than the tensor size of the second common feature map, and the tensor size of the second common feature map is larger than the tensor size of the third common feature map.

Optionally, the inputting the common feature into the first branch network, and predicting a gaussian heatmap of the head through the first branch network includes:

performing convolution calculation after upsampling the third common characteristic diagram to obtain a first upsampled characteristic diagram with the same tensor size as the second common characteristic diagram;

performing convolution calculation on the second common feature map, and performing feature fusion on the second common feature map and the first up-sampling feature map to obtain a first fusion feature map;

performing convolution calculation after the first fusion characteristic is subjected to up-sampling to obtain a second up-sampling characteristic image with the same tensor size as the first common characteristic image;

performing convolution calculation on the first common feature map, and performing feature fusion on the first common feature map and the second up-sampling feature to obtain a second fusion feature map;

and obtaining a Gaussian heat map based on the second fusion feature map.

Optionally, the three-dimensional reconstruction of the head of the target person is performed according to the height information and the head depth information to obtain the three-dimensional head information of the target person, including:

determining the reconstruction position of the head of the target person in the three-dimensional space according to the height information;

and performing three-dimensional reconstruction on the head of the target person at the reconstruction position in the three-dimensional space according to the head depth information to obtain the three-dimensional head information of the target person.

Optionally, the method further includes:

and constructing the ground of the three-dimensional space according to the human head projection central point.

In a second aspect, an embodiment of the present invention further provides a social distance detecting device, where the device includes:

the system comprises an acquisition module, a detection module and a display module, wherein the acquisition module is used for acquiring an image to be detected, the image to be detected comprises image depth information and a human head to be detected, and the human head to be detected comprises human heads of a plurality of target people;

the prediction module is used for predicting positioning point information and human head frame information of each target person in the image to be detected through a pre-trained neural network model and extracting corresponding human head depth information according to the human head frame information;

the first calculation module is used for calculating the height information of the corresponding target person according to the positioning point information;

the reconstruction module is used for performing three-dimensional reconstruction on the head of the target person according to the height information and the head depth information to obtain three-dimensional head information of the target person;

and the second calculation module is used for calculating the social distance among a plurality of target persons based on the three-dimensional head information of the target persons.

In a third aspect, an embodiment of the present invention provides an electronic device, including: the system comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the computer program to realize the steps of the social distance detection method provided by the embodiment of the invention.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the social distance detection method provided by the embodiment of the present invention.

In the embodiment of the invention, an image to be detected is obtained, wherein the image to be detected comprises image depth information and a human head to be detected, and the human head to be detected comprises human heads of a plurality of target persons; predicting positioning point information and human head frame information of each target person in the image to be detected through a pre-trained neural network model, and extracting corresponding human head depth information according to the human head frame information; calculating height information of a corresponding target person according to the positioning point information; according to the height information and the head depth information, performing three-dimensional reconstruction on the head of the target person to obtain three-dimensional head information of the target person; and calculating the social distance among a plurality of target persons based on the three-dimensional head information of the target persons. By calculating the height information of the target personnel, more accurate head information of the target personnel is extracted for three-dimensional reconstruction, so that the position of the head of the three-dimensional target personnel in a three-dimensional space is more accurate, and the accuracy of the social distance between the target personnel is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a social distance detection method according to an embodiment of the present invention;

FIG. 2a is a schematic processing flow diagram of a neural network model according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for predicting location information according to an embodiment of the present invention;

fig. 2b is a schematic processing flow diagram of a first branch network according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for predicting the head box information according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a social distance detecting device according to an embodiment of the present invention;

FIG. 5 is a block diagram of a prediction module according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a second processing sub-module according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of a reconstruction module according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of another social distance detecting device according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a social distance detection method according to an embodiment of the present invention, and as shown in fig. 1, the method is used for performing social distance detection in real time, and includes the following steps:

101. and acquiring an image to be detected.

In the embodiment of the invention, the image to be detected comprises image depth information and a human head to be detected, and the human head to be detected comprises human heads of a plurality of target people.

The target area may be acquired by a depth camera (also referred to as a 3D camera), and the image to be detected including the image depth information is obtained.

In a possible embodiment, the number of target people may be detected in an image collected by the depth camera, and when the number of the detected target people is 1, the image is not processed, that is, the social distance of the target people in the image is not detected; and when the number of the detected target persons is 2 or more, taking the image as an image to be detected.

The image to be detected can be an image acquired in real time or an image uploaded by a user. The image to be detected may be a continuous frame image (video stream image) or may be an independent frame image (photograph).

102. And predicting positioning point information and human head frame information of each target person in the image to be detected through a pre-trained neural network model, and extracting corresponding human head depth information according to the human head frame information.

In the embodiment of the present invention, the positioning point information of the target person may be head positioning point information and foot positioning point information. Further, the positioning point information of the target person further includes head projection positioning point information.

The frame information of the target person may be a frame (x, y, h, w) of the head detection frame, where x and y are coordinates of a center point of the frame, h is a height of the frame, and w is a width of the frame. The human head detection frame can be understood as small pictures in the target image, and each small picture comprises a human head image of a target person.

In a possible embodiment, the number of the pre-trained neural network models may be multiple, and the number of the pre-trained neural network models includes a human head detection network model for performing human head detection on an image to be detected, and a network model for extracting positioning points for performing positioning point information on the image to be detected.

In the embodiment of the invention, in order to improve the acquisition speed of the human head frame information and the positioning point information of the target person, an optional neural network model is provided, the input is an image to be detected, and the output is the positioning point information and the human head frame information of the target person in the image to be detected. That is, the positioning point information and the human head frame information are obtained by processing through the same neural network model. The input of the pre-trained neural network model is an image to be detected. The neural network model at least comprises a public network, a first branch network and a third branch network, wherein the public network is simultaneously connected with the first branch network and the third branch network, the first branch network is used for outputting positioning point information, and the third branch network is used for human head frame information.

Furthermore, the positioning point information of the target person may be head positioning point information and foot positioning point information. In order to improve the acquisition speed of the head positioning point information and the foot positioning point information of the target person, an optional neural network model is provided, the input is an image to be detected, and the output is the head positioning point information, the foot positioning point information and the head frame information of the target person in the image to be detected. That is, the positioning point information and the human head frame information are obtained by processing through the same neural network model. The input of the pre-trained neural network model is an image to be detected. The neural network model includes a public network, and a first branch network, a second branch network, and a third branch network connected to the back of the public network, wherein the public network is used to extract common features of the first branch network, the second branch network, and the third branch network, the first branch network and the second branch network are used to extract the anchor point information of the head, specifically, the first branch network is used to extract the anchor point information of the head, the second branch network is used to extract the anchor point information of the foot, and the third branch network is used to extract the frame information of the head.

Furthermore, the positioning point information of the target person may be head positioning point information, head projection positioning point information, and foot positioning point information. In order to improve the acquisition speed of the human head positioning point information, the human head projection positioning point information and the human foot positioning point information of the target personnel, an optional neural network model is provided, the human head positioning point information, the human head projection positioning point information, the human foot positioning point information and the human head frame information of the target personnel in the image to be detected are input into the image to be detected and output into the image to be detected. That is, the positioning point information and the human head frame information are obtained by processing through the same neural network model. The input of the pre-trained neural network model is an image to be detected. The neural network model includes a public network, and a first branch network, a second branch network, and a third branch network connected behind the public network, where the public network is used to extract common features of the first branch network, the second branch network, and the third branch network, the first branch network and the second branch network are used to extract location point information of the head, specifically, the first branch network is used to extract head location point information, the second branch network is used to extract head projection location point information and foot location point information, and the third branch network is used to extract head frame information, as shown in fig. 2 a.

Specifically, the embodiment of the present invention is described by taking the positioning point information of the target person as the head positioning point information, the head projection positioning point information, and the foot positioning point information as examples, please refer to fig. 2, fig. 2 is a flowchart of a method for predicting positioning point information according to the embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

201. and extracting the common characteristics of the image to be detected through a public network.

In the embodiment of the present invention, the common feature may be understood as an implicit feature including a human head feature, a human head positioning point feature, a human head projection positioning point feature, and a human foot positioning point feature, where the implicit feature has no background feature.

The public network can be a convolutional neural network, and the hidden features including the human head feature, the human head positioning point feature, the human head projection positioning point feature and the human foot positioning point feature are obtained as the common features by performing feature extraction on the image to be detected through convolutional checking.

Further, the public network in the embodiment of the present invention may be a convolutional neural network using Mobilenet v3 (a lightweight network) as a backbone network. The common network carries out convolution downsampling of a first multiple, convolution downsampling of a second multiple and convolution downsampling of a third multiple on an image to be detected in sequence to obtain a first common feature map, a second common feature map and a third common feature map in sequence, wherein the numerical value of the first multiple is smaller than the numerical value of the second multiple, the numerical value of the second multiple is smaller than the numerical value of the third multiple, the tensor size of the first common feature map is larger than the tensor size of the second common feature map, and the tensor size of the second common feature map is larger than the tensor size of the third common feature map. The convolution downsampling may be to perform convolution operation on the image to obtain the feature map and then perform downsampling to obtain the feature map with a small size. Of course, the convolution may also be performed by changing the sliding step of the convolution kernel to reduce the size of the obtained feature map while performing the convolution operation on the image, for example, if the sliding step of the convolution kernel is 2, the obtained feature map is 1/2 of the input image.

Further, the first multiple may be 4 times, the second multiple may be 8 times, and the third multiple may be 16 times. For example, when the input image tensor is 1024 × 1024, the tensor of the eigenmap (the first common eigenmap) obtained by the 4-fold convolution downsampling is 256 × 256, the tensor of the eigenmap (the second common eigenmap) obtained by the 8-fold convolution downsampling is 128 × 128, and the tensor of the eigenmap (the third common eigenmap) obtained by the 16-fold convolution downsampling is 64 × 64.

202. And inputting the common features into the first branch network, predicting to obtain a Gaussian heat map of the head through the first branch network, and obtaining a head central point based on the Gaussian heat map of the head.

In the embodiment of the invention, the first branch network prediction is used for predicting the Gaussian heat map of the human head, and the output characteristic diagram of the Gaussian heat map is larger, so that the spatial generalization capability is stronger, and the output mode of the Gaussian heat map is higher in accuracy than that of a direct regression coordinate point.

Specifically, as shown in fig. 2b, the common feature includes a first common feature map, a second common feature map, and a third common feature map, and the third common feature map may be upsampled and then subjected to convolution calculation to obtain a first upsampled feature map having the same tensor size as the second common feature map; performing convolution calculation on the second common feature map, and performing feature fusion on the second common feature map and the first up-sampling feature map to obtain a first fusion feature map; performing convolution calculation after the first fusion characteristic is subjected to up-sampling to obtain a second up-sampling characteristic image with the same tensor size as the first common characteristic image; performing convolution calculation on the first common feature map, and performing feature fusion on the first common feature map and the second up-sampling feature map to obtain a second fusion feature map; and obtaining a Gaussian heat map based on the second fusion feature map. In the following examples, the first multiple is 4 times, the second multiple is 8 times, the third multiple is 16 times, the tensor of the input image is 1024 × 1024, the tensor of the first common feature map is 256 × 256, the tensor of the second common feature map is 128 × 128, the tensor of the third common feature map is 64 × 64, the tensor of the first upsampled feature map is 128 × 128, the tensor of the second upsampled feature map is 256 × 256, the feature fusion may be performed by adding the feature maps, the feature map tensor obtained by the addition does not change, and the change is a value in the tensor, so that the finally output gaussian tensor is 256 × 256.

After the Gaussian heat map is obtained, the head positioning coordinates (x, y) corresponding to each target person can be found in an image processing mode to serve as head positioning point information. In a possible embodiment, the pair of positioning points may be head center points, and after obtaining the gaussian heat map, the head center coordinates (x) corresponding to each target person may be found by image processing_center,y_center) As the head location point information.

It should be noted that, the first branch network is a convolutional neural network for predicting a gaussian heat map of the human head, and the common features extracted by the public network also include implicit features of the predicted gaussian heat map of the human head.

203. And inputting the shared features into the second branch network, and predicting to obtain a human head projection central point and a human foot key point through the second branch network.

In the embodiment of the invention, the head projection central point is used as a head projection positioning point, the foot key point is used as a foot positioning point, and the second branch network is based on a humanThe head center point is subjected to offset prediction, implicit features of the predicted head Gaussian heat map in the shared network can be extracted from the second branch network, and the head center coordinate (x) is obtained through prediction_center,y_center) As a basis for the offset prediction. The second branch network outputs the head projection central point and the coordinates (x) of the key point of the human foot relative to the head central point_center，y_center) Δ x3, Δ y3, Δ x4, Δ y4, Δ x5, Δ y 5). Wherein (delta x3, delta y3) is the center point of the human head projection relative to the coordinate (x) of the human head center_center，y_center) (Δ x4, Δ y4) is the left foot keypoint relative to the human head center coordinate (x)_center，y_center) (Δ x5, Δ y5) is the right foot keypoint relative to the head center coordinate (x)_center,y_center) Of (3) is detected. Then the coordinate of the center point of the human head projection can be obtained as (x) correspondingly_center+Δx3,y_center+ Δ y3), left foot keypoint coordinate of (x)_center+Δx4,y_center+ Δ y4), and the right foot keypoint coordinate is (x)_center+Δx5,y_center+Δy5)。

It will be appreciated that the second branch network has a similar structure to the first branch network for predicting head center coordinates (x)_center,y_center) The convolution structure of the second branch network, however, has a portion more than the first branch network, is used for offset prediction and regression to the corresponding offset values (Δ x3, Δ y3, Δ x4, Δ y4, Δ x5, Δ y 5).

In the embodiment of the invention, the head projection positioning point and the foot positioning point are obtained by predicting through the offset of the center coordinate of the head, and the obtained relative coordinates of the head projection positioning point and the foot positioning point ensure the consistency of the head center point, the head projection positioning point and the foot positioning point, and improve the accuracy of prediction.

Specifically, the neural network model includes a third branch network, and the third branch network is used for extracting the human head frame information. Referring to fig. 3, fig. 3 is a flowchart of a method for predicting human head frame information according to an embodiment of the present invention, as shown in fig. 3, including the following steps:

301. and extracting the common characteristics of the image to be detected through a public network.

In the embodiment of the invention, the common features extracted by the public network also comprise implicit features of the predicted head Gaussian heatmap.

302. And inputting the common characteristics into the third branch network, and predicting to obtain the human head frame information through the third branch network.

And the third branch network carries out offset prediction based on the head central point. It will be appreciated that the third branch network has a similar structure to the first branch network for predicting head center coordinates (x)_center,y_center) However, the convolution structure of the third branch network having a part more than the first branch network is used for offset prediction and regression to the offset value of the human head box to the corner point (Δ x1, Δ y1, Δ x2, Δ y 2). Where (Δ x1, Δ y1) may be the predicted offset of the person to the top left corner of the box and (Δ x2, Δ y2) may be the predicted offset of the person to the top right corner of the box. According to the predicted offset, the position coordinates of the upper left corner point can be obtained as follows:

the position coordinates of the upper right corner point are:

in the embodiment of the invention, the head frame is obtained by predicting the offset of the center coordinate of the head, and the relative coordinate of the corner position of the head frame is obtained, so that the consistency of the head frame and the center point of the head is ensured, and the accuracy of the head frame prediction is improved.

Note that, although both the second branch network and the third branch network are offset amounts of the prediction coordinate points, they may not be merged together. At the moment, a more accurate result can be obtained by using a Wing loss function used for training the second branch network; the third branch uses the giou loss function, and more accurate results can be obtained.

Through the human head frame information, corresponding human head depth information can be extracted from the position corresponding to the human head frame in the image to be detected.

103. And calculating the height information of the corresponding target person according to the positioning point information.

In the embodiment of the present invention, the positioning point information may be head positioning point information and foot positioning point information, that is, the positioning points may be a head positioning point and a foot positioning point.

In one possible embodiment, the second branch network may output (Δ x4, Δ y4) as a left foot keypoint relative to a head center coordinate (x)_center,y_center) (Δ x5, Δ y5) is the right foot keypoint relative to the head center coordinate (x)_center,y_center) Of (3) is detected. The average of Δ y4 and Δ y5 may be calculated as height information of the target person. The height information of the target person can be the image height of the target person in the image to be detected.

In a possible embodiment, the positioning point information may be head positioning point information, head projection positioning point information, and foot positioning point information, that is, the positioning points may be head positioning points, head projection positioning points, and foot positioning points. The height information of the target person can be calculated by a triangle calculation method by respectively taking three points of the predicted head positioning point, the predicted head projection positioning point and the predicted foot positioning point as three angular points of the triangle. Therefore, the height can be calculated by positioning the head of the child and the feet of the adult when the adult holds the child, and the accuracy of the height information is improved.

Furthermore, the head positioning point, the head projection positioning point and the foot positioning point are respectively a head central point, a head projection central point, a left foot key point and a right foot key point, the left foot key point and the right foot key point can be fitted into a foot fitting point, and the height information of the target person is calculated by taking the head central point, the head projection central point and the foot fitting point as three angular points of a triangle.

104. And performing three-dimensional reconstruction on the head of the target person according to the height information and the head depth information to obtain the three-dimensional head information of the target person.

In the embodiment of the invention, the human head of the target person can be reconstructed in a preset three-dimensional space, and the three-dimensional space can be constructed according to the coordinate system of the depth camera and the world coordinate system.

In one possible embodiment, since the human head projection is located on the ground, the construction of the ground in the three-dimensional space can be assisted by the human head projection positioning points. Specifically, the human head projection positioning point can be a human head projection central point, and the construction of the ground in the three-dimensional space can be assisted according to the human head projection central point.

Specifically, the reconstruction position of the head of the target person in the three-dimensional space can be determined according to the height information; and performing three-dimensional reconstruction on the head of the target person at the reconstruction position in the three-dimensional space according to the head depth information to obtain the three-dimensional head information of the target person.

More specifically, the height of the head of the target person reconstructed in the three-dimensional space, which is the height from the head reconstruction position to the ground in the three-dimensional space, can be determined by the height information. Therefore, the three-dimensional head of the target person can be accurately reconstructed, and the three-dimensional head information of the target person corresponding to the three-dimensional space is obtained. After the height of the human head reconstruction is determined, the three-dimensional reconstruction can be performed on the target human head in the three-dimensional reconstruction according to the human head depth information, and the three-dimensional target human head is obtained.

105. And calculating the social distance among a plurality of target persons based on the three-dimensional head information of the target persons.

In the embodiment of the invention, the social distance between the target persons can be calculated based on the three-dimensional target heads obtained through reconstruction, and a plurality of three-dimensional target heads obtained through reconstruction can be projected onto a preset plane (the ground of a three-dimensional space) of the three-dimensional space to obtain a plurality of target head projections; calculating the distance between the head projections of different target persons to obtain the projection distance between the target persons; and converting the projection distance between the target persons into the social distance between the target persons according to a preset proportion. The three-dimensional space may be constructed based on calibrated camera coordinates, and the origin of the three-dimensional space may be an optical center point of the camera or a center point of two optical center points.

Specifically, the geometric center of the projection of the target human head can be calculated as a human head projection point, and the distance of each human head projection point on the three-dimensional projection surface is calculated through the Euclidean distance. The preset proportion is that the metric value of the three-dimensional space is compared with the metric value of the actual space and can be determined according to the depth information, and the larger the depth value is, the larger the matched proportion is. And converting the distance of the human head projection point into the distance of the human head of the target in reality according to the matched preset proportion, thereby obtaining the social distance of the target person in the real scene.

It should be noted that the social distance detection method provided by the embodiment of the present invention may be applied to devices such as a mobile phone, a monitor, a computer, and a server that can perform social distance detection.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a social distance detecting device according to an embodiment of the present invention, and as shown in fig. 4, the device includes:

the system comprises an acquisition module 401, a detection module and a processing module, wherein the acquisition module is used for acquiring an image to be detected, the image to be detected comprises image depth information and a head to be detected, and the head to be detected comprises heads of a plurality of target persons;

the prediction module 402 is configured to predict positioning point information and head frame information of each head in the image to be detected through a pre-trained neural network model, and extract corresponding head depth information according to the head frame information;

a first calculating module 403, configured to calculate height information of a corresponding target person according to the location point information;

the reconstruction module 404 is configured to perform three-dimensional reconstruction on the head of the target person according to the height information and the head depth information to obtain three-dimensional head information of the target person;

a second calculating module 405, configured to calculate social distances between multiple target people based on the three-dimensional head information of the target people.

Optionally, the locating point information of the head includes a head central point, a head projection central point, and a foot key point, as shown in fig. 5, the predicting module 402 includes:

the first processing submodule 4021 is used for extracting common features of the images to be detected through the public network;

the second processing sub-module 4022 is configured to input the common features to the first branch network, predict and obtain a gaussian heat map of the head through the first branch network, and obtain a head center point based on the gaussian heat map of the head;

the third processing sub-module 4023 is configured to input the common features to the second branch network, and predict and obtain a head projection center point and a foot key point through the second branch network, where the second branch network performs offset prediction based on the head center point.

Optionally, the predicting module 402 is further specifically configured to input the common feature into the third branch network, and obtain the head frame information through third branch network prediction, where the third branch network performs offset prediction based on the head center point.

Optionally, the first calculating module 403 is further specifically configured to calculate height information of the corresponding target person by combining the head central point, the head projection central point, and the foot key point.

Optionally, the first processing sub-module 4021 is specifically further configured to sequentially perform convolution downsampling of a first multiple, convolution downsampling of a second multiple, and convolution downsampling of a third multiple on an image to be detected, and sequentially obtain a first common feature map, a second common feature map, and a third common feature map, where a numerical value of the first multiple is smaller than a numerical value of the second multiple, a numerical value of the second multiple is smaller than a numerical value of the third multiple, a tensor size of the first common feature map is larger than a tensor size of the second common feature map, and a tensor size of the second common feature map is larger than a tensor size of the third common feature map.

Optionally, as shown in fig. 6, the second processing sub-module 4022 includes:

a first calculation unit 40221, configured to perform convolution calculation after upsampling the third common feature map, to obtain a first upsampled feature map with the same tensor size as the second common feature map;

a first fusion unit 40222, configured to perform convolution calculation on the second common feature map, and perform feature fusion with the first upsampling feature map to obtain a first fusion feature map;

a second calculation unit 40223, configured to perform convolution calculation after performing upsampling on the first fusion feature to obtain a second upsampled feature map with the same tensor size as the first common feature map;

a second fusion unit 40224, configured to perform convolution calculation on the first common feature map, and perform feature fusion with the second upsampling feature to obtain a second fusion feature map;

the processing unit 40225 is configured to obtain a gaussian heatmap based on the second fusion feature map.

Optionally, as shown in fig. 7, the reconstructing module 404 includes:

the determining submodule 4041 is used for determining the reconstruction position of the head of the target person in the three-dimensional space according to the height information;

and the reconstruction submodule 4042 is configured to perform three-dimensional reconstruction on the head of the target person at the reconstruction position in the three-dimensional space according to the head depth information, so as to obtain three-dimensional head information of the target person.

Optionally, as shown in fig. 8, the apparatus further includes:

and the building module 406 is configured to build the ground of the three-dimensional space according to the human head projection center point.

It should be noted that the social distance detection apparatus provided in the embodiment of the present invention may be applied to a device capable of detecting social distance, such as a mobile phone, a monitor, a computer, and a server.

The social distance detection device provided by the embodiment of the invention can realize each process realized by the social distance detection method in the method embodiment, and can achieve the same beneficial effect. To avoid repetition, further description is omitted here.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 9, including: a memory 902, a processor 901 and a computer program stored on the memory 902 and executable on the processor 901, wherein:

the processor 901 is used for calling the computer program stored in the memory 902 and executing the following steps:

Optionally, the locating point information of the human head includes a human head central point, a human head projection central point, and a human foot key point, and the locating point information of each target person in the image to be detected is predicted by the pre-trained neural network model executed by the processor 901, and includes:

Optionally, the predicting, by the processor 901, the human head frame information of each target person in the image to be detected through the pre-trained neural network model includes:

Optionally, the calculating, by the processor 901, height information of a corresponding target person according to the positioning point information includes:

Optionally, the extracting, by the processor 901, the common features through the public network includes:

Optionally, the inputting, by the processor 901, the common features into the first branch network, and predicting, by the first branch network, a gaussian heatmap of the head includes:

performing convolution calculation after the first fusion characteristic is subjected to upsampling to obtain a second upsampled characteristic diagram with the same tensor size as the first common characteristic diagram;

performing convolution calculation on the first common feature map, and performing feature fusion on the first common feature map and the second up-sampling feature map to obtain a second fusion feature map;

and obtaining a Gaussian heat map based on the second fusion feature map.

Optionally, the three-dimensional reconstruction of the head of the target person performed by the processor 901 according to the height information and the head depth information to obtain the three-dimensional head information of the target person includes:

Optionally, the processor 901 further performs the following steps:

The electronic device may be a device that can be applied to a mobile phone, a monitor, a computer, a server, and the like that can perform social distance detection.

The electronic device provided by the embodiment of the invention can realize each process realized by the social distance detection method in the method embodiment, can achieve the same beneficial effects, and is not repeated here to avoid repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the social distance detection method provided in the embodiment of the present invention, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A social distance detection method is characterized by comprising the following steps:

2. The method of claim 1, wherein the pre-trained neural network model comprises a public network, and a first branch network, a second branch network and a third branch network connected behind the public network, wherein the public network is used for extracting common features of the first branch network, the second branch network and the third branch network, the first branch network and the second branch network are used for extracting location point information of the head, and the third branch network is used for extracting the head frame information.

3. The method as claimed in claim 2, wherein the locating point information of the human head comprises a human head central point, a human head projection central point and human foot key points, and the predicting the locating point information of each target person in the image to be detected by a pre-trained neural network model comprises:

and inputting the shared features into the second branch network, and predicting through the second branch network to obtain a human head projection central point and a human foot key point, wherein the second branch network performs offset prediction based on the human head central point.

4. The method of claim 3, wherein the predicting the head frame information of each target person in the image to be detected through a pre-trained neural network model comprises:

5. The method of claim 3, wherein said calculating height information for a corresponding target person based on said location point information comprises:

6. The method of claim 3, wherein extracting common features over the public network comprises:

7. The method of claim 6, wherein said inputting said common features into said first branch network, predicting a Gaussian heat map of a human head through said first branch network, comprises:

and obtaining a Gaussian heat map based on the second fusion feature map.

8. The method of claim 3, wherein the three-dimensional reconstruction of the head of the target person based on the height information and the head depth information to obtain three-dimensional head information of the target person comprises:

and performing three-dimensional reconstruction on the head of the target person according to the head depth information at the reconstruction position in the three-dimensional space to obtain the three-dimensional head information of the target person.

9. The method of claim 8, wherein the method further comprises:

10. An apparatus for social distance detection, the apparatus comprising:

the prediction module is used for predicting positioning point information and head frame information of each head in the image to be detected through a pre-trained neural network model and extracting corresponding head depth information according to the head frame information;

11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps in the method of social distance detection according to any of claims 1 to 9 when executing the computer program.

12. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method for social distance detection according to any one of claims 1 to 9.