CN112465890A

CN112465890A - Depth detection method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN112465890A
Application number: CN202011335257.4A
Authority: CN
Inventors: 李健华; 李雷; 王权; 钱晨
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-03-09

Abstract

The present disclosure provides a depth detection method, apparatus, electronic device and computer-readable storage medium, the method comprising: acquiring at least one frame of image acquired by image acquisition equipment, wherein the at least one frame of image comprises a current frame of image; segmenting the current frame image to obtain a plurality of mask images of human bodies; detecting key points of the human body of the at least one frame of image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image; and determining depth detection results of the human bodies in the current frame image according to the two-dimensional key point information and the three-dimensional key point information of the human bodies in the current frame image and the mask images of the human bodies.

Description

Depth detection method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to computer vision processing technologies, and in particular, to a depth detection method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In the related art, the multi-person depth detection technology of the image has important application in applications such as Augmented Reality (AR) interaction, virtual photographing and the like; how to realize multi-user depth detection of images in scenes lacking of special hardware devices such as a three-dimensional depth camera and the like is a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the disclosure provides a depth detection method and device, electronic equipment and a computer-readable storage medium.

The technical scheme of the embodiment of the disclosure is realized as follows:

the embodiment of the disclosure provides a depth detection method, which comprises the following steps:

acquiring at least one frame of image acquired by image acquisition equipment, wherein the at least one frame of image comprises a current frame of image;

segmenting the current frame image to obtain a plurality of mask images of human bodies; detecting key points of the human body of the at least one frame of image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image;

and determining depth detection results of the human bodies in the current frame image according to the two-dimensional key point information and the three-dimensional key point information of the human bodies in the current frame image and the mask images of the human bodies.

In some embodiments of the present disclosure, the determining depth detection results of a plurality of human bodies in the current frame image according to two-dimensional key point information and three-dimensional key point information of the plurality of human bodies in the current frame image and mask images of the plurality of human bodies includes:

matching the two-dimensional key point information of each human body in the multiple human bodies with the mask image of each human body in the mask images of the multiple human bodies to obtain two-dimensional key point information respectively belonging to each human body;

and determining the depth detection result of each human body in the current frame image according to the three-dimensional key point information corresponding to the two-dimensional key point information respectively belonging to each human body.

In some embodiments of the present disclosure, the obtaining two-dimensional keypoint information respectively belonging to each human body by matching the two-dimensional keypoint information of each human body in the plurality of human bodies with the mask image of each human body in the mask images of the plurality of human bodies includes:

and in the two-dimensional key point information of the plurality of human bodies, the two-dimensional key point information of one human body, the position overlapping degree of which with the mask image of each human body reaches a set value, is used as the two-dimensional key point information of each human body.

In some embodiments of the disclosure, the determining a depth detection result of each human body in the current frame image according to three-dimensional keypoint information corresponding to two-dimensional keypoint information respectively belonging to each human body includes:

determining coordinate information of three-dimensional key point information corresponding to the two-dimensional key point information of each human body; determining the depth information of the two-dimensional key points of each human body according to the coordinate information of the three-dimensional key points; carrying out interpolation processing on the depth information of the two-dimensional key points of each human body to obtain the depth information of a first pixel point in the mask image of each human body; and the first pixel point represents any pixel point except the pixel point overlapped with the two-dimensional key point in the mask image of each human body.

optimizing the two-dimensional key point information of the human bodies in the current frame image to obtain optimized two-dimensional key point information of the human bodies;

and in the optimized two-dimensional key point information of a plurality of human bodies, taking the two-dimensional key point information of one human body, of which the position overlapping degree with the mask image of each human body reaches a set value, as the two-dimensional key point information of each human body.

In some embodiments of the disclosure, the optimizing the two-dimensional keypoint information of the multiple human bodies in the current frame image to obtain the two-dimensional keypoint information of the multiple human bodies after the optimizing includes:

and under the condition that the at least one frame of image also comprises a historical frame of image, processing the two-dimensional key point information of the multiple human bodies in the current frame of image and the two-dimensional key point information of the multiple human bodies in the historical frame of image to obtain the optimized two-dimensional key point information of the multiple human bodies.

In some embodiments of the present disclosure, the method further comprises:

determining the position relation between the human bodies and at least one target object in the AR scene according to the depth detection results of the human bodies in the current frame image;

determining a combined presentation mode of the plurality of human bodies and the at least one target object based on the position relation;

and displaying the AR effect superposed by the human bodies and the at least one target object based on the combined presentation mode.

In some embodiments of the present disclosure, the at least one frame of image captured by the image capturing device is a Red Green Blue (RGB) image.

In some embodiments of the present disclosure, the two-dimensional keypoint information is a two-dimensional keypoint representing a human skeleton, and the three-dimensional keypoint information is a three-dimensional keypoint representing the human skeleton.

The disclosed embodiment also provides a depth detection device, which comprises:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring at least one frame of image acquired by image acquisition equipment, and the at least one frame of image comprises a current frame of image;

the processing module is used for segmenting the human body image of the current frame image to obtain a plurality of mask images of human bodies; detecting key points of the human body of the at least one frame of image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image;

and the detection module is used for determining the depth detection results of the human bodies in the current frame image according to the two-dimensional key point information and the three-dimensional key point information of the human bodies in the current frame image and the mask images of the human bodies.

In some embodiments of the present disclosure, the determining module is configured to determine depth detection results of a plurality of human bodies in the current frame image according to two-dimensional key point information and three-dimensional key point information of the plurality of human bodies in the current frame image and mask images of the plurality of human bodies, and the determining module includes:

In some embodiments of the disclosure, the detecting module is configured to obtain two-dimensional keypoint information respectively belonging to each human body by matching the two-dimensional keypoint information of each human body in the multiple human bodies with the mask image of each human body in the mask images of the multiple human bodies, and the detecting module includes:

In some embodiments of the disclosure, the determining module is configured to determine a depth detection result of each human body in the current frame image according to three-dimensional key point information corresponding to two-dimensional key point information respectively belonging to each human body, and includes:

In some embodiments of the disclosure, the detecting module is configured to perform optimization processing on the two-dimensional key point information of a plurality of human bodies in the current frame image to obtain the optimized two-dimensional key point information of the plurality of human bodies, and the detecting module includes:

In some embodiments of the disclosure, the processing module is further configured to:

In some embodiments of the present disclosure, at least one frame of image captured by the image capturing device is an RGB image.

An embodiment of the present disclosure further provides an electronic device, which includes:

a memory for storing executable instructions;

and the processor is used for executing the executable instructions stored in the memory so as to realize any one of the depth detection methods.

An embodiment of the present disclosure further provides a computer-readable storage medium, which stores executable instructions and is configured to, when executed by a processor, implement any one of the depth detection methods described above.

The embodiment of the disclosure has the following beneficial effects:

the depth detection method and the device can determine the depth detection results of a plurality of human bodies by combining human body mask images of the plurality of human bodies and two-dimensional key points and three-dimensional key information of the plurality of human bodies without acquiring the depth information of the human bodies in the images through special hardware equipment such as a three-dimensional depth camera, so that the depth detection of the plurality of human bodies in the images can be realized without depending on the special hardware equipment such as the three-dimensional depth camera, and the device can be applied to scenes such as AR interaction, virtual photographing and the like.

Drawings

Fig. 1 is a schematic diagram of a terminal and a server connection provided by an embodiment of the present disclosure;

FIG. 2 is a schematic flow chart of an alternative depth detection method provided by the embodiments of the present disclosure;

FIG. 3 is a schematic diagram of two-dimensional key points of a human skeleton provided by an embodiment of the present disclosure;

FIG. 4 is a schematic flow chart of another alternative depth detection method provided by the embodiments of the present disclosure;

FIG. 5 is a schematic diagram of a point cloud provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an alternative structure of a depth detection device provided in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the examples provided herein are merely illustrative of the present disclosure and are not intended to limit the present disclosure. In addition, the embodiments provided below are some embodiments for implementing the disclosure, not all embodiments for implementing the disclosure, and the technical solutions described in the embodiments of the disclosure may be implemented in any combination without conflict.

It should be noted that, in the embodiments of the present disclosure, the terms "comprises," "comprising," or any other variation thereof are intended to cover a non-exclusive inclusion, so that a method or apparatus including a series of elements includes not only the explicitly recited elements but also other elements not explicitly listed or inherent to the method or apparatus. Without further limitation, the use of the phrase "including a. -. said." does not exclude the presence of other elements (e.g., steps in a method or elements in a device, such as portions of circuitry, processors, programs, software, etc.) in the method or device in which the element is included.

For example, the depth detection method provided by the embodiment of the present disclosure includes a series of steps, but the depth detection method provided by the embodiment of the present disclosure is not limited to the described steps, and similarly, the depth detection apparatus provided by the embodiment of the present disclosure includes a series of modules, but the apparatus provided by the embodiment of the present disclosure is not limited to include the explicitly described modules, and may also include modules that are required to be configured to acquire relevant information or perform processing based on the information.

The term "and/or" herein is merely an associative relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or C, may mean: a exists alone, A and C exist simultaneously, and C exists alone. In addition, the term "at least one" herein means any combination of any one or more of a plurality, for example, including at least one of A, C, D, and may mean including any one or more elements selected from the group consisting of A, C and D.

In the related technology, the depth detection of the human body in the image can be realized by utilizing special hardware such as a three-dimensional depth camera, wherein the three-dimensional depth camera can be a camera which is provided with a binocular camera and adopts a binocular vision technology to obtain depth information; however, the use of such special hardware increases the application cost, and limits the application scenarios to some extent.

In the case of human depth estimation based on an image taken by a monocular camera, the requirements for the accuracy of depth estimation, the amount of information provided, are relatively low; in some embodiments, under the condition of human body depth estimation based on an image shot by a monocular camera, only the relative depth between each pixel point of a human body can be estimated, the depth between the pixel point of the human body and the camera cannot be estimated, and the application range is limited to a certain extent; in some embodiments, only a single depth can be estimated for each pixel point of a human body, and the estimated depth information is less; in other embodiments, the depth information estimation may be implemented based on an image matching algorithm of consecutive frames, however, this approach increases the consumption of time resources and computational resources, and is not suitable for a real-time application scenario with low power consumption.

In view of the foregoing technical problems, embodiments of the present disclosure provide a depth detection method, apparatus, electronic device, and computer-readable storage medium, which can implement single-person depth detection in an image without depending on special hardware devices such as a three-dimensional depth camera; the depth detection method provided by the embodiment of the disclosure can be applied to electronic equipment, and an exemplary application of the electronic equipment provided by the embodiment of the disclosure is described below.

In some embodiments, the electronic device provided by the embodiments of the present disclosure may be implemented as various terminals having an image capturing device, such as AR glasses, a notebook computer, a tablet computer, a desktop computer, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and the like, and the image capturing device may be a monocular camera and the like, and the terminal may be a mobile phone with a camera, wherein the mobile phone may be moved by hand.

For example, after receiving an image acquired by an image acquisition device, a terminal may perform depth detection on the image acquired by the image acquisition device according to the depth detection method of the embodiment of the present disclosure to obtain depth detection results of a plurality of human bodies in the image.

In some embodiments, the electronic device provided by the embodiments of the present disclosure may also be implemented as a server that forms a communication connection with the terminal. Fig. 1 is a schematic diagram of a terminal and a server in the embodiment of the present disclosure, and as shown in fig. 1, a terminal 100 is connected to a server 102 through a network 101, where the network 101 may be a wide area network or a local area network, or a combination of both.

In some embodiments, the server 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present disclosure is not limited thereto.

The terminal 100 is configured to acquire an image at a current mobile position through an image acquisition device; the captured image may be sent to the server 102; after receiving the image, the server 102 may perform depth detection on the received image according to the depth detection method of the embodiment of the present disclosure, to obtain depth detection results of a plurality of human bodies in the image.

The depth detection method according to the embodiment of the present disclosure will be described below with reference to the above description.

Fig. 2 is a schematic diagram of an optional process of the depth detection method according to the embodiment of the disclosure, and as shown in fig. 2, the process may include:

step 201: and acquiring at least one frame of image acquired by the image acquisition equipment, wherein the at least one frame of image comprises a current frame of image.

In the embodiment of the disclosure, the image acquisition device may acquire an image and may transmit at least one frame image including a current frame image to a processor of the electronic device.

In some embodiments, the at least one image includes a current frame image (a frame image acquired at a current time); in some embodiments, the at least one frame of image includes not only the current frame of image but also a historical frame of image, where the historical frame of image represents one or more frames of historical images captured by the image capture device.

In some embodiments, in the case that the at least one image is a multi-frame image, the at least one image may be a continuous frame image continuously acquired by the image acquisition device, or may be a discontinuous multi-frame image, which is not limited in this disclosure.

Step 202: segmenting the current frame image into human body images to obtain a plurality of mask images of human bodies; and detecting key points of the human body on at least one frame of image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image.

In the embodiment of the disclosure, the current frame image may be segmented into the human body images according to the pre-trained image segmentation model, so as to obtain a plurality of human body mask images of the human body.

In some embodiments, the attribute of the human body image may include an area, a gray value of a pixel point, or other attributes; in some embodiments, when the attribute of the human body is an area, the current frame image is segmented according to a pre-trained image segmentation model, so that mask images of a plurality of human bodies with areas larger than a set area can be obtained.

In the embodiment of the present disclosure, the image segmentation model may be implemented by a neural network, for example, the image segmentation model may be implemented by a full convolution neural network or other neural networks.

According to the embodiment of the disclosure, the image segmentation model can be predetermined according to actual requirements, wherein the actual requirements include but are not limited to time-consuming requirements, precision requirements and the like; that is, different image segmentation models can be set according to different actual requirements.

It should be noted that the above description is only an exemplary description of the image segmentation model, and the embodiments of the present disclosure are not limited thereto.

In the embodiment of the present disclosure, the two-dimensional key point information may include coordinate information of the two-dimensional key point, and the coordinate information of the two-dimensional key point includes an abscissa and an ordinate.

The three-dimensional key point information may include coordinate information of a three-dimensional key point, where the coordinate information of the three-dimensional key point represents coordinates of the three-dimensional key point in a camera coordinate system, where the camera coordinate system represents a three-dimensional rectangular coordinate system established with a focus center of the image capturing device as an origin and an optical axis of the image capturing device as a Z-axis, and an X-axis and a Y-axis of the camera coordinate system are two mutually perpendicular coordinate axes of an image plane.

In some embodiments, after the two-dimensional key point information is determined, a three-dimensional key point corresponding to the two-dimensional key point can be determined according to the two-dimensional key point information, and coordinate information of the three-dimensional key point is determined; illustratively, a key point conversion model for realizing conversion from a two-dimensional key point to a three-dimensional key point can be trained in advance; in this way, after the trained key point conversion model is obtained, the coordinate information of the two-dimensional key point can be input into the trained key point conversion model, and the three-dimensional key point corresponding to the two-dimensional key point and the coordinate information of the three-dimensional key point are obtained. It should be noted that the above description is only an exemplary description for obtaining coordinate information of the three-dimensional key, and the embodiments of the present disclosure are not limited thereto.

In the embodiment of the present disclosure, the network structure of the keypoint conversion model is not limited, for example, the keypoint conversion model may be a time-series convolution network or a non-time-series fully-connected network; the network structure of the key point conversion model can be predetermined according to the actual application requirements.

In some embodiments, when the at least one frame of image is a multi-frame image, the detection and tracking of key points of a human body can be performed on the at least one frame of image, so as to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image; the method can be understood that the tracking of the key points of the human body is carried out based on the multi-frame image, so that the two-dimensional key point information of a plurality of human bodies in the current frame image can be accurately obtained, and the accurate three-dimensional key point information can be further obtained.

In some embodiments, when the at least one frame of image is a continuous frame of image, the detection and tracking of key points of a human body can be performed on the continuous frame of image, so as to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image; the method can be understood that the human body key points are tracked based on the continuous frame images, so that the two-dimensional key point information of a plurality of human bodies in the current frame image can be further accurately obtained, and the accurate three-dimensional key point information can be further obtained.

Step 203: and determining depth detection results of the human bodies in the current frame image according to the two-dimensional key point information and the three-dimensional key point information of the human bodies in the current frame image and the mask images of the human bodies.

In practical applications, the steps 201 to 203 may be implemented based on a Processor of an electronic Device, where the Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic device implementing the above processor function may be other, and the embodiments of the present disclosure are not limited.

It can be seen that the depth detection results of multiple human bodies can be determined by combining human body mask images of multiple human bodies and two-dimensional key points and three-dimensional key information of the multiple human bodies, and the depth information of the human bodies in the images is not required to be acquired by special hardware devices such as a three-dimensional depth camera, so that the depth detection of the multiple human bodies in the images can be realized without depending on the special hardware devices such as the three-dimensional depth camera, and the depth detection method and the depth detection device can be applied to scenes such as AR interaction and virtual photographing.

Further, the depth information between each pixel point of the human body and the camera can be obtained in the embodiment of the present disclosure, a single depth is not estimated for each pixel point of the human body, the obtained depth information is rich, and the method can be applied to multiple scenes, for example, the application range of the embodiment of the present disclosure includes but is not limited to: three-dimensional reconstruction and presentation of a dynamic human body in three-dimensional human body reconstruction; displaying human body and virtual scene in an augmented reality application in an occlusion manner; interaction between a human body and a reconstructed scene in augmented reality application, and the like; further, the depth information of the human body pixel points is directly estimated by the image matching algorithm based on the continuous frames, the depth information of the human body pixel points is determined by utilizing the two-dimensional key point information and the three-dimensional key point information of the human body, and compared with a scheme of achieving depth information estimation by the image matching algorithm based on the continuous frames, the method and the device reduce consumption of time resources and calculation resources and balance between estimation precision of the depth information and time consumption for determining the depth information.

In some embodiments, at least one frame of image acquired by the image acquisition device is an RGB image; it can be seen that the depth detection of a plurality of human bodies can be realized based on the easily acquired RGB images, and the method and the device have the characteristic of easy realization.

In some embodiments, the two-dimensional keypoint information respectively belonging to each human body may be obtained by matching the two-dimensional keypoint information of each human body in the plurality of human bodies with the mask image of each human body in the mask images of the plurality of human bodies; and then, determining the depth detection result of each human body in the current frame image according to the three-dimensional key point information corresponding to the two-dimensional key point information respectively belonging to each human body.

It can be seen that, in the embodiment of the disclosure, the two-dimensional key point information of a plurality of human bodies in the current frame image is matched with the mask image of each human body, so that the two-dimensional key point information of each human body can be directly obtained, and further, the depth detection result of each human body is determined.

In the embodiment of the present disclosure, the two-dimensional key point information is a two-dimensional key point representing a human skeleton, and the three-dimensional key point information is a three-dimensional key point representing a human skeleton.

The two-dimensional key points of the human skeleton are used for representing human key position points in an image plane, and the human key position points comprise, but are not limited to, five sense organs, neck, shoulders, elbows, hands, hips, knees, feet and the like; the key positions of the human body can be preset according to actual conditions; illustratively, referring to fig. 3, two-dimensional key points of the human skeleton may represent 14 human key positions or 17 human key positions, in fig. 3, open circles represent 14 human key positions, and open circles and solid dots collectively represent 17 human key positions.

It can be seen that the two-dimensional key points of each human skeleton can be obtained, the depth detection result of each human body is determined based on the two-dimensional key points of each human skeleton, and the depth detection of different human bodies in the image depends on the two-dimensional key points of the skeletons of different human bodies, so that the correlation between the two-dimensional key points of the skeletons of different human bodies is small, and therefore, the depth detection of multiple human bodies in the image can be realized.

In some embodiments, of the two-dimensional keypoint information of multiple human bodies, the two-dimensional keypoint information of one human body, which overlaps with the mask image of each human body by a set value, may be used as the two-dimensional keypoint information of each human body.

Here, the set value may be a value preset according to an actual application scenario, for example, the set value is between 80% and 90%; in the embodiment of the disclosure, the overlapping degree of the two-dimensional key point information of each human body and the human body mask image can be determined according to the coordinate information of the two-dimensional key point of each human body in a plurality of human bodies and the position information of the human body mask image.

In some embodiments, for any one human body mask image, if the position overlapping degree of the two-dimensional key point information of multiple human bodies and the mask image reaches a set value, the two-dimensional key point information of one human body with the highest position overlapping degree with the mask image may be selected from the two-dimensional key point information of multiple human bodies.

Therefore, in the embodiment of the disclosure, the two-dimensional key point information of each human body can be directly determined according to the position overlapping degree of the two-dimensional key point information and the human body mask image, which is beneficial to accurately obtaining the two-dimensional key point information of each human body.

In some embodiments, the two-dimensional keypoint information of a plurality of human bodies in the current frame image may be optimized to obtain optimized two-dimensional keypoint information of the plurality of human bodies; then, in the optimized two-dimensional key point information of a plurality of human bodies, the two-dimensional key point information of one human body, the position overlapping degree of which with the mask image of each human body reaches a set value, is used as the two-dimensional key point information of each human body.

For the implementation manner of performing optimization processing on the two-dimensional key point information of the multiple human bodies in the current frame image to obtain the two-dimensional key point information of the multiple human bodies after the optimization processing, for example, the two-dimensional key point information of the multiple human bodies in the current frame image and the two-dimensional key point information of the multiple human bodies in the history frame image may be processed under the condition that at least one frame image further includes the history frame image to obtain the two-dimensional key point information of the multiple human bodies after the optimization processing.

In some embodiments, time sequence filtering processing may be performed on the two-dimensional keypoint information of a plurality of human bodies in the current frame image and the two-dimensional keypoint information of a plurality of human bodies in the historical frame image to obtain filtered two-dimensional keypoints of the plurality of human bodies; methods of the time sequence filtering process include, but are not limited to, time sequence low-pass filtering, time sequence extended kalman filtering; in other embodiments, the two-dimensional keypoint information of multiple human bodies in the current frame image and the two-dimensional keypoint information of multiple human bodies in the historical frame image may be subjected to skeleton limb length optimization processing to obtain the filtered two-dimensional keypoint information of multiple human bodies.

In an understandable way, the two-dimensional key point information of a plurality of human bodies in the current frame image is optimized by combining the two-dimensional key point information of a plurality of human bodies in the historical frame image, so that the time sequence stability of the two-dimensional key point information is favorably improved, the time sequence stability of human body depth detection is favorably improved,

for determining the implementation manner of the depth detection result of each human body in the current frame image according to the three-dimensional key point information corresponding to the two-dimensional key point information respectively belonging to each human body, exemplarily, the coordinate information of the three-dimensional key point information corresponding to the two-dimensional key point information of each human body can be determined; determining the depth information of the two-dimensional key points of each human body according to the coordinate information of the three-dimensional key points; carrying out interpolation processing on the depth information of the two-dimensional key points of each human body to obtain the depth information of the first pixel point in the mask image of each human body; the first pixel point represents any pixel point except the pixel point overlapped with the two-dimensional key point position in the mask image of each human body.

For example, since the two-dimensional key points correspond to the three-dimensional key points, the coordinate information of the three-dimensional key point information of each human body may be used as the depth information of the two-dimensional key points of each human body, where the depth information of the two-dimensional key points represents: depth information of pixel points overlapping with the two-dimensional key point position.

If any pixel point in the mask image of each human body is not the pixel point overlapped with the two-dimensional key point, the any pixel point can be considered as the first pixel point, and at the moment, the depth information of the first pixel point in the mask image of each human body can be obtained by carrying out interpolation processing on the depth information of the two-dimensional key point of each human body.

Interpolation is an important method for approximation of a discrete function, and can be used for estimating the approximate value of the function at other points through the value conditions of the function at a limited number of points. In some embodiments, the depth information of the complete pixel points in the mask image of each human body can be obtained based on an interpolation processing mode under a preset spatial continuity constraint condition.

In some embodiments, after the depth information of each pixel point in the mask image of each human body is obtained, the depth information of each pixel point in the mask image of each human body may be subjected to smoothing filtering.

In some embodiments, after the depth information of each pixel point in the mask image of each human body is obtained, a depth map of each human body may be generated based on the depth information of each pixel point, and the depth map may be displayed in a display interface of the electronic device.

It can be seen that, in the embodiment of the present disclosure, depth information can be determined for any pixel point of the mask image of each human body, and depth detection of each human body in the image can be comprehensively achieved.

The implementation method for interpolating the depth information of the two-dimensional key points of each human body to obtain the depth information of the first pixel point in the mask image of each human body can exemplarily determine a discrete function for representing the relationship between the pixel point position and the pixel point depth information according to the depth information of the two-dimensional key points of each human body; supplementing the value of the discrete function at the position of the first pixel point according to the depth information of the two-dimensional key point of each human body, and determining the value of the discrete function at the position of the first pixel point as the depth information of the first pixel point.

In the embodiment of the present disclosure, the above description is only for explaining the principle of interpolation processing, and a specific implementation of interpolation processing is not limited, and exemplary implementations of interpolation processing include, but are not limited to, nearest neighbor interpolation completion, interpolation completion based on breadth-first search, and the like.

Understandably, the depth information of the two-dimensional key points of each human body is subjected to interpolation processing, so that the requirement of spatial continuity of the depth information of the pixel points of each human body can be met, and further the spatial continuity of the human body depth detection result can be favorably improved.

The depth detection method according to the embodiment of the present disclosure is further described below with reference to the drawings.

Fig. 4 is another optional flowchart of the depth detection method according to the embodiment of the disclosure, and as shown in fig. 4, the image capture device may send captured multi-frame images to a processor of the electronic device, where the multi-frame images include a current frame image and a historical frame image, and the multi-frame images are RGB images; the processor can carry out human body image segmentation on the current frame image of at least one frame image to obtain a plurality of mask images of human bodies; and detecting and tracking the key points of the human body based on the multi-frame image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame image. After the two-dimensional key point information and the three-dimensional key point information of a plurality of human bodies in the current frame image are obtained, post-processing optimization can be further executed, wherein the post-processing optimization comprises the recorded process of performing time sequence filtering processing on the two-dimensional key point information and the recorded process of performing interpolation processing on the depth information of the two-dimensional key points.

After post-processing optimization is executed, according to two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in a current frame image and mask images of the human bodies, depth detection results of the human bodies in the current frame image are determined, a depth map of each human body is generated based on the depth detection results of the human bodies in the current frame image, and the depth maps can be displayed in a display interface of electronic equipment to realize man-machine interaction.

In some embodiments, a point cloud corresponding to each pixel point in the depth map may also be shown; fig. 5 is a schematic diagram of a point cloud provided by an embodiment of the present disclosure, in fig. 5, points in a human body outline represent a point cloud formed by pixel points, thickened solid dots represent skeleton key points, and connecting lines between the thickened solid dots represent a skeleton of a human body.

The method has the advantages that the point cloud corresponding to each pixel point in the depth map is displayed, so that the positions of the pixel points can be conveniently and visually known, and further, the relation between the pixel points and the skeleton key points can be conveniently and visually known by displaying the skeleton key points.

In some embodiments, after obtaining the depth detection results of the multiple human bodies in the current frame image, the AR effect may be displayed based on the depth detection results of the multiple human bodies.

In some embodiments, the position relationship between each of the plurality of human bodies and at least one target object in the AR scene may be determined according to the depth detection results of the plurality of human bodies in the current frame image; determining a combined presentation mode of a plurality of human bodies and at least one target object based on the position relation; and displaying the AR effect superposed by the multiple human bodies and the at least one target object based on the combined presentation mode.

Here, the target object may be an object actually existing in a real scene, and depth information of the target object may be known, or may be information determined from shot data of the target object; the target object may also be a preset virtual object, the depth information of which is predetermined.

In one embodiment, the position relationship between a plurality of human bodies and at least one target object in the AR scene and the position relationship between the plurality of human bodies may be determined according to the depth detection results of the plurality of human bodies and the depth information of the target object; illustratively, the position relationship between each human body and the target object in the AR scene may be as follows: 1) a human body is closer to the image capturing apparatus than the target object, 2) the target object is closer to the image capturing apparatus than the human body, 3) the human body is located on a right side, a left side, an upper side, or a lower side of the target object, 4) a part of the human body is closer to the image capturing apparatus than the target object, and another part is farther from the image capturing apparatus than the target object; the position relationship among the multiple human bodies can be as follows: 1) one human body is closer to the image pickup device than the other human body, 2) one human body is located at a side, left side, upper side or lower side of the other human body, 3) a part of the one human body is located closer to the image pickup device than the other human body, and the other part is located farther from the image pickup device than the other human body; it should be noted that, the above description is only an exemplary description of the position relationship between a plurality of human bodies and at least one target object in the AR scene, and the embodiments of the present disclosure are not limited thereto.

After the position relation between the multiple human bodies and the at least one target object in the AR scene is determined, the combined presentation mode of the multiple human bodies and the at least one target object can be determined, and the combined presentation mode reflects the position relation, so that the AR effect of the multiple human bodies and the at least one target object which are overlapped is displayed based on the combined presentation mode, and the AR display effect is favorably improved.

Based on the depth detection method described in the foregoing embodiment, the embodiments of the present disclosure further provide a depth detection apparatus, which may be located in the electronic device described above.

Fig. 6 is a schematic diagram of an optional component structure of a depth detection apparatus provided in an embodiment of the present disclosure, and as shown in fig. 6, the depth detection apparatus 600 may include:

an obtaining module 601, configured to obtain at least one frame of image acquired by an image acquisition device, where the at least one frame of image includes a current frame of image;

a processing module 602, configured to perform human body image segmentation on the current frame image to obtain a plurality of mask images of human bodies; detecting key points of the human body of the at least one frame of image to obtain two-dimensional key point information and three-dimensional key point information of a plurality of human bodies in the current frame of image;

a detecting module 603, configured to determine depth detection results of the multiple human bodies in the current frame image according to the two-dimensional key point information and the three-dimensional key point information of the multiple human bodies in the current frame image, and the mask images of the multiple human bodies.

In some embodiments of the present disclosure, the detecting module 603 is configured to determine depth detection results of a plurality of human bodies in the current frame image according to two-dimensional key point information and three-dimensional key point information of the plurality of human bodies in the current frame image and mask images of the plurality of human bodies, and includes:

In some embodiments of the present disclosure, the detecting module 603 is configured to obtain two-dimensional keypoint information respectively belonging to each human body by matching the two-dimensional keypoint information of each human body in the multiple human bodies with the mask image of each human body in the mask images of the multiple human bodies, and includes:

In some embodiments of the present disclosure, the detecting module 603 is configured to determine a depth detection result of each human body in the current frame image according to three-dimensional key point information corresponding to two-dimensional key point information respectively belonging to each human body, and includes:

In some embodiments of the present disclosure, the detecting module 603 is configured to perform optimization processing on the two-dimensional key point information of multiple human bodies in the current frame image, so as to obtain the optimized two-dimensional key point information of multiple human bodies, and the method includes:

In some embodiments of the present disclosure, the processing module 602 is further configured to:

It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be noted that, in the embodiment of the present disclosure, if the display method is implemented in the form of a software functional module and sold or used as a standalone product, the display method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a terminal, a server, etc.) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present disclosure are not limited to any specific combination of hardware and software.

Correspondingly, the embodiment of the present disclosure further provides a computer program product, where the computer program product includes computer-executable instructions, and the computer-executable instructions are used to implement the depth detection method provided by the embodiment of the present disclosure.

Accordingly, an embodiment of the present disclosure further provides a computer storage medium, where computer-executable instructions are stored on the computer storage medium, and the computer-executable instructions are used to implement the depth detection method provided by the foregoing embodiment.

An embodiment of the present disclosure further provides an electronic device, fig. 7 is an optional schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and as shown in fig. 7, the electronic device 700 includes:

a memory 701 for storing executable instructions;

a processor 702, configured to execute the executable instructions stored in the memory, to implement any one of the depth detection methods described above.

The Memory 701 is configured to store computer programs and applications executed by the processor 702, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 702 and modules in the electronic device, and may be implemented by a FLASH Memory (FLASH) or a Random Access Memory (RAM).

The processor 702, when executing the program, implements any of the depth detection methods described above.

The processor 702 may be at least one of ASIC, DSP, DSPD, PLD, FPGA, CPU, controller, microcontroller, and microprocessor. It is understood that the electronic device implementing the above processor function may be other, and the embodiments of the present disclosure are not limited.

The computer-readable storage medium/Memory may be a ROM, a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic Random Access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical Disc, or a Compact Disc Read-Only Memory (CD-ROM), and the like; but may also be various terminals such as mobile phones, computers, tablet devices, personal digital assistants, etc., that include one or any combination of the above-mentioned memories.

Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present disclosure.

In addition, all the functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing an automatic test line of a device to perform all or part of the methods according to the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The methods disclosed in the several method embodiments provided in this disclosure may be combined arbitrarily without conflict to arrive at new method embodiments.

The features disclosed in the several method or apparatus embodiments provided in this disclosure may be combined in any combination to arrive at a new method or apparatus embodiment without conflict.

The above description is only an embodiment of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered by the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A depth detection method, the method comprising:

2. The method according to claim 1, wherein the determining depth detection results of the plurality of human bodies in the current frame image according to the two-dimensional and three-dimensional keypoint information of the plurality of human bodies in the current frame image and the mask images of the plurality of human bodies comprises:

3. The method according to claim 2, wherein the obtaining of the two-dimensional keypoint information respectively belonging to each human body by matching the two-dimensional keypoint information of each human body in the plurality of human bodies with the mask image of each human body in the mask images of the plurality of human bodies comprises:

4. The method according to claim 2, wherein the determining the depth detection result of each human body in the current frame image according to the three-dimensional key point information corresponding to the two-dimensional key point information respectively belonging to each human body comprises:

determining coordinate information of three-dimensional key point information corresponding to the two-dimensional key point information of each human body;

determining the depth information of the two-dimensional key points of each human body according to the coordinate information of the three-dimensional key points;

carrying out interpolation processing on the depth information of the two-dimensional key points of each human body to obtain the depth information of a first pixel point in the mask image of each human body; and the first pixel point represents any pixel point except the pixel point overlapped with the two-dimensional key point in the mask image of each human body.

5. The method according to any one of claims 2 to 4, wherein the obtaining of the two-dimensional keypoint information respectively belonging to each of the human bodies by matching the two-dimensional keypoint information of each of the human bodies with the mask image of each of the human bodies comprises:

6. The method according to claim 5, wherein the optimizing the two-dimensional keypoint information of the plurality of human bodies in the current frame image to obtain the optimized two-dimensional keypoint information of the plurality of human bodies comprises:

7. The method according to any one of claims 1 to 6, further comprising:

determining the position relation between the human bodies and at least one target object in an Augmented Reality (AR) scene according to the depth detection results of the human bodies in the current frame image;

8. The method according to any one of claims 1 to 7, wherein at least one frame of image collected by the image collecting device is a red, green and blue (RGB) image.

9. The method according to any one of claims 1 to 8, wherein the two-dimensional keypoint information is a two-dimensional keypoint representing a human skeleton, and the three-dimensional keypoint information is a three-dimensional keypoint representing a human skeleton.

10. A depth detection apparatus, the apparatus comprising:

11. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for executing executable instructions stored in the memory to implement the method of any one of claims 1 to 9.

12. A computer-readable storage medium having stored thereon executable instructions for, when executed by a processor, implementing the method of any one of claims 1 to 9.