WO2023273499A1

WO2023273499A1 - Depth measurement method and apparatus, electronic device, and storage medium

Info

Publication number: WO2023273499A1
Application number: PCT/CN2022/085920
Authority: WO
Inventors: 赵佳; 谢符宝; 刘文韬; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-06-28
Filing date: 2022-04-08
Publication date: 2023-01-05
Also published as: CN113345000A; TW202301276A

Abstract

The present disclosure relates to a depth measurement method and apparatus, an electronic device, and a storage medium. The method comprises: obtaining a plurality of frames to be detected, the plurality of frames to be detected comprising an image frame obtained by performing image acquisition on a target object from at least two acquisition viewing angles; performing key point detection on a target area in the target object according to the frames to be detected, and determining a plurality of key point detection results corresponding to the plurality of frames to be detected, the target area comprising a head area and/or a shoulder area; and determining depth information of the target object according to the plurality of key point detection results.

Description

Depth detection method and device, electronic equipment and storage medium

This application claims priority to a Chinese patent application filed with the China Patent Office on June 28, 2021, with application number 202110721270.1, and the title of the invention is "Deep Detection Method and Device, Electronic Equipment, and Storage Medium", the entire contents of which are incorporated by reference in this application.

technical field

The present disclosure relates to the technical field of computers, and in particular to a depth detection method and device, electronic equipment and a storage medium.

Background technique

The depth information can reflect the distance of the human body in the image relative to the image acquisition device, and based on the depth information, the human body object in the image can be spatially positioned. The binocular camera is a relatively common and widely used image acquisition device. Based on at least two images collected by the binocular camera, the depth information of the human body in the image can be determined by matching between images. However, the matching calculation between images is complex and The accuracy is easily affected. How to conveniently and accurately determine the depth information of the human body in the image has become an urgent problem to be solved.

Contents of the invention

The present disclosure proposes a technical solution for depth detection.

According to an aspect of the present disclosure, a deep detection method is provided, including:

Acquiring multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by collecting images of the target object from at least two acquisition angles of view; performing detection of the target area in the target object according to the frames to be detected Key point detection, determining a plurality of key point detection results corresponding to the multiple frames to be detected, wherein the target area includes a head area and/or a shoulder area; according to the multiple key point detection results, determining Depth information of the target object.

In a possible implementation manner, the determining the depth information of the target object according to the multiple key point detection results includes: acquiring at least two preset device parameters respectively corresponding to at least two acquisition devices, the The at least two acquisition devices are used to acquire images of the target object from at least two acquisition angles of view; according to the at least two preset device parameters and the multiple key point detection results, determine the Depth information of the target object.

In a possible implementation manner, the depth information includes a depth distance, and the depth distance includes a distance between the target object and the optical center of the acquisition device; the at least two preset device parameters and The multiple key point detection results, determining the depth information of the target object in the frame to be detected includes: according to the preset external parameters in the at least two preset device parameters and the multiple key point detection As a result, the depth distance is obtained by coordinates in at least two forms; wherein, the preset external parameters include relative parameters formed between the at least two acquisition devices.

In a possible implementation manner, the depth information includes an offset angle, and the offset angle includes a spatial angle of the target object relative to the optical axis of the acquisition device; Assuming device parameters and the multiple key point detection results, determining the depth information of the target object in the frame to be detected includes: according to the preset internal parameters in the at least two preset device parameters and the multiple The coordinates of the key point detection results in at least two forms are used to obtain the offset angle; wherein, the preset internal parameters include device parameters corresponding to the at least two devices.

In a possible implementation manner, the performing the key point detection of the target area in the target object according to the frame to be detected includes: according to the position information of the target object in the reference frame, The key point detection is performed on the target area of the target object in the frame, and the key point detection result corresponding to the frame to be detected is obtained, wherein the reference frame is the target video to which the frame to be detected belongs, and is located in the The video frame preceding the frame to be detected.

In a possible implementation manner, according to the position information of the target object in the reference frame, the key point detection is performed on the target area of the target object in the frame to be detected to obtain the The key point detection result corresponding to the frame includes: clipping the frame to be detected according to the first position of the target object in the reference frame to obtain a clipping result; and the target area of the target object in the clipping result Perform key point detection to obtain a key point detection result corresponding to the frame to be detected.

In a possible implementation manner, according to the position information of the target object in the reference frame, the key point detection is performed on the target area of the target object in the frame to be detected to obtain the The key point detection result corresponding to the frame includes: obtaining a second position of the target area of the target object in the reference frame; cutting the frame to be detected according to the second position to obtain a cutting result; The key point detection is performed on the target area of the target object in the clipping result, and the key point detection result corresponding to the frame to be detected is obtained.

In a possible implementation manner, the obtaining the second position of the target area of the target object in the reference frame includes: identifying the target area in the reference frame by using a first neural network to obtain The second position output by the first neural network; and/or, according to the key point detection result corresponding to the reference frame, the second position of the target area in the reference frame is obtained.

In a possible implementation manner, the method further includes: determining a position of the target object in a three-dimensional space according to depth information of the target object.

According to an aspect of the present disclosure, a depth detection device is provided, including:

An acquisition module, configured to acquire multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by image acquisition of the target object from at least two acquisition angles of view; a key point detection module, configured to The frame to be detected performs the key point detection of the target area in the target object, and determines a plurality of key point detection results corresponding to the multiple frames to be detected, wherein the target area includes a head area and/or a shoulder area ; A depth detection module, configured to determine the depth information of the target object according to the multiple key point detection results.

In a possible implementation manner, the depth detection module is configured to: acquire at least two preset device parameters respectively corresponding to at least two acquisition devices, and the at least two acquisition devices are used to measure The target object performs image acquisition; according to the at least two preset device parameters and the multiple key point detection results, determine the depth information of the target object in the frame to be detected.

In a possible implementation manner, the depth information includes a depth distance, and the depth distance includes a distance between the target object and the optical center of the acquisition device; the depth detection module is further configured to: according to the at least The preset external parameters among the two preset device parameters and the coordinates of the plurality of key point detection results in at least two forms obtain the depth distance; wherein, the preset external parameters include the at least two Collect relative parameters formed between devices.

In a possible implementation manner, the depth information includes an offset angle, and the offset angle includes a spatial angle of the target object relative to the optical axis of the acquisition device; the depth detection module is further configured to: According to the preset internal parameters in the at least two preset device parameters and the coordinates of the plurality of key point detection results in at least two forms, the offset angle is obtained; wherein the preset internal parameters include Device parameters respectively corresponding to the at least two devices.

In a possible implementation manner, the key point detection module is configured to: perform key point detection on the target area of the target object in the frame to be detected according to the position information of the target object in the reference frame , to obtain a key point detection result corresponding to the frame to be detected, wherein the reference frame is a video frame before the frame to be detected in the target video to which the frame to be detected belongs.

In a possible implementation manner, the key point detection module is further configured to: clip the frame to be detected according to the first position of the target object in the reference frame to obtain a clipping result; The key point detection is performed on the target area of the target object in the clipping result, and the key point detection result corresponding to the frame to be detected is obtained.

In a possible implementation manner, the key point detection module is further configured to: acquire a second position of the target area of the target object in the reference frame; Clipping the frame to obtain a clipping result; performing key point detection on the target area of the target object in the clipping result to obtain a key point detection result corresponding to the frame to be detected.

In a possible implementation manner, the key point detection module is further configured to: use a first neural network to identify the target area in the reference frame to obtain a second position output by the first neural network; and /or, obtain the second position of the target area in the reference frame according to the key point detection result corresponding to the reference frame.

In a possible implementation manner, the apparatus is further configured to: determine the position of the target object in a three-dimensional space according to the depth information of the target object.

According to an aspect of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above-mentioned method.

According to one aspect of the present disclosure, there is provided a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.

According to an aspect of the present disclosure, a computer program product is provided, including computer readable codes, and when the computer readable codes are run in an electronic device, a processor in the electronic device executes the above method.

In the embodiment of the present disclosure, by acquiring multiple frames to be detected from at least two acquisition angles of view, and performing key point detection of the target area according to the frames to be detected, multiple key point detections corresponding to multiple frames to be detected are determined As a result, and based on the detection results of multiple key points, the depth information of the target object is determined. Through the embodiments of the present disclosure, the parallax formed by the multiple frames to be detected collected under at least two acquisition angles can be used to utilize the multi-frames to be detected The detection results of multiple key points corresponding to the target area in the frame realize the calculation based on parallax to obtain depth information, effectively reduce the amount of data processed in the process of calculation based on parallax, and improve the efficiency and accuracy of depth detection.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings here are incorporated into the specification and constitute a part of the specification. These drawings show embodiments consistent with the present disclosure, and are used together with the description to explain the technical solutions of the present disclosure.

Fig. 1 shows a flowchart of a depth detection method according to an embodiment of the present disclosure.

FIG. 2 shows a schematic diagram of a target area according to an embodiment of the present disclosure.

Fig. 3 shows a flowchart of a depth detection method according to an embodiment of the present disclosure.

FIG. 4 shows a block diagram of a depth detection device according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of an application example according to the present disclosure.

FIG. 6 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

FIG. 7 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

detailed description

Various exemplary embodiments, features, and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures indicate functionally identical or similar elements. While various aspects of the embodiments are shown in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as superior or better than other embodiments.

The term "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist simultaneously, and there exists alone B these three situations. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, may mean including from A, Any one or more elements selected from the set formed by B and C.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific implementation manners. It will be understood by those skilled in the art that the present disclosure may be practiced without some of the specific details. In some instances, methods, means, components and circuits that are well known to those skilled in the art have not been described in detail so as to obscure the gist of the present disclosure.

Fig. 1 shows a flowchart of a depth detection method according to an embodiment of the present disclosure. The method can be performed by a depth detection device, and the depth detection device can be an electronic device such as a terminal device or a server, and the terminal device can be user equipment (User Equipment, UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal Digital assistants (Personal Digital Assistant, PDA), handheld devices, computing devices, in-vehicle devices, wearable devices, etc. In some possible implementation manners, the method may be implemented by a processor invoking computer-readable instructions stored in a memory. Alternatively, the method can be performed by a server. As shown in Figure 1, the method may include:

Step S11 , acquiring multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by collecting images of the target object from at least two collection angles of view.

Wherein, the frame to be detected may be any image frame that requires depth detection, for example, it may be an image frame extracted from a captured video, or an image frame obtained by capturing an image. The number of multiple frames to be detected is not limited in this embodiment of the present disclosure, and may include two or more frames.

The acquisition angle of view can be the angle of image acquisition of the target object, and different frames to be detected can be acquired by image acquisition devices set at different acquisition angles of view, or can be acquired by the same image device under different acquisition angles of view.

The frame to be detected includes the target object to be subjected to depth detection. The type of the target object is not limited in the embodiments of the present disclosure, and may include various human objects, animal objects, or some mechanical objects, such as robots. Subsequent disclosed embodiments are described by taking the target object as a person object as an example. Implementations in which the target object is other types can be flexibly expanded by referring to the subsequent disclosed embodiments, and will not be elaborated one by one.

The number of target objects contained in the frame to be detected is also not limited in the embodiments of the present disclosure, and may contain one or more target objects, which can be flexibly determined according to actual conditions.

The manner of obtaining multiple frames to be detected is not limited in the embodiment of the present disclosure. In a possible implementation, frame extraction may be performed from one or more videos to obtain multiple frames to be detected, wherein, frame The extraction may include one or more methods such as frame-by-frame extraction, frame sampling at a certain interval, or random frame sampling. In a possible implementation, it is also possible to collect images from multiple angles of the target object to obtain multiple frames to be detected; Multiple frames to be detected, etc.

Step S12 , performing key point detection of the target area in the target object according to the frame to be detected, and determining multiple key point detection results corresponding to multiple frames to be detected.

Wherein, the key point detection result may include the position of the detected key point in the frame to be detected. Among them, the number and types of detected key points can be flexibly determined according to the actual situation. In some possible implementations, the number of detected key points can include 2 to 150, etc. In an example, the detected key points can be Contains 14 limb key points of the human body (such as head key points, shoulder key points, neck key points, elbow key points, wrist key points, crotch key points, leg key points and foot key points, etc.) , or include 59 outline key points on the outline of the human body (such as some key points on the periphery of the head or the periphery of the shoulders) and the like. In a possible implementation manner, in order to reduce the amount of calculation, the detected key points may also only include three key points including the key point of the head, the key point of the left shoulder and the key point of the right shoulder.

Multiple key point detection results can correspond to multiple frames to be detected respectively. For example, if key point detection is performed on multiple frames to be detected, each frame to be detected can correspond to a key point detection result, so that it can be obtained Multiple keypoint detection results.

The target area may include a head area and/or a shoulder area, and the head area of the target object may be the area where the head of the target object is located, such as the area formed between the key points of the head and the key points of the neck; the shoulder area Then it may be the area where the shoulder and neck of the target object are located, such as the area formed between the key points of the neck and the key points of the shoulder.

Fig. 2 shows a schematic diagram of a target area according to an embodiment of the present disclosure. As shown in Fig. 2 , in a possible implementation, when the target area includes a head area and a shoulder area, the head key can be point, the key point of the left shoulder and the key point of the right shoulder are connected by the head and shoulders box, which is used as the target area. In one example, the head-shoulders frame can be a rectangle as shown in Figure 2. It can be seen from Figure 2 that the head-shoulders frame can be connected to the head key point at the head vertex of the target object and the left shoulder key point at the left shoulder joint. and the right shoulder key point at the right shoulder joint. In an example, the head-shoulders frame may also be in other shapes, such as polygons, circles, or other irregular shapes.

The way of key point detection can be flexibly determined according to the actual situation. In one possible implementation, the frame to be detected can be input into any neural network with key point detection function to realize key point detection; in some possible implementations, It is also possible to perform key point identification on the frame to be detected through a relevant key point identification algorithm to obtain a key point detection result; Position, perform key point detection on a part of the image area in the frame to be detected to obtain key point detection results, etc. For some possible specific implementations of step S12, reference may be made to the following disclosed embodiments in detail, which will not be expanded here.

Step S13, according to the multiple key point detection results, determine the depth information of the target object in the frame to be detected.

Wherein, the information content contained in the depth information can be flexibly determined according to the actual situation, and any information that can reflect the depth of the target object in the three-dimensional space can be used as a realization method of the depth information. In a possible implementation manner, the depth information may include a depth distance and/or an offset angle.

The depth distance can be the distance between the target object and the collection device, and the collection device can be any device that collects images of the target object. In some possible implementations, the collection device can be a static image collection device, such as a camera, etc. ; In some possible implementation manners, the collection device may also be a device for collecting dynamic images, such as a video camera or a camera.

As described in the above-mentioned disclosed embodiments, different frames to be detected can be collected by image acquisition devices set under different acquisition angles of view, or can be acquired by the same image device under different acquisition angles of view. Therefore, the number of acquisition devices Can be one or more. In a possible implementation, the depth detection method proposed by the embodiment of the present disclosure can be implemented based on at least two acquisition devices. In this case, at least two acquisition devices can detect the target object from at least two acquisition angles. Image acquisition is performed to obtain multiple frames to be detected.

In the case where the collection device includes at least two collection devices, the types of different collection devices may be the same or different, which can be flexibly selected according to the actual situation, and there is no limitation in this embodiment of the present disclosure.

The depth distance can be the distance between the target object and the collection device, the distance can be the distance between the target object and the collection device as a whole, or the distance between the target object and a certain equipment part of the collection device, in some possible In an implementation manner of , the distance between the target object and the optical center of the acquisition device may be used as the depth distance.

The offset angle may be an offset angle of the target object relative to the collection device, and in a possible implementation manner, the offset angle may be a spatial angle of the target object relative to the optical axis of the collection device.

Since multiple key point detection results can correspond to multiple frames to be detected, and multiple frames to be detected can be obtained by collecting images of the target object from at least two acquisition angles of view, therefore, based on multiple key point detection results, The parallax formed between multiple frames to be detected can be determined, and then the depth information calculation based on the parallax can be realized to obtain the depth information of the target object. Among them, the parallax-based calculation method based on the key point detection results can be flexibly determined according to the actual situation. Any method for realizing depth ranging based on parallax can be used in the implementation process of step S13. For details, see the following disclosed embodiments. , do not expand here.

In a possible implementation manner, step S12 may include:

According to the position information of the target object in the reference frame, key point detection is performed on the target area of the target object in the frame to be detected, and a key point detection result corresponding to the frame to be detected is obtained.

Wherein, the reference frame may be a video frame located before the frame to be detected in the target video, and the target video may be a video including the frame to be detected. In some possible implementation manners, different frames to be detected may respectively belong to different target videos, and in this case, reference frames corresponding to different frames to be detected may also be different.

In some possible implementations, the reference frame can be the previous frame of the frame to be detected in the target video, and in some possible implementations, the reference frame can also be the frame in the target video, located before the frame to be detected and connected to the frame to be detected The distance between the video frames does not exceed the preset distance, the number of preset distances can be flexibly determined according to the actual situation, and can be one or more frames apart, which is not limited in this embodiment of the present disclosure.

Since the reference frame is located before the frame to be detected, and the distance from the frame to be detected does not exceed the preset distance, the position of the target object in the reference frame may be relatively close to the position of the target object in the frame to be detected. In this case, According to the position information of the target object in the reference frame, the position information of the target object in the frame to be detected can be roughly determined. In this case, the target area of the target object in the frame to be detected can be more targeted. Detection, and the amount of data detected will be smaller, so that more accurate key point detection results can be obtained, and the efficiency of key point detection can also be improved.

In some possible implementations, according to the position information of the target object in the reference frame, the key point detection method of the target area of the target object in the frame to be detected can be flexibly determined according to the actual situation, for example, according to the position information of the target object in the reference frame The position information in the to-be-detected frame is cropped and then the key point detection is performed, or according to the position information of the target object in the reference frame, the key point detection is directly performed on the image area corresponding to the position in the to-be-detected frame, etc., various possible implementations The methods can be referred to the following disclosed embodiments in detail, and will not be expanded here.

Through the embodiments of the present disclosure, according to the position information of the target object in the reference frame, more targeted key point detection can be realized for the target area in the frame to be detected, and the efficiency and accuracy of key point detection can be improved, thereby improving the depth detection method. efficiency and precision.

In a possible implementation manner, according to the position information of the target object in the reference frame, the key point detection is performed on the target area of the target object in the frame to be detected, and the key point detection result corresponding to the frame to be detected is obtained, including:

clipping the frame to be detected according to the first position of the target object in the reference frame to obtain a clipping result;

The key point detection is performed on the target area of the target object in the clipping result, and the key point detection result corresponding to the frame to be detected is obtained.

Wherein, the first position may be the overall position coordinates of the target object in the reference frame. For example, if the target object is a person object, the first position may be the position coordinates of the body frame of the target object in the reference frame.

The manner of clipping the frame to be detected according to the first position is also not limited in the embodiments of the present disclosure, and is not limited to the following disclosed embodiments. In a possible implementation, the first coordinates of the human body frame in the reference frame can be determined according to the first position, and combined with the corresponding relationship between the position coordinates between the reference frame and the frame to be detected, it can be determined that the human body frame of the target object is in the frame to be detected. The second coordinates in the frame are detected, and the frame to be detected is cropped based on the second coordinates to obtain a cropping result.

In some possible implementations, the first coordinates of the body frame in the reference frame and the border length of the body frame can also be determined according to the first position, and combined with the position coordinate correspondence between the reference frame and the frame to be detected, determine The second coordinates of the human body frame of the target object in the frame to be detected, and the frame to be detected is cropped based on the second coordinates and the frame length to obtain a clipping result, wherein, the clipping based on the second coordinates and the frame length can be based on the first The two coordinates determine the position of the clipping endpoint, and the frame length determines the length of the clipping result. In one example, the length of the clipping result can be consistent with the frame length. In one example, the length of the clipping result can also be proportional to the frame length, such as N times the frame length, etc., N can be any value not less than 1, etc.

The way to detect the key points of the target object in the clipping result can be flexibly determined according to the actual situation. For details, please refer to the following disclosed embodiments, which will not be expanded here.

Through the embodiments of the present disclosure, the target object in the frame to be detected can be preliminarily positioned according to the first position of the target object in the reference frame, and the clipping result can be obtained, and the key point detection of the target area can be performed based on the clipping result. On the one hand, it can reduce The amount of detected data improves the detection efficiency. On the other hand, since the target object accounts for a large proportion in the cropped result after cropping, the accuracy of key point detection can be improved.

Acquiring a second position of the target area of the target object in the reference frame;

Cutting the frame to be detected according to the second position to obtain a cutting result;

Perform key point detection on the target object in the clipping result to obtain the key point detection result.

Wherein, the second position may be the position coordinates of the target area of the target object in the reference frame. As described in the above disclosed embodiments, the target area may include the head area and/or the shoulder area, so in a possible implementation In the manner, the second position may be the position coordinates of the head and shoulders frame of the target object in the reference frame.

How to determine the second position of the target area in the reference frame, the implementation form can be flexibly determined according to the actual situation, for example, it can be realized by performing head and shoulder frame and/or key point recognition on the reference frame, see the following publications for details Embodiment, do not expand here.

For the manner of clipping the frame to be detected according to the second position, reference may be made to the manner of clipping the frame to be detected according to the first position, which will not be repeated here.

The key point detection method for the target object in the clipping result can be the same as the key point detection method based on the clipping result obtained at the first position, or it can be different. Do unfold.

Through the embodiments of the present disclosure, the key point detection result can be obtained according to the second position of the target area of the target object in the reference frame. In this way, the target area can be more targeted, thereby further reducing the amount of data processing. Therefore, the accuracy and efficiency of depth detection are further improved.

In a possible implementation manner, obtaining the second position of the target area of the target object in the reference frame may include:

Using the first neural network to identify the target area in the reference frame to obtain the second position output by the first neural network; and/or,

According to the key point detection result corresponding to the reference frame, the second position of the target area in the reference frame is obtained.

Wherein, the first neural network may be any network used to determine the second position, and its implementation form is not limited in the embodiments of the present disclosure. In some possible implementations, the first neural network may be an object area detection network for identifying the second location of the object area directly from the reference frame. In one example, the object area detection network may be faster based on Regional Convolutional Neural Networks (Faster Regions with Convolutional Neural Networks, Faster RCNN); in some possible implementations, the first neural network can also be a key point detection network, which is used to detect one or more key points in the reference frame Points are identified, and then the second position of the target area in the reference frame is determined according to the positions of the identified key points.

In some possible implementation manners, the reference frame may also be used as the frame to be detected for depth detection. In this case, the reference frame may have undergone key point detection and a corresponding key point detection result has been obtained. Therefore, in some possible implementation manners, the second position of the target area in the reference frame may be obtained according to the key point detection result corresponding to the reference frame.

In some possible implementation manners, the key point detection may also be directly performed on the reference frame to obtain the key point detection result. For the key point detection method, reference may be made to other disclosed embodiments, which will not be repeated here.

Through the embodiments of the present disclosure, the second position of the target area in the reference frame can be flexibly determined in multiple ways according to the actual situation of the reference frame, which improves the flexibility and versatility of depth detection; and in some possible implementations In the case where the reference frame before the frame to be detected has participated in the depth detection, the second position can be determined directly based on the intermediate result of the reference frame in the depth detection, thereby reducing the repeated calculation of data and improving the depth detection. efficiency and precision.

In a possible implementation manner, the key point detection is performed on the target object in the clipping result to obtain the key point detection result, which may include:

The second neural network is used to perform key point detection on the target object in the clipping result to obtain a key point detection result.

Wherein, the second neural network may be any neural network used to realize key point detection, and its implementation mode is not limited in the embodiments of the present disclosure, wherein, when the first neural network may be a key point detection network, the second The second neural network may be implemented in the same or different manner as the first neural network.

In some possible implementation manners, key point detection may also be performed on the target object in the clipping result through a related key point recognition algorithm, and the key point recognition algorithm to be used is also not limited in the embodiments of the present disclosure.

FIG. 3 shows a flowchart of a depth detection method according to an embodiment of the present disclosure. As shown in FIG. 3 , in a possible implementation, step S13 may include:

Step S131, acquiring at least two preset device parameters respectively corresponding to at least two capture devices, the at least two capture devices are used to capture images of the target object from at least two capture angles of view.

Step S132: Determine the depth information of the target object in the frame to be detected according to at least two preset device parameters and a plurality of key point detection results.

For the implementation manner of the collection device, reference may be made to the above disclosed embodiments, which will not be repeated here.

In some possible implementation manners, the at least two preset device parameters may include preset internal parameters respectively corresponding to at least two acquisition devices. The preset internal parameters may be some calibration parameters of the collection device itself, and the types and types of parameters contained therein may be flexibly determined according to the actual situation of the collection device. In some possible implementation manners, the preset internal parameters may include an internal reference matrix of the acquisition device, and the internal reference matrix may include one or more focal length parameters of the camera, principal point positions of one or more cameras, and the like.

In some possible implementations, since the collection device may include at least two collection devices, at least two preset device parameters may also include preset external parameters, wherein the preset external parameters may be between different collection devices The formed relative parameters are used to describe the relative positions of different acquisition devices in the world coordinate system. In some possible implementation manners, the preset external parameters may include an external parameter matrix formed between different acquisition devices. In an example, the external parameter matrix may include a rotation matrix and/or a translation vector matrix, and the like.

The way to obtain the preset device parameters is not limited in the embodiments of the present disclosure. In some possible implementations, the preset device parameters can be directly obtained according to the actual situation of the acquisition device. In some possible implementations, you can also The preset device parameters are obtained by calibrating the acquisition device.

According to the positional relationship among the multiple key point detection results, combined with at least two preset device parameters, the parallax formed between different frames to be detected in the three-dimensional world coordinate system can be determined. As mentioned in the above-mentioned disclosed embodiments, the information content contained in the depth information can be flexibly determined according to the actual situation. Therefore, with the different content of the depth information, the process of determining the depth information according to the preset device parameters and the detection results of multiple key points can also be determined at any time. For the changes, see the following disclosed embodiments for details, and will not be expanded here.

Through the embodiments of the present disclosure, at least two preset device parameters and multiple key point detection results can be used to determine the disparity formed between different frames to be detected, and to determine the depth information simply and conveniently. This method has a small amount of calculation and is The result is more accurate, which can improve the accuracy and efficiency of depth detection.

In a possible implementation manner, step S132 may include:

The depth distance is obtained according to the preset external parameters among the at least two preset device parameters and the coordinates of the multiple key point detection results in at least two forms.

Wherein, the implementation form of preset external parameters may refer to the above-mentioned disclosed embodiments, which will not be repeated here. The coordinates of the key point detection results in at least two forms can be the corresponding coordinates of the key point detection results in different coordinate systems, for example, it can include the pixel coordinates formed by the key point detection results in the image coordinate system, and/or Or, homogeneous coordinates formed separately in different acquisition devices, etc. Which form of coordinates to choose can be flexibly selected according to the actual situation, and is not limited to the following disclosed embodiments.

In the process of obtaining the depth distance, the coordinates of the key points in the key point detection results are not limited in the embodiment of the present disclosure. In some possible implementations, the head key point, left shoulder key point and right shoulder key point can be selected. One or more of the shoulder keys and, in one example, the head key. In some possible implementations, the center of the head and shoulders can also be chosen.

Wherein, the center point of the head and shoulders may be the center point of the head and shoulders frame mentioned in the above disclosed embodiments. In some possible implementations, the position coordinates of the key points of the head, the key points of the left shoulder and the key points of the right shoulder may be Determine the overall position coordinates of the head and shoulders frame, and determine the position coordinates of the center point of the head and shoulders based on the overall position coordinates of the head and shoulders frame; in some possible implementations, the center point of the head and shoulders can also be directly used as the key to be detected point, so that the position coordinates of the center point of the head and shoulders can be directly obtained in the key point detection results.

As the number of acquisition devices is different, the calculation method for obtaining the depth distance can be flexibly changed, and is not limited to the following disclosed embodiments. In one example, it may include two acquisition devices, a left camera and a right camera. In this case, according to the external parameters in at least two preset device parameters and the coordinates of a plurality of key point detection results in at least two forms , the process of obtaining the depth distance can be expressed by the following formulas (1) and (2):

Among them, d is the depth distance,

is the original coordinates of the key points in the homogeneous form in the frame to be detected collected by the left camera,

is the transformed coordinate obtained by linearly transforming the original coordinate,

is the coordinates of the key points in the homogeneous form in the frame to be detected collected by the right camera,

is the rotation matrix R of the right camera relative to the left camera in the preset external parameters,

is the translation vector matrix T of the right camera relative to the left camera in the preset external parameters.

Through the embodiments of the present disclosure, the homogeneous form coordinates of key points in different camera coordinate systems and the coordinates of key points in the form of linear transformation can be combined with the relative preset external parameters between different cameras, with a small amount of calculation Accurately determine the depth distance, thereby improving the accuracy and efficiency of depth detection.

In a possible implementation manner, step S132 may also include:

The offset angle is obtained according to the preset internal parameters in the at least two preset device parameters and the coordinates of the multiple key point detection results in at least two forms.

Wherein, the implementation forms of preset internal parameters and coordinates of key point detection results in at least two forms can also refer to the above disclosed embodiments, and will not be repeated here.

According to the preset internal parameters and the coordinates of the key point detection results in at least two forms, the way of obtaining the offset angle can also be flexibly selected, and is not limited to the following disclosed embodiments. In the process of determining the offset angle, the type of selected key points can also be flexibly selected according to the actual situation. You can refer to the type of key points selected in the above-mentioned determination of the depth distance, which will not be repeated here.

Similar to the determination process of the depth distance, with the different number of acquisition devices, the calculation method for obtaining the offset angle can also be flexibly changed, and is not limited to the following disclosed embodiments. In one example, taking the acquisition device including a certain target camera as an example, the process of obtaining the offset angle relative to the target camera can be expressed by the following formulas (3) to (5):

Among them, θ _x is the offset angle of the target object in the x-axis direction, θ _y is the offset angle of the target object in the y-axis direction,

is the coordinates of the key points in the homogeneous form in the frame to be detected collected by the target camera,

is the pixel coordinates of the key points in the frame to be detected collected by the target camera, f _x and f _y are the internal reference matrix of the target camera

The focal length parameter in , u ₀ and v ₀ are the principal point positions in the intrinsic parameter matrix K of the target camera.

Through the embodiments of the present disclosure, the offset angle can be determined simply and conveniently by using the preset internal parameters and the coordinates of the key point detection results obtained in the depth detection process in different forms. This determination method does not need to obtain additional data and is convenient. Computing can improve the efficiency and convenience of in-depth detection.

In a possible implementation manner, the method proposed in the embodiment of the present disclosure may further include:

According to the depth information of the target object, the position of the target object in the three-dimensional space is determined.

Wherein, the position of the target object in the three-dimensional space may be the three-dimensional coordinates of the target object in the three-dimensional space. The way to determine the position in the three-dimensional space based on the depth information can be flexibly selected according to the actual situation. In a possible implementation mode, the two-dimensional coordinates of the target object in the frame to be detected can be determined according to the key point detection results of the target object. The two-dimensional coordinates are combined with the depth distance and/or offset angle in the depth information, so as to determine the three-dimensional coordinates of the target object in the three-dimensional space.

After determining the position of the target object in the three-dimensional space, based on the three-dimensional position information, face recognition, living body recognition, route tracking or application to virtual reality (Virtual Reality, VR) or augmented reality (Augmented Reality) can be performed on the target object. Reality, AR) and other scenarios. Through the embodiments of the present disclosure, the depth information can be used to perform three-dimensional positioning of the target object, so as to realize various operations such as interaction with the target object. For example, in some possible implementations, the distance and angle between the target object and the smart air conditioner can be determined according to the position of the target object in three-dimensional space, so as to dynamically adjust the wind direction and/or wind speed of the smart air conditioner; in some possible In the implementation method, the target object can also be positioned in the game scene based on the position of the target object in the three-dimensional space in the AR game platform, so that the human-computer interaction in the AR scene can be realized more realistically and naturally.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can all be combined with each other to form a combined embodiment without violating the principle and logic. Due to space limitations, this disclosure will not repeat them. Those skilled in the art can understand that, in the above method in the specific implementation manner, the specific execution order of each step should be determined according to its function and possible internal logic.

In addition, the present disclosure also provides depth detection devices, electronic equipment, computer-readable storage media, and programs, all of which can be used to implement any depth detection method provided by the present disclosure. For the corresponding technical solutions and descriptions, refer to the corresponding records in the method section ,No longer.

FIG. 4 shows a block diagram of a depth detection device according to an embodiment of the present disclosure. As shown in Figure 4, device 20 includes:

The obtaining module 21 is configured to obtain multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by collecting images of a target object from at least two collection angles of view.

The key point detection module 22 is used to perform key point detection of the target area in the target object according to the frame to be detected, and determine a plurality of key point detection results corresponding to multiple frames to be detected, wherein the target area includes the head area and/or shoulder area.

The depth detection module 23 is configured to determine the depth information of the target object according to the multiple key point detection results.

In a possible implementation manner, the depth detection module is configured to: acquire at least two preset device parameters respectively corresponding to at least two acquisition devices, the at least two acquisition devices are used to image the target object from at least two acquisition angles of view Acquisition: Determining the depth information of the target object in the frame to be detected according to at least two preset device parameters and a plurality of key point detection results.

In a possible implementation manner, the depth information includes a depth distance, and the depth distance includes a distance between the target object and the optical center of the acquisition device; the depth detection module is further used for: according to the preset in at least two preset device parameters The external parameters and the coordinates of the multiple key point detection results in at least two forms obtain the depth distance; wherein, the preset external parameters include relative parameters formed between at least two acquisition devices.

In a possible implementation manner, the depth information includes an offset angle, and the offset angle includes a spatial angle of the target object relative to the optical axis of the acquisition device; the depth detection module is further configured to: according to at least two preset device parameters The preset internal parameters and the coordinates of the multiple key point detection results in at least two forms are used to obtain the offset angle; wherein the preset internal parameters include device parameters corresponding to at least two devices respectively.

In a possible implementation, the key point detection module is used to: perform key point detection on the target area of the target object in the frame to be detected according to the position information of the target object in the reference frame, and obtain the key point corresponding to the frame to be detected The point detection result, wherein the reference frame is a video frame before the frame to be detected in the target video to which the frame to be detected belongs.

In a possible implementation, the key point detection module is further used to: crop the frame to be detected according to the first position of the target object in the reference frame to obtain the cropping result; key the target area of the target object in the cropping result Point detection to obtain key point detection results corresponding to the frame to be detected.

In a possible implementation, the key point detection module is further used to: obtain the second position of the target area of the target object in the reference frame; according to the second position, the frame to be detected is cropped to obtain the cropping result; the cropping result The key point detection is performed on the target area of the target object in , and the key point detection result corresponding to the frame to be detected is obtained.

In a possible implementation, the key point detection module is further configured to: use the first neural network to identify the target area in the reference frame to obtain the second position output by the first neural network; and/or, according to the reference frame The corresponding key point detection result obtains the second position of the target area in the reference frame.

In a possible implementation manner, the device is further configured to: determine the position of the target object in the three-dimensional space according to the depth information of the target object.

In some embodiments, the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the method embodiments above, and its specific implementation can refer to the description of the method embodiments above. For brevity, here No longer.

Application Scenario Example

Fig. 5 shows a schematic diagram of an application example according to the present disclosure. As shown in Fig. 5, the application example of the present disclosure proposes a depth detection method, which may include the following process:

Step S31, use the Faster RCNN neural network to detect the head and shoulders frame of the human body from the two frames to be detected taken by the binocular camera (including the left camera and the right camera), and obtain the head and shoulders frame in the first frame of the left camera. position, and the position of the head-and-shoulders box in the first frame of the right camera.

Step S32, obtain the target video corresponding to the left camera and the right camera respectively, start from the second frame of the target video, use the video frame as the frame to be detected, and use the previous frame of the frame to be detected as the reference frame, according to the reference frame For the second position of the head and shoulders frame, the key point detection of the frame to be detected is carried out through the key point detection network, and the position coordinates of the three key points of the head key point, left shoulder key point and right shoulder key point are obtained, and the three key points The circumscribed rectangle of the point is used as the head and shoulders frame in the frame to be detected.

Step S33, according to the coordinates of the key points in the frame to be detected in at least two forms, and the internal reference matrix of the camera, calculate the offset angle of the target object relative to the camera:

Among them, according to the pixel coordinates (u, v, 1) of the key points of the head in the frame to be detected and the internal reference matrix K of the camera, the head The coordinates (x/z, y/z, 1) of the homogeneous form corresponding to the internal key points, and the offset angles θ _x and θ _y relative to the camera optical axis.

Step S34, according to the homogeneous coordinates of the key points in the frame to be detected in the left camera and the right camera, and the extrinsic matrix of the right camera relative to the left camera, calculate the depth distance of the target object:

Wherein, according to the homogeneous form coordinates of the same key point in the left and right cameras respectively, and the extrinsic parameter matrices R and T of the right camera relative to the left camera, through the formula (1) mentioned in the above disclosed embodiments and (2), calculate the depth distance d of the target object.

In one example, after the depth information of the target object in the frame to be detected is determined through steps S33 and S34, the next frame of the frame to be detected in the target video corresponding to the left camera and the right camera can also be used as the frame to be detected , and return to step S32 to perform depth detection again.

Through the application example of this disclosure, the head and shoulders frame of the human body and the key points in the head and shoulders frame can be used to calculate the disparity formed by the frames to be detected collected under different viewing angles. Compared with the disparity estimation method based on image matching, the calculation amount is more Small size, wider application scenarios.

It can be understood that the above-mentioned method embodiments mentioned in this disclosure can all be combined with each other to form a combined embodiment without violating the principle and logic. Due to space limitations, this disclosure will not repeat them.

Those skilled in the art can understand that in the above method of specific implementation, the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process. The specific execution order of each step should be based on its function and possible The inner logic is OK.

Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor. The computer readable storage medium may be a non-volatile computer readable storage medium or a volatile computer readable storage medium.

An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.

An embodiment of the present disclosure also provides a computer program product, including computer readable codes. When the computer readable codes run on the device, the processor in the device executes the method for implementing the depth detection method provided in any of the above embodiments. instruction.

The embodiments of the present disclosure also provide another computer program product, which is used for storing computer-readable instructions, and when the instructions are executed, the computer executes the operation of the depth detection method provided by any of the above-mentioned embodiments.

Electronic devices may be provided as terminals, servers, or other forms of devices.

FIG. 6 shows a block diagram of an electronic device according to an embodiment of the present disclosure. As shown in FIG. 6 , the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.

6, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814 , and the communication component 816.

The processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .

The memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like. The memory 804 can be implemented by any type of volatile or non-volatile storage device or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.

The power supply component 806 provides power to various components of the electronic device 800 . Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .

The multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), which is configured to receive external audio signals when the electronic device 800 is in operation modes, such as call mode, recording mode and voice recognition mode. Received audio signals may be further stored in memory 804 or sent via communication component 816 . In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.

Sensor assembly 814 includes one or more sensors for providing status assessments of various aspects of electronic device 800 . For example, the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 . Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. Sensor assembly 814 may also include an optical sensor, such as a complementary metal-oxide-semiconductor (CMOS) or charge-coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wide Band (UWB) technology, Bluetooth (BT) technology and other technologies.

In an exemplary embodiment, electronic device 800 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable A programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic component implementation for performing the methods described above.

In an exemplary embodiment, there is also provided a non-volatile computer-readable storage medium, such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.

FIG. 7 shows a block diagram of an electronic device according to an embodiment of the present disclosure. As shown in FIG. 7, the electronic device 1900 may be provided as a server. Referring to FIG. 7 , electronic device 1900 includes processing component 1922 , which further includes one or more processors, and memory resources represented by memory 1932 for storing instructions executable by processing component 1922 , such as application programs. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above method.

Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 . The electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows Server ^TM ), the graphical user interface-based operating system (Mac OS X ^TM ) introduced by Apple Inc., and the multi-user and multi-process computer operating system (Unix ^™ ), a free and open source Unix-like operating system (Linux ^™ ), an open source Unix-like operating system (FreeBSD ^™ ), or the like.

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium, such as a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.

The present disclosure can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.

A computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device. A computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

Computer program instructions for performing the operations of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or Source or object code written in any combination, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages. Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as via the Internet using an Internet service provider). connect). In some embodiments, an electronic circuit, such as a programmable logic circuit, field programmable gate array (FPGA), or programmable logic array (PLA), can be customized by utilizing state information of computer-readable program instructions, which can Various aspects of the present disclosure are implemented by executing computer readable program instructions.

Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It should be understood that each block of the flowcharts and/or block diagrams, and combinations of blocks in the flowcharts and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.

It is also possible to load computer-readable program instructions into a computer, other programmable data processing device, or other equipment, so that a series of operational steps are performed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , so that instructions executed on computers, other programmable data processing devices, or other devices implement the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions. In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.

The computer program product can be specifically realized by means of hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. Wait.

Having described various embodiments of the present disclosure above, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or improvement of technology in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

A deep detection method, characterized in that, comprising:

Acquiring multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by collecting images of the target object from at least two acquisition angles of view;

Perform key point detection of a target area in the target object according to the frame to be detected, and determine a plurality of key point detection results corresponding to the multiple frames to be detected, wherein the target area includes a head area and/or shoulder area;

Determining depth information of the target object according to the multiple key point detection results.
The method according to claim 1, wherein the determining the depth information of the target object according to the multiple key point detection results comprises:

Acquire at least two preset device parameters respectively corresponding to at least two capture devices, the at least two capture devices are used to capture images of the target object from at least two capture angles of view;

Determining depth information of the target object in the frame to be detected according to the at least two preset device parameters and the multiple key point detection results.
The method according to claim 2, wherein the depth information includes a depth distance, and the depth distance includes a distance between the target object and the optical center of the acquisition device;

The determining the depth information of the target object in the frame to be detected according to the at least two preset device parameters and the multiple key point detection results includes:

According to the preset external parameters in the at least two preset device parameters and the coordinates of the plurality of key point detection results in at least two forms, the depth distance is obtained; wherein the preset external parameters include the The relative parameters formed between the at least two acquisition devices.
The method according to claim 2 or 3, wherein the depth information includes an offset angle, and the offset angle includes a spatial angle of the target object relative to the optical axis of the acquisition device;

The determining the depth information of the target object in the frame to be detected according to the at least two preset device parameters and the multiple key point detection results includes:

According to the preset internal parameters in the at least two preset device parameters and the coordinates of the plurality of key point detection results in at least two forms, the offset angle is obtained; wherein the preset internal parameters include Device parameters respectively corresponding to the at least two devices.
The method according to any one of claims 1 to 4, wherein the key point detection of the target area in the target object according to the frame to be detected comprises:

According to the position information of the target object in the reference frame, perform key point detection on the target area of the target object in the frame to be detected, and obtain a key point detection result corresponding to the frame to be detected, wherein the The reference frame is a video frame before the frame to be detected in the target video to which the frame to be detected belongs.
The method according to claim 5, characterized in that, according to the position information of the target object in the reference frame, the key point detection is performed on the target area of the target object in the frame to be detected to obtain the same as The key point detection result corresponding to the frame to be detected includes:

clipping the frame to be detected according to the first position of the target object in the reference frame to obtain a clipping result;

Key point detection is performed on the target area of the target object in the clipping result to obtain a key point detection result corresponding to the frame to be detected.
The method according to claim 5 or 6, wherein the key point detection is performed on the target area of the target object in the frame to be detected according to the position information of the target object in the reference frame, Obtain the key point detection result corresponding to the frame to be detected, including:

Acquiring a second position of the target area of the target object in the reference frame;

clipping the frame to be detected according to the second position to obtain a clipping result;

Key point detection is performed on the target area of the target object in the clipping result to obtain a key point detection result corresponding to the frame to be detected.
The method according to claim 7, wherein said obtaining the second position of the target area of the target object in the reference frame comprises:

Using the first neural network to identify the target area in the reference frame to obtain a second position output by the first neural network; and/or,

Obtain a second position of the target area in the reference frame according to the key point detection result corresponding to the reference frame.
The method according to any one of claims 1 to 8, further comprising:

According to the depth information of the target object, the position of the target object in the three-dimensional space is determined.
A depth detection device is characterized in that it comprises:

An acquisition module, configured to acquire multiple frames to be detected, wherein the multiple frames to be detected include image frames obtained by acquiring images of the target object from at least two acquisition angles of view;

A key point detection module, configured to perform key point detection of the target area in the target object according to the frame to be detected, and determine a plurality of key point detection results corresponding to the multiple frames to be detected, wherein the target area Including the head area and/or shoulder area;

A depth detection module, configured to determine the depth information of the target object according to the multiple key point detection results.
An electronic device, characterized in that it comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1-9.
A computer-readable storage medium on which computer program instructions are stored, wherein the computer program instructions implement the method according to any one of claims 1 to 9 when executed by a processor.
A computer program product, characterized by comprising computer readable code, when the computer readable code is run in the electronic device, the processor in the electronic device executes to implement any one of claims 1 to 9 method described in the item.