WO2023024440A1

WO2023024440A1 - Posture estimation method and apparatus, computer device, storage medium, and program product

Info

Publication number: WO2023024440A1
Application number: PCT/CN2022/074929
Authority: WO
Inventors: 曹国良; 邱丰; 刘文韬; 钱晨
Original assignee: 上海商汤智能科技有限公司
Priority date: 2021-08-27
Filing date: 2022-01-29
Publication date: 2023-03-02
Also published as: TW202309782A; CN113449696A; CN113449696B

Abstract

Provided in the present disclosure are a posture estimation method and apparatus, a computer device, and a storage medium. The method comprises: acquiring a video image frame, which contains a limb part of a target object; performing blocking detection on a target limb part of the target object in the video image frame; when it is detected that a first part of the target limb part is blocked, predicting a key point of the first part on the basis of the video image frame, so as to obtain a first key point, and determining a key point of a second part of the target limb part contained in the video image frame, so as to obtain a second key point; and determining a posture detection result of the target object on the basis of the first key point and the second key point.

Description

A pose estimation method, device, computer equipment, storage medium and program product

Cross References to Related Applications

This disclosure is based on the Chinese patent application with the application number 202110994673.3, the application date is August 27, 2021, and the application name is "A Method, Device, Computer Equipment, and Storage Medium for Attitude Estimation", and requires the priority of the above-mentioned Chinese patent application Right, the entire content of the above-mentioned Chinese patent application is hereby incorporated into this disclosure as a reference.

technical field

The present disclosure relates to the technical field of image processing, and relates to a pose estimation method, device, computer equipment, storage medium and program product.

Background technique

In related pose capture solutions, the pose of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes parts of the object's limbs to exceed the camera screen (for example, The forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the subject.

Contents of the invention

Embodiments of the present disclosure at least provide a pose estimation method, device, computer equipment, storage medium, and program product.

In a first aspect, an embodiment of the present disclosure provides a pose estimation method, the method is executed by an electronic device, and the method includes: acquiring a video image frame containing a body part of a target object; performing occlusion detection on the target limb part of the target object; in the case that the first part of the target limb part is detected to be occluded, predicting the key point of the first part based on the video image frame to obtain the first key point, And determine the key point of the second part of the target limb part contained in the video image frame to obtain the second key point; determine the posture detection of the target object based on the first key point and the second key point result.

In the embodiment of the present disclosure, when the occlusion detection is performed on the video image frame and the occluded first part in the video image frame is detected, the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame. The key points of the first part other than the first part, so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be posed even when the video image frame does not contain complete body parts. detection.

In an optional implementation manner, the method further includes: determining a collection distance corresponding to the video image frame; the collection distance is used to characterize the target object and the target object when the video collection device collects the video image frame. The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.

In an optional implementation manner, the method further includes: in a case where it is determined that the acquisition distance meets a preset distance requirement, determining that the video image frame satisfies the facial expression capture condition; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.

In the above-mentioned embodiment, when it is judged based on the collection distance that the facial expression capturing condition is satisfied, facial expression capturing can be performed on the anchor. When it is judged based on the acquisition distance that the facial expression capture condition is not met, occlusion detection can also be performed on the target body part of the target object in the video image frame, and when the first part is detected to be occluded, by respectively predicting the video The key points of the first part and the second part in the image frame can realize the limb detection of the limb parts to be detected that are not included in the image, so as to solve the problem that the limb points outside the image cannot be predicted in related technologies, thereby alleviating the above application In the case of limb detection, there is a serious problem of limb point jumping caused by unstable limb point detection.

In an optional implementation manner, the predicting the key point of the first part based on the video image frame to obtain the first key point includes: intercepting the key point containing the second part in the video image frame The first image; edge filling is performed on the first image to obtain a second image including a filled area; wherein, the filled area is an area used for key point detection on the first part; based on the second The key point of the first part is predicted from the filled area in the image to obtain the first key point.

In the above embodiment, by intercepting the first image containing the second part in the video image frame, and filling the edge of the first image to obtain the second image, it is possible to realize the occluded first image in the video image frame through the second image. A part is used to predict the key points, so that when the first part of the target body part in the video image frame is occluded, the key points of the second part collected in the video image frame can be predicted, and the key points in the video image frame can be predicted. Key points other than the first part

In an optional implementation manner, the performing edge padding on the first image to obtain the second image containing the filled area includes: determining attribute information of the first part, wherein the attribute information includes the following At least one of: limb type information, limb size information; determine the filling parameters of the first image according to the attribute information; wherein, the filling parameters include at least one of the following: filling position, filling size; based on the filling parameters Perform edge filling on the first image to obtain the second image.

In the above-mentioned embodiment, by determining the filling parameters according to the attribute information of the first part, and filling the video image frame according to the filling parameters to obtain the second image, the video image frame can be filled as much as possible on the basis of filling the video image frame. Ensure a larger image resolution, so as to ensure a higher accuracy of attitude detection results.

In an optional implementation manner, the intercepting the first image containing the second part in the video image frame includes: determining a target frame in the video image frame, wherein the target frame The body is used to frame the frame of the second part; and the sub-image within the target frame in the video image frame is intercepted to obtain the first image.

In the above-mentioned embodiment, when the first part of the target limb part of the target object is blocked, the method of obtaining the second image by intercepting the sub-image in the target frame and performing edge filling on the sub-image can be enlarged. The application scenario of the pose estimation method provided in the present disclosure can still ensure that the application program based on the pose estimation can run normally and stably in the complex pose estimation scenario.

In an optional implementation manner, the predicting the key point of the first part based on the video image frame to obtain the first key point includes: determining the estimated area of the first part based on the video image frame ; Predict the key point of the first part based on the estimated area to obtain the first key point.

In the above-mentioned embodiment, in the case of predicting the key point of the first part through the first part estimation area, the gesture detection model can be guided by the estimation area to detect the missing first key point of the first part in the video image frame. detection, thereby improving the accuracy of the detected key points and reducing the detection error of the key points.

In an optional implementation manner, the method further includes: after obtaining the gesture detection result of the target object, generating a gesture trigger signal of a virtual object corresponding to the target object according to the gesture detection result; The gesture trigger signal controls the virtual object to perform a corresponding trigger action.

In the above-mentioned embodiment, since more accurate detection results can be obtained when the body parts of the target object are detected according to the posture detection model, when the virtual object is triggered to perform the corresponding trigger action according to the posture detection result, Accurate control of virtual objects to execute corresponding trigger actions can be achieved.

In an optional implementation manner, the method further includes: determining a plurality of training samples; wherein, each of the training samples in the plurality of training samples includes part of the target body parts of the target object, and each The training sample includes labeling information of each key point of the target body part; the posture detection model to be trained is trained through the plurality of training samples to obtain the posture detection model; the prediction based on the video image frame The key point of the first part, obtain the first key point, and determine the key point of the second part of the target limb part contained in the video image frame, obtain the second key point, including: based on the posture detection model in the Predicting the key points of the first part in the video image frame to obtain the first key point, and predicting the key points of the second part of the target limb part contained in the video image frame based on the posture detection model, Get the second key point.

In the above-mentioned embodiment, the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained. When the image frame contains part of the target body parts of the target object, the posture detection of the target object can still be performed, so as to ensure that the anchor application can normally detect the body parts.

In an optional implementation manner, the determining a plurality of training samples includes: collecting original images including all target body parts of the target object, and performing body detection on the original images to obtain multiple key points; Perform occlusion processing on at least part of the specified body parts in the original image, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.

In the above embodiment, by performing occlusion processing on the original image, it is possible to simulate the situation where the limbs are occluded or cropped in the corresponding application scene. In the case where the training samples determined by the above processing method are used to train the attitude detection model to be trained, It can be realized that when the video image frame does not contain all the target body parts, the posture detection of the target object can still be performed, so as to ensure that the corresponding application program can run normally and stably.

In a second aspect, an embodiment of the present disclosure further provides a pose estimation device, including: an acquisition part configured to acquire a video image frame containing a body part of a target object; a detection part configured to detect all the body parts in the video image frame Perform occlusion detection on the target body part of the target object; the first determination part is configured to predict the position of the first part based on the video image frame when it is detected that the first part of the target body part is occluded key points, to obtain the first key points, and determine the key points of the second part of the target limb part contained in the video image frame, to obtain the second key points; the second determination part is configured to be based on the first The key point and the second key point determine the gesture detection result of the target object.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect, or the steps in any possible implementation manner of the first aspect are executed.

In a fourth aspect, embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned first aspect, or any of the first aspects of the first aspect, may be executed. Steps in one possible implementation.

In the fifth aspect, an embodiment of the present disclosure further provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the above first aspect, or A step in any possible implementation of the first aspect.

Description of drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings required in the embodiments.

FIG. 1 shows a flow chart of a pose estimation method provided by an embodiment of the present disclosure;

FIG. 2 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure;

Fig. 3 shows a schematic diagram of a video image frame provided by an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of a first image including a second part obtained after collecting a video image frame provided by an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a detection result of key point detection of a first part and a second part in a second image provided by an embodiment of the present disclosure;

FIG. 6 shows a specific flow chart of intercepting the first image containing the second part in the video image frame in a pose estimation method provided by an embodiment of the present disclosure;

Fig. 7 shows a specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure;

Fig. 8a shows a schematic diagram of a padding effect obtained by padding a first image to obtain a second image according to an embodiment of the present disclosure;

Fig. 8b shows another schematic diagram of the padding effect of padding the first image to obtain the second image according to an embodiment of the present disclosure;

FIG. 9 shows another specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure;

FIG. 10 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure;

Fig. 11 shows a specific flow chart of the first method for optionally determining the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure;

Fig. 12 shows a specific flow chart of the second optional determination of the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure;

Fig. 13 shows a flow chart of another pose estimation method provided by an embodiment of the present disclosure;

Fig. 14 shows a schematic structural diagram of a pose estimation device provided by an embodiment of the present disclosure;

Fig. 15 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure.

It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

The term "and/or" in this article describes an association relationship, which means that there can be three kinds of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists independently. Condition. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.

After research, it is found that in the relevant posture capture schemes, the posture of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes some limbs of the object to exceed the The situation of the camera screen (for example, the forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the object.

Based on the above research, the present disclosure provides a pose estimation method, device, computer equipment, storage medium and program product. In the embodiment of the present disclosure, when the occlusion detection is performed on the video image frame and the occluded first part in the video image frame is detected, the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame. The key points of the first part other than the first part, so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be accurately detected when the video image frame does not contain complete body parts. gesture detection.

In order to facilitate the understanding of this embodiment, a pose estimation method disclosed in the embodiments of the present disclosure is first introduced in detail. The execution subject of the pose estimation method provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities. The computer device may be a live broadcast device, for example, the live broadcast device may be any device capable of attitude estimation such as a smart phone, a tablet computer, or a PC.

In some possible implementation manners, the pose estimation method may be implemented by a processor invoking computer-readable instructions stored in a memory.

Referring to FIG. 1 , it is a flowchart of a pose estimation method provided by an embodiment of the present disclosure, the method includes steps S101 to S107, wherein:

S101: Acquire a video image frame including a body part of a target object.

In the embodiment of the present disclosure, firstly, the video image frame including the body parts of the target object may be acquired through the camera device of the computer equipment. Wherein, the limb parts included in the video image frame may be the whole body limb parts or the half body limb parts of the target object. Here, the body parts of the half body may include the following body parts: body parts above the waist of the target object (head, upper torso, arms, hands).

Exemplarily, the technical solution of the present disclosure can be used in a live broadcast scene, and the above-mentioned computer device can be a device capable of installing a live broadcast application program. At this time, the above-mentioned target object may be the anchor, and the acquired video image frame may be a video image frame including the body parts of the anchor obtained by collecting the anchor during the live broadcast of the anchor. Of course, in some embodiments, it can also be applied to other video playing scenarios.

S103: Perform occlusion detection on the target body part of the target object in the video image frame.

In the embodiment of the present disclosure, an occlusion detection model may be used to perform occlusion detection on the target body part of the target object in the video image frame. The target body part may be all the body parts of the target object, and may also be a body part of the target object, which is not limited in the present disclosure.

After detecting the occlusion of the target body part, a corresponding occlusion detection result can be obtained, wherein the occlusion detection result is used to characterize the integrity of the target body part. For example, the complete condition includes a complete or incomplete result, and the occlusion detection result may further include part information of a part that is blocked (ie, the first part) in the case of an incomplete result.

Assuming that the target body part is the half-body body part of the target object, and the video image frame does not contain the hand of the target object, after the occlusion detection is performed on the video image frame, the occlusion detection result can be obtained, and the occlusion detection result is used to represent The target body part in the video image frame is incomplete, and the occluded part of the target body part is the hand of the target subject.

S105: When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key point in the video image frame The key points of the second part of the target limb part are included to obtain the second key points.

In the embodiment of the present disclosure, when the first part of the target body part is detected to be occluded, the pose detection model can be used to predict the key points of the occluded first part, and predict the target body part contained in the video image The key point of the second part.

In the embodiment of the present disclosure, based on the position of the missing first part in the video image frame, edge filling can be performed on the video image frame, and the video image frame after the edge filling can be processed through the pose detection model, so that the filled area Predict the key points of the first part. At the same time, the posture detection module can also perform key point detection on the second part contained in the video image frame to obtain the second key point.

In the embodiment of the present disclosure, the target body part may contain multiple sub-parts. For example, the target body part may be the upper body part of the target subject. At this time, the target body part may include the following sub-parts: head, upper body trunk , arms and hands. At this time, the first part may be a complete sub-part, or a partial part of a sub-part. For example, the first part can be both hands, indicating that the first part blocked in the video image frame is both hands; the first part can also be a finger part, indicating that the first part blocked in the video image frame is the finger part of both hands.

In the embodiment of the present disclosure, before performing occlusion detection on the target body part of the target object in the video image frame, the following steps may also be performed:

Determine the acquisition moment of the video image frame, calculate the distance between the target object and the camera device of the computer equipment, and obtain the acquisition distance; compare the acquisition distance with the first distance threshold and the second distance threshold. If the comparison shows that the collection distance is smaller than the first distance threshold, or, if the comparison shows that the collection distance is greater than the second distance threshold, it is preliminarily determined that occlusion detection does not need to be performed on the video image frame. When the comparison shows that the acquisition distance is less than or equal to the second distance threshold and greater than or equal to the first distance threshold, it is necessary to perform occlusion detection on the target body part in the video image frame.

Here, the second distance threshold is greater than the first distance threshold. Wherein, the first distance threshold and the second distance threshold may be distance thresholds selected in advance according to experience, and may also be distance thresholds preset in the computer device of the target object. The first distance threshold is used to characterize the distance between the target object and the camera when the video image frame does not contain the part below the head of the target object, or the video image frame contains insufficient parts below the head for gesture detection. The second distance threshold is used to characterize the distance between the target object and the camera device when the complete target body part is included in the video image frame.

S107: Determine a pose detection result of the target object based on the first key point and the second key point.

After the first key point and the second key point are determined, the gesture detection result of the target object can be determined based on the first key point and the second key point.

The above steps are described in detail below.

In an optional embodiment, the method also includes the following steps:

Determine the collection distance corresponding to the video image frame; the collection distance is used to characterize the distance between the target object and the video collection device when the video collection device is collecting the video image frame; and judge the distance based on the collection distance Whether the above-mentioned video image frame satisfies the facial expression capture condition.

When it is determined that the collection distance satisfies the preset distance requirement, it is determined that the video image frame meets the facial expression capture condition; and facial expression detection is performed on the video image frame to obtain a facial expression detection result.

Based on this, in step S103, occlusion detection is performed on the target body part of the target object in the video image frame, including the following content:

When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.

In the technical solution of the present disclosure, in addition to body detection of the target object, facial expression capture of the target object can also be performed. At this point, the distance between the target object and the camera when the video capture device (for example, a camera of a computer device) captures the video image frame can be determined.

Then, based on the collection distance, it is judged whether the video image frame satisfies the facial expression capture condition. In the embodiment of the present disclosure, the collection distance may be compared with the above-mentioned first distance threshold. In the case that the acquisition distance is less than the above-mentioned first distance threshold, it is determined that the preset distance requirement is met, that is, it is determined that the video image frame meets the facial expression capture condition, and at this time, facial expression capture can be performed on the video image frame .

When it is judged that the collection distance is greater than or equal to the above-mentioned first distance threshold, it is determined that the video image frame does not meet the facial expression capture condition, and at this time, occlusion detection may be performed on the target body part of the target object in the video image frame.

In the case that the first part of the target limb part is detected to be blocked, the key point of the first part can be predicted based on the video image frame, the first key point is obtained, and the position of the target limb part contained in the video image frame is determined. The key points of the second part are obtained to obtain the second key points; and then the gesture detection result of the target object is determined based on the first key points and the second key points.

If it is detected that the target body part is not blocked, then by detecting the target body part in the video image frame, the posture detection result of the target object can be obtained.

In some related applications, it is based on a single-mode technical solution. The single-mode technical solution can be understood as, when the image contains complete facial expressions, the application can only capture facial expressions; When a body part is to be detected, the app can only do body capture. When the application cannot capture facial expressions and the image does not contain complete body parts to be detected, the application cannot normally and stably operate the functions of facial capture and body part detection.

It is assumed that the above-mentioned application program is a virtual live broadcast software. In the virtual live broadcast scene, the body movements of the anchor are often captured and recognized by using the webcam, but due to uncertain factors such as the camera angle of view specification, the distance between the anchor and the lens, etc. In particular, when the distance between the anchor and the camera is very close, that is, when part of the body of the character exceeds the camera screen (for example, the forearm of the hand is out of the screen, the position below the shoulder and chest of the character is out of the screen, etc.), the virtual live broadcast software in the related art cannot Proceed as normal for extremity detection.

After adopting the above technical solution, it is possible to capture the facial expression of the host when it is judged based on the collection distance that the facial expression capturing condition is met. When it is judged based on the acquisition distance that the facial expression capture condition is not met, occlusion detection can also be performed on the target body part of the target object in the video image frame, and when the first part is detected to be occluded, by respectively predicting the video The key points of the first part and the second part in the image frame realize the body detection of the body parts to be detected that are not included in the image, so as to solve the problem that the body points outside the image cannot be predicted in related technologies, thereby alleviating the above application When performing limb detection, there is a serious problem of limb point jumping caused by unstable limb point detection.

In the technical solution of the present disclosure, a smooth transition between facial expression capture and body part capture can also be realized, thereby improving the robustness of the application program so that the application program can run stably.

In an optional implementation manner, as shown in FIG. 2, the above step S105, based on the video image frame, predicts the key point of the first part to obtain the first key point, including the following process:

S1051: Intercept a first image including the second part from the video image frame.

S1052: Perform edge padding on the first image to obtain a second image including a padding area, where the padding area is an area used to perform key point detection on the first part.

S1053: Predict the key point of the first part based on the filled area in the second image to obtain the first key point.

Assume that the image shown in FIG. 3 is a video image frame including the upper body parts of the target object collected in step S101 above. It can be seen from FIG. 3 that part of the finger parts of the target object (ie, the above-mentioned first part) are blocked. At this time, the first image including the second part is intercepted in the video image frame, and then the first image as shown in FIG. 4 is obtained.

After the above-mentioned first image is intercepted, edge filling may be performed on the first image, so as to obtain a second image including the filled area. After the second image is obtained, the posture detection of the second image can be performed through the posture detection model, so as to predict the key points of the first part in the filled area, obtain the first key point, and remove the filled area in the second image The key points of the second part are predicted in other areas other than , and the second key points are obtained. For example, as shown in FIG. 5 , the black image area is the above-mentioned filled area in the filled area. After processing by the process described above, the key points of the first part can be predicted in the filled area, and the key points of the second part can be predicted in other areas outside the filled area.

In the embodiment of the present disclosure, in the case of performing edge filling on the first image, edge filling may be performed on the first image based on the black image in a manner as shown in FIG. 5 .

Here, edge filling can be understood as performing edge filling on the first image based on the position of the occluded first part in the video image frame in the video image frame, so as to obtain key points that can be used for the occluded first part. detected area.

In the above embodiment, by intercepting the first image containing the second part in the video image frame, and filling the edge of the first image to obtain the second image, it is possible to realize the occluded first image in the video image frame through the second image. A part is used to predict key points, so that when the first part of the target body part in the video image frame is occluded, the key point of the second part collected in the video image frame can be predicted, and the key point in the video image frame can be predicted. Key points outside of the first part.

In an optional implementation manner, as shown in FIG. 6, S1051: Intercepting the first image containing the second part in the video image frame includes the following process:

S601: Determine a target frame in the video image frame, where the target frame is used to frame a frame of the second part.

S602: Intercept a sub-image within the target frame in the video image frame to obtain the first image.

In the embodiment of the present disclosure, the occlusion of the first part of the target object in the video image frame can be understood as: the first part is truncated by the edge of the image so that the first part is occluded, and the first part is occluded by other objects in the video image frame As a result, the first part is blocked.

In the embodiment of the present disclosure, when the first part of the target body part is not detected in the video image frame, and the first part is detected to be not at the edge position of the video image frame, it may be determined that the first part A part is occluded by other objects in the video image frame.

In this case, a target frame for framing the second part may be determined in the video image frame. Then, the sub-image located in the target frame in the video image frame is intercepted to obtain the first image.

In the embodiment of the present disclosure, when the first part of the target body part is not detected in the video image frame, and the first part is detected to be truncated by the edge of the image, edge filling can be directly performed on the video image frame, thereby obtaining A second image containing the filled area.

Here, the process of performing edge filling on the video image frame is the same as the process of performing edge filling on the first image. In the following embodiments, the process of edge filling on the first image will be described as an example.

In the above embodiment, when it is detected that the first part of the target limb part of the target object is blocked, the second image is obtained by intercepting the sub-image in the target frame and performing edge filling on the sub-image, The application scenarios of the pose estimation method provided in the present disclosure can be expanded, and in complex pose estimation scenarios, applications based on the pose estimation can still run normally and stably.

From the above description, it can be seen that in the case of performing edge padding on the first image (or video image frame), as shown in Figure 5, the first image can be realized by filling the black image in the first image. Do edge filling.

In addition to using the edge filling method to fill the edge of the first image, you can also choose the following method to fill the edge of the first image to obtain the second image, including:

Position information of an occluder that occludes the first part is determined in the video image frame. Perform replacement processing on the image located in the position information in the video image frame, and replace the image with a background image of a preset color, for example, replace it with a black background image.

In the embodiment of the present disclosure, in addition to replacing the black background image, it may also be replaced with a background image of other colors. In order to improve the processing accuracy of the attitude detection model, the preset color can be determined based on the training samples of the attitude detection model, which will be described in detail in the following process.

In an optional implementation manner, as shown in FIG. 7, the above step: performing edge filling on the first image to obtain a second image containing the filled area, including the following process:

S701: Determine attribute information of the first part, where the attribute information includes: limb type information and/or limb size information.

S702: Determine a padding parameter of the first image according to the attribute information; wherein, the padding parameter includes: a padding position and/or a padding size.

S703: Perform edge padding on the first image based on the padding parameters to obtain the second image.

In the embodiment of the present disclosure, the above-mentioned first part can be understood as a body part that is missing in the video image frame and for which the pose detection model needs to perform pose detection. For example, the posture detection model needs to detect the anchor's upper body parts, however, some hand parts are missing in the video image frame, and at this time, the first part is the missing hand part in the video image frame.

Here, the limb type information is used to indicate the limb type information of the first part missing in the video image frame, for example, the first missing part in the video image frame is a hand. The body size information is used to indicate the size information (or size information) of the first part that is missing in the video image frame.

It can be understood that after the body type information is determined, the positional relationship of the first part relative to the first image can be estimated.

After the above attribute information is determined, the filling position and/or filling size of the first image can be determined based on the attribute information, and then the first image can be filled based on the filling position and/or filling size to obtain the second image .

In the embodiment of the present disclosure, in the case of determining the filling parameters of the first image based on the attribute information of the first part, the positional relationship of the first part relative to the first image may be determined based on the body type information, for example, the first part It should be located at the bottom edge of the first image. At this time, the filling position may be determined based on the position information, for example, the bottom edge of the first image may be determined as the filling position. At the same time, the padding size of the first image may also be determined based on the limb size information, for example, the limb size information may be determined as the padding size.

Assuming that, as shown in Figure 8a, the missing first part in the video image frame is a hand part, then it can be determined that the hand part is located below the first image, and at this time, the edge of the lower edge of the first image can be edged. filling. For example, the lower edge of the video image frame may be filled with an image of a black background.

Assuming that, as shown in Figure 8b, the missing first part in the video image frame is the hand part and two arm parts, then it can be determined that the hand part is located below the first image, and the two arm parts are located on the left and right sides of the first image , at this time, edge padding may be performed on the bottom edge, left edge, and right edge of the first image. For example, the bottom edge, left edge, and right edge of the first image may be filled with an image of a black background, respectively.

Here, in addition to the image that can be filled with a black background, background images of other colors can also be selected to be filled, which is not limited in the present disclosure.

Since the size of the image input into the gesture detection model is preset, after the padding process is performed on the video image frame, the second image after the padding process needs to be adjusted to the preset size. Therefore, filling more space will affect the resolution of the image corresponding to the limb parts of the target object in the target image.

In the above-mentioned embodiment, by determining the filling parameters according to the attribute information of the first part, and filling the video image frame according to the filling parameters to obtain the second image, the video image frame can be filled as much as possible on the basis of filling the video image frame. Improving the image resolution is beneficial to obtain pose detection results with higher accuracy.

In an optional implementation manner, the limb type information of the first part may be determined in the following manner, including:

Way 1: The limb type information of the first part that is missing in the video image frame can be estimated according to the estimated distance between the target object and the camera device.

For example, the distance between the target object and the camera device may be predicted by the distance measurement model, and the missing limb type information of the first part may be output by the distance measurement model.

Method 2: intercept the first image containing the second part in the video image frame through the target frame, and then identify the first image to obtain the limb type information of the first part that is missing in the video image frame.

In an optional implementation manner, the limb size information of the first part may be determined in the following manner, including:

Acquiring the actual length information of the target object, the actual length information may be the actual height information of the target object, and may also be the actual length information of any complete target limb part of the target object. Then, determine the length information of the complete designated limb part contained in the video image frame in the video image frame, and then estimate the above-mentioned limb size information according to the length information and the actual length information.

In the embodiment of the present disclosure, in addition to performing edge filling on the first image in the manner described above to obtain the second image, you may also perform edge filling on the first image in the following ways, including:

Each image edge of the video image frame may be filled separately, and at this time, the padding size of each image edge may be a preset size.

It is also possible to determine the image edge corresponding to the missing first part in the first image, and then fill the image edge. At this time, the filling size for filling the image edge can be a preset size, or It may be the limb size information determined above.

In the embodiment of the present disclosure, as shown in FIG. 9 , on the basis of the implementation shown in FIG. 7 , after the filling parameters are determined, the method further includes the following process:

S901: Acquire scene type information of the video image frame.

S902: Adjust the padding parameters according to the scene type information, and pad the video image frames based on the adjusted padding parameters to obtain the second image.

In the embodiment of the present disclosure, the scene type information corresponding to the video image frame may be determined, for example, the scene type information is: delivery scene, game commentary scene, performance scene, and the like.

For each type of scene information, the resolution requirements for the images input into the pose detection model may be different. Therefore, after the scene type information is determined, the image resolution matched by the scene type information may be determined, and then the padding parameters may be adjusted according to the image resolution.

For example, for scenes requiring high image resolution, the padding size may be adaptively reduced, so as to facilitate the resolution of the second image obtained through the padding process. For scenes with high image resolution requirements, the padding size can be increased adaptively, or the original padding size can be kept unchanged.

It should be understood that, in the case where the above padding parameters are adjusted according to the image resolution, when the first image should be filled according to the adjusted padding parameters, the second image after padding may include the A region for pose detection of a part.

In the above-mentioned embodiment, the padding parameters are adjusted through the scene type information of the video image frame, so as to expand the video image frame according to the adjusted padding parameters, and the key points of the complete target body parts can be detected. Under such circumstances, the image resolution should be increased as much as possible, which is conducive to obtaining a more accurate pose detection result.

In the embodiment of the present disclosure, as shown in FIG. 10, step S105, predicting the key point of the first part based on the video image frame to obtain the first key point also includes the following process:

S1001: Determine an estimated area of the first part based on the video image frame.

S1002: Predict key points of the first part based on the estimated area, to obtain the first key points.

In an optional implementation manner, the estimated area of the first part may be determined through the following steps, including:

First, determine the first image containing the second part based on the video image frame; then, perform edge padding on the first part in the manner described above to obtain the second image; then, determine the first image in the filled area of the second image The estimated area of the part. Wherein, the estimated area is an area used for estimating key points of the second part in the second image.

Here, the estimation area may be a rectangular area, and may also be a circular area, and the disclosure does not limit the shape of the estimation area.

In an optional implementation manner, as shown in FIG. 11 , the estimated area of the first part may also be determined through the following steps, including:

S1101: Determine limb type information corresponding to the first part; and determine a target part associated with the first part among target limb parts.

Here, the first part and the target part may be detection parts having a linkage relationship. For example, the first part moves under the drive of the target part, or the target part moves under the drive of the first part. Assuming that the first part is a hand, then the target part may be a wrist part. The first part may be a wrist part, and the target part may be a forearm part. They are not listed here.

S1102: Based on the limb type information of the first part and the limb type information of the target part, determine a position constraint between the first part and the target part. Wherein, the position constraint is used to constrain the position difference between the position of the first part in the second image and the position of the target part in the second image.

Here, different position constraints are set for different types of limb type information. The second image is an image obtained after edge filling is performed on the first image in the above embodiment, and the first image is a sub-image including the second part in the video image frame.

S1103: Determine an estimated area of the first part in the second image based on the position constraint.

Using position constraints to determine the estimated area of the first part in the second image can reduce the phenomenon of large position differences between the first part and the target part, thereby improving the processing accuracy of the pose detection model.

After the estimated area is determined, the attitude detection model can be used to perform attitude detection on the second image marked with the estimated area to obtain an attitude detection result. Wherein, the posture detection result includes complete annotation information of key points of the target body part, wherein the annotation information includes: position information and category information.

In the case that the posture detection model is used to perform posture detection on the second image marked with the estimated area, the estimated area can be used to guide the posture detection model to detect the first key point of the first part that is missing in the video image frame , thus improving the accuracy of the detected key points and reducing the detection error of key points.

In an optional implementation, as shown in Figure 12, the estimated area of the first part can also be determined through the following steps, including the following process:

S1201: Determine a target video image in the historical video images corresponding to the target object, wherein the similarity between the target video image and the video image frame meets a preset requirement, and the target video image contains the first part.

S1202: Determine the estimation area according to the position information of the first part included in the target video image.

In the embodiment of the present disclosure, the historical video images (for example, historical live images) of the target object are first acquired in the cache folder. Here, the cache folder is used to store a video image frame including a complete specified body part, and a pose detection result corresponding to the video image frame.

After the historical video images are obtained, the historical video images can be screened to obtain the target video images. The screening process is described as follows:

A feature distance between each historical video image and the video image frame is calculated, wherein the feature distance is used to characterize the similarity between the historical video image and the video image frame. According to the calculated feature distance, the target video image whose similarity with the video image frame meets the preset requirement is selected from the historical video image. Here, meeting the preset requirement can be understood as: the feature distance is greater than or equal to the preset distance threshold.

After the target video image is screened, the estimation area can be determined according to the position information of the first part included in the target video image. For example, the location information of the first part included in the target video image may be determined as the estimated area.

In the above-mentioned embodiment, considering that the difference between similar actions of the same object in the process of using the computer equipment is small, therefore, by acquiring the target video image to determine the estimated area, the accuracy of determining the estimated area can be improved, thus obtaining More accurate pose detection results.

In the embodiment of the present disclosure, as shown in Figure 13, the method further includes the following process:

S1301: After obtaining the gesture detection result of the target object, generate a gesture trigger signal of a virtual object corresponding to the target object according to the gesture detection result.

S1302: Control the virtual object to perform a corresponding trigger action according to the gesture trigger signal.

In the embodiment of the present disclosure, the gesture trigger signal of the virtual object may be generated according to the key points of the target body part of the target object in the detection result.

In the embodiment of the present disclosure, after the gesture detection result of the target object is obtained, a gesture trigger signal for triggering the virtual object to perform a corresponding trigger action may be generated according to the gesture detection result, so as to trigger the virtual object to perform the corresponding trigger action.

Here, the trigger signal is used to indicate the position information of the key points of each virtual limb of the virtual object in the video image frame.

It should be noted that the image of the above-mentioned virtual object is a preset virtual image (for example, a virtual anchor), wherein the preset image includes at least one of the following: three-dimensional humanoid mimicry (the humanoid mimicry can be a person, or a humanoid image , such as aliens, etc.), three-dimensional animal mimicry (such as dinosaurs, pet cats, etc.), two-dimensional character mimicry, two-dimensional animal mimicry, etc.

In the embodiment of the present disclosure, the method also includes the following process:

Firstly, a plurality of training samples are determined; wherein, each of the training samples includes part of the target body part of the target object, and each of the training samples includes label information of each key point of the target body part. Then, the posture detection model to be trained is trained by using the plurality of training samples to obtain the pre-trained posture detection model.

In the embodiment of the present disclosure, multiple training samples are obtained first. Then, a plurality of training samples are input into the posture detection model to be trained, to realize the posture detection detection model to be trained is trained.

In the above-mentioned embodiment, the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained. In the case that the image frame contains some target body parts of the target object, the posture detection of the target object can still be performed, so that the anchor application can run normally and stably.

In an embodiment of the present disclosure, determining a plurality of training samples includes the following process:

Firstly, the original image including all target body parts of the target object is collected, and body detection is performed on the original image to obtain multiple key points. After the multiple key points are obtained, at least part of the target body parts in the original image can be occluded, and the multiple training methods can be determined based on the original image after the occlusion process and the annotation information of the multiple key points. sample.

In the embodiment of the present disclosure, firstly, the original image including all target body parts is acquired. For example, the target limb part may be the upper body limb part of the human body, then the original image must at least include the complete upper body limb part, for example, the whole body limb part may be included.

After the original image is obtained, limb detection may be performed on the original image to obtain multiple key points, wherein the multiple key points include multiple key points of the target body part.

Afterwards, occlusion processing may be performed on the original image to obtain an original image after occlusion processing, wherein the original image after occlusion processing contains incomplete target body parts. After the original image after occlusion processing is obtained, the original image after occlusion processing and the key points of the target body part determined in the above process can be determined as a training sample.

In an optional implementation manner, performing occlusion processing on at least part of the target body parts in the original image includes the following process:

A background image of a preset color can be used to occlude at least part of the target body parts to obtain an original image after occlusion processing; it can also be used to crop at least part of the target body parts in the original image to obtain an after occlusion process of the original image.

In the above embodiment, by performing occlusion processing on the original image, it is possible to simulate the situation where the limbs are occluded or cropped in the corresponding application scene. In the case where the training samples determined by the above processing method are used to train the pose detection model to be trained, It can be realized that when the video image frame does not contain all the target body parts, the posture detection of the target object can still be performed, so that the corresponding application program can run normally and stably, and the body parts can be detected normally.

This disclosure relates to the field of augmented reality. By acquiring the image information of the target object in the real environment, and then using various visual related algorithms to detect or identify the relevant features, states and attributes of the target object, and obtain a virtual reality that matches the application. Augmented Reality (AR) effects combined with reality. Exemplarily, the target object may involve faces, limbs, gestures, actions, etc. related to the human body. Applications can not only involve interactive scenes such as tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but can also involve special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual model Display and other interactive scenes. The relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.

Those skilled in the art can understand that in the above-mentioned method of specific implementation, the writing order of each step does not imply a strict execution order and constitutes any limitation on the implementation process, and the execution order of each step should be based on its function and possible internal Logically OK.

Based on the same inventive concept, an attitude estimation device corresponding to the attitude estimation method is also provided in the embodiment of the present disclosure. Since the problem-solving principle of the device in the embodiment of the disclosure is similar to the above-mentioned attitude estimation method in the embodiment of the disclosure, the implementation of the device See the implementation of the method.

Referring to FIG. 14 , it is a schematic diagram of a pose estimation device provided by an embodiment of the present disclosure, the device includes: an acquisition part 141, a detection part 142, a first determination part 143, and a second determination part 144; wherein,

The acquiring part 141 is configured to acquire a video image frame including a body part of a target object;

The detection part 142 is configured to perform occlusion detection on the target limb part of the target object in the video image frame;

The first determining part 143 is configured to predict the key point of the first part based on the video image frame to obtain the first key point when it is detected that the first part of the target body part is blocked, and Determining the key points of the second part of the target limb part contained in the video image frame to obtain the second key points;

The second determining part 144 is configured to determine the gesture detection result of the target object based on the first key point and the second key point.

In a possible implementation manner, the device is further configured to: determine the acquisition distance corresponding to the video image frame; The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.

In a possible implementation manner, the device is further configured to: determine that the video image frame satisfies the facial expression capture condition when it is determined that the collection distance meets a preset distance requirement; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.

In a possible implementation manner, the first determining part 143 is further configured to: intercept the first image containing the second part in the video image frame; perform edge padding on the first image to obtain the a second image of a padded area, wherein the padded area is an area configured to perform keypoint detection on the first part; predicting keypoints of the first part based on the padded area in the second image, Get the first key point.

In a possible implementation manner, the first determining part 143 is further configured to: determine attribute information of the first part, wherein the attribute information includes: limb type information and/or limb size information; according to the The attribute information determines the padding parameters of the first image; wherein the padding parameters include: a padding position and/or a padding size; edge padding is performed on the first image based on the padding parameters to obtain the second image.

In a possible implementation manner, the first determining part 143 is further configured to: determine a target frame in the video image frame, wherein the target frame is configured to frame a frame of the second part body; intercepting a sub-image within the target frame in the video image frame to obtain the first image.

In a possible implementation manner, the first determining part 83 is further configured to: determine the estimated area of the first part in the video image frame; predict the key point of the first part based on the estimated area , to get the first key point.

In a possible implementation manner, the device is further configured to: after obtaining the gesture detection result of the target object, generate a gesture trigger signal of the virtual object corresponding to the target object according to the gesture detection result; The gesture trigger signal controls the virtual object to perform a corresponding trigger action.

In a possible implementation manner, the device is further configured to: determine a plurality of training samples; wherein, each of the training samples includes part of the target limb parts of the target object, and each of the training samples includes the target limb parts Annotation information of each key point of the part; the attitude detection model to be trained is trained through the plurality of training samples to obtain the attitude detection model, and the first determining part 143 is also configured to: based on the attitude detection model in the Predict the key point of the first part in the video image frame to obtain the first key point, and predict the key point of the second part of the target limb part contained in the video image frame based on the attitude detection model, obtain The second key point.

In a possible implementation manner, the device is further configured to: collect an original image including all target body parts of the target object, and perform body detection on the original image to obtain multiple key points; Perform occlusion processing on at least part of the target body parts, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.

For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant description in the above method embodiment, and details will not be described here.

Corresponding to the pose estimation method in FIG. 1, the embodiment of the present disclosure also provides a computer device 1500, as shown in FIG. 15, which is a schematic structural diagram of the computer device 1500 provided by the embodiment of the present disclosure, including:

Processor 151, memory 152, and bus 153; memory 152 is used for storing and executing instructions, and includes memory 1521 and external memory 1522; memory 1521 here is also called internal memory, and is used for temporarily storing operation data in processor 151, and The data exchanged by the external memory 1522 such as hard disk, the processor 151 exchanges data with the external memory 1522 through the memory 1521, when the computer device 1500 is running, the processor 151 communicates with the memory 152 through the bus 153, so that The processor 151 executes the following instructions:

Obtaining a video image frame containing a body part of a target object;

performing occlusion detection on the target body part of the target object in the video image frame;

When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key points contained in the video image frame The key point of the second part of the target limb part, get the second key point;

Determining a pose detection result of the target object based on the first key point and the second key point.

Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the pose estimation method described in the above-mentioned method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

An embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the method described in the above method embodiment. For the steps of the pose estimation method, reference may be made to the foregoing method embodiments.

Wherein, the above-mentioned computer program product may be realized by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

Those skilled in the art can clearly understand that for the convenience and brevity of description, for the working process of the above-described system and device, reference may be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are schematic. For example, the division of the units is a logical function division. In actual implementation, there may be another division method. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

If the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the essence of the technical solution of the present disclosure or the part that contributes to the related technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.

The aforementioned computer-readable storage medium may be a tangible device capable of retaining and storing instructions used by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium. A computer readable storage medium may be, for example but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Type programmable read-only memory (Erasable Programmable Read Only Memory, EPROM or flash memory), static random-access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disk Read Only Memory, CD-ROM) , Digital versatile discs (Digital versatile Disc, DVD), memory sticks, floppy disks, mechanically encoded devices, such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.

Finally, it should be noted that: the above-mentioned embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than limit them, and the protection scope of the present disclosure is not limited thereto, although referring to the aforementioned The embodiments have described the present disclosure in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present disclosure Changes can be easily imagined, or equivalent replacements can be made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included in this disclosure. within the scope of protection. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.

Industrial Applicability

The present disclosure provides a pose estimation method, device, computer equipment, storage medium, and program product, wherein the method includes: acquiring a video image frame containing a body part of a target object; Carry out occlusion detection; in the case that the first part of the target limb part is detected to be occluded, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the video image The key points of the second part of the target body part contained in the frame are obtained to obtain the second key points; the pose detection result of the target object is determined based on the first key points and the second key points. In the embodiment of the present disclosure, by predicting the key points of the occluded first part and predicting the key points of the unoccluded second part in the video image frame, when the target body part is occluded in the video image frame, Predict the key points of the second part collected in the video image frame, and predict the key points of the first part outside the video image frame, so as to predict the position information of reasonable and stable key points without jumping, and then realize In the case that the video image frame does not contain complete body parts, it is still possible to perform pose detection on the target object.

Claims

A pose estimation method, comprising:

Obtaining a video image frame containing a body part of a target object;

performing occlusion detection on the target body part of the target object in the video image frame;

When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key points contained in the video image frame The key point of the second part of the target limb part, get the second key point;

Determining a pose detection result of the target object based on the first key point and the second key point.
The method according to claim 1, wherein the method further comprises:

Determine the acquisition distance corresponding to the video image frame; the acquisition distance is used to characterize the distance between the target object and the video acquisition device when the video acquisition device collects the video image frame;

Judging whether the video image frame meets the facial expression capture condition based on the collection distance;

The occlusion detection of the target body part of the target object in the video image frame includes:

When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
The method according to claim 2, wherein the method further comprises:

When it is determined that the acquisition distance meets the preset distance requirement, it is determined that the video image frame meets the facial expression capture condition; and facial expression detection is performed on the video image frame to obtain a facial expression detection result.
The method according to any one of claims 1 to 3, wherein said predicting the key point of the first part based on the video image frame to obtain the first key point comprises:

intercepting a first image containing the second part in the video image frame;

Performing edge padding on the first image to obtain a second image including a filled area; wherein, the filled area is an area used for key point detection on the first part;

Predicting the key point of the first part based on the filled area in the second image to obtain the first key point.
The method according to claim 4, wherein said performing edge padding on said first image to obtain a second image containing a filled area comprises:

Determine the attribute information of the first part; wherein the attribute information includes at least one of the following: limb type information, limb size information;

Determine the padding parameters of the first image according to the attribute information; wherein, the padding parameters include at least one of the following: padding position, padding size;

Perform edge padding on the first image based on the padding parameters to obtain the second image.
The method according to claim 4, wherein said intercepting the first image containing the second part in the video image frame comprises:

Determining a target frame in the video image frame, wherein the target frame is used to frame the frame of the second part;

Intercepting a sub-image within the target frame in the video image frame to obtain the first image.
The method according to claim 1, wherein said predicting the key point of the first part based on the video image frame to obtain the first key point comprises:

determining an estimated region of the first part based on the video image frame;

Predicting key points of the first part based on the estimated area to obtain the first key points.
The method according to claim 1, wherein the method further comprises:

After obtaining the posture detection result of the target object, generating a posture trigger signal of a virtual object corresponding to the target object according to the posture detection result;

The virtual object is controlled to perform a corresponding trigger action according to the gesture trigger signal.
The method according to claim 1, wherein the method further comprises:

Determining a plurality of training samples; wherein, each of the training samples in the plurality of training samples includes part of the target body part of the target object, and each of the training samples includes each key point of the target body part labeling information; training the attitude detection model to be trained by the plurality of training samples to obtain the attitude detection model;

Predicting the key points of the first part based on the video image frame to obtain the first key point, and determining the key point of the second part of the target limb part contained in the video image frame to obtain the second key point point, including: predicting the key point of the first part in the video image frame based on the posture detection model, obtaining the first key point, and predicting the key point contained in the video image frame based on the posture detection model The key point of the second part of the target limb part to get the second key point.
The method according to claim 9, wherein said determining a plurality of training samples comprises:

collecting original images including all target body parts of the target object, and performing body detection on the original images to obtain multiple key points;

Perform occlusion processing on at least part of the target body parts in the original image, and determine the plurality of training samples based on the original image after occlusion processing and labeling information of the plurality of key points.
A pose estimation device, comprising:

an acquisition part configured to acquire a video image frame including a body part of a target object;

A detection part configured to perform occlusion detection on the target limb part of the target object in the video image frame;

The first determining part is configured to predict the key point of the first part based on the video image frame, obtain the first key point, and determine that the first part of the target limb part is detected to be blocked The key point of the second part of the target limb part contained in the video image frame to obtain the second key point;

The second determining part is configured to determine the pose detection result of the target object based on the first key point and the second key point.
A computer device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processor communicates with the memory through the bus , when the machine-readable instructions are executed by the processor, the steps of the pose estimation method according to any one of claims 1 to 10 are executed.
A computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the attitude estimation method according to any one of claims 1 to 10 are executed.
A computer program product, said computer program product comprising a computer program or instructions, when said computer program or instructions run on a computer, causes said computer to execute the method described in any one of claims 1 to 10 The steps of the pose estimation method.