CN113449696B

CN113449696B - Attitude estimation method and device, computer equipment and storage medium

Info

Publication number: CN113449696B
Application number: CN202110994673.3A
Authority: CN
Inventors: 曹国良; 邱丰; 刘文韬; 钱晨
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-07
Anticipated expiration: 2041-08-27
Also published as: WO2023024440A1; TW202309782A; CN113449696A

Abstract

The present disclosure provides a method, an apparatus, a computer device and a storage medium for attitude estimation, wherein the method comprises: acquiring a video image frame containing a limb part of a target object; carrying out occlusion detection on a target limb part of a target object in a video image frame; under the condition that the first part of the target limb part is detected to be blocked, predicting key points of the first part based on the video image frame to obtain first key points, and determining key points of a second part of the target limb part contained in the video image frame to obtain second key points; determining a pose detection result of a target object based on the first and second keypoints.

Description

Attitude estimation method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and an apparatus for estimating an attitude, a computer device, and a storage medium.

Background

In the existing gesture capturing scheme, the gesture of the object is captured by using a network camera, but due to uncertain factors such as the visual angle specification of the camera, the distance between the object and a lens and the like, the situation that the limbs of the object partially exceed the picture of the camera (for example, the forearm of a hand is out of the picture, the position below the shoulder and chest of a person is out of the picture and the like) is usually caused, so that the gesture capturing device cannot accurately recognize the gesture of the object.

Disclosure of Invention

The embodiment of the disclosure at least provides a posture estimation method, a posture estimation device, computer equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides an attitude estimation method, including: acquiring a video image frame containing a limb part of a target object; carrying out occlusion detection on a target limb part of the target object in the video image frame; under the condition that the first part of the target limb part is detected to be blocked, predicting key points of the first part based on the video image frame to obtain first key points, and determining key points of a second part of the target limb part contained in the video image frame to obtain second key points; determining a pose detection result of the target object based on the first and second keypoints.

In the embodiment of the disclosure, under the condition that a video image frame is subjected to occlusion detection and a first occluded part in the video image frame is detected, by predicting a key point of the first occluded part and predicting a key point of a second unoccluded part in the video image frame, a key point of a second part collected in the video image frame and a key point of a first part outside the video image frame can be predicted under the condition that a target limb part in the video image frame is occluded, so that position information of a key point which is reasonable, stable and does not jump is predicted, and further, under the condition that the video image frame does not contain a complete limb part, posture detection of a target object can still be performed.

In an optional embodiment, the method further comprises: determining a corresponding acquisition distance of the video image frame; the acquisition distance is used for representing the distance between a target object and the video acquisition equipment when the video acquisition equipment acquires the video image frame; judging whether the video image frame meets facial expression capturing conditions or not based on the acquisition distance; the occlusion detection of the target limb part of the target object in the video image frame includes: and under the condition that the video image frame is determined not to meet the facial expression capturing condition, carrying out occlusion detection on the target limb part of the target object in the video image frame.

In an optional embodiment, the method further comprises: determining that the video image frame meets the facial expression capturing condition under the condition that the acquisition distance meets the requirement of a preset distance; and carrying out facial expression detection on the video image frame to obtain a facial expression detection result.

In the above embodiment, facial expression capturing may be performed on the anchor in a case where it is determined that the facial expression capturing condition is satisfied based on the acquisition distance. Under the condition that the facial expression capturing condition is judged not to be met based on the acquisition distance, the target limb part of the target object in the video image frame can be shielded and detected, and under the condition that the first part is shielded, the limb part to be detected which is not contained in the image can be detected by respectively predicting key points of the first part and the second part in the video image frame, so that the problem that the image outer limb point cannot be predicted in the prior art is solved, and the problem that the jump of the limb point caused by unstable detection of the limb point is serious when the application program performs limb detection is solved.

In an alternative embodiment, the predicting the keypoints of the first part based on the video image frame to obtain first keypoints includes: intercepting a first image containing the second part in the video image frame; performing edge filling on the first image to obtain a second image containing a filled area, wherein the filled area is an area for performing key point detection on the first part; and predicting the key point of the first part based on the filled area in the second image to obtain the first key point.

In the above embodiment, the first image including the second portion is captured in the video image frame, and the edge of the first image is filled to obtain the second image, so that the second image can be used to predict the keypoints of the first portion that is blocked in the video image frame, and thus the keypoints of the second portion that are collected in the video image frame and the first portion that is located outside the video image frame can be predicted under the condition that the first portion of the target limb portion in the video image frame is blocked

In an optional embodiment, the edge-filling the first image to obtain a second image including a filled region includes: determining attribute information of the first location, wherein the attribute information comprises: limb type information and/or limb size information; determining a filling parameter of the first image according to the attribute information; wherein the padding parameters include: padding locations and/or padding sizes; and performing edge filling on the first image based on the filling parameters to obtain the second image.

In the above embodiment, by determining the filling parameter according to the attribute information of the first portion, and filling the video image frame according to the filling parameter to obtain the second image, a larger image resolution can be ensured as much as possible on the basis of filling the video image frame, thereby ensuring that a posture detection result with a higher accuracy is obtained.

In an alternative embodiment, said intercepting a first image containing said second portion in said video image frame includes: determining a target frame body in the video image frame, wherein the target frame body is used for framing the frame body of the second part; and intercepting a sub-image positioned in the target frame in the video image frame to obtain the first image.

In the above embodiment, in the case that the first part of the target limb part of the target object is blocked, the second image is obtained by intercepting the sub-image in the target frame and performing edge filling on the sub-image, so that the application scenario of the posture estimation method provided by the disclosure can be expanded, and in the complex posture estimation scenario, the normal and stable operation of the application program based on the posture estimation can still be ensured.

In an alternative embodiment, the predicting the keypoints of the first part based on the video image frame to obtain first keypoints includes: determining an estimated region of the first location based on the video image frame; and predicting the key point of the first part based on the estimated area to obtain the first key point.

In the above embodiment, when the keypoints of the first portion are predicted by the first portion estimation region, the pose detection model may be guided by the estimation region to detect the first keypoints of the first portion that are missing in the video image frame, so that the accuracy of the detected keypoints is improved, and the detection error of the keypoints is reduced.

In an optional embodiment, the method further comprises: after the attitude detection result of the target object is obtained, generating an attitude trigger signal of a virtual object corresponding to the target object according to the attitude detection result; and controlling the virtual object to execute a corresponding trigger action according to the attitude trigger signal.

In the above embodiment, since a more accurate detection result can be obtained when the limb part of the target object is detected according to the posture detection model, when the virtual object is triggered to execute the corresponding trigger action according to the posture detection result, the virtual object can be accurately controlled to execute the corresponding trigger action.

In an optional embodiment, the method further comprises: determining a plurality of training samples; each training sample comprises part of target limb parts of a target object, and each training sample comprises labeling information of each key point of the target limb parts; training a posture detection model to be trained through the plurality of training samples to obtain a posture detection model; predicting the key point of the first part based on the video image frame to obtain a first key point, determining the key point of a second part of the target limb part contained in the video image frame to obtain a second key point, wherein the method comprises the following steps: and predicting key points of the first part in the video image frame based on the gesture detection model to obtain first key points, and predicting key points of a second part of the target limb part contained in the video image frame based on the gesture detection model to obtain second key points.

In the above embodiment, the gesture detection model to be trained is trained by the plurality of training samples, so that the gesture detection model capable of performing gesture detection on the image including the part of the target limb portion of the target object can be obtained.

In an alternative embodiment, the determining a plurality of training samples includes: acquiring original images of all target limb parts including a target object, and performing limb detection on the original images to obtain a plurality of key points; and performing occlusion processing on at least part of the specified limb part in the original image, and determining the training samples based on the original image after the occlusion processing and the labeling information of the key points.

In the above embodiment, the original image is subjected to shielding processing, so that the condition that the lower limbs of the corresponding application scene are shielded or cut can be simulated, and when the training sample determined by the processing mode is used for training the gesture detection model to be trained, the gesture detection of the target object can still be performed under the condition that all target limb parts are not included in the video image frame, so that the corresponding application program can be ensured to normally and stably run.

In a second aspect, an embodiment of the present disclosure further provides an attitude estimation apparatus, including: the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a video image frame of a limb part containing a target object; the detection unit is used for carrying out occlusion detection on the target limb part of the target object in the video image frame; a first determining unit, configured to predict a key point of a first part of the target limb part based on the video image frame to obtain a first key point, and determine a key point of a second part of the target limb part included in the video image frame to obtain a second key point, when it is detected that the first part of the target limb part is occluded; a second determination unit configured to determine a posture detection result of the target object based on the first key point and the second key point.

In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 illustrates a flow chart of a method of attitude estimation provided by an embodiment of the present disclosure;

fig. 2 shows a specific flowchart of predicting a keypoint of the first portion based on the video image frame to obtain a first keypoint in a pose estimation method provided by the embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a video image frame provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram illustrating a first image including a second portion obtained after a video image frame is acquired according to an embodiment of the disclosure;

fig. 5 is a schematic diagram illustrating a detection result of performing keypoint detection on a first portion and a second portion in a second image according to an embodiment of the disclosure;

fig. 6 shows a specific flowchart for intercepting a first image containing the second portion in the video image frame in a pose estimation method provided by the embodiment of the present disclosure;

fig. 7 shows a specific flowchart of performing edge filling on the first image to obtain a second image including a filled region in the pose estimation method provided by the embodiment of the present disclosure;

fig. 8a is a schematic diagram illustrating a filling effect of filling a first image into a second image according to an embodiment of the present disclosure;

fig. 8b is a schematic diagram illustrating another filling effect of filling a first image to obtain a second image according to the embodiment of the present disclosure;

fig. 9 shows another specific flowchart for performing edge filling on the first image to obtain a second image including a filled region in the pose estimation method provided by the embodiment of the present disclosure;

fig. 10 is a specific flowchart illustrating a first keypoint obtained by predicting a keypoint of the first portion based on the video image frame in an attitude estimation method provided by an embodiment of the present disclosure;

fig. 11 shows a specific flowchart of a first alternative method for determining an estimated region of a first location in an attitude estimation method provided by an embodiment of the present disclosure;

fig. 12 is a specific flowchart illustrating a second alternative method for determining an estimated region of a first location in an attitude estimation method provided by an embodiment of the present disclosure;

FIG. 13 illustrates a flow chart of another method of attitude estimation provided by embodiments of the present disclosure;

FIG. 14 is a schematic diagram illustrating an architecture of an attitude estimation device provided in an embodiment of the present disclosure;

fig. 15 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

According to research, in the existing gesture capturing scheme, the gesture of the object is captured and recognized by using a network camera, but due to uncertain factors such as the visual angle specification of the camera, the distance between the object and a lens and the like, the situation that the limbs of the object partially exceed the picture of the camera (for example, the forearms of hands are out of the picture, the positions below the shoulder and chest of a person are out of the picture and the like) is usually caused, so that the gesture capturing device cannot accurately recognize the gesture of the object.

Based on the above research, the present disclosure provides a posture estimation method, apparatus, computer device, and storage medium. In the embodiment of the disclosure, under the condition that a video image frame is subjected to occlusion detection and a first occluded part in the video image frame is detected, by predicting a key point of the first occluded part and predicting a key point of a second unoccluded part in the video image frame, a key point of a second part collected in the video image frame and a key point of a first part outside the video image frame can be predicted under the condition that a target limb part in the video image frame is occluded, so that position information of a key point which is reasonable, stable and does not jump is predicted, and accurate posture detection of a target object can still be performed under the condition that the video image frame does not contain a complete limb part.

To facilitate understanding of the present embodiment, a detailed description is first given of an attitude estimation method disclosed in the embodiments of the present disclosure, and an execution subject of the attitude estimation method provided in the embodiments of the present disclosure is generally a computer device with certain computing power. The computer device may be a live device, for example, the live device may be any one of a smartphone, a tablet computer, or a PC terminal that is capable of performing pose estimation.

In some possible implementations, the pose estimation method can be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of an attitude estimation method provided in the embodiment of the present disclosure is shown, where the method includes steps S101 to S107, where:

s101: video image frames containing limb portions of a target object are acquired.

In the embodiment of the present disclosure, a video image frame containing a limb portion of a target object may be first acquired by an image pickup device of a computer apparatus. The limb part included in the video image frame can be a whole body limb part or a half body limb part of the target object. Here, the half-body limb part may include the following limb parts: the limb parts (head, upper torso, arms, hands) above the waist of the target subject.

For example, the technical solution of the present disclosure may be used in a live broadcast scenario, and the computer device may be a device capable of installing a live broadcast application program. At this time, the target object may be a anchor, and the obtained video image frame may be a video image frame that is acquired by the anchor in a live broadcast process and includes a limb portion of the anchor. Of course, in some embodiments, the method can also be applied to other video playing scenes.

S103: and carrying out occlusion detection on the target limb part of the target object in the video image frame.

In specific implementation, the occlusion detection model can be used for carrying out occlusion detection on the target limb part of the target object in the video image frame. The target limb portion may be all limb portions of the target object, and may also be a part limb portion of the target object, which is not specifically limited by the present disclosure.

After the target limb part is subjected to occlusion detection, a corresponding occlusion detection result can be obtained, wherein the occlusion detection result is used for representing the complete condition of the target limb part. For example, the complete case includes a complete or incomplete result, and the occlusion detection result may further include location information of a location (i.e., the first location) that is occluded in the case of an incomplete result.

If the target limb part is a half body part of the target object and the video image frame does not include the hand of the target object, after occlusion detection is performed on the video image frame, an occlusion detection result representing that the target limb part in the video image frame is incomplete and that the occluded part of the target limb part is the hand of the target object can be obtained.

S105: and under the condition that the first part of the target limb part is detected to be blocked, predicting key points of the first part based on the video image frame to obtain first key points, and determining key points of a second part of the target limb part contained in the video image frame to obtain second key points.

In the embodiment of the present disclosure, when it is detected that the first part of the target limb part is occluded, the pose detection model may predict the key point of the occluded first part, and predict the key point of the second part of the target limb part included in the video image.

In specific implementation, the edge of the video image frame can be filled based on the position of the missing first part in the video image frame, and the video image frame after the edge is filled is processed through the gesture detection model, so that the key point of the first part is predicted in the filling area. Meanwhile, the gesture detection module can also detect key points of a second part contained in the video image frame to obtain second key points.

In the embodiment of the present disclosure, the target limb portion may include a plurality of sub-portions, for example, the target limb portion may be an upper half limb portion of the target object, and in this case, the target limb portion may include the following sub-portions: head, upper torso, arms, and hands. In this case, the first portion may be a complete sub-portion or a partial sub-portion. For example, the first portion may be two hands, which represents that the first portion that is occluded in the video image frame is two hands; the first part can also be a finger part, and the first part which is blocked in the video image frame is the finger part of the two hands.

In the embodiment of the present disclosure, before performing occlusion detection on a target limb part of a target object in a video image frame, the following steps may also be performed:

determining the acquisition time of a video image frame and the distance between a target object and a camera device of computer equipment to obtain an acquisition distance; the acquisition distance is compared to a first distance threshold and a second distance threshold. And under the condition that the comparison result shows that the acquisition distance is smaller than the first distance threshold value or the comparison result shows that the acquisition distance is larger than the second distance threshold value, preliminarily determining that the shielding detection on the video image frame is not needed. And under the condition that the comparison result shows that the acquisition distance is smaller than or equal to the second distance threshold and is larger than or equal to the first distance threshold, shielding detection needs to be carried out on the target limb part in the video image frame.

Here, the second distance threshold is greater than the first distance threshold. The first distance threshold and the second distance threshold may be distance thresholds empirically selected in advance, or may be distance thresholds set in the computer device in advance for the target object. The first distance threshold is used for representing the distance between the target object and the camera when the part below the head of the target object is not included in the video image frame or the part below the head of the target object is not enough for gesture detection. The second distance threshold is used for representing the distance between the target object and the camera when the video image frame contains the complete target limb part.

S107: determining a pose detection result of a target object based on the first and second keypoints.

After the first and second key points are determined, the pose detection result of the target object can be determined based on the first and second key points.

The above steps are described in detail below.

In an alternative embodiment, the method further comprises the steps of:

determining a corresponding acquisition distance of the video image frame; the acquisition distance is used for representing the distance between a target object and the video acquisition equipment when the video acquisition equipment acquires the video image frame; and determining whether the video image frame satisfies a facial expression capturing condition based on the acquisition distance.

Determining that the video image frame meets the facial expression capturing condition under the condition that the acquisition distance meets the requirement of a preset distance; and carrying out facial expression detection on the video image frame to obtain a facial expression detection result.

Based on this, in step S103, the occlusion detection is performed on the target limb part of the target object in the video image frame, which specifically includes the following contents:

and under the condition that the video image frame is determined not to meet the facial expression capturing condition, carrying out occlusion detection on the target limb part of the target object in the video image frame.

In the technical scheme of the disclosure, besides limb detection on a target object, facial expression capture can be performed on the target object. At this time, the distance between the target object and the camera when the video image frame is captured by the video capture device (e.g., the camera of the computer device) can be determined.

Then, it is determined whether the video image frame satisfies a facial expression capturing condition based on the acquisition distance. In particular, the acquisition distance may be compared to the first distance threshold described above. If the comparison result shows that the acquisition distance is smaller than the first distance threshold, the preset distance requirement is determined to be met, namely, the video image frame is determined to meet the facial expression capturing condition, and at the moment, the facial expression capturing can be carried out on the video image frame.

And if the acquisition distance is judged to be greater than or equal to the first distance threshold, determining that the video image frame does not meet the facial expression capturing condition, and at the moment, carrying out occlusion detection on the target limb part of the target object in the video image frame.

Under the condition that the first part of the target limb part is detected to be blocked, the key point of the first part can be predicted based on the video image frame to obtain a first key point, and the key point of the second part of the target limb part contained in the video image frame is determined to obtain a second key point; and determining a posture detection result of the target object based on the first key point and the second key point.

And under the condition that the target limb part is not blocked, detecting the target limb part in the video image frame to obtain the posture detection result of the target object.

In some existing applications, the application is a technical solution based on a single mode, and the technical solution of the single mode may be understood that the application can only perform facial expression capturing when an image includes a complete facial expression, and can only perform limb capturing when an image includes a complete limb part to be detected. When the application program cannot capture facial expressions and the image does not contain the complete limb part to be detected, the application program cannot normally and stably operate the functions of facial capture and limb part detection.

It is assumed that the application is virtual live software. In a virtual live broadcast scene, the limb movement of a anchor is often captured and recognized by using a network camera, but uncertain factors such as the visual angle specification of the camera, the distance between the anchor and a lens exist. In particular, when the anchor is at an ultra-close distance from the camera, that is, the limbs of the character partially extend out of the camera screen (for example, the forearms are out of the screen, the shoulders and chest of the character are out of the screen, and the like), the virtual live broadcast software in the prior art cannot normally detect the limbs.

After the technical scheme is adopted, facial expression capturing can be carried out on the anchor under the condition that the facial expression capturing condition is judged to be met based on the acquisition distance. Under the condition that the facial expression capturing condition is judged not to be met based on the acquisition distance, the target limb part of the target object in the video image frame can be shielded and detected, and under the condition that the first part is shielded, the limb part to be detected which is not contained in the image is detected by respectively predicting key points of the first part and the second part in the video image frame, so that the problem that the image outer limb point cannot be predicted in the prior art is solved, and the problem that the jump of the limb point caused by unstable detection of the limb point is serious when the application program performs limb detection is further solved.

In the technical scheme, the smooth transition between facial expression capture and limb part capture can be realized, so that the robustness of the application program is improved, and the stable operation of the application program is ensured.

In an alternative embodiment, as shown in fig. 2, in step S105, predicting the keypoints of the first portion based on the video image frame to obtain first keypoints, specifically includes the following steps:

s1051: and intercepting a first image containing the second part in the video image frame.

S1052: and performing edge filling on the first image to obtain a second image containing a filled area, wherein the filled area is an area for performing key point detection on the first part.

S1053: and predicting the key point of the first part based on the filled area in the second image to obtain the first key point.

Assume that the image shown in fig. 3 is a video image frame acquired in step S101 above and including the upper limb part of the target object. As can be seen from fig. 3, a part of the finger portion (i.e., the above-described first portion) of the target object is occluded. At this time, the first image including the second portion is cut out from the video image frame, and the first image shown in fig. 4 is obtained.

After the first image is captured, edge filling may be performed on the first image, so as to obtain a second image including a filled region. After the second image is obtained, the pose of the second image is detected through a pose detection model, so that the key points of the first part are predicted in the filling area to obtain first key points, and the key points of the second part are predicted in other areas except the filling area in the second image to obtain second key points. For example, as shown in fig. 5, the black image area is the filled area. After processing using the above described procedure, the keypoints of the first location may be predicted in the padding area, and the keypoints of the second location may be predicted in other areas than the padding area.

In the embodiment of the present disclosure, when performing edge padding on the first image, the first image may be edge-padded based on the black image in a manner as shown in fig. 5.

Here, edge padding may be understood as performing edge padding on a first image based on a position of an occluded first part in the video image frame, so as to obtain an area capable of performing keypoint detection on the occluded first part.

In the above embodiment, the first image including the second portion is captured in the video image frame, and the edge of the first image is filled to obtain the second image, so that the prediction of the keypoint of the occluded first portion in the video image frame can be realized through the second image, and thus, the keypoint of the second portion captured in the video image frame and the keypoint of the first portion outside the video image frame can be predicted under the condition that the first portion of the target limb portion in the video image frame is occluded.

In an alternative embodiment, as shown in fig. 6, S1051: intercepting a first image containing the second part in the video image frame, comprising the following processes:

s601: and determining a target frame body in the video image frame, wherein the target frame body is used for framing the frame body of the second part.

S602: and intercepting a sub-image positioned in the target frame in the video image frame to obtain the first image.

In the embodiment of the present disclosure, the occlusion of the first part of the target object in the video image frame may be understood as: the first portion is truncated by an image edge resulting in the first portion being occluded and the first portion is occluded by other objects in the video image frame resulting in the first portion being occluded.

In the embodiment of the disclosure, in the case that the first part of the target limb part is not detected in the video image frame, and it is detected that the first part in the video image frame is not at the edge position of the video image frame, it may be determined that the first part is occluded by other objects in the video image frame.

In this case, a target frame for framing the second part may be determined in the video image frame. Then, the sub-image in the video image frame located in the target frame body is intercepted, and a first image is obtained.

In the embodiment of the present disclosure, if the first portion of the target limb portion is not detected in the video image frame and the first portion is detected to be truncated by the image edge, the edge of the video image frame may be directly filled, so as to obtain the second image including the filled region.

Here, the process of edge-filling the video image frame is the same as the process of edge-filling the first image, and in the following embodiments, the process of edge-filling will be described by taking edge-filling the first image as an example.

In the above embodiment, when it is detected that the first part of the target limb part of the target object is blocked, the application scenario of the posture estimation method provided by the present disclosure can be expanded by capturing the sub-image in the target frame and performing edge filling on the sub-image to obtain the second image, and in a complex posture estimation scenario, the application program based on the posture estimation can still be ensured to be normally and stably operated.

As can be seen from the above description, when performing edge padding on the first image (or the video image frame), the edge padding on the first image can be implemented by padding a black image in the first image as shown in fig. 5.

In addition to performing edge filling on the first image by using the edge filling method, the following described method may be further used to perform edge filling on the first image to obtain the second image, and the method specifically includes:

position information of an obstruction that obstructs a first portion is determined in a video image frame. And performing replacement processing on an image in the position information in the video image frame, and replacing the image with a background image with a preset color, for example, replacing the image with a black background image.

In the embodiment of the present disclosure, in addition to the black background image, other color background images may be substituted. In order to improve the processing accuracy of the pose detection model, the preset color may be determined based on a training sample of the pose detection model, which will be described in detail in the following process.

In an alternative embodiment, as shown in fig. 7, the above steps: performing edge filling on the first image to obtain a second image containing a filled area, specifically comprising the following processes:

s701: determining attribute information of the first location, wherein the attribute information comprises: limb type information and/or limb size information.

S702: determining a filling parameter of the first image according to the attribute information; wherein the padding parameters include: padding locations and/or padding sizes.

S703: and performing edge filling on the first image based on the filling parameters to obtain the second image.

In the embodiment of the present disclosure, the first part may be understood as a limb part which is absent in the video image frame and needs to be subjected to posture detection by the posture detection model. For example, the gesture detection model needs to detect the upper body limb part of the anchor, however, a part of the hand part is absent in the video image frame, and at this time, the first part is the absent hand part in the video image frame.

Here, the body type information is used to indicate body type information of a first part missing in the video image frame, for example, the first part missing in the video image frame is a hand. The body size information is used to indicate size information (or size information) of the first part missing in the video image frame.

It will be appreciated that after the limb type information is determined, the positional relationship of the first location with respect to the first image can be estimated.

After the attribute information is determined, a padding position and/or a padding size of the first image may be determined based on the attribute information, and the first image is padded based on the padding position and/or the padding size to obtain a second image.

In the embodiment of the present disclosure, when determining the filling-in parameter of the first image based on the attribute information of the first location, the positional relationship of the first location with respect to the first image may be determined based on the limb type information, for example, the first location should be located at a lower edge position of the first image, and at this time, the filling-in position may be determined based on the positional information, for example, the lower edge position of the first image may be determined as the filling-in position. Meanwhile, a padding size of the first image may also be determined based on the limb size information, for example, the limb size information may be determined as the padding size.

Assuming that the first part missing in the video image frame is a hand part, as shown in fig. 8a, it can be determined that the hand part is located below the first image, and at this time, the lower edge of the first image can be edge-filled. For example, the lower edge of the video image frame may be filled with an image of a black background.

Assuming that the first portion missing in the video image frame of fig. 8b is a hand portion and two arm portions, it can be determined that the hand portion is located below the first image, and the two arm portions are located at the left and right sides of the first image, and at this time, the lower edge, the left edge, and the right edge of the first image can be edge-filled. For example, an image of a black background may be filled in the lower edge, the left edge, and the right edge of the first image, respectively.

Here, in addition to the image that may be filled with the black background, a background image filled with other colors may be selected, and this disclosure is not particularly limited thereto.

Since the size of the image input to the pose detection model is set in advance, after the video image frame is subjected to the padding process, the second image after the padding process needs to be adjusted to the preset size. Therefore, filling more space will affect the resolution of the image corresponding to the limb part of the target object in the target image.

In an alternative embodiment, the limb type information of the first part may be determined in the following manner, specifically including:

the first method is as follows: the limb type information of the first part which is lacked in the video image frame can be estimated according to the estimated distance between the target object and the camera device.

For example, the distance between the target object and the imaging device may be predicted by a ranging model, and the limb type information of the missing first part may be output by the ranging model.

The second method comprises the following steps: and intercepting a first image containing a second part in the video image frame through the target frame body, and then identifying the first image to obtain the limb type information of the first part which is lacked in the video image frame.

In an alternative embodiment, the limb size information of the first part may be determined in the following manner, specifically including:

and acquiring the actual length information of the target object, wherein the actual length information can be the actual height information of the target object and can also be the actual length information of any complete target limb part of the target object. And then, determining the length information of the completely specified limb part contained in the video image frame, and estimating the limb size information according to the length information and the actual length information.

In the embodiment of the present disclosure, in addition to performing edge filling on the first image in the above-described manner to obtain the second image, the edge filling may be performed on the first image in the following manner, specifically including:

each image edge of the video image frame can be filled up respectively, and at the moment, the filling size for filling up each image edge can be a preset size.

The image edge corresponding to the missing first part may be determined in the first image, and then the image edge is filled, at this time, the filling size for filling the image edge may be a preset size, or may be the determined limb size information.

In the embodiment of the present disclosure, as shown in fig. 9, on the basis of the embodiment shown in fig. 7, after determining the padding parameter, the method further includes the following steps:

s901: and acquiring scene type information of the video image frame.

S902: and adjusting the filling parameters according to the scene type information, and filling the video image frame based on the adjusted filling parameters to obtain the second image.

In the embodiment of the present disclosure, scene type information corresponding to a video image frame may be determined, for example, the scene type information is: a cargo scene, a game explanation scene, a performance scene and the like.

The resolution requirements for the image input into the pose detection model may not be the same for each scene type information. Therefore, after the scene type information is determined, the image resolution matched with the scene type information can be determined, and the filling parameters are adjusted according to the image resolution.

For example, for a scene with a high image resolution requirement, the padding size may be adaptively reduced, so as to ensure the resolution of the second image obtained by the padding processing. Aiming at scenes with high image resolution requirements, the filling size can be increased adaptively, or the original filling size is kept unchanged.

It should be understood that, when the above-mentioned padding parameter is adjusted according to the image resolution, it should be ensured that when the first image is padded according to the adjusted padding parameter, the second image after padding may include an area for performing the posture detection on the first part.

In the above embodiment, the filling parameter is adjusted according to the scene type information of the video image frame, so as to expand the video image frame according to the adjusted filling parameter, and a larger image resolution can be ensured as much as possible under the condition that a complete key point of the target limb part is detected, thereby ensuring that a posture detection result with a higher accuracy is obtained.

In this embodiment of the present disclosure, as shown in fig. 10, in step S105, predicting a keypoint of the first portion based on the video image frame to obtain a first keypoint, specifically, the following process is further included:

s1001: an estimated region of the first location is determined based on the video image frame.

S1002: and predicting the key point of the first part based on the estimated area to obtain the first key point.

In an alternative embodiment, the determining the estimated region of the first portion may include:

firstly, determining a first image containing a second part based on a video image frame; then, edge filling is carried out on the first part according to the mode to obtain a second image; an estimated area of the first location is then determined in the padded area of the second image. Wherein the estimation region is a region in the second image for estimating a keypoint of the second portion.

Here, the estimation region may be a rectangular region or a circular region, and the shape of the estimation region is not particularly limited in the present disclosure.

In an optional embodiment, as shown in fig. 11, the determining the estimated area of the first portion may further include:

s1101: determining limb type information corresponding to the first part; and determining a target site of the target limb sites associated with the first site. Here, the first site and the target site may be detection sites having an interlocking relationship. For example, the first portion is moved by the target portion, or the target portion is moved by the first portion. Assuming that the first site is a hand, the target site may be a wrist site. The first region may be a wrist region and the target region may be a forearm region. This is not further enumerated here.

S1102: a positional constraint between the first location and the target location is determined based on the limb type information for the first location and the limb type information for the target location. Wherein the position constraint is for constraining a difference in position between the position of the first location in the second image and the position of the target location in the second image.

Here, different position constraints are set for different types of limb type information. The second image is an image obtained by edge-filling the first image in the above embodiment, and the first image is a sub-image including the second portion in the video image frame.

S1103: an estimated region of the first location in the second image is determined based on the location constraint.

By adopting the mode of determining the estimation area of the first part in the second image by position constraint, the phenomenon of larger position difference between the first part and the target part can be reduced, and the processing precision of the attitude detection model is improved.

After the estimation area is determined, the attitude detection of the second image marked with the estimation area can be carried out through the attitude detection model, and an attitude detection result is obtained. Wherein, include the label information of the key point of the complete target limb position in this gesture testing result, wherein, this label information includes: location information and category information.

When the attitude detection model is used for detecting the attitude of the second image marked with the estimation area, the attitude detection model can be guided to detect the first key point of the first part lacked in the video image frame through the estimation area, so that the accuracy of the detected key point is improved, and the detection error of the key point is reduced.

In an alternative embodiment, as shown in fig. 12, the determining the estimated area of the first portion may further include the following steps:

s1201: and determining a target video image in the historical video image corresponding to the target object, wherein the similarity between the target video image and the video image frame meets a preset requirement, and the target video image comprises the first part.

S1202: and determining the estimation area according to the position information of the first part contained in the target video image.

In the disclosed embodiment, the historical video images (e.g., the historical live images) of the target object are first obtained in the cache folder. The cache folder is used for storing video image frames containing complete specified limb parts and gesture detection results corresponding to the video image frames.

After the historical video image is obtained, the historical video image can be screened to obtain a target video image, and the specific screening process is described as follows:

and calculating a characteristic distance between each historical video image and the video image frame, wherein the characteristic distance is used for representing the similarity between the historical video image and the video image frame. And screening target video images with the similarity meeting preset requirements with the video image frames in the historical video images according to the calculated characteristic distance. Here, satisfying the preset requirement may be understood as: the characteristic distance is greater than or equal to a preset distance threshold.

After the target video image is screened, the estimation region may be determined according to the position information of the first portion included in the target video image. For example, the position information of the first portion included in the target video image may be determined as the estimation region.

In the above embodiment, it is considered that the difference between similar motions of the same object in the process of using the computer device is small, and therefore, by means of acquiring the target video image to determine the estimation region, the determination accuracy of the estimation region can be improved, and thus a more accurate posture detection result can be obtained.

In the embodiment of the present disclosure, as shown in fig. 13, the method further includes the following processes:

s1301: and after the attitude detection result of the target object is obtained, generating an attitude trigger signal of the virtual object corresponding to the target object according to the attitude detection result.

S1302: and controlling the virtual object to execute a corresponding trigger action according to the attitude trigger signal.

In the embodiment of the present disclosure, the gesture trigger signal of the virtual object may be generated according to the key point of the target limb part of the target object in the detection result.

In the embodiment of the present disclosure, after obtaining the gesture detection result of the target object, a gesture trigger signal for triggering the virtual object to execute a corresponding trigger action may be generated according to the gesture detection result, so as to trigger the virtual object to execute the corresponding trigger action.

Here, the trigger signal is used to indicate position information of key points of respective virtual limbs of the virtual object in the video image frame.

It should be noted that the avatar of the virtual object is a preset avatar (e.g., a virtual anchor), wherein the preset avatar includes at least one of the following items: three-dimensional human mimicry (the human mimicry can be a human or a human-like image such as a alien human, etc.), three-dimensional animal mimicry (such as a dinosaur, a pet cat, etc.), two-dimensional human mimicry, two-dimensional animal mimicry, etc.

In an embodiment of the present disclosure, the method further includes the following process:

firstly, determining a plurality of training samples; each training sample comprises part of target limb parts of the target object, and each training sample comprises labeling information of each key point of the target limb parts. And then, training the posture detection model to be trained through the plurality of training samples to obtain the pre-trained posture detection model.

In the disclosed embodiment, a plurality of training samples are first obtained. Then, a plurality of training samples are input into the gesture detection model to be trained, so that the gesture detection model to be trained is trained.

In the embodiment of the present disclosure, determining a plurality of training samples specifically includes the following processes:

firstly, acquiring an original image of all target limb parts including a target object, and performing limb detection on the original image to obtain a plurality of key points. After obtaining the plurality of key points, occlusion processing may be performed on at least part of the target limb part in the original image, and the plurality of training samples are determined based on the original image after the occlusion processing and the labeling information of the plurality of key points.

In the disclosed embodiment, first, an original image containing all of the target limb portions is acquired. For example, the target limb portion may be an upper half limb portion of a human body, and at least the upper half limb portion is included in the original image, for example, a whole body limb portion may be included.

After the original image is obtained, limb detection may be performed on the original image to obtain a plurality of key points, where the plurality of key points include a plurality of key points of the target limb portion.

And then, carrying out occlusion processing on the original image to obtain the original image after the occlusion processing, wherein the original image after the occlusion processing contains an incomplete target limb part. After obtaining the original image after the occlusion processing, the original image after the occlusion processing and the key points of the target limb part determined in the above process may be determined as a training sample.

In an optional embodiment, the occlusion processing is performed on at least part of the target limb part in the original image, specifically including the following processes:

the background image with preset colors can be used for shielding at least part of the target limb part, so that an original image after shielding treatment is obtained; the original image after the occlusion processing can be obtained by performing the cutting processing on at least part of the target limb part in the original image.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. Illustratively, the target object may relate to a face, a limb, a gesture, an action, etc. associated with the human body. The specific application can not only relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to real scenes or articles, but also relate to special effect treatment related to people, such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides an attitude estimation device corresponding to the attitude estimation method, and as the principle of solving the problem of the device in the embodiment of the present disclosure is similar to the attitude estimation method in the embodiment of the present disclosure, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 14, a schematic diagram of an attitude estimation apparatus provided in an embodiment of the present disclosure is shown, the apparatus including: an acquisition unit 141, a detection unit 142, a first determination unit 143, a second determination unit 144; wherein the content of the first and second substances,

an acquiring unit 141, configured to acquire a video image frame including a limb portion of a target object;

a detecting unit 142, configured to perform occlusion detection on a target limb portion of the target object in the video image frame;

a first determining unit 143, configured to, when it is detected that a first portion of the target limb portion is occluded, predict a keypoint of the first portion based on the video image frame to obtain a first keypoint, and determine a keypoint of a second portion of the target limb portion included in the video image frame to obtain a second keypoint;

a second determining unit 144, configured to determine a posture detection result of the target object based on the first key point and the second key point.

In one possible embodiment, the apparatus is further configured to: determining a corresponding acquisition distance of the video image frame; the acquisition distance is used for representing the distance between a target object and the video acquisition equipment when the video acquisition equipment acquires the video image frame; judging whether the video image frame meets facial expression capturing conditions or not based on the acquisition distance; the occlusion detection of the target limb part of the target object in the video image frame includes: and under the condition that the video image frame is determined not to meet the facial expression capturing condition, carrying out occlusion detection on the target limb part of the target object in the video image frame.

In one possible embodiment, the apparatus is further configured to: determining that the video image frame meets the facial expression capturing condition under the condition that the acquisition distance meets the requirement of a preset distance; and carrying out facial expression detection on the video image frame to obtain a facial expression detection result.

In one possible implementation, the first determining unit 143 is further configured to: intercepting a first image containing the second part in the video image frame; performing edge filling on the first image to obtain a second image containing a filled area, wherein the filled area is an area for performing key point detection on the first part; and predicting the key point of the first part based on the filled area in the second image to obtain the first key point.

In one possible implementation, the first determining unit 143 is further configured to: determining attribute information of the first location, wherein the attribute information comprises: limb type information and/or limb size information; determining a filling parameter of the first image according to the attribute information; wherein the padding parameters include: padding locations and/or padding sizes; and performing edge filling on the first image based on the filling parameters to obtain the second image.

In one possible implementation, the first determining unit 143 is further configured to: determining a target frame body in the video image frame, wherein the target frame body is used for framing the frame body of the second part; and intercepting a sub-image positioned in the target frame in the video image frame to obtain the first image.

In a possible implementation, the first determining unit 83 is further configured to: determining an estimated region of the first location in the video image frame; and predicting the key point of the first part based on the estimated area to obtain the first key point.

In one possible embodiment, the apparatus is further configured to: after the attitude detection result of the target object is obtained, generating an attitude trigger signal of a virtual object corresponding to the target object according to the attitude detection result; and controlling the virtual object to execute a corresponding trigger action according to the attitude trigger signal.

In one possible embodiment, the apparatus is further configured to: determining a plurality of training samples; each training sample comprises part of target limb parts of a target object, and each training sample comprises labeling information of each key point of the target limb parts; the posture detection model to be trained is trained through the training samples to obtain a posture detection model, and the first determining unit 143 is further configured to: and predicting key points of the first part in the video image frame based on the gesture detection model to obtain first key points, and predicting key points of a second part of the target limb part contained in the video image frame based on the gesture detection model to obtain second key points.

In one possible embodiment, the apparatus is further configured to: acquiring original images of all target limb parts including a target object, and performing limb detection on the original images to obtain a plurality of key points; and performing occlusion processing on at least part of the target limb part in the original image, and determining the training samples based on the original image after the occlusion processing and the labeling information of the key points.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Corresponding to the posture estimation method in fig. 1, an embodiment of the present disclosure further provides a computer device 1500, as shown in fig. 15, a schematic structural diagram of the computer device 1500 provided in the embodiment of the present disclosure includes:

a processor 151, a memory 152, and a bus 153; the memory 152 is used for storing execution instructions and includes a memory 1521 and an external memory 1522; the memory 1521 is also referred to as an internal memory, and is configured to temporarily store operation data in the processor 151 and data exchanged with an external memory 1522 such as a hard disk, the processor 151 exchanges data with the external memory 1522 through the memory 1521, and when the computer device 1500 operates, the processor 151 communicates with the memory 152 through the bus 153, so that the processor 151 executes the following instructions:

acquiring a video image frame containing a limb part of a target object;

carrying out occlusion detection on a target limb part of the target object in the video image frame;

under the condition that the first part of the target limb part is detected to be blocked, predicting key points of the first part based on the video image frame to obtain first key points, and determining key points of a second part of the target limb part contained in the video image frame to obtain second key points;

determining a pose detection result of the target object based on the first and second keypoints.

The embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the attitude estimation method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, and instructions included in the program code may be used to execute the steps of the attitude estimation method in the foregoing method embodiments, which may be referred to specifically in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. An attitude estimation method, comprising:

acquiring a video image frame containing a limb part of a target object;

determining a pose detection result of the target object based on the first key point and the second key point;

predicting the key point of the first part based on the video image frame to obtain a first key point, wherein the predicting the key point of the first part based on the video image frame comprises:

intercepting a first image containing the second part in the video image frame;

performing edge filling on the first image to obtain a second image containing a filled area, wherein the filled area is an area for performing key point detection on the first part;

and predicting the key point of the first part based on the filled area in the second image to obtain the first key point.

2. The method of claim 1, further comprising:

determining a corresponding acquisition distance of the video image frame; the acquisition distance is used for representing the distance between a target object and the video acquisition equipment when the video acquisition equipment acquires the video image frame;

judging whether the video image frame meets facial expression capturing conditions or not based on the acquisition distance;

the occlusion detection of the target limb part of the target object in the video image frame includes:

3. The method of claim 2, further comprising:

4. The method of claim 1, wherein edge-filling the first image to obtain a second image comprising a filled area comprises:

determining attribute information of the first location, wherein the attribute information comprises: limb type information and/or limb size information;

determining a filling parameter of the first image according to the attribute information; wherein the padding parameters include: padding locations and/or padding sizes;

and performing edge filling on the first image based on the filling parameters to obtain the second image.

5. The method of claim 1, wherein said capturing a first image containing said second portion in said video image frame comprises:

determining a target frame body in the video image frame, wherein the target frame body is used for framing the frame body of the second part;

and intercepting a sub-image positioned in the target frame in the video image frame to obtain the first image.

6. The method of claim 1, wherein said predicting the keypoints of the first location based on the video image frame to obtain first keypoints comprises:

determining an estimated region of the first location based on the video image frame;

and predicting the key point of the first part based on the estimated area to obtain the first key point.

7. The method of claim 1, further comprising:

after the attitude detection result of the target object is obtained, generating an attitude trigger signal of a virtual object corresponding to the target object according to the attitude detection result;

and controlling the virtual object to execute a corresponding trigger action according to the attitude trigger signal.

8. The method of claim 1, further comprising:

determining a plurality of training samples; each training sample comprises part of target limb parts of a target object, and each training sample comprises labeling information of each key point of the target limb parts; training a posture detection model to be trained through the plurality of training samples to obtain a posture detection model;

predicting the key point of the first part based on the video image frame to obtain a first key point, determining the key point of a second part of the target limb part contained in the video image frame to obtain a second key point, wherein the method comprises the following steps: and predicting key points of the first part in the video image frame based on the gesture detection model to obtain first key points, and predicting key points of a second part of the target limb part contained in the video image frame based on the gesture detection model to obtain second key points.

9. The method of claim 8, wherein determining the plurality of training samples comprises:

acquiring original images of all target limb parts including a target object, and performing limb detection on the original images to obtain a plurality of key points;

and performing occlusion processing on at least part of the target limb part in the original image, and determining the training samples based on the original image after the occlusion processing and the labeling information of the key points.

10. An attitude estimation device, characterized by comprising:

the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a video image frame of a limb part containing a target object;

the detection unit is used for carrying out occlusion detection on the target limb part of the target object in the video image frame;

a first determining unit, configured to predict a key point of a first part of the target limb part based on the video image frame to obtain a first key point, and determine a key point of a second part of the target limb part included in the video image frame to obtain a second key point, when it is detected that the first part of the target limb part is occluded;

a second determination unit configured to determine a posture detection result of the target object based on the first key point and the second key point;

wherein the first determining unit is further configured to: intercepting a first image containing the second part in the video image frame; performing edge filling on the first image to obtain a second image containing a filled area, wherein the filled area is an area for performing key point detection on the first part; and predicting the key point of the first part based on the filled area in the second image to obtain the first key point.

11. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is run, the machine-readable instructions when executed by the processor performing the steps of the pose estimation method according to any of the claims 1 to 9.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, performs the steps of the pose estimation method according to any of the claims 1 to 9.