WO2023024440A1 - Posture estimation method and apparatus, computer device, storage medium, and program product - Google Patents

Posture estimation method and apparatus, computer device, storage medium, and program product Download PDF

Info

Publication number
WO2023024440A1
WO2023024440A1 PCT/CN2022/074929 CN2022074929W WO2023024440A1 WO 2023024440 A1 WO2023024440 A1 WO 2023024440A1 CN 2022074929 W CN2022074929 W CN 2022074929W WO 2023024440 A1 WO2023024440 A1 WO 2023024440A1
Authority
WO
WIPO (PCT)
Prior art keywords
video image
image frame
key point
target
target object
Prior art date
Application number
PCT/CN2022/074929
Other languages
French (fr)
Chinese (zh)
Inventor
曹国良
邱丰
刘文韬
钱晨
Original Assignee
上海商汤智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海商汤智能科技有限公司 filed Critical 上海商汤智能科技有限公司
Publication of WO2023024440A1 publication Critical patent/WO2023024440A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person

Definitions

  • the present disclosure relates to the technical field of image processing, and relates to a pose estimation method, device, computer equipment, storage medium and program product.
  • the pose of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes parts of the object's limbs to exceed the camera screen (for example, The forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the subject.
  • Embodiments of the present disclosure at least provide a pose estimation method, device, computer equipment, storage medium, and program product.
  • an embodiment of the present disclosure provides a pose estimation method, the method is executed by an electronic device, and the method includes: acquiring a video image frame containing a body part of a target object; performing occlusion detection on the target limb part of the target object; in the case that the first part of the target limb part is detected to be occluded, predicting the key point of the first part based on the video image frame to obtain the first key point, And determine the key point of the second part of the target limb part contained in the video image frame to obtain the second key point; determine the posture detection of the target object based on the first key point and the second key point result.
  • the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame.
  • the key points of the first part other than the first part so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be posed even when the video image frame does not contain complete body parts. detection.
  • the method further includes: determining a collection distance corresponding to the video image frame; the collection distance is used to characterize the target object and the target object when the video collection device collects the video image frame. The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
  • the method further includes: in a case where it is determined that the acquisition distance meets a preset distance requirement, determining that the video image frame satisfies the facial expression capture condition; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.
  • facial expression capturing when it is judged based on the collection distance that the facial expression capturing condition is satisfied, facial expression capturing can be performed on the anchor.
  • occlusion detection can also be performed on the target body part of the target object in the video image frame, and when the first part is detected to be occluded, by respectively predicting the video
  • the key points of the first part and the second part in the image frame can realize the limb detection of the limb parts to be detected that are not included in the image, so as to solve the problem that the limb points outside the image cannot be predicted in related technologies, thereby alleviating the above application In the case of limb detection, there is a serious problem of limb point jumping caused by unstable limb point detection.
  • the predicting the key point of the first part based on the video image frame to obtain the first key point includes: intercepting the key point containing the second part in the video image frame The first image; edge filling is performed on the first image to obtain a second image including a filled area; wherein, the filled area is an area used for key point detection on the first part; based on the second The key point of the first part is predicted from the filled area in the image to obtain the first key point.
  • the performing edge padding on the first image to obtain the second image containing the filled area includes: determining attribute information of the first part, wherein the attribute information includes the following At least one of: limb type information, limb size information; determine the filling parameters of the first image according to the attribute information; wherein, the filling parameters include at least one of the following: filling position, filling size; based on the filling parameters Perform edge filling on the first image to obtain the second image.
  • the video image frame can be filled as much as possible on the basis of filling the video image frame. Ensure a larger image resolution, so as to ensure a higher accuracy of attitude detection results.
  • the intercepting the first image containing the second part in the video image frame includes: determining a target frame in the video image frame, wherein the target frame The body is used to frame the frame of the second part; and the sub-image within the target frame in the video image frame is intercepted to obtain the first image.
  • the method of obtaining the second image by intercepting the sub-image in the target frame and performing edge filling on the sub-image can be enlarged.
  • the application scenario of the pose estimation method provided in the present disclosure can still ensure that the application program based on the pose estimation can run normally and stably in the complex pose estimation scenario.
  • the predicting the key point of the first part based on the video image frame to obtain the first key point includes: determining the estimated area of the first part based on the video image frame ; Predict the key point of the first part based on the estimated area to obtain the first key point.
  • the gesture detection model in the case of predicting the key point of the first part through the first part estimation area, can be guided by the estimation area to detect the missing first key point of the first part in the video image frame. detection, thereby improving the accuracy of the detected key points and reducing the detection error of the key points.
  • the method further includes: after obtaining the gesture detection result of the target object, generating a gesture trigger signal of a virtual object corresponding to the target object according to the gesture detection result; The gesture trigger signal controls the virtual object to perform a corresponding trigger action.
  • the method further includes: determining a plurality of training samples; wherein, each of the training samples in the plurality of training samples includes part of the target body parts of the target object, and each The training sample includes labeling information of each key point of the target body part; the posture detection model to be trained is trained through the plurality of training samples to obtain the posture detection model; the prediction based on the video image frame The key point of the first part, obtain the first key point, and determine the key point of the second part of the target limb part contained in the video image frame, obtain the second key point, including: based on the posture detection model in the Predicting the key points of the first part in the video image frame to obtain the first key point, and predicting the key points of the second part of the target limb part contained in the video image frame based on the posture detection model, Get the second key point.
  • the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained.
  • the posture detection of the target object can still be performed, so as to ensure that the anchor application can normally detect the body parts.
  • the determining a plurality of training samples includes: collecting original images including all target body parts of the target object, and performing body detection on the original images to obtain multiple key points; Perform occlusion processing on at least part of the specified body parts in the original image, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.
  • an embodiment of the present disclosure further provides a pose estimation device, including: an acquisition part configured to acquire a video image frame containing a body part of a target object; a detection part configured to detect all the body parts in the video image frame Perform occlusion detection on the target body part of the target object; the first determination part is configured to predict the position of the first part based on the video image frame when it is detected that the first part of the target body part is occluded key points, to obtain the first key points, and determine the key points of the second part of the target limb part contained in the video image frame, to obtain the second key points; the second determination part is configured to be based on the first The key point and the second key point determine the gesture detection result of the target object.
  • an embodiment of the present disclosure further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect, or the steps in any possible implementation manner of the first aspect are executed.
  • a computer device including: a processor, a memory, and a bus
  • the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing
  • the processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect, or the steps in any possible implementation manner of the first aspect are executed.
  • embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned first aspect, or any of the first aspects of the first aspect, may be executed. Steps in one possible implementation.
  • an embodiment of the present disclosure further provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the above first aspect, or A step in any possible implementation of the first aspect.
  • FIG. 1 shows a flow chart of a pose estimation method provided by an embodiment of the present disclosure
  • FIG. 2 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure
  • Fig. 3 shows a schematic diagram of a video image frame provided by an embodiment of the present disclosure
  • FIG. 4 shows a schematic diagram of a first image including a second part obtained after collecting a video image frame provided by an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of a detection result of key point detection of a first part and a second part in a second image provided by an embodiment of the present disclosure
  • FIG. 6 shows a specific flow chart of intercepting the first image containing the second part in the video image frame in a pose estimation method provided by an embodiment of the present disclosure
  • Fig. 7 shows a specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure
  • Fig. 8a shows a schematic diagram of a padding effect obtained by padding a first image to obtain a second image according to an embodiment of the present disclosure
  • Fig. 8b shows another schematic diagram of the padding effect of padding the first image to obtain the second image according to an embodiment of the present disclosure
  • FIG. 9 shows another specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure
  • FIG. 10 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure
  • Fig. 11 shows a specific flow chart of the first method for optionally determining the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure
  • Fig. 12 shows a specific flow chart of the second optional determination of the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure
  • Fig. 13 shows a flow chart of another pose estimation method provided by an embodiment of the present disclosure
  • Fig. 14 shows a schematic structural diagram of a pose estimation device provided by an embodiment of the present disclosure
  • Fig. 15 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.
  • the posture of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes some limbs of the object to exceed the The situation of the camera screen (for example, the forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the object.
  • the present disclosure provides a pose estimation method, device, computer equipment, storage medium and program product.
  • the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame.
  • the key points of the first part other than the first part so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be accurately detected when the video image frame does not contain complete body parts. gesture detection.
  • the execution subject of the pose estimation method provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities.
  • the computer device may be a live broadcast device, for example, the live broadcast device may be any device capable of attitude estimation such as a smart phone, a tablet computer, or a PC.
  • the pose estimation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 it is a flowchart of a pose estimation method provided by an embodiment of the present disclosure, the method includes steps S101 to S107, wherein:
  • S101 Acquire a video image frame including a body part of a target object.
  • the video image frame including the body parts of the target object may be acquired through the camera device of the computer equipment.
  • the limb parts included in the video image frame may be the whole body limb parts or the half body limb parts of the target object.
  • the body parts of the half body may include the following body parts: body parts above the waist of the target object (head, upper torso, arms, hands).
  • the technical solution of the present disclosure can be used in a live broadcast scene
  • the above-mentioned computer device can be a device capable of installing a live broadcast application program.
  • the above-mentioned target object may be the anchor
  • the acquired video image frame may be a video image frame including the body parts of the anchor obtained by collecting the anchor during the live broadcast of the anchor.
  • it can also be applied to other video playing scenarios.
  • S103 Perform occlusion detection on the target body part of the target object in the video image frame.
  • an occlusion detection model may be used to perform occlusion detection on the target body part of the target object in the video image frame.
  • the target body part may be all the body parts of the target object, and may also be a body part of the target object, which is not limited in the present disclosure.
  • the occlusion detection result is used to characterize the integrity of the target body part.
  • the complete condition includes a complete or incomplete result
  • the occlusion detection result may further include part information of a part that is blocked (ie, the first part) in the case of an incomplete result.
  • the target body part is the half-body body part of the target object, and the video image frame does not contain the hand of the target object
  • the occlusion detection result can be obtained, and the occlusion detection result is used to represent The target body part in the video image frame is incomplete, and the occluded part of the target body part is the hand of the target subject.
  • the pose detection model when the first part of the target body part is detected to be occluded, can be used to predict the key points of the occluded first part, and predict the target body part contained in the video image The key point of the second part.
  • edge filling can be performed on the video image frame, and the video image frame after the edge filling can be processed through the pose detection model, so that the filled area Predict the key points of the first part.
  • the posture detection module can also perform key point detection on the second part contained in the video image frame to obtain the second key point.
  • the target body part may contain multiple sub-parts.
  • the target body part may be the upper body part of the target subject.
  • the target body part may include the following sub-parts: head, upper body trunk , arms and hands.
  • the first part may be a complete sub-part, or a partial part of a sub-part.
  • the first part can be both hands, indicating that the first part blocked in the video image frame is both hands; the first part can also be a finger part, indicating that the first part blocked in the video image frame is the finger part of both hands.
  • the following steps may also be performed:
  • Determine the acquisition moment of the video image frame calculate the distance between the target object and the camera device of the computer equipment, and obtain the acquisition distance; compare the acquisition distance with the first distance threshold and the second distance threshold. If the comparison shows that the collection distance is smaller than the first distance threshold, or, if the comparison shows that the collection distance is greater than the second distance threshold, it is preliminarily determined that occlusion detection does not need to be performed on the video image frame. When the comparison shows that the acquisition distance is less than or equal to the second distance threshold and greater than or equal to the first distance threshold, it is necessary to perform occlusion detection on the target body part in the video image frame.
  • the second distance threshold is greater than the first distance threshold.
  • the first distance threshold and the second distance threshold may be distance thresholds selected in advance according to experience, and may also be distance thresholds preset in the computer device of the target object.
  • the first distance threshold is used to characterize the distance between the target object and the camera when the video image frame does not contain the part below the head of the target object, or the video image frame contains insufficient parts below the head for gesture detection.
  • the second distance threshold is used to characterize the distance between the target object and the camera device when the complete target body part is included in the video image frame.
  • S107 Determine a pose detection result of the target object based on the first key point and the second key point.
  • the gesture detection result of the target object can be determined based on the first key point and the second key point.
  • the method also includes the following steps:
  • the collection distance is used to characterize the distance between the target object and the video collection device when the video collection device is collecting the video image frame; and judge the distance based on the collection distance Whether the above-mentioned video image frame satisfies the facial expression capture condition.
  • step S103 occlusion detection is performed on the target body part of the target object in the video image frame, including the following content:
  • occlusion detection is performed on the target body part of the target object in the video image frame.
  • facial expression capture of the target object can also be performed.
  • the distance between the target object and the camera when the video capture device for example, a camera of a computer device
  • the video capture device for example, a camera of a computer device
  • the collection distance may be compared with the above-mentioned first distance threshold. In the case that the acquisition distance is less than the above-mentioned first distance threshold, it is determined that the preset distance requirement is met, that is, it is determined that the video image frame meets the facial expression capture condition, and at this time, facial expression capture can be performed on the video image frame .
  • occlusion detection may be performed on the target body part of the target object in the video image frame.
  • the key point of the first part can be predicted based on the video image frame, the first key point is obtained, and the position of the target limb part contained in the video image frame is determined.
  • the key points of the second part are obtained to obtain the second key points; and then the gesture detection result of the target object is determined based on the first key points and the second key points.
  • the posture detection result of the target object can be obtained.
  • the single-mode technical solution can be understood as, when the image contains complete facial expressions, the application can only capture facial expressions; When a body part is to be detected, the app can only do body capture. When the application cannot capture facial expressions and the image does not contain complete body parts to be detected, the application cannot normally and stably operate the functions of facial capture and body part detection.
  • the above-mentioned application program is a virtual live broadcast software.
  • the body movements of the anchor are often captured and recognized by using the webcam, but due to uncertain factors such as the camera angle of view specification, the distance between the anchor and the lens, etc.
  • the distance between the anchor and the camera is very close, that is, when part of the body of the character exceeds the camera screen (for example, the forearm of the hand is out of the screen, the position below the shoulder and chest of the character is out of the screen, etc.)
  • the virtual live broadcast software in the related art cannot Proceed as normal for extremity detection.
  • a smooth transition between facial expression capture and body part capture can also be realized, thereby improving the robustness of the application program so that the application program can run stably.
  • step S105 based on the video image frame, predicts the key point of the first part to obtain the first key point, including the following process:
  • S1051 Intercept a first image including the second part from the video image frame.
  • S1052 Perform edge padding on the first image to obtain a second image including a padding area, where the padding area is an area used to perform key point detection on the first part.
  • S1053 Predict the key point of the first part based on the filled area in the second image to obtain the first key point.
  • the image shown in FIG. 3 is a video image frame including the upper body parts of the target object collected in step S101 above. It can be seen from FIG. 3 that part of the finger parts of the target object (ie, the above-mentioned first part) are blocked. At this time, the first image including the second part is intercepted in the video image frame, and then the first image as shown in FIG. 4 is obtained.
  • edge filling may be performed on the first image, so as to obtain a second image including the filled area.
  • the posture detection of the second image can be performed through the posture detection model, so as to predict the key points of the first part in the filled area, obtain the first key point, and remove the filled area in the second image
  • the key points of the second part are predicted in other areas other than , and the second key points are obtained.
  • the black image area is the above-mentioned filled area in the filled area.
  • edge filling may be performed on the first image based on the black image in a manner as shown in FIG. 5 .
  • edge filling can be understood as performing edge filling on the first image based on the position of the occluded first part in the video image frame in the video image frame, so as to obtain key points that can be used for the occluded first part. detected area.
  • a part is used to predict key points, so that when the first part of the target body part in the video image frame is occluded, the key point of the second part collected in the video image frame can be predicted, and the key point in the video image frame can be predicted. Key points outside of the first part.
  • S1051 Intercepting the first image containing the second part in the video image frame includes the following process:
  • S601 Determine a target frame in the video image frame, where the target frame is used to frame a frame of the second part.
  • S602 Intercept a sub-image within the target frame in the video image frame to obtain the first image.
  • the occlusion of the first part of the target object in the video image frame can be understood as: the first part is truncated by the edge of the image so that the first part is occluded, and the first part is occluded by other objects in the video image frame As a result, the first part is blocked.
  • the first part of the target body part when the first part of the target body part is not detected in the video image frame, and the first part is detected to be not at the edge position of the video image frame, it may be determined that the first part A part is occluded by other objects in the video image frame.
  • a target frame for framing the second part may be determined in the video image frame. Then, the sub-image located in the target frame in the video image frame is intercepted to obtain the first image.
  • edge filling can be directly performed on the video image frame, thereby obtaining A second image containing the filled area.
  • the process of performing edge filling on the video image frame is the same as the process of performing edge filling on the first image.
  • the process of edge filling on the first image will be described as an example.
  • the second image is obtained by intercepting the sub-image in the target frame and performing edge filling on the sub-image
  • the application scenarios of the pose estimation method provided in the present disclosure can be expanded, and in complex pose estimation scenarios, applications based on the pose estimation can still run normally and stably.
  • the first image can be realized by filling the black image in the first image. Do edge filling.
  • edge filling method In addition to using the edge filling method to fill the edge of the first image, you can also choose the following method to fill the edge of the first image to obtain the second image, including:
  • Position information of an occluder that occludes the first part is determined in the video image frame. Perform replacement processing on the image located in the position information in the video image frame, and replace the image with a background image of a preset color, for example, replace it with a black background image.
  • the preset color can be determined based on the training samples of the attitude detection model, which will be described in detail in the following process.
  • S701 Determine attribute information of the first part, where the attribute information includes: limb type information and/or limb size information.
  • S702 Determine a padding parameter of the first image according to the attribute information; wherein, the padding parameter includes: a padding position and/or a padding size.
  • S703 Perform edge padding on the first image based on the padding parameters to obtain the second image.
  • the above-mentioned first part can be understood as a body part that is missing in the video image frame and for which the pose detection model needs to perform pose detection.
  • the posture detection model needs to detect the anchor's upper body parts, however, some hand parts are missing in the video image frame, and at this time, the first part is the missing hand part in the video image frame.
  • the limb type information is used to indicate the limb type information of the first part missing in the video image frame, for example, the first missing part in the video image frame is a hand.
  • the body size information is used to indicate the size information (or size information) of the first part that is missing in the video image frame.
  • the positional relationship of the first part relative to the first image can be estimated.
  • the filling position and/or filling size of the first image can be determined based on the attribute information, and then the first image can be filled based on the filling position and/or filling size to obtain the second image .
  • the positional relationship of the first part relative to the first image may be determined based on the body type information, for example, the first part It should be located at the bottom edge of the first image.
  • the filling position may be determined based on the position information, for example, the bottom edge of the first image may be determined as the filling position.
  • the padding size of the first image may also be determined based on the limb size information, for example, the limb size information may be determined as the padding size.
  • the missing first part in the video image frame is a hand part
  • the edge of the lower edge of the first image can be edged. filling.
  • the lower edge of the video image frame may be filled with an image of a black background.
  • edge padding may be performed on the bottom edge, left edge, and right edge of the first image.
  • the bottom edge, left edge, and right edge of the first image may be filled with an image of a black background, respectively.
  • background images of other colors can also be selected to be filled, which is not limited in the present disclosure.
  • the second image after the padding process needs to be adjusted to the preset size. Therefore, filling more space will affect the resolution of the image corresponding to the limb parts of the target object in the target image.
  • the video image frame can be filled as much as possible on the basis of filling the video image frame. Improving the image resolution is beneficial to obtain pose detection results with higher accuracy.
  • the limb type information of the first part may be determined in the following manner, including:
  • the limb type information of the first part that is missing in the video image frame can be estimated according to the estimated distance between the target object and the camera device.
  • the distance between the target object and the camera device may be predicted by the distance measurement model, and the missing limb type information of the first part may be output by the distance measurement model.
  • Method 2 intercept the first image containing the second part in the video image frame through the target frame, and then identify the first image to obtain the limb type information of the first part that is missing in the video image frame.
  • the limb size information of the first part may be determined in the following manner, including:
  • the actual length information may be the actual height information of the target object, and may also be the actual length information of any complete target limb part of the target object. Then, determine the length information of the complete designated limb part contained in the video image frame in the video image frame, and then estimate the above-mentioned limb size information according to the length information and the actual length information.
  • edge filling on the first image in the manner described above to obtain the second image you may also perform edge filling on the first image in the following ways, including:
  • Each image edge of the video image frame may be filled separately, and at this time, the padding size of each image edge may be a preset size.
  • the filling size for filling the image edge can be a preset size, or It may be the limb size information determined above.
  • the method further includes the following process:
  • S901 Acquire scene type information of the video image frame.
  • S902 Adjust the padding parameters according to the scene type information, and pad the video image frames based on the adjusted padding parameters to obtain the second image.
  • the scene type information corresponding to the video image frame may be determined, for example, the scene type information is: delivery scene, game commentary scene, performance scene, and the like.
  • the resolution requirements for the images input into the pose detection model may be different. Therefore, after the scene type information is determined, the image resolution matched by the scene type information may be determined, and then the padding parameters may be adjusted according to the image resolution.
  • the padding size may be adaptively reduced, so as to facilitate the resolution of the second image obtained through the padding process.
  • the padding size can be increased adaptively, or the original padding size can be kept unchanged.
  • the second image after padding may include the A region for pose detection of a part.
  • the padding parameters are adjusted through the scene type information of the video image frame, so as to expand the video image frame according to the adjusted padding parameters, and the key points of the complete target body parts can be detected.
  • the image resolution should be increased as much as possible, which is conducive to obtaining a more accurate pose detection result.
  • step S105 predicting the key point of the first part based on the video image frame to obtain the first key point also includes the following process:
  • S1001 Determine an estimated area of the first part based on the video image frame.
  • S1002 Predict key points of the first part based on the estimated area, to obtain the first key points.
  • the estimated area of the first part may be determined through the following steps, including:
  • the estimated area of the part is an area used for estimating key points of the second part in the second image.
  • the estimation area may be a rectangular area, and may also be a circular area, and the disclosure does not limit the shape of the estimation area.
  • the estimated area of the first part may also be determined through the following steps, including:
  • S1101 Determine limb type information corresponding to the first part; and determine a target part associated with the first part among target limb parts.
  • the first part and the target part may be detection parts having a linkage relationship.
  • the first part moves under the drive of the target part, or the target part moves under the drive of the first part.
  • the target part may be a wrist part.
  • the first part may be a wrist part, and the target part may be a forearm part. They are not listed here.
  • S1102 Based on the limb type information of the first part and the limb type information of the target part, determine a position constraint between the first part and the target part. Wherein, the position constraint is used to constrain the position difference between the position of the first part in the second image and the position of the target part in the second image.
  • the second image is an image obtained after edge filling is performed on the first image in the above embodiment, and the first image is a sub-image including the second part in the video image frame.
  • S1103 Determine an estimated area of the first part in the second image based on the position constraint.
  • Using position constraints to determine the estimated area of the first part in the second image can reduce the phenomenon of large position differences between the first part and the target part, thereby improving the processing accuracy of the pose detection model.
  • the attitude detection model can be used to perform attitude detection on the second image marked with the estimated area to obtain an attitude detection result.
  • the posture detection result includes complete annotation information of key points of the target body part, wherein the annotation information includes: position information and category information.
  • the estimated area can be used to guide the posture detection model to detect the first key point of the first part that is missing in the video image frame , thus improving the accuracy of the detected key points and reducing the detection error of key points.
  • the estimated area of the first part can also be determined through the following steps, including the following process:
  • S1201 Determine a target video image in the historical video images corresponding to the target object, wherein the similarity between the target video image and the video image frame meets a preset requirement, and the target video image contains the first part.
  • S1202 Determine the estimation area according to the position information of the first part included in the target video image.
  • the historical video images (for example, historical live images) of the target object are first acquired in the cache folder.
  • the cache folder is used to store a video image frame including a complete specified body part, and a pose detection result corresponding to the video image frame.
  • the historical video images can be screened to obtain the target video images.
  • the screening process is described as follows:
  • a feature distance between each historical video image and the video image frame is calculated, wherein the feature distance is used to characterize the similarity between the historical video image and the video image frame.
  • the target video image whose similarity with the video image frame meets the preset requirement is selected from the historical video image.
  • meeting the preset requirement can be understood as: the feature distance is greater than or equal to the preset distance threshold.
  • the estimation area can be determined according to the position information of the first part included in the target video image. For example, the location information of the first part included in the target video image may be determined as the estimated area.
  • the method further includes the following process:
  • S1302 Control the virtual object to perform a corresponding trigger action according to the gesture trigger signal.
  • the gesture trigger signal of the virtual object may be generated according to the key points of the target body part of the target object in the detection result.
  • a gesture trigger signal for triggering the virtual object to perform a corresponding trigger action may be generated according to the gesture detection result, so as to trigger the virtual object to perform the corresponding trigger action.
  • the trigger signal is used to indicate the position information of the key points of each virtual limb of the virtual object in the video image frame.
  • the image of the above-mentioned virtual object is a preset virtual image (for example, a virtual anchor), wherein the preset image includes at least one of the following: three-dimensional humanoid mimicry (the humanoid mimicry can be a person, or a humanoid image , such as aliens, etc.), three-dimensional animal mimicry (such as dinosaurs, pet cats, etc.), two-dimensional character mimicry, two-dimensional animal mimicry, etc.
  • the humanoid mimicry can be a person, or a humanoid image , such as aliens, etc.
  • three-dimensional animal mimicry such as dinosaurs, pet cats, etc.
  • two-dimensional character mimicry such as dinosaurs, pet cats, etc.
  • the method also includes the following process:
  • each of the training samples includes part of the target body part of the target object, and each of the training samples includes label information of each key point of the target body part.
  • the posture detection model to be trained is trained by using the plurality of training samples to obtain the pre-trained posture detection model.
  • multiple training samples are obtained first. Then, a plurality of training samples are input into the posture detection model to be trained, to realize the posture detection detection model to be trained is trained.
  • the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained.
  • the posture detection of the target object can still be performed, so that the anchor application can run normally and stably.
  • determining a plurality of training samples includes the following process:
  • the original image including all target body parts of the target object is collected, and body detection is performed on the original image to obtain multiple key points.
  • the multiple key points are obtained, at least part of the target body parts in the original image can be occluded, and the multiple training methods can be determined based on the original image after the occlusion process and the annotation information of the multiple key points. sample.
  • the original image including all target body parts is acquired.
  • the target limb part may be the upper body limb part of the human body
  • the original image must at least include the complete upper body limb part, for example, the whole body limb part may be included.
  • limb detection may be performed on the original image to obtain multiple key points, wherein the multiple key points include multiple key points of the target body part.
  • occlusion processing may be performed on the original image to obtain an original image after occlusion processing, wherein the original image after occlusion processing contains incomplete target body parts.
  • the original image after occlusion processing and the key points of the target body part determined in the above process can be determined as a training sample.
  • performing occlusion processing on at least part of the target body parts in the original image includes the following process:
  • a background image of a preset color can be used to occlude at least part of the target body parts to obtain an original image after occlusion processing; it can also be used to crop at least part of the target body parts in the original image to obtain an after occlusion process of the original image.
  • This disclosure relates to the field of augmented reality.
  • Augmented Reality (AR) effects combined with reality.
  • the target object may involve faces, limbs, gestures, actions, etc. related to the human body.
  • Applications can not only involve interactive scenes such as tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but can also involve special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual model Display and other interactive scenes.
  • the relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.
  • an attitude estimation device corresponding to the attitude estimation method is also provided in the embodiment of the present disclosure. Since the problem-solving principle of the device in the embodiment of the disclosure is similar to the above-mentioned attitude estimation method in the embodiment of the disclosure, the implementation of the device See the implementation of the method.
  • the device includes: an acquisition part 141, a detection part 142, a first determination part 143, and a second determination part 144; wherein,
  • the acquiring part 141 is configured to acquire a video image frame including a body part of a target object
  • the detection part 142 is configured to perform occlusion detection on the target limb part of the target object in the video image frame;
  • the first determining part 143 is configured to predict the key point of the first part based on the video image frame to obtain the first key point when it is detected that the first part of the target body part is blocked, and Determining the key points of the second part of the target limb part contained in the video image frame to obtain the second key points;
  • the second determining part 144 is configured to determine the gesture detection result of the target object based on the first key point and the second key point.
  • the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame.
  • the key points of the first part other than the first part so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be posed even when the video image frame does not contain complete body parts. detection.
  • the device is further configured to: determine the acquisition distance corresponding to the video image frame; The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
  • the device is further configured to: determine that the video image frame satisfies the facial expression capture condition when it is determined that the collection distance meets a preset distance requirement; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.
  • the first determining part 143 is further configured to: intercept the first image containing the second part in the video image frame; perform edge padding on the first image to obtain the a second image of a padded area, wherein the padded area is an area configured to perform keypoint detection on the first part; predicting keypoints of the first part based on the padded area in the second image, Get the first key point.
  • the first determining part 143 is further configured to: determine attribute information of the first part, wherein the attribute information includes: limb type information and/or limb size information; according to the The attribute information determines the padding parameters of the first image; wherein the padding parameters include: a padding position and/or a padding size; edge padding is performed on the first image based on the padding parameters to obtain the second image.
  • the first determining part 143 is further configured to: determine a target frame in the video image frame, wherein the target frame is configured to frame a frame of the second part body; intercepting a sub-image within the target frame in the video image frame to obtain the first image.
  • the first determining part 83 is further configured to: determine the estimated area of the first part in the video image frame; predict the key point of the first part based on the estimated area , to get the first key point.
  • the device is further configured to: after obtaining the gesture detection result of the target object, generate a gesture trigger signal of the virtual object corresponding to the target object according to the gesture detection result;
  • the gesture trigger signal controls the virtual object to perform a corresponding trigger action.
  • the device is further configured to: determine a plurality of training samples; wherein, each of the training samples includes part of the target limb parts of the target object, and each of the training samples includes the target limb parts Annotation information of each key point of the part; the attitude detection model to be trained is trained through the plurality of training samples to obtain the attitude detection model, and the first determining part 143 is also configured to: based on the attitude detection model in the Predict the key point of the first part in the video image frame to obtain the first key point, and predict the key point of the second part of the target limb part contained in the video image frame based on the attitude detection model, obtain The second key point.
  • the device is further configured to: collect an original image including all target body parts of the target object, and perform body detection on the original image to obtain multiple key points; Perform occlusion processing on at least part of the target body parts, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.
  • the embodiment of the present disclosure also provides a computer device 1500, as shown in FIG. 15, which is a schematic structural diagram of the computer device 1500 provided by the embodiment of the present disclosure, including:
  • Processor 151 memory 152, and bus 153; memory 152 is used for storing and executing instructions, and includes memory 1521 and external memory 1522; memory 1521 here is also called internal memory, and is used for temporarily storing operation data in processor 151, and The data exchanged by the external memory 1522 such as hard disk, the processor 151 exchanges data with the external memory 1522 through the memory 1521, when the computer device 1500 is running, the processor 151 communicates with the memory 152 through the bus 153, so that The processor 151 executes the following instructions:
  • Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the pose estimation method described in the above-mentioned method embodiments are executed.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • An embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the method described in the above method embodiment.
  • the steps of the pose estimation method reference may be made to the foregoing method embodiments.
  • the above-mentioned computer program product may be realized by hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
  • the disclosed devices and methods may be implemented in other ways.
  • the device embodiments described above are schematic.
  • the division of the units is a logical function division.
  • multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the computer software product is stored in a storage medium, including several
  • the instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned computer-readable storage medium may be a tangible device capable of retaining and storing instructions used by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium.
  • a computer readable storage medium may be, for example but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above.
  • Non-exhaustive list of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Type programmable read-only memory (Erasable Programmable Read Only Memory, EPROM or flash memory), static random-access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disk Read Only Memory, CD-ROM) , Digital versatile discs (Digital versatile Disc, DVD), memory sticks, floppy disks, mechanically encoded devices, such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • the present disclosure provides a pose estimation method, device, computer equipment, storage medium, and program product, wherein the method includes: acquiring a video image frame containing a body part of a target object; Carry out occlusion detection; in the case that the first part of the target limb part is detected to be occluded, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the video image
  • the key points of the second part of the target body part contained in the frame are obtained to obtain the second key points; the pose detection result of the target object is determined based on the first key points and the second key points.
  • the present disclosure by predicting the key points of the occluded first part and predicting the key points of the unoccluded second part in the video image frame, when the target body part is occluded in the video image frame, Predict the key points of the second part collected in the video image frame, and predict the key points of the first part outside the video image frame, so as to predict the position information of reasonable and stable key points without jumping, and then realize In the case that the video image frame does not contain complete body parts, it is still possible to perform pose detection on the target object.

Abstract

Provided in the present disclosure are a posture estimation method and apparatus, a computer device, and a storage medium. The method comprises: acquiring a video image frame, which contains a limb part of a target object; performing blocking detection on a target limb part of the target object in the video image frame; when it is detected that a first part of the target limb part is blocked, predicting a key point of the first part on the basis of the video image frame, so as to obtain a first key point, and determining a key point of a second part of the target limb part contained in the video image frame, so as to obtain a second key point; and determining a posture detection result of the target object on the basis of the first key point and the second key point.

Description

一种姿态估计方法、装置、计算机设备、存储介质以及程序产品A pose estimation method, device, computer equipment, storage medium and program product
相关申请的交叉引用Cross References to Related Applications
本公开基于申请号为202110994673.3、申请日为2021年08月27日、申请名称为“一种姿态估计方法、装置、计算机设备以及存储介质”的中国专利申请提出,并要求上述中国专利申请的优先权,上述中国专利申请的全部内容在此引入本公开作为参考。This disclosure is based on the Chinese patent application with the application number 202110994673.3, the application date is August 27, 2021, and the application name is "A Method, Device, Computer Equipment, and Storage Medium for Attitude Estimation", and requires the priority of the above-mentioned Chinese patent application Right, the entire content of the above-mentioned Chinese patent application is hereby incorporated into this disclosure as a reference.
技术领域technical field
本公开涉及图像处理技术领域,涉及一种姿态估计方法、装置、计算机设备、存储介质以及程序产品。The present disclosure relates to the technical field of image processing, and relates to a pose estimation method, device, computer equipment, storage medium and program product.
背景技术Background technique
在相关的姿态捕捉方案中,往往是通过使用网络摄像头来捕捉识别对象的姿态,但由于存在摄像头视角规格、对象与镜头间距离等不确定因素,通常会造成对象肢体有部分超出相机画面(例如手前臂在画面外、人物肩胸位置以下在画面外等等)的情况,从而导致姿态捕捉设备无法准确识别对象的姿态。In related pose capture solutions, the pose of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes parts of the object's limbs to exceed the camera screen (for example, The forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the subject.
发明内容Contents of the invention
本公开实施例至少提供一种姿态估计方法、装置、计算机设备、存储介质以及程序产品。Embodiments of the present disclosure at least provide a pose estimation method, device, computer equipment, storage medium, and program product.
第一方面,本公开实施例提供了一种姿态估计方法,所述方法由电子设备执行,所述方法包括:获取包含目标对象的肢体部位的视频图像帧;对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。In a first aspect, an embodiment of the present disclosure provides a pose estimation method, the method is executed by an electronic device, and the method includes: acquiring a video image frame containing a body part of a target object; performing occlusion detection on the target limb part of the target object; in the case that the first part of the target limb part is detected to be occluded, predicting the key point of the first part based on the video image frame to obtain the first key point, And determine the key point of the second part of the target limb part contained in the video image frame to obtain the second key point; determine the posture detection of the target object based on the first key point and the second key point result.
在本公开实施例中,在对视频图像帧进行遮挡检测,检测出视频图像帧中被遮挡的第一部位的情况下,通过预测被遮挡的第一部位的关键点,并预测视频图像帧中未被遮挡的第二部位的关键点的方式,能够在视频图像帧中目标肢体部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点,从而预测出合理且稳定不跳变的关键点的位置信息,进而实现在视频图像帧中不包含完整的肢体部位的情况下,依然可以对目标对象进行姿态检测。In the embodiment of the present disclosure, when the occlusion detection is performed on the video image frame and the occluded first part in the video image frame is detected, the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame. The key points of the first part other than the first part, so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be posed even when the video image frame does not contain complete body parts. detection.
一种可选的实施方式中,所述方法还包括:确定所述视频图像帧对应的采集距离;所述采集距离用于表征视频采集设备在采集所述视频图像帧的情况下目标对象和所述视频采集设备之间的距离;基于所述采集距离判断所述视频图像帧是否满足面部表情捕捉条件;所述对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测,包括:在确定出所述视频图像帧不满足所述面部表情捕捉条件的情况下,对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测。In an optional implementation manner, the method further includes: determining a collection distance corresponding to the video image frame; the collection distance is used to characterize the target object and the target object when the video collection device collects the video image frame. The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
一种可选的实施方式中,所述方法还包括:在确定出所述采集距离满足预设距离要求的情况下,确定所述视频图像帧满足所述面部表情捕捉条件;以及对所述视频图像帧进行面部表情检测,得到面部表情检测结果。In an optional implementation manner, the method further includes: in a case where it is determined that the acquisition distance meets a preset distance requirement, determining that the video image frame satisfies the facial expression capture condition; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.
上述实施方式中,可以在基于采集距离判断出满足面部表情捕捉条件的情况下,对主播进行面部表情捕捉。在基于采集距离判断出不满足面部表情捕捉条件的情况下,还可以对视频图像帧中目标对象的目标肢体部位进行遮挡检测,并在检测到第一部位被遮挡的情况下,通过分别预测视频图像帧中第一部位和第二部位的关键点,可以实现对未包含在图像中的待检测的肢体部位进行肢体检测,从而解决相关技术中无法预测图像外肢体点的问题,进而缓解上述应用程序在进行肢体检测的情况下,由于肢体点检测不稳定导致的肢体点跳变严重的问题。In the above-mentioned embodiment, when it is judged based on the collection distance that the facial expression capturing condition is satisfied, facial expression capturing can be performed on the anchor. When it is judged based on the acquisition distance that the facial expression capture condition is not met, occlusion detection can also be performed on the target body part of the target object in the video image frame, and when the first part is detected to be occluded, by respectively predicting the video The key points of the first part and the second part in the image frame can realize the limb detection of the limb parts to be detected that are not included in the image, so as to solve the problem that the limb points outside the image cannot be predicted in related technologies, thereby alleviating the above application In the case of limb detection, there is a serious problem of limb point jumping caused by unstable limb point detection.
一种可选的实施方式中,所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,包括:在所述视频图像帧中截取包含所述第二部位的第一图像;对所述第一图像进行边缘填补,得到包含填补区域的第二图像;其中,所述填补区域为用于对所述第一部位进行关键点检测的区域;基于所述第二图像中的填补区域预测所述第一部位的关键点,得到所述第一关键点。In an optional implementation manner, the predicting the key point of the first part based on the video image frame to obtain the first key point includes: intercepting the key point containing the second part in the video image frame The first image; edge filling is performed on the first image to obtain a second image including a filled area; wherein, the filled area is an area used for key point detection on the first part; based on the second The key point of the first part is predicted from the filled area in the image to obtain the first key point.
上述实施方式中,通过在视频图像帧中截取包含第二部位的第一图像,并对第一图像进行边缘填补,得到第二图像,可以实现通过第二图像对视频图像帧中被遮挡的第一部位进行关键点的预测, 从而能够在视频图像帧中目标肢体部位的第一部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点In the above embodiment, by intercepting the first image containing the second part in the video image frame, and filling the edge of the first image to obtain the second image, it is possible to realize the occluded first image in the video image frame through the second image. A part is used to predict the key points, so that when the first part of the target body part in the video image frame is occluded, the key points of the second part collected in the video image frame can be predicted, and the key points in the video image frame can be predicted. Key points other than the first part
一种可选的实施方式中,所述对所述第一图像进行边缘填补,得到包含填补区域的第二图像,包括:确定所述第一部位的属性信息,其中,所述属性信息包括以下至少之一:肢体类型信息、肢体尺寸信息;根据所述属性信息确定所述第一图像的填补参数;其中,所述填补参数包括以下至少之一:填补位置、填补尺寸;基于所述填补参数对所述第一图像进行边缘填补,得到所述第二图像。In an optional implementation manner, the performing edge padding on the first image to obtain the second image containing the filled area includes: determining attribute information of the first part, wherein the attribute information includes the following At least one of: limb type information, limb size information; determine the filling parameters of the first image according to the attribute information; wherein, the filling parameters include at least one of the following: filling position, filling size; based on the filling parameters Perform edge filling on the first image to obtain the second image.
上述实施方式中,通过根据第一部位的属性信息确定填补参数,以根据该填补参数对视频图像帧进行填补,得到第二图像的方式,可以在对视频图像帧进行填补的基础上,尽可能保证较大的图像分辨率,从而保证得到准确率较高的姿态检测结果。In the above-mentioned embodiment, by determining the filling parameters according to the attribute information of the first part, and filling the video image frame according to the filling parameters to obtain the second image, the video image frame can be filled as much as possible on the basis of filling the video image frame. Ensure a larger image resolution, so as to ensure a higher accuracy of attitude detection results.
一种可选的实施方式中,所述在所述视频图像帧中截取包含所述第二部位的第一图像,包括:在所述视频图像帧中确定目标框体,其中,所述目标框体用于框选所述第二部位的框体;截取所述视频图像帧中位于所述目标框体内的子图像,得到所述第一图像。In an optional implementation manner, the intercepting the first image containing the second part in the video image frame includes: determining a target frame in the video image frame, wherein the target frame The body is used to frame the frame of the second part; and the sub-image within the target frame in the video image frame is intercepted to obtain the first image.
上述实施方式中,在目标对象的目标肢体部位中的第一部位被遮挡的情况下,通过截取目标框体内的子图像,并对该子图像进行边缘填补,得到第二图像的方式,可以扩大本公开所提供的姿态估计方法的应用场景,在复杂的姿态估计的场景下,依然可以保证基于该姿态估计的应用程序能够正常稳定运行。In the above-mentioned embodiment, when the first part of the target limb part of the target object is blocked, the method of obtaining the second image by intercepting the sub-image in the target frame and performing edge filling on the sub-image can be enlarged. The application scenario of the pose estimation method provided in the present disclosure can still ensure that the application program based on the pose estimation can run normally and stably in the complex pose estimation scenario.
一种可选的实施方式中,所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,包括:基于所述视频图像帧确定所述第一部位的估计区域;基于所述估计区域预测所述第一部位的关键点,得到所述第一关键点。In an optional implementation manner, the predicting the key point of the first part based on the video image frame to obtain the first key point includes: determining the estimated area of the first part based on the video image frame ; Predict the key point of the first part based on the estimated area to obtain the first key point.
上述实施方式中,在通过第一部位估计区域,预测第一部位的关键点的情况下,可以通过该估计区域指导该姿态检测模型对视频图像帧中所缺少的第一部位的第一关键点进行检测,从而提高了检测出的关键点的准确性,减小了关键点的检测误差。In the above-mentioned embodiment, in the case of predicting the key point of the first part through the first part estimation area, the gesture detection model can be guided by the estimation area to detect the missing first key point of the first part in the video image frame. detection, thereby improving the accuracy of the detected key points and reducing the detection error of the key points.
一种可选的实施方式中,所述方法还包括:在得到所述目标对象的姿态检测结果之后,根据所述姿态检测结果生成所述目标对象所对应的虚拟对象的姿态触发信号;根据所述姿态触发信号控制所述虚拟对象执行相应的触发动作。In an optional implementation manner, the method further includes: after obtaining the gesture detection result of the target object, generating a gesture trigger signal of a virtual object corresponding to the target object according to the gesture detection result; The gesture trigger signal controls the virtual object to perform a corresponding trigger action.
上述实施方式中,由于在根据该姿态检测模型检测目标对象的肢体部位的情况下,可以得到更加准确的检测结果,因此,在根据该姿态检测结果触发虚拟对象执行相应的触发动作的情况下,可以实现准确的控制虚拟对象执行相应的触发动作。In the above-mentioned embodiment, since more accurate detection results can be obtained when the body parts of the target object are detected according to the posture detection model, when the virtual object is triggered to perform the corresponding trigger action according to the posture detection result, Accurate control of virtual objects to execute corresponding trigger actions can be achieved.
一种可选的实施方式中,所述方法还包括:确定多个训练样本;其中,所述多个训练样本中的每个所述训练样本中包含目标对象的部分目标肢体部位,且每个所述训练样本中包含目标肢体部位的每个关键点的标注信息;通过所述多个训练样本对待训练的姿态检测模型进行训练,得到姿态检测模型;所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点,包括:基于所述姿态检测模型在所述视频图像帧中预测所述第一部位的关键点,得到第一关键点,并基于所述姿态检测模型预测所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点。In an optional implementation manner, the method further includes: determining a plurality of training samples; wherein, each of the training samples in the plurality of training samples includes part of the target body parts of the target object, and each The training sample includes labeling information of each key point of the target body part; the posture detection model to be trained is trained through the plurality of training samples to obtain the posture detection model; the prediction based on the video image frame The key point of the first part, obtain the first key point, and determine the key point of the second part of the target limb part contained in the video image frame, obtain the second key point, including: based on the posture detection model in the Predicting the key points of the first part in the video image frame to obtain the first key point, and predicting the key points of the second part of the target limb part contained in the video image frame based on the posture detection model, Get the second key point.
上述实施方式中,通过多个训练样本对待训练的姿态检测模型进行训练,可以得到能够对包含目标对象的部分目标肢体部位的图像进行姿态检测的姿态检测模型,采用上述处理方式,可以实现在视频图像帧中包含目标对象的部分目标肢体部位的情况下,依然可以对目标对象进行姿态检测,从而保证主播应用程序能够正常进行肢体部位的检测。In the above-mentioned embodiment, the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained. When the image frame contains part of the target body parts of the target object, the posture detection of the target object can still be performed, so as to ensure that the anchor application can normally detect the body parts.
一种可选的实施方式中,所述确定多个训练样本,包括:采集包含目标对象的全部目标肢体部位的原始图像,并对所述原始图像进行肢体检测,得到多个关键点;对所述原始图像中的至少部分指定肢体部位进行遮挡处理,并基于遮挡处理之后的原始图像和所述多个关键点的标注信息确定所述多个训练样本。In an optional implementation manner, the determining a plurality of training samples includes: collecting original images including all target body parts of the target object, and performing body detection on the original images to obtain multiple key points; Perform occlusion processing on at least part of the specified body parts in the original image, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.
上述实施方式中,通过对原始图像进行遮挡处理,可以模拟相应的应用场景下肢体被遮挡或者裁剪的情况,在通过上述处理方式确定出的训练样本对待训练的姿态检测模型进行训练的情况下,可以实现在视频图像帧中不包含全部目标肢体部位的情况下,依然可以对目标对象进行姿态检测,从而保证对应的应用程序能够正常稳定运行。In the above embodiment, by performing occlusion processing on the original image, it is possible to simulate the situation where the limbs are occluded or cropped in the corresponding application scene. In the case where the training samples determined by the above processing method are used to train the attitude detection model to be trained, It can be realized that when the video image frame does not contain all the target body parts, the posture detection of the target object can still be performed, so as to ensure that the corresponding application program can run normally and stably.
第二方面,本公开实施例还提供一种姿态估计装置,包括:获取部分,被配置为获取包含目标对象的肢体部位的视频图像帧;检测部分,被配置为对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;第一确定部分,被配置为在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中 所包含的目标肢体部位的第二部位的关键点,得到第二关键点;第二确定部分,被配置为基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。In a second aspect, an embodiment of the present disclosure further provides a pose estimation device, including: an acquisition part configured to acquire a video image frame containing a body part of a target object; a detection part configured to detect all the body parts in the video image frame Perform occlusion detection on the target body part of the target object; the first determination part is configured to predict the position of the first part based on the video image frame when it is detected that the first part of the target body part is occluded key points, to obtain the first key points, and determine the key points of the second part of the target limb part contained in the video image frame, to obtain the second key points; the second determination part is configured to be based on the first The key point and the second key point determine the gesture detection result of the target object.
第三方面,本公开实施例还提供一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤。In a third aspect, an embodiment of the present disclosure further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing The processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect, or the steps in any possible implementation manner of the first aspect are executed.
第四方面,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤。In a fourth aspect, embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the above-mentioned first aspect, or any of the first aspects of the first aspect, may be executed. Steps in one possible implementation.
第五方面,本公开实施例还提供一种计算机程序产品,该计算机程序产品包括计算机程序或指令,在该计算机程序或指令在计算机上运行的情况下,使得该计算机执行上述第一方面,或第一方面中任一种可能的实施方式中的步骤。In the fifth aspect, an embodiment of the present disclosure further provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the above first aspect, or A step in any possible implementation of the first aspect.
附图说明Description of drawings
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the following will briefly introduce the drawings required in the embodiments.
图1示出了本公开实施例所提供的一种姿态估计方法的流程图;FIG. 1 shows a flow chart of a pose estimation method provided by an embodiment of the present disclosure;
图2示出了本公开实施例所提供的一种姿态估计方法中,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点的具体流程图;FIG. 2 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure;
图3示出了本公开实施例所提供的一种视频图像帧的示意图;Fig. 3 shows a schematic diagram of a video image frame provided by an embodiment of the present disclosure;
图4示出了本公开实施例所提供的对视频图像帧进行采集之后得到的包含第二部位的第一图像的示意图;FIG. 4 shows a schematic diagram of a first image including a second part obtained after collecting a video image frame provided by an embodiment of the present disclosure;
图5示出了本公开实施例所提供的一种在第二图像中进行第一部位和第二部位进行关键点检测的检测结果的示意图;FIG. 5 shows a schematic diagram of a detection result of key point detection of a first part and a second part in a second image provided by an embodiment of the present disclosure;
图6示出了本公开实施例所提供的一种姿态估计方法中,在所述视频图像帧中截取包含所述第二部位的第一图像的具体流程图;FIG. 6 shows a specific flow chart of intercepting the first image containing the second part in the video image frame in a pose estimation method provided by an embodiment of the present disclosure;
图7示出了本公开实施例所提供的一种姿态估计方法中,对所述第一图像进行边缘填补,得到包含填补区域的第二图像的具体流程图;Fig. 7 shows a specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure;
图8a示出了本公开实施例所提供的一种对第一图像进行填补得到第二图像的填补效果示意图;Fig. 8a shows a schematic diagram of a padding effect obtained by padding a first image to obtain a second image according to an embodiment of the present disclosure;
图8b示出了本公开实施例所提供的另一种对第一图像进行填补得到第二图像的填补效果示意图;Fig. 8b shows another schematic diagram of the padding effect of padding the first image to obtain the second image according to an embodiment of the present disclosure;
图9示出了本公开实施例所提供的一种姿态估计方法中,另一种对所述第一图像进行边缘填补,得到包含填补区域的第二图像的具体流程图;FIG. 9 shows another specific flow chart of performing edge padding on the first image to obtain a second image containing a padding area in a pose estimation method provided by an embodiment of the present disclosure;
图10示出了本公开实施例所提供的一种姿态估计方法中,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点的具体流程图;FIG. 10 shows a specific flow chart of predicting the key points of the first part based on the video image frame and obtaining the first key points in a pose estimation method provided by an embodiment of the present disclosure;
图11示出了本公开实施例所提供的一种姿态估计方法中,第一种可选地确定第一部位的估计区域的具体流程图;Fig. 11 shows a specific flow chart of the first method for optionally determining the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure;
图12示出了本公开实施例所提供的一种姿态估计方法中,第二种可选地确定第一部位的估计区域的具体流程图;Fig. 12 shows a specific flow chart of the second optional determination of the estimation area of the first part in a posture estimation method provided by an embodiment of the present disclosure;
图13示出了本公开实施例所提供的另一种姿态估计方法的流程图;Fig. 13 shows a flow chart of another pose estimation method provided by an embodiment of the present disclosure;
图14示出了本公开实施例所提供的一种姿态估计装置的架构示意图;Fig. 14 shows a schematic structural diagram of a pose estimation device provided by an embodiment of the present disclosure;
图15示出了本公开实施例所提供的一种计算机设备的示意图。Fig. 15 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.
具体实施方式Detailed ways
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述。In order to make the purpose, technical solutions and advantages of the embodiments of the present disclosure clearer, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present disclosure.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that like numerals and letters denote similar items in the following figures, therefore, once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.
本文中术语“和/或”,是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表 示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" in this article describes an association relationship, which means that there can be three kinds of relationships, for example, A and/or B, which can mean: A exists alone, A and B exist simultaneously, and B exists independently. Condition. In addition, the term "at least one" herein means any one of a variety or any combination of at least two of the more, for example, including at least one of A, B, and C, which may mean including from A, Any one or more elements selected from the set formed by B and C.
经研究发现,在相关的姿态捕捉方案中,往往是通过使用网络摄像头来捕捉识别对象的姿态,但由于存在摄像头视角规格、对象与镜头间距离等不确定因素,通常会造成对象肢体有部分超出相机画面(例如手前臂在画面外、人物肩胸位置以下在画面外等等)的情况,从而导致姿态捕捉设备无法准确识别对象的姿态。After research, it is found that in the relevant posture capture schemes, the posture of the recognized object is often captured by using a network camera, but due to uncertain factors such as the camera angle of view specification, the distance between the object and the lens, etc., it usually causes some limbs of the object to exceed the The situation of the camera screen (for example, the forearm of the hand is out of the screen, the character below the shoulder and chest is out of the screen, etc.), which makes the pose capture device unable to accurately recognize the pose of the object.
基于上述研究,本公开提供了一种姿态估计方法、装置、计算机设备、存储介质以及程序产品。在本公开实施例中,在对视频图像帧进行遮挡检测,检测出视频图像帧中被遮挡的第一部位的情况下,通过预测被遮挡的第一部位的关键点,并预测视频图像帧中未被遮挡的第二部位的关键点的方式,能够在视频图像帧中目标肢体部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点,从而预测出合理且稳定不跳变的关键点的位置信息,进而实现在视频图像帧中不包含完整的肢体部位的情况下,依然可以对目标对象进行准确的姿态检测。Based on the above research, the present disclosure provides a pose estimation method, device, computer equipment, storage medium and program product. In the embodiment of the present disclosure, when the occlusion detection is performed on the video image frame and the occluded first part in the video image frame is detected, the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame. The key points of the first part other than the first part, so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be accurately detected when the video image frame does not contain complete body parts. gesture detection.
为便于对本实施例进行理解,首先对本公开实施例所公开的一种姿态估计方法进行详细介绍,本公开实施例所提供的姿态估计方法的执行主体一般为具有一定计算能力的计算机设备。该计算机设备可以为直播设备,例如,该直播设备可以为智能手机,平板电脑,或者PC端等任意一种能够进行姿态估计的设备。In order to facilitate the understanding of this embodiment, a pose estimation method disclosed in the embodiments of the present disclosure is first introduced in detail. The execution subject of the pose estimation method provided in the embodiments of the present disclosure is generally a computer device with certain computing capabilities. The computer device may be a live broadcast device, for example, the live broadcast device may be any device capable of attitude estimation such as a smart phone, a tablet computer, or a PC.
在一些可能的实现方式中,该姿态估计方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。In some possible implementation manners, the pose estimation method may be implemented by a processor invoking computer-readable instructions stored in a memory.
参见图1所示,为本公开实施例提供的一种姿态估计方法的流程图,所述方法包括步骤S101至S107,其中:Referring to FIG. 1 , it is a flowchart of a pose estimation method provided by an embodiment of the present disclosure, the method includes steps S101 to S107, wherein:
S101:获取包含目标对象的肢体部位的视频图像帧。S101: Acquire a video image frame including a body part of a target object.
在本公开实施例中,首先可以通过计算机设备的摄像装置获取包含目标对象的肢体部位的视频图像帧。其中,该视频图像帧包含的肢体部位可以为目标对象的全身肢体部位,或者半身肢体部位。这里,半身肢体部位可以包括以下肢体部位:目标对象腰部以上的肢体部位(头部、上身躯干、两臂、手部)。In the embodiment of the present disclosure, firstly, the video image frame including the body parts of the target object may be acquired through the camera device of the computer equipment. Wherein, the limb parts included in the video image frame may be the whole body limb parts or the half body limb parts of the target object. Here, the body parts of the half body may include the following body parts: body parts above the waist of the target object (head, upper torso, arms, hands).
示例性的,本公开技术方案可以用于直播场景,上述计算机设备可以为能够安装直播应用程序的设备。此时,上述目标对象可以为主播,获取到的视频图像帧可以为该主播在直播过程中对该主播进行采集得到的包含该主播的肢体部位的视频图像帧。当然,在一些实施例中还可应用在其它视频播放的场景下。Exemplarily, the technical solution of the present disclosure can be used in a live broadcast scene, and the above-mentioned computer device can be a device capable of installing a live broadcast application program. At this time, the above-mentioned target object may be the anchor, and the acquired video image frame may be a video image frame including the body parts of the anchor obtained by collecting the anchor during the live broadcast of the anchor. Of course, in some embodiments, it can also be applied to other video playing scenarios.
S103:对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测。S103: Perform occlusion detection on the target body part of the target object in the video image frame.
在本公开实施例中,可以通过遮挡检测模型对视频图像帧中目标对象的目标肢体部位进行遮挡检测。目标肢体部位可以为目标对象的全部肢体部位,还可以为目标对象的部位肢体部位,本公开对此不作限定。In the embodiment of the present disclosure, an occlusion detection model may be used to perform occlusion detection on the target body part of the target object in the video image frame. The target body part may be all the body parts of the target object, and may also be a body part of the target object, which is not limited in the present disclosure.
在对目标肢体部位遮挡检测之后,可以得到对应的遮挡检测结果,其中,遮挡检测结果用于表征目标肢体部位的完整情况。例如,该完整情况包括完整或者不完整的结果,遮挡检测结果还可以包括在不完整的结果的情况下被遮挡的部位(即第一部位)的部位信息。After detecting the occlusion of the target body part, a corresponding occlusion detection result can be obtained, wherein the occlusion detection result is used to characterize the integrity of the target body part. For example, the complete condition includes a complete or incomplete result, and the occlusion detection result may further include part information of a part that is blocked (ie, the first part) in the case of an incomplete result.
假设,目标肢体部位为目标对象的半身肢体部位,且视频图像帧中不包含目标对象的手部,在对该视频图像帧进行遮挡检测之后,可以得到遮挡检测结果,该遮挡检测结果用于表征视频图像帧中目标肢体部位不完整,且目标肢体部位中被遮挡的部位为目标对象的手部。Assuming that the target body part is the half-body body part of the target object, and the video image frame does not contain the hand of the target object, after the occlusion detection is performed on the video image frame, the occlusion detection result can be obtained, and the occlusion detection result is used to represent The target body part in the video image frame is incomplete, and the occluded part of the target body part is the hand of the target subject.
S105:在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点。S105: When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key point in the video image frame The key points of the second part of the target limb part are included to obtain the second key points.
在本公开实施例中,在检测出目标肢体部位的第一部位被遮挡的情况下,可以通过姿态检测模型预测被遮挡的第一部位的关键点,并预测视频图像中所包含的目标肢体部位的第二部位的关键点。In the embodiment of the present disclosure, when the first part of the target body part is detected to be occluded, the pose detection model can be used to predict the key points of the occluded first part, and predict the target body part contained in the video image The key point of the second part.
在本公开实施例中,可以基于视频图像帧中所缺失的第一部位的位置,对视频图像帧进行边缘填补,并通过姿态检测模型对边缘填补之后的视频图像帧进行处理,从而在填补区域预测出第一部位的关键点。同时,姿态检测模块还可以对视频图像帧中所包含的第二部位进行关键点检测,得到第二关键点。In the embodiment of the present disclosure, based on the position of the missing first part in the video image frame, edge filling can be performed on the video image frame, and the video image frame after the edge filling can be processed through the pose detection model, so that the filled area Predict the key points of the first part. At the same time, the posture detection module can also perform key point detection on the second part contained in the video image frame to obtain the second key point.
在本公开实施例中,目标肢体部位中可以包含多个子部位,例如,目标肢体部位可以为目标对象的上半身肢体部位,此时,该目标肢体部位中可以包含以下子部位:头部、上半身躯干、双臂和双手。此时,第一部位可以为完整的子部位,还可以为子部位的部分部位。例如,第一部位可以为 双手,表示视频图像帧中被遮挡的第一部位为双手;第一部位还可以手指部位,表示视频图像帧中被遮挡的第一部位为双手的手指部位。In the embodiment of the present disclosure, the target body part may contain multiple sub-parts. For example, the target body part may be the upper body part of the target subject. At this time, the target body part may include the following sub-parts: head, upper body trunk , arms and hands. At this time, the first part may be a complete sub-part, or a partial part of a sub-part. For example, the first part can be both hands, indicating that the first part blocked in the video image frame is both hands; the first part can also be a finger part, indicating that the first part blocked in the video image frame is the finger part of both hands.
在本公开实施例中,在对视频图像帧中目标对象的目标肢体部位进行遮挡检测之前,还可以执行以下步骤:In the embodiment of the present disclosure, before performing occlusion detection on the target body part of the target object in the video image frame, the following steps may also be performed:
确定视频图像帧的采集时刻,计算目标对象与计算机设备的摄像装置之间的距离,得到采集距离;将该采集距离与第一距离阈值和第二距离阈值进行比较。在比较出采集距离小于第一距离阈值,或者,在比较出采集距离大于第二距离阈值的情况下,初步确定不需要对视频图像帧进行遮挡检测。在比较出采集距离小于或者等于第二距离阈值,且大于或者等于第一距离阈值的情况下,需要对视频图像帧中的目标肢体部位进行遮挡检测。Determine the acquisition moment of the video image frame, calculate the distance between the target object and the camera device of the computer equipment, and obtain the acquisition distance; compare the acquisition distance with the first distance threshold and the second distance threshold. If the comparison shows that the collection distance is smaller than the first distance threshold, or, if the comparison shows that the collection distance is greater than the second distance threshold, it is preliminarily determined that occlusion detection does not need to be performed on the video image frame. When the comparison shows that the acquisition distance is less than or equal to the second distance threshold and greater than or equal to the first distance threshold, it is necessary to perform occlusion detection on the target body part in the video image frame.
这里,第二距离阈值大于第一距离阈值。其中,第一距离阈值和第二距离阈值可以为预先根据经验选择的距离阈值,还可以为目标对象预先在计算机设备中设定的距离阈值。第一距离阈值用于表征视频图像帧中不包含目标对象的头部以下部位,或者视频图像帧中所包含的头部以下部位不足以进行姿态检测时目标对象和摄像装置之间的距离。第二距离阈值用于表征视频图像帧中包含完整目标肢体部位时目标对象和摄像装置之间的距离。Here, the second distance threshold is greater than the first distance threshold. Wherein, the first distance threshold and the second distance threshold may be distance thresholds selected in advance according to experience, and may also be distance thresholds preset in the computer device of the target object. The first distance threshold is used to characterize the distance between the target object and the camera when the video image frame does not contain the part below the head of the target object, or the video image frame contains insufficient parts below the head for gesture detection. The second distance threshold is used to characterize the distance between the target object and the camera device when the complete target body part is included in the video image frame.
S107:基于所述第一关键点和所述第二关键点确定目标对象的姿态检测结果。S107: Determine a pose detection result of the target object based on the first key point and the second key point.
在确定出上述第一关键点和第二关键点之后,就可以基于第一关键点和第二关键点确定目标对象的姿态检测结果。After the first key point and the second key point are determined, the gesture detection result of the target object can be determined based on the first key point and the second key point.
下面针对上述步骤进行详细介绍。The above steps are described in detail below.
在一个可选的实施方式中,该方法还包括如下步骤:In an optional embodiment, the method also includes the following steps:
确定所述视频图像帧对应的采集距离;所述采集距离用于表征视频采集设备为采集所述视频图像帧时目标对象和所述视频采集设备之间的距离;并基于所述采集距离判断所述视频图像帧是否满足面部表情捕捉条件。Determine the collection distance corresponding to the video image frame; the collection distance is used to characterize the distance between the target object and the video collection device when the video collection device is collecting the video image frame; and judge the distance based on the collection distance Whether the above-mentioned video image frame satisfies the facial expression capture condition.
在确定出所述采集距离满足预设距离要求的情况下,确定所述视频图像帧满足所述面部表情捕捉条件;以及对视频图像帧进行面部表情检测,得到面部表情检测结果。When it is determined that the collection distance satisfies the preset distance requirement, it is determined that the video image frame meets the facial expression capture condition; and facial expression detection is performed on the video image frame to obtain a facial expression detection result.
基于此,步骤S103,对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测,包括如下内容:Based on this, in step S103, occlusion detection is performed on the target body part of the target object in the video image frame, including the following content:
在确定出所述视频图像帧不满足所述面部表情捕捉条件的情况下,对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测。When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
在本公开技术方案中,除了对目标对象进行肢体检测之外,还可以对该目标对象进行面部表情捕捉。此时,就可以确定视频采集设备(例如,计算机设备的摄像装置)采集该视频图像帧时目标对象和该摄像装置之间的距离。In the technical solution of the present disclosure, in addition to body detection of the target object, facial expression capture of the target object can also be performed. At this point, the distance between the target object and the camera when the video capture device (for example, a camera of a computer device) captures the video image frame can be determined.
然后,基于该采集距离判断视频图像帧是否满足面部表情捕捉条件。在本公开实施例中,可以将该采集距离与上述第一距离阈值进行比较。在比较出采集距离小于上述第一距离阈值的情况下,确定满足预设距离要求,也即,确定出视频图像帧满足面部表情捕捉条件,此时,就可以对该视频图像帧进行面部表情捕捉。Then, based on the collection distance, it is judged whether the video image frame satisfies the facial expression capture condition. In the embodiment of the present disclosure, the collection distance may be compared with the above-mentioned first distance threshold. In the case that the acquisition distance is less than the above-mentioned first distance threshold, it is determined that the preset distance requirement is met, that is, it is determined that the video image frame meets the facial expression capture condition, and at this time, facial expression capture can be performed on the video image frame .
在判断出采集距离大于或者等于上述第一距离阈值的情况下,确定视频图像帧不满足面部表情捕捉条件,此时,可以对视频图像帧中目标对象的目标肢体部位进行遮挡检测。When it is judged that the collection distance is greater than or equal to the above-mentioned first distance threshold, it is determined that the video image frame does not meet the facial expression capture condition, and at this time, occlusion detection may be performed on the target body part of the target object in the video image frame.
在检测到目标肢体部位的第一部位被遮挡的情况下,可以基于视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;进而基于第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。In the case that the first part of the target limb part is detected to be blocked, the key point of the first part can be predicted based on the video image frame, the first key point is obtained, and the position of the target limb part contained in the video image frame is determined. The key points of the second part are obtained to obtain the second key points; and then the gesture detection result of the target object is determined based on the first key points and the second key points.
在检测到目标肢体部位未被遮挡的情况下,则通过对视频图像帧中的目标肢体部位进行检测,就可以得到目标对象的姿态检测结果。If it is detected that the target body part is not blocked, then by detecting the target body part in the video image frame, the posture detection result of the target object can be obtained.
在相关的一些应用程序中为基于单模式的技术方案,该单模式的技术方案可以理解为,当图像中包含完整面部表情时,该应用程序仅能进行面部表情捕捉、当图像中包含完整的待检测肢体部位时,该应用程序仅能进行肢体捕捉。当该应用程序无法进行面部表情捕捉,且图像中不包含完整的待检测肢体部位时,该应用程序则无法正常稳定运行面部捕捉及肢体部位检测的功能。In some related applications, it is based on a single-mode technical solution. The single-mode technical solution can be understood as, when the image contains complete facial expressions, the application can only capture facial expressions; When a body part is to be detected, the app can only do body capture. When the application cannot capture facial expressions and the image does not contain complete body parts to be detected, the application cannot normally and stably operate the functions of facial capture and body part detection.
假设,上述应用程序为虚拟直播软件。在虚拟直播场景中,往往是通过使用网络摄像头来捕捉识别主播的肢体动作,但由于存在摄像头视角规格、主播与镜头间距离等不确定因素。特别地,当主播与摄像头距离处于超近距离,即人物肢体有部分超出相机画面(例如手前臂在画面外、人物肩胸位置以下在画面外等等)时,相关技术中的虚拟直播软件无法正常进行肢体部位的检测。It is assumed that the above-mentioned application program is a virtual live broadcast software. In the virtual live broadcast scene, the body movements of the anchor are often captured and recognized by using the webcam, but due to uncertain factors such as the camera angle of view specification, the distance between the anchor and the lens, etc. In particular, when the distance between the anchor and the camera is very close, that is, when part of the body of the character exceeds the camera screen (for example, the forearm of the hand is out of the screen, the position below the shoulder and chest of the character is out of the screen, etc.), the virtual live broadcast software in the related art cannot Proceed as normal for extremity detection.
在采用上述技术方案之后,可以在基于采集距离判断出满足面部表情捕捉条件的情况下,对主 播进行面部表情捕捉。在基于采集距离判断出不满足面部表情捕捉条件的情况下,还可以对视频图像帧中目标对象的目标肢体部位进行遮挡检测,并在检测到第一部位被遮挡的情况下,通过分别预测视频图像帧中第一部位和第二部位的关键点,实现对未包含在图像中的待检测的肢体部位进行肢体检测,从而解决相关技术中无法预测图像外肢体点的问题,进而缓解上述应用程序在进行肢体检测时,由于肢体点检测不稳定导致的肢体点跳变严重的问题。After adopting the above technical solution, it is possible to capture the facial expression of the host when it is judged based on the collection distance that the facial expression capturing condition is met. When it is judged based on the acquisition distance that the facial expression capture condition is not met, occlusion detection can also be performed on the target body part of the target object in the video image frame, and when the first part is detected to be occluded, by respectively predicting the video The key points of the first part and the second part in the image frame realize the body detection of the body parts to be detected that are not included in the image, so as to solve the problem that the body points outside the image cannot be predicted in related technologies, thereby alleviating the above application When performing limb detection, there is a serious problem of limb point jumping caused by unstable limb point detection.
在本公开技术方案中,还可以实现面部表情捕捉和肢体部位捕捉的平稳过渡,从而提高了该应用程序的鲁棒性,以使得该应用程序能够稳定的运行。In the technical solution of the present disclosure, a smooth transition between facial expression capture and body part capture can also be realized, thereby improving the robustness of the application program so that the application program can run stably.
在一个可选的实施方式中,如图2所示,上述步骤S105,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,包括如下过程:In an optional implementation manner, as shown in FIG. 2, the above step S105, based on the video image frame, predicts the key point of the first part to obtain the first key point, including the following process:
S1051:在所述视频图像帧中截取包含所述第二部位的第一图像。S1051: Intercept a first image including the second part from the video image frame.
S1052:对所述第一图像进行边缘填补,得到包含填补区域的第二图像,其中,所述填补区域为用于对所述第一部位进行关键点检测的区域。S1052: Perform edge padding on the first image to obtain a second image including a padding area, where the padding area is an area used to perform key point detection on the first part.
S1053:基于所述第二图像中的填补区域预测所述第一部位的关键点,得到所述第一关键点。S1053: Predict the key point of the first part based on the filled area in the second image to obtain the first key point.
假设,如图3所示的图像为上述步骤S101中采集到包含目标对象的上半身肢体部位的视频图像帧。从图3中可以看出,目标对象的部分手指部位(即,上述第一部位)被遮挡。此时,在该视频图像帧中截取包含第二部位的第一图像,进而得到如图4所示的第一图像。Assume that the image shown in FIG. 3 is a video image frame including the upper body parts of the target object collected in step S101 above. It can be seen from FIG. 3 that part of the finger parts of the target object (ie, the above-mentioned first part) are blocked. At this time, the first image including the second part is intercepted in the video image frame, and then the first image as shown in FIG. 4 is obtained.
在截取到上述第一图像之后,可以对第一图像进行边缘填补,从而得到包含填补区域的第二图像。在得到第二图像之后,就可以通过姿态检测模型对第二图像进行姿态检测,从而在填补区域中预测出第一部位的关键点,得到第一关键点,并在第二图像中除填补区域之外的其他区域中预测出第二部位的关键点,得到第二关键点。例如,如图5所示,黑色图像区域即为上述填补区域在填补区域。在采用上述所描述的过程进行处理之后,可以在填补区域预测出第一部位的关键点,以及在填补区域之外的其他区域中预测出第二部位的关键点。After the above-mentioned first image is intercepted, edge filling may be performed on the first image, so as to obtain a second image including the filled area. After the second image is obtained, the posture detection of the second image can be performed through the posture detection model, so as to predict the key points of the first part in the filled area, obtain the first key point, and remove the filled area in the second image The key points of the second part are predicted in other areas other than , and the second key points are obtained. For example, as shown in FIG. 5 , the black image area is the above-mentioned filled area in the filled area. After processing by the process described above, the key points of the first part can be predicted in the filled area, and the key points of the second part can be predicted in other areas outside the filled area.
在本公开实施例中,在对第一图像进行边缘填补的情况下,可以通过如图5所示的方式,基于黑色图像对第一图像进行边缘填补。In the embodiment of the present disclosure, in the case of performing edge filling on the first image, edge filling may be performed on the first image based on the black image in a manner as shown in FIG. 5 .
这里,边缘填补可以理解为对基于视频图像帧中被遮挡的第一部位在该视频图像帧中的位置,对该第一图像进行边缘填补,从而得到能够对被遮挡的第一部位进行关键点检测的区域。Here, edge filling can be understood as performing edge filling on the first image based on the position of the occluded first part in the video image frame in the video image frame, so as to obtain key points that can be used for the occluded first part. detected area.
上述实施方式中,通过在视频图像帧中截取包含第二部位的第一图像,并对第一图像进行边缘填补,得到第二图像,可以实现通过第二图像对视频图像帧中被遮挡的第一部位进行关键点的预测,从而能够在视频图像帧中目标肢体部位的第一部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点。In the above embodiment, by intercepting the first image containing the second part in the video image frame, and filling the edge of the first image to obtain the second image, it is possible to realize the occluded first image in the video image frame through the second image. A part is used to predict key points, so that when the first part of the target body part in the video image frame is occluded, the key point of the second part collected in the video image frame can be predicted, and the key point in the video image frame can be predicted. Key points outside of the first part.
在一个可选实施方式中,如图6所示,S1051:在所述视频图像帧中截取包含所述第二部位的第一图像,包括如下过程:In an optional implementation manner, as shown in FIG. 6, S1051: Intercepting the first image containing the second part in the video image frame includes the following process:
S601:在所述视频图像帧中确定目标框体,其中,所述目标框体用于框选所述第二部位的框体。S601: Determine a target frame in the video image frame, where the target frame is used to frame a frame of the second part.
S602:截取所述视频图像帧中位于所述目标框体内的子图像,得到所述第一图像。S602: Intercept a sub-image within the target frame in the video image frame to obtain the first image.
在本公开实施例中,视频图像帧中目标对象的第一部位被遮挡可以理解为:第一部位被图像边缘截断从而导致第一部位被遮挡、第一部位被视频图像帧中的其他物体遮挡从而导致第一部位被遮挡。In the embodiment of the present disclosure, the occlusion of the first part of the target object in the video image frame can be understood as: the first part is truncated by the edge of the image so that the first part is occluded, and the first part is occluded by other objects in the video image frame As a result, the first part is blocked.
在本公开实施例中,在视频图像帧中未检测到目标肢体部位的第一部位,且检测到视频图像帧中该第一部位并未处于视频图像帧的边缘位置的情况下,可以确定第一部位被视频图像帧中的其他物体遮挡。In the embodiment of the present disclosure, when the first part of the target body part is not detected in the video image frame, and the first part is detected to be not at the edge position of the video image frame, it may be determined that the first part A part is occluded by other objects in the video image frame.
在此情况下,可以在视频图像帧中确定用于框选第二部位的目标框体。然后,将视频图像帧中位于该目标框体中的子图像进行截取,得到第一图像。In this case, a target frame for framing the second part may be determined in the video image frame. Then, the sub-image located in the target frame in the video image frame is intercepted to obtain the first image.
在本公开实施例中,在视频图像帧中未检测到目标肢体部位的第一部位,且检测到第一部位为被图像边缘截断的情况下,可以直接对视频图像帧进行边缘填补,从而得到包含填补区域的第二图像。In the embodiment of the present disclosure, when the first part of the target body part is not detected in the video image frame, and the first part is detected to be truncated by the edge of the image, edge filling can be directly performed on the video image frame, thereby obtaining A second image containing the filled area.
这里,对视频图像帧进行边缘填补的过程与对第一图像进行边缘填补的过程相同,在下述实施方式中,以对第一图像进行边缘填补为例对边缘填补的过程进行说明。Here, the process of performing edge filling on the video image frame is the same as the process of performing edge filling on the first image. In the following embodiments, the process of edge filling on the first image will be described as an example.
上述实施方式中,在检测到目标对象的目标肢体部位中的第一部位被遮挡的情况下,通过截取目标框体内的子图像,并对该子图像进行边缘填补,得到第二图像的方式,可以扩大本公开所提供的姿态估计方法的应用场景,在复杂的姿态估计的场景下,基于该姿态估计的应用程序依然能够正常稳定运行。In the above embodiment, when it is detected that the first part of the target limb part of the target object is blocked, the second image is obtained by intercepting the sub-image in the target frame and performing edge filling on the sub-image, The application scenarios of the pose estimation method provided in the present disclosure can be expanded, and in complex pose estimation scenarios, applications based on the pose estimation can still run normally and stably.
通过上述描述可知,在对第一图像(或者视频图像帧)进行边缘填补的情况下,可以按照如图5所示的方式,通过在第一图像中填补黑色图像的方式,实现对第一图像进行边缘填补。From the above description, it can be seen that in the case of performing edge padding on the first image (or video image frame), as shown in Figure 5, the first image can be realized by filling the black image in the first image. Do edge filling.
除了采用该边缘填补方式对第一图像进行边缘填补之外,还可以选用下述所描述的方式对第一图像进行边缘填补,得到第二图像,包括:In addition to using the edge filling method to fill the edge of the first image, you can also choose the following method to fill the edge of the first image to obtain the second image, including:
在视频图像帧中确定对第一部位进行遮挡的遮挡物的位置信息。对视频图像帧中位于该位置信息内的图像进行替换处理,将该图像替换为预设颜色的背景图像,例如,替换为黑色的背景图像。Position information of an occluder that occludes the first part is determined in the video image frame. Perform replacement processing on the image located in the position information in the video image frame, and replace the image with a background image of a preset color, for example, replace it with a black background image.
在本公开实施例中,除了替换为黑色的背景图像之外,还可以替换成其他颜色的背景图像。为了提高姿态检测模型的处理精度,可以基于姿态检测模型的训练样本确定该预设颜色,将在下述过程中进行详细介绍。In the embodiment of the present disclosure, in addition to replacing the black background image, it may also be replaced with a background image of other colors. In order to improve the processing accuracy of the attitude detection model, the preset color can be determined based on the training samples of the attitude detection model, which will be described in detail in the following process.
在一个可选的实施方式中,如图7所示,上述步骤:对所述第一图像进行边缘填补,得到包含填补区域的第二图像,包括如下过程:In an optional implementation manner, as shown in FIG. 7, the above step: performing edge filling on the first image to obtain a second image containing the filled area, including the following process:
S701:确定所述第一部位的属性信息,其中,所述属性信息包括:肢体类型信息和/或肢体尺寸信息。S701: Determine attribute information of the first part, where the attribute information includes: limb type information and/or limb size information.
S702:根据所述属性信息确定所述第一图像的填补参数;其中,所述填补参数包括:填补位置和/或填补尺寸。S702: Determine a padding parameter of the first image according to the attribute information; wherein, the padding parameter includes: a padding position and/or a padding size.
S703:基于所述填补参数对所述第一图像进行边缘填补,得到所述第二图像。S703: Perform edge padding on the first image based on the padding parameters to obtain the second image.
在本公开实施例中,上述第一部位可以理解为视频图像帧中所缺少的,且该姿态检测模型需要进行姿态检测的肢体部位。举例来说,该姿态检测模型需要对主播的上半身肢体部位进行检测,然而,该视频图像帧中缺少部分手部部位,此时,第一部位即为视频图像帧中所缺少的手部部位。In the embodiment of the present disclosure, the above-mentioned first part can be understood as a body part that is missing in the video image frame and for which the pose detection model needs to perform pose detection. For example, the posture detection model needs to detect the anchor's upper body parts, however, some hand parts are missing in the video image frame, and at this time, the first part is the missing hand part in the video image frame.
这里,肢体类型信息用于指示视频图像帧中所缺少的第一部位的肢体类型信息,例如,该视频图像帧中所缺少的第一部位为手部。肢体尺寸信息用于指示视频图像帧中所缺少的第一部位的尺寸信息(或者大小信息)。Here, the limb type information is used to indicate the limb type information of the first part missing in the video image frame, for example, the first missing part in the video image frame is a hand. The body size information is used to indicate the size information (or size information) of the first part that is missing in the video image frame.
可以理解的是,在确定出肢体类型信息之后,就可以估计出该第一部位相对于第一图像的位置关系。It can be understood that after the body type information is determined, the positional relationship of the first part relative to the first image can be estimated.
在确定出上述属性信息之后,就可以基于该属性信息,确定第一图像的填补位置和/或填补尺寸,进而,基于该填补位置和/或填补尺寸对第一图像进行填补,得到第二图像。After the above attribute information is determined, the filling position and/or filling size of the first image can be determined based on the attribute information, and then the first image can be filled based on the filling position and/or filling size to obtain the second image .
在本公开实施例中,在基于第一部位的属性信息,确定第一图像的填补参数的情况下,可以基于肢体类型信息确定第一部位相对于第一图像的位置关系,例如,第一部位应当位于第一图像的下边缘位置,此时,可以基于该位置信息确定填补位置,例如,可以将第一图像的下边缘位置确定为填补位置。同时,还可以基于肢体尺寸信息,确定第一图像的填补尺寸,例如,可以将该肢体尺寸信息确定为该填补尺寸。In the embodiment of the present disclosure, in the case of determining the filling parameters of the first image based on the attribute information of the first part, the positional relationship of the first part relative to the first image may be determined based on the body type information, for example, the first part It should be located at the bottom edge of the first image. At this time, the filling position may be determined based on the position information, for example, the bottom edge of the first image may be determined as the filling position. At the same time, the padding size of the first image may also be determined based on the limb size information, for example, the limb size information may be determined as the padding size.
假设,如图8a,视频图像帧中所缺少的第一部位为手部部位,则可以确定该手部部位位于该第一图像的下方,此时,可以对该第一图像的下边缘进行边缘填充。例如,可以在该视频图像帧的下边缘填充黑色背景的图像。Assuming that, as shown in Figure 8a, the missing first part in the video image frame is a hand part, then it can be determined that the hand part is located below the first image, and at this time, the edge of the lower edge of the first image can be edged. filling. For example, the lower edge of the video image frame may be filled with an image of a black background.
假设,如图8b视频图像帧中所缺少的第一部位为手部部位和两臂部位,则可以确定该手部部位位于该第一图像的下方,两臂部位位于第一图像的左右两侧,此时,可以对该第一图像的下边缘、左边缘和右边缘进行边缘填充。例如,可以在该第一图像的下边缘、左边缘和右边缘分别填充黑色背景的图像。Assuming that, as shown in Figure 8b, the missing first part in the video image frame is the hand part and two arm parts, then it can be determined that the hand part is located below the first image, and the two arm parts are located on the left and right sides of the first image , at this time, edge padding may be performed on the bottom edge, left edge, and right edge of the first image. For example, the bottom edge, left edge, and right edge of the first image may be filled with an image of a black background, respectively.
这里,除了可以填充黑色背景的图像之外,还可以选择填充其他颜色的背景图像,本公开对此不作限定。Here, in addition to the image that can be filled with a black background, background images of other colors can also be selected to be filled, which is not limited in the present disclosure.
由于输入至姿态检测模型中的图像的尺寸是预先设定的,因此,在对视频图像帧进行填补处理之后,还需要将填补处理之后的第二图像调整至预设尺寸。因此,填补较多的空间,将会影响目标图像中目标对象的肢体部位所对应图像的分辨率。Since the size of the image input into the gesture detection model is preset, after the padding process is performed on the video image frame, the second image after the padding process needs to be adjusted to the preset size. Therefore, filling more space will affect the resolution of the image corresponding to the limb parts of the target object in the target image.
上述实施方式中,通过根据第一部位的属性信息确定填补参数,以根据该填补参数对视频图像帧进行填补,得到第二图像的方式,可以在对视频图像帧进行填补的基础上,尽可能提高图像分辨率,从而有利于得到准确率较高的姿态检测结果。In the above-mentioned embodiment, by determining the filling parameters according to the attribute information of the first part, and filling the video image frame according to the filling parameters to obtain the second image, the video image frame can be filled as much as possible on the basis of filling the video image frame. Improving the image resolution is beneficial to obtain pose detection results with higher accuracy.
在一个可选的实施方式中,可以通过以下所描述的方式确定第一部位的肢体类型信息,包括:In an optional implementation manner, the limb type information of the first part may be determined in the following manner, including:
方式一:可以根据预估出的目标对象和摄像装置之间的距离,估计出视频图像帧中所缺少的第一部位的肢体类型信息。Way 1: The limb type information of the first part that is missing in the video image frame can be estimated according to the estimated distance between the target object and the camera device.
例如,可以通过测距模型预测目标对象和摄像装置之间的距离,以及通过该测距模型输出所缺少的第一部位的肢体类型信息。For example, the distance between the target object and the camera device may be predicted by the distance measurement model, and the missing limb type information of the first part may be output by the distance measurement model.
方式二:通过目标框体在视频图像帧中截取包含第二部位的第一图像,之后,对第一图像进行识别,得到视频图像帧中所缺少的第一部位的肢体类型信息。Method 2: intercept the first image containing the second part in the video image frame through the target frame, and then identify the first image to obtain the limb type information of the first part that is missing in the video image frame.
在一个可选的实施方式中,可以通过以下所描述的方式确定第一部位的肢体尺寸信息,包括:In an optional implementation manner, the limb size information of the first part may be determined in the following manner, including:
获取目标对象的实际长度信息,该实际长度信息可以为该目标对象的实际身高信息,还可以为该目标对象任意一个完整的目标肢体部位的实际长度信息。然后,确定视频图像帧中所包含的完整指定肢体部位在视频图像帧中的长度信息,进而根据该长度信息和实际长度信息,估计出上述肢体尺寸信息。Acquiring the actual length information of the target object, the actual length information may be the actual height information of the target object, and may also be the actual length information of any complete target limb part of the target object. Then, determine the length information of the complete designated limb part contained in the video image frame in the video image frame, and then estimate the above-mentioned limb size information according to the length information and the actual length information.
在本公开实施例中,除了上述所描述的方式对第一图像进行边缘填补,得到第二图像之外,还可以通过以下方式对第一图像进行边缘填补,包括:In the embodiment of the present disclosure, in addition to performing edge filling on the first image in the manner described above to obtain the second image, you may also perform edge filling on the first image in the following ways, including:
可以对视频图像帧的每个图像边缘分别进行填补,此时,对每个图像边缘进行填补的填补尺寸可以为预先设定的尺寸。Each image edge of the video image frame may be filled separately, and at this time, the padding size of each image edge may be a preset size.
还可以在第一图像中确定所缺少的第一部位所对应的图像边缘,进而,对该图像边缘进行填补,此时,对该图像边缘进行填补的填补尺寸可以为预先设定的尺寸,还可以为上述确定出的肢体尺寸信息。It is also possible to determine the image edge corresponding to the missing first part in the first image, and then fill the image edge. At this time, the filling size for filling the image edge can be a preset size, or It may be the limb size information determined above.
在本公开实施例中,如图9所示,在图7所示实施方式的基础上,在确定出所述填补参数之后,该方法还包括如下过程:In the embodiment of the present disclosure, as shown in FIG. 9 , on the basis of the implementation shown in FIG. 7 , after the filling parameters are determined, the method further includes the following process:
S901:获取所述视频图像帧的场景类型信息。S901: Acquire scene type information of the video image frame.
S902:根据所述场景类型信息对所述填补参数进行调整,并基于调整之后的所述填补参数对所述视频图像帧进行填补,得到所述第二图像。S902: Adjust the padding parameters according to the scene type information, and pad the video image frames based on the adjusted padding parameters to obtain the second image.
在本公开实施例中,可以确定视频图像帧所对应的场景类型信息,例如,该场景类型信息为:带货场景、游戏解说场景、表演类场景等。In the embodiment of the present disclosure, the scene type information corresponding to the video image frame may be determined, for example, the scene type information is: delivery scene, game commentary scene, performance scene, and the like.
针对每一种场景类型信息,对输入至姿态检测模型中的图像的分辨率要求可以不相同。因此,在确定出场景类型信息之后,可以确定该场景类型信息所匹配的图像分辨率,进而根据该图像分辨率对填补参数进行调整。For each type of scene information, the resolution requirements for the images input into the pose detection model may be different. Therefore, after the scene type information is determined, the image resolution matched by the scene type information may be determined, and then the padding parameters may be adjusted according to the image resolution.
例如,针对图像分辨率要求高的场景,可以适应性减小填补尺寸,从而有利于填补处理得到的第二图像的分辨率。针对图像分辨率要求高的场景,可以适应性增大填补尺寸,或者保持原来的填补尺寸不改变。For example, for scenes requiring high image resolution, the padding size may be adaptively reduced, so as to facilitate the resolution of the second image obtained through the padding process. For scenes with high image resolution requirements, the padding size can be increased adaptively, or the original padding size can be kept unchanged.
应理解的是,在根据图像分辨率对上述填补参数进行调整的情况下,应按照调整之后的填补参数对第一图像进行填补时,可以使得填补之后的第二图像中包含用于对该第一部位进行姿态检测的区域。It should be understood that, in the case where the above padding parameters are adjusted according to the image resolution, when the first image should be filled according to the adjusted padding parameters, the second image after padding may include the A region for pose detection of a part.
上述实施方式中,通过视频图像帧的场景类型信息对填补参数进行调整,以根据调整之后的所述填补参数对所述视频图像帧进行扩展,可以在检测到完整的目标肢体部位的关键点的情况下,尽可能提高图像分辨率,从而有利于得到准确率较高的姿态检测结果。In the above-mentioned embodiment, the padding parameters are adjusted through the scene type information of the video image frame, so as to expand the video image frame according to the adjusted padding parameters, and the key points of the complete target body parts can be detected. Under such circumstances, the image resolution should be increased as much as possible, which is conducive to obtaining a more accurate pose detection result.
在本公开实施例中,如图10所示,步骤S105,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,还包括如下过程:In the embodiment of the present disclosure, as shown in FIG. 10, step S105, predicting the key point of the first part based on the video image frame to obtain the first key point also includes the following process:
S1001:基于所述视频图像帧确定所述第一部位的估计区域。S1001: Determine an estimated area of the first part based on the video image frame.
S1002:基于所述估计区域预测所述第一部位的关键点,得到所述第一关键点。S1002: Predict key points of the first part based on the estimated area, to obtain the first key points.
在一个可选的实施方式中,可以通过以下步骤确定第一部位的估计区域,包括:In an optional implementation manner, the estimated area of the first part may be determined through the following steps, including:
首先,基于视频图像帧确定包含第二部位的第一图像;然后,按照上述所描述的方式对第一部位进行边缘填补,得到第二图像;之后,在第二图像的填补区域中确定第一部位的估计区域。其中,该估计区域为第二图像中用于估计该第二部位的关键点的区域。First, determine the first image containing the second part based on the video image frame; then, perform edge padding on the first part in the manner described above to obtain the second image; then, determine the first image in the filled area of the second image The estimated area of the part. Wherein, the estimated area is an area used for estimating key points of the second part in the second image.
这里,该估计区域可以为一个矩形的区域,还可以为一个圆形的区域,本公开对该估计区域的形状不作限定。Here, the estimation area may be a rectangular area, and may also be a circular area, and the disclosure does not limit the shape of the estimation area.
在一个可选的实施方式中,如图11所示,还可以通过以下步骤确定第一部位的估计区域,包括:In an optional implementation manner, as shown in FIG. 11 , the estimated area of the first part may also be determined through the following steps, including:
S1101:确定第一部位所对应的肢体类型信息;以及确定目标肢体部位中与第一部位相关联的目标部位。S1101: Determine limb type information corresponding to the first part; and determine a target part associated with the first part among target limb parts.
这里,第一部位和目标部位可以为具有联动关系的检测部位。例如,第一部位在目标部位的带动下进行移动,或者,目标部位在第一部位的带动下进行移动。假设,第一部位为手部,那么目标部位可以为手腕部位。第一部位可以手腕部位,目标部位可以为小臂部位。此处不再一一列举。Here, the first part and the target part may be detection parts having a linkage relationship. For example, the first part moves under the drive of the target part, or the target part moves under the drive of the first part. Assuming that the first part is a hand, then the target part may be a wrist part. The first part may be a wrist part, and the target part may be a forearm part. They are not listed here.
S1102:基于第一部位的肢体类型信息和目标部位的肢体类型信息,确定第一部位和目标部位之间的位置约束。其中,该位置约束用于约束第一部位在第二图像中的位置和目标部位在第二图像中 的位置之间的位置差异。S1102: Based on the limb type information of the first part and the limb type information of the target part, determine a position constraint between the first part and the target part. Wherein, the position constraint is used to constrain the position difference between the position of the first part in the second image and the position of the target part in the second image.
这里,针对不同类型的肢体类型信息,设定了不同的位置约束。第二图像为上述实施方式中对第一图像进行边缘填补之后得到的图像,第一图像为视频图像帧中包含第二部位的子图像。Here, different position constraints are set for different types of limb type information. The second image is an image obtained after edge filling is performed on the first image in the above embodiment, and the first image is a sub-image including the second part in the video image frame.
S1103:基于该位置约束确定第二图像中第一部位的估计区域。S1103: Determine an estimated area of the first part in the second image based on the position constraint.
采用位置约束在第二图像中确定第一部位的估计区域的方式,可以减少第一部位和目标部位之间位置差异较大的现象,从而提高姿态检测模型的处理精度。Using position constraints to determine the estimated area of the first part in the second image can reduce the phenomenon of large position differences between the first part and the target part, thereby improving the processing accuracy of the pose detection model.
在确定出该估计区域之后,就可以通过姿态检测模型对标注了估计区域的第二图像进行姿态检测,得到姿态检测结果。其中,该姿态检测结果中包含完整的目标肢体部位的关键点的标注信息,其中,该标注信息包括:位置信息和类别信息。After the estimated area is determined, the attitude detection model can be used to perform attitude detection on the second image marked with the estimated area to obtain an attitude detection result. Wherein, the posture detection result includes complete annotation information of key points of the target body part, wherein the annotation information includes: position information and category information.
在通过姿态检测模型对标注了估计区域的第二图像进行姿态检测的情况下,就可以通过该估计区域指导该姿态检测模型对视频图像帧中所缺少的第一部位的第一关键点进行检测,从而提高了检测出的关键点的准确性,减小了关键点的检测误差。In the case that the posture detection model is used to perform posture detection on the second image marked with the estimated area, the estimated area can be used to guide the posture detection model to detect the first key point of the first part that is missing in the video image frame , thus improving the accuracy of the detected key points and reducing the detection error of key points.
在一个可选的实施方式中,如图12所示,还可以通过以下步骤确定第一部位的估计区域,包括如下过程:In an optional implementation, as shown in Figure 12, the estimated area of the first part can also be determined through the following steps, including the following process:
S1201:在所述目标对象所对应的历史视频图像中确定目标视频图像,其中,所述目标视频图像与所述视频图像帧的相似度满足预设要求,且所述目标视频图像中包含所述第一部位。S1201: Determine a target video image in the historical video images corresponding to the target object, wherein the similarity between the target video image and the video image frame meets a preset requirement, and the target video image contains the first part.
S1202:根据所述目标视频图像中所包含的所述第一部位的位置信息,确定所述估计区域。S1202: Determine the estimation area according to the position information of the first part included in the target video image.
在本公开实施例中,首先在缓存文件夹中获取该目标对象的历史视频图像(例如,历史直播图像)。这里,缓存文件夹中用于存储包含完整指定肢体部位的视频图像帧,以及该视频图像帧所对应的姿态检测结果。In the embodiment of the present disclosure, the historical video images (for example, historical live images) of the target object are first acquired in the cache folder. Here, the cache folder is used to store a video image frame including a complete specified body part, and a pose detection result corresponding to the video image frame.
在获取到历史视频图像之后,可以对历史视频图像进行筛选,筛选得到目标视频图像,筛选过程描述如下:After the historical video images are obtained, the historical video images can be screened to obtain the target video images. The screening process is described as follows:
计算每个历史视频图像和视频图像帧之间的特征距离,其中,该特征距离用于表征历史视频图像和视频图像帧之间的相似度。根据计算出的特征距离在历史视频图像中筛选与视频图像帧之间的相似度满足预设要求的目标视频图像。这里,满足预设要求可以理解为:特征距离大于或者等于预设距离阈值。A feature distance between each historical video image and the video image frame is calculated, wherein the feature distance is used to characterize the similarity between the historical video image and the video image frame. According to the calculated feature distance, the target video image whose similarity with the video image frame meets the preset requirement is selected from the historical video image. Here, meeting the preset requirement can be understood as: the feature distance is greater than or equal to the preset distance threshold.
在筛选到目标视频图像之后,就可以根据目标视频图像中所包含的第一部位的位置信息,确定估计区域。例如,可以将目标视频图像中所包含的第一部位的位置信息,确定为估计区域。After the target video image is screened, the estimation area can be determined according to the position information of the first part included in the target video image. For example, the location information of the first part included in the target video image may be determined as the estimated area.
上述实施方式中,考虑到相同对象在使用计算机设备的过程中相似动作之间的差异较小,因此,通过获取目标视频图像来确定估计区域的方式,能够提高估计区域的确定准确度,从而得到更加准确的姿态检测结果。In the above-mentioned embodiment, considering that the difference between similar actions of the same object in the process of using the computer equipment is small, therefore, by acquiring the target video image to determine the estimated area, the accuracy of determining the estimated area can be improved, thus obtaining More accurate pose detection results.
在本公开实施例中,如图13所示,该方法还包括如下过程:In the embodiment of the present disclosure, as shown in Figure 13, the method further includes the following process:
S1301:在得到所述目标对象的姿态检测结果之后,根据所述姿态检测结果生成所述目标对象所对应的虚拟对象的姿态触发信号。S1301: After obtaining the gesture detection result of the target object, generate a gesture trigger signal of a virtual object corresponding to the target object according to the gesture detection result.
S1302:根据所述姿态触发信号控制所述虚拟对象执行相应的触发动作。S1302: Control the virtual object to perform a corresponding trigger action according to the gesture trigger signal.
在本公开实施例中,可以根据上述检测结果中的目标对象的目标肢体部位的关键点,生成虚拟对象的姿态触发信号。In the embodiment of the present disclosure, the gesture trigger signal of the virtual object may be generated according to the key points of the target body part of the target object in the detection result.
在本公开实施例中,在得到目标对象的姿态检测结果之后,可以根据该姿态检测结果生成触发虚拟对象执行相应触发动作的姿态触发信号,以触发虚拟对象执行相应的触发动作。In the embodiment of the present disclosure, after the gesture detection result of the target object is obtained, a gesture trigger signal for triggering the virtual object to perform a corresponding trigger action may be generated according to the gesture detection result, so as to trigger the virtual object to perform the corresponding trigger action.
这里,该触发信号用于指示虚拟对象的各个虚拟肢体的关键点在视频图像帧中的位置信息。Here, the trigger signal is used to indicate the position information of the key points of each virtual limb of the virtual object in the video image frame.
需要说明的是,上述虚拟对象的形象为预设虚拟形象(例如,虚拟主播),其中,该预设形象包括以下至少之一:三维人形拟态(该人形拟态可以为人,也可以为类人形象,如外星人等)、三维动物拟态(如恐龙、宠物猫等)、二维人物拟态、二维动物拟态等。It should be noted that the image of the above-mentioned virtual object is a preset virtual image (for example, a virtual anchor), wherein the preset image includes at least one of the following: three-dimensional humanoid mimicry (the humanoid mimicry can be a person, or a humanoid image , such as aliens, etc.), three-dimensional animal mimicry (such as dinosaurs, pet cats, etc.), two-dimensional character mimicry, two-dimensional animal mimicry, etc.
上述实施方式中,由于在根据该姿态检测模型检测目标对象的肢体部位的情况下,可以得到更加准确的检测结果,因此,在根据该姿态检测结果触发虚拟对象执行相应的触发动作的情况下,可以实现准确的控制虚拟对象执行相应的触发动作。In the above-mentioned embodiment, since more accurate detection results can be obtained when the body parts of the target object are detected according to the posture detection model, when the virtual object is triggered to perform the corresponding trigger action according to the posture detection result, Accurate control of virtual objects to execute corresponding trigger actions can be achieved.
在本公开实施例中,该方法还包括如下过程:In the embodiment of the present disclosure, the method also includes the following process:
首先,确定多个训练样本;其中,每个所述训练样本中包含目标对象的部分目标肢体部位,且每个所述训练样本中包含目标肢体部位的每个关键点的标注信息。然后,通过所述多个训练样本对待训练的姿态检测模型进行训练,得到所述预训练的姿态检测模型。Firstly, a plurality of training samples are determined; wherein, each of the training samples includes part of the target body part of the target object, and each of the training samples includes label information of each key point of the target body part. Then, the posture detection model to be trained is trained by using the plurality of training samples to obtain the pre-trained posture detection model.
在本公开实施例中,首先获取多个训练样本。然后,将多个训练样本输入到待训练的姿态检测 模型中,以实现对该待训练的姿态检测检测模型进行训练。In the embodiment of the present disclosure, multiple training samples are obtained first. Then, a plurality of training samples are input into the posture detection model to be trained, to realize the posture detection detection model to be trained is trained.
上述实施方式中,通过多个训练样本对待训练的姿态检测模型进行训练,可以得到能够对包含目标对象的部分目标肢体部位的图像进行姿态检测的姿态检测模型,采用上述处理方式,可以实现在视频图像帧中包含目标对象的部分目标肢体部位的情况下,依然可以对目标对象进行姿态检测,从而使得主播应用程序能够正常稳定运行。In the above-mentioned embodiment, the pose detection model to be trained is trained through a plurality of training samples, and a pose detection model capable of performing pose detection on images of some target body parts including the target object can be obtained. In the case that the image frame contains some target body parts of the target object, the posture detection of the target object can still be performed, so that the anchor application can run normally and stably.
在本公开实施例中,确定多个训练样本,包括如下过程:In an embodiment of the present disclosure, determining a plurality of training samples includes the following process:
首先,采集包含目标对象的全部目标肢体部位的原始图像,并对所述原始图像进行肢体检测,得到多个关键点。在得到多个关键点之后,就可以对所述原始图像中的至少部分目标肢体部位进行遮挡处理,并基于遮挡处理之后的原始图像和所述多个关键点的标注信息确定所述多个训练样本。Firstly, the original image including all target body parts of the target object is collected, and body detection is performed on the original image to obtain multiple key points. After the multiple key points are obtained, at least part of the target body parts in the original image can be occluded, and the multiple training methods can be determined based on the original image after the occlusion process and the annotation information of the multiple key points. sample.
在本公开实施例中,首先,获取包含全部目标肢体部位的原始图像。例如,目标肢体部位可以为人体的上半身肢体部位,那么在该原始图像中至少要包含完整的上半身肢体部位,例如,可以包含全身的肢体部位。In the embodiment of the present disclosure, firstly, the original image including all target body parts is acquired. For example, the target limb part may be the upper body limb part of the human body, then the original image must at least include the complete upper body limb part, for example, the whole body limb part may be included.
在得到原始图像之后,可以对原始图像进行肢体检测,得到多个关键点,其中,该多个关键点中包含目标肢体部位的多个关键点。After the original image is obtained, limb detection may be performed on the original image to obtain multiple key points, wherein the multiple key points include multiple key points of the target body part.
之后,可以对原始图像进行遮挡处理,得到遮挡处理之后的原始图像,其中,遮挡处理之后的原始图像中包含不完整的目标肢体部位。在得到遮挡处理之后的原始图像之后,可以将该遮挡处理之后的原始图像和上述过程中确定出的目标肢体部位的关键点确定为一个训练样本。Afterwards, occlusion processing may be performed on the original image to obtain an original image after occlusion processing, wherein the original image after occlusion processing contains incomplete target body parts. After the original image after occlusion processing is obtained, the original image after occlusion processing and the key points of the target body part determined in the above process can be determined as a training sample.
在一个可选的实施方式中,对原始图像中的至少部分目标肢体部位进行遮挡处理,包括如下过程:In an optional implementation manner, performing occlusion processing on at least part of the target body parts in the original image includes the following process:
可以使用预设颜色的背景图像对该至少部分目标肢体部位进行遮挡处理,从而得到遮挡处理之后的原始图像;还可以通过对原始图像中的至少部分目标肢体部位进行裁剪处理,从而得到遮挡处理之后的原始图像。A background image of a preset color can be used to occlude at least part of the target body parts to obtain an original image after occlusion processing; it can also be used to crop at least part of the target body parts in the original image to obtain an after occlusion process of the original image.
上述实施方式中,通过对原始图像进行遮挡处理,可以模拟相应的应用场景下肢体被遮挡或者裁剪的情况,在通过上述处理方式确定出的训练样本对待训练的姿态检测模型进行训练的情况下,可以实现在视频图像帧中不包含全部目标肢体部位的情况下,依然可以对目标对象进行姿态检测,从而使得对应的应用程序能够正常稳定运行,以及能够正常进行肢体部位的检测。In the above embodiment, by performing occlusion processing on the original image, it is possible to simulate the situation where the limbs are occluded or cropped in the corresponding application scene. In the case where the training samples determined by the above processing method are used to train the pose detection model to be trained, It can be realized that when the video image frame does not contain all the target body parts, the posture detection of the target object can still be performed, so that the corresponding application program can run normally and stably, and the body parts can be detected normally.
本公开涉及增强现实领域,通过获取现实环境中的目标对象的图像信息,进而借助各类视觉相关算法实现对目标对象的相关特征、状态及属性进行检测或识别处理,从而得到与应用匹配的虚拟与现实相结合的增强现实(Augmented Reality,AR)效果。示例性的,目标对象可涉及与人体相关的脸部、肢体、手势、动作等。应用不仅可以涉及跟真实场景或物品相关的导览、导航、讲解、重建、虚拟效果叠加展示等交互场景,还可以涉及与人相关的特效处理,比如妆容美化、肢体美化、特效展示、虚拟模型展示等交互场景。可通过卷积神经网络,实现对目标对象的相关特征、状态及属性进行检测或识别处理。This disclosure relates to the field of augmented reality. By acquiring the image information of the target object in the real environment, and then using various visual related algorithms to detect or identify the relevant features, states and attributes of the target object, and obtain a virtual reality that matches the application. Augmented Reality (AR) effects combined with reality. Exemplarily, the target object may involve faces, limbs, gestures, actions, etc. related to the human body. Applications can not only involve interactive scenes such as tours, navigation, explanations, reconstructions, virtual effect overlays and display related to real scenes or objects, but can also involve special effects processing related to people, such as makeup beautification, body beautification, special effect display, virtual model Display and other interactive scenes. The relevant features, states and attributes of the target object can be detected or identified through the convolutional neural network.
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的执行顺序应当以其功能和可能的内在逻辑确定。Those skilled in the art can understand that in the above-mentioned method of specific implementation, the writing order of each step does not imply a strict execution order and constitutes any limitation on the implementation process, and the execution order of each step should be based on its function and possible internal Logically OK.
基于同一发明构思,本公开实施例中还提供了与姿态估计方法对应的姿态估计装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述姿态估计方法相似,因此装置的实施可以参见方法的实施。Based on the same inventive concept, an attitude estimation device corresponding to the attitude estimation method is also provided in the embodiment of the present disclosure. Since the problem-solving principle of the device in the embodiment of the disclosure is similar to the above-mentioned attitude estimation method in the embodiment of the disclosure, the implementation of the device See the implementation of the method.
参照图14所示,为本公开实施例提供的一种姿态估计装置的示意图,所述装置包括:获取部分141、检测部分142、第一确定部分143、第二确定部分144;其中,Referring to FIG. 14 , it is a schematic diagram of a pose estimation device provided by an embodiment of the present disclosure, the device includes: an acquisition part 141, a detection part 142, a first determination part 143, and a second determination part 144; wherein,
获取部分141,被配置为获取包含目标对象的肢体部位的视频图像帧;The acquiring part 141 is configured to acquire a video image frame including a body part of a target object;
检测部分142,被配置为对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;The detection part 142 is configured to perform occlusion detection on the target limb part of the target object in the video image frame;
第一确定部分143,被配置为在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;The first determining part 143 is configured to predict the key point of the first part based on the video image frame to obtain the first key point when it is detected that the first part of the target body part is blocked, and Determining the key points of the second part of the target limb part contained in the video image frame to obtain the second key points;
第二确定部分144,被配置为基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。The second determining part 144 is configured to determine the gesture detection result of the target object based on the first key point and the second key point.
在本公开实施例中,在对视频图像帧进行遮挡检测,检测出视频图像帧中被遮挡的第一部位的情况下,通过预测被遮挡的第一部位的关键点,并预测视频图像帧中未被遮挡的第二部位的关键点的方式,能够在视频图像帧中目标肢体部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点,从而预测出合理且稳定不跳变的 关键点的位置信息,进而实现在视频图像帧中不包含完整的肢体部位的情况下,依然可以对目标对象进行姿态检测。In the embodiment of the present disclosure, when the occlusion detection is performed on the video image frame and the occluded first part in the video image frame is detected, the key points of the occluded first part are predicted, and the The way of the key point of the second part that is not occluded can predict the key point of the second part collected in the video image frame when the target limb part is occluded in the video image frame, and predict the key point of the second part in the video image frame. The key points of the first part other than the first part, so as to predict the position information of reasonable and stable key points without jumping, and then realize that the target object can still be posed even when the video image frame does not contain complete body parts. detection.
一种可能的实施方式中,该装置还被配置为:确定所述视频图像帧对应的采集距离;所述采集距离用于表征视频采集设备为采集所述视频图像帧的情况下目标对象和所述视频采集设备之间的距离;基于所述采集距离判断所述视频图像帧是否满足面部表情捕捉条件;所述对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测,包括:在确定出所述视频图像帧不满足所述面部表情捕捉条件的情况下,对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测。In a possible implementation manner, the device is further configured to: determine the acquisition distance corresponding to the video image frame; The distance between the video capture devices; judge whether the video image frame satisfies the facial expression capture condition based on the collection distance; the occlusion detection is carried out to the target body parts of the target object in the video image frame, including: When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
一种可能的实施方式中,该装置还被配置为:在确定出所述采集距离满足预设距离要求的情况下,确定所述视频图像帧满足所述面部表情捕捉条件;以及对所述视频图像帧进行面部表情检测,得到面部表情检测结果。In a possible implementation manner, the device is further configured to: determine that the video image frame satisfies the facial expression capture condition when it is determined that the collection distance meets a preset distance requirement; The image frame is subjected to facial expression detection, and the facial expression detection result is obtained.
一种可能的实施方式中,第一确定部分143,还被配置为:在所述视频图像帧中截取包含所述第二部位的第一图像;对所述第一图像进行边缘填补,得到包含填补区域的第二图像,其中,所述填补区域为被配置为对所述第一部位进行关键点检测的区域;基于所述第二图像中的填补区域预测所述第一部位的关键点,得到所述第一关键点。In a possible implementation manner, the first determining part 143 is further configured to: intercept the first image containing the second part in the video image frame; perform edge padding on the first image to obtain the a second image of a padded area, wherein the padded area is an area configured to perform keypoint detection on the first part; predicting keypoints of the first part based on the padded area in the second image, Get the first key point.
一种可能的实施方式中,第一确定部分143,还被配置为:确定所述第一部位的属性信息,其中,所述属性信息包括:肢体类型信息和/或肢体尺寸信息;根据所述属性信息确定所述第一图像的填补参数;其中,所述填补参数包括:填补位置和/或填补尺寸;基于所述填补参数对所述第一图像进行边缘填补,得到所述第二图像。In a possible implementation manner, the first determining part 143 is further configured to: determine attribute information of the first part, wherein the attribute information includes: limb type information and/or limb size information; according to the The attribute information determines the padding parameters of the first image; wherein the padding parameters include: a padding position and/or a padding size; edge padding is performed on the first image based on the padding parameters to obtain the second image.
一种可能的实施方式中,第一确定部分143,还被配置为:在所述视频图像帧中确定目标框体,其中,所述目标框体被配置为框选所述第二部位的框体;截取所述视频图像帧中位于所述目标框体内的子图像,得到所述第一图像。In a possible implementation manner, the first determining part 143 is further configured to: determine a target frame in the video image frame, wherein the target frame is configured to frame a frame of the second part body; intercepting a sub-image within the target frame in the video image frame to obtain the first image.
一种可能的实施方式中,第一确定部分83,还被配置为:在所述视频图像帧中确定所述第一部位的估计区域;基于所述估计区域预测所述第一部位的关键点,得到所述第一关键点。In a possible implementation manner, the first determining part 83 is further configured to: determine the estimated area of the first part in the video image frame; predict the key point of the first part based on the estimated area , to get the first key point.
一种可能的实施方式中,该装置还被配置为:在得到所述目标对象的姿态检测结果之后,根据所述姿态检测结果生成所述目标对象所对应的虚拟对象的姿态触发信号;根据所述姿态触发信号控制所述虚拟对象执行相应的触发动作。In a possible implementation manner, the device is further configured to: after obtaining the gesture detection result of the target object, generate a gesture trigger signal of the virtual object corresponding to the target object according to the gesture detection result; The gesture trigger signal controls the virtual object to perform a corresponding trigger action.
一种可能的实施方式中,该装置还被配置为:确定多个训练样本;其中,每个所述训练样本中包含目标对象的部分目标肢体部位,且每个所述训练样本中包含目标肢体部位的每个关键点的标注信息;通过所述多个训练样本对待训练的姿态检测模型进行训练,得到姿态检测模型,第一确定部分143,还被配置为:基于所述姿态检测模型在所述视频图像帧中预测所述第一部位的关键点,得到第一关键点,并基于所述姿态检测模型预测所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点。In a possible implementation manner, the device is further configured to: determine a plurality of training samples; wherein, each of the training samples includes part of the target limb parts of the target object, and each of the training samples includes the target limb parts Annotation information of each key point of the part; the attitude detection model to be trained is trained through the plurality of training samples to obtain the attitude detection model, and the first determining part 143 is also configured to: based on the attitude detection model in the Predict the key point of the first part in the video image frame to obtain the first key point, and predict the key point of the second part of the target limb part contained in the video image frame based on the attitude detection model, obtain The second key point.
一种可能的实施方式中,该装置还被配置为:采集包含目标对象的全部目标肢体部位的原始图像,并对所述原始图像进行肢体检测,得到多个关键点;对所述原始图像中的至少部分目标肢体部位进行遮挡处理,并基于遮挡处理之后的原始图像和所述多个关键点的标注信息确定所述多个训练样本。In a possible implementation manner, the device is further configured to: collect an original image including all target body parts of the target object, and perform body detection on the original image to obtain multiple key points; Perform occlusion processing on at least part of the target body parts, and determine the plurality of training samples based on the original image after occlusion processing and the annotation information of the plurality of key points.
关于装置中的各模块的处理流程、以及各模块之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。For the description of the processing flow of each module in the device and the interaction flow between the modules, reference may be made to the relevant description in the above method embodiment, and details will not be described here.
对应于图1中的姿态估计方法,本公开实施例还提供了一种计算机设备1500,如图15所示,为本公开实施例提供的计算机设备1500结构示意图,包括:Corresponding to the pose estimation method in FIG. 1, the embodiment of the present disclosure also provides a computer device 1500, as shown in FIG. 15, which is a schematic structural diagram of the computer device 1500 provided by the embodiment of the present disclosure, including:
处理器151、存储器152、和总线153;存储器152用于存储执行指令,包括内存1521和外部存储器1522;这里的内存1521也称内存储器,用于暂时存放处理器151中的运算数据,以及与硬盘等外部存储器1522交换的数据,处理器151通过内存1521与外部存储器1522进行数据交换,当所述计算机设备1500运行时,所述处理器151与所述存储器152之间通过总线153通信,使得所述处理器151执行以下指令: Processor 151, memory 152, and bus 153; memory 152 is used for storing and executing instructions, and includes memory 1521 and external memory 1522; memory 1521 here is also called internal memory, and is used for temporarily storing operation data in processor 151, and The data exchanged by the external memory 1522 such as hard disk, the processor 151 exchanges data with the external memory 1522 through the memory 1521, when the computer device 1500 is running, the processor 151 communicates with the memory 152 through the bus 153, so that The processor 151 executes the following instructions:
获取包含目标对象的肢体部位的视频图像帧;Obtaining a video image frame containing a body part of a target object;
对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;performing occlusion detection on the target body part of the target object in the video image frame;
在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key points contained in the video image frame The key point of the second part of the target limb part, get the second key point;
基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。Determining a pose detection result of the target object based on the first key point and the second key point.
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的姿态估计方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。Embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the pose estimation method described in the above-mentioned method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.
本公开实施例还提供一种计算机程序产品,该计算机程序产品包括计算机程序或指令,在所述计算机程序或指令在计算机上运行的情况下,使得所述计算机执行上述方法实施例中所述的姿态估计方法的步骤,可参见上述方法实施例。An embodiment of the present disclosure also provides a computer program product, the computer program product includes a computer program or an instruction, and when the computer program or instruction is run on a computer, the computer executes the method described in the above method embodiment. For the steps of the pose estimation method, reference may be made to the foregoing method embodiments.
其中,上述计算机程序产品可以通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品体现为计算机存储介质,在另一个可选实施例中,计算机程序产品体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。Wherein, the above-mentioned computer program product may be realized by hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的工作过程,可以参考前述方法实施例中的对应过程。在本公开所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。以上所描述的装置实施例是示意性的,例如,所述单元的划分,为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。Those skilled in the art can clearly understand that for the convenience and brevity of description, for the working process of the above-described system and device, reference may be made to the corresponding process in the foregoing method embodiments. In the several embodiments provided in the present disclosure, it should be understood that the disclosed devices and methods may be implemented in other ways. The device embodiments described above are schematic. For example, the division of the units is a logical function division. In actual implementation, there may be another division method. For example, multiple units or components can be combined or integrated. to another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some communication interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对相关技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。If the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor. Based on this understanding, the essence of the technical solution of the present disclosure or the part that contributes to the related technology or the part of the technical solution can be embodied in the form of a software product. The computer software product is stored in a storage medium, including several The instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
前述的计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备,可为易失性存储介质或非易失性存储介质。计算机可读存储介质例如可以是但不限于:电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可擦式可编程只读存储器(Erasable Programmable Read Only Memory,EPROM或闪存)、静态随机存取存储器(Static Random-Access Memory,SRAM)、便携式压缩盘只读存储器(Compact Disk Read Only Memory,CD-ROM)、数字多功能盘(Digital versatile Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。The aforementioned computer-readable storage medium may be a tangible device capable of retaining and storing instructions used by an instruction execution device, and may be a volatile storage medium or a nonvolatile storage medium. A computer readable storage medium may be, for example but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the above. More specific examples (non-exhaustive list) of computer-readable storage media include: portable computer disk, hard disk, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), erasable Type programmable read-only memory (Erasable Programmable Read Only Memory, EPROM or flash memory), static random-access memory (Static Random-Access Memory, SRAM), portable compact disk read-only memory (Compact Disk Read Only Memory, CD-ROM) , Digital versatile discs (Digital versatile Disc, DVD), memory sticks, floppy disks, mechanically encoded devices, such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing. As used herein, computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that: the above-mentioned embodiments are only specific implementations of the present disclosure, and are used to illustrate the technical solutions of the present disclosure, rather than limit them, and the protection scope of the present disclosure is not limited thereto, although referring to the aforementioned The embodiments have described the present disclosure in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present disclosure Changes can be easily imagined, or equivalent replacements can be made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present disclosure, and should be included in this disclosure. within the scope of protection. Therefore, the protection scope of the present disclosure should be defined by the protection scope of the claims.
工业实用性Industrial Applicability
本公开提供了一种姿态估计方法、装置、计算机设备、存储介质以及程序产品,其中,该方法包括:获取包含目标对象的肢体部位的视频图像帧;对视频图像帧中目标对象的目标肢体部位进行遮挡检测;在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;基于所述第一关键点和所述第二关键点确定目标对象的姿态检测结果。 本公开实施例通过预测被遮挡的第一部位的关键点,并预测视频图像帧中未被遮挡的第二部位的关键点的方式,能够在视频图像帧中目标肢体部位被遮挡的情况下,预测出视频图像帧内采集到的第二部位的关键点,以及预测处于视频图像帧之外的第一部位的关键点,从而预测出合理且稳定不跳变的关键点的位置信息,进而实现在视频图像帧中不包含完整的肢体部位的情况下,依然可以对目标对象进行姿态检测。The present disclosure provides a pose estimation method, device, computer equipment, storage medium, and program product, wherein the method includes: acquiring a video image frame containing a body part of a target object; Carry out occlusion detection; in the case that the first part of the target limb part is detected to be occluded, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the video image The key points of the second part of the target body part contained in the frame are obtained to obtain the second key points; the pose detection result of the target object is determined based on the first key points and the second key points. In the embodiment of the present disclosure, by predicting the key points of the occluded first part and predicting the key points of the unoccluded second part in the video image frame, when the target body part is occluded in the video image frame, Predict the key points of the second part collected in the video image frame, and predict the key points of the first part outside the video image frame, so as to predict the position information of reasonable and stable key points without jumping, and then realize In the case that the video image frame does not contain complete body parts, it is still possible to perform pose detection on the target object.

Claims (14)

  1. 一种姿态估计方法,包括:A pose estimation method, comprising:
    获取包含目标对象的肢体部位的视频图像帧;Obtaining a video image frame containing a body part of a target object;
    对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;performing occlusion detection on the target body part of the target object in the video image frame;
    在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;When it is detected that the first part of the target body part is blocked, predict the key point of the first part based on the video image frame, obtain the first key point, and determine the key points contained in the video image frame The key point of the second part of the target limb part, get the second key point;
    基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。Determining a pose detection result of the target object based on the first key point and the second key point.
  2. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    确定所述视频图像帧对应的采集距离;所述采集距离用于表征视频采集设备在采集所述视频图像帧的情况下目标对象和所述视频采集设备之间的距离;Determine the acquisition distance corresponding to the video image frame; the acquisition distance is used to characterize the distance between the target object and the video acquisition device when the video acquisition device collects the video image frame;
    基于所述采集距离判断所述视频图像帧是否满足面部表情捕捉条件;Judging whether the video image frame meets the facial expression capture condition based on the collection distance;
    所述对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测,包括:The occlusion detection of the target body part of the target object in the video image frame includes:
    在确定出所述视频图像帧不满足所述面部表情捕捉条件的情况下,对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测。When it is determined that the video image frame does not satisfy the facial expression capturing condition, occlusion detection is performed on the target body part of the target object in the video image frame.
  3. 根据权利要求2所述的方法,其中,所述方法还包括:The method according to claim 2, wherein the method further comprises:
    在确定出所述采集距离满足预设距离要求的情况下,确定所述视频图像帧满足所述面部表情捕捉条件;以及对所述视频图像帧进行面部表情检测,得到面部表情检测结果。When it is determined that the acquisition distance meets the preset distance requirement, it is determined that the video image frame meets the facial expression capture condition; and facial expression detection is performed on the video image frame to obtain a facial expression detection result.
  4. 根据权利要求1至3中任一项所述的方法,其中,所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,包括:The method according to any one of claims 1 to 3, wherein said predicting the key point of the first part based on the video image frame to obtain the first key point comprises:
    在所述视频图像帧中截取包含所述第二部位的第一图像;intercepting a first image containing the second part in the video image frame;
    对所述第一图像进行边缘填补,得到包含填补区域的第二图像;其中,所述填补区域为用于对所述第一部位进行关键点检测的区域;Performing edge padding on the first image to obtain a second image including a filled area; wherein, the filled area is an area used for key point detection on the first part;
    基于所述第二图像中的填补区域预测所述第一部位的关键点,得到所述第一关键点。Predicting the key point of the first part based on the filled area in the second image to obtain the first key point.
  5. 根据权利要求4所述的方法,其中,所述对所述第一图像进行边缘填补,得到包含填补区域的第二图像,包括:The method according to claim 4, wherein said performing edge padding on said first image to obtain a second image containing a filled area comprises:
    确定所述第一部位的属性信息;其中,所述属性信息包括以下至少之一:肢体类型信息、肢体尺寸信息;Determine the attribute information of the first part; wherein the attribute information includes at least one of the following: limb type information, limb size information;
    根据所述属性信息确定所述第一图像的填补参数;其中,所述填补参数包括以下至少之一:填补位置、填补尺寸;Determine the padding parameters of the first image according to the attribute information; wherein, the padding parameters include at least one of the following: padding position, padding size;
    基于所述填补参数对所述第一图像进行边缘填补,得到所述第二图像。Perform edge padding on the first image based on the padding parameters to obtain the second image.
  6. 根据权利要求4所述的方法,其中,所述在所述视频图像帧中截取包含所述第二部位的第一图像,包括:The method according to claim 4, wherein said intercepting the first image containing the second part in the video image frame comprises:
    在所述视频图像帧中确定目标框体,其中,所述目标框体用于框选所述第二部位的框体;Determining a target frame in the video image frame, wherein the target frame is used to frame the frame of the second part;
    截取所述视频图像帧中位于所述目标框体内的子图像,得到所述第一图像。Intercepting a sub-image within the target frame in the video image frame to obtain the first image.
  7. 根据权利要求1所述的方法,其中,所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,包括:The method according to claim 1, wherein said predicting the key point of the first part based on the video image frame to obtain the first key point comprises:
    基于所述视频图像帧确定所述第一部位的估计区域;determining an estimated region of the first part based on the video image frame;
    基于所述估计区域预测所述第一部位的关键点,得到所述第一关键点。Predicting key points of the first part based on the estimated area to obtain the first key points.
  8. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    在得到所述目标对象的姿态检测结果之后,根据所述姿态检测结果生成所述目标对象所对应的虚拟对象的姿态触发信号;After obtaining the posture detection result of the target object, generating a posture trigger signal of a virtual object corresponding to the target object according to the posture detection result;
    根据所述姿态触发信号控制所述虚拟对象执行相应的触发动作。The virtual object is controlled to perform a corresponding trigger action according to the gesture trigger signal.
  9. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    确定多个训练样本;其中,所述多个训练样本中的每个所述训练样本中包含目标对象的部分目标肢体部位,且每个所述训练样本中包含目标肢体部位的每个关键点的标注信息;通过所述多个训练样本对待训练的姿态检测模型进行训练,得到姿态检测模型;Determining a plurality of training samples; wherein, each of the training samples in the plurality of training samples includes part of the target body part of the target object, and each of the training samples includes each key point of the target body part labeling information; training the attitude detection model to be trained by the plurality of training samples to obtain the attitude detection model;
    所述基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点,包括:基于所述姿态检测模型在所述视频图像帧中预测所述第一部位的关键点,得到第一关键点,并基于所述姿态检测模型预测所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点。Predicting the key points of the first part based on the video image frame to obtain the first key point, and determining the key point of the second part of the target limb part contained in the video image frame to obtain the second key point point, including: predicting the key point of the first part in the video image frame based on the posture detection model, obtaining the first key point, and predicting the key point contained in the video image frame based on the posture detection model The key point of the second part of the target limb part to get the second key point.
  10. 根据权利要求9所述的方法,其中,所述确定多个训练样本,包括:The method according to claim 9, wherein said determining a plurality of training samples comprises:
    采集包含目标对象的全部目标肢体部位的原始图像,并对所述原始图像进行肢体检测,得到多个关键点;collecting original images including all target body parts of the target object, and performing body detection on the original images to obtain multiple key points;
    对所述原始图像中的至少部分目标肢体部位进行遮挡处理,并基于遮挡处理之后的原始图像和所述多个关键点的标注信息确定所述多个训练样本。Perform occlusion processing on at least part of the target body parts in the original image, and determine the plurality of training samples based on the original image after occlusion processing and labeling information of the plurality of key points.
  11. 一种姿态估计装置,包括:A pose estimation device, comprising:
    获取部分,被配置为获取包含目标对象的肢体部位的视频图像帧;an acquisition part configured to acquire a video image frame including a body part of a target object;
    检测部分,被配置为对所述视频图像帧中所述目标对象的目标肢体部位进行遮挡检测;A detection part configured to perform occlusion detection on the target limb part of the target object in the video image frame;
    第一确定部分,被配置为在检测出所述目标肢体部位的第一部位被遮挡的情况下,基于所述视频图像帧预测所述第一部位的关键点,得到第一关键点,并确定所述视频图像帧中所包含的目标肢体部位的第二部位的关键点,得到第二关键点;The first determining part is configured to predict the key point of the first part based on the video image frame, obtain the first key point, and determine that the first part of the target limb part is detected to be blocked The key point of the second part of the target limb part contained in the video image frame to obtain the second key point;
    第二确定部分,被配置为基于所述第一关键点和所述第二关键点确定所述目标对象的姿态检测结果。The second determining part is configured to determine the pose detection result of the target object based on the first key point and the second key point.
  12. 一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至10中任一项所述的姿态估计方法的步骤。A computer device, comprising: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processor communicates with the memory through the bus , when the machine-readable instructions are executed by the processor, the steps of the pose estimation method according to any one of claims 1 to 10 are executed.
  13. 一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行如权利要求1至10中任一项所述的姿态估计方法的步骤。A computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the attitude estimation method according to any one of claims 1 to 10 are executed.
  14. 一种计算机程序产品,所述计算机程序产品包括计算机程序或指令,在所述计算机程序或指令在计算机上运行的情况下,使得所述计算机执行如权利要求1至10中任一项所述的姿态估计方法的步骤。A computer program product, said computer program product comprising a computer program or instructions, when said computer program or instructions run on a computer, causes said computer to execute the method described in any one of claims 1 to 10 The steps of the pose estimation method.
PCT/CN2022/074929 2021-08-27 2022-01-29 Posture estimation method and apparatus, computer device, storage medium, and program product WO2023024440A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110994673.3 2021-08-27
CN202110994673.3A CN113449696B (en) 2021-08-27 2021-08-27 Attitude estimation method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023024440A1 true WO2023024440A1 (en) 2023-03-02

Family

ID=77818821

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074929 WO2023024440A1 (en) 2021-08-27 2022-01-29 Posture estimation method and apparatus, computer device, storage medium, and program product

Country Status (3)

Country Link
CN (1) CN113449696B (en)
TW (1) TW202309782A (en)
WO (1) WO2023024440A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449696B (en) * 2021-08-27 2021-12-07 北京市商汤科技开发有限公司 Attitude estimation method and device, computer equipment and storage medium
CN116934848A (en) * 2022-03-31 2023-10-24 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN115171217B (en) * 2022-07-27 2023-03-03 北京拙河科技有限公司 Action recognition method and system under dynamic background
CN114998814B (en) * 2022-08-04 2022-11-15 广州此声网络科技有限公司 Target video generation method and device, computer equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070268295A1 (en) * 2006-05-19 2007-11-22 Kabushiki Kaisha Toshiba Posture estimation apparatus and method of posture estimation
US10296102B1 (en) * 2018-01-31 2019-05-21 Piccolo Labs Inc. Gesture and motion recognition using skeleton tracking
CN110929651A (en) * 2019-11-25 2020-03-27 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN111027407A (en) * 2019-11-19 2020-04-17 东南大学 Color image hand posture estimation method for shielding situation
CN112115886A (en) * 2020-09-22 2020-12-22 北京市商汤科技开发有限公司 Image detection method and related device, equipment and storage medium
CN112232194A (en) * 2020-10-15 2021-01-15 广州云从凯风科技有限公司 Single-target human body key point detection method, system, equipment and medium
CN113449696A (en) * 2021-08-27 2021-09-28 北京市商汤科技开发有限公司 Attitude estimation method and device, computer equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960011B (en) * 2017-05-23 2021-12-03 湖南生物机电职业技术学院 Partially-shielded citrus fruit image identification method
CN107633204B (en) * 2017-08-17 2019-01-29 平安科技(深圳)有限公司 Face occlusion detection method, apparatus and storage medium
CN108711175B (en) * 2018-05-16 2021-10-01 浙江大学 Head attitude estimation optimization method based on interframe information guidance
CN109784255B (en) * 2019-01-07 2021-12-14 深圳市商汤科技有限公司 Neural network training method and device and recognition method and device
CN110826519B (en) * 2019-11-14 2023-08-18 深圳华付技术股份有限公司 Face shielding detection method and device, computer equipment and storage medium
CN112257552B (en) * 2020-10-19 2023-09-05 腾讯科技(深圳)有限公司 Image processing method, device, equipment and storage medium
CN113239797B (en) * 2021-05-12 2022-02-25 中科视语(北京)科技有限公司 Human body action recognition method, device and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070268295A1 (en) * 2006-05-19 2007-11-22 Kabushiki Kaisha Toshiba Posture estimation apparatus and method of posture estimation
US10296102B1 (en) * 2018-01-31 2019-05-21 Piccolo Labs Inc. Gesture and motion recognition using skeleton tracking
CN111027407A (en) * 2019-11-19 2020-04-17 东南大学 Color image hand posture estimation method for shielding situation
CN110929651A (en) * 2019-11-25 2020-03-27 北京达佳互联信息技术有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN112115886A (en) * 2020-09-22 2020-12-22 北京市商汤科技开发有限公司 Image detection method and related device, equipment and storage medium
CN112232194A (en) * 2020-10-15 2021-01-15 广州云从凯风科技有限公司 Single-target human body key point detection method, system, equipment and medium
CN113449696A (en) * 2021-08-27 2021-09-28 北京市商汤科技开发有限公司 Attitude estimation method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
TW202309782A (en) 2023-03-01
CN113449696A (en) 2021-09-28
CN113449696B (en) 2021-12-07

Similar Documents

Publication Publication Date Title
WO2023024440A1 (en) Posture estimation method and apparatus, computer device, storage medium, and program product
Sridhar et al. Real-time joint tracking of a hand manipulating an object from rgb-d input
CN107251096B (en) Image capturing apparatus and method
US8488888B2 (en) Classification of posture states
US20160202770A1 (en) Touchless input
US20200286302A1 (en) Method And Apparatus For Manipulating Object In Virtual Or Augmented Reality Based On Hand Motion Capture Apparatus
US20130251244A1 (en) Real time head pose estimation
KR101929077B1 (en) Image identificaiton method and image identification device
EP3540574B1 (en) Eye tracking method, electronic device, and non-transitory computer readable storage medium
US9536132B2 (en) Facilitating image capture and image review by visually impaired users
KR20170014491A (en) Method and device for recognizing motion
WO2017084319A1 (en) Gesture recognition method and virtual reality display output device
WO2023024442A1 (en) Detection method and apparatus, training method and apparatus, device, storage medium and program product
Antoshchuk et al. Gesture recognition-based human–computer interaction interface for multimedia applications
US8970479B1 (en) Hand gesture detection
WO2019037257A1 (en) Password input control device and method, and computer readable storage medium
US11410398B2 (en) Augmenting live images of a scene for occlusion
US11854308B1 (en) Hand initialization for machine learning based gesture recognition
KR20200081529A (en) HMD based User Interface Method and Device for Social Acceptability
Schlattmann et al. Markerless 4 gestures 6 DOF real‐time visual tracking of the human hand with automatic initialization
CN114327063A (en) Interaction method and device of target virtual object, electronic equipment and storage medium
de Gusmao Lafayette et al. The virtual kinect
EP3584688A1 (en) Information processing system, information processing method, and program
Albrektsen Using the Kinect Sensor for Social Robotics
CN116820251B (en) Gesture track interaction method, intelligent glasses and storage medium