CN110874865A

CN110874865A - Three-dimensional skeleton generation method and computer equipment

Info

Publication number: CN110874865A
Application number: CN201911111436.7A
Authority: CN
Inventors: 杨博; 王博; 程禹; 陈明辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2020-03-10

Abstract

The application relates to a three-dimensional skeleton generation method and computer equipment, wherein the method comprises the following steps: acquiring a sequence of video frames comprising a target object; respectively determining original position information of key points of the target object in the video frames for each video frame in the video frame sequence; respectively inputting the original position information corresponding to each video frame in the video frame sequence into a first time convolution network to obtain corrected two-dimensional position information corresponding to each video frame; inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to a second time convolution network respectively to obtain three-dimensional position information corresponding to each video frame; and determining a three-dimensional skeleton corresponding to the target object according to the three-dimensional position information. The scheme provided by the application can improve the accuracy of three-dimensional skeleton extraction.

Description

Three-dimensional skeleton generation method and computer equipment

Technical Field

The present application relates to the field of neural network technology, and in particular, to a three-dimensional skeleton generation method and a computer device.

Background

Three-dimensional skeletons are a compact way to characterize biological models in three-dimensional space. The three-dimensional skeleton can be used for simply and efficiently representing the motion of a target object, wherein the target object can be a human or an animal. Therefore, three-dimensional skeletons are often used in games or videos to drive virtual objects to enable the virtual objects to perform corresponding actions.

In order to provide a natural consistent motion for virtual objects in a game or animation, it is common to employ an actor wearing a three-dimensional position sensor and then perform a performance of various motions to capture the pose of a detected three-dimensional skeleton. However, the three-dimensional skeleton extraction method is often limited and inflexible because employed actors cannot perform too professional actions, such as pattern-basket-like actions.

Disclosure of Invention

Based on this, it is necessary to provide a three-dimensional skeleton generation method, an apparatus, a computer-readable storage medium, and a computer device, for solving the technical problem that in the conventional scheme, a three-dimensional position sensor is worn by an actor to perform various actions so as to capture and detect a three-dimensional skeleton, which results in limited and inflexible action extraction.

A three-dimensional skeleton generation method, comprising:

acquiring a sequence of video frames comprising a target object;

respectively determining original position information of key points of the target object in the video frames for each video frame in the video frame sequence;

respectively inputting the original position information corresponding to each video frame in the video frame sequence into a first time convolution network to obtain corrected two-dimensional position information corresponding to each video frame;

inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to a second time convolution network respectively to obtain three-dimensional position information corresponding to each video frame;

and determining a three-dimensional skeleton corresponding to the target object according to the three-dimensional position information.

A three-dimensional skeleton generation apparatus comprising:

an obtaining module for obtaining a sequence of video frames comprising a target object;

a determining module, configured to determine, for each video frame in the sequence of video frames, original position information of a key point of the target object in the video frame;

the correction module is used for respectively inputting the original position information corresponding to each video frame in the video frame sequence into a first time convolution network to obtain the corrected two-dimensional position information corresponding to each video frame;

the conversion module is used for respectively inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to a second time convolution network to obtain three-dimensional position information corresponding to each video frame;

the determining module is further configured to determine a three-dimensional skeleton corresponding to the target object according to the three-dimensional position information.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

acquiring a sequence of video frames comprising a target object;

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of:

acquiring a sequence of video frames comprising a target object;

According to the three-dimensional skeleton generation method, the three-dimensional skeleton generation device, the computer readable storage medium and the computer equipment, the initial prediction of the key points of the target object is realized by detecting a group of continuous video frames in time sequence, and the original position information of the key points of each frame is obtained. And then inputting the original position information of each key point into a pre-trained first time convolution network to obtain relatively smooth and stable two-dimensional position information. And then, further inputting the two-dimensional position information into a pre-trained second time convolution network for estimating the three-dimensional posture, thereby extracting the three-dimensional skeleton of the target object from the video frame sequence. Therefore, the three-dimensional skeleton of the target object can be accurately extracted from the existing video data without hiring an actor to finish corresponding actions to acquire three-dimensional coordinates, the extracted three-dimensional skeleton can assist the virtual object to realize the corresponding actions, the requirement of hiring the actor to capture the actions is greatly reduced or eliminated, the extraction of the three-dimensional skeleton is not limited by scene conditions, and the flexibility of extracting the three-dimensional skeleton is greatly improved.

Drawings

FIG. 1 is a diagram of an application environment of a method for generating a three-dimensional skeleton according to an embodiment;

FIG. 2 is a schematic flow chart illustrating a method for generating a three-dimensional skeleton according to an embodiment;

FIG. 3A is an interface display diagram illustrating an embodiment of applying a three-dimensional skeleton extracted by a three-dimensional skeleton generation method to a street basketball game;

FIG. 3B is an interface display diagram of the three-dimensional skeleton extracted by the three-dimensional skeleton generation method applied to the design of the dance game in one embodiment;

FIG. 4 is a flowchart illustrating the steps of determining the original location information of key points of a target object in a video frame for each video frame in a sequence of video frames, respectively, in one embodiment;

FIG. 5 is a schematic flow chart diagram illustrating a method for generating a three-dimensional skeleton according to an exemplary embodiment;

FIG. 6 is a schematic flow chart diagram illustrating the training steps for a three-dimensional skeleton generation model in one embodiment;

FIG. 7 is a flowchart illustrating steps for obtaining a sample sequence of video frames and annotation information corresponding to each video frame in the sample sequence of video frames according to one embodiment;

FIG. 8A is a diagram showing a structure of a human body cylinder phantom according to an embodiment;

FIG. 8B is a diagram illustrating a reference diagram for determining whether a keypoint is occluded by a human cylindrical model according to an embodiment;

FIG. 9 is a flowchart illustrating the training steps for the three-dimensional skeleton generation model in one embodiment;

FIG. 10 is a schematic flow chart diagram illustrating a method for generating a three-dimensional skeleton according to an embodiment;

FIG. 11 is a block diagram showing a configuration of a three-dimensional skeleton generating apparatus according to an embodiment;

FIG. 12 is a block diagram showing the structure of a three-dimensional skeleton generating apparatus according to another embodiment;

FIG. 13 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

FIG. 1 is a diagram of an application environment of a method for generating a three-dimensional skeleton according to an embodiment. Referring to fig. 1, the three-dimensional skeleton generation method is applied to a three-dimensional skeleton generation system. The three-dimensional skeleton generation system includes a terminal 110 and a server 120. The terminal 110 and the server 120 are connected through a network. The terminal 110 may specifically be a desktop terminal or a mobile terminal, and the mobile terminal may specifically be at least one of a mobile phone, a tablet computer, a notebook computer, and the like. The server 120 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers. Both the terminal 110 and the server 120 can be independently used for executing the three-dimensional skeleton generation method provided in the embodiment of the present application. The terminal 110 and the server 120 may also be cooperatively used to execute the three-dimensional skeleton generation method provided in the embodiment of the present application.

As shown in FIG. 2, in one embodiment, a three-dimensional skeleton generation method is provided. The embodiment is mainly exemplified by applying the method to a computer device, and the computer device may specifically be the terminal 110 or the server 120 in fig. 1. Referring to fig. 2, the three-dimensional skeleton generation method specifically includes the following steps:

s202, a video frame sequence comprising the target object is obtained.

The video frame sequence is a sequence of a group of temporally consecutive video frames. The target object is an object for performing a target action, and may specifically be a virtual object, a real person, an animal, or the like. The target action is a change in the position of a character's body with a certain motivation and purpose, such as a lazy action, a running action, or a high jump action. Specifically, the computer device may obtain video data collected by a local camera, or receive video data sent by other computer devices through a network connection. A computer device may intercept a sequence of video frames from video data that includes a target object that performs a target action.

S204, respectively determining the original position information of the key points of the target object in the video frame for each video frame in the video frame sequence.

The key points of the target object are key points for determining a skeleton of the target object, and specifically, may be position points of each joint of the target object. For example, when the target object is a human figure, the key points of the target object may be position points of various joints of the human body, such as position points of shoulders, elbows, wrists, necks, heads, hips, crotches, knees or ankles. When the target object is a quadruped, the key points of the target object may be, in particular, the position points of the respective joints of the animal, such as the positions of the neck, head, hip, knee or ankle.

The original position information of each key point is position information determined by preliminarily detecting each key point of the target object, and can be used to represent the respective rough positions of each key point. The original position information of each keypoint may be, specifically, a two-dimensional coordinate or a relative position coordinate of each keypoint. Wherein, the two-dimensional coordinates are the coordinates of the key points in the two-dimensional space; the relative position coordinates are the coordinates of a keypoint relative to another keypoint in two-dimensional space.

Specifically, for each frame of the video frame sequence, the computer device may detect and track a target object appearing in each frame of the video frame sequence, and for each independent target object in each frame, the computer device may intercept an area including the target object according to a detected bounding box (bounding box), and then detect a position of each key point of the target object through a key point detection algorithm, such as a Stacked Hourglass (Stacked Hourglass algorithm). When detecting and tracking the target object, algorithms such as Mask RCNN (Mask region with CNN) or dose Flow (human body posture tracking) may be specifically used, and of course, other effective algorithms may also be used, which is not limited herein in this embodiment of the present invention.

In one embodiment, when the target object is a character object, the computer device may detect and track a character in each frame of video frames in the sequence of video frames. For each individual person in each frame, the computer device may intercept the surrounding area according to the detected bounding box and normalize it to a uniform size. And then, detecting the position of each key point of the target human body by using a top-down key point detection algorithm.

In one embodiment, the step S204, that is, the step of determining the original position information of each key point of the target object in each video frame specifically includes the following steps: respectively detecting candidate objects included in the video frames of each video frame in the video frame sequence; screening out a target object for executing a target action from candidate objects included in each video frame; and determining original position information of each key point of the target object in each video frame.

Specifically, for each video frame in the sequence of video frames, the computer device may detect objects appearing in the video frame through the object detection algorithm, respectively, and frame out each object through the bounding box. Also, the computer device may determine the categories to which the respective objects belong. The computer device takes objects belonging to the target category as candidate objects and screens out candidate objects for performing the target action from the candidate objects as target objects. The computer equipment can detect key points of the target object in each video frame and determine the original position information corresponding to each key point. Wherein, the categories of each object, such as people, cats, dogs, trees, tables, etc., the computer device can detect and distinguish different objects appearing in the video frame. When the target category is a person category, the computer device may take different persons appearing in the video frame as candidates.

In one embodiment, when the object class is a people class, the computer device may perform detection of people for each frame in the video and then distinguish different people by a tracking method. The target detection algorithm can adopt a Mask RCNN algorithm, and the tracking algorithm can adopt a Pose Flow algorithm. Of course, any other reliable detection and tracking algorithm may be used herein, and the embodiment of the present application is not limited thereto.

In an embodiment, the computer device may perform target object detection and tracking on each video frame in the video frame sequence through a pre-trained convolutional network, and after a rectangular bounding box is obtained by detecting each person in each frame of video frame, the computer device may expand the rectangular bounding box into a square bounding box with a long side of the rectangle as a center point and no change in the center point, and further perform scaling processing on an area determined by the square bounding box to obtain a standard image. Furthermore, the computer device can perform the key point detection and estimation through a pre-trained deep convolutional network, such as a Stacked Hourglass network. It can be understood that, for the network structure for target detection and tracking and key point detection mentioned in this embodiment, it may be trained in a conventional training manner, and the trained network structure is directly used. For the first time convolution network and the second time convolution network mentioned in the embodiment of the application, in the training process, the pre-trained network parameters of the network structure for target detection and tracking and key point detection can be frozen, and only the network parameters of the first time convolution network and the second time convolution network in the training process are adjusted to realize the training of the whole three-dimensional skeleton generation model. Specifically, the training mode of the three-dimensional skeleton generation model will be described in detail in the following embodiments.

In one embodiment, when the computer device detects key points in each video frame, there may be some cases where some key points are occluded, and these occluded key points may be called invalid key points, while the non-occluded key points are valid key points. In this case, the computer apparatus may directly set the original position information of the obscured invalid keypoint to the origin coordinate (0,0) without using the detected position coordinate. The specific manner of how to determine the original location information of the valid keypoints will be described in detail in the following embodiments.

In the above embodiment, the target object for executing the target action may be accurately screened out from the candidate objects included in each video frame, and then the subsequent three-dimensional skeleton may be extracted, so that the three-dimensional skeleton extracted from each video frame may form a three-dimensional skeleton corresponding to the target action according to the time sequence of the video frame sequence formed by each video frame.

And S206, respectively inputting the original position information corresponding to each video frame in the video frame sequence to the first time convolution network to obtain the corrected two-dimensional position information corresponding to each video frame.

The time Convolutional network (Temporal Convolutional network) is a network structure capable of processing input data of a time sequence, and can predict corresponding output according to the sequence of a known sequence. The first time convolution network is a time convolution network used for correcting original position information corresponding to a key point of a target object in each video frame. The second time convolution network is a time convolution network for converting two-dimensional position information into three-dimensional position information. The two-dimensional position information is position information of the key point in a two-dimensional space, and may be specifically a two-dimensional position coordinate. The three-dimensional position information is position information of the key point in a three-dimensional space, and may specifically be a three-dimensional position coordinate.

It should be noted that the original position information and the two-dimensional position information are both position information in a two-dimensional space, and different from the original position information, the original position information is position information determined by preliminarily detecting each key point of the target object, and there may be a detection error. The two-dimensional position information is the position information obtained by correcting the original position, and the position estimation of the two-dimensional position information on the key point can be more smooth and accurate in time sequence. The three-dimensional position information is position information of a key point in a three-dimensional space, and can be understood as three-dimensional position information obtained by mapping two-dimensional position information to the three-dimensional space. For example, the computer device may perform keypoint detection on a target object in a frame of video frame, and obtain original position information corresponding to the keypoint a as (X1, Y1). After the computer equipment inputs the original position information corresponding to each key point in the video frame sequence into the first time convolution network, the first time convolution network can fuse the time sequence information among the video frames, and the two-dimensional position information corresponding to the key point A after being corrected is output as (X2, Y2). After the computer equipment inputs the two-dimensional position information corresponding to each key point in the video frame sequence into the second time convolution network, the three-dimensional position information corresponding to the key point A is output as (X3, Y3, Z1).

Specifically, the computer device may sequentially input the original position information corresponding to each video frame to the pre-trained first time convolution network according to a time sequence of a video frame sequence composed of corresponding video frames. When a certain frame of video is predicted to be output correspondingly through more than one layer of convolution layer in the first time convolution network, information of a plurality of previous frames of video frames and a plurality of next frames of video frames are fused to obtain the predicted output corresponding to the frame of video frames, namely the corrected two-dimensional position information mentioned in the embodiment of the application. In this way, the two-dimensional position information corresponding to each video frame in the sequence of video frames can be output separately by the first time convolution network. It is understood that the two-dimensional position information here is position information corresponding to each key point of the target object in each video frame.

In one embodiment, the raw location information includes two-dimensional raw coordinates; the two-dimensional position information comprises two-dimensional correction coordinates; step S206, namely, the step of inputting the original position information corresponding to each video frame in the video frame sequence to the pre-trained first time convolution network respectively to obtain the corrected two-dimensional position information corresponding to each video frame specifically includes: for each frame of video frame, splicing two-dimensional original coordinates corresponding to each key point in the video frame to obtain a corresponding first vector; sequentially inputting a first vector corresponding to each video frame to a pre-trained first time convolution network according to a time sequence of a video frame sequence formed by the video frames; for each frame of video frame, fusing information of a preorder video frame and a posterior video frame of the video frame through more than one layer of convolution layer in a first time convolution network to obtain a corrected second vector corresponding to the video frame; the second vector is used to represent the two-dimensional modified coordinates.

In one embodiment, for each frame of video, the computer device performs flat stitching on two-dimensional original coordinates corresponding to each key point in the butted video frame to obtain a corresponding first vector. For example, when there are 17 key points in the target object, the two-dimensional original coordinates corresponding to each key point are: keypoint 1(x1, y 1); keypoint 2(x2, y2) … … keypoint 17(x17, y 17). The computer equipment can flatten the two-dimensional original coordinates of the 17 key points respectively and splice to obtain a 34-dimensional first vector. The first vector may be represented as H1(x1, y1, x2, y2,..., x17, y17) or H2(x1, x2, … x17, y1, y2, …, y 17).

Furthermore, the computer device may sequentially input the first vector corresponding to each video frame to the pre-trained first time convolution network according to a time sequence of the video frame sequence composed of the video frames. And for each frame of video frame, fusing information of a preamble video frame and a subsequent video frame of the video frame through more than one layer of convolution layer in the first time convolution network to obtain a corrected second vector corresponding to the corresponding video frame. Wherein the second vector may be used to represent the two-dimensional modified coordinates.

In one embodiment, the computer device may perform a one-dimensional convolution on a series of first vectors, and output a new second vector for each frame of video after passing through several convolution layers. The dimensions of the second vector and the first vector are kept unchanged, but the estimation of the position information of the key point by the second vector is more time-sequence smooth and accurate. For the coordinates corresponding to the invalid keypoints mentioned in the foregoing embodiment, the first time convolution network does not update the zeroed coordinates of the invalid keypoints.

In the above embodiment, the original position information of the key point corresponding to each video frame in the video frame sequence is processed through more than one layer of convolution layer in the first time convolution network, so that information of a preceding video frame and a subsequent video frame of the video frame can be fused to obtain a two-dimensional modified coordinate of the key point corresponding to each video frame, and the two-dimensional modified coordinate is more smooth and accurate in time sequence.

And S208, respectively inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to a pre-trained second time convolution network to obtain the three-dimensional position information corresponding to each video frame.

Specifically, the computer device may sequentially input the two-dimensional position information corresponding to each video frame to the pre-trained second time convolution network according to a time sequence of a video frame sequence composed of corresponding video frames. When a certain frame of video frame is predicted to be output correspondingly through more than one layer of convolution layer in the second time convolution network, information of a plurality of previous frames of video frames and a plurality of next frames of video frames are fused to obtain the predicted output corresponding to the frame of video frame, namely the three-dimensional position information mentioned in the embodiment of the application. Thus, the three-dimensional position information corresponding to each video frame in the video frame sequence can be output through the second time convolution network. It is understood that the three-dimensional position information here is position information in a three-dimensional space corresponding to each key point of the target object in each video frame.

In one embodiment, the three-dimensional position information includes three-dimensional position coordinates; step S208, namely, the step of respectively inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to the pre-trained second time convolution network to obtain the three-dimensional position information corresponding to each video frame specifically includes: sequentially inputting second vectors corresponding to the video frames to a pre-trained second time convolution network according to a time sequence of a video frame sequence formed by the video frames; for each frame of video frame, fusing information of a preorder video frame and a postorder video frame of the video frame through more than one layer of convolution layer in a second time convolution network to obtain a third vector corresponding to the video frame; the third vector is used to represent three-dimensional position coordinates.

In one embodiment, the computer device may sequentially input the second vector corresponding to each video frame to the pre-trained second time convolution network according to a time sequence in which each video frame constitutes a sequence of video frames. And for each frame of video frame, fusing information of a preamble video frame and a subsequent video frame of the video frame through more than one layer of convolution layer in the second time convolution network to obtain a third vector corresponding to the corresponding video frame. Wherein the third vector may be used to represent the three-dimensional modified coordinates. The second time convolution network gives the final prediction for all key points, whether reliable or not, when processed.

It will be appreciated that the third vector may be used to represent three-dimensional modified coordinates, and that accordingly the third vector will include 1.5 times the number of elements included in the second vector. For example, when the target object has 17 key points, the second vector corresponding to each frame of video is a 34-dimensional vector, and correspondingly, the third vector corresponding to each frame of video is a 51-dimensional vector.

In one embodiment, the second time convolution network and the first time convolution network may have the same network structure, but the number of convolution kernels (filters) of the second time convolution network is 1.5 times the number of convolution kernels of the first time convolution network for the convolution layer of the last layer. For example, the last layer of the first time convolution network has 34 filters, while the last layer of the second time convolution network has 51 filters. Of course, the network structures of the first time convolution network and the second time convolution network may also be different, and this is not limited in this embodiment of the application.

Before the first time convolution network and the second time convolution network are used, two-dimensional three-dimensional data are used together to train through a semi-supervised learning method, and the first time convolution network and the second time convolution network which have good training effect are obtained. The training process of the first time convolution network and the second time convolution network will be described in detail in the following embodiments.

In the above embodiment, the two-dimensional position information of the key point corresponding to each video frame in the video frame sequence is processed by more than one layer of convolution layer in the second time convolution network, and the information of the preceding video frame and the subsequent video frame of the video frame can be fused to obtain the three-dimensional position information of the key point corresponding to each video frame. Thus, the three-dimensional coordinates corresponding to each key point can be predicted through the second time convolution network.

And S210, determining a three-dimensional skeleton corresponding to the target object according to the three-dimensional position information.

Specifically, the computer device may determine three-dimensional position information of key points corresponding to a target object in each video frame in the sequence of video frames, and connect the key points corresponding to each video frame to generate a three-dimensional skeleton corresponding to the target object. The posture of the target object in the three-dimensional space can be determined by the three-dimensional skeleton obtained by connecting the key points. In this way, according to the time sequence of the video frame sequence formed by the video frames, the computer device can extract the three-dimensional skeleton for executing the target action according to the three-dimensional skeleton corresponding to each video frame.

In one embodiment, the computer device obtains a third vector corresponding to each video frame through processing by the second time convolution network. And extracting the three-dimensional coordinates corresponding to the key points from the third vector in a mode of performing flattening splicing reversal on the two-dimensional original coordinates corresponding to the key points in the video frame in advance. And connecting each key point in the three-dimensional space based on the three-dimensional coordinates corresponding to each key point to obtain a three-dimensional skeleton of the target object in the three-dimensional space.

In one embodiment, the method for generating a three-dimensional skeleton further includes a step of implementing a target action executed by the virtual object, where the step specifically includes: acquiring a preset virtual object; and adjusting the three-dimensional skeleton of the virtual object according to the three-dimensional skeleton corresponding to each video frame in the video frame sequence so as to realize that the virtual object executes the target action.

In particular, the computer device may extract a three-dimensional skeleton from the video data that performs the target action. The computer device may acquire a preset virtual object, which may be a virtual character, such as a game character or an animation character. And the computer equipment adjusts the three-dimensional skeleton of the virtual object according to the extracted three-dimensional skeleton so as to realize that the virtual object executes the corresponding target action. In this way, the computer device can restore the actions in various real scenes through the extracted three-dimensional skeleton. For example, for a gaming application, a computer device may perform an extraction of actions through a large number of videos on a network and apply the extracted actions to the design of actions for a game, such as basketball on the street or dance. Therefore, the action is extracted through a large amount of videos on the network, so that the cost can be greatly reduced, and actions in various real scenes can be more truly restored.

Referring to fig. 3A and 3B, fig. 3A is an interface display diagram of an embodiment in which the three-dimensional skeleton extracted by the above-described three-dimensional skeleton generation method is applied to a design such as a street basketball game. Referring to fig. 3A, in the game scene, the three-dimensional skeleton of the virtual object 301 may be adjusted according to the extracted three-dimensional skeleton, so that the virtual object 301 performs the corresponding target action. For example, the computer device may obtain a video of a basketball player performing a shooting action, and perform the three-dimensional skeleton generation method mentioned in the embodiments of the present application on the video to obtain a three-dimensional skeleton of a human body of the basketball player during the shooting action. And then the human body three-dimensional skeleton is applied to the game design of the street basketball, so that the game role of the street basketball game executes corresponding shooting actions according to the human body three-dimensional skeleton. Fig. 3B is an interface display diagram in an embodiment, in which the three-dimensional skeleton extracted by the three-dimensional skeleton generation method is applied to a design of a dance game. Referring to fig. 3B, in the game scene, the three-dimensional skeleton of the virtual object 302 may be adjusted according to the extracted three-dimensional skeleton, so as to implement that the virtual object 302 performs the corresponding target action. For example, the computer device may obtain a video of a dancing actor dancing, and execute the three-dimensional skeleton generation method mentioned in the embodiments of the present application on the video to obtain a three-dimensional skeleton of a human body of the dancing actor when dancing. And then the human body three-dimensional skeleton is applied to the dance game design, so that the dance game role executes corresponding dance actions according to the human body three-dimensional skeleton.

According to the three-dimensional skeleton generation method, the initial prediction of the key points of the target object is realized by detecting a group of continuous video frames in time sequence, and the original position information of the key points of each frame is obtained. And then inputting the original position information of each key point into a pre-trained first time convolution network to obtain relatively smooth and stable two-dimensional position information. And then, further inputting the two-dimensional position information into a pre-trained second time convolution network for estimating the three-dimensional posture, thereby extracting the three-dimensional skeleton of the target object from the video frame sequence. Therefore, the three-dimensional skeleton of the target object can be accurately extracted from the existing video data without hiring an actor to finish corresponding actions to acquire three-dimensional coordinates, the extracted three-dimensional skeleton can assist the virtual object to realize the corresponding actions, the requirement of hiring the actor to capture the actions is greatly reduced or eliminated, the extraction of the three-dimensional skeleton is not limited by scene conditions, and the flexibility of extracting the three-dimensional skeleton is greatly improved.

In one embodiment, the step S204, that is, the step of determining the original position information of the key points of the target object in the video frames for each video frame in the sequence of video frames specifically includes:

s402, respectively determining standard images including the target object from each video frame of the video frame sequence.

Wherein the standard image is an image having a preset size. Specifically, the computer device may perform target detection on each video frame in the sequence of video frames, divide an area including the target object from each video frame, and adjust the divided area to a standard size to obtain a standard image.

In one embodiment, the step S402 specifically includes: respectively carrying out target detection on each video frame in the video frame sequence, and determining a rectangular boundary frame comprising a target object in each video frame; for each frame of video frame, respectively adjusting the corresponding rectangular bounding box into a square bounding box which takes the long edge of the rectangular bounding box as the side length and has a constant central point; and for each frame of video, respectively carrying out normalization processing on the regions determined by the corresponding square bounding boxes to obtain a standard image comprising the target object.

Specifically, the computer device may perform target detection on each video frame, and determine different objects appearing in each video frame, where the detected objects in the video frames are marked by bounding boxes of corresponding sizes. The computer device may sift out rectangular bounding boxes that include the target object from the bounding boxes that label different objects. Furthermore, to avoid the proportional distortion of the target object, the computer device may adjust the rectangular bounding box to a square bounding box with the long side of the rectangular bounding box as the side length and the central point unchanged. Then, the area determined by the square bounding box is normalized, for example, the area determined by the square bounding box is scaled to a preset standard size, for example, 256 × 256, so as to obtain a standard image including the target object.

In the above embodiment, the region including the target object is extracted from each video frame, and normalization processing is performed to obtain the standard image with the standard size, so that key detection can be conveniently and accurately performed on the standard image including the target object in the subsequent steps.

S404, respectively carrying out key point detection on each standard image to obtain heat maps corresponding to different key points in each standard image; the values of the heat map represent the certainty that the corresponding location in the standard image was detected as a keypoint.

Specifically, the computer device may perform keypoint detection for each standard image using a keypoint detection algorithm, such as the Stacked Hourglass algorithm. I.e. to make a key estimate of the target object. In the process of detecting the key points, the computer equipment outputs a heat map estimated for each key point, and the heat map corresponds to the distribution probability of different key points respectively. Wherein the value in each heat map respectively represents the certainty factor that the corresponding position in the standard image is detected as the key point.

For example, when there are 17 key points, the computer device may distribute and output 17 heat maps for each standard image, respectively corresponding to the distribution probabilities of the 17 key points. For the heat map corresponding to each keypoint, the computer device may select the maximum value thereof as the estimate of the position of the current keypoint.

S406, determining the original position information of the key points of the target object in each video frame according to the peak values of the heat map corresponding to the key points.

Specifically, for each keypoint, the computer device may select a position coordinate corresponding to a peak of its corresponding heat map as the original position information of the current keypoint.

In an embodiment, the step S406, that is, the step of determining the original position information of the key point of the target object in each video frame according to the peak value of the heat map corresponding to the key point specifically includes: for each frame of video frame, when the peak value of a heat map corresponding to a key point in the video frame is greater than or equal to a preset threshold value, taking the coordinate of the position corresponding to the peak value as the original position information of the corresponding key point; and regarding each frame of video frame, when the peak value of the heat map corresponding to the key point in the video frame is smaller than a preset threshold value, taking the origin point coordinate as the original position information of the corresponding key point.

In one embodiment, the values in the heat map themselves represent the confidence in the keypoint estimate, with higher values in the heat map indicating a greater likelihood that the location is detected as a keypoint. When the peak value of the heat map is greater than or equal to the preset threshold value, the computer device may use the coordinates of the position corresponding to the peak value as the original position information of the corresponding key point. The coordinate (x, y) of the position corresponding to the peak may be specifically a coordinate determined by using the lower left corner position of the standard image as an origin. When the peak value is smaller than the preset threshold value, the key point may be blocked to cause the reliability of detection to be reduced, and the accuracy is not high enough, and the key point can be determined to be an invalid key point. The computer apparatus can suppress the coordinates of the keypoint to zero (0,0), that is, the origin coordinates as the original position information of the invalid keypoint. Therefore, the position information of the unreliable key points is set to be zero, and when the subsequent processing is carried out through the first time convolution network and the second time network, the three-dimensional position information of the whole sequence can be estimated through the original position information of other effective key points, so that the accuracy of the three-dimensional position information is greatly improved.

In the embodiment, the key point detection is performed on the standard image including the target object, and the heat map corresponding to each key point can be quickly and accurately determined, so that the original position information of the key point can be intuitively and accurately determined according to the heat map.

In a specific embodiment, the method for generating a three-dimensional skeleton specifically includes the following steps:

s502, a video frame sequence comprising the target object is obtained.

S504, for each video frame in the video frame sequence, candidate objects included in the video frame are respectively detected.

S506, a target object for executing the target action is screened out from the candidate objects included in each video frame.

And S508, respectively carrying out target detection on each video frame, and determining a rectangular boundary frame including a target object in each video frame.

And S510, for each frame of video frame, respectively adjusting the corresponding rectangular bounding box into a square bounding box with the long side of the rectangular bounding box as the side length and the central point unchanged.

S512, for each frame of video, respectively carrying out normalization processing on the areas determined by the corresponding square bounding boxes to obtain a standard image comprising the target object.

S514, respectively detecting key points of each standard image to obtain heat maps corresponding to different key points in each standard image; the values of the heat map represent the certainty that the corresponding location in the standard image was detected as a keypoint.

S516, regarding each frame of video, when the peak value of the heat map corresponding to the key point in the video frame is greater than or equal to a preset threshold value, taking the coordinate of the position corresponding to the peak value as the two-dimensional original coordinate of the corresponding key point.

S518, for each frame of video, when the peak value of the heat map corresponding to the key point in the video frame is smaller than a preset threshold value, the origin point coordinate is used as the two-dimensional original coordinate of the corresponding key point.

S520, for each frame of video frame, splicing the two-dimensional original coordinates corresponding to each key point in the video frame to obtain a corresponding first vector.

S522, according to the time sequence of the video frame sequence formed by the video frames, the first vector corresponding to each video frame is input to the pre-trained first time convolution network in sequence.

S524, for each frame of video frame, fusing information of a preorder video frame and a postorder video frame of the video frame through more than one layer of convolution layer in the first time convolution network to obtain a corrected second vector corresponding to the video frame; the second vector is used to represent the two-dimensional modified coordinates.

And S526, sequentially inputting the second vector corresponding to each video frame to the pre-trained second time convolution network according to the time sequence of the video frame sequence formed by the video frames.

S528, for each frame of video frame, fusing information of a preorder video frame and a posterior video frame of the video frame through more than one layer of convolution layer in a second time convolution network to obtain a third vector corresponding to the video frame; the third vector is used to represent three-dimensional position coordinates.

S530, determining a three-dimensional skeleton corresponding to the target object according to the three-dimensional position coordinates.

S532, the preset virtual object is obtained.

S534, the three-dimensional skeleton of the virtual object is adjusted according to the three-dimensional skeleton corresponding to each video frame in the video frame sequence, so that the virtual object executes the target action.

FIG. 5 is a flowchart illustrating a method for generating a three-dimensional skeleton according to an embodiment. It should be understood that, although the steps in the flowchart of fig. 5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In one embodiment, the three-dimensional skeleton generation method is performed by a three-dimensional skeleton generation model, the three-dimensional skeleton generation model comprising a first time convolution network and a second time convolution network; the three-dimensional skeleton generation model is obtained by training a video frame sample sequence and marking information corresponding to key points of sample objects included in each video frame in the video frame sample sequence, wherein the video frame sample sequence comprises a first video frame sample sequence and a second video frame sample sequence, and the marking information comprises two-dimensional marking information corresponding to each first video frame in the first video frame sample sequence and two-dimensional marking information and three-dimensional marking information corresponding to each second video frame in the second video frame sample sequence.

In one embodiment, the three-dimensional skeleton generation model further comprises a network structure for performing target detection and tracking and key point detection, and the network structure can be trained in advance through a conventional training mode. In the process of training the three-dimensional skeleton generation model, the network parameters of the network structure can be kept unchanged, and only the network parameters of the first time convolution network and the second time convolution network to be trained are adjusted, so that the model training efficiency can be greatly improved.

In one embodiment, the training step of the three-dimensional skeleton generation model comprises the steps of:

s602, acquiring a video frame sample sequence and marking information corresponding to each video frame in the video frame sample sequence.

The video frame sample sequence and the labeling information corresponding to each video frame in the video frame sample sequence are training data. The sequence of video frame samples comprises a sample object for performing a sample action, the sequence of video frame samples comprising a first sequence of video frame samples and a second sequence of video frame samples. The annotation information corresponding to each first video frame in the first video frame sample sequence is two-dimensional annotation information for performing two-dimensional annotation on the key point of the sample object included in each first video frame. The labeling information corresponding to each second video frame in the second video frame sample sequence is two-dimensional labeling information for performing two-dimensional labeling on the key point of the sample object included in each second video frame and three-dimensional labeling information for performing three-dimensional labeling.

Specifically, the computer device may obtain training data required by embodiments of the present application from a data set published on a network. Or, the computer device may obtain the video frame sample sequence from a local or other computer device, and label the key points of the sample object in each video frame in a manual labeling or machine labeling manner to obtain corresponding labeling information.

In one embodiment, the step S602, that is, the step of obtaining the video frame sample sequence and the annotation information corresponding to each video frame in the video frame sample sequence, includes steps S702 to S708, which are described as follows:

s702, a first video frame sample sequence and two-dimensional labeling information for performing two-dimensional labeling on key points of sample objects included in each first video frame in the first video frame sample sequence are obtained.

In one embodiment, the first video frame sample sequence may specifically be video data including a sample object acquired from a network, or video data including a sample object acquired by a computer device through a camera. And marking the effective key points of the sample objects in each frame of the video frame in a manual marking or machine marking mode to obtain two-dimensional marking coordinates of the key points, wherein the two-dimensional marking coordinates are two-dimensional marking information. It is understood that valid keypoints herein refer to the unobscured, visible to the human eye, keypoints in the first video frame. While the position coordinates for the invalid keypoints in the first video frame, i.e. the occluded keypoints that are not visible to the human eye, are zeroed out.

S704, acquiring a second video frame sample sequence and three-dimensional labeling information corresponding to key points of sample objects included in each second video frame in the second video frame sequence acquired by the three-dimensional sensor.

In an embodiment, the second video frame sample sequence may be obtained by recording a three-dimensional sensor worn by a user in an indoor motion capture room, and a three-dimensional labeling coordinate corresponding to the key point may be directly obtained by the three-dimensional sensor worn by the user, where the three-dimensional labeling coordinate is three-dimensional labeling information.

And S706, determining invalid key points and valid key points in the key points corresponding to each second video frame in the second video frame sequence by combining the acquired three-dimensional labeling information and the preset three-dimensional model.

The preset three-dimensional model may be a biological model having a three-dimensional shape in a three-dimensional space. For example, when the sample object is a person, the preset three-dimensional model may be a human body cylinder model, and the human body cylinder model may be used to simulate the shape of a human body in a three-dimensional space, where a certain space is occupied. In other embodiments, the computer device may estimate the three-dimensional space occupation of the human body by using other methods besides a cylindrical model, for example, SMPL (skinned multi-person linear model, a three-dimensional model of the human body), and the like, and the embodiments of the present application are not limited herein.

Specifically, the computer device can judge whether each key point is shielded in the preset three-dimensional model by combining the collected three-dimensional labeling information corresponding to each key point through the preset three-dimensional model. The shielded key points are invalid key points, and the unshielded key points are valid key points. The occlusion means that the key point is occluded by other parts of the preset three-dimensional model when viewed from a given viewing angle, and the key point cannot be observed from the viewing angle.

Referring to fig. 8A and 8B, fig. 8A is a structural view of a human body cylinder phantom in one embodiment. The human body cylinder model can be used to approximate different parts of the human body, and the purpose is to judge which key points are shielded under the condition of a given three-dimensional skeleton. For any three-dimensional skeleton, the computer device can approximate the occupied part of the human body in the three-dimensional space through the cylindrical model. Referring to fig. 8B, fig. 8B is a schematic reference diagram illustrating an embodiment of determining whether a key point is occluded by a human body cylinder model. As shown in fig. 8B, given a perspective, for any one keypoint P, the keypoint is connected to the observation point by a line segment, and if the line segment passes through at least one cylindrical portion, the computer device can determine that this keypoint (such as keypoint o in fig. 8B) is occluded.

And S708, taking the position information corresponding to the coordinate origin as two-dimensional labeling information corresponding to the invalid key points, and determining the two-dimensional labeling information corresponding to the valid key points in the key points according to the three-dimensional labeling information.

Specifically, the computer device may set the blocked invalid key point to zero, and use the position information corresponding to the origin of coordinates as the two-dimensional labeling information corresponding to the invalid key point. For the effective key points, the computer equipment can acquire the visual angle of the camera defined when the three-dimensional labeling information is acquired, and determine the corresponding rotation and translation matrix. The rotational-translation matrix is used to represent the perspective orientation of the camera. And multiplying the three-dimensional coordinates corresponding to the key points by the rotation and translation matrix to obtain corresponding two-dimensional labeling coordinates, wherein the two-dimensional labeling coordinates are the two-dimensional labeling information corresponding to the effective key points.

It can be understood that in the model training process, because the number of training data corresponding to the three-dimensional labeling information is small, a large amount of two-dimensional data is often required for supplementation during training. The traditional method cannot expand data on two-dimensional data on the aspect of the problem of occlusion in training. However, in the mode mentioned in the application, the preset three-dimensional model is used for judging which key points are shielded, so that the two-dimensional data of the shielded key points can be corrected, the two-dimensional marking information of the shielded key points is inhibited to be zero, and the accuracy of model training can be greatly improved.

And S604, inputting the video frame sample sequence as a sample of the three-dimensional skeleton generation model, and inputting the sample into the three-dimensional skeleton generation model for training.

Specifically, the computer device may input the video frame sample sequence as a sample of the three-dimensional skeleton generation model, and in a training process of the three-dimensional skeleton generation model, respectively input the first video frame sample sequence and the second video frame sample sequence into the three-dimensional skeleton generation model to be trained for training.

S606, determining the predicted two-dimensional position information corresponding to the sample input through a first time convolution network in the three-dimensional skeleton generation model, and determining the predicted three-dimensional position information corresponding to the sample input through a second time convolution network in the three-dimensional skeleton generation model.

Specifically, the computer device may process each video frame in the input video frame sample sequence through the three-dimensional skeleton generation model, and determine the original position information of the key point of the sample object in each video frame. And then inputting the original position information corresponding to the sample input into a first time convolution network to be trained in the three-dimensional skeleton generation model, and obtaining the predicted two-dimensional position information corresponding to the sample input through the processing of the first time convolution network. And inputting the predicted two-dimensional information into a second time convolution network to be trained of the three-dimensional framework generation model, and determining predicted three-dimensional position information corresponding to the sample input through the second time convolution network in the three-dimensional framework generation model.

S608, according to the predicted two-dimensional position information, the predicted three-dimensional position information and the labeling information, a first loss function corresponding to the first video frame sample sequence and a second loss function corresponding to the second video frame sample sequence are respectively constructed.

Specifically, when the sample input is a first sequence of video frame samples, the computer device may construct a first loss function according to predicted two-dimensional position information, predicted three-dimensional position information, and two-dimensional label information corresponding to the sample input, and train the first and second time convolutional networks through the first loss function. When the sample input is a second sequence of video frame samples, the computer device may construct a second loss function based on the predicted two-dimensional position information, the predicted three-dimensional position information, and the two-dimensional annotation information and the three-dimensional annotation information corresponding to the sample input, and train the first time convolution network and the second time convolution network with the second loss function.

This step is not shown in fig. 6, and steps S608A to S608C, which are more detailed with respect to this step, are shown in fig. 6. The steps S608A-S608C will be explained in detail in the following embodiments.

S610, for different sample inputs, corresponding loss functions are executed respectively, network parameters of the first time convolution network and the second time convolution network are adjusted according to execution results of the corresponding loss functions, training is continued, and the training is stopped until training stopping conditions are met.

Wherein the training stop condition is a condition for ending the model training. The training stopping condition may be that a preset number of iterations is reached, or that the performance index of the three-dimensional skeleton generation model after the network parameters are adjusted reaches a preset index. And adjusting the network parameters of the three-dimensional framework generation model, namely adjusting the network parameters of the first time convolution network and the second time convolution network in the three-dimensional framework generation model.

Specifically, for different sample inputs, the computer device may respectively execute corresponding loss functions, adjust network parameters of the first time convolutional network and the second time convolutional network according to execution results of the corresponding loss functions, and continue training until training stopping conditions are met. For example, when the current sample input is a first sequence of video frame samples, the computer device may adjust network parameters of the first and second time convolutional networks according to a first loss function. When the current sample input is a second sequence of video frame samples, the computer device may adjust network parameters of the first and second time convolutional networks according to a second loss function.

It will be appreciated that during each training session, for each loss function, the computer device may adjust the network parameters in a direction to reduce the difference between the corresponding prediction and the annotation information. Therefore, corresponding predicted two-dimensional position information and predicted three-dimensional position information are obtained by continuously inputting a video frame sample sequence, and network parameters are adjusted according to the loss function so as to train a three-dimensional skeleton generation model and obtain the trained three-dimensional skeleton generation model.

With continued reference to fig. 6, in one embodiment, the step S608, that is, the step of constructing the first loss function corresponding to the first video frame sample sequence and the second loss function corresponding to the second video frame sample sequence according to the predicted two-dimensional position information, the predicted three-dimensional position information and the annotation information, respectively, specifically includes the steps S608A-S608C, which are detailed as follows:

S608A, a two-dimensional keypoint loss function is constructed according to a difference between two-dimensional labeling information corresponding to the sample input and predicted two-dimensional position information.

Specifically, for each frame of video frame sample in the sample input, the computer device may construct a two-dimensional keypoint loss function according to a difference between two-dimensional annotation information and predicted two-dimensional position information corresponding to the frame of video frame sample, respectively. The difference between the two-dimensional labeling information and the predicted two-dimensional position information may be a relative distance between a predicted two-dimensional coordinate of each estimated key point and an actual two-dimensional labeling coordinate. Here, the computer device may perform the calculation of the loss function only for the valid keypoints of the visible portion, and the invisible invalid keypoints are not included in the calculation.

In one embodiment, the two-dimensional keypoint loss corresponding to each video frame sample may be obtained by superimposing an inner product of a predicted two-dimensional coordinate of each keypoint and an actual two-dimensional labeled coordinate, where the inner product of the predicted two-dimensional coordinate and the actual two-dimensional labeled coordinate may be calculated by a square of a 2-norm. The two-dimensional keypoint loss function can be represented by the following formula:

wherein，

Representing two-dimensional key point loss corresponding to a single-frame video frame sample; i represents the ith key point, and K represents K key points in total; m_iRepresenting the two-dimensional labeling coordinate of the ith key point;

representing the predicted two-dimensional coordinates of the ith keypoint.

For the whole video frame sample sequence, the two-dimensional key point loss functions corresponding to each video frame sample can be superposed by the calculation mode of inputting the samples into the corresponding two-dimensional key point loss functions. The superposition mode may specifically adopt a mode of directly adding the two-dimensional keypoint losses corresponding to each video frame sample, or a mode of performing weighted summation according to a preset weighting coefficient, and the like, and the embodiment of the present application is not limited herein.

S608B, when the sample input is the first video frame sample sequence, mapping the predicted three-dimensional position information corresponding to the sample input back to the two-dimensional space to obtain corresponding mapped two-dimensional position information, constructing a two-dimensional mapping loss function according to a difference between the two-dimensional labeling information corresponding to the sample input and the mapped two-dimensional position information, and constructing a first loss function according to the two-dimensional keypoint loss function and the two-dimensional mapping loss function.

Specifically, in the training process of generating the model for the three-dimensional skeleton, training data are input into the model in batches and are trained. When the current sample input is the first video frame sample sequence, it is obvious that the annotation information of the first video sample sequence is only two-dimensional annotation information. At this time, for each frame of the first video frame that is input, the computer device may map the predicted three-dimensional position information corresponding to the current sample input back to the two-dimensional space, resulting in corresponding mapped two-dimensional position information. For example, where the current predicted three-dimensional position information is predicted three-dimensional coordinates (x, y, z), the computer device may remove z and use (x, y) as the mapped two-dimensional coordinates.

Furthermore, the computer device may construct a two-dimensional mapping loss function according to a difference between two-dimensional annotation information and two-dimensional mapping position information corresponding to a current sample input, that is, each first video frame in the first video frame sample sequence. The difference between the two-dimensional labeling information and the two-dimensional mapping position information may be a relative distance between a two-dimensional mapping coordinate of each estimated key point and an actual two-dimensional labeling coordinate. Here, the computer device may perform the calculation of the two-dimensional mapping loss function only for the valid keypoints of the visible portion, and the invisible invalid keypoints are not included in the calculation.

In an embodiment, the two-dimensional mapping loss corresponding to each first video frame may be obtained by superimposing an inner product of the mapped two-dimensional coordinates of each key point and the actual two-dimensional labeled coordinates, where the inner product of the mapped two-dimensional coordinates and the actual two-dimensional labeled coordinates may be obtained by calculating a square of a 2 norm. The two-dimensional mapping loss function can be expressed by the following formula:

wherein L is_projRepresenting a two-dimensional mapping loss corresponding to a first video frame of the single frame; i represents the ith key point, and K represents K key points in total; v. of_iIndicating whether the ith key point is occluded or not during data annotation, wherein v is the valid key point when the ith key point is the valid key point_iWhen the ith keypoint is an invalid keypoint, v is 1_i＝0；M_iRepresenting the two-dimensional labeling coordinate of the ith key point;

representing the mapped two-dimensional coordinates of the ith keypoint.

For the entire first video frame sample sequence, the two-dimensional mapping loss corresponding to each first video frame can be superimposed by the calculation mode of the two-dimensional mapping loss corresponding to the sample input of the whole first video frame sample sequence. The superimposing mode may specifically adopt a mode of directly adding the two-dimensional mapping losses corresponding to each first video frame, or a mode of performing weighted summation according to a preset weighting coefficient, and the like, and the embodiment of the present application is not limited herein.

Further, the computer device may construct a first loss function according to the two-dimensional keypoint loss function and the two-dimensional mapping loss function, where the first loss function is a loss function corresponding to the first video frame sample sequence.

In one embodiment, the computer device may perform weighted summation on the two-dimensional keypoint loss function and the two-dimensional mapping loss function to obtain a first loss function. The first loss function may be expressed by the following equation:

wherein, a₁And b₁The weighting coefficients are preset and may be all constants 1 or different constants.

In one embodiment, during training of the three-dimensional skeleton generation model, the computer device may adjust network parameters of the first time convolution network and the second time convolution network according to the first loss function when the current sample input is the first video frame sample sequence. For example, the network parameter at which the first loss function is minimized may be used as the network parameter of the first time convolution network and the second time convolution network in the current training process. Next, the computer device may input a next batch of sample inputs and continue training until the training stop condition is met.

S608C, when the sample input is a second video frame sample sequence, constructing a three-dimensional keypoint loss function according to a difference between three-dimensional labeling information and predicted three-dimensional position information corresponding to the sample input, and constructing a second loss function according to the two-dimensional keypoint loss function and the three-dimensional keypoint loss function.

Specifically, when the current sample input is the second video frame sample sequence, it is obvious that the annotation information of the second video frame sample sequence includes two-dimensional annotation information and three-dimensional annotation information. At this time, for each frame of the second video frame that is input, the computer device may construct a three-dimensional keypoint loss function from differences between the three-dimensional annotation information and the predicted three-dimensional position information respectively corresponding to the currently input second video frame. The difference between the three-dimensional labeling information and the predicted three-dimensional position information may be a relative distance between a predicted three-dimensional coordinate of each estimated key point and an actual three-dimensional labeling coordinate.

In one embodiment, each second video frame may be used to superimpose the three-dimensional keypoint loss on the inner product of the predicted three-dimensional coordinates and the actual three-dimensional labeling coordinates of each keypoint, wherein the inner product of the predicted three-dimensional coordinates and the actual three-dimensional labeling coordinates may be calculated by a square of a 2-norm. The three-dimensional keypoint loss function can be represented by the following formula:

wherein the content of the first and second substances,

representing the loss of the three-dimensional key points corresponding to the second video frame of the single frame; i represents the ith key point, and K represents K key points in total; n is a radical of_iRepresenting the three-dimensional labeling coordinate of the ith key point;

representing the predicted three-dimensional coordinates of the ith keypoint.

For the whole second video frame sample sequence, the three-dimensional key point loss corresponding to each second video frame can be superimposed by inputting the sample into the corresponding three-dimensional key point loss calculation mode. The superposition mode may specifically adopt a mode of directly adding the three-dimensional keypoint losses corresponding to each second video frame, or a mode of performing weighted summation according to a preset weighting coefficient, and the like, and the embodiment of the present application is not limited herein.

Further, the computer device may construct a second loss function according to the two-dimensional keypoint loss function and the three-dimensional keypoint loss function, where the second loss function is a loss function corresponding to the second video frame sample sequence.

In one embodiment, the computer device may apply a two-dimensional keypoint loss function and threeAnd carrying out weighted summation processing on the loss functions of the dimension key points to obtain second loss functions. The second loss function may be expressed by the following equation:

wherein, a₂And b₂The weighting coefficients are preset and may be all constants 1 or different constants.

In one embodiment, during the training of the three-dimensional skeleton generation model, when the current sample input is a second video frame sample sequence, the computer device may adjust network parameters of the first time convolution network and the second time convolution network according to the second loss function. For example, the network parameter at which the second loss function is minimized may be used as the network parameter of the first time convolution network and the second time convolution network in the current training process. Next, the computer device may input a next batch of sample inputs and continue training until the training stop condition is met.

In the above embodiment, in the model training process, a semi-supervised learning method using two-dimensional data and three-dimensional data is adopted, when three-dimensional labeling information exists, a loss function is calculated for the predicted three-dimensional position information of the key point, and when only two-dimensional labeling information exists, the predicted three-dimensional position information can be mapped back to the two-dimensional space for loss function calculation, so that a three-dimensional skeleton generating model with a good effect can be obtained through training under the condition of a small data amount of the three-dimensional data, and the accuracy of extracting the three-dimensional skeleton by the trained three-dimensional skeleton generating model is ensured.

In one embodiment, the three-dimensional skeleton generation model in the training phase further comprises a countermeasure network, and the training step of the three-dimensional skeleton generation model further comprises the steps of: taking the predicted three-dimensional position information or three-dimensional labeling information corresponding to each video frame in sample input as input data of an antagonistic network, and classifying the input data through the antagonistic network to obtain the prediction category of the input data; and constructing a resistance loss function according to the prediction type and the reference type corresponding to the input data. The method comprises the following steps of constructing a first loss function according to a two-dimensional key point loss function and a two-dimensional mapping loss function, wherein the steps comprise: and constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function. The step of constructing the second loss function according to the two-dimensional key point loss function and the three-dimensional key point loss function specifically includes: and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function.

It should be noted that, the three-dimensional skeleton generation model in the training phase further includes a countermeasure network, which assists in training the first time convolution network and the second time convolution network, and can be cancelled after the training is finished. The countermeasure network is used to determine whether a set of three-dimensional keypoints constitutes a reasonable motion gesture, such as a human motion gesture, without violating joint motion, for example.

In one embodiment, the computer device may use the predicted three-dimensional position information or the three-dimensional label information corresponding to each video frame in the sample input as input data for the countermeasure network. When the input data is input into the countermeasure network, the input data can be classified through the countermeasure network to obtain the prediction type of the input data, and then the computer equipment can construct the countermeasure loss function according to the prediction type and the reference type corresponding to the input data.

For example, when the input data of the countermeasure network is predicted three-dimensional position information, the corresponding reference category thereof is a network-generated posture (which can be set to 0); when the input data of the countermeasure network is three-dimensional label information, its corresponding reference category is a true pose (may be set to 1). Correspondingly, the countermeasure network classifies the input data to obtain a prediction type corresponding to the input data as an output (the output may be a predicted value of the attitude rationality, and the range is between [0 and 1 ]).

In one embodiment, the computer device may represent the counter-loss function by the following equation: l is_dis＝-∑_j[u_jlogq_j+(1-u_j)log(1-q_j)](ii) a Wherein q is_jFor the prediction class corresponding to the j-th input data of the input, u_jAnd the j-th input data corresponds to a reference category.

Further, the computer device may construct the first loss function and the second loss function separately in conjunction with the competing loss function after calculating the competing loss function. The step of constructing the first loss function by the two-dimensional key point loss function and the two-dimensional mapping loss function specifically comprises the following steps: and constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function. The step of constructing the second loss function according to the two-dimensional key point loss function and the three-dimensional key point loss function specifically includes: and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function.

That is, the computer device may perform weighted summation on the two-dimensional keypoint loss function, the two-dimensional mapping loss function, and the countermeasure loss function to obtain the first loss function. The first loss function may be expressed by the following equation:

wherein, a₁、b₁And c₁The weighting coefficients are preset and can be all constant 1 or different constants.

Correspondingly, the computer equipment can perform weighted summation processing on the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function to obtain a second loss function. The second loss function may be expressed by the following equation:

wherein, a₂、b₂And c₂The weighting coefficients are preset and may be all constants 1 or different constants.

Furthermore, in the training process of the three-dimensional skeleton generation model, when the current sample input is the first video frame sample sequence, the computer device may adjust the network parameters of the first time convolution network and the second time convolution network according to the first loss function. When the current sample input is the second video frame sample sequence, the computer device may adjust the network parameters of the first time convolution network and the second time convolution network according to the second loss function.

In the embodiment, the counteracting loss function is added into the loss function, so that the model can learn the capability of adjusting the output three-dimensional skeleton to a reasonable posture in the training process, the condition that joint movement is violated is avoided, and the accuracy of three-dimensional skeleton extraction is further improved.

In one embodiment, the training step of the three-dimensional skeleton generation model further comprises: for each key point in each video frame in sample input, screening invalid two-dimensional position information corresponding to the invalid key point from the corresponding predicted two-dimensional position information, and determining predicted three-dimensional position information corresponding to the invalid two-dimensional position information; the invalid key points are shielded key points; determining a predicted shielding category corresponding to the invalid key point by combining the determined predicted three-dimensional position information and a preset three-dimensional model; and constructing an occlusion loss function according to the predicted occlusion category corresponding to each video frame sample in the sample input. Constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function, wherein the first loss function comprises the following steps: and constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function, the countermeasure loss function and the shielding loss function. Constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function, wherein the second loss function comprises the following steps: and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the shielding loss function.

It can be understood that, in the process of model training, if the confidence of the two-dimensional coordinates of a certain key point is low, that is, the original position information and the two-dimensional position information of the key point are both suppressed to zero, the computer device will tend to block the key point in the preset three-dimensional model. If the key points are not shielded, the prediction of the key points is deviated, and certain loss needs to be added in the training process.

In one embodiment, the computer device may also construct an occlusion loss function. For each key point in each video frame in the sample input, the computer equipment can screen invalid two-dimensional position information corresponding to the invalid key point from the corresponding predicted two-dimensional position information, and determine predicted three-dimensional position information corresponding to the invalid two-dimensional position information, namely predicted three-dimensional position information corresponding to the invalid key point of which the two-dimensional position information is suppressed to zero. The computer device can determine the predicted blocking category corresponding to the invalid key point by combining the determined predicted three-dimensional position information and the preset three-dimensional model. The predicted occlusion category includes occluded and unoccluded, where occluded may be represented by a real zero and unoccluded may be represented by a real one. Furthermore, the computer device may construct an occlusion loss function according to the predicted occlusion class corresponding to each video frame sample in the sample input.

In one embodiment, the computer device may represent the counter-loss function by the following equation:

wherein the content of the first and second substances,

inputting whether three-dimensional labeling information exists for a corresponding sample, and when the three-dimensional labeling information exists in the sample input,

when the sample input does not have three-dimensional annotation information,

i denotes the ith keypoint, Occ denotes a set of invalid keypoints for which the two-dimensional position information is predicted to be suppressed to zero;

representing and combining the predicted three-dimensional position information of the invalid key point and a preset three-dimensional model to determine whether the invalid key point is shielded or notWhen the ith invalid key point is determined not to be occluded by the preset three-dimensional model,

when the ith invalid key point is judged to be blocked through the preset three-dimensional model,

for the whole video frame sample sequence, the corresponding occlusion loss calculation mode of the sample input can be used for superposing the occlusion loss corresponding to each video frame sample. The overlapping mode may specifically adopt a mode of directly adding the occlusion losses corresponding to each video frame sample, or a mode of performing weighted summation according to a preset weighting coefficient, and the like, and the embodiment of the present application is not limited herein.

Further, after the computer device calculates the occlusion loss function, the computer device may respectively construct a first loss function and a second loss function in combination with the occlusion loss function. Wherein, according to two-dimensional key point loss function, two-dimensional mapping loss function and confrontation loss function, construct the first loss function, including: and constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function, the countermeasure loss function and the shielding loss function. Constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function, wherein the second loss function comprises the following steps: and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the shielding loss function.

That is, the computer device may perform weighted summation processing on the two-dimensional keypoint loss function, the two-dimensional mapping loss function, the countermeasure loss function, and the occlusion loss function to obtain the first loss function. The first loss function may be expressed by the following equation:

wherein, a₁、b₁、c₁And d₁The weighting coefficients are preset weighting coefficients, and may be all constants 1 or different constants.

Correspondingly, the computer device can perform weighted summation processing on the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the occlusion loss function to obtain a second loss function. The second loss function may be expressed by the following equation:

wherein, a₂、b₂、c₂And d₂The weighting coefficients are preset weighting coefficients, and may be all constants 1 or different constants.

In the embodiment, the occlusion loss function is added to the loss function, so that the loss caused by the fact that the key point is estimated as an occluded error in the training process of the model can be reduced, and the accuracy of the three-dimensional skeleton generation model obtained through the training of the loss function can be further improved.

Referring to FIG. 9, in a specific embodiment, the training step of the three-dimensional skeleton generation model includes:

s902, acquiring a first video frame sample sequence and two-dimensional labeling information for performing two-dimensional labeling on key points of sample objects included in each first video frame in the first video frame sample sequence.

And S904, acquiring a second video frame sample sequence and three-dimensional labeling information corresponding to key points of sample objects included in each second video frame in the second video frame sequence acquired by the three-dimensional sensor.

And S906, determining invalid key points and valid key points in the key points corresponding to each second video frame in the second video frame sequence by combining the acquired three-dimensional labeling information and the preset three-dimensional model.

And S908, using the position information corresponding to the coordinate origin as two-dimensional labeling information corresponding to the invalid key points, and determining the two-dimensional labeling information corresponding to the valid key points in the key points according to the three-dimensional labeling information.

S910, inputting the video frame sample sequence as a sample of the three-dimensional skeleton generation model, and inputting the sample into the three-dimensional skeleton generation model for training.

S912, determining predicted two-dimensional position information corresponding to the sample input through a first time convolution network in the three-dimensional skeleton generation model, and determining predicted three-dimensional position information corresponding to the sample input through a second time convolution network in the three-dimensional skeleton generation model.

And S914, constructing a two-dimensional key point loss function according to the difference between the two-dimensional labeling information corresponding to the sample input and the predicted two-dimensional position information.

And S916, using the predicted three-dimensional position information or the three-dimensional label information corresponding to each video frame in the sample input as input data of the countermeasure network, and classifying the input data through the countermeasure network to obtain the prediction category of the input data.

S918, construct the penalty-fighting function according to the prediction category and the reference category corresponding to the input data.

S920, for each key point in each video frame in the sample input, screening invalid two-dimensional position information corresponding to the invalid key point from the corresponding predicted two-dimensional position information, and determining predicted three-dimensional position information corresponding to the invalid two-dimensional position information; invalid keypoints are occluded keypoints.

And S922, determining a predicted shielding category corresponding to the invalid key point by combining the determined predicted three-dimensional position information and the preset three-dimensional model.

And S924, constructing an occlusion loss function according to the predicted occlusion types corresponding to the video frame samples in the sample input.

S926, when the sample input is the first video frame sample sequence, mapping the predicted three-dimensional position information corresponding to the sample input back to a two-dimensional space to obtain corresponding mapped two-dimensional position information, constructing a two-dimensional mapping loss function according to a difference between two-dimensional label information corresponding to the sample input and the mapped two-dimensional position information, and constructing a first loss function according to a two-dimensional key point loss function, a two-dimensional mapping loss function, a countermeasure loss function, and a mask loss function.

And S928, when the sample input is a second video frame sample sequence, constructing a three-dimensional key point loss function according to the difference between the three-dimensional labeling information corresponding to the sample input and the predicted three-dimensional position information, and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the shielding loss function.

And S930, for different sample inputs, respectively executing corresponding loss functions, adjusting network parameters of the first time convolution network and the second time convolution network according to the execution results of the corresponding loss functions, and continuing training until the training stopping conditions are met.

In the above embodiment, in the model training process, a semi-supervised learning method using two-dimensional data and three-dimensional data is adopted, when three-dimensional labeling information exists, a loss function is calculated for the predicted three-dimensional position information of the key point, and when only two-dimensional labeling information exists, the predicted three-dimensional position information can be mapped back to the two-dimensional space for loss function calculation. Moreover, an occlusion loss function and a countermeasure loss function are added to the loss function. Under the condition of small data volume of the three-dimensional data, the three-dimensional skeleton generation model with good effect can be obtained through training, and the accuracy of extracting the three-dimensional skeleton from the trained three-dimensional skeleton generation model is improved.

FIG. 9 is a flowchart illustrating a model training step in a three-dimensional skeleton generation method according to an embodiment. It should be understood that, although the steps in the flowchart of fig. 9 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 9 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

In a specific embodiment, referring to fig. 10, fig. 10 is a flowchart illustrating a three-dimensional skeleton generation method in an embodiment. As shown in fig. 10, the computer device may train a three-dimensional skeleton generation model through an AI (artificial intelligence) model training module according to videos in the motion video collection library, so as to obtain a trained three-dimensional skeleton generation model. On the other hand, the computer device can acquire a video including the target action, process and predict the video through the trained three-dimensional skeleton generation model, and acquire a three-dimensional skeleton for executing the target action. And the three-dimensional skeleton is acted on the game role model, so that the game role model can act on the target.

As shown in fig. 11, in one embodiment, a three-dimensional skeleton generation apparatus 1100 is provided, comprising an obtaining module 1101, a determining module 1102, a modifying module 1103, and a converting module 1104, wherein:

an obtaining module 1101 is configured to obtain a sequence of video frames including a target object.

The determining module 1102 is configured to determine, for each video frame in the video frame sequence, original position information of a key point of a target object in the video frame.

The correcting module 1103 is configured to input the original position information corresponding to each video frame in the video frame sequence to the first time convolution network, respectively, to obtain the two-dimensional position information corresponding to each video frame and after being corrected.

The conversion module 1104 is configured to input the two-dimensional position information corresponding to each video frame in the sequence of video frames to the second time convolution network, respectively, so as to obtain three-dimensional position information corresponding to each video frame.

The determining module 1102 is further configured to determine a three-dimensional skeleton corresponding to the target object according to the three-dimensional position information.

In one embodiment, the determining module 1102 is further configured to detect, for each video frame in the sequence of video frames, a candidate object included in the video frame; screening out a target object for executing a target action from candidate objects included in each video frame; and determining original position information of each key point of the target object in each video frame.

In one embodiment, the determining module 1102 is further configured to determine a standard image including the target object from each video frame of the sequence of video frames; respectively detecting key points of each standard image to obtain heat maps corresponding to different key points in each standard image; the value of the heat map represents the certainty that the corresponding position in the standard image is detected as a key point; and determining the original position information of the key points of the target object in each video frame according to the peak values of the heat maps corresponding to the key points.

In one embodiment, the determining module 1102 is further configured to perform target detection on each video frame in the sequence of video frames, and determine a rectangular bounding box including a target object in each video frame; for each frame of video frame, respectively adjusting the corresponding rectangular bounding box into a square bounding box which takes the long edge of the rectangular bounding box as the side length and has a constant central point; and for each frame of video, respectively carrying out normalization processing on the regions determined by the corresponding square bounding boxes to obtain a standard image comprising the target object.

In one embodiment, the determining module 1102 is further configured to, for each frame of the video frame, when a peak value of a heat map corresponding to a key point in the video frame is greater than or equal to a preset threshold, use a coordinate of a position corresponding to the peak value as original position information of the corresponding key point; and regarding each frame of video frame, when the peak value of the heat map corresponding to the key point in the video frame is smaller than a preset threshold value, taking the origin point coordinate as the original position information of the corresponding key point.

In one embodiment, the raw location information includes two-dimensional raw coordinates; the two-dimensional position information comprises two-dimensional correction coordinates; the correction module 1103 is further configured to, for each frame of video frame, splice two-dimensional original coordinates corresponding to each key point in the video frame to obtain a corresponding first vector; sequentially inputting a first vector corresponding to each video frame to a pre-trained first time convolution network according to a time sequence of a video frame sequence formed by the video frames; for each frame of video frame, fusing information of a preorder video frame and a posterior video frame of the video frame through more than one layer of convolution layer in a first time convolution network to obtain a corrected second vector corresponding to the video frame; the second vector is used to represent the two-dimensional modified coordinates.

In one embodiment, the three-dimensional position information includes three-dimensional position coordinates; the conversion module is also used for sequentially inputting second vectors corresponding to the video frames to a pre-trained second time convolution network according to a time sequence of a video frame sequence formed by the video frames; for each frame of video frame, fusing information of a preorder video frame and a postorder video frame of the video frame through more than one layer of convolution layer in a second time convolution network to obtain a third vector corresponding to the video frame; the third vector is used to represent three-dimensional position coordinates.

In one embodiment, the three-dimensional skeleton generation 1100 further includes an adjusting module 1105, wherein the obtaining module 1101 is further configured to obtain a preset virtual object; the adjusting module 1105 is configured to adjust a three-dimensional skeleton of the virtual object according to the three-dimensional skeleton corresponding to each video frame in the sequence of video frames, so as to implement the virtual object to execute the target action.

Referring to fig. 12, in an embodiment, the three-dimensional skeleton generation 1100 further includes a model training module 1106, where the model training module 1106 is configured to obtain a video frame sample sequence and annotation information corresponding to each video frame in the video frame sample sequence; inputting a video frame sample sequence as a sample of a three-dimensional skeleton generation model, and inputting the sample sequence into the three-dimensional skeleton generation model for training; determining predicted two-dimensional position information corresponding to the sample input through a first time convolution network in the three-dimensional skeleton generation model, and determining predicted three-dimensional position information corresponding to the sample input through a second time convolution network in the three-dimensional skeleton generation model; respectively constructing a first loss function corresponding to the first video frame sample sequence and a second loss function corresponding to the second video frame sample sequence according to the predicted two-dimensional position information, the predicted three-dimensional position information and the labeling information; and respectively executing corresponding loss functions for different sample inputs, adjusting network parameters of the first time convolution network and the second time convolution network according to the execution results of the corresponding loss functions, and continuing training until the training stopping condition is met.

In one embodiment, the model training module 1106 is further configured to construct a two-dimensional keypoint loss function according to a difference between two-dimensional labeling information corresponding to the sample input and predicted two-dimensional position information; when the sample input is a first video frame sample sequence, mapping the predicted three-dimensional position information corresponding to the sample input back to a two-dimensional space to obtain corresponding mapped two-dimensional position information, constructing a two-dimensional mapping loss function according to the difference between two-dimensional labeling information corresponding to the sample input and the mapped two-dimensional position information, and constructing a first loss function according to a two-dimensional key point loss function and the two-dimensional mapping loss function; and when the sample input is a second video frame sample sequence, constructing a three-dimensional key point loss function according to the difference between the three-dimensional labeling information corresponding to the sample input and the predicted three-dimensional position information, and constructing a second loss function according to the two-dimensional key point loss function and the three-dimensional key point loss function.

In one embodiment, the three-dimensional skeleton generation model in the training phase further includes an antagonistic network, and the model training module 1106 is further configured to use the predicted three-dimensional position information or three-dimensional label information corresponding to each video frame in the sample input as input data of the antagonistic network, and classify the input data through the antagonistic network to obtain a predicted category of the input data; constructing a resistance loss function according to the prediction type and the reference type corresponding to the input data; constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function; and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function.

In one embodiment, the model training module 1106 is further configured to, for each key point in each video frame in the sample input, screen out invalid two-dimensional position information corresponding to the invalid key point from the corresponding predicted two-dimensional position information, and determine predicted three-dimensional position information corresponding to the invalid two-dimensional position information; the invalid key points are shielded key points; determining a predicted shielding category corresponding to the invalid key point by combining the determined predicted three-dimensional position information and a preset three-dimensional model; constructing an occlusion loss function according to the predicted occlusion category corresponding to each video frame sample in the sample input; constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function, the countermeasure loss function and the shielding loss function; and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the shielding loss function.

In an embodiment, the model training module 1106 is further configured to obtain a first video frame sample sequence and two-dimensional labeling information for performing two-dimensional labeling on key points of sample objects included in each first video frame in the first video frame sample sequence; acquiring a second video frame sample sequence and three-dimensional labeling information corresponding to key points of sample objects included in each second video frame in the second video frame sequence acquired by a three-dimensional sensor; determining invalid key points and valid key points in the key points corresponding to each second video frame in the second video frame sequence by combining the collected three-dimensional labeling information and a preset three-dimensional model; and taking the position information corresponding to the origin of coordinates as the two-dimensional labeling information corresponding to the invalid key points, and determining the two-dimensional labeling information corresponding to the valid key points in the key points according to the three-dimensional labeling information.

The three-dimensional skeleton generation device detects a group of continuous video frames in time sequence to realize preliminary prediction of key points of a target object and obtain the original position information of the key points of each frame. And then inputting the original position information of each key point into a pre-trained first time convolution network to obtain relatively smooth and stable two-dimensional position information. And then, further inputting the two-dimensional position information into a pre-trained second time convolution network for estimating the three-dimensional posture, thereby extracting the three-dimensional skeleton of the target object from the video frame sequence. Therefore, the three-dimensional skeleton of the target object can be accurately extracted from the existing video data without hiring an actor to finish corresponding actions to acquire three-dimensional coordinates, the extracted three-dimensional skeleton can assist the virtual object to realize the corresponding actions, the requirement of hiring the actor to capture the actions is greatly reduced or eliminated, the extraction of the three-dimensional skeleton is not limited by scene conditions, and the flexibility of extracting the three-dimensional skeleton is greatly improved.

FIG. 13 is a diagram illustrating an internal structure of a computer device in one embodiment. The computer device may specifically be the terminal 110 or the server 120 in fig. 1. As shown in fig. 13, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by a processor, causes the processor to implement a three-dimensional skeleton generation method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a method of three-dimensional skeleton generation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the three-dimensional skeleton generation apparatus provided in the present application may be implemented in the form of a computer program, which is executable on a computer device as shown in fig. 13. The memory of the computer device may store various program modules constituting the three-dimensional skeleton generation apparatus, such as the acquisition module, the determination module, the modification module, and the conversion module shown in fig. 11. The computer program constituted by the respective program modules causes the processor to execute the steps in the three-dimensional skeleton generation method of the respective embodiments of the present application described in the present specification.

For example, the computer device shown in fig. 13 may execute step S202 by an acquisition module in the three-dimensional skeleton generation apparatus shown in fig. 11. The computer device may perform steps S204 and S210 by the determination module. The computer device may perform step S206 through the modification module. The computer device may perform step S208 through the conversion module.

In one embodiment, there is provided a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the XX method described above. Here, the steps of the three-dimensional skeleton generation method may be steps in the three-dimensional skeleton generation method of each of the above embodiments.

In one embodiment, a computer-readable storage medium is provided, storing a computer program which, when executed by a processor, causes the processor to perform the steps of the above-described three-dimensional skeleton generation method. Here, the steps of the three-dimensional skeleton generation method may be steps in the three-dimensional skeleton generation method of each of the above embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A three-dimensional skeleton generation method, comprising:

acquiring a sequence of video frames comprising a target object;

2. The method of claim 1, wherein the determining, for each video frame in the sequence of video frames, original location information of a key point of the target object in the video frame comprises:

for each video frame in the video frame sequence, respectively detecting candidate objects included in the video frame;

screening out a target object for executing a target action from candidate objects included in each video frame;

and determining original position information of each key point of the target object in each video frame.

3. The method of claim 1, wherein the determining, for each video frame in the sequence of video frames, original location information of a key point of the target object in the video frame comprises:

respectively determining a standard image comprising the target object from each video frame of the video frame sequence;

respectively detecting key points of each standard image to obtain heat maps corresponding to different key points in each standard image; the value of the heat map represents the certainty that the corresponding position in the standard image is detected as a keypoint;

and determining the original position information of the key points of the target object in each video frame according to the peak values of the heat maps corresponding to the key points.

4. The method of claim 3, wherein the determining a standard image comprising the target object from each video frame of the sequence of video frames comprises:

respectively carrying out target detection on each video frame in the video frame sequence, and determining a rectangular bounding box comprising the target object in each video frame;

for each frame of video frame, respectively adjusting the corresponding rectangular bounding box into a square bounding box which takes the long edge of the rectangular bounding box as the side length and has a constant central point;

and for each frame of video, respectively carrying out normalization processing on the regions determined by the corresponding square bounding boxes to obtain a standard image comprising the target object.

5. The method according to claim 3, wherein the determining original position information of the key points of the target object in each video frame according to the peak values of the heat map corresponding to the key points comprises:

for each frame of video, when the peak value of a heat map corresponding to a key point in the video frame is greater than or equal to a preset threshold value, taking the coordinate of the position corresponding to the peak value as the original position information of the corresponding key point;

and for each frame of video frame, when the peak value of the heat map corresponding to the key point in the video frame is smaller than the preset threshold value, taking the origin point coordinate as the original position information of the corresponding key point.

6. The method of claim 1, wherein the raw location information comprises two-dimensional raw coordinates; the two-dimensional position information comprises two-dimensional correction coordinates; the step of inputting the original position information corresponding to each video frame in the video frame sequence to a pre-trained first time convolution network respectively to obtain the corrected two-dimensional position information corresponding to each video frame includes:

for each frame of video frame, splicing two-dimensional original coordinates corresponding to each key point in the video frame to obtain a corresponding first vector;

sequentially inputting first vectors corresponding to the video frames to a pre-trained first time convolution network according to a time sequence of the video frame sequence formed by the video frames;

for each frame of video frame, fusing information of a preorder video frame and a postorder video frame of the video frame through more than one layer of convolution layer in the first time convolution network to obtain a corrected second vector corresponding to the video frame; the second vector is used to represent the two-dimensional modified coordinates.

7. The method of claim 6, wherein the three-dimensional position information comprises three-dimensional position coordinates; the step of inputting the two-dimensional position information corresponding to each video frame in the video frame sequence to a pre-trained second time convolution network respectively to obtain the three-dimensional position information corresponding to each video frame includes:

sequentially inputting second vectors corresponding to the video frames to a pre-trained second time convolution network according to the time sequence of the video frame sequence formed by the video frames;

for each frame of video frame, fusing information of a preorder video frame and a postorder video frame of the video frame through more than one layer of convolution layer in the second time convolution network to obtain a third vector corresponding to the video frame; the third vector is used to represent the three-dimensional position coordinates.

8. The method according to any one of claims 1 to 7, further comprising:

acquiring a preset virtual object;

and adjusting the three-dimensional skeleton of the virtual object according to the three-dimensional skeleton corresponding to each video frame in the video frame sequence so as to realize that the virtual object executes the target action.

9. The method of any one of claims 1 to 7, wherein the method is performed by a three-dimensional skeleton generation model comprising a first time convolution network and a second time convolution network; the three-dimensional skeleton generation model is obtained by training a video frame sample sequence and annotation information corresponding to key points of sample objects included in each video frame in the video frame sample sequence, wherein the video frame sample sequence comprises a first video frame sample sequence and a second video frame sample sequence, and the annotation information comprises two-dimensional annotation information corresponding to each first video frame in the first video frame sample sequence and two-dimensional annotation information and three-dimensional annotation information corresponding to each second video frame in the second video frame sample sequence.

10. The method of claim 9, wherein the step of training the three-dimensional skeleton-generating model comprises:

acquiring a video frame sample sequence and marking information corresponding to each video frame in the video frame sample sequence;

inputting the video frame sample sequence as a sample of a three-dimensional skeleton generation model, and inputting the sample sequence into the three-dimensional skeleton generation model for training;

determining predicted two-dimensional position information corresponding to sample input through a first time convolution network in the three-dimensional skeleton generation model, and determining predicted three-dimensional position information corresponding to the sample input through a second time convolution network in the three-dimensional skeleton generation model;

respectively constructing a first loss function corresponding to the first video frame sample sequence and a second loss function corresponding to the second video frame sample sequence according to the predicted two-dimensional position information, the predicted three-dimensional position information and the labeling information;

and respectively executing corresponding loss functions for different sample inputs, adjusting network parameters of the first time convolution network and the second time convolution network according to the execution results of the corresponding loss functions, and continuing training until the training stopping condition is met.

11. The method of claim 10, wherein constructing a first loss function corresponding to the first sequence of video frame samples and a second loss function corresponding to the second sequence of video frame samples from the predicted two-dimensional position information, the predicted three-dimensional position information, and the annotation information, respectively, comprises:

constructing a two-dimensional key point loss function according to the difference between two-dimensional labeling information corresponding to the sample input and predicted two-dimensional position information;

when the sample input is a first video frame sample sequence, mapping the predicted three-dimensional position information corresponding to the sample input back to a two-dimensional space to obtain corresponding mapped two-dimensional position information, constructing a two-dimensional mapping loss function according to the difference between two-dimensional labeling information corresponding to the sample input and mapped two-dimensional position information, and constructing a first loss function according to the two-dimensional key point loss function and the two-dimensional mapping loss function;

and when the sample input is a second video frame sample sequence, constructing a three-dimensional key point loss function according to the difference between three-dimensional labeling information corresponding to the sample input and predicted three-dimensional position information, and constructing a second loss function according to the two-dimensional key point loss function and the three-dimensional key point loss function.

12. The method of claim 11, wherein the three-dimensional skeletal generation model in the training phase further comprises a countermeasure network, and wherein the training step of the three-dimensional skeletal generation model further comprises:

taking the predicted three-dimensional position information or three-dimensional labeling information corresponding to each video frame in the sample input as input data of the countermeasure network, and classifying the input data through the countermeasure network to obtain the prediction category of the input data;

constructing a resistance loss function according to the prediction type and the reference type corresponding to the input data;

the constructing a first loss function according to the two-dimensional key point loss function and the two-dimensional mapping loss function comprises:

constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function;

the constructing a second loss function according to the two-dimensional key point loss function and the three-dimensional key point loss function comprises:

and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function.

13. The method of claim 12, wherein the step of training the three-dimensional skeleton-generating model further comprises:

for each key point in each video frame in the sample input, screening invalid two-dimensional position information corresponding to the invalid key point from the corresponding predicted two-dimensional position information, and determining predicted three-dimensional position information corresponding to the invalid two-dimensional position information; the invalid key points are shielded key points;

determining a predicted blocking category corresponding to the invalid key point by combining the determined predicted three-dimensional position information and a preset three-dimensional model;

constructing an occlusion loss function according to the predicted occlusion category corresponding to each video frame sample in the sample input;

constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function and the countermeasure loss function, wherein the method comprises the following steps:

constructing a first loss function according to the two-dimensional key point loss function, the two-dimensional mapping loss function, the countermeasure loss function and the shielding loss function;

constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function and the countermeasure loss function, wherein the second loss function comprises:

and constructing a second loss function according to the two-dimensional key point loss function, the three-dimensional key point loss function, the countermeasure loss function and the shielding loss function.

14. The method of claim 10, wherein the obtaining the sequence of video frame samples and the annotation information corresponding to each video frame in the sequence of video frame samples comprises:

acquiring a first video frame sample sequence and two-dimensional labeling information for performing two-dimensional labeling on key points of sample objects included in each first video frame in the first video frame sample sequence;

acquiring a second video frame sample sequence and three-dimensional labeling information corresponding to key points of sample objects included in each second video frame in the second video frame sequence acquired by a three-dimensional sensor;

determining invalid key points and valid key points in the key points corresponding to each second video frame in the second video frame sequence by combining the collected three-dimensional labeling information and a preset three-dimensional model;

and taking the position information corresponding to the origin of coordinates as the two-dimensional labeling information corresponding to the invalid key points, and determining the two-dimensional labeling information corresponding to the valid key points in the key points according to the three-dimensional labeling information.

15. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method of any one of claims 1 to 14.