CN116638512A

CN116638512A - Method, device and equipment for driving digital human limb actions based on video

Info

Publication number: CN116638512A
Application number: CN202310624134.XA
Authority: CN
Inventors: 胡强; 陈聪
Original assignee: Beijing Yingfeng Technology Co ltd
Current assignee: Beijing Yingfeng Technology Co ltd
Priority date: 2023-05-30
Filing date: 2023-05-30
Publication date: 2023-08-25

Abstract

The invention relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for driving digital human limb actions based on video. The method comprises the following steps: acquiring video stream data, and decomposing the video stream data into a plurality of pieces of picture data arranged according to a preset sequence frame by frame; sequentially inputting the plurality of pieces of picture data into a preset model according to the preset sequence, and sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data; obtaining standard skeleton coordinates of each picture data based on the three-dimensional posture coordinates of each picture data; obtaining a rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data; and driving the limbs of the digital person to act according to the rotation quaternion and the preset sequence. Solves the problems of more professional equipment, high cost and inaccurate bone capture in the traditional limb motion capture technology.

Description

Method, device and equipment for driving digital human limb actions based on video

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device and equipment for driving digital human limb actions based on video.

Background

With the continuous development of technology, motion capture technology is receiving more and more attention. The method collects human body posture and motion information, performs accurate characteristic quantitative analysis on the information by using a proper data processing method, and finally acquires the wanted motion information and corresponding subsequent processing. Motion capture has fundamental significance in academic, and is widely applied to the fields of virtual and augmented reality, robots and motion biology. The three-dimensional motion capture system is a device for comprehensively recording the motion of an object in a three-dimensional space, and is divided into mechanical motion capture, acoustic motion capture, electromagnetic motion capture, optical motion capture and inertial motion capture according to different principles.

However, the optical motion capture device is expensive, has certain limitation on the precision of the device, cannot meet the requirement of a user on the precision, and is easy to cause external interference to the environment; inertial motion requires a large amount of equipment, is high in cost and has high requirements on indoor deployment, and capturing precision and accuracy are limited to a certain extent.

Therefore, there is an urgent need to solve the problems of high cost of professional equipment and inaccurate bone capture in the conventional limb motion capture technology.

Disclosure of Invention

In view of the above, the invention provides a method, a device and equipment for driving digital human limb movements based on video, so as to solve the problems of more professional equipment, high cost and inaccurate bone capture in the traditional limb movement capture technology.

According to a first aspect of an embodiment of the present invention, a method for driving digital human limb movements based on video, comprises:

acquiring video stream data, and decomposing the video stream data into a plurality of pieces of picture data arranged according to a preset sequence frame by frame;

wherein the video stream data comprises a series of character limb actions;

sequentially inputting the plurality of pieces of picture data into a preset model according to the preset sequence, and sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data; wherein the three-dimensional gesture coordinates comprise three-dimensional coordinates of a preset number of human joints;

obtaining standard skeleton coordinates of each picture data based on the three-dimensional posture coordinates of each picture data; the standard skeleton coordinates comprise standard coordinates of joints of the digital person, and the joints of the digital person correspond to joints of human body postures in the three-dimensional posture coordinates one by one;

Obtaining a rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data;

and driving the limbs of the digital person to act according to the rotation quaternion and the preset sequence.

Further, the preset model includes:

the basic two-dimensional gesture recognition module is used for generating a two-dimensional gesture sequence based on the picture data;

and the cavity convolution module is used for establishing the association between the space and the time domain for the two-dimensional gesture sequence to obtain the three-dimensional gesture coordinate.

Further, the sequentially inputting the plurality of pieces of picture data to a preset model according to the preset sequence, sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data, includes:

acquiring the plurality of pieces of picture data, and sequentially generating a bias map and a thermodynamic diagram corresponding to each piece of picture data;

and generating three-dimensional gesture coordinates of a preset number of human joints corresponding to each piece of picture data based on the bias diagram and the thermodynamic diagram.

Further, the method further comprises the following steps:

and performing self-adaptive Kalman filtering and low-pass filtering on the three-dimensional attitude coordinates.

Further, the obtaining the rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data includes:

Generating an initial Y-axis orientation and an initial Z-axis orientation of each joint based on the first piece of picture data;

generating an initial rotation matrix of each joint based on the Y-axis orientation and the initial Z-axis orientation of each joint;

generating an initial alignment matrix of each joint based on the initial rotation matrix of each joint;

generating a Y-axis orientation corresponding to the newly output picture data and a Z-axis orientation of each joint based on the newly input picture data;

generating a rotation matrix of each joint corresponding to the new input picture data based on the Y-axis orientation of the new input picture data and the Z-axis orientation of each joint;

and multiplying the rotation matrix of the joint in the new input picture data with the initial alignment matrix to obtain a rotation quaternion of the new input picture data.

Further, the process of generating the Y-axis orientation and the Z-axis orientation of each joint includes:

determining father-son relationship of each joint in the picture data;

determining a Z-axis orientation of a target joint based on a difference in position of the target joint and its associated joint, the associated joint of the target joint comprising a joint with which a parent-child joint exists;

and determining the Y-axis orientation of the picture data based on the normal direction of a plane defined by the spine, the left joint and the right joint in the picture data.

Further, the human joint includes:

nose, left eye, right eye, left ear, right ear, left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, and right ankle.

Further, the method further comprises the following steps:

constraining normative of digital human limb actions through a preset plug-in.

According to a second aspect of an embodiment of the present invention, an apparatus for driving digital human limb movements based on video, comprises:

the acquisition module is used for acquiring video stream data and decomposing the video stream data into a plurality of pieces of picture data arranged according to a preset sequence frame by frame;

wherein the video stream data comprises a series of character limb actions;

the training module is used for sequentially inputting the plurality of pieces of picture data into a preset model according to the preset sequence and sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data; wherein the three-dimensional gesture coordinates comprise three-dimensional coordinates of a preset number of human joints;

the generation module is used for obtaining standard skeleton coordinates of each picture data based on the three-dimensional posture coordinates of each picture data; the standard skeleton coordinates comprise standard coordinates of joints of the digital person, and the joints of the digital person correspond to joints of human body postures in the three-dimensional posture coordinates one by one;

The rotation calculation module is used for obtaining rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data;

and the driving module is used for driving the limb actions of the digital person according to the rotation quaternion and the preset sequence.

According to a third aspect of an embodiment of the present application, an intelligent device includes:

a processor, and a memory coupled to the processor;

the memory is used for storing a computer program, and the computer program is at least used for executing the method based on the video driving digital human limb actions according to the first aspect of the embodiment of the application;

the processor is configured to invoke and execute the computer program in the memory.

According to the method for driving the digital human limb actions based on the video, a series of human limb actions in video stream data are decomposed into the plurality of pieces of picture data which are arranged according to the preset sequence, the picture data are sequentially input into the preset model according to the preset sequence, the three-dimensional gesture coordinates corresponding to each piece of picture data are generated, and the standard skeleton coordinates of each piece of picture data are obtained based on the three-dimensional gesture coordinates of each piece of picture data.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow chart illustrating a method for driving digital human limb movements based on video, according to an exemplary embodiment;

FIG. 2 is a schematic illustration of a human joint shown according to an exemplary embodiment;

FIG. 3 is a graph of the processing effects of adaptive Kalman filtering and low pass filtering smoothing shown in accordance with an exemplary embodiment;

FIG. 4 is a schematic block diagram of an apparatus for driving digital human limb movements based on video, according to an exemplary embodiment;

FIG. 5 is a schematic diagram of a smart device, according to an example embodiment;

fig. 6 is a specific flow chart illustrating a method of driving digital human limb movements based on video according to another exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

Example 1

Referring to fig. 1, fig. 1 is a flow chart illustrating a method for driving digital human limb movements based on video according to an exemplary embodiment, and as shown in fig. 1, the method may specifically include the following steps:

s11, step: and acquiring video stream data, and decomposing the video stream data into a plurality of pieces of picture data arranged according to a preset sequence frame by frame.

Wherein the video stream data comprises a series of character limb actions.

S12, step: and sequentially inputting the plurality of pieces of picture data into a preset model according to the preset sequence, and sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data.

The three-dimensional gesture coordinates comprise three-dimensional coordinates of a preset number of human joints.

S13, step: obtaining standard skeleton coordinates of each picture data based on the three-dimensional posture coordinates of each picture data; the standard skeleton coordinates comprise standard coordinates of joints of the digital person, and the joints of the digital person correspond to joints of human body postures in the three-dimensional posture coordinates one by one.

S14, step: and obtaining the rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data.

S15, step: and driving the limbs of the digital person to act according to the rotation quaternion and the preset sequence.

In step S11, "obtain video stream data, decompose the video stream data into a plurality of picture data arranged according to a preset sequence frame by frame", decompose the video stream data into picture data frame by frame, where the preset sequence may be a time sequence in which the video plays pictures frame by frame, that is, the plurality of picture data are arranged according to a video playing sequence.

Step S12, sequentially inputting the plurality of pieces of picture data into a preset model according to the preset sequence, and sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data; wherein, the three-dimensional gesture coordinates include three-dimensional coordinates of a preset number of human joints:

referring to fig. 2, fig. 2 is a schematic diagram of a human joint, including 17 joints, or key points of 17 human poses, according to an exemplary embodiment.

Preferably, the human joint includes:

0 nose, 1 left eye, 2 right eye, 3 left ear, 4 right ear, 5 left shoulder, 6 right shoulder, 7 left elbow, 8 right elbow, 9 left wrist, 10 right wrist, 11 left hip, 12 right hip, 13 left knee, 14 right knee, 15 left ankle and 16 right ankle.

In a specific application, the preset model is a three-dimensional gesture estimation network model, and the three-dimensional gesture estimation network model is an improvement based on a basic two-dimensional gesture recognition module.

The basic two-dimensional gesture recognition module may specifically be a videoPose model.

In a specific embodiment, the preset model is an improvement based on the videoPose model, and finally a three-dimensional posture estimation network model is obtained, which is the preset model.

Specifically, the input of the videoPose model is a two-dimensional image sequence, and the output is a two-dimensional gesture coordinate. The improved preset model is added with a cavity convolution module and a skeleton direction adjustment module, and is used for obtaining three-dimensional attitude coordinates.

Specifically, the cavity convolution module is used for establishing the correlation between the space and the time domain of the two-dimensional image sequences, so that dynamic information in the sequences is better captured. In the time domain, the cavity convolution can fuse the gesture information between the front frame and the rear frame, so as to solve the problems of shielding and action recognition jitter alleviation. In space, the cavity convolution can enlarge the receptive field and improve the understanding capability of the gesture information.

The skeleton direction adjusting module is used for fine-adjusting relevant parameters of convolution, so that the three-dimensional gesture coordinates and the two-dimensional gesture coordinates are more consistent. Specifically, the module projects the three-dimensional pose coordinates to generate two-dimensional pose coordinates, and calculates a loss between the three-dimensional pose coordinates and the two-dimensional pose coordinates. If the bone position is inconsistent with the previous forward prediction, the module can perform reverse iteration to fine tune the relevant parameters of convolution, so that the prediction result is more accurate. In this way, the model can better learn three-dimensional attitude information and generate more accurate three-dimensional attitude coordinates so as to obtain more accurate estimation results.

Further, based on the improvement of the video Pose model in the two aspects, a Human3.6M (Human3.6M is one of the data sets most commonly used in the 3D HPE task at present, and comprises 360 ten thousand frames of picture data and corresponding 2D/3D human body gestures.) data sets is used for training, so that an improved preset model, namely a three-dimensional gesture estimation network model, is obtained.

Preferably, the sequentially inputting the plurality of pieces of picture data to a preset model according to the preset sequence, sequentially generating three-dimensional gesture coordinates corresponding to each piece of picture data, includes:

acquiring the plurality of pieces of picture data, and sequentially generating a bias chart and a thermodynamic diagram corresponding to each piece of picture data;

It can be understood that the most accurate three-dimensional coordinates of the preset number of human body joints of each piece of picture data are found through the offset map and the thermodynamic diagram, and the three-dimensional coordinates are scaled according to a certain proportion, so that the picture data are transformed into three-dimensional posture coordinates which are more in accordance with the human body posture rules.

Preferably, the method further comprises: and performing self-adaptive Kalman filtering and low-pass filtering smoothing on the three-dimensional attitude coordinates.

Referring to fig. 3, fig. 3 is a processing effect diagram of adaptive kalman filtering and low-pass filtering smoothing, as shown in fig. 3.

Specifically, in the first aspect, the adaptive kalman filter algorithm is used for smoothing, so as to prevent jitter.

Prediction formula of adaptive Kalman filtering:

the _hat { x } { k } { - } represents a state estimation value predicted by the state transition matrix A and the control input matrix B based on the state \hat { x } { k-1} and the control input u_ { k-1} of the previous time step k-1 at the time step k. P_ { k } - } represents the state estimation covariance matrix predicted from the state transition matrix A and the process noise covariance matrix Q based on the state estimation covariance matrix P_ { k-1} of the previous time step k-1 at time step k.

A represents a state transition matrix describing how states evolve over time. B denotes a control input matrix describing how the control input affects the evolution of the state. u_ { k-1} represents the control input at time step k-1, Q represents the process noise covariance matrix, describing the uncertainty or noise in the state evolution.

The update formula is as follows:

k_ { K } represents the calculated Kalman gain based on the predicted covariance matrix P_ { K } { - }, the observation matrix H, and the observation noise covariance matrix R at time step K. The_hat { x } { K } represents a state estimation value obtained by performing state update on the basis of the prediction state estimation value } { x } { K } { - } and the Kalman gain K_ { K } in the time step K through the observation matrix H and the observation value z_ { K }. P_ { K } represents a state estimation covariance matrix obtained by correcting the prediction covariance matrix P_ { K } - } by the Kalman gain K_ { K } and the observation matrix H at the time step K. H denotes an observation matrix, describing the relationship between states and observations. R represents an observation noise covariance matrix, describing the uncertainty or noise of the observation. z_ { k } represents the observed value obtained at time step k. I represents an identity matrix.

Firstly, initializing K, A, R, P and H five parameter values respectively, and finding out that the smaller the R value is, the closer the optimal estimated value is to the observed data through parameter adjustment; the larger the Q value, the closer the optimal estimate is to the observed data.

In the second aspect, the three-dimensional attitude coordinates are subjected to a low-pass filtering algorithm, wherein the formula is as follows

now represents the data point at the current time and prev represents the data point at the last time. The smooths represent the smoothing coefficient, the value range is [0, 1], the weight ratio of the data point at the current moment and the data point at the last moment is represented, wherein the larger the smooths are, the larger the weight of the data point at the current moment is, and the more obvious the smoothing effect is; the smaller the smooth, the greater the weight of the data point at the last time, and the weaker the smoothing effect.

Specifically, the low-pass filtering algorithm is in the prior art, and a specific implementation method is not described herein.

S13, obtaining standard skeleton coordinates of each picture data based on the three-dimensional posture coordinates of each picture data; the standard skeleton coordinates comprise standard coordinates of joints of the digital person, and the joints of the digital person correspond to the joints of the human body in the three-dimensional posture coordinates one by one;

in practical application, as can be seen from the description of the above document, the three-dimensional gesture coordinate is smoothed by adaptive kalman filtering and low-pass filtering to obtain a smoothed three-dimensional gesture coordinate, and in an exemplary manner, when the obtained three-dimensional gesture coordinate is a left elbow joint, a sub-joint of the opposite left elbow joint is a left wrist joint, a relationship from the left elbow joint to the left wrist joint is referred to as a parent-child relationship, and if a distance from the left elbow joint to the left wrist joint is denoted by D1, a distance D1 between the left elbow joint and the left wrist joint of the digital person is calculated according to initial position coordinates of the left elbow joint and the left wrist joint in the first picture data. And calculating the distance D2 between the elbow joint and the wrist joint of the video character in the picture data according to the three-dimensional posture coordinates of the left elbow joint and the left wrist joint after the smoothing treatment, and obtaining a direction angle by dividing the difference by D2. And multiplying the distance D1 from the left elbow joint to the left wrist joint by the direction angle, and then adding the initial position coordinate of the left elbow joint to obtain the standard skeleton coordinate of the left wrist joint.

It will be appreciated that calculating the standard skeletal coordinates of the left wrist joint establishes a relationship between the left elbow joint and the sub-node of the left elbow joint (left wrist joint), and that the standard skeletal coordinates of the elbow joint requires establishing a relationship between the left shoulder joint and the sub-node of the left shoulder joint (left elbow joint). And sequentially calculating standard coordinates, namely standard skeleton coordinates, of the joints of the digital person corresponding to the image data based on the three-dimensional posture coordinates of the joints of the human body posture in each image through analogy relation.

Preferably, the obtaining the rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data includes:

generating an initial Y-axis orientation and each Z-axis orientation based on the first piece of picture data;

generating an initial rotation matrix for each joint based on the Y-axis orientation and each Z-axis orientation;

generating a Y-axis orientation and each Z-axis orientation corresponding to the newly output picture data based on the newly input picture data;

generating an initial rotation matrix of each joint corresponding to the new input picture data based on the Y-axis orientation and each Z-axis orientation of the new input picture data;

And multiplying the initial rotation matrix of the joint in the new input picture data with the initial alignment matrix of the joint same as the initial picture data to obtain a rotation quaternion of the new input picture data.

The value is stated, because the plurality of picture data are input to the preset model according to the preset sequence to generate the rotation quaternion of each picture, when the newly input picture outputs the preset model, the gesture of the digital person is changed, and therefore the rotation quaternion of the current picture generated by the preset model realizes real-time driving of the limb action of the digital person.

In practical application, the first piece of picture data is taken as an initial posture image of the digital person, a subsequent posture image of the new digital person is acquired for the newly input picture data according to a preset sequence, further, each posture of the digital person is driven according to the rotation quaternion, and then, how the rotation quaternion is generated is specifically illustrated:

step a: calculating an initial rotation matrix (I) based on the respective Z-axis orientations and Y-axis orientations of the first piece of picture data;

namely, firstly, calculating the directions of the Z axis and the Y axis of the initial position of the digital person: i.e. the initial orientation of the digital person is calculated. For example: the Z-axis orientation of the large arm is determined by the difference in position of the large arm and the small arm.

The Y-axis direction, i.e., the human body direction, is determined by a normal direction (i.e., a vertical vector) based on a plane defined by three points of the Spine and the left and right crotch joints, and the vertical vector is obtained by vector cross-multiplying a vector connecting the left crotch joint and the Spine with a vector connecting the right crotch joint and the Spine.

The desired joint coordinates used for the Y-axis orientation calculation of the forearm herein are three-dimensional pose coordinates. The Z-axis orientation of the large arm is determined by the difference between the three-dimensional pose coordinates of the parent-child nodes.

Each frame of the character action in the video stream data is different, so that the Z-axis orientation and the Y-axis orientation of each piece of picture data of the frame-by-frame decomposition of the video stream data are different.

Step b: and determining the Z-axis direction and the Y-axis direction of the newly input picture data as the Z-axis direction and the Y-axis direction of the current picture, and calculating a rotation matrix (R1) by a LookRotation method by utilizing the Z-axis direction and the Y-axis direction of the current picture. And (3) inverting the rotation matrix (R1) to obtain an inverse matrix (R2), and multiplying the inverse matrix (R2) by the initial rotation matrix (I) of the joint to obtain an initial alignment matrix (A) of the joint.

For example: the inverse matrix (R2) is: (-0.1, -0.6, -0.8,0.1), corresponds to (x, y, z, w), where x, y, z are imaginary parts representing the coordinates of the rotation axis in three dimensions. w is the real part, representing the rotation angle;

The initial rotation matrix (I) is (0.3, -0.5, -0.7);

A=R2*I=(R2.w*I.x+R2.x*I.w+R2.y*I.z-R2.z*I.y,

R2.w*I.y+R2.y*I.w+R2.z*I.x-R2.x*I.z,

R2.w*I.z+R2.z*I.w+R2.x*I.y-R2.y*I.x,

R2.w*I.w-R2.x*I.x-R2.y*I.y-R2.z*I.z)

=(-0.32,-0.14,-0.56,-0.66)

assume that the initial alignment matrix for the large arm is calculated as a.

It should be noted that the alignment matrix of the big arm and the small arm calculated here is calculated only once and is fixed, i.e. the initial alignment matrix.

Step c: and acquiring the Z-axis orientation and the Y-axis orientation of the newly input picture data, and calculating a rotation matrix (T1) through a Lookrotation method.

Multiplying the rotation matrix (T1) with the calculated initial alignment matrix (A), and obtaining a result B which is a rotation quaternion for driving the digital human limb to act in real time.

For example: t1= (0.0,0.6,0.8,0.0) a= (-0.32, -0.14, -0.56, -0.66)

The rotation quaternion of the current frame of the big arm, namely the current picture data, is obtained: b=t1×a= (-0.084, -0.064,0.35, -0.51)

And calculating the rotation quaternion of each piece of picture data based on the calculation method for calculating the rotation quaternion described in the above document and the like.

Preferably, the generating the Y-axis orientation and the respective Z-axis orientations includes:

determining father-son relationship of each joint in the picture data;

determining a Z-axis orientation of a joint based on a difference in position of the joint and its associated joints, the joint's associated joints including a joint with which a parent-child joint exists;

The Y-axis direction, i.e., the human body direction, is determined by a normal direction (i.e., a vertical vector) based on a plane defined by three points of the Spine and the left and right crotch joints, and the vertical vector is obtained by vector cross-multiplying a vector connecting the left crotch joint with the Spine and a vector connecting the right crotch joint with the Spine.

Illustratively, in the video stream data, the Y-axis orientation calculation of the large arm of the person adopts Spine (Spine) and left and right crotch joint coordinates, specifically three-dimensional posture coordinates after joint smoothing. The respective Z-axis orientations of the large arm are determined by the differences in three-dimensional pose coordinates of the parent-child nodes.

Preferably, the method further comprises: the normative of digital human limb actions is restrained through the preset plug-in, and the situation that limb joints are distorted is avoided.

Specifically, the motion normalization of the digital character model can be constrained through the FullBodyIK plug-in, and the situation of limb joint distortion is avoided. For example, 17 points of position are used to control 12 effectors by script.

In some embodiments, in the FullBodyIK plugin, the literal meaning of an effector refers to a particular controller that can control certain key points of a digital character model. These controllers may be implemented by scripts for controlling the motions and gestures of the digital character model. Specifically, these controllers include 12 key points that may be used to control the movements and pose of the hands, feet, head, torso, etc. of the digital character model. By constraining these key points, we can ensure the motion normalization of the digital character model and avoid the occurrence of limb joint distortion. Meanwhile, the controllers can be used for realizing special effects, such as simulated jumping, climbing, crawling and other actions.

It will be appreciated that effectors are a very important control mechanism that can help us achieve more natural, smooth digital character model actions and improve game and animation quality.

The 12 effectors in this embodiment include: the digital human skeleton is controlled by 12 effectors to play a role in reverse dynamics constraint, and the digital human moves in a coordinated manner according to standard skeleton coordinates.

In some embodiments, each joint of the digital person corresponds to a key point position output by a preset model, namely, a human body joint of a preset data joint, a refresh frequency of each frame of picture is defined, and next frame information of the dynamic capture data is read when each refresh is performed. And driving the digital human limb to act according to the calculated rotation quaternion of each joint of each frame.

It can be understood that a series of character limb actions in video stream data are decomposed into a plurality of pieces of picture data arranged according to a preset sequence, the picture data are sequentially input into a preset model according to the preset sequence, three-dimensional gesture coordinates corresponding to each piece of picture data are generated, and standard skeleton coordinates of each piece of picture data are obtained based on the three-dimensional gesture coordinates of each piece of picture data.

Referring to fig. 4, fig. 4 is a schematic block diagram of an apparatus for driving digital human limb movements based on video, as shown in fig. 4, the apparatus comprising:

an obtaining module 101, configured to obtain video stream data, and decompose the video stream data into a plurality of picture data arranged according to a preset sequence frame by frame; wherein the video stream data comprises a series of character limb actions;

the training module 102 is configured to sequentially input the plurality of pieces of picture data to a preset model according to the preset sequence, and sequentially generate three-dimensional gesture coordinates corresponding to each piece of picture data; wherein the three-dimensional gesture coordinates comprise three-dimensional coordinates of a preset number of human joints;

a generating module 103, configured to obtain standard skeleton coordinates of each picture data based on the three-dimensional pose coordinates of each picture data; the standard skeleton coordinates comprise standard coordinates of joints of the digital person, and the joints of the digital person correspond to joints of human body postures in the three-dimensional posture coordinates one by one;

the rotation calculation module 104 is configured to obtain a rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard skeleton coordinates of each picture data;

and the driving module 105 is used for driving the limb actions of the digital person according to the rotation quaternion and the preset sequence.

Specifically, a specific implementation method of a device based on video driving digital human limb motion may refer to a specific implementation of a method based on video driving digital human limb motion described in any of the above embodiments, which is not described herein again.

An embodiment of the present invention further provides an intelligent device, referring to fig. 5, fig. 5 is a schematic structural diagram of the intelligent device, and as shown in fig. 5, the intelligent device includes:

a processor 202, and a memory 201 connected to the processor 202.

The memory 201 is configured to store a computer program at least for performing the method of video-based driving of digital human limb movements according to any of the embodiments of the present invention.

The processor 202 is arranged to invoke and execute the computer program in the memory 201.

Specifically, a specific implementation method of an intelligent device may refer to a specific implementation manner of a method for driving digital human limb motion based on video as described in any of the above embodiments, which is not described herein again.

Example two

Referring to fig. 6, fig. 6 is a specific flowchart of a method for driving digital human limb movements based on video, as shown in fig. 6, the method comprising:

s21, step: and acquiring video stream data, and decomposing the video stream data into a plurality of pieces of picture data frame by frame.

S22, step: and inputting the picture data of each frame into a preset model, and outputting three-dimensional gesture coordinates of 17 key points of the human gesture in real time.

S23, step: and carrying out self-adaptive Kalman filtering and low-pass filtering processing on the three-dimensional attitude coordinates.

S24, step: and converting the three-dimensional posture coordinate after the smoothing treatment into a standard skeleton coordinate.

S25, step: and respectively calculating the Z-axis orientation and the Y-axis orientation of the first piece of picture data by using standard skeleton coordinates, converting the directions into rotation angles, generating an initial rotation matrix, and then calculating the initial alignment matrix of each joint through matrix transformation.

S26, step: and calculating a rotation matrix of each joint of each picture of the subsequent multiple picture data, and multiplying the rotation matrix by the initial alignment matrix to calculate a rotation quaternion of each joint.

S27, step: and driving the digital human limb to act according to the rotation quaternion of each joint in each piece of picture data.

It is to be understood that the same or similar parts in the above embodiments may be referred to each other, and that in some embodiments, the same or similar parts in other embodiments may be referred to.

It should be noted that in the description of the present invention, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise indicated, the meaning of "plurality" means at least two.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, result, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, results, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A method for driving digital human limb movements based on video, comprising:

wherein the video stream data comprises a series of character limb actions;

2. The method of claim 1, wherein the pre-set model comprises:

the basic two-dimensional gesture recognition module is used for generating two-dimensional gesture coordinates based on the two-dimensional image sequence of the picture data;

the cavity convolution module is used for establishing association between a space and a time domain for the two-dimensional image sequence to obtain the three-dimensional attitude coordinate;

and the skeleton direction adjustment module is used for projecting the three-dimensional gesture coordinate to generate the two-dimensional gesture coordinate and calculating the loss between the three-dimensional gesture coordinate and the two-dimensional gesture coordinate.

3. The method according to claim 1, wherein sequentially inputting the plurality of pieces of picture data into a preset model in the preset order sequentially generates three-dimensional gesture coordinates corresponding to each piece of picture data, comprising:

4. The method as recited in claim 1, further comprising:

5. The method according to claim 1, wherein the obtaining the rotation quaternion of each joint of the digital person corresponding to each picture data based on the standard bone coordinates of each picture data comprises:

6. The method of claim 5, wherein generating the Y-axis orientation and the Z-axis orientation of each joint comprises:

Determining father-son relationship of each joint in the picture data;

7. The method of claim 1, wherein the human joint comprises:

8. The method as recited in claim 1, further comprising:

constraining normative of digital human limb actions through a preset plug-in.

9. A device for driving digital human limb movements based on video, comprising:

wherein the video stream data comprises a series of character limb actions;

10. An intelligent device, comprising:

a processor, and a memory coupled to the processor;

the memory is used for storing a computer program at least for executing the method of any one of claims 1-8 based on video-driven digital human limb movements;