CN111899320A

CN111899320A - Data processing method, and training method and device of dynamic capture denoising model

Info

Publication number: CN111899320A
Application number: CN202010844348.4A
Authority: CN
Inventors: 张榕
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2020-11-06
Anticipated expiration: 2040-08-20
Also published as: CN111899320B

Abstract

The application discloses a data processing method, which is applied to the field of artificial intelligence and specifically comprises the following steps: acquiring a target dynamic capture key frame; acquiring dynamic capture data to be processed according to the target dynamic capture key frame; based on the dynamic capture data to be processed, acquiring a noise offset vector corresponding to a target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model; and determining a joint denoising result corresponding to the target dynamic capturing key frame according to the target noise feature vector and the noise offset vector. The application can predict the noise offset vector of the target motion capture key frame on the animation curve through the motion capture denoising model, not only can retain the original trend of motion details, but also can save the time cost and the labor cost of animation production.

Description

Data processing method, and training method and device of dynamic capture denoising model

Technical Field

The application relates to the field of artificial intelligence, in particular to a data processing method, a dynamic capture denoising model training method and a dynamic capture denoising model training device.

Background

Motion capture (motion capture), also known as motion capture, or motion capture for short, refers to a technique of recording and processing the motion of a person or other object. The industry commonly uses a mode based on optical motion capture, optical motion needs to stick a plurality of identification points on a performer, the positions of the identification points are captured by a plurality of cameras, the positions of the identification points are restored and rendered on a corresponding virtual image, and finally mapping of real actor motion performance to skeleton animation is realized.

In the process of collecting the dynamic capture data, due to the fact that precision and software calculation errors exist in the dynamic capture equipment, noise cannot be avoided. At present, most of services are still made by animators to artificially repair animation noise aiming at the denoising of skeleton animation, the skeleton animation refined by the animators is closer to the requirement, and the action precision is higher.

However, although it is effective for animators to manually fix animation noise, the cost is high and the processing period is long. Depending on the magnitude of motion capture noise, a generally skilled animator processes a set of motion capture data, with a few tens of seconds and a large number of tens of minutes, resulting in excessive time and labor costs for animation production.

Disclosure of Invention

The embodiment of the application provides a data processing method and a training method and device of a dynamic capture denoising model, wherein the dynamic capture denoising model can predict a noise offset vector of a target dynamic capture key frame on an animation curve, so that the original trend of motion details can be reserved, and the time cost and the labor cost of animation production can be saved.

In view of the above, an aspect of the present application provides a data processing method, including:

acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

acquiring dynamic capture data to be processed according to the target dynamic capture key frames, wherein the dynamic capture data to be processed represents a noise feature vector set corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frames, the noise feature vector set comprises target noise feature vectors, and N is an integer greater than 1;

based on the dynamic capture data to be processed, acquiring a noise offset vector corresponding to a target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model, wherein the noise offset vector comprises offset data of M joints;

and determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising results comprise denoising data of M joints.

Another aspect of the present application provides a method for training a dynamic capture denoising model, including:

acquiring dynamic capture noise-free data corresponding to dynamic capture sample key frames, wherein the dynamic capture noise-free data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

acquiring dynamic capture noise data according to dynamic capture noise-free data, wherein the dynamic capture noise data comprises a noise feature vector set corresponding to N continuous dynamic capture key frames, the noise feature vector set comprises second feature vectors corresponding to dynamic capture sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

based on the dynamic capture noise data, acquiring a noise offset vector corresponding to a dynamic capture sample key frame through a time sequence network included in a dynamic capture denoising model to be trained, wherein the noise offset vector comprises offset data of M joints;

and updating the model parameters of the dynamic-capture denoising model to be trained according to the first characteristic vector, the second characteristic vector and the noise offset vector until the model training conditions are met, and outputting the dynamic-capture denoising model.

One aspect of the present application provides a data processing apparatus, including:

the acquisition module is used for acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

the acquisition module is further used for acquiring dynamic capture data to be processed according to the target dynamic capture key frames, wherein the dynamic capture data to be processed represents a noise feature vector set corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frames, the noise feature vector set comprises target noise feature vectors, and N is an integer greater than 1;

the acquisition module is further used for acquiring a noise offset vector corresponding to the target dynamic capture key frame through a time sequence network included in the dynamic capture denoising model based on the dynamic capture data to be processed, wherein the noise offset vector comprises offset data of M joints;

and the determining module is used for determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise characteristic vector and the noise offset vector, wherein the joint denoising results comprise denoising data of M joints.

In one possible design, in one implementation of another aspect of the embodiments of the present application, the target noise feature vector further includes an angular velocity corresponding to the root joint;

the acquisition module is specifically used for acquiring N continuous dynamic capture key frames according to the target dynamic capture key frame;

and acquiring to-be-processed dynamic capture data according to the continuous N dynamic capture key frames, wherein the to-be-processed dynamic capture data comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to M joints and an angular velocity corresponding to a root joint.

In one possible design, in another implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for acquiring a coding characteristic vector set through a coder included in the dynamic capture denoising model based on dynamic capture data to be processed, wherein the coding characteristic vector set comprises N coding characteristic vectors;

acquiring a target time sequence characteristic vector through a time sequence network included in the dynamic capture denoising model based on the coding characteristic vector set;

and acquiring a noise offset vector corresponding to the target dynamic-capture key frame through a decoder included in the dynamic-capture denoising model based on the target time sequence feature vector.

the acquisition module is specifically used for acquiring a target time sequence characteristic vector through a forward time sequence network included in the dynamic capture denoising model based on the coding characteristic vector set;

or,

and acquiring a target time sequence characteristic vector through a backward time sequence network included in the dynamic capture denoising model based on the coding characteristic vector set.

and the obtaining module is specifically used for obtaining the first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

And acquiring a second time sequence characteristic vector through a backward time sequence network included in the dynamic capture denoising model based on the coding characteristic vector set.

And generating a target time sequence characteristic vector according to the first time sequence characteristic vector and the second time sequence characteristic vector.

and the obtaining module is specifically used for obtaining a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

And acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

And generating a target time sequence feature vector according to the third time sequence feature vector and the fourth time sequence feature vector.

the determining module is specifically used for acquiring denoising data corresponding to any joint in the noise offset vector aiming at the noise data corresponding to any joint in the target noise characteristic vector;

adding the denoising data corresponding to any joint and the noise data corresponding to any joint to obtain the denoising data corresponding to any joint;

and when the denoising data corresponding to each joint in the M joints is acquired, generating a joint denoising result corresponding to the target dynamic capture key frame.

This application another aspect provides a move and catches model training device of making an uproar, includes:

the acquisition module is used for acquiring dynamic capture noise-free data corresponding to dynamic capture sample key frames, wherein the dynamic capture noise-free data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the acquisition module is further used for acquiring dynamic capture noise data according to the dynamic capture noise-free data, wherein the dynamic capture noise data comprises a noise feature vector set corresponding to N continuous dynamic capture key frames, the noise feature vector set comprises second feature vectors corresponding to dynamic capture sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

the acquisition module is further used for acquiring a noise offset vector corresponding to the key frame of the dynamic capture sample through a time sequence network included in a dynamic capture denoising model to be trained based on dynamic capture noise data, wherein the noise offset vector comprises offset data of M joints;

and the training module is used for updating the model parameters of the dynamic capture denoising model to be trained according to the first characteristic vector, the second characteristic vector and the noise offset vector until the model training conditions are met, and outputting the dynamic capture denoising model.

In one possible design, in one implementation of another aspect of an embodiment of the present application,

the acquisition module is specifically used for acquiring a dynamic capture sample key frame;

acquiring continuous N dynamic capture key frames according to the dynamic capture sample key frames;

acquiring dynamic capture noise-free data according to the continuous N dynamic capture key frames, wherein the dynamic capture noise-free data comprises N original feature vectors, and each original feature vector comprises label data corresponding to M joints and label data corresponding to a root joint;

the acquisition module is specifically configured to acquire dynamic capture noise data according to an original feature vector set corresponding to N continuous dynamic capture keyframes, where the dynamic capture noise data includes N noise feature vectors, and each noise feature vector includes noise data corresponding to M joints and an angular velocity corresponding to a root joint.

the acquisition module is specifically used for acquiring a coding characteristic vector set through a coder included in a to-be-trained dynamic-capture denoising model based on dynamic-capture noise data, wherein the coding characteristic vector set comprises N coding characteristic vectors;

acquiring a target time sequence characteristic vector through a time sequence network included in a dynamic capture denoising model to be trained based on the coding characteristic vector set;

and acquiring a noise offset vector corresponding to the key frame of the dynamic capture sample through a decoder included in the dynamic capture denoising model to be trained based on the target time sequence characteristic vector.

the acquisition module is specifically used for acquiring a target time sequence characteristic vector through a forward time sequence network included in the dynamic capture denoising model to be trained based on the coding characteristic vector set;

or,

and acquiring a target time sequence characteristic vector through a backward time sequence network included in the dynamic catching and denoising model to be trained based on the coding characteristic vector set.

and the obtaining module is specifically used for obtaining the first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

And acquiring a second time sequence characteristic vector through a backward time sequence network included in the dynamic capture denoising model to be trained based on the coding characteristic vector set.

and the obtaining module is specifically used for obtaining a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

And acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

the training module is specifically used for determining a joint denoising result according to the noise offset vector and the second characteristic vector;

determining a loss value by adopting a loss function according to a joint denoising result and the first feature vector;

and updating the model parameters of the dynamic catching denoising model to be trained according to the loss values.

Another aspect of the present application provides a computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein, the memory is used for storing programs;

the processor is used for executing the program in the memory, and the processor is used for executing the method provided by the aspects according to the instructions in the program code;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

Another aspect of the present application provides a computer-readable storage medium having stored therein instructions, which when executed on a computer, cause the computer to perform the method of the above-described aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided by the above aspects.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a data processing method, which includes the steps of firstly obtaining a target dynamic capture key frame, then obtaining dynamic capture data to be processed according to the target dynamic capture key frame, wherein the dynamic capture data to be processed represents a noise feature vector set corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frame, the noise feature vector set comprises a target noise feature vector, then obtaining a noise offset vector corresponding to the target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model based on the dynamic capture data to be processed, the noise offset vector comprises offset data of M joints, and finally determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising results comprise denoising data of the M joints. By adopting the mode, the noise offset vector of the target moving-capture key frame on the animation curve can be predicted through the moving-capture denoising model, the joint denoising result can be obtained based on the noise offset vector, the original trend of motion details can be reserved, the time cost and the labor cost of animation production can be saved, and the efficiency of animation noise repair is improved.

Drawings

FIG. 1 is a schematic diagram illustrating a comparison between a smooth curve and a non-smooth curve in an embodiment of the present application;

FIG. 2 is a schematic view of a joint animation curve in an embodiment of the present application;

FIG. 3 is a schematic diagram of obtaining a Y-axis rotation curve of an animated character based on a low frequency filter and a skeletal animation;

FIG. 4 is a block diagram of an architecture of a data processing system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an embodiment of a method for data processing in an embodiment of the present application;

FIG. 6 is a schematic diagram of a joint position of a human character in an embodiment of the present application;

FIG. 7 is a schematic comparison of the animated character joint before and after removing noise in an embodiment of the present application;

FIG. 8 is a schematic structural diagram of obtaining a noise offset vector based on a dynamic capture denoising model in the embodiment of the present application;

FIG. 9 is a diagram illustrating the output of target timing feature vectors based on a forward timing network in an embodiment of the present application;

FIG. 10 is a diagram illustrating the backward timing network based output of target timing eigenvectors in an embodiment of the present application;

FIG. 11 is another schematic structural diagram of obtaining a noise offset vector based on a dynamic capture denoising model in the embodiment of the present application;

FIG. 12 is another schematic structural diagram of obtaining a noise offset vector based on a dynamic capture denoising model in the embodiment of the present application;

FIG. 13 is a schematic diagram of an embodiment of a dynamic capture denoising model training method in the embodiment of the present application;

FIG. 14 is a graph illustrating a comparison of rotation curves of dynamic capture noise-free data with dynamic capture noise data in an embodiment of the present application;

FIG. 15 is a graph showing a comparison of Y-axis rotation curves based on experimental data in the examples of the present application;

FIG. 16 is a graph showing the alignment of X-axis rotation curves based on experimental data in the examples of the present application;

FIG. 17 is a schematic diagram of an embodiment of a data processing apparatus according to the present embodiment;

FIG. 18 is a schematic diagram of an embodiment of a dynamic capture denoising model training device in an embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device in an embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that the method for processing motion capture data provided by the application can be used for scenes such as three-dimensional (3Dimensions, 3D) games, animation movies and Virtual Reality (VR). Specifically, the processing of the dynamic capture data and the training of the dynamic capture denoising model are realized by a Machine Learning (ML) method based on Artificial Intelligence (AI). Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The method for processing the motion capture data is mainly used for removing noise of the skeleton animation collected by the motion capture equipment, so that the skeleton animation closer to the requirement is obtained, and the motion accuracy is higher. The method can be particularly applied to the fields of movie production, television production, advertisement production, computer game production, television game production and the like, and achieves the purpose of digital characteristic creation. This application relates to a series of terms, which are described below:

1. maya software is a complex three-dimensional computer graphics software, and is widely used for digital special effect creation of movies, televisions, advertisements, computer games, television games and the like.

2. The MotionBuilder software is 3D character animation software made by otteck (Autodesk) and is used for virtual photography, motion capture and traditional keyframe animation.

3. Motion capture (motion capture), also known as motion capture and simply referred to as "motion capture," refers to a technique for recording and processing the motion of a person or other object. The dynamic trapping device can be classified into a plurality of categories such as mechanical, acoustic, optical, electromagnetic and the like according to different working principles. Currently, the industry commonly uses a mode based on optical motion capture, optical motion needs to stick a plurality of identification points (markers) on an actor performing the motion, a plurality of cameras capture the positions of the markers, and then the positions of the markers are restored and rendered on a corresponding virtual image, so as to finally realize mapping of the motion performance of the real actor to the skeletal animation.

In the process, due to the fact that the actor limb movement possibly shields the marker, marker data are lost or jittered, problems of imaging errors of an optical camera, limited calculation software modeling and calculation precision and the like exist, and finally generated animation has obvious noise. In animation, noise is represented as unnatural or abnormal movements of the character limbs, and in joint animation curves, noise is represented as spikes that repeatedly appear on smooth curves. For convenience of illustration, referring to fig. 1, fig. 1 is a schematic diagram comparing a smooth curve and a non-smooth curve in the embodiment of the present application, as shown in the drawing, a1 indicates a typical smooth curve, a2 indicates a non-smooth curve, and the undulating portions are marked by boxes.

Aiming at the problem of noise of the motion capture data, the method can be divided into two directions of marker data denoising and motion capture animation denoising according to different data generation stages. The marker data is not always easy to acquire, and the analysis requirement on the marker data is high due to different dynamic capture configurations. However, Motion capture animation is common and easy to obtain, a lot of open source databases provide a lot of Motion capture animation data, such as a carbon meridian University Motion library (CMU) in a card, and the problem of animation noise is relatively wide.

4. The skeleton animation is one of model animation, and the motion capture data is often processed into a skeleton animation form and applied to game production. In skeletal animation, a model has a skeletal structure of interconnected "bones," and animation is generated for the model by changing the orientation and position of the bones.

5. Euler's angle is a set of three independent angle parameters for determining the position of fixed-point rotating rigid body, and is composed of nutation angle, precession angle (precession angle) and self-rotation angle.

6. The joint animation curve represents that for one joint, the motion of the joint can be represented as a series of three-dimensional Euler angle rotations in time sequence, namely three motion curves. In the skeleton animation, the character motion is formed by rotationally superimposing each of the joints connected to each other. The rotation of a single joint at a certain moment can be expressed in terms of euler angles. For easy understanding, please refer to fig. 2, fig. 2 is a schematic diagram of joint animation curves in the embodiment of the present application, and as shown, three curves are a character right thigh rotation curve shown in the Maya curve editor by animation, an X-axis rotation curve of the character right thigh is indicated by B1, a Y-axis rotation curve of the character right thigh is indicated by B2, and a Z-axis rotation curve of the character right thigh is indicated by B3.

7. Animation repositioning (retargeting) is a technique for realizing animation sharing among a plurality of different characters. After the repositioning is applied, the same animation can be played normally on skeletons of different skeleton resources.

8. A Low-frequency filter (Low-pass filter) that allows Low-frequency signals to pass, but attenuates (or reduces) the passage of signals having frequencies above the cut-off frequency. The common Gaussian filter is a low-frequency filter, is suitable for eliminating Gaussian noise, and is widely applied to the noise reduction process of image processing. The gaussian filtering is a process of weighted average of the whole image, and the value of each pixel point is obtained by weighted average of the value of each pixel point and other pixel values in the neighborhood.

The low frequency filter can only smooth out the spur noise, resulting in insufficient preservation of animation details. From the curve fluctuation, the low frequency filter can only smooth the projections or depressions to the average position, and cannot improve the degree of the fluctuation. When the noise data fluctuation is small, the filter will suppress the protrusion more, resulting in less adequate curve fluctuation after processing. The fluctuating curve can reflect the intensity of the joint movement of the animated character, and the amount and effect of the movement. For easy understanding, please refer to fig. 3, fig. 3 is a schematic diagram of the animation character Y-axis rotation curve and the skeleton animation obtained based on the low frequency filter, as shown in the figure, C1 and C2 both indicate the Y-axis rotation curve of the right foot joint of the animation character, wherein the curve indicated by C1 has more severe fluctuation, i.e. higher amplitude at the peak and valley, corresponding to the animation character shown by C3 stepping forward, while the curve indicated by C2 is relatively smooth, corresponding to the animation character shown by C4 stepping in place.

The related concepts of proper nouns are explained above, and referring to fig. 4, fig. 4 is a schematic structural diagram of a data processing system in an embodiment of the present application, and as shown in the drawing, in the process of training a dynamic capture denoising model, first, dynamic capture noise-free data is obtained from a database, and is subjected to noise addition processing to obtain dynamic capture noise data, a server may use the dynamic capture noise-free data and the dynamic capture noise data to train the dynamic capture denoising model to be trained, and when a training condition is satisfied, a corresponding dynamic capture denoising model is output. The server can store the dynamic capture denoising model locally or send the dynamic capture denoising model to the terminal equipment, and the terminal equipment uses the dynamic capture denoising model. Alternatively, the training process may also be executed by the terminal device, that is, after the terminal device outputs the dynamic capture denoising model, the dynamic capture denoising model is stored locally. In actual prediction, the terminal device may input the motion capture data to be processed, which includes the target motion capture key frame, to the motion capture denoising model, and output a noise offset vector corresponding to the target motion capture key frame through the motion capture denoising model, thereby achieving the purpose of correcting the noise data corresponding to the target motion capture key frame.

It should be noted that the server related to the present application may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, and the like. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. The number of servers and terminal devices is not limited.

With reference to fig. 5, an embodiment of a data processing method in the present application is described below, where the data processing method in the embodiment of the present application includes:

101. acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

in this embodiment, the data processing apparatus obtains a target motion capture key frame, where the target motion capture key frame is a motion capture key frame to be corrected, and the target motion capture key frame includes noise data corresponding to M joints, where M is equal to 18 in this application as an example, but this should not be construed as a limitation to this application. It is understood that the data processing apparatus may be disposed in a server or a terminal device, and is not limited herein.

For convenience of introduction, referring to fig. 6, fig. 6 is a schematic diagram of a joint position of a human-shaped character in the embodiment of the present application, and as shown in the figure, it is assumed that the human-shaped character has 19 joints, wherein the joint indicated by D19 is a root joint, and is also a parent node at the topmost layer in a pelvic bone position of the human-shaped character. The remaining 18 joints are respectively a thorax joint indicated by D1, a neck joint indicated by D2, a right leg joint indicated by D3, a left leg joint indicated by D4, a right knee joint indicated by D5, a left knee joint indicated by D6, a right ankle joint indicated by D7, a left ankle joint indicated by D8, a right foot joint indicated by D9, a left foot joint indicated by D10, a right elbow joint indicated by D11, a left elbow joint indicated by D12, a right hand joint indicated by D13, a left hand joint indicated by D14, a right shoulder joint indicated by D15, a left shoulder joint indicated by D16, a right hip joint indicated by D17, and a left hip joint indicated by D18. It is understood that the human character may also include other numbers of joints, which are only illustrative and should not be construed as limiting the present application.

Specifically, each joint corresponds to one set of noise data, each set of noise data includes parameters of an X-axis rotation channel, parameters of a Y-axis rotation channel, and parameters of a Z-axis rotation channel, and assuming that M is 18 (excluding the root joint), each joint corresponds to one set of noise data, each set of noise data includes three parameters, and thus, the target noise feature vector may include 54-dimensional parameters, which also indicates that the target kinetic capture keyframe has 54 rotation channels. Taking the right foot joint as an example, the noise data corresponding to the right foot joint can be represented as "foot _ r _ xrotation", "foot _ r _ yrotation", and "foot _ r _ zroto".

Based on this, a target noise feature vector can be obtained according to the target dynamic capturing key frame, and the target noise feature vector can be expressed as:

[(x1_t,y1_t,z1_t),(x2_t,y2_t,z2_t)...,(xM_t,yM_t,zM_t)]；

wherein X represents a parameter on an X axis, Y represents a parameter on a Y axis, Z represents a parameter on a Z axis, M represents M joints, and t represents a t-th frame motion capture key frame, i.e. a target motion capture key frame.

102. Acquiring dynamic capture data to be processed according to the target dynamic capture key frames, wherein the dynamic capture data to be processed represents a noise feature vector set corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frames, the noise feature vector set comprises target noise feature vectors, and N is an integer greater than 1;

in this embodiment, the data processing apparatus obtains N consecutive motion capture key frames according to a target motion capture key frame, where the N motion capture key frames include the target motion capture key frame, it should be noted that the N motion capture key frames are consecutive, and in this application, N is equal to 31 as an example, which should not be construed as a limitation to this application.

Specifically, in order to facilitate understanding of the joint movement trend by the motion capture denoising model, for a target motion capture key frame, [ (N-1)/2] frame motion capture key frames before and after the target motion capture key frame can be extracted, and then sequentially spliced into to-be-processed motion capture data with the length of N. For the situation that the key frame is captured dynamically by the preceding or following lack (N-1)/2] frame, the filling can be carried out by adopting a central symmetry mode. Taking N equal to 31 as an example, assuming that the target motion capture key frame is the 1 st motion capture key frame in the whole animation, then taking the following 15 motion capture key frames (i.e. the t +15 th frame, the t +14 th frame, … …, the t +1 th frame) as the padding of the preamble, it is necessary to ensure that not only the time sequence is consecutive, but also the target motion capture key frame is located in the middle of the N motion capture key frames. Therefore, 31 dynamic capture key frames including the t +15 th frame, the t +14 th frame, … …, the t +1 th frame, the t +1 th frame, … … and the t +15 th frame are obtained, wherein the t th frame dynamic capture key frame is a target dynamic capture key frame.

Based on this, the dynamic capture data (i.e. the noise feature vector set) to be processed can be expressed as:

wherein, each row is expressed as a noise feature vector, and N noise feature vectors form a noise feature vector set. X represents a parameter on an X axis, Y represents a parameter on a Y axis, Z represents a parameter on a Z axis, M represents M joints, N represents the total number of the dynamic capture key frames, t represents a t-th frame dynamic capture key frame, namely a target dynamic capture key frame, and a target noise feature vector corresponding to the target dynamic capture key frame is as follows:

(x1_t,y1_t,z1_t),(x2_t,y2_t,z2_t)...,(xM_t,yM_t,zM_t)；

it can be understood that, for the target motion capture key frame, the (N-1) frame motion capture key frame before the target motion capture key frame may also be extracted, and then the motion capture data to be processed with the length of N is spliced in sequence. Or (N-1) frame dynamic capture key frames behind the target dynamic capture key frame are extracted, and then the dynamic capture data to be processed with the length of N are spliced in sequence.

103. Based on the dynamic capture data to be processed, acquiring a noise offset vector corresponding to a target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model, wherein the noise offset vector comprises offset data of M joints;

in this embodiment, the data processing apparatus inputs the motion capture data to be processed into the motion capture denoising model, the motion capture denoising model outputs noise offset vectors corresponding to N motion capture key frames, and since positions of the target motion capture key frames in the N motion capture key frames are predetermined, the noise offset vectors corresponding to the target motion capture key frames can be further extracted, where the noise offset vectors include offset data of M joints, similar to the noise data, each joint corresponds to a set of offset data, and each set of offset data includes an offset parameter of an X-axis rotation channel, an offset parameter of a Y-axis rotation channel, and an offset parameter of a Z-axis rotation channel. Assuming that M is 18, each joint corresponds to a set of offset data, each set of offset data includes three parameters, and thus, the noise offset vector may include 54-dimensional parameters. Taking the right foot joint as an example, the offset data corresponding to the right foot joint can be represented as "foot _ r _ xrotation _ offset", "foot _ r _ yrotation _ offset", and "foot _ r _ zrtition _ offset".

104. And determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising results comprise denoising data of M joints.

In this embodiment, the data processing device may adjust the target noise feature vector with noise according to the noise offset vector corresponding to the target dynamic capture keyframe, so as to obtain a final joint denoising result.

Specifically, assuming that the noise offset vector includes offset data of 18 joints, where offset data corresponding to the right foot joint is [ -0.1,0,0.1], and accordingly, the target noise feature vector includes offset data of 18 joints, where noise data corresponding to the right foot joint is [60.2,70,59.8], based on which the offset data corresponding to the right foot joint is added to the noise data, respectively, to obtain denoised data [60.1,70,59.9] corresponding to the right foot joint. By analogy, after the offset data of each joint in the noise offset vector is added with the corresponding noise data, the denoising data corresponding to each joint can be obtained, the denoising data corresponding to the M joints are used for forming a joint denoising result, and the joint denoising result can be expressed in a form of a feature vector.

For convenience of illustration, please refer to fig. 7, fig. 7 is a schematic diagram illustrating comparison before and after removing noise based on an animated character joint in the embodiment of the present application, where fig. 7 (a) shows a Z-dimension rotation curve of a certain joint of the animated character before denoising, and fig. 7 (a) shows a Z-dimension rotation curve of a certain joint of the animated character after denoising, it is obvious that the Z-dimension rotation curve after denoising is smoother.

In the embodiment of the application, a data processing method is provided, and by adopting the above mode, the noise offset vector of the target motion capture key frame on the animation curve can be predicted through the motion capture denoising model, and the joint denoising result can be obtained based on the noise offset vector, so that the original trend of motion details can be kept, the time cost and the labor cost of animation production can be saved, and the efficiency of animation noise repair is improved.

Optionally, on the basis of the various embodiments corresponding to fig. 5, in an optional embodiment provided in the embodiments of the present application, the target noise feature vector further includes an angular velocity corresponding to the root joint;

acquiring to-be-processed kinetic capture data according to a target kinetic capture key frame, and specifically comprising the following steps of:

acquiring continuous N dynamic capture key frames according to the target dynamic capture key frame;

In this embodiment, a method for determining a target noise feature vector based on an angular velocity of a root joint is introduced, and a data processing device acquires N consecutive moving capture key frames according to a target moving capture key frame, where the N moving capture key frames include the target moving capture key frame, and one noise feature vector may be extracted for each moving capture key frame, and each noise feature vector includes noise data corresponding to (M +1) joints. The (M +1) joints comprise root joints, the angular velocity of each root joint specifically comprises a rotation angular velocity on an X axis, a rotation angular velocity on a Y axis and a rotation angular velocity on a Z axis, and the rotation angular velocities of the three dimensions surface the motion velocity of the current motion, so that the angular velocity of each root joint can help the motion capture and noise removal model to understand the velocity and the period of the motion.

wherein, each row is expressed as a noise feature vector, and N noise feature vectors form a noise feature vector set. X represents a parameter on an X axis, Y represents a parameter on a Y axis, Z represents a parameter on a Z axis, M represents M joints, R represents a root joint, N represents the total number of the dynamic capture key frames, t represents the t-th frame dynamic capture key frame, namely a target dynamic capture key frame, and the target noise feature vector corresponding to the target dynamic capture key frame is as follows:

(x1_t,y1_t,z1_t),(x2_t,y2_t,z2_t)...,(xM_t,yM_t,zM_t),(xR_t,yR_t,zR_t)；

it is understood that the noise data corresponding to the root node may be arranged at any position in the noise feature vector, but it should be noted that the noise data corresponding to the joints in each noise feature vector needs to be consistent, and for ease of understanding, please refer to table 1, where table 1 is the noise data of each joint extracted from the target motion capture keyframe.

TABLE 1

Joint	Noisy data	Joint	Noisy data
				Thorax joint	(124,255,300)	Neck joint	(135,55,310)
Right leg joint	(15,55,71)	Left leg joint	(147,135,241)
				Right knee joint	(14,67,35)	Left knee joint	(19,25,68)
Right ankle joint	(46,215,244)	Left ankle joint	(117,46,112)
				Right foot joint	(322,157,42)	Left foot joint	(129,155,111)
Right elbow joint	(310,255,300)	Left elbow joint	(19,255,62)
				Right hand joint	(114,255,300)	Left hand joint	(249,295,78)
Right shoulder joint	(74,92,30)	Left shoulder joint	(286,99,152)
				Right hip joint	(12,341,100)	Left hip joint	(42,155,110)
Root joint	Angular velocity (57,62,72)

As can be seen from table 1, the target noise feature vector may include 57-dimensional parameters, i.e. also the angular velocity of the root joint at this time, and similarly, each noise feature vector may also include 57-dimensional parameters. It should be noted that, in practical cases, the parameters shown in table 1 may also include at least one decimal place.

Secondly, in the embodiment of the present application, a method for determining a target noise feature vector based on an angular velocity of a root joint is provided, and by the method, the angular velocity of the root joint can reflect the speed of a role action, so that the accuracy of model prediction can be improved by using the angular velocity of the root joint as a reference for predicting a noise offset vector.

Optionally, on the basis of each embodiment corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, based on the dynamic capture data to be processed, a noise offset vector corresponding to the target dynamic capture keyframe is obtained through a time series network included in the dynamic capture denoising model, which specifically includes the following steps:

based on the dynamic capture data to be processed, acquiring a coding characteristic vector set through a coder included in a dynamic capture denoising model, wherein the coding characteristic vector set comprises N coding characteristic vectors;

In the embodiment, a way of outputting noise offset vectors based on a dynamic capture denoising model is described, and in the following, by taking M equal to 18 and N equal to 31 as an example, in dynamic capture data to be processed, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes 57-dimensional parameters (i.e., 3 × 18+3 — 57, where 54 dimensions are noise data and 3dimensions are angular velocities), and based on this, dynamic capture data to be processed is formed by 31 frames of dynamic capture keyframes and is denoted by 31 × 57.

For convenience of understanding, please refer to fig. 8, where fig. 8 is a schematic structural diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, and as shown in the figure, it is assumed that a target motion capture key frame is a tth frame motion capture key frame in an animation, a continuous first 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, and a continuous last 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, that is, motion capture data to be processed may be represented as a matrix of 31 × 57. Firstly, inputting the dynamic capture data to be processed into an encoder (encoder), and assuming that the encoder comprises two fully-connected layers, namely a first fully-connected layer matrix with the input of 31 x 57 is coded to obtain a 31 x 48 matrix, a second fully-connected layer matrix with the input of 31 x 48 is coded to obtain a 31 x 32 matrix, the 31 x 32 matrix is a coded feature vector set, and each coded feature vector comprises 32-dimensional data. It should be noted that the number of network layers of the encoder and the dimension of the output are only an illustration.

The encoding feature vector set is input to the time sequence Network again, the time sequence Network outputs a 31 × 54 matrix, a target time sequence feature vector corresponding to the target motion capture key frame is selected from the 31 × 54 matrix, and the target time sequence feature vector has 54 dimensions, it should be noted that the time sequence Network may be a Bidirectional Long Short Term Memory Network (Bi-LSTM) Network, a Long Short Term Memory (Long Short Term Memory, LSTM) Network, a Gate gating Unit (Gate recovery Unit, GRU) Network, a time convolution Network (Temporal Convolutional Network, TCN), or a Recurrent Neural Network (RNN), fig. 8 illustrates an LSTM as an example, which should not be construed as a limitation to the present application.

The target time sequence feature vector is input to a decoder (decoder), and the decoder mainly functions to map the information output by the time sequence network to a rotation space. Assuming that the encoder includes two fully-connected layers, that is, the input of the first fully-connected layer is a 54-dimensional target timing sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, at this time, the 54-dimensional noise offset vector does not include offset data corresponding to the root node. And combining the noise offset vector to obtain a joint denoising result of the target dynamic capturing key frame. It should be noted that the number of network layers of the decoder is only an illustration.

In the embodiment of the application, a method for outputting noise offset vectors based on a dynamic capture denoising model is provided, and through the method, N coding feature vectors obtained through coding are coded again by using a time sequence network, so that time sequence features implicit in N dynamic capture key frames are extracted, more accurate noise offset vectors can be predicted, and the effect of repairing animation noise is improved.

Optionally, on the basis of each embodiment corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, based on the encoding feature vector set, the method for obtaining the target time series feature vector through the time series network included in the dynamic capture denoising model specifically includes the following steps:

acquiring a target time sequence characteristic vector through a forward time sequence network included in a dynamic capture denoising model based on the coding characteristic vector set;

or,

In this embodiment, a method for predicting a target timing feature vector based on a unidirectional timing network is described, and the following description will take N equal to 31 and the timing network is LSTM as an example.

Specifically, referring to fig. 9, fig. 9 is a schematic diagram of outputting a target timing feature vector based on a forward timing network in the embodiment of the present application, and as shown in the figure, it is assumed that a coding feature vector set includes coding feature vectors of N dynamic capture key frames, and a coding feature vector corresponding to a1 st dynamic capture key frame in the N dynamic capture key frames is represented as x₁The coding feature vector corresponding to the t-th dynamic capture key frame is represented as x_tAnd the coded feature vector corresponding to the Nth dynamic capture key frame is represented as x_NThe target dynamic capture key frame may be any one of the N dynamic capture key frames. For the coded feature vector x₁Obtaining a hidden vector h after encoding₁Then the hidden vector h is further processed₁Coding feature vector x corresponding to next frame motion capture key frame₂Coding is carried out to obtain a hidden vector h₂. And repeating the steps until a target time sequence feature vector is obtained.

Referring to fig. 10, fig. 10 is a schematic diagram of outputting a target timing feature vector based on a backward timing network in an embodiment of the present application, where it is assumed that a coding feature vector set includes coding feature vectors of N dynamic capture keyframes, and a coding feature vector corresponding to an nth dynamic capture keyframe in the N dynamic capture keyframes is represented as x_NThe coding feature vector corresponding to the t-th dynamic capture key frame is represented as x_tThe coding feature vector corresponding to the 1 st frame dynamic capture key frame is represented as x₁The target dynamic capture key frame may be any one of the N dynamic capture key frames. For the coded feature vector x_NObtaining a hidden vector h after encoding_NThen the hidden vector h is further processed_NCoding feature vector x corresponding to next frame motion capture key frame_N-1Coding is carried out to obtain a hidden vector h_N-1. And repeating the steps until a target time sequence feature vector is obtained.

Further, in the embodiment of the present application, a method for predicting a target time sequence feature vector based on a unidirectional time sequence network is provided, and through the above method, sequence information in N number of dynamic capture key frames can be effectively extracted by using a sequence encoder, the dynamic capture key frames are successively encoded, and information of a previous frame or a next frame of the dynamic capture key frames is introduced, which is beneficial to improving the effect of model prediction.

based on the coding feature vector set, acquiring a first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model;

based on the coding feature vector set, acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model;

In the embodiment, a manner of predicting a target time series feature vector based on a bidirectional time series network is described, and in the following, taking M equal to 18 and N equal to 31 as an example, in the to-be-processed moving capture data, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes a 57-dimensional parameter (i.e., 3 × 18+3 — 57), and based on this, the to-be-processed moving capture data composed of 31 moving capture keyframes is denoted by 31 × 57.

For convenience of understanding, please refer to fig. 11, fig. 11 is another structural schematic diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, and as shown in the figure, it is assumed that a target motion capture key frame is a tth frame motion capture key frame in an animation, a continuous first 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, and a continuous last 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, that is, motion capture data to be processed may be represented as a matrix of 31 × 57. Firstly, inputting the dynamic capture data to be processed into an encoder, and assuming that the encoder comprises two fully-connected layers, namely the input of the first fully-connected layer is a 31 x 57 matrix, the matrix of 31 x 48 is obtained after encoding, the input of the second fully-connected layer is a 31 x 48 matrix, the matrix of 31 x 32 is obtained after encoding, the matrix of 31 x 32 is a set of encoding feature vectors, and each encoding feature vector comprises 32-dimensional data. It should be noted that the number of network layers of the encoder and the dimension of the output are only an illustration.

And inputting the coding feature vector set into a bidirectional time sequence network (comprising a forward time sequence network and a backward time sequence network), outputting a 31 x 108 matrix by the bidirectional time sequence network, and selecting a target time sequence feature vector corresponding to the target capturing key frame from the 31 x 108 matrix, wherein the target time sequence feature vector has 108 dimensions. Specifically, a first time sequence feature vector corresponding to a target moving capture key frame output by the forward time sequence network is spliced with a second time sequence feature vector corresponding to a target moving capture key frame output by the backward time sequence network, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be Bi-LSTM.

And inputting the target time sequence feature vector to a decoder, wherein the decoder mainly has the function of mapping information output by a time sequence network to a rotation space. Assuming that the encoder includes two fully-connected layers, that is, the input of the first fully-connected layer is a target timing sequence feature vector of 108 dimensions, a noise offset vector of 54 dimensions is obtained after two times of decoding, and at this time, the noise offset vector of 54 dimensions does not include offset data corresponding to the root node. And combining the noise offset vector to obtain a joint denoising result of the target dynamic capturing key frame. It should be noted that the number of network layers of the decoder is only an illustration.

Furthermore, in the embodiment of the application, a method for predicting a target time sequence feature vector based on a two-way time sequence network is provided, through the method, time sequence information is processed by using a two-way LSTM, so that action trends are convenient to understand, and the two-way LSTM is helpful for a model to use curve trend information in the past and the future, so that the effect of model prediction is improved.

acquiring a third time sequence characteristic vector through a first bidirectional time sequence network included in the dynamic capture denoising model based on the coding characteristic vector set;

based on the coding feature vector set, acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model;

In the embodiment, a way of predicting a target time series feature vector based on a two-layer bidirectional time series network is described, which will be described below by taking M equal to 18 and N equal to 31 as an example, in the to-be-processed kinetic capture data, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes a 57-dimensional parameter (i.e., 3 × 18+3 — 57), and based on this, the to-be-processed kinetic capture data composed of 31 frames of kinetic capture keyframes is denoted by 31 × 57.

For convenience of understanding, please refer to fig. 12, fig. 12 is another structural schematic diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, and as shown in the figure, it is assumed that a target motion capture key frame is a tth frame motion capture key frame in an animation, a continuous first 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, and a continuous last 15 frames of motion capture key frames are obtained based on the tth frame motion capture key frame, that is, motion capture data to be processed may be represented as a matrix of 31 × 57. Firstly, inputting the dynamic capture data to be processed into an encoder, and assuming that the encoder comprises two fully-connected layers, namely the input of the first fully-connected layer is a 31 x 57 matrix, the matrix of 31 x 48 is obtained after encoding, the input of the second fully-connected layer is a 31 x 48 matrix, the matrix of 31 x 32 is obtained after encoding, the matrix of 31 x 32 is a set of encoding feature vectors, and each encoding feature vector comprises 32-dimensional data. It should be noted that the number of network layers of the encoder and the dimension of the output are only an illustration.

And inputting the coding feature vector set into a double-layer bidirectional time sequence network (comprising a first bidirectional time sequence network and a second bidirectional time sequence network, wherein each bidirectional time sequence network comprises a forward time sequence network and a backward time sequence network), outputting a matrix of 31 x 108 by the first bidirectional time sequence network, inputting the matrix of 31 x 108 into the second bidirectional time sequence network, outputting another matrix of 31 x 108 by the second bidirectional time sequence network, and selecting a target time sequence feature vector corresponding to the target dynamic capturing key frame from the matrix of 31 x 108, wherein the target time sequence feature vector has 108 dimensions. Specifically, a third time sequence feature vector corresponding to a target motion capture key frame output by the first bidirectional time sequence network is spliced with a fourth time sequence feature vector corresponding to a target motion capture key frame output by the second bidirectional time sequence network, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be Bi-LSTM.

Furthermore, in the embodiment of the application, a method for predicting a target time sequence feature vector based on a two-layer bidirectional time sequence network is provided, and through the method, time sequence information is processed by using two layers of bidirectional LSTM, so that action trends are convenient to understand, and the two layers of bidirectional LSTM are helpful for a model to use past and future curve trend information, so that the effect of model prediction is improved.

Optionally, on the basis of each embodiment corresponding to fig. 5, in another optional embodiment provided in the embodiment of the present application, the determining a joint denoising result corresponding to the target dynamic capturing keyframe according to the target noise feature vector and the noise offset vector specifically includes the following steps:

acquiring denoising data corresponding to any joint in the noise offset vector aiming at the noise data corresponding to any joint in the target noise characteristic vector;

In this embodiment, a manner of generating joint denoising results is introduced, and noise offset vectors corresponding to target live-action keyframes are output according to a live-action denoising model, where it can be understood that the noise offset vectors include offset data of M joints, where M is equal to 18 as an example, that is, the noise offset vectors include 54 parameters. And adding the noise offset vector and the initial target noise characteristic vector to obtain a joint denoising result, thereby realizing the denoising function.

Specifically, for ease of understanding, please refer to table 2, where table 2 is offset data of each joint extracted from the target motion capture keyframe, and these offset data constitute a noise offset vector of the target motion capture keyframe.

TABLE 2

Joint	Offset data	Joint	Offset data
				Thorax joint	(1,2,3)	Neck joint	(1,-1,-1)
Right leg joint	(0,1,2)	Left leg joint	(1,1,1)
				Right knee joint	(-1,-1,5)	Left knee joint	(0,5,8)
Right ankle joint	(2,2,2)	Left ankle joint	(1,-1,-2)
				Right foot joint	(-3,-7,2)	Left foot joint	(0,-3,1)
Right elbow joint	(2,5,3)	Left elbow joint	(1,0,2)
				Right hand joint	(-1,-1,-3)	Left hand joint	(2,-1,-2)
Right shoulder joint	(-5,1,0)	Left shoulder joint	(-1,0,2)
				Right hip joint	(0,1,0)	Left hip joint	(-2,1,0)

As can be seen from table 2, the noise offset vector of the target live keyframe may include 54-dimensional parameters, i.e., the offset data of the root joint is not included at this time. It should be noted that, in practical cases, the parameters shown in table 2 may also include at least one decimal place. Taking the noise data corresponding to the "chest joint" in the target noise feature vector as an example, please refer to table 1 again, the noise data of the "chest joint" is (124,255,300), and the offset data of the "chest joint" is (1,2,3), so that the corresponding parameters are added up to obtain the denoised data of the "chest joint" (125, 257, 303). It should be noted that similar processing is also adopted for other joints until denoising data corresponding to M joints are obtained, and the denoising data are expressed as joint denoising results corresponding to the target dynamic capture keyframe.

Secondly, in the embodiment of the application, a mode for generating a joint denoising result is provided, and through the mode, the joint denoising result can be automatically determined based on the target noise characteristic vector and the noise offset vector, so that the noise of the joint can be corrected without manual calculation, and the convenience of operation is further improved.

With reference to fig. 13, a method for training a dynamic capture denoising model in the present application will be described below, where an embodiment of the method for training a dynamic capture denoising model in the present application includes:

201. acquiring dynamic capture noise-free data corresponding to dynamic capture sample key frames, wherein the dynamic capture noise-free data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

in this embodiment, the dynamic capture denoising model training device obtains dynamic capture noise-free data from the dynamic capture database, and since the open source data is basically relatively clean and there is less jitter and noise of the dynamic capture data, the dynamic capture data can be considered to be clean and noise-free, that is, the dynamic capture noise-free data can be used as the dynamic capture noise-free data. The dynamic capture noise-free data comprises N continuous dynamic capture key frames, wherein the N continuous dynamic capture key frames comprise dynamic capture sample key frames to be predicted. Each of the N consecutive motion capture keyframes has an original feature vector, the N original feature vectors form an original feature vector set, for convenience of description, the original feature vector corresponding to the motion capture sample keyframe is referred to as a first feature vector, and the first feature vector includes label data corresponding to M joints. The present application is described by taking M equal to 18 and N equal to 31 as an example, however, this should not be construed as limiting the present application.

Similar to the foregoing embodiment, the original feature vector corresponding to each frame of the kinetic capture key frame includes the label data of M joints, and further, may also include the label data of the root joint. Assuming that M is 18 and the tag data of each joint includes parameters of an X-axis rotation channel, parameters of a Y-axis rotation channel, and parameters of a Z-axis rotation channel, each raw feature vector includes 54-dimensional parameters. If the rotation angular velocity of the root joint on the X axis, the rotation angular velocity on the Y axis, and the rotation angular velocity on the Z axis are also included, each of the original feature vectors includes 57-dimensional parameters.

Specifically, the initial sample data set is up to 679 ten thousand frames, mostly from the Biovision hierarchical model (BVH) file or movie box (FBX) file of the industry-sourced motion capture database, where BVH and FBX are common formats of animation. The dynamic capture databases used include, but are not limited to, the carlsrue Institute of Technology (KIT) dynamic capture database, the simmon france University and new slope National University (Simon platform and National University of Singapore, SFU) dynamic capture database, the Advanced Computing Center for arms and Design, ACCAD dynamic capture database belonging to ohio state University, and the CMU dynamic capture database, and the specific information of these dynamic capture databases are shown in table 3.

TABLE 3

	KIT	ACCAD	SFU	CMU
					Number of files/number	2141	81	44	2548
Total number of frames/number	2397855	19599	109653	4264355
					Frame length/second	0.01	0.0333333	0.008333	0.00833333
Frame rate/Hz	100	30	120	120

As can be seen from table 3, the frame rate of the motion capture data of each of the above motion capture databases is different from that of the human skeleton. In order to ensure that the input and output of the dynamic capture denoising model are more fixed, firstly, dynamic capture data are unified. In this regard, motion capture data can be repositioned in animation using the MotionBuilder software, so that the motion skeleton is uniform. In addition, the MotionBuilder software may also unify animation frame rates, for example, training data may be all unified into a Frame Per Second (FPS) of 30 hertz (Hz), and may reach 183 ten thousand Frames.

It can be understood that the dynamic capture denoising model training device may be deployed in a server, or may be deployed in a terminal device, which is not limited herein.

202. Acquiring dynamic capture noise data according to dynamic capture noise-free data, wherein the dynamic capture noise data comprises a noise feature vector set corresponding to N continuous dynamic capture key frames, the noise feature vector set comprises second feature vectors corresponding to dynamic capture sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

in this embodiment, in order to simulate noise data during training, the noise-adding processing may be performed on data of different channels of M joints (not including the root joint) in the motion-captured noise-free data, and generally speaking, the joint with a larger motion range is more likely to have more severe jitter, so the noise of each channel is related to the rotation value range of the channel. In addition, the extractable part of the motion capture key frames (e.g., 80% of the animation images in the motion capture noise-free data) adds noise, considering that noise may be present in a large amount, but not all frames contain noise.

Specifically, for a certain rotation channel (for example, a rotation channel of an X axis), the variance of the rotation value of the rotation channel is recorded as σ, a part of the motion capture key frame corresponding to the rotation channel is extracted, and gaussian noise with a mean value of 0 and a variance of 0.05 × σ is added to the rotation value of each frame of the motion capture key frame, thereby constructing motion capture noise data. It is to be understood that the live noise data includes a noise feature vector set corresponding to N consecutive live key frames, each of the N consecutive live key frames has one noise feature vector, the N noise feature vectors form a noise feature vector set, for convenience of description, the noise feature vector corresponding to the live sample key frame is referred to as a second feature vector, and the second feature vector includes noise data corresponding to M joints.

For convenience of illustration, please refer to fig. 14, fig. 14 is a schematic diagram comparing a rotation curve of motion capture noise-free data with a rotation curve of a certain joint of an animated character in a period of time, as shown, the horizontal axis represents the time axis, and the vertical axis represents the joint curve value, wherein E1 indicates a noisy curve, E2 indicates a noise-free curve, E3 indicates a curve predicted by using a fixed gaussian filter (fixed parameters, mean value is 0, variance is 1), E4 indicates a curve predicted by using an adaptive gaussian filter, and the curve shown in E4 has completely deviated from the curve indicated in E2 in the position of a rectangular box and fluctuates with the curve indicated by E1.

203. Based on the dynamic capture noise data, acquiring a noise offset vector corresponding to a dynamic capture sample key frame through a time sequence network included in a dynamic capture denoising model to be trained, wherein the noise offset vector comprises offset data of M joints;

in this embodiment, the dynamic capture denoising model training device inputs dynamic capture noise data into a dynamic capture denoising model to be trained, the dynamic capture denoising model to be trained outputs noise offset vectors corresponding to N dynamic capture key frames, and since positions of dynamic capture sample key frames in the N dynamic capture key frames are predetermined, the noise offset vectors corresponding to the dynamic capture sample key frames can be further extracted, where the noise offset vectors include offset data of M joints, which are similar to the noise data, each joint corresponds to one set of offset data, and each set of offset data includes an offset parameter of an X-axis rotation channel, an offset parameter of a Y-axis rotation channel, and an offset parameter of a Z-axis rotation channel.

204. And updating the model parameters of the dynamic-capture denoising model to be trained according to the first characteristic vector, the second characteristic vector and the noise offset vector until the model training conditions are met, and outputting the dynamic-capture denoising model.

In this embodiment, the dynamic capture denoising model training device may adjust the second feature vector with noise according to the noise offset vector corresponding to the dynamic capture sample key frame, so as to obtain a final joint denoising result, where the joint denoising result is a predicted value, and the first feature vector is a true value, so that the joint denoising result is compared with the first feature vector to obtain a loss value, and a model parameter of the dynamic capture denoising model to be trained is updated by using the loss value until a model training condition is satisfied, and the model parameter obtained by the last iteration is used as the model parameter of the dynamic capture denoising model.

In the embodiment of the application, a training method of a dynamic capture denoising model is provided, by adopting the above mode, the dynamic capture noise-free data and the dynamic capture noise data are utilized to train to obtain the dynamic capture denoising model, the noise offset vector of a dynamic capture sample key frame on an animation curve can be predicted through the dynamic capture denoising model, a joint denoising result can be obtained based on the noise offset vector, the original trend of motion details can be kept, the denoising animation motion is natural and smooth, the time cost and the labor cost of animation production can be saved, the model denoising result can be subjected to secondary manual processing by an animator subsequently, the processing cost at the moment is greatly reduced, and the efficiency of animation noise repair is improved.

Optionally, on the basis of each embodiment corresponding to fig. 13, in an optional embodiment provided in this application, the obtaining dynamic capture noiseless data corresponding to the dynamic capture sample keyframe specifically includes the following steps:

acquiring a dynamic capture sample key frame;

acquiring dynamic capture noise data according to dynamic capture noiseless data, which specifically comprises the following steps:

and acquiring dynamic capture noise data according to the original feature vector set corresponding to the continuous N dynamic capture key frames, wherein the dynamic capture noise data comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to M joints and an angular velocity corresponding to a root joint.

In this embodiment, a method for training a dynamic capture denoising model based on an angular velocity of a root joint is introduced, and a dynamic capture denoising model training device obtains N continuous dynamic capture key frames according to a dynamic capture sample key frame, where the N dynamic capture key frames include the dynamic capture sample key frame, and an original feature vector can be extracted for each dynamic capture key frame, and each original feature vector includes label data corresponding to (M +1) joints. The (M +1) joints include a root joint, and the tag data of the root joint specifically includes a rotation angular velocity on the X axis, a rotation angular velocity on the Y axis, and a rotation angular velocity on the Z axis. Accordingly, one noise feature vector may be extracted for each frame of the live keyframe, each noise feature vector also including noise data corresponding to (M +1) joints. The (M +1) joints include root joints, and the angular velocity of the root joint specifically includes a rotation angular velocity on the X axis, a rotation angular velocity on the Y axis, and a rotation angular velocity on the Z axis, and the rotation angular velocities of the three dimensions surface the motion velocity of the current motion, so that the tag data and the noise data of the root joints can help the motion capture and noise removal model to understand the velocity and the period of the motion.

It is understood that, similar to the representation manner of the noisy feature vectors described in the foregoing embodiments, the original feature vectors are also represented in a corresponding manner, and based on this, the dynamic capture noise-free data (i.e., N original feature vectors) can be represented as:

where each row is represented as an original feature vector. X ' represents a label parameter on an X axis, Y ' represents a label parameter on a Y axis, Z ' represents a label parameter on a Z axis, M represents M joints, R represents a root joint, N represents the total number of the dynamic capture key frames, t represents a t-th dynamic capture key frame, namely a dynamic capture sample key frame, and a first feature vector corresponding to the dynamic capture sample key frame is as follows:

(x′1_t,y′1_t,z′1_t),(x′2_t,y′2_t,z′2_t)...,(x′M_t,y′M_t,z′M_t),(x′R_t,y′R_t,z′R_t)；

it is understood that the tag data (or noise data) corresponding to the root node may be arranged at any position in the first feature vector (or the second feature vector), but it should be noted that the joint types and the appearance order in the first feature vector and the second feature vector need to be consistent.

Secondly, in the embodiment of the application, a method for training a dynamic capture denoising model based on the angular velocity of the root joint is provided, and the angular velocity of the root joint can reflect the speed of the action of the character through the method, so that the accuracy of model prediction can be improved by taking the angular velocity of the root joint as a reference for predicting a noise offset vector.

Optionally, on the basis of each embodiment corresponding to fig. 13, in another optional embodiment provided in the embodiment of the present application, based on the dynamic capture noise data, a noise offset vector corresponding to a keyframe of the dynamic capture sample is obtained through a time series network included in the dynamic capture denoising model to be trained, which specifically includes the following steps:

based on the dynamic capture noise data, acquiring a coding characteristic vector set through a coder included in a dynamic capture denoising model to be trained, wherein the coding characteristic vector set comprises N coding characteristic vectors;

In the present embodiment, a way of outputting noise offset vectors based on a dynamic capture denoising model to be trained is described, and will be described below by taking M equal to 18 and N equal to 31 as an example, in dynamic capture noise data, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes 57-dimensional parameters (i.e., 3 × 18+3 — 57), and based on this, dynamic capture noise data composed of 31 dynamic capture keyframes is denoted by 31 × 57.

Similar to what is described in the previous embodiment, it can be understood with reference to fig. 8 again, wherein it is assumed that the motion capture sample key frame is the motion capture key frame of the tth frame in the animation, the first 15 consecutive motion capture key frames are obtained based on the motion capture key frame of the tth frame, and the last 15 consecutive motion capture key frames are obtained based on the motion capture key frame of the tth frame, that is, the motion capture noise data can be represented as a matrix of 31 × 57. Firstly, dynamic capture noise data is input into an encoder, the encoder is assumed to comprise two fully-connected layers, namely a first fully-connected layer input is 31 x 57 matrix, 31 x 48 matrix is obtained after encoding, a second fully-connected layer input is 31 x 48 matrix, 31 x 32 matrix is obtained after encoding, 31 x 32 matrix is a set of encoding feature vectors, and each encoding feature vector comprises 32-dimensional data.

Inputting the coding feature vector set into the time sequence network, outputting a 31 × 54 matrix by the time sequence network, selecting a target time sequence feature vector corresponding to the captured sample key frame from the 31 × 54 matrix, wherein the target time sequence feature vector has 54 dimensions, and it should be noted that the time sequence network may be a Bi-LSTM network, an LSTM network, a GRU network, a TCN, or an RNN, which is not limited herein.

And inputting the target time sequence feature vector to a decoder, wherein the decoder mainly has the function of mapping information output by a time sequence network to a rotation space. Assuming that the encoder includes two fully-connected layers, that is, the input of the first fully-connected layer is a 54-dimensional target timing sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, at this time, the 54-dimensional noise offset vector does not include offset data corresponding to the root node. And combining the noise offset vector to obtain a joint denoising result of the key frame of the dynamic capture sample. It should be noted that the number of network layers of the decoder is only an illustration.

In the embodiment of the application, a method for outputting noise offset vectors based on a to-be-trained motion capture denoising model is provided, and through the method, N coding feature vectors obtained through coding are coded again by using a time sequence network, so that time sequence features implicit in N motion capture key frames are extracted, more accurate noise offset vectors can be predicted, and the effect of repairing animation noise is improved.

Optionally, on the basis of each embodiment corresponding to fig. 13, in another optional embodiment provided in the embodiment of the present application, based on the encoding feature vector set, the method for obtaining the target timing feature vector through the timing network included in the dynamic capture denoising model to be trained specifically includes the following steps:

acquiring a target time sequence characteristic vector through a forward time sequence network included in a dynamic capture denoising model to be trained based on a coding characteristic vector set;

or,

In this embodiment, a method for predicting a target timing feature vector based on a unidirectional timing network is described, and the following description will take N equal to 31 and the timing network is LSTM as an example. Similar to the content described in the previous embodiment, it can be understood by referring to fig. 8 and 9 again, and the description is omitted here.

based on the coding feature vector set, acquiring a first time sequence feature vector through a forward time sequence network included in a dynamic capture denoising model to be trained;

In the present embodiment, a manner of predicting a target time series feature vector based on a bidirectional time series network is described, and in the following, by taking M equal to 18 and N equal to 31 as an example, in the dynamic captured noise data, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes a 57-dimensional parameter (i.e., 3 × 18+3 ═ 57), and based on this, the dynamic captured noise data composed of 31 dynamic captured key frames is denoted by 31 × 57.

For ease of understanding, please refer to fig. 11 again, assuming that the motion capture sample keyframe is the motion capture keyframe of the tth frame in the animation, the first 15 consecutive motion capture keyframes are obtained based on the motion capture keyframe of the tth frame, and the last 15 consecutive motion capture keyframes are obtained based on the motion capture keyframe of the tth frame, i.e., the motion capture noise data may be represented as a matrix of 31 × 57. Firstly, dynamic capture noise data is input into an encoder, the encoder is assumed to comprise two fully-connected layers, namely a first fully-connected layer input is 31 x 57 matrix, 31 x 48 matrix is obtained after encoding, a second fully-connected layer input is 31 x 48 matrix, 31 x 32 matrix is obtained after encoding, 31 x 32 matrix is a set of encoding feature vectors, and each encoding feature vector comprises 32-dimensional data. It should be noted that the number of network layers of the encoder and the dimension of the output are only an illustration.

And inputting the coding feature vector set into a bidirectional time sequence network (comprising a forward time sequence network and a backward time sequence network), outputting a 31 x 108 matrix by the bidirectional time sequence network, and selecting a target time sequence feature vector corresponding to the live capture sample key frame from the 31 x 108 matrix, wherein the target time sequence feature vector has 108 dimensions. Specifically, a first time sequence feature vector corresponding to the moving capture sample key frame output by the forward time sequence network is spliced with a second time sequence feature vector corresponding to the moving capture sample key frame output by the backward time sequence network, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be Bi-LSTM.

And inputting the target time sequence feature vector to a decoder, wherein the decoder mainly has the function of mapping information output by a time sequence network to a rotation space. Assuming that the encoder includes two fully-connected layers, that is, the input of the first fully-connected layer is a target timing sequence feature vector of 108 dimensions, a noise offset vector of 54 dimensions is obtained after two times of decoding, and at this time, the noise offset vector of 54 dimensions does not include offset data corresponding to the root node. And combining the noise offset vector to obtain a joint denoising result of the key frame of the dynamic capture sample. It should be noted that the number of network layers of the decoder is only an illustration.

acquiring a third time sequence characteristic vector through a first bidirectional time sequence network included in a dynamic capture denoising model to be trained based on the coding characteristic vector set;

based on the coding feature vector set, acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model to be trained;

In the present embodiment, a method for predicting a target time series feature vector based on a two-layer bidirectional time series network is described, and will be described below by taking M equal to 18 and N equal to 31 as an example, in the dynamic captured noise data, each noise feature vector may further include an angular velocity of a root joint, and therefore, each noise feature vector includes a 57-dimensional parameter (i.e., 3 × 18+3 — 57), and based on this, dynamic captured noise data composed of 31 dynamic captured key frames is denoted by 31 × 57.

For ease of understanding, please refer to fig. 12 again, assuming that the motion capture sample keyframe is the motion capture keyframe of the tth frame in the animation, the first 15 consecutive motion capture keyframes are obtained based on the motion capture keyframe of the tth frame, and the last 15 consecutive motion capture keyframes are obtained based on the motion capture keyframe of the tth frame, i.e., the motion capture noise data may be represented as a matrix of 31 × 57. Firstly, dynamic capture noise data is input into an encoder, the encoder is assumed to comprise two fully-connected layers, namely a first fully-connected layer input is 31 x 57 matrix, 31 x 48 matrix is obtained after encoding, a second fully-connected layer input is 31 x 48 matrix, 31 x 32 matrix is obtained after encoding, 31 x 32 matrix is a set of encoding feature vectors, and each encoding feature vector comprises 32-dimensional data. It should be noted that the number of network layers of the encoder and the dimension of the output are only an illustration.

And inputting the coding feature vector set into a double-layer bidirectional time sequence network (comprising a first bidirectional time sequence network and a second bidirectional time sequence network, wherein each bidirectional time sequence network comprises a forward time sequence network and a backward time sequence network), outputting a matrix of 31 x 108 by the first bidirectional time sequence network, inputting the matrix of 31 x 108 into the second bidirectional time sequence network, outputting another matrix of 31 x 108 by the second bidirectional time sequence network, and selecting a target time sequence feature vector corresponding to the dynamic capture sample key frame from the matrix of 31 x 108, wherein the target time sequence feature vector has 108 dimensions. Specifically, a third time sequence feature vector corresponding to the live capture sample key frame output by the first bidirectional time sequence network and a fourth time sequence feature vector corresponding to the live capture sample key frame output by the second bidirectional time sequence network are spliced, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be Bi-LSTM.

Optionally, on the basis of each embodiment corresponding to fig. 13, in another optional embodiment provided in this application, the updating of the model parameters of the dynamic capture denoising model to be trained according to the first feature vector, the second feature vector, and the noise offset vector specifically includes the following steps:

determining a joint denoising result according to the noise offset vector and the second characteristic vector;

In this embodiment, a method for training a dynamic capture denoising model to be trained by using a loss function is introduced. Since the denoising is intended to be close to clean motion capture data after denoising noisy data, the huber loss function can be used in training, but this should not be construed as a limitation of the present application.

Specifically, the huber loss function employed in training is as follows:

where y represents the true value, e.g., the first feature vector of the target motion capture keyframe. And f, (x) representing a model predicted value, such as a joint denoising result of the target dynamic capturing key frame, wherein the joint denoising result is obtained by adding the noise offset vector and the second feature vector. Represents the absolute value of the difference between y and f (x).

Further, the loss rate (dropout) of the time series network adopted during training is 0.5, the learning rate (learning) is 5e-5, and the iteration number (epoch) is 5, which is only an illustration here and should not be construed as a limitation to the present application. When a dynamic capture denoising model to be trained is trained, calculating a predicted joint denoising result obtained by superposing a noise offset vector with a second feature vector by using a loss function, calculating a loss value between the predicted joint denoising result and a noise-free first feature vector, calculating a return gradient by using the loss value, updating the model until the loss value converges or reaches an upper limit of iteration times, and considering that a model training condition is met, wherein at the moment, a model parameter obtained after the last iteration of the dynamic capture denoising model to be trained is used as a model parameter of the dynamic capture denoising model, so that the dynamic capture denoising model learns to predict noise deviation according to the input noise-carrying characteristics.

Secondly, in the embodiment of the application, a mode for training a to-be-trained dynamic capture denoising model is provided, and by the above mode, a huber loss function is specifically adopted as a loss function of model training, so that the sensitivity of the model to abnormal points can be reduced, the abnormal noise points are not over concerned, a smooth action curve is generated as much as possible, and the reasonability and the reliability of model training are facilitated.

The data processing method provided by the present application will be further explained below with reference to experimental data.

In the evaluation of the model, the error of the rotation data before and after denoising is used to measure the model effect, and the error of each frame can be cumulatively averaged by root-mean-square error (RMSE), and the RMSE formula is as follows:

wherein, W represents the total number of frames in the animation after being denoised, K represents the total number of channels included in each frame of the dynamic capture keyframes, and if each dynamic capture keyframe has 18 joints, and each joint corresponds to three channels, K is 54. e.g. of the type_i,jAnd representing the difference of the rotation values of the ith frame of the dynamic capture key frame before and after the jth channel denoising. MSE represents the mean-square error (MSE).

The data processing method provided by the application can be used for cleaning the moving capture data, not only can the shaking of most moving capture data be removed, but also the motion trend can be well kept, so that the details of the motion of the de-noised animation are in place, and the motion is natural and smooth. And then reduce the cost of artifical denoising, promote the effect of denoising, improve the efficiency of game animation preparation.

Based on this, please refer to table 4, where table 4 shows the evaluation results of 3500 frames of moving key frames in the test set of 10 files.

TABLE 4

Evaluating content	Conventional solutions	This application	Noise error between noisy and clean data
				MSE	6.286521	5.508197	19.652893
HUBER	2.462729	2.336673	7.422279

The traditional scheme represents a denoising mode by using a self-adaptive Gaussian filter, and as can be seen from table 4, the method for processing the dynamic capture data provided by the application is obviously improved in evaluation indexes. In addition, in animation representation, the method for processing the motion capture data can also keep the motion details and is closer to the original curve in local trend. For convenience of illustration, please refer to fig. 15, fig. 15 is a schematic diagram illustrating comparison of Y-axis rotation curves based on experimental data in the embodiment of the present application, in which 4Y-axis rotation curves (spine02_ rotation) are respectively shown, where F1 indicates a noisy curve, F2 indicates a noiseless curve, F3 indicates a curve predicted by using a conventional scheme (i.e., an adaptive gaussian filter), and F4 indicates a curve predicted by using the dynamic capture data processing method provided by the present application. Obviously, the curve indicated by F3 has insufficient undulation at the peak and trough of the curve, for example, the peak at the first peak is lower, while the curve indicated by F4 is closer to the original curve, and action details are retained.

Further, referring to fig. 16, fig. 16 is a schematic diagram illustrating comparison of X-axis rotation curves based on experimental data in the embodiment of the present application, and fig. 16 (a) and fig. 16 (B) respectively show 4X-axis rotation curves (leaf _ r _ Xrotation), in which for convenience of observation, the rectangular box selected in fig. 16 (a) is enlarged to obtain fig. 16 (B), where F1 indicates a noisy curve, F2 indicates a noise-free curve, F3 indicates a curve predicted by using a conventional scheme (i.e., an adaptive gaussian filter), and F4 indicates a curve predicted by using the method for processing dynamic capture data provided by the present application. Obviously, the curve indicated by F4 locally fluctuates closer to the curve indicated by F2, while the curve indicated by F3 deviates more toward the curve indicated by F1, oscillating significantly around the curve indicated by F2. Thus, it is demonstrated that the scheme provided by the present application can better represent the actions of characters, with errors at very small granularity (e.g., 0.5 degrees), but with little visual difference.

Referring to fig. 17, fig. 17 is a schematic diagram of an embodiment of a data processing apparatus in an embodiment of the present application, and the data processing apparatus 30 includes:

an obtaining module 301, configured to obtain a target dynamic capture key frame, where the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector includes noise data corresponding to M joints, and M is an integer greater than or equal to 1;

the obtaining module 301 is further configured to obtain motion capture data to be processed according to the target motion capture key frame, where the motion capture data to be processed represents a noise feature vector set corresponding to N consecutive motion capture key frames, the N consecutive motion capture key frames include the target motion capture key frame, the noise feature vector set includes a target noise feature vector, and N is an integer greater than 1;

the obtaining module 301 is further configured to obtain, based on the dynamic capture data to be processed, a noise offset vector corresponding to the target dynamic capture keyframe through a time sequence network included in the dynamic capture denoising model, where the noise offset vector includes offset data of M joints;

the determining module 302 is configured to determine joint denoising results corresponding to the target dynamic capture keyframe according to the target noise feature vector and the noise offset vector, where the joint denoising results include denoising data of M joints.

Optionally, on the basis of the embodiment corresponding to fig. 17, in another embodiment of the data processing apparatus 30 provided in the embodiment of the present application, the target noise feature vector further includes an angular velocity corresponding to the root joint;

an obtaining module 301, configured to obtain N consecutive motion capture key frames according to a target motion capture key frame;

Alternatively, on the basis of the embodiment corresponding to fig. 17, in another embodiment of the data processing apparatus 30 provided in the embodiment of the present application,

the obtaining module 301 is specifically configured to obtain, based on the dynamic capture data to be processed, a coding feature vector set through a coder included in the dynamic capture denoising model, where the coding feature vector set includes N coding feature vectors;

an obtaining module 301, specifically configured to obtain a target timing feature vector through a forward timing network included in the dynamic capture denoising model based on the coding feature vector set;

or,

an obtaining module 301, specifically configured to obtain a first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

an obtaining module 301, configured to obtain a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

a determining module 302, configured to obtain, for noise data corresponding to any joint in the target noise feature vector, denoising data corresponding to any joint in the noise offset vector;

Referring to fig. 18, fig. 18 is a schematic view of an embodiment of a dynamic-capture denoising model training apparatus in an embodiment of the present application, where the dynamic-capture denoising model training apparatus 40 includes:

an obtaining module 401, configured to obtain dynamic capture noise-free data corresponding to dynamic capture sample key frames, where the dynamic capture noise-free data includes an original feature vector set of N consecutive dynamic capture key frames, the N consecutive dynamic capture key frames include dynamic capture sample key frames, the original feature vector set includes first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors include label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the obtaining module 401 is further configured to obtain dynamic capture noise data according to dynamic capture noise-free data, where the dynamic capture noise data includes a noise feature vector set corresponding to N continuous dynamic capture key frames, the noise feature vector set includes second feature vectors corresponding to dynamic capture sample key frames, and the second feature vectors include noise data corresponding to M joints;

the obtaining module 401 is further configured to obtain, based on the dynamic capture noise data, a noise offset vector corresponding to the dynamic capture sample key frame through a time sequence network included in the dynamic capture denoising model to be trained, where the noise offset vector includes offset data of M joints;

and the training module 402 is configured to update the model parameters of the dynamic-capture denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector, and output the dynamic-capture denoising model until the model training conditions are met.

Optionally, on the basis of the embodiment corresponding to fig. 18, in another embodiment of the dynamic capture denoising model training device 40 provided in the embodiment of the present application,

an obtaining module 401, specifically configured to obtain a live capture sample key frame;

the obtaining module 401 is specifically configured to obtain dynamic capture noise data according to an original feature vector set corresponding to N consecutive dynamic capture key frames, where the dynamic capture noise data includes N noise feature vectors, and each noise feature vector includes noise data corresponding to M joints and an angular velocity corresponding to a root joint.

the obtaining module 401 is specifically configured to obtain, based on the dynamic noise data, a coding feature vector set through a coder included in a dynamic noise removal model to be trained, where the coding feature vector set includes N coding feature vectors;

the obtaining module 401 is specifically configured to obtain a target timing feature vector through a forward timing network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

or,

the obtaining module 401 is specifically configured to obtain, based on the coding feature vector set, a first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model to be trained.

Based on the coding feature vector set, acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model to be trained;

the obtaining module 401 is specifically configured to obtain, based on the encoding feature vector set, a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model to be trained;

a training module 402, configured to determine a joint denoising result according to the noise offset vector and the second feature vector;

Fig. 19 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 500 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 522 (e.g., one or more processors) and a memory 532, and one or more storage media 530 (e.g., one or more mass storage devices) for storing applications 542 or data 544. Memory 532 and storage media 530 may be, among other things, transient storage or persistent storage. The program stored on the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 522 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input-output interfaces 558, and/or one or more operating systems 541, such as a Windows Server^TM，Mac OS X^TM，Unix^TM,Linux^TM，FreeBSD^TMAnd so on.

The steps performed by the server in the above embodiment may be based on the server configuration shown in fig. 19.

Embodiments of the present application also provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product including a program, which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of data processing, comprising:

acquiring to-be-processed dynamic capture data according to the target dynamic capture key frame, wherein the to-be-processed dynamic capture data represents a noise feature vector set corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frame, the noise feature vector set comprises the target noise feature vector, and N is an integer greater than 1;

based on the dynamic capture data to be processed, acquiring a noise offset vector corresponding to the target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model, wherein the noise offset vector comprises offset data of the M joints;

and determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising results comprise denoising data of the M joints.

2. The method of claim 1, wherein the target noise eigenvector further comprises an angular velocity corresponding to a root joint;

the acquiring the dynamic capture data to be processed according to the target dynamic capture key frame comprises the following steps:

acquiring the continuous N dynamic capture key frames according to the target dynamic capture key frames;

and acquiring the dynamic capture data to be processed according to the continuous N dynamic capture key frames, wherein the dynamic capture data to be processed comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to the M joints and an angular velocity corresponding to the root joint.

3. The method according to claim 1 or 2, wherein the obtaining, based on the motion capture data to be processed, a noise offset vector corresponding to the target motion capture keyframe through a time series network included in a motion capture denoising model comprises:

based on the dynamic capture data to be processed, acquiring a coding characteristic vector set through a coder included in the dynamic capture denoising model, wherein the coding characteristic vector set comprises N coding characteristic vectors;

based on the coding feature vector set, acquiring a target time sequence feature vector through the time sequence network included in the dynamic capture denoising model;

and acquiring a noise offset vector corresponding to the target motion capture key frame through a decoder included in the motion capture denoising model based on the target time sequence feature vector.

4. The method according to claim 3, wherein the obtaining a target time series feature vector through the time series network included in the dynamic capture denoising model based on the encoding feature vector set comprises:

based on the coding feature vector set, acquiring the target time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model;

or,

and acquiring the target time sequence characteristic vector through a backward time sequence network included by the dynamic capture denoising model based on the coding characteristic vector set.

5. The method according to claim 3, wherein the obtaining a target time series feature vector through the time series network included in the dynamic capture denoising model based on the encoding feature vector set comprises:

and generating the target time sequence characteristic vector according to the first time sequence characteristic vector and the second time sequence characteristic vector.

6. The method according to claim 3, wherein the obtaining a target time series feature vector through the time series network included in the dynamic capture denoising model based on the encoding feature vector set comprises:

based on the coding feature vector set, acquiring a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model;

and generating the target time sequence feature vector according to the third time sequence feature vector and the fourth time sequence feature vector.

7. The method according to claim 1, wherein the determining a joint denoising result corresponding to the target motion capture keyframe according to the target noise feature vector and the noise offset vector comprises:

adding the denoising data corresponding to any joint with the noise data corresponding to any joint to obtain the denoising data corresponding to any joint;

8. A training method of a dynamic capture denoising model is characterized by comprising the following steps:

acquiring dynamic capture noise-free data corresponding to dynamic capture sample key frames, wherein the dynamic capture noise-free data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

acquiring dynamic capture noise data according to the dynamic capture noise-free data, wherein the dynamic capture noise data comprises a noise feature vector set corresponding to the N continuous dynamic capture key frames, the noise feature vector set comprises second feature vectors corresponding to the dynamic capture sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

based on the dynamic capture noise data, acquiring noise offset vectors corresponding to the dynamic capture sample key frames through a time sequence network included in a dynamic capture denoising model to be trained, wherein the noise offset vectors include offset data of the M joints;

updating the model parameters of the to-be-trained dynamic-catching denoising model according to the first feature vector, the second feature vector and the noise offset vector until the model training conditions are met, and outputting the dynamic-catching denoising model.

9. The training method according to claim 8, wherein the obtaining of the dynamic capture noise-free data corresponding to the dynamic capture sample keyframe comprises:

acquiring the dynamic capture sample key frame;

acquiring the continuous N dynamic capture key frames according to the dynamic capture sample key frames;

acquiring the dynamic capturing noise-free data according to the continuous N dynamic capturing key frames, wherein the dynamic capturing noise-free data comprises N original feature vectors, and each original feature vector comprises label data corresponding to the M joints and label data corresponding to a root joint;

the acquiring of dynamic capture noise data according to the dynamic capture noise-free data comprises:

and acquiring the dynamic capturing noise data according to the original feature vector set corresponding to the N continuous dynamic capturing key frames, wherein the dynamic capturing noise data comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to the M joints and an angular velocity corresponding to a root joint.

10. The training method according to claim 8 or 9, wherein the obtaining, based on the dynamic noise data, the noise offset vector corresponding to the dynamic sample keyframe through a timing network included in a dynamic noise reduction model to be trained comprises:

based on the dynamic capture noise data, acquiring a coding characteristic vector set through a coder included in the dynamic capture denoising model to be trained, wherein the coding characteristic vector set comprises N coding characteristic vectors;

based on the coding feature vector set, acquiring a target time sequence feature vector through the time sequence network included in the dynamic capture denoising model to be trained;

and acquiring a noise offset vector corresponding to the key frame of the dynamic capture sample through a decoder included in the dynamic capture denoising model to be trained based on the target time sequence feature vector.

11. The training method according to claim 10, wherein the obtaining a target timing feature vector through the timing network included in the to-be-trained motion capture denoising model based on the coding feature vector set comprises:

based on the coding feature vector set, acquiring a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model to be trained;

12. A data processing apparatus, comprising:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a target dynamic capture key frame, the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

the acquisition module is further configured to acquire dynamic capture data to be processed according to the target dynamic capture key frame, where the dynamic capture data to be processed represents a noise feature vector set corresponding to N consecutive dynamic capture key frames, the N consecutive dynamic capture key frames include the target dynamic capture key frame, the noise feature vector set includes the target noise feature vector, and N is an integer greater than 1;

the obtaining module is further configured to obtain, based on the dynamic capture data to be processed, a noise offset vector corresponding to the target dynamic capture keyframe through a time sequence network included in a dynamic capture denoising model, where the noise offset vector includes offset data of the M joints;

and the determining module is used for determining joint denoising results corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising results comprise denoising data of the M joints.

13. A dynamic-capture denoising model training device is characterized by comprising:

an obtaining module, configured to obtain dynamic capture noise-free data corresponding to dynamic capture sample key frames, where the dynamic capture noise-free data includes an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames include the dynamic capture sample key frames, the original feature vector set includes first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors include label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the obtaining module is further configured to obtain dynamic capture noise data according to the dynamic capture noise-free data, where the dynamic capture noise data includes a noise feature vector set corresponding to the N continuous dynamic capture key frames, the noise feature vector set includes second feature vectors corresponding to the dynamic capture sample key frames, and the second feature vectors include noise data corresponding to M joints;

the obtaining module is further configured to obtain, based on the dynamic capture noise data, noise offset vectors corresponding to the dynamic capture sample keyframe through a time sequence network included in a dynamic capture denoising model to be trained, where the noise offset vectors include offset data of the M joints;

and the training module is used for updating the model parameters of the to-be-trained dynamic-catching denoising model according to the first characteristic vector, the second characteristic vector and the noise offset vector until a model training condition is met, and outputting the dynamic-catching denoising model.

14. A computer device, comprising: a memory, a transceiver, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is configured to execute a program in the memory, the processor is configured to perform the method of any one of claims 1 to 7 or the training method of any one of claims 8 to 11 according to instructions in the program code;

15. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any of claims 1 to 7, or the training method of any of claims 8 to 11.