CN111899320B

CN111899320B - Data processing method, training method and device of dynamic capture denoising model

Info

Publication number: CN111899320B
Application number: CN202010844348.4A
Authority: CN
Inventors: 张榕
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-20
Filing date: 2020-08-20
Publication date: 2023-05-23
Anticipated expiration: 2040-08-20
Also published as: CN111899320A

Abstract

The application discloses a data processing method, which is applied to the field of artificial intelligence and specifically comprises the following steps: acquiring a target dynamic capture key frame; acquiring dynamic capture data to be processed according to the target dynamic capture key frame; acquiring a noise offset vector corresponding to a target dynamic capture key frame through a time sequence network included in the dynamic capture denoising model based on dynamic capture data to be processed; and determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise characteristic vector and the noise offset vector. The application also discloses a training method and a training device of the dynamic capture denoising model, and the training method and the training device can predict the noise offset vector of the target dynamic capture key frame on the animation curve through the dynamic capture denoising model, can not only keep the original trend of motion details, but also save the time cost and the labor cost of animation production.

Description

Data processing method, training method and device of dynamic capture denoising model

Technical Field

The application relates to the field of artificial intelligence, in particular to a data processing method, a training method of a dynamic capture denoising model and a device thereof.

Background

Motion capture (motion capture), also known as dynamic capture, is a technique that records and processes the motion of a person or other object. The industry commonly uses an optical action capturing mode, wherein the optical action needs to attach a plurality of identification points on a performer, a plurality of cameras capture the positions of the identification points, and then the positions of the identification points are restored and rendered on corresponding virtual images, so that the mapping from the action performance of the real actor to the bone animation is finally realized.

In the dynamic capture data acquisition process, noise cannot be avoided due to errors of precision and software calculation of dynamic capture equipment. At present, aiming at the denoising of the bone animation, most of the business is to repair the animation noise manually by an animator, the bone animation refined by the animator is closer to the requirement, and the action accuracy is higher.

However, although the effect of the animator manually repairing the animation noise is good, the cost is high and the processing period is also long. According to the difference of the dynamic capture noise, a group of dynamic capture data is processed by a general skilled animator, and the time cost and the labor cost of animation production are too high because of tens of seconds and tens of minutes.

Disclosure of Invention

The embodiment of the application provides a data processing method, a training method and a training device of a dynamic capture denoising model, wherein the dynamic capture denoising model can be used for predicting a noise offset vector of a target dynamic capture key frame on an animation curve, so that the original trend of action details can be kept, and the time cost and the labor cost of animation production can be saved.

In view of this, the present application provides in one aspect a method of data processing, comprising:

acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

Acquiring to-be-processed dynamic capture data according to target dynamic capture key frames, wherein the to-be-processed dynamic capture data represent noise feature vector sets corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise target dynamic capture key frames, the noise feature vector sets comprise target noise feature vectors, and N is an integer greater than 1;

acquiring noise offset vectors corresponding to target dynamic capture key frames through a time sequence network included in the dynamic capture denoising model based on dynamic capture data to be processed, wherein the noise offset vectors comprise offset data of M joints;

and determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise characteristic vector and the noise offset vector, wherein the joint denoising result comprises denoising data of M joints.

In another aspect, the present application provides a training method for a dynamic capture denoising model, including:

acquiring dynamic capture noiseless data corresponding to dynamic capture sample key frames, wherein the dynamic capture noiseless data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

Acquiring dynamic capturing noise data according to dynamic capturing noiseless data, wherein the dynamic capturing noise data comprises noise feature vector sets corresponding to N continuous dynamic capturing key frames, the noise feature vector sets comprise second feature vectors corresponding to dynamic capturing sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

acquiring noise offset vectors corresponding to key frames of the dynamic capture samples through a time sequence network included in the dynamic capture denoising model to be trained based on the dynamic capture noise data, wherein the noise offset vectors comprise offset data of M joints;

and updating model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector until the model training condition is met, and outputting the dynamic denoising model.

In one aspect, the present application provides a data processing apparatus, including:

the acquisition module is used for acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

the acquisition module is further used for acquiring to-be-processed dynamic capture data according to the target dynamic capture key frames, wherein the to-be-processed dynamic capture data represent noise feature vector sets corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frames, the noise feature vector sets comprise target noise feature vectors, and N is an integer greater than 1;

The acquisition module is further used for acquiring noise offset vectors corresponding to the target dynamic capture key frames through a time sequence network included in the dynamic capture denoising model based on the dynamic capture data to be processed, wherein the noise offset vectors comprise offset data of M joints;

the determining module is used for determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise characteristic vector and the noise offset vector, wherein the joint denoising result comprises denoising data of M joints.

In one possible design, in one implementation of another aspect of the embodiments of the present application, the target noise feature vector further includes an angular velocity corresponding to the root joint;

the acquisition module is specifically used for acquiring N continuous dynamic capture key frames according to the target dynamic capture key frames;

and acquiring the dynamic capture data to be processed according to the continuous N dynamic capture key frames, wherein the dynamic capture data to be processed comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to M joints and angular velocity corresponding to a root joint.

In one possible design, in another implementation of another aspect of the embodiments of the present application,

the acquisition module is specifically used for acquiring a coding feature vector set based on the dynamic capture data to be processed through an encoder included in the dynamic capture denoising model, wherein the coding feature vector set comprises N coding feature vectors;

Acquiring a target time sequence feature vector through a time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

and acquiring a noise offset vector corresponding to the target dynamic capture key frame through a decoder included in the dynamic capture noise model based on the target time sequence feature vector.

the acquisition module is specifically used for acquiring a target time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

or alternatively, the process may be performed,

and acquiring a target time sequence feature vector through a backward time sequence network included in the dynamic denoising model based on the coding feature vector set.

the acquisition module is specifically configured to acquire a first timing sequence feature vector through a forward timing sequence network included in the dynamic capture denoising model based on the encoding feature vector set.

And acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

And generating a target time sequence feature vector according to the first time sequence feature vector and the second time sequence feature vector.

the acquisition module is specifically configured to acquire a third timing sequence feature vector through a first bidirectional timing sequence network included in the dynamic capture denoising model based on the encoding feature vector set.

And acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

And generating a target time sequence feature vector according to the third time sequence feature vector and the fourth time sequence feature vector.

the determining module is specifically configured to obtain denoising data corresponding to any joint in the noise offset vector for noise data corresponding to any joint in the target noise feature vector;

adding the denoising data corresponding to any joint with the noise data corresponding to any joint to obtain denoising data corresponding to any joint;

when denoising data corresponding to each joint of M joints is obtained, a joint denoising result corresponding to the target dynamic capture key frame is generated.

Another aspect of the present application provides a dynamic capture denoising model training apparatus, comprising:

the acquisition module is used for acquiring dynamic capture noiseless data corresponding to dynamic capture sample key frames, wherein the dynamic capture noiseless data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the acquisition module is further used for acquiring dynamic capture noise data according to the dynamic capture noiseless data, wherein the dynamic capture noise data comprises noise feature vector sets corresponding to N continuous dynamic capture key frames, the noise feature vector sets comprise second feature vectors corresponding to dynamic capture sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

the acquisition module is further used for acquiring noise offset vectors corresponding to key frames of the dynamic capture samples through a time sequence network included in the dynamic capture noise model to be trained based on the dynamic capture noise data, wherein the noise offset vectors comprise offset data of M joints;

And the training module is used for updating the model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector until the model training conditions are met, and outputting the dynamic denoising model.

In one possible design, in one implementation of another aspect of the embodiments of the present application,

the acquisition module is specifically used for acquiring a dynamic capture sample key frame;

acquiring continuous N dynamic capture key frames according to the dynamic capture sample key frames;

acquiring dynamic capturing noiseless data according to N continuous dynamic capturing key frames, wherein the dynamic capturing noiseless data comprises N original feature vectors, and each original feature vector comprises label data corresponding to M joints and label data corresponding to root joints;

the acquisition module is specifically configured to acquire dynamic capture noise data according to an original feature vector set corresponding to N continuous dynamic capture key frames, where the dynamic capture noise data includes N noise feature vectors, and each noise feature vector includes noise data corresponding to M joints and an angular velocity corresponding to a root joint.

The acquisition module is specifically used for acquiring a coding feature vector set based on dynamic noise capturing data through an encoder included in a dynamic noise capturing model to be trained, wherein the coding feature vector set comprises N coding feature vectors;

acquiring a target time sequence feature vector through a time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

based on the target time sequence feature vector, a noise offset vector corresponding to a dynamic capture sample key frame is obtained through a decoder included in the dynamic capture denoising model to be trained.

the acquisition module is specifically used for acquiring a target time sequence feature vector through a forward time sequence network included in the dynamic denoising model to be trained based on the coding feature vector set;

or alternatively, the process may be performed,

and acquiring a target time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

the acquisition module is specifically configured to acquire a first timing sequence feature vector through a forward timing sequence network included in the dynamic denoising model to be trained based on the coding feature vector set.

And acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

the acquisition module is specifically configured to acquire a third timing sequence feature vector through a first bidirectional timing sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

And acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set.

the training module is specifically used for determining a joint denoising result according to the noise offset vector and the second feature vector;

determining a loss value by adopting a loss function according to the joint denoising result and the first feature vector;

and updating the model parameters of the dynamic capture denoising model to be trained according to the loss value.

Another aspect of the present application provides a computer device comprising: memory, transceiver, processor, and bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and the processor is used for executing the method provided by the aspects according to the instructions in the program code;

the bus system is used to connect the memory and the processor to communicate the memory and the processor.

Another aspect of the present application provides a computer-readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

In another aspect of the present application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the above aspects.

From the above technical solutions, the embodiments of the present application have the following advantages:

In the embodiment of the application, a method for processing data is provided, firstly, a target dynamic capture key frame is obtained, then, to-be-processed dynamic capture data is obtained according to the target dynamic capture key frame, the to-be-processed dynamic capture data represents noise feature vector sets corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the target dynamic capture key frame, the noise feature vector sets comprise target noise feature vectors, then, based on the to-be-processed dynamic capture data, noise offset vectors corresponding to the target dynamic capture key frame are obtained through a time sequence network comprising a dynamic capture denoising model, the noise offset vectors comprise offset data of M joints, finally, joint denoising results corresponding to the target dynamic capture key frame are determined according to the target noise feature vectors and the noise offset vectors, and the joint denoising results comprise denoising data of M joints. By adopting the mode, the noise offset vector of the target dynamic capture key frame on the animation curve can be predicted through the dynamic capture noise removal model, the joint denoising result can be obtained based on the noise offset vector, the original trend of motion details can be kept, the time cost and the labor cost of animation production can be saved, and therefore the efficiency of repairing the animation noise is improved.

Drawings

FIG. 1 is a schematic diagram of a comparison of a smooth curve and a non-smooth curve according to an embodiment of the present application;

FIG. 2 is a schematic illustration of a joint animation curve according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a low frequency filter based acquisition of Y-axis rotation curves of animated characters and skeletal animation;

FIG. 4 is a schematic diagram of an architecture of a data processing system in an embodiment of the present application;

FIG. 5 is a schematic diagram of one embodiment of a method of data processing in an embodiment of the present application;

FIG. 6 is a schematic illustration of a joint position of a persona in an embodiment of the application;

FIG. 7 is a schematic diagram of a comparison of the joint noise removal before and after the joint noise removal based on the animated character in an embodiment of the present application;

FIG. 8 is a schematic diagram of a structure for obtaining noise offset vectors based on a dynamic noise cancellation model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of outputting a target timing feature vector based on a forward timing network according to an embodiment of the present application;

FIG. 10 is a schematic diagram of outputting a target timing feature vector based on a backward timing network according to an embodiment of the present application;

FIG. 11 is another schematic structural diagram of obtaining a noise offset vector based on a dynamic noise cancellation model according to an embodiment of the present application;

FIG. 12 is another schematic diagram of a structure for obtaining noise offset vectors based on a dynamic noise cancellation model according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an embodiment of a training method of a dynamic denoising model according to an embodiment of the present application;

FIG. 14 is a graph showing a comparison of a rotation curve of dynamic noise free data and dynamic noise data in an embodiment of the present application;

FIG. 15 is a graph showing the comparison of a Y-axis rotation curve based on experimental data in the examples of the present application;

FIG. 16 is a graph showing an alignment of X-axis rotation curves based on experimental data in the examples of the present application;

FIG. 17 is a schematic diagram of an embodiment of a data processing apparatus in an embodiment of the present application;

FIG. 18 is a schematic diagram of an embodiment of a training device for a dynamic denoising model according to an embodiment of the present application;

fig. 19 is a schematic structural diagram of a computer device in the embodiment of the present application.

Detailed Description

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

It should be appreciated that the method of dynamic capture data processing provided herein may be used in three-dimensional (3 d) games, animated movies, virtual Reality (VR) and other scenarios. Specifically, a Machine Learning (ML) method based on artificial intelligence (Artificial Intelligence, AI) is used to implement the processing of dynamic capture data and the training of a dynamic capture denoising model. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The method for processing the dynamic capture data is mainly used for removing noise from the bone animation acquired by the dynamic capture equipment, so that the bone animation which is closer to the requirement is obtained, and the action accuracy is higher. The method can be particularly applied to the fields of movie production, television production, advertisement production, computer game production, television game production and the like, and achieves the aim of digital characteristic creation. The present application relates to a series of terms, which are respectively described below:

1. maya software, a complex three-dimensional computer graphics software, is widely used for digital special effect creation of movies, televisions, advertisements, computer games, video games, and the like.

2. The MotionBuilder software is 3D character animation software produced by olter (Autodesk) and used for virtual photography, motion capture and traditional key frame animation.

3. Motion capture (motion capture), also known as dynamic capture, is a technique for recording and processing the motion of a person or other object. Dynamic capturing devices can be classified into mechanical, acoustic, optical, electromagnetic and the like according to different working principles. At present, an optical action is usually used in the industry on the basis of an optical action capturing mode, a plurality of mark points (marks) are attached to an actor performing the action, the positions of the marks are captured by a plurality of cameras, then the positions of the marks are restored and rendered to corresponding virtual images, and finally mapping from actual actor action performance to bone animation is realized.

In the process, the marker data is lost or dithered due to the fact that the marker is possibly blocked by the action of the limbs of the actors, the problems that an optical camera is in imaging errors, calculation software modeling and calculation accuracy are limited and the like are likely to exist, and finally generated animation has obvious noise. In the animation, noise is represented as an unnatural or abnormal character limb motion, and in the joint animation curve, noise is represented as a spike repeatedly appearing on a smooth curve. For convenience of description, referring to fig. 1, fig. 1 is a schematic diagram of a comparison between a smooth curve and a non-smooth curve in the embodiment of the present application, as shown in the drawing, a typical smooth curve is indicated by A1, a non-smooth curve is indicated by A2, and undulating portions are marked with boxes.

Aiming at the problem of dynamic capture data noise, the method can be divided into two directions of denoising marker data and denoising dynamic capture animation according to different data generation stages. Because marker data is not always easy to obtain, and because dynamic capturing equipment is different, the analysis of the marker data is required more. However, the dynamic capturing animation is common and easy to obtain, a large number of open source databases provide a large amount of dynamic capturing animation data, such as dynamic capturing databases (Carnegie Mellon University Motion Capture Database, CMU) of the university of Carin, and the problem of animation noise is relatively wide.

4. Bone animation, which is one of model animation, is often applied to game production in the form of bone animation processed as dynamic capture data. In skeletal animation, a model has a skeletal structure of interconnected "skeletons" that are animated by changing the orientation and position of the skeletons.

5. Euler angle is a triplet of independent angle parameters used to determine the position of a fixed point rotating rigid body, and consists of nutation angle, precession angle (i.e. precession angle) and autorotation angle.

6. The joint animation curve represents the motion for a joint, which can be represented as a series of three-dimensional euler angular rotations over time, i.e., three motion curves. In skeletal animation, character actions are superimposed by rotations of individual interconnected joints. Rotation of a single joint at a certain moment may be represented by the euler angle. For ease of understanding, referring to fig. 2, fig. 2 is a schematic diagram of a joint animation curve according to an embodiment of the present application, where three curves are a character right thigh rotation curve displayed in a Maya curve editor, an X-axis rotation curve of a character right thigh is indicated by B1, a Y-axis rotation curve of a character right thigh is indicated by B2, and a Z-axis rotation curve of a character right thigh is indicated by B3.

7. Animation repositioning (retargeting) is a technique that enables multiple different characters to share an animation. After application repositioning, the same animation can be played normally on bones of different skeleton resources.

8. A Low-pass filter (Low-pass filter) allows Low frequency signals to pass through, but attenuates (or reduces) the passage of signals having frequencies above the cut-off frequency. The common Gaussian filter is a low-frequency filter, is suitable for eliminating Gaussian noise and is widely applied to the noise reduction process of image processing. The gaussian filtering is a process of performing weighted average on the whole image, and the value of each pixel point is obtained by performing weighted average on the pixel point and other pixel values in the neighborhood.

The low frequency filter can only smooth out spur noise, resulting in insufficient preservation of animation details. From the curve relief, the low frequency filter can only smooth the projections or depressions to the average position, and cannot promote the degree of the relief. When the fluctuation of the noise data is smaller, the filter will suppress the prominence more, resulting in less fluctuation of the curve after processing. The undulating curve can show the intensity of the movement of the joints of the animated character and the sense and effect of the movement. For ease of understanding, referring to fig. 3, fig. 3 is a schematic diagram of acquiring a Y-axis rotation curve of an animated character and a skeletal animation based on a low frequency filter, where C1 and C2 each indicate a Y-axis rotation curve of a right foot joint of the animated character, and where the curve indicated by C1 fluctuates more strongly, i.e., the amplitude at the peak and trough is higher, corresponding to the animated character indicated by C3 stepping forward, and the curve indicated by C2 is relatively smooth, corresponding to the animated character indicated by C4 stepping in place.

Having explained the related concepts of proper nouns above, a method for processing data and a training method for a dynamic noise capturing model in the present application will be described below with reference to fig. 4, referring to fig. 4, fig. 4 is a schematic diagram of an architecture of a data processing system in an embodiment of the present application, as shown in the drawing, in the process of training the dynamic noise capturing model, dynamic noise capturing non-data is firstly obtained from a database, and noise adding processing is performed on the dynamic noise capturing non-data to obtain dynamic noise capturing data, and the dynamic noise capturing non-data and dynamic noise capturing data can be adopted by a server to train the dynamic noise capturing model to be trained, and under the condition that training conditions are satisfied, a corresponding dynamic noise capturing model is output. The server can store the dynamic denoising model locally, and can also send the dynamic denoising model to the terminal equipment, and the terminal equipment uses the dynamic denoising model. Alternatively, the training process may be performed by the terminal device, i.e. after the terminal device outputs the dynamic denoising model, the dynamic denoising model is stored locally. In actual prediction, the terminal device may input the to-be-processed dynamic capture data including the target dynamic capture key frame into the dynamic capture denoising model, and output the noise offset vector corresponding to the target dynamic capture key frame through the dynamic capture denoising model, thereby achieving the purpose of correcting the noise data corresponding to the target dynamic capture key frame.

The server related to the application may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content distribution network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform. The terminal device may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein. The number of servers and terminal devices is not limited either.

With reference to the foregoing description, a method of data processing in the present application will be described below, referring to fig. 5, and one embodiment of the method of data processing in the embodiment of the present application includes:

101. acquiring a target dynamic capture key frame, wherein the target dynamic capture key frame corresponds to a target noise feature vector, the target noise feature vector comprises noise data corresponding to M joints, and M is an integer greater than or equal to 1;

In this embodiment, the data processing device acquires a target dynamic capture key frame, which is a dynamic capture key frame to be corrected, and includes noise data corresponding to M joints, where M is equal to 18 for illustration in this application, but this should not be construed as limiting the present application. It will be appreciated that the data processing apparatus may be disposed on a server or may be disposed on a terminal device, which is not limited herein.

For convenience of description, referring to fig. 6, fig. 6 is a schematic diagram of a joint position of a human character in the embodiment of the present application, and as shown in the drawing, it is assumed that the human character has 19 joints, where the joint indicated by D19 is a root joint, and the joint is also the top parent node at the pelvic bone position of the human character. The remaining 18 joints are respectively a chest joint indicated by D1, a neck joint indicated by D2, a right leg joint indicated by D3, a left leg joint indicated by D4, a right knee joint indicated by D5, a left knee joint indicated by D6, a right ankle joint indicated by D7, a left ankle joint indicated by D8, a right foot joint indicated by D9, a left foot joint indicated by D10, a right elbow joint indicated by D11, a left elbow joint indicated by D12, a right hand joint indicated by D13, a left hand joint indicated by D14, a right shoulder joint indicated by D15, a left shoulder joint indicated by D16, a right hip joint indicated by D17 and a left hip joint indicated by D18. It will be appreciated that the humanoid character may also include other numbers of joints, which are only illustrative and should not be construed as limiting the present application.

Specifically, each joint corresponds to a set of noise data, each set of noise data includes parameters of an X-axis rotation channel, parameters of a Y-axis rotation channel, and parameters of a Z-axis rotation channel, assuming that M is 18 (excluding the root joint), each joint corresponds to a set of noise data, each set of noise data includes three parameters, and thus, the target noise feature vector may include 54-dimensional parameters, which also means that the target dynamic capture keyframe has 54 rotation channels. Taking the right foot joint as an example, noise data corresponding to the right foot joint may be expressed as "foot_r_xrotatin", "foot_r_yrotatin" and "foot_r_zrotioin".

Based on this, a target noise feature vector can be obtained from the target motion capture keyframe, and the target noise feature vector can be expressed as:

[(x1 _t ,y1 _t ,z1 _t ),(x2 _t ,y2 _t ,z2 _t )...,(xM _t ,yM _t ,zM _t )]；

wherein X represents a parameter on the X axis, Y represents a parameter on the Y axis, Z represents a parameter on the Z axis, M represents M joints, and t represents a t frame dynamic capture keyframe, namely a target dynamic capture keyframe.

102. Acquiring to-be-processed dynamic capture data according to target dynamic capture key frames, wherein the to-be-processed dynamic capture data represent noise feature vector sets corresponding to N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise target dynamic capture key frames, the noise feature vector sets comprise target noise feature vectors, and N is an integer greater than 1;

In this embodiment, the data processing apparatus acquires N continuous dynamic capture keyframes according to the target dynamic capture keyframe, where the N dynamic capture keyframes include the target dynamic capture keyframe, and it should be noted that the N dynamic capture keyframes are continuous, and in this application, N is taken as an example and described with reference to N being equal to 31, which should not be construed as limiting the present application.

Specifically, in order to facilitate the dynamic capture denoising model to understand the trend of joint motion, for the target dynamic capture key frames, [ (N-1)/2 ] frame dynamic capture key frames before and after the target dynamic capture key frames can be extracted, and then the dynamic capture key frames are spliced into to-be-processed dynamic capture data with the length of N in sequence. For the situation that the preamble or the postamble (N-1)/2) frames dynamically catch key frames, the key frames can be filled in a central symmetry mode. Taking N as 31 as an example, assuming that the target dynamic capture keyframe is the 1 st dynamic capture keyframe in the whole animation, then taking the following 15 dynamic capture keyframes (i.e. the t+15 th frame, the t+14 th frame, the … … th frame and the t+1 th frame) as the filling of the preamble part, not only the sequence needs to be guaranteed to be consistent, but also the target dynamic capture keyframe needs to be guaranteed to be positioned in the middle position of the N dynamic capture keyframes. Thus, 31 frames of t+15 frames, t+14 frames, … …, t+1 frames, t frames, t+1 frames, … … and t+15 frames are obtained, wherein the t frame dynamic capture key frame is a target dynamic capture key frame.

Based on this, the dynamic capture data to be processed (i.e., the noise eigenvector set) can be expressed as:

wherein each row is represented as a noise feature vector, and the N noise feature vectors form a noise feature vector set. X represents a parameter on an X axis, Y represents a parameter on a Y axis, Z represents a parameter on a Z axis, M represents M joints, N represents the total number of dynamic capture key frames, t represents a t-th frame dynamic capture key frame, namely a target dynamic capture key frame, and a target noise feature vector corresponding to the target dynamic capture key frame is:

(x1 _t ,y1 _t ,z1 _t ),(x2 _t ,y2 _t ,z2 _t )...,(xM _t ,yM _t ,zM _t )；

it can be understood that, for the target dynamic capture key frame, the (N-1) frame dynamic capture key frame before the target dynamic capture key frame can be extracted, and then the (N-1) frame dynamic capture key frames are spliced into a piece of dynamic capture data to be processed with the length of N in sequence. Or extracting the (N-1) frame dynamic capture key frames after the target dynamic capture key frames, and splicing the (N-1) frame dynamic capture key frames into to-be-processed dynamic capture data with the length of N in sequence.

103. Acquiring noise offset vectors corresponding to target dynamic capture key frames through a time sequence network included in the dynamic capture denoising model based on dynamic capture data to be processed, wherein the noise offset vectors comprise offset data of M joints;

in this embodiment, the data processing device inputs the dynamic capture data to be processed into the dynamic capture denoising model, and the dynamic capture denoising model outputs noise offset vectors corresponding to N dynamic capture keyframes, and because positions of the target dynamic capture keyframes in the N dynamic capture keyframes are predetermined, the noise offset vectors corresponding to the target dynamic capture keyframes can be further extracted, the noise offset vectors include offset data of M joints, similar to the noise data, each joint corresponds to a set of offset data, and each set of offset data includes offset parameters of an X-axis rotation channel, offset parameters of a Y-axis rotation channel, and offset parameters of a Z-axis rotation channel. Assuming 18M, each joint corresponds to a set of offset data, each set of offset data including three parameters, and thus the noise offset vector may include 54-dimensional parameters. Taking the right foot joint as an example, the offset data corresponding to the right foot joint may be expressed as "foot_r_xrotationjoffset", "foot_r_yrotationjoffset" and "foot_r_zrotioin_offset".

104. And determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise characteristic vector and the noise offset vector, wherein the joint denoising result comprises denoising data of M joints.

In this embodiment, the data processing apparatus may adjust the target noise feature vector with noise according to the noise offset vector corresponding to the target dynamic capture key frame, so as to obtain a final joint denoising result.

Specifically, it is assumed that the noise offset vector includes offset data of 18 joints, where the offset data corresponding to the right foot joint is [ -0.1,0,0.1], and correspondingly, the target noise feature vector includes offset data of 18 joints, where the noise data corresponding to the right foot joint is [60.2,70,59.8], and based on this, the offset data corresponding to the right foot joint is added to the noise data, so as to obtain denoising data corresponding to the right foot joint [60.1,70,59.9]. By the method, when the offset data of each joint in the noise offset vector is respectively added with the corresponding noise data, the denoising data corresponding to each joint can be obtained, and the denoising data corresponding to M joints are used for forming joint denoising results, wherein the joint denoising results can be expressed in the form of feature vectors.

For convenience of explanation, please refer to fig. 7, which is a comparison diagram of the animation-character-based joint before and after noise removal in the embodiment of the present application, wherein fig. 7 (a) shows a Z-dimension rotation curve of a certain joint of an animation character before noise removal, and fig. 7 (B) shows a Z-dimension rotation curve of a certain joint of an animation character after noise removal, and obviously, the Z-dimension rotation curve after noise removal is smoother.

According to the data processing method, the noise offset vector of the target dynamic capture key frame on the animation curve can be predicted through the dynamic capture denoising model, the joint denoising result can be obtained based on the noise offset vector, the original trend of motion details can be kept, the time cost and the labor cost of animation production can be saved, and therefore the efficiency of repairing animation noise is improved.

Optionally, based on the foregoing embodiments corresponding to fig. 5, in an optional embodiment provided in this embodiment of the present application, the target noise feature vector further includes an angular velocity corresponding to a root joint;

the method comprises the following steps of:

acquiring N continuous dynamic capture key frames according to the target dynamic capture key frames;

In this embodiment, a method for determining a target noise feature vector based on an angular velocity of a root joint is described, where a data processing apparatus obtains N continuous dynamic capture keyframes according to a target dynamic capture keyframe, where the N dynamic capture keyframes include the target dynamic capture keyframe, and a noise feature vector may be extracted for each dynamic capture keyframe, and each noise feature vector includes noise data corresponding to (m+1) joints. The (m+1) joints include root joints, and the angular velocities of the root joints include specifically the rotational angular velocity in the X axis, the rotational angular velocity in the Y axis, and the rotational angular velocity in the Z axis, and the rotational angular velocity in these three dimensions is the motion velocity of the current motion of the surface, so that the angular velocity of the root joints can help the motion capture denoising model understand the velocity and cycle of the motion.

wherein each row is represented as a noise feature vector, and the N noise feature vectors form a noise feature vector set. X represents a parameter on an X axis, Y represents a parameter on a Y axis, Z represents a parameter on a Z axis, M represents M joints, R represents a root joint, N represents the total number of dynamic capture key frames, t represents the t frame dynamic capture key frame, namely, a target dynamic capture key frame, and a target noise feature vector corresponding to the target dynamic capture key frame is:

(x1 _t ,y1 _t ,z1 _t ),(x2 _t ,y2 _t ,z2 _t )...,(xM _t ,yM _t ,zM _t ),(xR _t ,yR _t ,zR _t )；

It should be understood that the noise data corresponding to the root node may be arranged at any position in the noise feature vector, but it should be noted that the noise data corresponding to the joints in each noise feature vector needs to be kept consistent, for ease of understanding, refer to table 1, and table 1 is the noise data of each joint extracted from the target motion capture keyframe.

TABLE 1

Joint	Noise data	Joint	Noise data
				Chest joint	(124,255,300)	Neck joint	(135,55,310)
Right leg joint	(15,55,71)	Left leg joint	(147,135,241)
				Right knee joint	(14,67,35)	Left knee joint	(19,25,68)
Right ankle joint	(46,215,244)	Left ankle joint	(117,46,112)
				Right foot joint	(322,157,42)	Left foot joint	(129,155,111)
Right elbow joint	(310,255,300)	Left elbow joint	(19,255,62)
				Right hand joint	(114,255,300)	Left hand joint	(249,295,78)
Right shoulder joint	(74,92,30)	Left shoulder joint	(286,99,152)
				Right hip joint	(12,341,100)	Left hip joint	(42,155,110)
Root joint	Angular velocity (57,62,72)

As can be seen from table 1, the target noise feature vectors may comprise 57-dimensional parameters, i.e. also the angular velocity of the root joint at this time, and similarly each noise feature vector may also comprise 57-dimensional parameters. It should be noted that, in practical cases, the parameters shown in table 1 may further include at least one decimal place.

In addition, in the embodiment of the present application, a method for determining a target noise feature vector based on the angular velocity of the root joint is provided, by which the angular velocity of the root joint can represent the speed of the role action, so that the accuracy of model prediction can be improved by taking the angular velocity of the root joint as a reference for predicting a noise offset vector.

Optionally, based on the foregoing embodiments corresponding to fig. 5, another optional embodiment provided in this application specifically includes the following steps of:

based on the dynamic capture data to be processed, acquiring a coding feature vector set through an encoder included in the dynamic capture denoising model, wherein the coding feature vector set comprises N coding feature vectors;

In this embodiment, a manner of outputting noise offset vectors based on the motion capture denoising model is described below by taking M equal to 18 and n equal to 31 as an example, in the motion capture data to be processed, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes 57-dimensional parameters (i.e. 3×18+3=57, where 54 dimensions are noise data and 3 dimensions are angular velocities), and based on this, 31 frames of motion capture key frames constitute the motion capture data to be processed and are denoted as 31×57.

For ease of understanding, referring to fig. 8, fig. 8 is a schematic structural diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, as shown in the drawing, assuming that a target motion capture keyframe is a t-frame motion capture keyframe in an animation, a continuous first 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, and a continuous second 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, i.e. a matrix of 31×57 motion capture data to be processed can be represented. Firstly, inputting dynamic capture data to be processed into an encoder (encoder), wherein the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 is obtained, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data. It should be noted that the number of network layers and the output dimension of the encoder are only one schematic.

The set of coding feature vectors is input to the timing network again, the timing network outputs a matrix of 31×54, the target timing feature vector corresponding to the target dynamic capture key frame is selected from the matrix of 31×54, and the target timing feature vector has 54 dimensions, and it should be noted that the timing network may be a bidirectional Long Short-Term Memory (Bidirectional Long Short-Term Memory Networks for Relation, bi-LSTM) network, a Long Short-Term Memory (LSTM) network, a gate loop unit (Gate Recurrent Unit, GRU) network, a time convolution network (Temporal Convolutional Network, TCN) or a cyclic neural network (Recurrent Neural Network, RNN), which is illustrated in fig. 8 by taking LSTM as an example, but this should not be construed as limiting the application.

The target timing feature vector is input to a decoder (decoder) which is primarily operative to map the information output by the timing network to a rotational space. The encoder is assumed to comprise two full-connection layers, namely, the input of the first full-connection layer is a 54-dimensional target time sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, wherein the 54-dimensional noise offset vector does not comprise offset data corresponding to a root node. And combining the noise offset vector to obtain the joint denoising result of the target dynamic capture key frame. It should be noted that the number of network layers of the decoder is only one.

In the embodiment of the application, a mode for outputting noise offset vectors based on a dynamic capture denoising model is provided, through the mode, N coding feature vectors obtained by coding are coded again by using a time sequence network, so that the implicit time sequence features in N dynamic capture key frames are extracted, more accurate noise offset vectors are predicted, and the effect of repairing animation noise is improved.

Optionally, based on the foregoing respective embodiments corresponding to fig. 5, another optional embodiment provided in this embodiment of the present application obtains, based on the set of encoding feature vectors, a target timing feature vector through a timing network included in a dynamic capture denoising model, and specifically includes the following steps:

Acquiring a target time sequence feature vector through a forward time sequence network included in the dynamic denoising model based on the coding feature vector set;

or alternatively, the process may be performed,

In this embodiment, a method for predicting a target timing feature vector based on a unidirectional timing network is described, where N is equal to 31, and the timing network is exemplified by LSTM.

Specifically, referring to fig. 9, fig. 9 is a schematic diagram of outputting a target timing feature vector based on a forward timing network according to an embodiment of the present application, where as shown in the drawing, it is assumed that the set of encoding feature vectors includes encoding feature vectors of N dynamic capture key frames, and the encoding feature vector corresponding to the 1 st dynamic capture key frame in the N dynamic capture key frames is denoted as x ₁ The coding feature vector corresponding to the t-th dynamic capture key frame is expressed as x _t The coding feature vector corresponding to the Nth dynamic capture key frame is expressed as x _N The target dynamic capture key frame may be any one of N dynamic capture key frames. For coding feature vector x ₁ Obtaining a hidden vector h after encoding ₁ Then the hidden vector h is again ₁ Coding feature vector x corresponding to next frame dynamic capture key frame ₂ Coding to obtain hidden vector h ₂ . And so on until the target timing feature vector is obtained.

Referring to fig. 10, fig. 10 is a schematic diagram of outputting a target timing feature vector based on a backward timing network according to an embodiment of the present application, where as shown in the drawing, it is assumed that the set of encoding feature vectors includes encoding feature vectors of N dynamic capture key frames, and the encoding feature vector corresponding to the nth dynamic capture key frame in the N dynamic capture key frames is denoted as x _N First, thethe coding feature vector corresponding to t dynamic capture key frames is expressed as x _t The coding feature vector corresponding to the 1 st frame dynamic capture key frame is expressed as x ₁ The target dynamic capture key frame may be any one of N dynamic capture key frames. For coding feature vector x _N Obtaining a hidden vector h after encoding _N Then the hidden vector h is again _N Coding feature vector x corresponding to next frame dynamic capture key frame _N-1 Coding to obtain hidden vector h _N-1 . And so on until the target timing feature vector is obtained.

Further, in the embodiment of the application, a method for predicting a target time sequence feature vector based on a unidirectional time sequence network is provided, by the method, sequence information in N dynamic capture key frames can be effectively extracted by using a sequence encoder, the dynamic capture key frames are coded successively, and information of a frame before or after the dynamic capture key frames is introduced, so that the effect of model prediction is improved.

based on the coding feature vector set, a first time sequence feature vector is obtained through a forward time sequence network included in the dynamic denoising model;

acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

In this embodiment, a manner of predicting the target timing feature vector based on the bidirectional timing network is described below by taking M equal to 18 and n equal to 31 as an example, in the to-be-processed dynamic capture data, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes a 57-dimensional parameter (i.e. 3×18+3=57), based on which 31 frames of dynamic capture key frames constitute 31×57 to-be-processed dynamic capture data.

For ease of understanding, referring to fig. 11, fig. 11 is another schematic structural diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, as shown in the drawing, assuming that a target motion capture keyframe is a t-frame motion capture keyframe in an animation, a continuous first 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, and a continuous second 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, i.e. the to-be-processed motion capture data can be represented as a matrix of 31×57. Firstly, inputting dynamic capture data to be processed into an encoder, wherein the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained after encoding, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data. It should be noted that the number of network layers and the output dimension of the encoder are only one schematic.

The coding feature vector set is input into a bidirectional time sequence network (comprising a forward time sequence network and a backward time sequence network), a matrix of 31 x 108 is output by the bidirectional time sequence network, a target time sequence feature vector corresponding to a target dynamic capture key frame is selected from the matrix of 31 x 108, and the target time sequence feature vector has 108 dimensions. Specifically, a first time sequence feature vector corresponding to a target dynamic capture key frame output by the forward time sequence network and a second time sequence feature vector corresponding to a target dynamic capture key frame output by the backward time sequence network are spliced, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be specifically Bi-LSTM.

The target timing feature vector is input to a decoder, which is mainly used for mapping information output by the timing network to a rotation space. The encoder is assumed to comprise two full-connection layers, namely, the input of the first full-connection layer is a 108-dimensional target time sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, wherein the 54-dimensional noise offset vector does not comprise offset data corresponding to a root node. And combining the noise offset vector to obtain the joint denoising result of the target dynamic capture key frame. It should be noted that the number of network layers of the decoder is only one.

Further, in the embodiment of the application, a method for predicting a target time sequence feature vector based on a bidirectional time sequence network is provided, by the method, time sequence information is processed by utilizing bidirectional LSTM, so that action trend is easy to understand, and the bidirectional LSTM is beneficial to the model to utilize past and future curve trend information, so that the effect of model prediction is improved.

acquiring a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capturing denoising model based on the coding feature vector set;

In this embodiment, a manner of predicting a target timing feature vector based on a two-layer bidirectional timing network is described, in which M is equal to 18 and n is equal to 31, and in the to-be-processed dynamic capture data, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes a 57-dimensional parameter (i.e. 3×18+3=57), based on which 31 frames of dynamic capture key frames constitute to-be-processed dynamic capture data denoted as 31×57.

For ease of understanding, referring to fig. 12, fig. 12 is another schematic structural diagram of obtaining a noise offset vector based on a motion capture denoising model in the embodiment of the present application, as shown in the drawing, assuming that a target motion capture keyframe is a t-frame motion capture keyframe in an animation, a continuous first 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, and a continuous second 15 frames of motion capture keyframes are obtained based on the t-frame motion capture keyframe, i.e. the to-be-processed motion capture data can be represented as a matrix of 31×57. Firstly, inputting dynamic capture data to be processed into an encoder, wherein the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained after encoding, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data. It should be noted that the number of network layers and the output dimension of the encoder are only one schematic.

The method comprises the steps of inputting a coding feature vector set into a double-layer bidirectional time sequence network (comprising a first bidirectional time sequence network and a second bidirectional time sequence network, wherein each bidirectional time sequence network comprises a forward time sequence network and a backward time sequence network), outputting a 31 x 108 matrix by the first bidirectional time sequence network, inputting the 31 x 108 matrix into the second bidirectional time sequence network, outputting another 31 x 108 matrix by the second bidirectional time sequence network, selecting a target time sequence feature vector corresponding to a target dynamic capture key frame from the 31 x 108 matrices, and enabling the target time sequence feature vector to have 108 dimensions. Specifically, a third time sequence feature vector corresponding to the target dynamic capture key frame output by the first bidirectional time sequence network and a fourth time sequence feature vector corresponding to the target dynamic capture key frame output by the second bidirectional time sequence network are spliced, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be specifically Bi-LSTM.

Further, in the embodiment of the application, a mode for predicting the target time sequence feature vector based on the two-layer bidirectional time sequence network is provided, by adopting the mode, the time sequence information is processed by utilizing the two-layer bidirectional LSTM, so that the action trend is convenient to understand, and the model can utilize the past and future curve trend information by utilizing the two-layer bidirectional LSTM, so that the effect of model prediction is improved.

Optionally, based on the foregoing embodiments corresponding to fig. 5, in another optional embodiment provided in this application, determining, according to the target noise feature vector and the noise offset vector, a joint denoising result corresponding to the target dynamic capture keyframe specifically includes the following steps:

Acquiring denoising data corresponding to any joint in a noise offset vector aiming at noise data corresponding to any joint in a target noise feature vector;

In this embodiment, a way of generating a joint denoising result is described, and a noise offset vector corresponding to a target dynamic capture keyframe is output according to a dynamic capture denoising model, where it can be understood that the noise offset vector includes offset data of M joints, taking M as an example, where M is equal to 18, that is, the noise offset vector includes 54 parameters. And adding the noise offset vector and the initial target noise feature vector to obtain a joint denoising result, thereby realizing the denoising function.

Specifically, for ease of understanding, referring to table 2, table 2 is offset data of each joint extracted from the target motion capture key frame, and these offset data constitute noise offset vectors of the target motion capture key frame.

TABLE 2

/>

As can be seen from table 2, the noise offset vector of the target motion capture keyframe may include 54-dimensional parameters, i.e., the offset data of the root joint is not included at this time. It should be noted that, in practical cases, the parameters shown in table 2 may further include at least one decimal place. Taking the noise data corresponding to the "chest joint" in the target noise feature vector as an example, please refer to table 1 again, the noise data of the "chest joint" is (124,255,300), the offset data of the "chest joint" is (1, 2, 3), and then the corresponding parameters are added respectively, so that the denoising data of the "chest joint" is (125, 257,303). It should be noted that, similar processing is also adopted for each other joint until denoising data corresponding to M joints are obtained, where the denoising data is represented as a joint denoising result corresponding to the target dynamic capture key frame.

Secondly, in the embodiment of the application, a method for generating a joint denoising result is provided, by the method, the joint denoising result can be automatically determined based on the target noise feature vector and the noise offset vector, and the joint noise can be corrected without manual calculation, so that the convenience of operation is further improved.

With reference to the foregoing description, a training method of the dynamic denoising model in the present application will be described below, referring to fig. 13, and one embodiment of the training method of the dynamic denoising model in the embodiment of the present application includes:

201. acquiring dynamic capture noiseless data corresponding to dynamic capture sample key frames, wherein the dynamic capture noiseless data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

in this embodiment, the dynamic capture denoising model training device acquires dynamic capture noiseless data from the dynamic capture database, and since the open source data are basically relatively clean, jitter and noise of the dynamic capture data are less, the dynamic capture data can be considered to be clean and noiseless, and can be used as dynamic capture noiseless data. The dynamic capture noiseless data comprises N continuous dynamic capture key frames, wherein the N continuous dynamic capture key frames comprise dynamic capture sample key frames to be predicted. Each of the N continuous dynamic capture keyframes has an original feature vector, the N original feature vectors form an original feature vector set, for convenience of explanation, the original feature vector corresponding to the dynamic capture sample keyframe is referred to as a first feature vector, and the first feature vector includes tag data corresponding to M joints. This application is illustrated with M equal to 18 and N equal to 31, however this should not be construed as limiting the application.

Similar to the foregoing embodiment, the original feature vector corresponding to each frame of dynamic capture key frame includes label data of M joints, and further may further include label data of root joints. Assuming that M is 18, the label data for each joint includes parameters of the X-axis rotation channel, parameters of the Y-axis rotation channel, and parameters of the Z-axis rotation channel, each raw feature vector includes 54-dimensional parameters. If the rotational angular velocity of the root joint in the X-axis, the rotational angular velocity in the Y-axis, and the rotational angular velocity in the Z-axis are also included, each of the original feature vectors includes 57-dimensional parameters.

Specifically, the initial sample dataset reaches 679 ten thousand frames, a Biovision hierarchical model (Biovision Hierarchy, BVH) file or a Film Box (FBX) file, mainly from an industry-wide open-source dynamic capture database, where BVH and FBX are both common formats of animation. The dynamic capture databases used include, but are not limited to, karl lury university (Karlsruhe Institute of Technology, KIT) dynamic capture databases, west Meng Feisha university (Simon Fraser University & National University of Singapore, SFU) dynamic capture databases, subordinate ohio state art and design advanced computer center (Advanced Computing Center for the Arts and Design, ACCAD) dynamic capture databases, and karl university (Carnegie Mellon University Motion, CMU) dynamic capture databases, the specific information of which is shown in table 3.

TABLE 3 Table 3

	KIT	ACCAD	SFU	CMU
					Document number/number	2141	81	44	2548
Total frame number/number	2397855	19599	109653	4264355
					Frame length/second	0.01	0.0333333	0.008333	0.00833333
Frame rate/Hz	100	30	120	120

Based on table 3, the dynamic capture data frame rate and the humanoid skeleton of each dynamic capture database have certain differences. In order to ensure that the input and output of the dynamic capturing denoising model are more fixed, the dynamic capturing data is unified at first. In this regard, motion builder software may be used to perform animation repositioning on the motion capture data, such that the motion skeleton is uniform. In addition, the MotionBuilder software can also unify the animation frame rate, for example, the training data can be unified into transmission frame number per second (Frames Per Second, FPS) =30hz, and the training data can reach 183 ten thousand frames.

It can be understood that the dynamic capture denoising model training apparatus may be deployed on a server or on a terminal device, which is not limited herein.

202. Acquiring dynamic capturing noise data according to dynamic capturing noiseless data, wherein the dynamic capturing noise data comprises noise feature vector sets corresponding to N continuous dynamic capturing key frames, the noise feature vector sets comprise second feature vectors corresponding to dynamic capturing sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

in this embodiment, in order to simulate noise data during training, noise processing may be performed on data of M joints (excluding the root joint) in different channels in dynamic capturing noise-free data, and in general, a joint with a larger motion range is more likely to have more severe jitter, so that noise of each channel is related to the rotation value range of the channel. In addition, considering that noise may be present in large amounts, but not all frames contain noise, the decimatable portion of the motion capture keyframes (e.g., 80% of the animated images in the motion capture noiseless data) adds noise.

Specifically, for a certain rotation channel (for example, a rotation channel of an X axis), the variance of the rotation value of the rotation channel is recorded as σ, a part of the motion capture key frames corresponding to the rotation channel are extracted, and gaussian noise with the mean value of 0 and the variance of 0.05X σ is added to the rotation value of each motion capture key frame of each frame, so as to construct motion capture noise data. It can be understood that the motion capture noise data includes a noise feature vector set corresponding to N continuous motion capture key frames, each of the N continuous motion capture key frames has a noise feature vector, the N noise feature vectors form the noise feature vector set, for convenience of explanation, the noise feature vector corresponding to the motion capture sample key frame is referred to as a second feature vector, and the second feature vector includes noise data corresponding to M joints.

For convenience of explanation, please refer to fig. 14, which is a comparison diagram of a rotation curve of dynamic capturing noise-free data and dynamic capturing noise data in the embodiment of the present application, as shown in the drawing, taking a rotation curve of a certain joint of an animation character over a period of time as an example, a horizontal axis represents a time axis, and a vertical axis represents a joint curve value, wherein E1 indicates a curve after noise addition, E2 indicates a noise-free curve, E3 indicates a curve predicted by a fixed gaussian filter (a fixed parameter, a mean value is 0, and a variance is 1), E4 indicates a curve predicted by an adaptive gaussian filter, and in a position of a rectangular frame, the curve shown by E4 has completely deviated from the curve indicated by E2, and fluctuates with the curve indicated by E1.

203. Acquiring noise offset vectors corresponding to key frames of the dynamic capture samples through a time sequence network included in the dynamic capture denoising model to be trained based on the dynamic capture noise data, wherein the noise offset vectors comprise offset data of M joints;

in this embodiment, the dynamic capture denoising model training device inputs dynamic capture noise data to a dynamic capture denoising model to be trained, the dynamic capture denoising model to be trained outputs noise offset vectors corresponding to N dynamic capture keyframes, and as positions of dynamic capture sample keyframes in the N dynamic capture keyframes are predetermined, the noise offset vectors corresponding to the dynamic capture sample keyframes can be further extracted, the noise offset vectors include offset data of M joints, similar to the noise data, each joint corresponds to a set of offset data, and each set of offset data includes offset parameters of an X-axis rotation channel, offset parameters of a Y-axis rotation channel, and offset parameters of a Z-axis rotation channel.

204. And updating model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector until the model training condition is met, and outputting the dynamic denoising model.

In this embodiment, the dynamic capture denoising model training device may adjust the second feature vector with noise according to the noise offset vector corresponding to the dynamic capture sample key frame, so as to obtain a final joint denoising result, where the joint denoising result is a predicted value and the first feature vector is a true value, so that the joint denoising result is compared with the first feature vector to obtain a loss value, and the model parameters of the dynamic capture denoising model to be trained are updated by using the loss value until the model training condition is satisfied, and the model parameters obtained in the last iteration are used as the model parameters of the dynamic capture denoising model.

According to the training method of the dynamic capture denoising model, the dynamic capture denoising model is obtained through training by utilizing the dynamic capture noiseless data and the dynamic capture denoising data, the noise offset vector of the dynamic capture sample key frame on the animation curve can be predicted through the dynamic capture denoising model, the joint denoising result can be obtained based on the noise offset vector, the original trend of motion details can be reserved, the denoising animation motion is natural and smooth, the time cost and the labor cost of animation production can be saved, the model denoising result can be manually processed for the second time by an animator, and the processing cost at the moment is greatly reduced, so that the efficiency of repairing animation noise is improved.

Optionally, based on the foregoing embodiments corresponding to fig. 13, in an optional embodiment provided in the present application, the acquiring dynamic capturing noise-free data corresponding to a dynamic capturing sample key frame specifically includes the following steps:

acquiring a dynamic capture sample key frame;

Acquiring dynamic capturing noise data according to the dynamic capturing noise-free data, specifically comprising the following steps:

and acquiring dynamic capture noise data according to the original feature vector sets corresponding to the continuous N dynamic capture key frames, wherein the dynamic capture noise data comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to M joints and angular velocity corresponding to a root joint.

In this embodiment, a method for training a dynamic capture denoising model based on angular velocity of a root joint is described, and a dynamic capture denoising model training device acquires N continuous dynamic capture keyframes according to dynamic capture sample keyframes, where N dynamic capture sample keyframes include the dynamic capture sample keyframes, and an original feature vector can be extracted for each dynamic capture keyframe, and each original feature vector includes tag data corresponding to (m+1) joints. The (m+1) joints include root joints, and the tag data of the root joints specifically includes a rotational angular velocity in the X axis, a rotational angular velocity in the Y axis, and a rotational angular velocity in the Z axis. Correspondingly, a noise feature vector can be extracted for each dynamic capture key frame, and each noise feature vector also comprises noise data corresponding to (M+1) joints. The (m+1) joints include root joints, and the angular velocities of the root joints include specifically the rotational angular velocity in the X axis, the rotational angular velocity in the Y axis, and the rotational angular velocity in the Z axis, and the rotational angular velocity surfaces in these three dimensions are the current motion velocities of the motion, so that the tag data and the noise data of the root joints can help the motion capture denoising model understand the velocity and the period of the motion.

It will be appreciated that, similar to the noise feature vector representation described in the previous embodiment, the original feature vector is also represented in a corresponding manner, based on which, according to the dynamic capture of noise-free data (i.e., N original feature vectors), it can be represented as:

wherein each row is represented as an original feature vector. X ' represents a tag parameter on the X axis, Y ' represents a tag parameter on the Y axis, Z ' represents a tag parameter on the Z axis, M represents M joints, R represents a root joint, N represents a total number of dynamic capture key frames, t represents a t-th frame dynamic capture key frame, namely a dynamic capture sample key frame, and a first feature vector corresponding to the dynamic capture sample key frame is:

(x′1 _t ,y′1 _t ,z′1 _t ),(x′2 _t ,y′2 _t ,z′2 _t )...,(x′M _t ,y′M _t ,z′M _t ),(x′R _t ,y′R _t ,z′R _t )；

it is to be understood that the tag data (or noise data) corresponding to the root node may be arranged at any position in the first feature vector (or the second feature vector), but it should be noted that the joint types and the appearance sequences in the first feature vector and the second feature vector need to be consistent.

Secondly, in the embodiment of the application, a method for training a dynamic capturing and denoising model based on the angular velocity of the root joint is provided, by the method, the angular velocity of the root joint can embody the speed of the action of a role, so that the angular velocity of the root joint is used as a reference for predicting a noise offset vector, and the accuracy of model prediction can be improved.

Optionally, in another optional embodiment provided in the present application based on the respective embodiments corresponding to fig. 13, based on the dynamic capturing noise data, a noise offset vector corresponding to a dynamic capturing sample key frame is obtained through a timing network included in a dynamic capturing denoising model to be trained, and specifically includes the following steps:

acquiring a coding feature vector set based on dynamic capture noise data through an encoder included in a dynamic capture noise model to be trained, wherein the coding feature vector set comprises N coding feature vectors;

In this embodiment, a manner of outputting noise offset vectors based on the dynamic capture denoising model to be trained is described, in which, in dynamic capture noise data, M is equal to 18 and n is equal to 31, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes 57-dimensional parameters (i.e. 3×18+3=57), and based on this, dynamic capture noise data composed of 31 dynamic capture key frames is denoted as 31×57.

Similar to what has been described in the foregoing embodiment, it will be understood with reference to fig. 8 again, where, assuming that the dynamic capture sample key frame is the t frame dynamic capture key frame in the animation, the consecutive first 15 frame dynamic capture key frames are taken based on the t frame dynamic capture key frame, the consecutive last 15 frame dynamic capture key frames are taken based on the t frame dynamic capture key frame, that is, the dynamic capture noise data may be represented as a matrix of 31×57. Firstly, dynamic noise capturing data is input to an encoder, the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained after encoding, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data.

The set of coding feature vectors is input to the timing network, the timing network outputs a matrix of 31×54, and a target timing feature vector corresponding to the dynamic capture sample key frame is selected from the matrix of 31×54, and the target timing feature vector has 54 dimensions, and it should be noted that the timing network may be a Bi-LSTM network, an LSTM network, a GRU network, a TCN or an RNN, which is not limited herein.

The target timing feature vector is input to a decoder, which is mainly used for mapping information output by the timing network to a rotation space. The encoder is assumed to comprise two full-connection layers, namely, the input of the first full-connection layer is a 54-dimensional target time sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, wherein the 54-dimensional noise offset vector does not comprise offset data corresponding to a root node. And combining the noise offset vector to obtain a joint denoising result of the dynamic capture sample key frame. It should be noted that the number of network layers of the decoder is only one.

In the embodiment of the application, a mode for outputting noise offset vectors based on a dynamic capture denoising model to be trained is provided, through the mode, N coding feature vectors obtained by coding are coded again by using a time sequence network, so that implicit time sequence features in N dynamic capture key frames are extracted, more accurate noise offset vectors are predicted, and the effect of repairing animation noise is improved.

Optionally, in another optional embodiment provided in the present application based on the respective embodiments corresponding to fig. 13, the acquiring, based on the set of coding feature vectors, the target timing feature vector through a timing network included in the dynamic denoising model to be trained specifically includes the following steps:

Acquiring a target time sequence feature vector through a forward time sequence network included in a dynamic capture denoising model to be trained based on the coding feature vector set;

or alternatively, the process may be performed,

In this embodiment, a method for predicting a target timing feature vector based on a unidirectional timing network is described, where N is equal to 31, and the timing network is exemplified by LSTM. Similar to what is described in the foregoing embodiments, it can be understood with reference to fig. 8 and 9 again, and details are not repeated here.

Based on the coding feature vector set, acquiring a first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model to be trained;

In this embodiment, a manner of predicting the target timing feature vector based on the bidirectional timing network is described below by taking M equal to 18 and n equal to 31 as an example, in the dynamic capturing noise data, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes a 57-dimensional parameter (i.e. 3×18+3=57), based on which the dynamic capturing noise data is represented as 31×57 by 31 frames of dynamic capturing key frames.

For ease of understanding, please refer to fig. 11 again, assume that the dynamic capture sample key frame is the t frame dynamic capture key frame in the animation, the consecutive first 15 frames of dynamic capture key frames are taken based on the t frame dynamic capture key frame, the consecutive last 15 frames of dynamic capture key frames are taken based on the t frame dynamic capture key frame, i.e. the dynamic capture noise data can be represented as a matrix of 31×57. Firstly, dynamic noise capturing data is input to an encoder, the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained after encoding, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data. It should be noted that the number of network layers and the output dimension of the encoder are only one schematic.

The coding feature vector set is input into a bidirectional time sequence network (comprising a forward time sequence network and a backward time sequence network), a matrix of 31 x 108 is output by the bidirectional time sequence network, a target time sequence feature vector corresponding to the dynamic capture sample key frame is selected from the matrix of 31 x 108, and the target time sequence feature vector has 108 dimensions. Specifically, a first time sequence feature vector corresponding to a dynamic capture sample key frame output by a forward time sequence network and a second time sequence feature vector corresponding to a dynamic capture sample key frame output by a backward time sequence network are spliced, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be specifically Bi-LSTM.

The target timing feature vector is input to a decoder, which is mainly used for mapping information output by the timing network to a rotation space. The encoder is assumed to comprise two full-connection layers, namely, the input of the first full-connection layer is a 108-dimensional target time sequence feature vector, and a 54-dimensional noise offset vector is obtained after two times of decoding, wherein the 54-dimensional noise offset vector does not comprise offset data corresponding to a root node. And combining the noise offset vector to obtain a joint denoising result of the dynamic capture sample key frame. It should be noted that the number of network layers of the decoder is only one.

acquiring a third time sequence feature vector through a first bidirectional time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

In this embodiment, a manner of predicting the target timing feature vector based on the two-layer bidirectional timing network is described below by taking M equal to 18 and n equal to 31 as an example, in the dynamic capturing noise data, each noise feature vector may further include an angular velocity of a root joint, so each noise feature vector includes 57-dimensional parameters (i.e. 3×18+3=57), based on which 31 frames of dynamic capturing key frames constitute 31×57 dynamic capturing noise data.

For ease of understanding, please refer to fig. 12 again, assume that the dynamic capture sample key frame is the t frame dynamic capture key frame in the animation, the consecutive first 15 frames of dynamic capture key frames are taken based on the t frame dynamic capture key frame, the consecutive last 15 frames of dynamic capture key frames are taken based on the t frame dynamic capture key frame, i.e. the dynamic capture noise data can be represented as a matrix of 31×57. Firstly, dynamic noise capturing data is input to an encoder, the encoder is assumed to comprise two full-connection layers, namely, a matrix with the input of 31 x 57 of a first full-connection layer is obtained after encoding, a matrix with the input of 31 x 48 of a second full-connection layer is obtained after encoding, a matrix with the input of 31 x 32 is obtained after encoding, the matrix with the input of 31 x 32 is a coding feature vector set, and each coding feature vector comprises 32-dimensional data. It should be noted that the number of network layers and the output dimension of the encoder are only one schematic.

The method comprises the steps of inputting a coding feature vector set into a double-layer bidirectional time sequence network (comprising a first bidirectional time sequence network and a second bidirectional time sequence network, wherein each bidirectional time sequence network comprises a forward time sequence network and a backward time sequence network), outputting a 31 x 108 matrix by the first bidirectional time sequence network, inputting the 31 x 108 matrix into the second bidirectional time sequence network, outputting another 31 x 108 matrix by the second bidirectional time sequence network, selecting a target time sequence feature vector corresponding to a dynamic capture sample key frame from the 31 x 108 matrix, and enabling the target time sequence feature vector to have 108 dimensions. Specifically, a third time sequence feature vector corresponding to the dynamic capture sample key frame output by the first bidirectional time sequence network and a fourth time sequence feature vector corresponding to the dynamic capture sample key frame output by the second bidirectional time sequence network are spliced, and a 108-dimensional target time sequence feature vector is output. It should be noted that the bidirectional timing network may be specifically Bi-LSTM.

Optionally, based on the foregoing respective embodiments corresponding to fig. 13, in another optional embodiment provided in this application, updating model parameters of a dynamic capture denoising model to be trained according to a first feature vector, a second feature vector, and a noise offset vector specifically includes the following steps:

Determining a joint denoising result according to the noise offset vector and the second feature vector;

In this embodiment, a way of training a dynamic noise trapping model to be trained using a loss function is described. Because the purpose of denoising is to make the data with noise as close to clean dynamic capture data as possible after denoising, a huber (huber) loss function can be used in training, but the purpose is not to be construed as limiting the application.

Specifically, the huber loss function employed in training is as follows:

where y represents a true value, e.g., a first eigenvector of the target motion capture keyframe. f (x) represents a model predicted value, for example, a joint denoising result of the target motion capture key frame, where the joint denoising result is obtained by adding a noise offset vector and a second feature vector. Delta represents the absolute value of the difference between y and f (x).

Further, the loss rate (dropout) of the time sequence network used in training is 0.5, the learning rate (learning rate) is 5e-5, and the iteration number (epoch) is 5, which is only one illustration, and should not be construed as limiting the application. When the dynamic denoising model to be trained is trained, a loss function is used for calculating a predicted noise offset vector, a joint denoising result is obtained after a second feature vector is overlapped with a noise-free first feature vector, a return gradient is calculated according to the loss value, model updating is carried out until the loss value converges or reaches the upper limit of iteration times, at the moment, model parameters obtained after the last iteration of the dynamic denoising model to be trained are regarded as model parameters of the dynamic denoising model, and the dynamic denoising model learns to predict noise deviation according to the input noisy features.

Secondly, in the embodiment of the application, a mode of training the dynamic capturing and denoising model to be trained by using the loss function is provided, by adopting the mode, the huber loss function is specifically adopted as the loss function of model training, the sensitivity degree of the model to abnormal points can be reduced, the abnormal noise points are not excessively concerned, and a smooth action curve is generated as much as possible, so that the rationality and reliability of model training are facilitated.

The data processing method provided in the present application will be further described below in conjunction with experimental data.

In evaluating the model, the error of the denoising front-back rotation data is used to measure the model effect, and root-mean-square error (RMSE) can be used to accumulate and average the error of each frame, and the RMSE formula is as follows:

where W represents the total number of frames in the animation after denoising, K represents the total number of channels included in each frame of the dynamic capture keyframe, and if each dynamic capture keyframe has 18 joints, each joint corresponds to three channels, K is 54.e, e _i,j And the difference between the rotation values of the ith frame before and after denoising of the jth channel is represented. MSE represents the mean-square error (MSE).

The data processing method provided by the application can realize the cleaning of the dynamic capture data, not only can remove most of the shake of the dynamic capture data, but also can well keep the action trend, so that the details of the motion of the animation after denoising are in place, and the motion is natural and smooth. And further, the cost of manual denoising is reduced, the denoising effect is improved, and the efficiency of game animation production is improved.

Based on this, referring to table 4, table 4 is the evaluation results on the 10 files, 3500 frames total, of the keyframe test set.

TABLE 4 Table 4

The conventional scheme indicates a denoising method using an adaptive gaussian filter, and as can be seen from table 4, the method for processing dynamic capture data provided by the present application has an obvious improvement in evaluation index. In addition, in the aspect of animation, the method for processing the dynamic capture data can keep action details and is closer to an original curve in a local trend. For convenience of explanation, please refer to fig. 15, fig. 15 is a schematic diagram of comparison of Y-axis rotation curves based on experimental data in the embodiment of the present application, wherein 4Y-axis rotation curves (spin 02_rotation) are shown respectively, F1 indicates a curve after noise addition, F2 indicates a curve without noise, F3 indicates a curve predicted by a conventional scheme (i.e. an adaptive gaussian filter), and F4 indicates a curve predicted by a method for dynamic capture data processing provided in the present application. Obviously, the curve indicated by F3 does not undulate sufficiently at the troughs of the peaks of the curve, for example, the peak at the first peak is lower, whereas the curve indicated by F4 is closer to the original curve, better preserving the details of the action.

Further, referring to fig. 16, fig. 16 is a schematic diagram of comparison of an X-axis rotation curve based on experimental data in the embodiment of the present application, fig. 16 (a) and fig. 16 (B) respectively show 4X-axis rotation curves (clip_r_xrotation), and for convenience of observation, a rectangular frame selected in fig. 16 (a) is enlarged to obtain the fig. 16 (B), wherein F1 indicates a curve after noise addition, F2 indicates a noise-free curve, F3 indicates a curve predicted by a conventional scheme (i.e., an adaptive gaussian filter), and F4 indicates a curve predicted by a method for processing dynamic capture data provided in the present application. Obviously, the curve indicated by F4 locally fluctuates closer to the curve indicated by F2, whereas the curve indicated by F3 is more biased towards the curve indicated by F1, clearly oscillating around the curve indicated by F2. Thus, it is illustrated that the solution provided by the present application is better able to show a keratin action, with errors at very small granularity (e.g. 0.5 degrees), but already substantially less different from the visual sense.

Referring to fig. 17, fig. 17 is a schematic diagram of an embodiment of a data processing apparatus according to an embodiment of the present application, and the data processing apparatus 30 includes:

The acquiring module 301 is configured to acquire a target dynamic capture keyframe, where the target dynamic capture keyframe corresponds to a target noise feature vector, the target noise feature vector includes noise data corresponding to M joints, and M is an integer greater than or equal to 1;

the obtaining module 301 is further configured to obtain to-be-processed dynamic capture data according to a target dynamic capture key frame, where the to-be-processed dynamic capture data represents a noise feature vector set corresponding to N consecutive dynamic capture key frames, where the N consecutive dynamic capture key frames include the target dynamic capture key frame, the noise feature vector set includes a target noise feature vector, and N is an integer greater than 1;

the obtaining module 301 is further configured to obtain, based on the dynamic capture data to be processed, a noise offset vector corresponding to the target dynamic capture keyframe through a timing network included in the dynamic capture denoising model, where the noise offset vector includes offset data of M joints;

the determining module 302 is configured to determine a joint denoising result corresponding to the target dynamic capture keyframe according to the target noise feature vector and the noise offset vector, where the joint denoising result includes denoising data of M joints.

Optionally, on the basis of the embodiment corresponding to fig. 17, in another embodiment of the data processing apparatus 30 provided in the embodiment of the present application, the target noise feature vector further includes an angular velocity corresponding to a root joint;

The acquiring module 301 is specifically configured to acquire N continuous dynamic capture keyframes according to the target dynamic capture keyframes;

Alternatively, in another embodiment of the data processing apparatus 30 provided in the embodiment of the present application based on the embodiment corresponding to fig. 17 described above,

the obtaining module 301 is specifically configured to obtain, based on the dynamic capture data to be processed, a set of encoded feature vectors through an encoder included in the dynamic capture denoising model, where the set of encoded feature vectors includes N encoded feature vectors;

The acquiring module 301 is specifically configured to acquire, based on the set of encoding feature vectors, a target timing feature vector through a forward timing network included in the dynamic denoising model;

or alternatively, the process may be performed,

the obtaining module 301 is specifically configured to obtain, based on the set of encoded feature vectors, a first timing feature vector through a forward timing network included in the dynamic denoising model;

the obtaining module 301 is specifically configured to obtain, based on the set of encoded feature vectors, a third timing feature vector through a first bidirectional timing network included in the dynamic denoising model;

the determining module 302 is specifically configured to obtain, for noise data corresponding to any joint in the target noise feature vector, denoising data corresponding to any joint in the noise offset vector;

Referring to fig. 18, fig. 18 is a schematic diagram of an embodiment of a dynamic denoising model training apparatus according to an embodiment of the present application, and the dynamic denoising model training apparatus 40 includes:

The acquiring module 401 is configured to acquire dynamic capture noise-free data corresponding to dynamic capture sample key frames, where the dynamic capture noise-free data includes an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames include dynamic capture sample key frames, the original feature vector set includes a first feature vector corresponding to the dynamic capture sample key frames, the first feature vector includes tag data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the obtaining module 401 is further configured to obtain dynamic capturing noise data according to dynamic capturing noiseless data, where the dynamic capturing noise data includes noise feature vector sets corresponding to N consecutive dynamic capturing key frames, the noise feature vector sets include second feature vectors corresponding to dynamic capturing sample key frames, and the second feature vectors include noise data corresponding to M joints;

the obtaining module 401 is further configured to obtain, based on the dynamic capture noise data, a noise offset vector corresponding to a key frame of a dynamic capture sample through a timing network included in the dynamic capture noise model to be trained, where the noise offset vector includes offset data of M joints;

the training module 402 is configured to update model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector, and the noise offset vector until a model training condition is satisfied, and output the dynamic denoising model.

Alternatively, based on the embodiment corresponding to fig. 18, in another embodiment of the dynamic capture denoising model training apparatus 40 provided in the embodiment of the present application,

the acquiring module 401 is specifically configured to acquire a dynamic capture sample key frame;

the obtaining module 401 is specifically configured to obtain dynamic capturing noise data according to an original feature vector set corresponding to N continuous dynamic capturing key frames, where the dynamic capturing noise data includes N noise feature vectors, and each noise feature vector includes noise data corresponding to M joints and an angular velocity corresponding to a root joint.

the obtaining module 401 is specifically configured to obtain, based on the dynamic noise capturing data, a set of encoding feature vectors through an encoder included in the dynamic noise capturing model to be trained, where the set of encoding feature vectors includes N encoding feature vectors;

the obtaining module 401 is specifically configured to obtain, based on the set of encoding feature vectors, a target timing feature vector through a forward timing network included in the dynamic denoising model to be trained;

or alternatively, the process may be performed,

the obtaining module 401 is specifically configured to obtain, based on the set of encoded feature vectors, a first timing feature vector through a forward timing network included in the dynamic noise capturing model to be trained.

Acquiring a second time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

the obtaining module 401 is specifically configured to obtain, based on the set of encoding feature vectors, a third timing feature vector through a first bidirectional timing network included in the dynamic noise capturing and removing model to be trained;

the training module 402 is specifically configured to determine a joint denoising result according to the noise offset vector and the second feature vector;

Fig. 19 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 500 may vary considerably in configuration or performance, and may include one or more central processing units (central processing units, CPU) 522 (e.g., one or more processors) and memory 532, one or more storage media 530 (e.g., one or more mass storage devices) storing applications 542 or data 544. Wherein memory 532 and storage medium 530 may be transitory or persistent. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 522 may be configured to communicate with a storage medium 530 and execute a series of instruction operations in the storage medium 530 on the server 500.

The Server 500 may also include one or more power supplies 526, one or more wired or wireless network interfaces 550, one or more input/output interfaces 558, and/or one or more operating systems 541, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 19.

Also provided in embodiments of the present application is a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the methods as described in the foregoing embodiments.

Also provided in embodiments of the present application is a computer program product comprising a program which, when run on a computer, causes the computer to perform the methods described in the foregoing embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A method of data processing, comprising:

acquiring to-be-processed dynamic capture data according to the target dynamic capture key frames, wherein the to-be-processed dynamic capture data represent noise feature vector sets corresponding to continuous N dynamic capture key frames, the continuous N dynamic capture key frames comprise the target dynamic capture key frames, the noise feature vector sets comprise the target noise feature vectors, and the N is an integer greater than 1;

Acquiring a noise offset vector corresponding to the target dynamic capture key frame through a time sequence network included in a dynamic capture denoising model based on the dynamic capture data to be processed, wherein the noise offset vector comprises offset data of the M joints;

determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising result comprises denoising data of the M joints;

the step of obtaining the noise offset vector corresponding to the target dynamic capture key frame through a time sequence network included in the dynamic capture denoising model based on the dynamic capture data to be processed comprises the following steps:

acquiring a coding feature vector set based on the dynamic capture data to be processed through an encoder included in the dynamic capture denoising model, wherein the coding feature vector set comprises N coding feature vectors;

acquiring a target time sequence feature vector through the time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

and acquiring a noise offset vector corresponding to the target dynamic capture key frame through a decoder included in the dynamic capture denoising model based on the target time sequence feature vector.

2. The method of claim 1, wherein the target noise feature vector further comprises an angular velocity corresponding to a root joint;

the obtaining the dynamic capture data to be processed according to the target dynamic capture key frame comprises the following steps:

acquiring the continuous N dynamic capture key frames according to the target dynamic capture key frames;

and acquiring the dynamic capture data to be processed according to the continuous N dynamic capture key frames, wherein the dynamic capture data to be processed comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to the M joints and angular velocity corresponding to the root joint.

3. The method of claim 1, wherein the obtaining, based on the set of encoded feature vectors, a target timing feature vector through the timing network included in the dynamic capture denoising model comprises:

acquiring the target time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

or alternatively, the process may be performed,

and acquiring the target time sequence feature vector through a backward time sequence network included in the dynamic capture denoising model based on the coding feature vector set.

4. The method of claim 1, wherein the obtaining, based on the set of encoded feature vectors, a target timing feature vector through the timing network included in the dynamic capture denoising model comprises:

based on the coding feature vector set, acquiring a first time sequence feature vector through a forward time sequence network included in the dynamic capture denoising model;

and generating the target time sequence feature vector according to the first time sequence feature vector and the second time sequence feature vector.

5. The method of claim 1, wherein the obtaining, based on the set of encoded feature vectors, a target timing feature vector through the timing network included in the dynamic capture denoising model comprises:

acquiring a fourth time sequence feature vector through a second bidirectional time sequence network included in the dynamic capture denoising model based on the coding feature vector set;

And generating the target time sequence feature vector according to the third time sequence feature vector and the fourth time sequence feature vector.

6. The method of claim 1, wherein the determining the joint denoising result corresponding to the target motion capture keyframe according to the target noise feature vector and the noise offset vector comprises:

acquiring denoising data corresponding to any joint in the noise offset vector aiming at the noise data corresponding to any joint in the target noise feature vector;

adding the denoising data corresponding to any joint with the denoising data corresponding to any joint to obtain denoising data corresponding to any joint;

7. The training method of the dynamic capture denoising model is characterized by comprising the following steps of:

acquiring dynamic capture noiseless data corresponding to dynamic capture sample key frames, wherein the dynamic capture noiseless data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

Acquiring dynamic capturing noise data according to the dynamic capturing noiseless data, wherein the dynamic capturing noise data comprises noise feature vector sets corresponding to the continuous N dynamic capturing key frames, the noise feature vector sets comprise second feature vectors corresponding to the dynamic capturing sample key frames, and the second feature vectors comprise noise data corresponding to M joints;

acquiring a noise offset vector corresponding to the dynamic capture sample key frame through a time sequence network included in a dynamic capture denoising model to be trained based on the dynamic capture noise data, wherein the noise offset vector comprises offset data of the M joints;

updating the model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector until the model training condition is met, and outputting the dynamic denoising model;

the step of obtaining the noise offset vector corresponding to the dynamic capture sample key frame through a time sequence network included in the dynamic capture noise model to be trained based on the dynamic capture noise data comprises the following steps:

acquiring a coding feature vector set based on the dynamic noise capture data through an encoder included in the dynamic noise capture model to be trained, wherein the coding feature vector set comprises N coding feature vectors;

Acquiring a target time sequence feature vector through the time sequence network included in the dynamic capture denoising model to be trained based on the coding feature vector set;

and acquiring a noise offset vector corresponding to the dynamic capture sample key frame through a decoder included in the dynamic capture noise model to be trained based on the target time sequence feature vector.

8. The training method of claim 7, wherein the obtaining dynamic capture noiseless data corresponding to the dynamic capture sample keyframes comprises:

acquiring the dynamic capture sample key frame;

acquiring the continuous N dynamic capture key frames according to the dynamic capture sample key frames;

acquiring the dynamic capturing noiseless data according to the continuous N dynamic capturing key frames, wherein the dynamic capturing noiseless data comprises N original feature vectors, and each original feature vector comprises tag data corresponding to the M joints and tag data corresponding to a root joint;

the step of obtaining the dynamic capturing noise data according to the dynamic capturing noise-free data comprises the following steps:

and acquiring the dynamic capture noise data according to the original feature vector sets corresponding to the continuous N dynamic capture key frames, wherein the dynamic capture noise data comprises N noise feature vectors, and each noise feature vector comprises noise data corresponding to the M joints and angular velocity corresponding to a root joint.

9. The training method of claim 7, wherein the obtaining, based on the set of encoded feature vectors, a target timing feature vector through the timing network included in the dynamic capture denoising model to be trained, comprises:

10. A data processing apparatus, comprising:

the acquisition module is further configured to acquire to-be-processed dynamic capture data according to the target dynamic capture key frames, where the to-be-processed dynamic capture data represents a noise feature vector set corresponding to N consecutive dynamic capture key frames, the N consecutive dynamic capture key frames include the target dynamic capture key frames, the noise feature vector set includes the target noise feature vector, and N is an integer greater than 1;

The acquisition module is further configured to acquire a noise offset vector corresponding to the target dynamic capture key frame through a timing network included in the dynamic capture denoising model based on the dynamic capture data to be processed, where the noise offset vector includes offset data of the M joints;

the determining module is used for determining a joint denoising result corresponding to the target dynamic capture key frame according to the target noise feature vector and the noise offset vector, wherein the joint denoising result comprises denoising data of the M joints;

the acquisition module is specifically configured to, when acquiring a noise offset vector corresponding to the target dynamic capture key frame through a timing network included in the dynamic capture denoising model based on the dynamic capture data to be processed:

11. The utility model provides a move and catch denoising model trainer which characterized in that includes:

the acquisition module is used for acquiring dynamic capture noiseless data corresponding to dynamic capture sample key frames, wherein the dynamic capture noiseless data comprises an original feature vector set of N continuous dynamic capture key frames, the N continuous dynamic capture key frames comprise the dynamic capture sample key frames, the original feature vector set comprises first feature vectors corresponding to the dynamic capture sample key frames, the first feature vectors comprise label data corresponding to M joints, N is an integer greater than 1, and M is an integer greater than or equal to 1;

the acquisition module is further configured to acquire dynamic capture noise data according to the dynamic capture noiseless data, where the dynamic capture noise data includes noise feature vector sets corresponding to the N continuous dynamic capture key frames, the noise feature vector sets include second feature vectors corresponding to the dynamic capture sample key frames, and the second feature vectors include noise data corresponding to M joints;

the acquisition module is further configured to acquire a noise offset vector corresponding to the key frame of the dynamic capture sample through a timing network included in the dynamic capture noise model to be trained based on the dynamic capture noise data, where the noise offset vector includes offset data of the M joints;

The training module is used for updating the model parameters of the dynamic denoising model to be trained according to the first feature vector, the second feature vector and the noise offset vector until the model training conditions are met, and outputting the dynamic denoising model;

the acquisition module is specifically configured to, when acquiring a noise offset vector corresponding to the dynamic capture sample key frame through a timing network included in the dynamic capture noise model to be trained based on the dynamic capture noise data:

12. A computer device, comprising: memory, transceiver, processor, and bus system;

Wherein the memory is used for storing programs;

the processor being for executing a program in the memory, the processor being for executing the method of any one of claims 1 to 6 or the training method of any one of claims 7 to 9 according to instructions in the program;

the bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

13. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 6 or to perform the training method of any one of claims 7 to 9.