CN117894068A

CN117894068A - Motion data processing method, training method and device for key frame extraction model

Info

Publication number: CN117894068A
Application number: CN202311794535.6A
Authority: CN
Inventors: 张栩凌; 张子儒; 王宇阳; 许彬
Original assignee: Hong Kong University Of Science And Technology Guangzhou
Current assignee: Hong Kong University Of Science And Technology Guangzhou
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-04-16

Abstract

The application discloses a motion data processing method, a training method and a training device for a key frame extraction model, and relates to the technical field of computers. In the motion data processing method, a motion sequence comprising a plurality of data frames is firstly acquired, and then the motion sequence is input into a pre-trained key frame extraction model to carry out key frame extraction, so that a target key frame is obtained. The key frame extraction model is a deep reinforcement learning model and is used for calculating a return value corresponding to each data frame and giving the return value to select a target key frame. And finally, reconstructing other data frames in the motion sequence by utilizing the target key frame to obtain a reconstructed frame corresponding to the data frame, and giving the reconstructed frame and the target key frame to obtain a reconstructed sequence of the motion sequence. The motion sequence is extracted by the depth reinforcement learning model, the target key frame is selected based on the return value of the data frame, and the motion sequence can be reconstructed according to the target key frame, so that the extraction efficiency and accuracy of the key frame are effectively improved.

Description

Motion data processing method, training method and device for key frame extraction model

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a motion data processing method, a training method for a keyframe extraction model, and a device thereof.

Background

The motion data can be used for creating vivid character animation, carrying out human motion analysis and simulation, real-time interactive game and the like, and has wide application in the fields of meta universe, movie production, game development, sports science, virtual reality and the like. In practical applications, however, the motion data needs to be transmitted in large amounts in a short time to achieve synchronization, resulting in a high delay.

In the related technology, a non-supervision learning method such as clustering or a heuristic algorithm such as a genetic algorithm is utilized to extract key frames of the motion data, so that only the key frames in the motion data are transmitted, and the key frames are utilized to reconstruct the whole motion data, thereby reducing the transmission data quantity. However, the process of extracting the key frames consumes a lot of time, and the accuracy of extracting the key frames is not ideal.

Disclosure of Invention

The present application aims to solve at least one of the technical problems existing in the prior art. Therefore, the embodiment of the application provides a motion data processing method, a training method and a training device for a key frame extraction model, which can effectively improve the extraction efficiency and the accuracy of key frames.

In a first aspect, an embodiment of the present application provides a motion data processing method, including:

acquiring a motion sequence comprising a plurality of data frames;

inputting the motion sequence into a pre-trained key frame extraction model for key frame extraction to obtain a target key frame of the motion sequence; the key frame extraction model is a deep reinforcement learning model and is used for calculating a return value corresponding to each data frame and selecting the target key frame based on the return value;

and reconstructing other data frames in the motion sequence by using the target key frame to obtain a reconstructed frame corresponding to the data frame, and obtaining a reconstructed sequence of the motion sequence based on the reconstructed frame and the target key frame.

In a second aspect, an embodiment of the present application further provides a training method of a keyframe extraction model, which is applied to the motion data processing method described in the embodiment of the first aspect of the present application, and includes:

acquiring a training data set; the training data set comprises a plurality of training states, the training states comprise a training sequence and an index sequence, the training sequence comprises a plurality of training data frames, and the index sequence is composed of key frame indexes of the training data frames;

Inputting the training state into a key frame extraction model, selecting a key frame according to the decision value when each training data frame is used as the key frame, and updating the training state based on the key frame to obtain an updated state;

calculating decision rewards of the key frames according to the training state and the updating state;

and taking the training state, the updating state, the key frame and the decision rewards as unit training data, calculating a loss value according to the unit training data and the decision rewards of the key frame, and updating model parameters of the key frame extraction model according to the loss value until the key frame extraction model after training is completed is obtained.

In some embodiments of the present application, the selecting a key frame according to the decision value when each training data frame is used as the key frame includes:

taking each training data frame in the training sequence as a key frame, and calculating the decision value of the training data frame;

and taking the highest value of the decision value as a target decision value, and selecting the training data frame of the target decision value as the key frame.

In some embodiments of the present application, the calculating the decision reward for the key frame based on the training state and the update state comprises:

Calculating a training reconstruction error based on the index sequence of the training state, and calculating an update reconstruction error based on the index sequence of the update state;

subtracting the updated reconstruction error from the training reconstruction error to obtain a reference decision reward;

normalizing the reference decision rewards by using initial reconstruction errors corresponding to the initial states to obtain the decision rewards of the key frames; the initial state is obtained according to the initial value of the index sequence.

In some embodiments of the present application, the training data frame includes a plurality of keypoints therein; the step of calculating the reconstruction error includes:

obtaining the key frames according to the index sequence, and selecting a front key frame and a rear key frame with time sequence from the key frames; wherein, a plurality of front key points in the front key frame are in one-to-one correspondence with a plurality of rear key points in the rear key frame, and form a key point group;

performing interpolation calculation on each key point group by using a preset interpolation algorithm, reconstructing a non-key frame between the front key frame and the rear key frame to obtain a training reconstruction frame, wherein the training reconstruction frame and the key frame form a reconstruction sequence;

Calculating a reconstruction error according to the training sequence and the reconstruction sequence; the reconstruction errors include the training reconstruction errors, the updated reconstruction errors, or the initial reconstruction errors.

In some embodiments of the present application, the calculating a loss value from the unit training data and the decision rewards of the key frames includes:

calculating expected value of the unit training data according to the decision rewards, rewards discount parameters and long-term rewards;

inputting the unit training data into the key frame extraction model to obtain corresponding reference values;

and calculating a loss value of the key frame extraction model according to the reference value and the expected value.

In some embodiments of the present application, the unit training data is stored in a playback buffer, the playback buffer comprising a buffer capacity; the method further comprises the steps of:

randomly selecting the training state from the training data set, inputting the training state into the key frame extraction model, obtaining corresponding unit training data, and storing the corresponding unit training data into the playback cache until the cache capacity is reached;

and calculating a loss value according to the unit training data and the decision rewards of the key frames, updating the model parameters of the key frame extraction model according to the loss value, and repeating the process to update the unit training data in the playback cache and train the key frame extraction model if the key frame extraction model does not meet the preset convergence condition.

In some embodiments of the present application, before the acquiring the training data set, the method further includes:

acquiring an action sequence from a preset motion database, and adjusting the frame rate of the action sequence;

dividing the action sequence according to a preset time interval to obtain a plurality of training sequences; wherein each of the training sequences comprises the same number of training data frames;

initializing key frame indexes of the training data frames to obtain initial values of the index sequences; and the key frame index indicates that the training data frame is a key frame when the key frame index is a first index value, and indicates that the training data frame is a non-key frame when the key frame index is a second index value.

In a third aspect, an embodiment of the present application further provides a training device for a keyframe extraction model, where the training method for the keyframe extraction model according to the embodiment of the second aspect of the present application includes:

the acquisition module is used for acquiring a training data set; the training data set comprises a plurality of training states, the training states comprise a training sequence and an index sequence, the training sequence comprises a plurality of training data frames, and the index sequence is composed of key frame indexes of the training data frames;

The extraction module is used for inputting the training state into a key frame extraction model, selecting a key frame according to the decision value when each training data frame is used as the key frame, and updating the training state based on the key frame to obtain an updated state;

the rewarding module is used for calculating decision rewards of the key frames according to the training state and the updating state;

the training module is used for taking the training state, the updating state, the key frame and the decision rewards as unit training data, calculating a loss value according to the unit training data and the decision rewards of the key frame, and updating the model parameters of the key frame extraction model according to the loss value until the key frame extraction model after training is completed is obtained.

In a fourth aspect, an embodiment of the present application further provides an electronic device, including a memory, and a processor, where the memory stores a computer program, and the processor implements a training method of a keyframe extraction model according to an embodiment of the second aspect of the present application when executing the computer program.

In a fifth aspect, embodiments of the present application further provide a computer readable storage medium storing a program, where the program is executed by a processor to implement a training method for a keyframe extraction model according to embodiments of the second aspect of the present application.

The embodiment of the application at least comprises the following beneficial effects:

the embodiment of the application provides a motion data processing method, a training method of a key frame extraction model and a training device of the key frame extraction model. In the motion data processing method, a motion sequence comprising a plurality of data frames is firstly obtained, and then the motion sequence is input into a pre-trained key frame extraction model for key frame extraction, so that a target key frame of the motion sequence is obtained. The key frame extraction model is a deep reinforcement learning model and is used for calculating a return value corresponding to each data frame and giving the return value to select a target key frame. And finally, reconstructing other data frames in the motion sequence by utilizing the target key frame to obtain a reconstructed frame corresponding to the data frame, and giving the reconstructed frame and the target key frame to obtain a reconstructed sequence of the motion sequence. The motion sequence is extracted by the depth reinforcement learning model, and the target key frame is selected based on the return value of the data frame, so that the motion sequence can be reconstructed according to the target key frame, and the extraction efficiency and accuracy of the key frame are effectively improved.

Additional aspects and advantages of the application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 is a schematic flow chart of a training method of a key frame extraction model according to an embodiment of the present application;

fig. 2 is a schematic flow chart before step S101 in fig. 1;

fig. 3 is a schematic flow chart of step S103 in fig. 1;

fig. 4 is a schematic flow chart of step S104 in fig. 1;

FIG. 5 is a schematic flow chart of step S102 in FIG. 1;

FIG. 6 is a flowchart of a training method of another keyframe extraction model according to one embodiment of the present application;

FIG. 7 is a flowchart of a training method for a further key frame extraction model according to one embodiment of the present application;

FIG. 8 is a schematic diagram of a visual reconstruction result of an action sequence provided by one embodiment of the present application;

FIG. 9 is a schematic diagram of a training device module of a keyframe extraction model according to one embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Reference numerals: the system comprises an acquisition module 100, an extraction module 200, a rewarding module 300, a training module 400, an electronic device 1000, a processor 1001 and a memory 1002.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

In the description of the present application, it should be understood that references to orientation descriptions, such as directions of up, down, front, back, left, right, etc., are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present application and simplifying the description, and do not indicate or imply that the apparatus or element referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present application.

In the description of the present application, the meaning of a number is one or more, the meaning of a number is two or more, greater than, less than, exceeding, etc. are understood to not include the present number, and the meaning of a number above, below, within, etc. are understood to include the present number. The description of the first and second is for the purpose of distinguishing between technical features only and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present application, unless explicitly defined otherwise, terms such as arrangement, installation, connection, etc. should be construed broadly and the specific meaning of the terms in the present application can be reasonably determined by a person skilled in the art in combination with the specific contents of the technical solution.

A motion capture device is a technical device for capturing and recording human or object motion. By using the technologies of sensors, cameras, inertial measurement devices and the like, the device can accurately identify and record the positions of all bone points of a human body or an object in space, and save the space coordinates of the bone points at each moment as motion data, thereby forming a time sequence capable of representing a motion process and accurately describing the motion trail and the gesture of the device. Motion capture devices are widely used in many fields, including meta-universe, movie production, game development, sports science, virtual reality, and so forth. It can be used to create realistic character animation, perform human motion analysis and simulation, real-time interactive game, etc. By using the obtained motion data sequence, a user can synchronize the motion in the real world to the virtual world, so that a more real and accurate interaction experience can be provided for the user, and the user can obtain an immersive experience.

In the actual use process, the process of synchronizing motion data to the virtual character often needs to transmit a large amount of motion data in a short time, and for scenes involving a wireless communication process, such as a meta-universe or a virtual reality game, the data transmission process inevitably causes higher delay, thereby affecting the user experience. And extraction key frames and action reconstruction techniques may be used to solve this problem. Key frame extraction techniques are commonly used in video to summarize video content by selecting a concise set of representative frames in the video in different ways, enhancing the ability of the video to store, transmit, and summarize. When the network condition is poor, only key frames can be transmitted to the user, and the intermediate frames are rebuilt in different modes by the video with low frame rate, so that the frame rate is improved, and the fluency of the video is ensured. For the motion capture system and the motion data, only partial frames in the sequence can be transmitted by extracting representative key frames in the sequence, and the whole sequence is reconstructed by using the extracted key frames at the user side, so that the transmission data volume is greatly reduced under the condition of not affecting the user experience.

In the related technology, key frames of any given sequence can be extracted by using an unsupervised learning method such as clustering or a heuristic algorithm such as a genetic algorithm, but the method does not have generalization capability, and the generated result cannot be transferred to other key frame extraction problems. Such schemes have to be re-calculated in the face of new given action sequence data, and thus the process of extracting key frames requires a lot of time and the accuracy of extracting key frames is not ideal.

Based on the above, the embodiment of the application provides a motion data processing method, a training method of a key frame extraction model and a training device thereof, wherein the key frame extraction is performed on a motion sequence through a depth reinforcement learning model, and a target key frame is selected based on a return value of a data frame, so that the motion sequence can be reconstructed according to the target key frame, thereby effectively improving the extraction efficiency and the accuracy of the key frame. And the deep learning model has strong model generalization capability and can be transferred to different key frame extraction problems.

The embodiment of the application provides a motion data processing method, a training method of a key frame extraction model and a training device of the key frame extraction model, and specifically describes the following embodiments.

The embodiment of the application provides a motion data processing method and a training method of a key frame extraction model, and relates to the technical field of computers, in particular to the technical field of deep reinforcement learning. The training method of the motion data processing method and the key frame extraction model provided by the embodiment of the application can be applied to a terminal, a server side and a computer program running in the terminal or the server side. For example, the computer program may be a native program or a software module in an operating system; the method can be a local application program, namely a program which needs to be installed in an operating system to run, such as a client supporting motion data processing and key frame extraction model training, namely a program which only needs to be downloaded into a browser environment to run. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The motion data processing method and the training method of the key frame extraction model can be executed by a terminal or a server or cooperatively executed by the terminal and the server.

In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where Peer-To-Peer (P2P) networks are formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (Transmission Control Protocol, TCP) protocol. The server may be provided with a server of the training system of the motion data processing system and the key frame extraction model, through which the server may interact with the terminal, for example, the server may be provided with corresponding software, which may be an application of the training method for implementing the motion data processing method and the key frame extraction model, etc., but is not limited to the above form. The terminal and the server may be connected through communication connection modes such as bluetooth, universal serial bus (Universal Serial Bus, USB) or network, which is not limited herein.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The following describes a motion data processing method in the embodiment of the present application.

In some embodiments of the present application, the motion sequence may be any form of motion data, such as a person's motion, a robot's movement, a vehicle's motion, etc. In the motion data processing method, a motion sequence to be transmitted is firstly obtained, and then the motion sequence is input into a pre-trained key frame extraction model to extract key frames, so that a target key frame of the motion sequence is obtained. Specifically, the target key frame represents the main content or key action of the whole motion sequence, and can clearly show the main action or change of the motion main body. Therefore, only the target key frame is transmitted, other data frames in the motion sequence can be reconstructed at the receiving end according to the target key frame, the reconstructed frame corresponding to the data frame is obtained, and the reconstructed sequence of the motion sequence is obtained based on the reconstructed frame and the target key frame. Thereby effectively compressing and summarizing a large amount of motion data while maintaining its critical content, reducing the amount of data transmitted.

In some embodiments, the keyframe extraction model is a deep reinforcement learning model, which is a method that combines deep learning with reinforcement learning in order to let the neural network make optimal decisions through interactive learning with the environment to achieve a specific goal or maximize rewards. Unlike common artificial intelligence algorithm, the deep reinforcement learning algorithm does not need the optimal decision solution in different scenes to be used as training data, and can realize the training process by means of the reward function, so that the deep reinforcement learning algorithm has strong adaptability and generalization capability.

Since keyframe extraction faces variable scenes and actions when applied, it is difficult to have comprehensive keyframe extraction decision data to serve as a training data set, and in this embodiment, keyframe extraction is performed through a deep reinforcement learning model. Specifically, the key frame extraction model is a deep Q learning model, and the key frame extraction model is used for calculating a corresponding return value when each data frame is used as a key frame, and selecting a target key frame based on the return value, for example, selecting a data frame with the highest return value as the target key frame, which is not limited in this embodiment.

It can be understood that the number of target key frames can be preset according to actual requirements, and the key frame extraction model can extract a corresponding number of target key frames from the motion sequence according to the set number of key frames.

The motion sequence is extracted by the depth reinforcement learning model, and the target key frame is selected based on the return value of the data frame, so that the motion sequence can be reconstructed according to the target key frame, and the extraction efficiency and accuracy of the key frame are effectively improved. And the deep learning model has strong model generalization capability and can be transferred to different key frame extraction problems.

The following describes a training method of a key frame extraction model in an embodiment of the present application, where the key frame extraction model is applied to the above-described motion data processing method.

Referring to fig. 1, fig. 1 is an optional flowchart of a training method of a keyframe extraction model provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S104. It should be understood that the order of steps S101 to S104 in fig. 1 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or added according to actual requirements.

Step S101, a training data set is acquired.

In some embodiments, the training data set includes a plurality of training states, each training state including a training sequence and an index sequence. The training sequence comprises a plurality of training data frames, and the index sequence is composed of key frame indexes of the training data frames. The key frame indicator is used to indicate whether the training data frame is a key frame, which is not limited in this embodiment.

Referring to fig. 2, the following steps S201 to S203 may be further included, but are not limited thereto, before acquiring the training data set.

Step S201, obtaining an action sequence from a preset motion database, and adjusting the frame rate of the action sequence.

The motion sequence is obtained from a preset motion database, for example, a CMU graphic laboratory motion capture database (Graphics Lab Motion Capture Database, abbreviated as CMU database). The CMU database is a large database widely used in the fields of computer animation and motion capture, and contains a large amount of real human motion capture data, including different motions such as walking, running, jumping, dancing, etc., wherein each motion sequence is captured by a plurality of high-precision sensors to obtain accurate human posture and motion information.

It will be appreciated that since the action sequence often contains all data over a longer period of time, corresponding to a large number of data frames, it is difficult to perform uniform key frame extraction. Therefore, the frame rate of the action sequence can be adjusted according to the actual requirements. Illustratively, the frame rate is reduced from an initial 120 frames per second to 30 frames per second, i.e., the time interval between two adjacent frames is 0.33 seconds, which is not limiting in this embodiment.

Step S202, dividing the action sequence according to a preset time interval to obtain a plurality of training sequences.

In some embodiments, the motion sequence is segmented to obtain motion data and the same number of training data frames, each training sequence having the same duration. The preset time interval may be determined according to the actual requirement and the calculation force, for example, 1 second, 2 seconds, or 4 seconds, which is not limited in this embodiment.

Step S203, initializing key frame indexes of the training data frames to obtain initial values of index sequences.

In some embodiments, the key frame indicator indicates that the training data frame is a key frame when the key frame indicator is a first indicator value and indicates that the training data frame is a non-key frame when the key frame indicator is a second indicator value. Exemplary, for any training sequence s= { F _n |1≤n≤N}，F _n N is the number of frames of the nth training data frame in the training sequence, and is determined according to the frame rate and the time interval of segmentation. The index sequence k= { K of the training sequence S _n |1≤n≤N}，k _n Is a key frame index of an nth training data frame in the training sequence. When k is _n When=1, the n training data frame is the key frame; when k is _n When=0, the n training data frame is represented as a non-key frame. So that key frames in the training sequence S can be selected according to the index sequence K The key frames may maximize preservation of sequence information and transmission.

In some embodiments, key frame indicators for training data frames in each training sequence are initialized. Specifically, the key frame index of the first training data frame in each training sequence is set as a first index value, the key frame index of the last training data frame in each training sequence is set as a first index value, and the key frame indexes of other training data frames in the training sequence are set as second index values. Therefore, the initialized training sequence only includes two key frames, and the initial value of the index sequence is correspondingly obtained, which is not limited in this embodiment.

It can be understood that each initialized training sequence and the corresponding index sequence are used as a training state s= { S, K }, and a plurality of training states form a training data set in the embodiment, which is used for training the key frame extraction model.

Step S102, inputting the training state into a key frame extraction model, selecting a key frame according to the decision value when each training data frame is used as the key frame, and updating the training state based on the key frame to obtain an updated state.

In some embodiments, the input to the keyframe extraction model may be any temporal state s ⁱ The output is that different decision actions a are adopted in the state ⁱ Decision value Q(s) ⁱ ,a ⁱ ). Specifically, the training state is input to the key frame extraction model, and different decision behaviors are selected as key frames corresponding to different training data frames, so that according to the decision value when each training data frame is used as a key frame, a training data frame meeting the condition can be selected as a key frame, and the training state is updated based on the key frame to obtain an updated state, which is not limited in this embodiment.

The process of extracting key frames can be seen as a multi-round decision process. Each round of decision is to select a training data frame as a new key frame based on the latest input training state until the extracted key frame reaches the preset key frame number. For example, the next time a key frame is extracted, the updated state is input as a new initial state to the key frame extraction model.

In some embodiments, record s ⁱ ＝{S,K ⁱ State at ith round, a ⁱ For decision on the ith round, representing the decision on the (a) th round ⁱ The frames are extracted as key frames. Updating training states based on key frames to obtain updated states, wherein the updated index sequence is K ⁱ ⁺¹ Namely the a ⁱ The key frame index of the frame is adjusted to be a first index value, and the corresponding updating state is s ⁱ⁺¹ ＝{S,K ⁱ⁺¹ No limitation is imposed on this embodiment.

Step S103, calculating decision rewards of the key frames according to the training state and the updating state.

Referring to fig. 3, in some embodiments of the present application, calculating a decision reward for a key frame based on training status and update status may include, but is not limited to, the following steps S301 to S303.

Step S301, calculating a training reconstruction error based on the index sequence of the training state, and calculating an updating reconstruction error based on the index sequence of the updating state.

In some embodiments, key frames in the training sequence can be extracted according to the index sequence of the training state, and the training sequence is reconstructed according to the key frames; and extracting an updated key frame of the training sequence according to the updated index sequence of the updated state, and reconstructing the training sequence according to the updated key frame. From which training reconstruction errors R (S, K) ⁱ ) And updating the reconstruction errors R (S, K) ⁱ⁺¹ ) This embodiment is not limited thereto.

Step S302, subtracting the updated reconstruction error from the training reconstruction error to obtain a reference decision reward.

In some embodiments, the updated reconstruction error is subtracted from the training reconstruction error to yield a reference decision reward, i.e., R (S, K) ⁱ )-R(S,K ⁱ⁺¹ )。

Step S303, carrying out normalization operation on the reference decision rewards by utilizing the initial reconstruction errors corresponding to the initial states to obtain the decision rewards of the key frames.

In some embodiments, the initial state is according to an index orderThe initial values of the columns are obtained, and the initial reconstruction errors are denoted as R (S, K ⁰ ). Dividing the reference decision rewards by the initial reconstruction errors, thereby realizing normalization operation on the reference decision rewards and obtaining the decision rewards r of the key frames ⁱ The method comprises the following steps:

step S104, taking the training state, the updating state, the key frame and the decision rewards as unit training data, calculating a loss value according to the unit training data and the decision rewards of the key frame, and updating the model parameters of the key frame extraction model according to the loss value until the trained key frame extraction model is obtained.

In some embodiments, the state s will be trained ⁱ Update state s ⁱ⁺¹ Key frame a ⁱ And decision prize r ⁱ As unit training data, denoted(s) ⁱ ,a ⁱ ,r ⁱ ,s ⁱ⁺¹ ). And calculating a loss value of the key frame extraction model according to the unit training data and the decision rewards of the key frames, so that model parameters of the key frame extraction model are updated according to the loss value by using a gradient descent method until the trained key frame extraction model is obtained.

Referring to fig. 4, in some embodiments of the present application, calculating a loss value from the unit training data and the decision rewards of the key frames may include, but is not limited to, the following steps S401 to S403.

Step S401, calculating expected value of unit training data according to decision rewards, rewards discount parameters and long-term rewards.

It will be appreciated that the deep reinforcement learning model requires target decisions to maximize not only the current rewards but also the long-term rewards, so that the expected long-term rewards corresponding to different decision behaviors can be output by the model as Q ^* (s ⁱ ⁺¹ ,a ^* ). In some embodiments, the prize r is awarded according to a decision ⁱ Reward discount parameter gamma and long-term reward Q ^* (s ⁱ⁺¹ ,a ^* ) Computing unit training dataThe expected value of (2) is:

where the prize discount parameter γ is a super parameter in the range of [0,1] for balancing the current decision prize with the long-term prize, which is not limited in this embodiment.

Step S402, inputting the unit training data into a key frame extraction model to obtain corresponding reference values.

In some embodiments, the unit training data is input to the keyframe extraction model, and the reference value output by the keyframe extraction model can be obtained, which is not limited in this embodiment.

Step S403, calculating the loss value of the key frame extraction model according to the reference value and the expected value.

In some embodiments, the loss values of the key frame extraction model are calculated using the Huber loss function based on the reference and expected values of the unit training data, thereby updating the model parameters based on the loss values. Specifically, the Huber loss function is a loss function for regression problems, combining the characteristics of Mean Square Error (MSE) and absolute error (MAE). The person skilled in the art can set the loss function according to the actual requirement, and the present embodiment is not limited thereto.

Referring to fig. 5, in some embodiments of the present application, the selecting a key frame in step S102 according to the decision value when each training data frame is used as the key frame may include, but is not limited to, the following steps S201 to S202.

In step S501, each training data frame in the training sequence is used as a key frame, and the decision value of the training data frame is calculated.

In some embodiments, each training data frame in the training sequence is used as a key frame, corresponding to a decision. And then calculating the decision value of the training data frame as a key frame, and marking the decision value as Q.

Step S502, taking the highest value of the decision value as a target decision value, and selecting a training data frame of the target decision value as a key frame.

In some embodiments, the highest value of the decision value is taken as the target decision value, and the training data frame corresponding to the target decision value is selected as the key frame, that is, the action with the largest Q value is selected as the decision, which is not limited in this embodiment.

Referring to fig. 6, in some embodiments of the present application, the step of calculating the reconstruction error may include, but is not limited to, the following steps S601 to S603.

Step S601, obtaining key frames according to the index sequence, and selecting a front key frame and a rear key frame with time sequence from the key frames.

In some embodiments, by each key frame indicator in the indicator sequence, it may be determined whether the corresponding training data frame is a key frame. The training data frame includes a plurality of keypoints. In some embodiments, 23 primary joints of human bone are utilized as key points for each training data frame. And selecting a front key frame and a rear key frame with time sequence from the key frames, wherein a plurality of front key points in the front key frame correspond to a plurality of rear key points in the rear key frame one by one, and form a key point group. The front key frame and the rear key frame in this embodiment may be adjacent key frames or non-adjacent key frames, and the time of the front key frame is only required to be before the rear key frame, which is not limited in this embodiment.

Step S602, performing interpolation calculation on each key point group by using a preset interpolation algorithm, and reconstructing a non-key frame between a front key frame and a rear key frame to obtain a training reconstruction frame, wherein the training reconstruction frame and the key frame form a reconstruction sequence.

For any non-key frame, it can be reconstructed with the key frame. Specifically, for any key point in the non-key frame, the spatial positions of the key point in the front key frame and the rear key frame can be read through the key point group, then the time difference of the non-key frame relative to the two frames is calculated, and the spatial coordinates of the key point of the non-key frame are obtained by interpolation calculation through a preset interpolation algorithm. Thus, by performing similar interpolation calculation on each key point, reconstruction can be performedAnd a non-key frame positioned between the front key frame and the rear key frame to obtain a training reconstruction frame. Wherein the training reconstructed frame and the key frame form a reconstructed sequence, which is recorded as Is the nth frame.

Step S603, calculating a reconstruction error according to the training sequence and the reconstruction sequence.

Reconstructing a sequence due to motion reconstruction having errorsThe reconstruction error R (S, K) caused by the index sequence K for the training sequence S can be expressed as:

Wherein P is _n,m Is the position of the mth key point of the nth frame in the training sequence,is the position of the mth key point of the nth frame in the reconstructed sequence, N is the number of frames, and M is the total number of key points in each frame.

It will be appreciated that the reconstruction errors include training reconstruction errors R (S, K ⁱ ) Updating reconstruction errors R (S, K) ⁱ⁺¹ ) Or an initial reconstruction error R (S, K) ⁰ ) All are calculated by adopting the reconstruction error formula.

Specifically, when the initial reconstruction error R (S, K ⁰ ) At the time, the initial value K of the index sequence is sampled ⁰ And selecting a corresponding key frame, so that non-key frames are rebuilt according to the time sequence of the key frames and a preset interpolation algorithm to obtain training rebuilding frames, and the training rebuilding frames and the key frames form a rebuilding sequence. Finally, calculating initial reconstruction error according to the training sequence and the reconstruction sequence。

When calculating training reconstruction errors R (S, K ⁱ ) At the time, the index sequence K is sampled ⁱ And selecting a corresponding key frame, so that non-key frames are rebuilt according to the time sequence of the key frames and a preset interpolation algorithm to obtain training rebuilding frames, and the training rebuilding frames and the key frames form a rebuilding sequence. And finally, calculating a training reconstruction error according to the training sequence and the reconstruction sequence.

When calculating and updating the reconstruction error R (S, K ⁱ⁺¹ ) At the time, the index sequence K is sampled ⁱ⁺¹ And selecting a corresponding key frame, so that non-key frames are rebuilt according to the time sequence of the key frames and a preset interpolation algorithm to obtain training rebuilding frames, and the training rebuilding frames and the key frames form a rebuilding sequence. Finally, the updated reconstruction error is calculated according to the training sequence and the reconstruction sequence, which is not limited in this embodiment.

In some embodiments of the present application, the unit training data is stored in a playback buffer comprising a preset buffer capacity. Randomly selecting a training state in the training data set, inputting the training state into a key frame extraction model, obtaining corresponding unit training data, and storing the unit training data into a playback cache until the cache capacity is reached. It can be understood that one unit training data can be obtained every time a key frame is extracted, so that under the condition that the number of key frames to be extracted is preset, each training state can correspondingly obtain a plurality of unit training data.

Randomly sampling a batch of unit training data from the playback buffer memory, and inputting the unit training data into a key frame extraction model to obtain a corresponding reference value. The expected value is calculated by key frame decision rewards, rewards discount parameters and long-term rewards of the unit training data, so that the loss value of the key frame extraction model is calculated according to the reference value and the expected value, and the model parameters are updated according to the loss value. If the key frame extraction model does not meet the preset convergence condition, for example, if the loss value is not smaller than the preset value or the loss value does not tend to be stable, the model does not meet the convergence condition, the process is repeated to update the unit training data in the playback buffer memory, and the key frame extraction model is retrained until the convergence condition is met, and the trained key frame extraction model is obtained.

It can be appreciated that in deep reinforcement learning, a deep neural network containing more parameters is used to make decisions, the set sequence length is prolonged to improve efficiency, and a more comprehensive and accurate motion database is used to ensure training effect, which is not limited in this embodiment.

The present application is illustrated by one complete example below:

referring to fig. 7, when training of the key frame extraction model is started, an initialization operation is first performed. Specifically, the model parameters of the key frame extraction model may be randomly initialized, and the buffer capacity of the playback buffer may be initialized, for example, the buffer capacity is initialized to 10000 data. The number of key frames to be extracted, the frame rate and the duration of the training sequence in the training data set can be initialized, and the embodiment is not limited to this.

Then randomly selecting training states comprising training sequences and index sequences, inputting the training states into a key frame extraction model, and calculating decision values, namely Q values, when different training data frames are key frames. And then determining the corresponding training data frame as a key frame according to the maximum Q value, updating the training state based on the key frame to obtain an updated state, and calculating the decision rewards of the key frame according to the training reconstruction errors of the training state and the updated reconstruction errors of the updated state. The training state, the update state, the key frames and the decision rewards are stored as unit data into a playback buffer.

If the unit data stored in the playback buffer memory does not reach the buffer memory capacity and the number of key frames extracted by the training sequence does not reach the preset number, the updated state is used as a new training state to be input into the key frame extraction model again for key frame extraction until the number of extracted key frames reaches the preset number. If the unit data stored in the playback buffer still does not reach the buffer capacity at this time, the training state is selected again randomly from the training data set and input to the key frame extraction model until the buffer capacity is reached, which is not limited in this embodiment.

And randomly sampling and selecting a batch of unit training data training key frame extraction models from the playback buffer. Specifically, the unit training data is input into a key frame extraction model to obtain corresponding reference values, and corresponding expected values are calculated according to decision rewards, rewards discount parameters and long-term rewards of the unit training data, so that loss values of the model are calculated according to the reference values and the expected values. And finally updating model parameters of the key frame extraction model according to the loss value, and if the model does not meet a preset convergence condition, for example, the loss value is not smaller than the preset value, repeating the process to update the unit training data in the playback cache and train the key frame extraction model until the trained key frame extraction model is obtained.

Referring to table 1 below, in practical experiments, a comparison was made by extracting a different number of key frames. It can be seen that, compared with two methods of randomly extracting key frames (Randomly Extracting Key Frames, RC) and equidistantly extracting key frames (Uniformly Extracting Key Frames, UC), the reconstruction error of the method of the present application is smaller at different key frame numbers, so that key frames related to motion content can be selected more, thereby improving the accuracy of key frame extraction and reducing the error.

Table 1: average error table for extracting different numbers of key frames

Method	Extracting 5 key frames	Extracting 10 key frames	Extracting 15 key frames
				RC	0.1437	0.0833	0.0549
UC	0.0945	0.0490	0.0328
				The method of the application	0.0844	0.0311	0.0200

Referring to fig. 8, in this embodiment, a running motion sequence is selected and a visual reconstruction result is obtained by extracting 5 key frames, and because the motion amplitude of the running motion sequence is large, the motions of the shoulders and legs reconstructed by the RC and UC methods are not natural and consistent enough, and the reconstruction error is poor. In contrast, the reconstructed frame of the method (our) of the present application is very similar to the real frame, not only retains the consistency and naturalness of the motion, but also is excellent in terms of reconstruction errors. This shows that the method can better extract the key frames and realize more accurate action reconstruction.

The training method and the training device for the key frame extraction model provided by the embodiment of the application have strong decision capability and high decision speed, the key frames in the motion data are extracted by the deep reinforcement learning method for the first time, so that the strong learning capability and generalization capability of an artificial intelligent algorithm are ensured, the neural network has the technology of intelligently extracting the key frames, and meanwhile, a unique reward function is built aiming at the precision of motion reconstruction, so that the extracted key frames have high precision in the motion reconstruction process. In addition, the training of the neural network is not dependent on the training data of the manual mark by using the deep reinforcement learning algorithm, so that the mobility of the model is greatly improved, and the method has wide application scenes.

In addition, by combining the invention with the traditional motion capture equipment, the required number of key frames can be rapidly and accurately extracted according to the actual demands of users, the key frames can replace the original motion sequence to be transmitted to the users, and the whole sequence is reconstructed at the user side, so that the data transmission quantity is greatly reduced, the delay is reduced under the condition of not influencing the user experience, the service quality is improved, and the network cost is reduced.

In application scenes such as metauniverse and virtual reality, it becomes important to accurately capture and map human body actions in the real world to the virtual world. As virtual environments become more immersive, human motion data is typically captured at high frequencies, and the vast amount of motion data increases transmission costs, resulting in higher delays in character motion. However, in the meta-universe, the actions of the character must meet the requirement of low latency, so extremely high computational power and bandwidth are required to guarantee the normal function of the meta-universe. Therefore, the key frame extraction model provided by the application can become an important tool for meeting the low-delay requirement, and real-time action synchronization between the virtual character and the human is realized.

The embodiment of the invention also provides a training device for a key frame extraction model, which can realize the training method for the key frame extraction model, and referring to fig. 9, in some embodiments of the present application, the training device for the key frame extraction model includes:

an acquisition module 100 for acquiring a training data set; the training data set comprises a plurality of training states, the training states comprise a training sequence and an index sequence, the training sequence comprises a plurality of training data frames, and the index sequence is composed of key frame indexes of the training data frames;

The extraction module 200 is configured to input a training state into a key frame extraction model, select a key frame according to a decision value when each training data frame is used as a key frame, and update the training state based on the key frame to obtain an updated state;

a rewarding module 300 for calculating decision rewards of key frames according to the training state and the update state;

the training module 400 is configured to take the training state, the updated state, the key frame and the decision reward as unit training data, calculate a loss value according to the unit training data and the decision reward of the key frame, and update the model parameters of the key frame extraction model according to the loss value until a trained key frame extraction model is obtained.

The specific implementation manner of the training device for the key frame extraction model in this embodiment is substantially identical to the specific implementation manner of the training method for the key frame extraction model, and will not be described in detail herein.

Fig. 10 shows an electronic device 1000 provided in an embodiment of the present application. The electronic device 1000 includes: the processor 1001, the memory 1002, and a computer program stored on the memory 1002 and executable on the processor 1001, the computer program when executed is for performing the above-described exercise data processing method or training method of a key frame extraction model.

The processor 1001 and the memory 1002 may be connected by a bus or other means.

The memory 1002 is used as a non-transitory computer readable storage medium for storing non-transitory software programs and non-transitory computer executable programs, such as the motion data processing method or the training method of the keyframe extraction model described in the embodiments of the present application. The processor 1001 implements the above-described exercise data processing method or training method of the key frame extraction model by running a non-transitory software program and instructions stored in the memory 1002.

Memory 1002 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store training methods that perform the motion data processing methods or key frame extraction models described above. In addition, the memory 1002 may include high-speed random access memory 1002, and may also include non-transitory memory 1002, such as at least one storage device memory device, flash memory device, or other non-transitory solid state memory device. In some implementations, the memory 1002 optionally includes memory 1002 remotely located relative to the processor 1001, which remote memory 1002 can be connected to the electronic device 1000 over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The non-transitory software programs and instructions required to implement the above-described training methods for motion data processing methods or keyframe extraction models are stored in the memory 1002, and when executed by the one or more processors 1001, the above-described training methods for motion data processing methods or keyframe extraction models are performed, for example, the method steps S101 to S104 in fig. 1, the method steps S201 to S203 in fig. 2, the method steps S301 to S303 in fig. 3, the method steps S401 to S403 in fig. 4, the method steps S501 to S502 in fig. 5, and the method steps S601 to S603 in fig. 6.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the training method of the motion data processing method or the key frame extraction model when being executed by a processor. The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

According to the motion data processing method, the training method of the key frame extraction model and the device, a motion sequence comprising a plurality of data frames is firstly obtained in the motion data processing method, and then the motion sequence is input into the pre-trained key frame extraction model to carry out key frame extraction, so that a target key frame of the motion sequence is obtained. The key frame extraction model is a deep reinforcement learning model and is used for calculating a return value corresponding to each data frame and giving the return value to select a target key frame. And finally, reconstructing other data frames in the motion sequence by utilizing the target key frame to obtain a reconstructed frame corresponding to the data frame, and giving the reconstructed frame and the target key frame to obtain a reconstructed sequence of the motion sequence. The motion sequence is extracted by the depth reinforcement learning model, and the target key frame is selected based on the return value of the data frame, so that the motion sequence can be reconstructed according to the target key frame, and the extraction efficiency and accuracy of the key frame are effectively improved.

The method and the device have the advantages that the key frames in the motion data are extracted by utilizing the deep reinforcement learning method, so that the strong learning capacity and generalization capacity of an artificial intelligent algorithm are ensured, the neural network has the intelligent key frame extraction technology, and meanwhile, a unique reward function is built aiming at the motion reconstruction precision, so that the extracted key frames have high precision in the motion reconstruction process. In addition, the training of the neural network is not dependent on the training data of the manual mark by using the deep reinforcement learning algorithm, so that the mobility of the model is greatly improved, and the method has wide application scenes.

The method and the device can rapidly and accurately extract the key frames with required quantity by using the deep reinforcement learning method, replace the original action sequence and transmit the key frames to the user, and reconstruct the whole sequence at the user side. This reduces the amount of data transmitted, delays, and network costs while improving quality of service. Especially in application scenes such as meta universe, virtual reality and the like, the key frame extraction algorithm can realize real-time action synchronization between the virtual roles and the human beings. This is critical to meeting low latency requirements, as the demands for computing power and bandwidth can be effectively reduced, thereby ensuring normal functioning of the meta-universe, achieving higher quality character animation, and improving the quality and immersion of game and entertainment products.

The embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as known to those skilled in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, storage device storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Furthermore, as is well known to those of ordinary skill in the art, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media.

It should also be appreciated that the various embodiments provided in the embodiments of the present application may be arbitrarily combined to achieve different technical effects. While the preferred embodiments of the present application have been described in detail, the present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit and scope of the present application.

Claims

1. A method of motion data processing, comprising:

acquiring a motion sequence comprising a plurality of data frames;

2. A training method of a keyframe extraction model, which is applied to the motion data processing method of claim 1, and comprises the following steps:

3. The method for training a key frame extraction model according to claim 2, wherein the selecting a key frame based on a decision value when each of the training data frames is a key frame comprises:

4. The method of training a key frame extraction model of claim 2, wherein said calculating a decision reward for the key frame based on the training state and the updated state comprises:

5. The method according to claim 4, wherein the training data frame includes a plurality of key points; the step of calculating the reconstruction error includes:

6. The method of training a key frame extraction model according to claim 2, wherein said calculating a loss value from said unit training data and a decision reward for said key frame comprises:

7. The training method of a key frame extraction model of claim 6, wherein the unit training data is stored in a playback buffer, the playback buffer comprising a buffer capacity; the method further comprises the steps of:

8. The training method of a keyframe extraction model according to any one of claims 1 to 7, further comprising, prior to the acquiring a training dataset:

9. A training device for a keyframe extraction model, characterized by applying the training method for a keyframe extraction model according to any one of claims 2 to 8, comprising:

10. An electronic device comprising a memory, a processor, the memory storing a computer program, the processor implementing the training method of the keyframe extraction model of any one of claims 2 to 8 when the computer program is executed.

11. A computer-readable storage medium, wherein the storage medium stores a program that is executed by a processor to implement the training method of the keyframe extraction model according to any one of claims 2 to 8.