CN116468827A

CN116468827A - Data processing method and related product

Info

Publication number: CN116468827A
Application number: CN202210028117.5A
Authority: CN
Inventors: 李世迪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-11
Filing date: 2022-01-11
Publication date: 2023-07-21

Abstract

The embodiment of the application discloses a data processing method and related products, wherein the method comprises the following steps: acquiring a user instruction aiming at a target virtual role and comprising a target action type; selecting a target control strategy model belonging to the target action type from a control strategy model library; the target control strategy model is obtained by training a target action segment belonging to a target action type; processing the state of the target virtual character in the t control period by utilizing the target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein t is a positive integer; generating an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period; and outputting the motion animation of the target virtual character in all control periods. By the embodiment of the application, the action animation can be automatically and intelligently generated, and the manufacturing efficiency and quality of the role animation are effectively improved.

Description

Data processing method and related product

Technical Field

The present application relates to the field of computer technology, and in particular, to a data processing method, a data processing apparatus, a computer device, a computer readable storage medium and a computer program product.

Background

With the rapid development of internet technology, game animation is increasingly pursuing natural and real visual effects, especially animation of virtual characters in games. However, generating animations for virtual characters in real-time is a very challenging task. In the conventional method, an animator manually creates an animation of a virtual character and stores the animation in an animation material library of a memory, so that the efficiency of manually creating the animation is low and the quality is poor.

Disclosure of Invention

The embodiment of the application provides a data processing method and related products, which can automatically and intelligently generate action animation and effectively improve the manufacturing efficiency and quality of character animation.

In one aspect, an embodiment of the present application provides a data processing method, including:

acquiring a user instruction aiming at a target virtual role, wherein the user instruction comprises a target action type;

selecting a target control strategy model belonging to the target action type from a control strategy model library; the target control strategy model is a control strategy model trained according to target action fragments belonging to the target action type;

processing the state of the target virtual character in the t control period by utilizing a target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein the action control strategy is used for representing the action executed by the target virtual character, and t is an integer greater than or equal to 1;

Generating an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period;

and outputting the motion animation of the target virtual character in all control periods.

In one aspect, an embodiment of the present application provides a data processing apparatus, including:

the acquisition module is used for acquiring a user instruction aiming at the target virtual role, wherein the user instruction comprises a target action type;

the selection module is used for selecting a target control strategy model belonging to the target action type from the control strategy model library; the target control strategy model is a control strategy model trained according to target action fragments belonging to the target action type;

the processing module is used for processing the state of the target virtual character in the t control period by utilizing the target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein the action control strategy is used for representing the action executed by the target virtual character, and t is an integer greater than or equal to 1;

the processing module is also used for generating an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period;

And the output module is used for outputting the motion animation of the target virtual character in all control periods.

In one aspect, embodiments of the present application provide a computer device, including: a processor, a memory, and a network interface; the processor is connected with the memory and the network interface, wherein the network interface is used for providing a network communication function, the memory is used for storing program codes, and the processor is used for calling the program codes to execute the data processing method in the embodiment of the application.

Accordingly, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions that, when executed by a processor, perform the data processing method of embodiments of the present application.

Accordingly, embodiments of the present application provide a computer program product comprising a computer program or computer instructions which, when executed by a processor, implement a data processing method as provided in an aspect of embodiments of the present application.

In the embodiment of the application, by acquiring the user instruction, a corresponding target control strategy model can be automatically selected from the control strategy model library according to the target action type included in the user instruction, and the action control strategy is determined based on the target control strategy model, so that the action animation of the target virtual character in all control periods is generated. According to the scheme, the target control strategy model is intelligently matched according to the target action type, and because the target control strategy model is obtained by training based on the action fragments of the target action type, the action characteristics of the specific action type are accurately learned and captured, so that the target control strategy model can accurately generate the action animation of the target action type, and the animation quality is ensured; the target control strategy model automatically processes the state of the virtual character to generate an animation meeting the requirements of a user, abandons the manual making mode and effectively improves the animation making efficiency; in addition, when the scheme is applied to a scene of animation real-time generation, as the animation materials with different action types do not need to be stored, the animation can be generated in real time only by calling a trained control strategy model, and particularly in the generation of the animation of the game role, the storage occupied by the game and the memory in the running process can be effectively reduced.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a block diagram of a data processing system according to one illustrative embodiment of the present application;

FIG. 2 is a flow chart of a method of data processing according to an exemplary embodiment of the present application;

FIG. 3 is a relationship between a character coordinate system and a center of gravity joint coordinate system of a bipedal character model in accordance with an exemplary embodiment of the subject application;

FIG. 4 is a schematic diagram of a target control strategy model processing data provided in an exemplary embodiment of the present application;

FIG. 5 is a flowchart of a target control strategy model training method according to an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of a target control strategy model training framework provided in an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of another control strategy model training framework provided in an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an exemplary embodiment of the present application;

fig. 9 is a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Detailed Description

In order to better understand the embodiments of the present application, the following description will clearly and completely describe the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

For a better understanding of aspects of embodiments of the present application, related terms and concepts that may be related to embodiments of the present application are described below.

1. UE (Unreal Engine)

The UE is a real-time three-dimensional creation tool, can produce photo-level vivid visual effects and immersive experiences, and can be particularly applied to industries of games, buildings, films and videos, simulation and the like. The dynamics simulation function provided by the UE4 (virtual Engine 4) editor can make the generated motion animation more realistic, because it simulates gravity, external force collision and self-motion driven by virtual character muscle joints strictly according to the laws of physics, and its control strategy is very complex: from the positions and poses that the player or game developer expects the game character to arrive at, it is difficult to deduce what commands should be given to the simulator, nor is a specification-based control strategy available. There are many examples of training control strategies using deep reinforcement learning such as DReCon and deep mic (an action generation algorithm) that can be used to train control strategies for bipedal character bones.

2. Quaternion

The quaternion is a simple supercomplex. Complex numbers are made up of real numbers plus imaginary units i, where i= -1. Similarly, quaternions are all composed of real numbers plus three imaginary units i, j, k, and the geometrical meaning of i, j, k itself can be understood as a rotation, wherein i rotation represents a positive rotation of an X-axis positive direction Y-axis in a plane intersecting the X-axis and the Y-axis, j rotation represents a positive rotation of a Z-axis positive direction X-axis in a plane intersecting the Z-axis and the X-axis, k rotation represents a positive rotation of a Y-axis positive direction Z-axis in a plane intersecting the Y-axis and the Z-axis, -i, -j, -k represent a negative rotation of the i, j, k rotations, respectively.

3. Kinematic mechanism

The motion of an object, i.e. the evolution of the position of the object in space over time, is described exclusively, completely irrespective of factors influencing the motion, such as forces or masses.

4. Dynamics of

The relationship between the forces acting on the object and the motion of the object is mainly studied.

5. Deep reinforcement learning

Deep Reinforcement Learning, DRL for short. As the name suggests, i.e., a combination of deep learning and reinforcement learning. In the field of games, deep reinforcement learning enables agents to explore in the environment to learn strategies without the need for labeled samples. Wherein reinforcement learning is a learning of a mapping of environmental states to action spaces, based on a markov decision process (Markov Decision Process, MDP); deep learning can mine optimal strategies through a reward and punishment mechanism, and focus on learning strategies for solving problems. Deep learning is then learning a distributed representation of the data through a multi-layer network and nonlinear transformation, focusing on the perception and expression of things.

6. Intelligent body

The environment can be sensed by the sensor and the entity acting on the environment by the actuator. The deep reinforcement learning agent is a computing entity based on a deep reinforcement learning algorithm.

7. Artificial intelligence

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The scheme provided by the embodiment of the application relates to computer vision technology and machine learning/deep learning in the field of artificial intelligence.

The computer vision is a science for researching how to make a machine "see", and more specifically, a camera and a computer are used to replace human eyes to perform machine vision such as recognition, tracking and measurement on a target, and further perform graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. In the embodiment of the application, the training of the control strategy model may use reinforcement learning technology, specifically may be deep reinforcement learning, for example, training learning is performed by a deep algorithm (an action generating algorithm), and the finally learned target control strategy model may be used for generating action animation.

It will be appreciated that in the specific embodiments of the present application, related data such as status, association status, other data related to the virtual character, and user instructions, etc. related to the virtual character, when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

The architecture of a data processing system provided in embodiments of the present application will be described below with reference to the accompanying drawings.

With reference now to FIG. 1, FIG. 1 is a pictorial representation of a data processing system in accordance with an exemplary embodiment of the present application. As shown in fig. 1, the data processing device 100 and the data processing device 101 are included, the data processing device 100 and the data processing device 101 can be in communication connection through a wired or wireless manner, and the data processing device 100 can be a local database of the data processing device 101 or a cloud database accessible by the data processing device 101. The data processing device 101 may be a terminal device, which in one embodiment may be a smart phone, a tablet computer, a smart wearable device, a smart voice interaction device, a smart home appliance, a personal computer, a vehicle terminal, etc., without limitation.

In one embodiment, database 100 includes a control strategy model library for storing control strategy models belonging to different action types, each of which is trained using target action segments of a certain type of action type and is capable of generating a certain action animation of a virtual character, such as walking, running, jumping, backspace, side-space, whirlpool kick, etc. These motion animations are all animations satisfying the kinematic and dynamic characteristics, and the application has very realistic visual effects in virtual characters. In addition, the database 100 further includes an animation material library for storing motion animations generated by the control policy model for the virtual character, where the motion types included in the animation material library may be equal to or more than the motion types to which the control policy model belongs, and optionally, the animation material library may further store motion animations of the virtual object manually made by the animator.

The data processing device 101 may obtain a user instruction for a target virtual character, and according to a target action type included in the user instruction, may select a target control policy model belonging to the target action type from a control policy model library, and further process a state of the target virtual character in a t-th control period by using the target control policy model to obtain an action control policy of a t+1th control period, where the state is used to represent feature information of the target virtual character, for example, may be a feature that original information of a current game picture is coded and extracted by using a technology such as CNN.

And generating the motion animation and the state of the target virtual character in the t+1th control period according to the motion control strategy, and then processing the state of the target virtual character in the t+1th control period by the target control strategy model to generate the motion animation and the state of the next control period, so that the process is circulated, and the motion animation of the target virtual character in all the control periods can be generated and output. Optionally, the data processing device 101 is a terminal device, in one embodiment, the terminal device may combine the motion animations of all control periods into an animation sequence according to the order of the generation time, and store the animation sequence in the animation material library, and the subsequent terminal device may extract the motion animation of the target motion type from the motion material library in response to the animation playing instruction of the user and play the motion animation in the terminal device. In another embodiment, the terminal device may also play the motion animation of the target virtual character directly in the image interface after rendering and skin processing when detecting that motion animations of all control periods are generated.

The data processing system can be applied to scenes such as a game field, a movie field, and the like, in which motion animation involving virtual characters is generated. Through the target action types included in the user instructions, a target control strategy model of the target action types can be selected from a control strategy model library, action animations corresponding to the target action types can be generated according to the target control strategy model, so that the action animations can be generated efficiently, and the quality of the generated action animations can be effectively improved by utilizing the trained target control strategy model.

The following describes in detail a specific implementation manner of the data processing method according to the embodiment of the present application with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flow chart of a data processing method according to an exemplary embodiment of the present application. The method may be performed by a computer device, such as the data processing device 101 shown in fig. 1. Wherein the data processing method includes, but is not limited to, the following steps.

First, a scenario to which the data processing scheme provided in the embodiment of the present application is applicable will be briefly described. In one embodiment, this data processing scheme may be applied to the following two animation generation scenarios, respectively:

1) Animation pre-generates a scene: the scene is suitable for users designing action animations, such as animators, artistic staff and the like, the users designing the action animations initiate user instructions by setting corresponding parameters (such as the frame rate of the action animations, the target action types to which the action animations belong and the like) in the client, and the action animations of the target action types are generated by calling corresponding target control strategy models from a control strategy model library through the preset target action types, and can be stored in an animation material library so as to be conveniently displayed by calling the action animations of the target action types from the action material library when the animation playing instructions initiated by the users of the action animations of the action types are received later.

2) Generating a scene in real time by animation: the scene is suitable for users using action animations, such as game players, when receiving user instructions initiated by the users, according to the target action types included in the user instructions, a target control strategy model belonging to the target action types can be called from the control strategy model, the action animations of target virtual roles are generated in real time according to the target control strategy model, and the action animations are subjected to treatments such as skin rendering and the like and displayed in a user interface. For more details, reference may be made to the following.

S201, acquiring a user instruction aiming at a target virtual role.

The target virtual character refers to a virtual character to be generated with an action animation, and the target virtual character may be a virtual character in a game, a cartoon, a movie, a television, or the like, for example, the target virtual character is a game character, and the virtual character includes, but is not limited to, a human character, a simulation robot, an animal character (for example, dinosaur, a lion), or the like, and the virtual character may be obtained by simulating drawing of a person or an object in a real environment.

Based on the introduction of the animation generation scene, when the scheme is applied to the animation pre-generation scene, the obtained user instruction aiming at the target virtual character can be generated by operation triggering of a user (such as an artist staff) designing the action animation in the client; when the scheme is applied to the animation real-time generation scene, the acquired user instruction aiming at the target virtual character can be generated by operation triggering of a user (such as a player) using the action animation in the client. The client may be running in the terminal device. For example, the triggering of the user command by the physical keyboard or the mouse click of the terminal device may be triggered by a screen touch manner, for example, gesture sensing (such as long press, single click, continuous click), voice sensing, body type sensing, and the like, and the triggering manner of the user command is not limited herein.

The acquired user instruction includes a target action type. The target action type may be set directly by the user (mainly for scenes where the action animation is generated in advance) or selected indirectly through the mapping identification (mainly for scenes where the action animation is used), so that the target action type may be a type of action desired by the user and may be used to instruct the target virtual character to finally generate the action animation according to the user's desire. The action types may include running, walking, jumping, back-turning, kicking, etc., each of which corresponds to a segment of action animation. In the process of generating the motion animation, according to the target motion type, the client may instruct the client what control strategy model is selected from the control strategy model library as the target control strategy model, i.e. step S202 described below.

S202, selecting a target control strategy model belonging to the target action type from a control strategy model library.

The control strategy model library is used for storing trained control strategy models belonging to different action types, and each control strategy model is trained by utilizing action fragments of a certain action type, wherein the action fragments can be acquired through action capturing, manually designed through art, generated through various intelligent action generating algorithms, and obtained from the Internet, and the sources of the action fragments are not limited. It should be noted that the action segment may not conform to the dynamics simulation characteristic, and may not look realistic, but may also conform to the dynamics simulation characteristic, for example, directly originate from the action segment in the real video, which is not limited.

It can be seen that there is a mapping relationship between the control policy model stored in the control policy model library and the action type, and thus, according to the target action type included in the user instruction for the target virtual character, a target control policy model belonging to the target action type can be selected from a plurality of control policy models included in the control policy model library, the target control policy model being a control policy model trained according to the target action segment belonging to the target action type. The target action type here, i.e., the target action type in the user instruction, is the action type that matches the input target action type among all the action types included in the control policy model library, for example, the flip-over. By selecting a target control strategy for generating an action animation according with the requirement of a user from a control strategy model library, the action animation can be generated in real time or in advance according to the following steps. For a specific training procedure of the target control strategy model, reference may be made to the description of the corresponding embodiment of fig. 5, which is not described in detail here. It should be noted that, all the control strategy models stored in the corresponding control strategy model may be implemented by using a training manner of the target control strategy model as shown in fig. 5.

The trained control strategy models are corresponding to the action types, and a plurality of control strategy models are trained through action fragments of different action types in the training stage, but one control strategy model is generalized for all the action types, but the control strategy models can be directly corresponding to the action types without inputting target action fragments in the control strategy model application stage, and the action animation is obtained by utilizing the target control strategy of the target action type, so that the data processing amount of the target control strategy model is effectively reduced.

In one embodiment, the control policy model may include a policy neural network (or policy network), which may be a fully-connected neural network (MLP) established based on a deep reinforcement learning algorithm (e.g., deep mic algorithm or), or a convolutional neural network, or a residual neural network, etc., without limitation as to the type of the policy neural network. The content of training the control strategy model according to the target action segments of the target action type may be described in the corresponding embodiment of fig. 5, and will not be described in detail here. Optionally, the control policy model may be set in a program of an action simulation engine (e.g. UE 4) of the client, and reinforcement learning training is performed by using a physical simulator of the action simulation engine, so that a complex control policy is generated, and thus a high-quality animation with dynamic simulation is generated by using a trained target control policy model.

S203, processing the state of the target virtual character in the t control period by utilizing the target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein t is an integer greater than or equal to 1.

The t-th control period belongs to the current control period of the target virtual character, and the control period can represent the time for controlling the simulation action of the target virtual character, for example, 0.1s. The time of each control period may be the same or different, for example, the 1 st control period is 0.1s, the 2 nd control period may be 0.2s, the 3 rd control period may be 0.05s, which may depend on the amount of data processed by the control strategy model, the control period may be independent of time, and merely indicates that a set of events repeatedly occur in the same order, and the repeated occurrence is referred to as a control period. The target virtual character has a corresponding state in each control period, the state can represent characteristic information of the target virtual character, and the state of each control period can be acquired by the client. Alternatively, the state may include various data describing the target avatar information, such as joint position, joint pose, and joint rotation angle. It should be noted that, the state of the target virtual character acquired by the client in each control period is not the most original data of the target virtual character, but the acquired original data is calculated and consolidated.

In one embodiment, the client may collect initial state information of the target virtual character in the t-th period, where the initial state information is the most original data of the target virtual character in the current control period, and then determine the state in the current control period according to the initial state information, and sort the state into a data format according to the processing of the target control policy model, for example, the state of the target virtual character in the t-th control period is a multidimensional vector sorted according to a preset format.

Optionally, the initial state information may include information of a current character state, an environmental parameter, a current simulation time, and the like, where the current character state is used to reflect a state of the target virtual character in the current processing period, and the current character state may include pose information of the target virtual character (for convenience of description, hereinafter referred to as a character), and may include, for example, a position and a pose of the character in a world coordinate system, a position and a pose of a barycentric joint (for example, pelvis joint) of the character relative to the character, and a position of all skeletal joints participating in physical simulation relative to the character, where the position or the pose may be represented by quaternion; the environment parameters are used for representing the virtual environment where the target virtual character is located and can be environment parameters related to deep reinforcement learning; the current simulation time refers to the time for integration in the kinetic simulation, not the time of the world clock.

Illustratively, as shown in FIG. 3, a relationship between a character coordinate system of a bipedal character model in a dynamics simulation engine (e.g., UE 4) and a barycentric joint coordinate system, wherein the character coordinate system is a world coordinate system in which a position of a target virtual character may be described, the barycentric joint coordinate system is used to describe a motion position of a barycentric joint of the bipedal character model, and the barycentric joint is operated under the barycentric joint coordinate system to drive the barycentric joint to move to achieve a desired position, where the barycentric joint may be a pelvis (pelvic) joint.

In the above data, the posture of all the skeletal joints participating in the physical simulation (i.e., the dynamic simulation) relative to the character can obtain the rotation of all the skeletal joints participating in the dynamic simulation relative to the parent joint, i.e., the joint rotation angle, and all the state information including the speed and the angular velocity can be obtained through the difference of the joint rotation angle.

Taking the bipedal model as an example, there are 18 rotatable skeletal joints in addition to pelvis joints, each rotatable skeletal joint having a different dimensional degree of freedom, as shown in table 1 below.

Table 1 rotatable joints and degrees of freedom

It has been found that the rotational degrees of freedom of each bone joint vary from 1 to 3 dimensions, and that 18 bone joints produce a total of 32 degrees of freedom in a manner that translates the rotational angle. Therefore, after the client acquires the initial state information, the client can obtain 32 angle values representing the rotation of the skeletal joint, namely joint angles by using 18 groups of attitude quaternions representing the rotation of the skeletal joint of the current character state.

The current state of the target virtual character (hereinafter referred to as character state) can be completely represented according to the joint rotation angle, the joint position and the joint gesture. Illustratively, the joint position is a barycentric joint position, the joint pose is a barycentric joint pose, and the barycentric joint position is represented by 3-dimensional position coordinates of the barycentric joint pelvis in the world coordinate system, the barycentric joint pose is a pose of the barycentric joint relative to the character, represented by 6-dimensional forward-up vectors (e.g., coordinate system 1 and coordinate system 2 shown in fig. 3). Thus, the 32-dimensional joint rotation angle, the 3-dimensional barycentric joint position, and the 6-dimensional barycentric joint posture, a total of 41-dimensional vectors can be included in the state of the target virtual character.

Optionally, the state of the target virtual character in the t-th control period includes character states of a plurality of control periods, and the character states can be characterized by joint angles, joint positions and joint postures and are multidimensional vectors. Illustratively, the character states of each control period are 41-dimensional vectors, the state of the target virtual character in the t-th control period is set to be 41×7=287-dimensional vectors, wherein the vector includes the character states of the target virtual character in the t-th control period and the character states of the target virtual character in the history control period, such as the t-1 control period and the t-2 control period, and the values of the remaining 4 dimensions can be set to be 0 or other random constants, that is, the character states of the 7 control periods are combined together to be the state of the target virtual character in the t-th control period.

In another embodiment, the initial state information may further include an action segment of a target action type, where relevant parameter information of the action segment of the target action type is similar to content included in a character state of the target virtual character in each control period, for example, a current state of the target virtual character includes a vector of 41 dimensions in total of joint rotation angle, barycentric joint position and barycentric joint posture, and then relevant parameter information of the target action segment may also include a vector of 41 dimensions in total of joint rotation angle, barycentric joint position and barycentric joint posture. This information can be used to describe the target pose of the reference avatar in the target action segment, which is the pose that the target avatar is tracking in the next control cycle or cycles. In order to ensure the optimal processing effect of the target control strategy model, the relevant parameter information of the target action segment and the role state of the target virtual role can be arranged together into an observation vector, namely the state of the t control period, and then processed by the target control strategy model. For example, taking the t control period as a reference, selecting the character states of the t+1th control period, the t+2th control period, the t+5th control period and the t+10th control period in the target action segment, selecting the character states of the target virtual character in the t-1th control period and the t-2th control period, and combining the character states of the target virtual character in the t control period as an observation vector, wherein the character states of each control period comprise 32-dimensional joint angles, 3-dimensional joint positions and 6-dimensional joint postures, so as to form 41-dimensional data, and the character states of 7 different control periods form an observation vector of 287-dimensional, wherein the observation vector is used for representing the state of the target virtual character in the t control period and comprises values which are not set to 0 or other constants.

That is, the state of the t-th control period may include joint rotation angle, joint position, joint posture. Other parameters (e.g., environmental parameters, current simulation time) may also be included in the state, but the subsequent processing is primarily focused on several data of joint rotation angle, joint position, joint pose. Alternatively, the data may be sorted according to the format of an observation (observation) vector, and then input to a target control policy model for processing, and output the action control policy of the target virtual character in the t+1th control period. Specific ways of processing can be seen in the following description.

Fig. 4 is a schematic diagram of processing data by using a target control policy model according to an embodiment of the present application, where a client uploads a state of a target virtual character to the target control policy model, and then the state is calculated by the target control policy model to obtain a control command, i.e. an action control policy.

The action control strategy is used for representing actions executed by the target virtual character, and is also called control command action, and specifically, the target virtual character with dynamics simulation can be guided in the dynamics simulation engine to complete the set actions indicated by the control command. Taking the target virtual character as an example of a bipedal human skeleton model, the action represented by the action control strategy can be represented by the angle value of each skeleton joint in the target virtual character, namely the joint angle of the (t+1) th control period, and meanwhile, the joint angle can also be used as data contained in the state of the (t+1) th control period.

S204, according to the action control strategy of the (t+1) th control period, generating an action animation of the target virtual character in the (t+1) th control period and a state of the target virtual character in the (t+1) th control period.

The action control strategy of the t+1th control period is executed after the t control period is finished, and the data generated by the target virtual character executing the action acts on the t+1th control period. That is, after the t control period ends, the target virtual character moves in the t+1th control period according to the action control strategy generated by the state of the t control period, and a corresponding action animation can be generated, so that the target virtual character finishes action according to the action control strategy in the t+1th control period, and the information of the posture, the position and the like of the target virtual character is changed, that is, the state of the target virtual character is changed, so that the target virtual character generates a new state, that is, the state of the t+1th control period.

In an embodiment, the specific implementation manner of step S204 may be the following steps 1) to 3):

1) And determining the rotation moment of each target rotation object in the target virtual character according to the state of the t control period and the action control strategy of the t+1th control period.

The target avatar includes a plurality of target rotational objects, and the target avatar is an bipedal character model, and the target avatar includes a plurality of target rotational objects including a plurality of rotatable joints, for example, 18 rotatable joints shown in table 1. In order for the target avatar to conform to the dynamics, the motion control strategy may be applied to the dynamics simulation calculations of a physical simulator (i.e., a dynamics simulator) in the motion simulation engine (e.g., UE 4) until the end of the t-th control period. The rotation moment of each target rotation object is also referred to as torque, and the specific expression may be as follows formula (1):

the spring and the damming are two control system coefficients, generally are selected as constants 10000 and 2000 by default, and can be automatically adjusted by a user to use the changed values, so that a more ideal control effect is achieved. v _x Is the angular velocity of the joint rotation angle of the current control period, q _x The current joint rotation angle is represented, wherein x is a continuous time, and the angular velocity of the joint rotation angle can be obtained by differentiating different joint rotation angles. v _x And q _x Belonging to the state in the t control period.

2) And according to the rotation moment, controlling each target rotation object of the target virtual character to move in the (t+1) th control period so as to generate an action animation of the target virtual character in the (i+1) th training period.

According to the calculated torque, each rotatable joint can rotate according to the corresponding torque, and the rotation of each rotatable joint forms the movement of the target virtual character in the t+1th control period so as to generate an action animation, so that after the movement of the target virtual character is finished, the new state of the generated target virtual character is taken as the state of the target virtual character in the t+1th control period.

3) And after the movement of the target virtual character is finished, acquiring the state of the target virtual character in the t+1th control period.

In one embodiment, the method for acquiring the state of the target virtual character in the t+1th control period includes: acquiring environment information and pose information of a target virtual character; and generating the state of the target virtual character in the t+1th control period according to the pose information and the environment information. The environment information is used for representing the virtual environment where the target virtual role is located; the pose information is used to characterize the position and pose of the target virtual character. The target virtual character moves according to the motion animation generated by the motion control strategy, for example, the motion animation of the right foot is taken while walking, the motion is ended, namely, the corresponding gesture indicated by the motion control strategy is reached, at this time, the positions and the gestures of each rotatable target object of the target virtual character, for example, each rotatable joint, change, namely, the acquired gesture information is new gesture information compared with the previous period, the environment information and the gesture information correspond to the initial state information, for example, the gesture information comprises the position and the gesture quaternion of a character (component) in a world coordinate system, the position and the gesture quaternion of a component of a character pelvis joint relative to the component, and the gesture quaternion of all skeleton joints participating in physical simulation relative to the component. The determination state of the initial state information collected according to the t+1th control period may refer to the description in step S203, and will not be described herein.

The processing in the state of the (t+1) th control period is also realized according to the state of the (t) th control period, and the processing is sequentially circulated until the processing reaches the state of the preset number of control periods, and after the result is obtained, the action animation of all the control periods can be obtained. When the application is in the context of generating animation scenes in advance, the preset number may be determined by the user designing the motion animation, such as animators, game artists; the preset number may also be included in the user instructions when the application is in a real-time generation scenario, i.e. determined by the game scenario in which the target virtual character is located and the player's operation.

By applying the motion control strategy to a physical simulator of the motion simulation engine to perform dynamics simulation calculation, the rotation moment of each rotatable target object is obtained, and then the motion is performed according to the rotation moment, so that a high-quality motion animation with more vivid animation effect and satisfying dynamics characteristics can be generated.

S205, outputting the motion animation of the target virtual character in all control periods.

When the state of the next control period is acquired according to the state of the target virtual character in the current control period, step S204 is repeatedly performed, and motion animation of all control periods can be generated. The motion animation of each control period is one frame of animation data, and the motion animation of all control periods can form a motion segment (or character animation) with a certain duration, for example, a 3s walking segment and a 2s back space flip. It should be noted that, the motion animation of all periods generated according to the motion control policy output by the target control policy model belongs to the target motion type, i.e. the target motion type included in the user instruction, and the generated motion animation meets the user's expectations. The motion animation of the target virtual character in all control cycles can then be output, as will be described in more detail below.

Thus, through the steps, the target control strategy model automatically generates the animation sequence of the target action type, if a user designing the action animation repeats the process and inputs different action types, the action animations of different action types can be obtained efficiently, the production efficiency of the action animations of different action types is further improved, the workload of an animator for manually manufacturing the action animations of the target virtual characters is greatly reduced, and when the method is applied to the game field, the efficient generation mode of the action animations of different action types can further reduce the game development work and shorten the development flow.

In one embodiment, an alternative implementation of step S205 may be:

combining the motion animations of all the control periods into an animation sequence according to the sequence of the recorded generation time from small to large, and storing the animation sequence into an animation material library; the action type of the animation sequence belongs to the target action type; and when an animation playing instruction containing the target action type is received, invoking an animation sequence from the animation material library to play in response to the animation playing instruction.

The motion animation of each control period is correspondingly recorded with a generation time, the generation time can be a time point under a world clock, the motion animations of all the control periods can be ordered according to the recorded generation time, and a motion sequence with a sequence is formed, and the motion sequence can be a motion of a continuous motion space. And then, the animation sequence can be stored in an animation material library and correspondingly marked as a target action type, so that when an animation playing instruction containing the target action type is received, the animation sequence of the target action type is called from the animation material library to be played in response to the animation playing instruction.

The animation sequences obtained by processing different target control strategy models can be stored in an animation material library, each animation sequence corresponds to an action type, optionally, all the animation sequences in the animation material library are animation sequences generated by different target control strategy models, and when a user initiates an animation playing instruction, the user can call the animation sequences of the corresponding action types from the animation material library to play according to the action types included in the animation playing instruction.

By adopting the mode, the action sequence corresponding to the target virtual role is automatically generated through the target control strategy model and stored in the animation material library, and the action sequence corresponding to the action type is matched when the user initiates the animation playing instruction, so that the user who needs to use the action animation can conveniently and rapidly find the action sequence of the target action type from the animation material library to play.

In another embodiment, an alternative implementation of step S205 may also be:

respectively rendering the action animations of the target virtual character in all control periods according to the set rendering frame rate, and applying the skin grid to the action animations after rendering to obtain action animations to be played; the animation to be played comprises a plurality of animation frames; playing a plurality of animation frames in the graphical interface, wherein each animation frame in the plurality of animation frames is displayed after reaching the synchronous waiting time length; the synchronization waiting time period is determined according to the set rendering frame rate.

In one embodiment, after the motion animation of all control periods is generated, the target virtual character may render the motion animation, and then directly play the motion animation in the graphical interface, or in another embodiment, after the motion animation of one control period is obtained, the motion animation of the control period is rendered, and then subjected to skin processing, and then displayed in the graphical interface after reaching the synchronous waiting duration. And the method is suitable for generating action animation scenes in real time without waiting for animation playing instructions initiated by users. The set rendering frame rate is the frame rate of the action sequence to be rendered, and the action sequence contains more action animation content when the number of frames is higher. The method comprises the steps of respectively rendering the motion animations of the target virtual character in all control periods to obtain the motion animations after rendering, and then performing skin processing on the motion animations after rendering, wherein the skin is a process of connecting three-dimensional bone vertices to positions of bones. In particular, a skinned mesh, which refers to a triangular mesh wrapped around bones, may be applied to the rendered action animation, where one or more bones of the target avatar may control each vertex of the mesh. The animation to be played, namely the skeleton skin animation, is composed in such a way that the skeleton controls skin movement, and the movement of the skeleton itself is realized by a motion control strategy. And then, when a plurality of animation frames included in the action animation to be played are played on the graphical interface, the synchronous waiting time exists, and each frame of animation is displayed on the graphical interface after the synchronous waiting time, so that the action animation accords with the real time rate.

Taking an example of an action animation with a frame rate of 120 frames as an example, the action animations of each control period in the action sequence are respectively rendered, the time required for calculating each frame of picture (i.e. each animation frame) in the action animation may be far less than 1/120 second, so that the finally played picture looks consistent with the rate of the real time, and the animation frames are not played until 1/120 second is waited. This is also called a time synchronization waiting mechanism, and with this time synchronization waiting mechanism, the animation frames played by the graphical interface can conform to the real time rate, so that the animation frames are more vivid.

In one embodiment, the rendering, skinning, etc. are processed as functions of an action simulation engine for deployment, that is, the action simulation engine for deployment has a graphic rendering and a graphic interface, so that an action animation of a target virtual character can be vividly shown to a user, and the action simulation engine (for example, UE 4) for training a control strategy model is different from the action simulation engine (for example, UE4 program) for deploying the target control strategy model, so as to speed up data collection, improve training efficiency, have no graphic rendering and no graphic interface for the animation for showing, and have no waiting mechanism, and immediately package and send the latest state after each training period (for example, 0.1 seconds) is calculated by a dynamics simulator in the engine to a server, which is also to improve training efficiency to the maximum extent.

According to the scheme provided by the embodiment of the application, through obtaining the user instruction, a corresponding target control strategy model can be automatically and intelligently selected from the control strategy model library according to the target action type included in the user instruction, and because the target control strategy model is obtained based on the action segment training of one action type, the action characteristics of the specific action type are accurately learned and captured, so that the target control strategy model can accurately generate the action animation of the target action type, and the animation quality is ensured; the target control strategy model automatically processes the state of the virtual character to generate an animation meeting the requirements of a user, abandons the manual making mode and effectively improves the animation making efficiency; in the process of processing data by the target control strategy model, the motion animation of the target virtual character can be physically restrained by calculating the rotation moment of the target virtual character, and the motion animation accords with the corresponding physical law, so that more realistic motion animation can be generated. In addition, when the scheme is applied to a scene of animation real-time generation, as the animation materials with different action types do not need to be stored, the animation can be generated in real time only by calling a trained control strategy model, and particularly in the generation of the animation of the game role, the storage occupied by the game and the memory in the running process can be effectively reduced. In addition, when the motion animation of all the periods is output, a time synchronization waiting mechanism exists, so that the playing effect of the motion animation can be further improved.

In order to better understand the content of the present solution, a detailed description is given below of the training process of the target control strategy model. Fig. 5 is a flowchart of a target control strategy model training method according to an embodiment of the present application, where the method is executed by a client, and includes but is not limited to the following steps.

S501, acquiring the association state of the target virtual character in the ith training period; wherein i is an integer greater than or equal to 1.

The training period is a time step of the control strategy model during a training period, for example 0.1s, and each training period is the same during the training period for the control strategy model. The ith training period is the time period currently being trained, and the associated state of the ith training period may include the state of the previous one or more historical training periods, it being understood that if the first training period is, then the associated state includes only the state of the target virtual character of the first training period. These association states are used to train the action control policies of the target avatar, and may include data of the target avatar itself and/or data of the reference avatar in the target action segment.

In one embodiment, the association state of the ith training period includes: a first state of the target virtual character and a second state of the reference virtual character in the target action segment; the first state comprises a state of the target virtual character in an ith training period and a state of the target virtual character in a historical training period; the second state includes a state of the reference virtual character in a reference training period; the reference training period is determined by an ith training period and a set training period number; the state includes position information, posture information, and joint rotation angle.

Specifically, the target action segment is a given fixed animation segment, and belongs to a target action type, where the target action type is any one of a plurality of action types, such as a kicking action. The target action segment may be acquired through action capturing, may be manually designed through art, may be generated through various intelligent action generating algorithms, and may not conform to dynamic simulation characteristics, may not look lifelike, but may also conform to dynamic simulation characteristics, and is not limited thereto. The task of the control strategy model training is to make the target virtual character with dynamics simulation in the action simulation engine track the target action segment as much as possible, and the action animation generated by the target virtual character is controlled by utilizing the action control strategy to accord with the dynamics characteristic.

The actions made by the reference virtual character in the target action segment are the actions tracked by the target virtual character. Thus, the reference avatar also has a state, specifically represented by the second state described below. The first state in the association state is a state directly associated with the target virtual character, and the first state not only includes a state of the target virtual character in a current training period, i.e., a state of an ith training period, but also includes a state of a history period, wherein the history training period is selected based on the ith training period, for example, a state of a last training period (i-1 th training period) and a state of a last training period (i-2 nd training period). Of course, where i is greater than or equal to 3, it is only meaningful to correspond to the selected historical training period. The target virtual character is contained in the first state in the state of the historical training period, so that the subsequent control strategy model can strengthen learning training of the historical action animation, and forgetting is avoided.

The second state is a state directly associated with the reference virtual character, and the state of the reference virtual character in the reference training period in the second state includes a plurality of training periods in the future determined based on the ith training period. The number of training periods is, for example, 1, 2, 3 or the like. For example, the next training period (i.e., the i+1th training period), the next two training periods (i.e., the i+2th training period), and so on, subject to the current training period. The position information, the posture information and the joint rotation angle are contained in the state of the target virtual character or the state of the virtual character in the target action segment. The states in the target action segment are used as references, so that the target virtual character can acquire the rotation angle of each joint in the next training period according to the states of the current training period and the historical training period, and tracking training can be better carried out.

In an embodiment, according to the description of the state of the target virtual character obtained in the corresponding embodiment of fig. 2, specific contents included in the first state and the second state may also be obtained in the same manner, and the obtained first state and the obtained second state may include the following contents, respectively, by way of example:

1) First state

41-dimensional current (i-th training period) state: p is p _i ，o _i ，q _i

41-dimensional last cycle (i-1 th training cycle) state: p is p _i-1 ，o _i-1 ，q _i-1

41-dimensional two-cycle (i-2 th training cycle) state: p is p _i-2 ，o _i-2 ，q _i-2

2) Second state

The next cycle (i+1th training cycle) state of the 41-dimensional target action segment: p is p ^g _i+1 ，o ^g _i+1 ，q ^g _i+1

Two-cycle (i+2th training cycle) state under 41-dimensional target action segment: p is p ^g _i+2 ，o ^g _i+2 ，q ^g _i+2

Five cycle (i+5 training cycle) state under 41-dimensional target action segment: p is p ^g _i+5 ，o ^g _i+5 ，q ^g _i+5

Ten cycle (i+10 training cycle) state under 41-dimensional target action segment: p is p ^g _i+10 ，o ^g _i+10 ，q ^g _i+10

Where p represents a joint position (e.g., pelvis joint position), o represents a joint pose (e.g., pelvis joint position), q represents a joint rotation angle, and 41 dimensions represent a joint position in 3 dimensions, a joint pose in 6 dimensions, and a joint rotation angle in 32 dimensions.

All states included in the associated states processed in the ith training period of the above example are sorted into a vector of 287 (41×7=287) dimensions, which may be referred to as an observation vector observation. It should be noted that the first state and the second state are merely exemplary, and other states of the training period may be selected. For example, the target action segment may be in the (i+2) th and (i+4) th training periods.

S502, the associated state of the ith training period is processed by using the control strategy model, so that the action control strategy of the target virtual character in the (i+1) th training period is obtained, and action rewards are determined according to the associated state of the ith training period.

In the embodiment of the application, the control strategy model is stored in the action simulation engine of the client. The control strategy model may include a strategy network trained based on a deep reinforcement learning algorithm, which may specifically be a basic neural network, such as a convolutional neural network, a residual neural network, a fully connected neural network, and so forth. The client's action simulation engine is an engine with dynamics simulation functionality, e.g. an engine with dynamics simulation of the UE4, which can be started on the client's CPU core.

And after the client acquires the associated data of the target virtual character in the ith training period, the control strategy model processes the associated state of the ith training period to obtain the action control strategy applied to the (i+1) th training period. Meanwhile, the action rewards can be determined by utilizing the association state of the ith training period, the action rewards can be used for evaluating whether the action control strategy of the ith training period is applied to the target virtual character or not, and learning targets can be continuously adjusted through the action rewards so as to generate the action control strategy which accords with expectations.

In one embodiment, the step of processing the ith training period using the control strategy model may include: determining the action expectation and the action variance of the target virtual character according to the association state of the target virtual character in the ith training period and the control strategy model; determining the motion control strategy distribution of the target virtual character in the (i+1) th training period according to the motion variance and the motion expectation; and sampling the motion control strategy distribution in the (i+1) th training period to obtain the motion control strategy of the target virtual character in the (i+1) th training period.

Here, taking the example that the control strategy model includes a simple three-layer sensory neural network, the principle of the control strategy model processing is explained as follows in conjunction with the following expression. Wherein, the three-Layer perception neural network is a Multi-Layer Perceptron (MLP) and comprises an input Layer, a hidden Layer and an output Layer, and the expression processed by the neural network can be expressed as:

mean＝W ₂ *tanh(W ₁ *tanh(W ₀ *observation+b ₀ )+b ₁ )+b ₂ (2)

Wherein W is ₀ 、W ₁ 、W ₂ B for controlling parameter matrix of each layer in strategy model ₀ 、b ₁ 、b ₂ For controlling the bias vectors of each layer in the strategy model, the parameter matrix and the bias vectors are collectively called strategy weights (or weights of the strategy neural network), and the observation represents the association state of the ith training period. Mean is a 32-dimensional vector, which is the operational expectation of neural network output.

Then, the mean vector may be subjected to gaussian distribution sampling to obtain an action control policy action, that is, a control command, where the action of the control command obeys gaussian distribution, that is:

action-N (means, std) (3)

Wherein, action is the target joint rotation angle and can be recorded asmean is action desire, std is action variance. This expression represents a motion control strategy distribution determined from motion expectations and motion variances. In some embodiments, the action variance may be a constant output by the control strategy model, i.e., the MLP network may output mean and std simultaneously, and in other embodiments, the action variance may be delivered by the server to the client.

Optionally, with the observation being a 287-dimensional vector sorted according to the first state and the second state of the above example, the mean vector obtained by the control policy model first is 32-dimensional, and the action output finally is also a 32-dimensional vector.

Then, the probability of randomly sampling the action under the mean vector output by the strategy network can be obtained through a probability density function of Gaussian distribution, and the probability is called strategy sampling probability, and the strategy sampling probability can be used for training a control strategy model subsequently.

In one embodiment, the manner of determining the action rewards based on the association status of the ith training period may be: performing differential processing on the state of the target virtual character in the ith training period and the state of the reference virtual character of the target action segment in the ith training period to obtain differential information; determining a reward component corresponding to the differential information; the reward components include a position reward component, a posture reward component, and a joint rotation angle reward component; and carrying out weighted summation on the position rewarding component, the gesture rewarding component and the joint rotation angle rewarding component to obtain the action rewarding of the target virtual character.

The calculation of the action rewards introduces target action fragments, and track errors of gestures in the target virtual characters and the target action fragments can be converted into the action rewards based on the target action fragments. Specifically, referring to the state described in the foregoing step S501 including the position information, the posture information, and the joint rotation angle, differential calculation may be performed on the state of the reference virtual character in the target virtual character and the target action segment, for example, the position information of the target virtual character in the ith training period and the position information of the target action segment in the ith training period are subtracted, and the corresponding posture information and joint rotation angle are the same, so as to obtain various corresponding differential information. Optionally, the content included in the state corresponding to the reward component may include a position reward component, a posture reward component and a joint rotation angle reward component, and then, the respective weights are assigned to the respective reward components and summed to obtain the final action reward. The weighted summation of the various reward components can be used for measuring the influence of different reward components on action rewards, and the larger the weight is, the larger the influence of the reward components on the action rewards is, so that the action rewards can pay attention to more important reward components, and policy learning is better performed.

Illustratively, following an introduction to the action rewards principle and an introduction to the association state, the expression of the action rewards is designed as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing differential information; /> Respectively representing a joint rotation angle rewarding component, a position rewarding component and a posture rewarding component; alpha ₁ 、α ₂ 、α ₃ Are bonus weights. The smaller the difference information is, the higher the action overlap ratio between the target virtual character and the target action segment is, and the action prize is expressedThe larger the excitation. Illustratively, the reward weights may be set to 0.7, 0.2, 0.1, respectively, i.e., the joint rotation angle reward component determines the action reward more.

Referring to the above example, the input and output of the control strategy model deployed in the action simulation engine of the client, the specific rewards calculation can be summarized as the following in table 2.

TABLE 2 control of input and output of policy models

During the training phase of the control strategy model, the control strategy model input/output and rewards of the client may be performed according to the design of table 2.

Before step S503 is performed, the method further includes: and applying an action control strategy to the target virtual character, and determining the torque of each rotatable joint in the target virtual character. In particular, referring to fig. 4, in the corresponding embodiment, the corresponding data is represented by the target joint rotation angle action and the current joint rotation angle q of the target virtual character _c Angular velocity v _c And (3) obtaining the product. Thus, the association state of the target virtual character in the (i+1) th training period is conveniently collected.

The above two steps, i.e., what the motion simulation engine of the client has completed in one training period (e.g., within 0.1S) using the control strategy model, are thus looped back and forth, and after each training period interpretation, the following step S503 is performed.

S503, combining the association state of the ith training period, the action control strategy of the (i+1) th training period, the action rewards and the association state of the (i+1) th training period into a training sample.

The client acquires the association state of the target virtual character in the ith training period, the action control strategy of the (i+1) th training period, the action rewards and the association state of the (i+1) th training period, which are output by the control strategy model, and can be combined into a training sample, and the training sample is used for training the control strategy model subsequently so as to update model parameters. The association state of the (i+1) th training period is acquired after the action control strategy of the (i+1) th training period is applied to the target virtual character.

The training samples are denoted herein as train= { observation _i ，a ^g _i+1 ，r _i ，observation _i+1 And the association state of the current training period, the action control strategy of the (i+1) th training period, the action rewards and the association state of the (i+1) th training period are sequentially corresponding. These data are consolidated into a set of training samples and then sent to the server. The data uploaded to the server may vary somewhat depending on the type of deep reinforcement learning algorithm chosen, but is primarily a few of these data. Alternatively, the control strategy model training may use a deep mic algorithm, where the strategy optimization may be performed using a near-end optimization strategy (Proximal Policy Optimization, PPO) algorithm.

Optionally, the training sample further includes a policy sampling probability, where the policy sampling probability is generated after sampling the motion control policy distribution, for example, a value obtained after sampling the motion control policy that is subject to the gaussian distribution is a target corner that is transferred to the motion simulation engine for physical simulation, where the target corner substitutes into a gaussian distribution probability density function, and a sampling probability of action can be obtained, and according to the policy sampling probability, a loss of the control policy model can be calculated, so as to better evaluate and adjust the control policy model.

Optionally, the training sample further includes a current simulation time, where the current simulation time is a time for integration in the kinetic simulation, and the sequence of data generation included in the state may be recorded through the current simulation time.

And S504, transmitting the training samples to a server so that the server trains the control strategy model stored in the server according to the training samples.

The client sends the training sample to the server through the Socket, and the specific training process is realized at the server. A brief description of the process by which the server trains using the stored control strategy model in training samples is provided herein: during the data collection phase, the server is configured to store data at intervals Training samples sent from the client may be received in one training period, and these training samples are stored in a unified manner in a database, and the training process of the control strategy model is specifically a deep reinforcement learning training process, and the database for storing training samples is also called an experience playback pool, and in the iteration of deep reinforcement learning, the training samples collected in the database reach a certain number (e.g. 2 ¹⁶ Group), the deep reinforcement learning agent in the server may be randomly extracted several times (e.g., 2 ⁶ Secondary) training samples form a batch (batch) training sample set, each batch training sample set containing multiple sets of training samples (e.g., 2 ¹⁰ Group) that will be used in the random gradient descent based deep reinforcement learning training and update the control strategy model. For example, the control strategy model is a strategy neural network, and the strategy neural network is trained by using a near-end optimization strategy PPO algorithm in a deep algorithm. The updated control strategy model has new model parameters, e.g. strategy weights W as exemplified in step S502 ₀ 、W ₁ 、W ₂ 、b ₀ 、b ₁ 、b ₂ These weights will be issued in turn to the client and step S505 will be performed by the client.

In combination with the foregoing steps, the control policy model in the client is mainly responsible for forward computation, that is, the computation of the control command is performed according to the association state of the current target virtual character, and the update operation of the control policy model, for example, the application of random gradient descent update control policy model is performed on the server through the GPU.

In an embodiment, the client sends the sorted training samples to the server through Socket (Socket), and at this time, the client can start dynamics simulation and data collection of the next training period by itself by using the control strategy model without waiting for the reply of the server.

Namely: acquiring states generated after the motion control strategy of the (i+1) th training period is applied to the target virtual character, and obtaining the associated state of the target virtual character in the (i+1) th training period; if the model parameters sent by the server are not received in the (i+1) th training period, the associated state of the (i+1) th training period is processed by using the control strategy model, and the action control strategy of the target virtual character in the (i+2) th training period is obtained.

After the action control policy of the (i+1) th training period is obtained, the action control policy is applied in the next (i+1) th training period to obtain the association state of the target virtual character in the (i+1) th training period, and it should be noted that the association state is not the original data of the target virtual character after the action control policy is applied, but is obtained after the sorting calculation, and a specific process can refer to an acquisition process of the association state of the (i) th training period, which is not described herein. After the ith training period, the training sample is sent to the server, but before the ith training period inputs the association state into the control strategy model, the data issued by the server is not received yet, and the control strategy model is not loaded with new model parameters to update, so that the control strategy model for processing the association state of the ith training period is still used for processing the association state of the ith training period and the (1) th training period. The same control strategy model is used for calculating the new association state by the ith training period and the (i+1) th training period, so that a new action control strategy is obtained, namely the action control strategy applied to the (i+2) th training period. And performs dynamics simulation calculation, namely torque calculation, based on the motion control strategy in the (i+2) th training period.

Before model parameters are not received, the control strategy model is not updated, but data acquisition, action control strategy calculation and the like of a target control role are still orderly carried out according to the sequence of training periods by the client, but when the client receives the model parameters, the control strategy model for processing the association state is different, but the progress of the client for processing the data is not influenced, so that the client can quickly obtain an action control strategy for dynamics simulation locally, continuously simulate and calculate actions without communicating with a server to obtain the action control strategy, unnecessary waiting time is avoided, and the dynamics simulation can be efficiently carried out.

It can be understood that, because the server extracts data to train after the training sample reaches a certain amount, the server issues the model parameters to the client at a low frequency, and the client does not have to wait for the issuing of the model parameters of the server to process the data in the next step, so that the communication frequency required by data exchange between the server and the client is extremely low, the communication frequency can be effectively reduced, and the training efficiency is improved.

S505, receiving the model parameters of the trained control strategy model sent by the server, and updating the control strategy model according to the received model parameters to obtain a target control strategy model.

In one embodiment, the step of updating the control policy model by the client according to the received model parameters to obtain the target control policy model may include: according to the received model parameters, adjusting the model parameters of the control strategy model, wherein the adjusted control strategy model is used for generating a new training sample, and the new training sample is used for continuously adjusting the model parameters of the control strategy model; and when the adjusted control strategy model meets the model convergence condition, taking the adjusted control strategy model as a target control strategy model.

The control strategy model deployed or stored in the action simulation engine of the client can be adjusted according to the model parameters issued by the server, and the adjusted control strategy model processes the association state of the input target virtual character and outputs the action control strategy. And according to the input/output pair of the control strategy model and other related data (such as one or more of the association state, action rewards and strategy sampling probability of the next training period), the new training samples are arranged into new training samples, the new training samples are sent to a server, when the collected training samples reach a certain number, the server extracts the training samples to train the control strategy model, then sends updated model parameters to a client, continuously adjusts the model, processes data and the like, and circulates the process until the control strategy model of the client reaches the model convergence condition, and the corresponding adjusted control strategy model can be used as a target model. The model convergence condition may be that the adjustment times reach a preset times, or that the action rewards reach a rewarding threshold, or that the change of the model parameters received twice before and after is within an error range, or the like, which is not limited herein.

After receiving the model parameters sent by the server, the client reloads the model parameters in a local control strategy model, namely, replaces the original model parameters with the model parameters sent by the server, generates a new control strategy model, can generate an action control strategy according to the new control strategy model, collects the association state of the target virtual character, sorts training samples according to the association state and sends the training samples to the server. And the method is repeated in a circulating way until the strategy expression meeting the task requirement is obtained, wherein the task requirement can be that action rewards reach a rewarding threshold value, and then the target control strategy model can be obtained. Because the target control strategy model refers to the generation of the target action fragments of the target action types, the action characteristics of the target action types can be captured well, and the actions of the action types are simulated to generate corresponding action animations.

It should be noted that, according to the training steps of steps S501-S505, a plurality of control strategy models may be trained, where each control strategy model belongs to an action type, and when called, a control strategy model of the action type can be generated, which is caused by the difference of action types to which the action segments of the training reference belong. That is, what action type of action segment is input during training trains the control strategy model, the control strategy model obtained by training belongs to the action type, and the application phase correspondence can generate the action animation of the action type.

In summary, the relevant control policy model training process in the data processing scheme provided by the embodiment of the application is quite efficient, because in the training stage, the control policy model in the client only needs simple forward calculation, namely, the action control policy of the next training period is calculated according to the association state of the current training period, but gradient calculation related to the back propagation training of the control policy model, such as random policy gradient calculation, is executed by the server, the client receives and stores the model parameters issued by the server, and updates the local control policy model according to the model parameters, so that the time cost and the space cost required by the forward calculation are very low, the issuing of the model parameters is also low-frequency, and even if the control policy model of the client is not updated, the processing of corresponding data can be orderly carried out, the dependence of the client on the server and the frequency of data exchange are greatly reduced, and therefore, the higher training efficiency can be obtained only by sacrificing the small memory on the client.

In combination with the above description of the training process of the control strategy model, a schematic structural diagram of a training framework of the control strategy model shown in fig. 6 is provided, including a server and N clients. The server side is deployed with a Python language-based deep reinforcement learning algorithm, the same strategy neural network calculation diagram, namely a control strategy model, is built in each client side, and N client sides correspond to N control strategy models. The engines with dynamics simulation of UE4 are started in multiple clients (e.g., n=1000) respectively, which will be started on different CPU cores. Currently, the UE4 editor is commonly developed using c++ or c# language, while the deep reinforcement learning algorithm is commonly developed using python language, and data exchange between the deep reinforcement learning agent and data on the UE4 can be achieved by using the structure shown in fig. 6.

The following describes the specific process flow related to the training framework:

1) After the last training period (for example, 0.1 s) ends, each UE4 client may record the current association state of the target virtual character, the target gesture of the target action segment (i.e., the action tracked in the next training period), the environmental parameters, the current simulation time, and so on, and immediately perform processing. Specifically, the data may be arranged according to a format of an observation vector input by a control policy model defined in a deep reinforcement learning algorithm, and the arranged data is subjected to forward calculation and random sampling by the control policy model to obtain an action control policy action (i.e., a control command), where the control command is also immediately applied to the control of the UE4 dynamics simulation, and meanwhile, the client may calculate an action reward request value required for reinforcement learning according to the current state of the target virtual character and the current state of the target action segment.

2) The client can upload the latest state of the target virtual character, the random strategy sampling probability, the recall value and the current simulation time after the execution of the observation vector, the action vector and the action to the server, and the client can immediately start the simulation of the next training period without blocking or waiting for the reply of the server.

3) The server receives the data from any client and can directly store the data into the database. Whenever the number of data pieces in the database reaches a certain value (e.g. 2 ¹⁶ The group) of the deep reinforcement learning agent randomly extracts a plurality of groups of data from the database according to the random gradient descent principle to perform strategy gradient calculation, the control strategy model is updated by using the calculated gradient through a PPO algorithm, and the control strategy model updated by the deep reinforcement learning agent is stored on a server for backup.

4) After the policy update is completed, the model parameters (e.g., policy weights) of the control policy model at the server end are issued to each client. The client loads the updated model parameters and then proceeds to efficiently collect data for the next round of updates.

And (3) repeatedly executing the steps 1) to 4), obtaining a target control strategy model with good performance after reinforcement learning training for a period of time, stably generating a control command for a target virtual character with dynamic simulation in the UE4, guiding the target virtual character to complete a given action, and generating a high-quality animation. The same strategy neural network calculation graph is built in each client in the scheme, and after each time of deep reinforcement learning agent finishes strategy upgrading, strategy weights are issued to each client. Each client can locally use the current state of the target virtual character to calculate a control command and calculate dynamics simulation, and then the local data is arranged into a format required by deep reinforcement learning and uploaded to a database of a server. The data can be continuously collected without waiting for the response of the server until the policy weight of the server after the next round of updating is received.

It can be seen that by placing the forward computation of the control policy model on the client to execute, the dependence of the client on the server can be greatly reduced, and the frequency of downloading data from the server can be reduced. The operation of upgrading the strategy by utilizing the deep reinforcement learning algorithm is carried out on the server through the GPU, and the server only needs to transmit the strategy weight in a low frequency manner. The space cost and the time cost of strategy forward calculation are very low, and the time cost of communication is very high, so that the part of the strategy forward calculation function can be moved to the client, so that only a small amount of running memory (about 10M for example) on the client is needed to be sacrificed, the time required for deep reinforcement learning data acquisition can be greatly saved, and the operation efficiency in game running can be greatly improved by adopting a mode of locally deploying a control strategy model on the UE 4.

In order to better illustrate the effects of the present solution, please refer to fig. 7, which is a schematic diagram of a training framework of another control strategy model provided in an embodiment of the present application. As shown in fig. 7, the server includes N clients and a server. The control strategy model is trained based on deep reinforcement learning, a python language-based deep reinforcement learning algorithm is placed at a server side, and an engine with dynamics simulation of the UE4 is started at a plurality of clients (for example, n=1000) respectively. These engines will be started on different CPU cores. The data processing flow based on fig. 7 is as follows:

1) Uploading state: in the time period (for example, 0.1 second) simulated by each simulator, each client organizes the information of the current state of the target virtual character, the target gesture of the target action segment (i.e., the action tracked in the next training period), the environmental parameter, the current simulation time and the like into a data packet, and sends the data packet to the server through a Socket.

2) Downloading a command: after receiving the data packet, the server arranges the data in the data packet according to the format of the control strategy model input vector defined in the deep reinforcement learning algorithm, and the arranged data is subjected to forward calculation by the control strategy model, and the obtained control command is sent back to the corresponding client.

3) And (3) data storage: the server can calculate an action rewarding report value required by reinforcement learning according to the current state of the target virtual character uploaded by the client and the current state of the known target action fragment, and store the input-output pair, the random sampling probability, the action rewarding and other information of the control strategy model into a database.

4) Data extraction and policy updating: whenever the number of data in the database reaches a certain value (for example, 216 groups), the deep reinforcement learning agent randomly extracts a plurality of groups of data from the database according to the principle of random gradient descent to perform strategy gradient calculation, and updates the strategy neural network by using the calculated gradient through a PPO algorithm.

Repeating the steps 1) to 4), and performing reinforcement learning training for a period of time until a target control strategy model with good performance is obtained, so that a control command can be stably generated for a character with dynamic simulation in the UE4, and the character can be guided to complete a given action, and high-quality animation can be generated.

The above-described neural network based on Socket communication scheme for implementing deep reinforcement learning training provides a control strategy for a target virtual character with dynamics simulation in UE4, and since the control strategy model is only stored in a server, a client performs communication with the server once every time a time step (e.g. 0.1 s) is simulated, uploads relevant information such as the state of the current target virtual character, and waits for the control strategy model of the server to perform strategy calculation, downloads a control command, so that a great amount of unnecessary communication delay and waiting exist when the deep reinforcement learning agent collects data, which wastes time and introduces a new problem to the time synchronization of the client.

In particular, in the reinforcement learning process, since the data acquisition process has strict sequential logic, that is, only the role state result of the last target virtual role (that is, the state generated by the last control command applied to the role) can be calculated, the control command of this step can be calculated, and only the control command of this step can be obtained, the calculation of the next role state can be started, that is, the calculation of each step depends on the calculation result of the last step, so that the calculations cannot be parallel. In addition, when the client uploads the calculation result of the character state of the target virtual character, the UE4 stops the simulation of the operation because it cannot receive the new control command, waits for the server to receive the state, calculates the control command and issues the control command, and then the client can start to continue the simulation. As such, the time spent is extremely costly. For example, if the policy frequency is 10Hz, that is, the control command needs to be acquired every 0.1 seconds, 100 times of communication with the server is needed, and about 20ms is needed for each local communication using Socket. Then the client actually has 2 seconds to spend waiting and does nothing at all. It should be noted that, the client simulates an action segment with a length of 10s, and the real running time is far less than 10s under the condition that a GUI interface is not needed, rendering is not opened and only data is collected. Thus, the latency of 2s even exceeds the time actually used for computational dynamics simulation.

Compared with the training framework shown in fig. 6, in the scheme provided by the embodiment of the application, compared with the scheme that the client side needs to communicate with the server once every time step (0.1 s) is simulated in fig. 7, the state of the target virtual character is uploaded, the calculation of the server strategy network is waited, and the control command is downloaded.

Taking the action simulation engine as the UE4 as an example, the scheme can ensure that the control command is calculated in the time of the tick (basic time unit) of the UE4 under the condition that the UE4 starts the dynamics simulation, and the high-quality animation with the dynamics simulation is completed. By using the training framework of the UE4 physical simulator for distributed deep reinforcement learning, a computational graph is built in the UE4 program of each client, and the strategic network weights are stored, and compared with fig. 7, the improvement only sacrifices small memory (10M magnitude) on the client, but can greatly reduce the communication frequency (10 ² ～10 ⁴ Multiple times, depending on different tasks), the real-time performance is good, the accuracy is high, and the training efficiency is improved by reducing the network communication frequency. The physical simulator of the UE4 is used for reinforcement learning training to be beneficial to generating a complex control strategy, and finally, high-quality animation with dynamic simulation can be generated, so that game development work is lightened, development flow is shortened, storage occupied by a game and memory during running can be reduced, the strategy training time required by the training mode provided by the embodiment of the application is effectively shortened under the same condition, and finally, the finally obtained strategy is automatically compatible with a UE4 program without additional programming, so that the use of game developers is facilitated.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an exemplary embodiment of the present application. The data processing means may be a computer program (comprising program code) running in a computer device, for example the data processing means is an application software; the data processing device may be used to perform the corresponding steps in the methods provided by the embodiments of the present application. As shown in fig. 8, the data processing apparatus 800 may include: acquisition module 801, selection module 802, processing module 803, output module 804, and training module 805.

An obtaining module 801, configured to obtain a user instruction for a target virtual character, where the user instruction includes a target action type;

a selecting module 802, configured to select a target control policy model belonging to a target action type from a control policy model library; the target control strategy model is a control strategy model trained according to target action fragments belonging to the target action type;

the processing module 803 is configured to process, using the target control policy model, a state of the target virtual character in a t-th control period to obtain an action control policy of the target virtual character in a (t+1) -th control period, where the action control policy is used to characterize an action executed by the target virtual character, and t is an integer greater than or equal to 1;

the processing module 803 is further configured to generate an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control policy of the t+1th control period;

and the output module 804 is used for outputting the motion animation of the target virtual character in all control periods.

In one embodiment, the output module 804 is specifically configured to: combining the motion animations of all the control periods into an animation sequence according to the sequence of the recorded generation time from small to large, and storing the animation sequence into an animation material library; the action type of the animation sequence belongs to the target action type; and when an animation playing instruction containing the target action type is received, invoking an animation sequence from the animation material library to play in response to the animation playing instruction.

In another embodiment, the output module 804 is specifically further configured to: respectively rendering the action animations of the target virtual character in all control periods according to the set rendering frame rate, and applying the skin grid to the action animations after rendering to obtain action animations to be played; the animation to be played comprises a plurality of animation frames; playing a plurality of animation frames in the graphical interface, wherein each animation frame in the plurality of animation frames is displayed after reaching the synchronous waiting time length; the synchronization waiting time period is determined according to the set rendering frame rate.

In one embodiment, the target avatar includes a plurality of target rotating objects; the processing module 803 is specifically configured to: determining the rotation moment of each target rotation object in the target virtual role according to the state of the t control period and the action control strategy of the t+1th control period; according to the rotation moment, each target rotation object of the target virtual character is controlled to move in the (t+1) th control period so as to generate an action animation of the target virtual character in the (i+1) th training period; and after the movement of the target virtual character is finished, acquiring the state of the target virtual character in the t+1th control period.

In one embodiment, the processing module 803 is specifically configured to: acquiring environment information and pose information of a target virtual character; the environment information is used for representing the virtual environment where the target virtual role is located; the pose information is used for representing the position and the pose of the target virtual character; and generating the state of the target virtual character in the t+1th control period according to the pose information and the environment information.

In an embodiment, the method is performed by a client, the data processing apparatus further comprising a training module 805 for: acquiring the association state of the target virtual character in the ith training period; wherein i is an integer greater than or equal to 1; processing the association state of the ith training period by using a control strategy model to obtain an action control strategy of the target virtual character in the (i+1) th training period, and determining action rewards according to the association state of the ith training period; the control strategy model is stored in an action simulation engine of the client; combining the association state of the ith training period, the action control strategy of the ith training period, the action rewards and the association state of the (i+1) th training period into a training sample; the association state of the (i+1) th training period is acquired after the action control strategy of the (i+1) th training period is applied to the target virtual character; transmitting the training samples to a server so that the server trains a control strategy model stored in the server according to the training samples; and receiving model parameters of the trained control strategy model sent by the server, and updating the control strategy model according to the received model parameters to obtain a target control strategy model.

In one embodiment, the training module 805 is further configured to: acquiring states generated after the motion control strategy of the (i+1) th training period is applied to the target virtual character, and obtaining the associated state of the target virtual character in the (i+1) th training period; if the model parameters sent by the server are not received in the (i+1) th training period, the associated state of the (i+1) th training period is processed by using the control strategy model, and the action control strategy of the target virtual character in the (i+2) th training period is obtained.

In one embodiment, the training module 805 is specifically configured to: determining the action expectation and the action variance of the target virtual character according to the association state of the target virtual character in the ith training period and the control strategy model; determining the motion control strategy distribution of the target virtual character in the (i+1) th training period according to the motion variance and the motion expectation; sampling the motion control strategy distribution in the (i+1) th training period to obtain the motion control strategy of the target virtual character in the (i+1) th training period; the training samples further comprise strategy sampling probabilities, and the sampling probabilities are generated after sampling the action control strategy distribution.

In one embodiment, the training module 805 is specifically configured to: according to the received model parameters, adjusting the model parameters of the control strategy model, wherein the adjusted control strategy model is used for generating a new training sample, and the new training sample is used for continuously adjusting the model parameters of the control strategy model; and when the adjusted control strategy model meets the model convergence condition, taking the adjusted control strategy model as a target control strategy model.

In one embodiment, the training module 805 is specifically configured to: performing differential processing on the state of the target virtual character in the ith training period and the state of the reference virtual character of the target action segment in the ith training period to obtain differential information; determining a reward component corresponding to the differential information; the reward components include a position reward component, a posture reward component, and a joint rotation angle reward component; and carrying out weighted summation on the position rewarding component, the gesture rewarding component and the joint rotation angle rewarding component to obtain the action rewarding of the target virtual character.

It may be understood that the functions of each functional module of the data processing apparatus described in the embodiments of the present application may be specifically implemented according to the method in the foregoing method embodiments, and the specific implementation process may refer to the relevant description of the foregoing method embodiments, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device 900 may comprise a stand-alone device (e.g., one or more of a server, node, terminal, etc.), or may comprise components (e.g., a chip, software module, hardware module, etc.) internal to the stand-alone device. The computer device 900 may include at least one processor 901 and a communication interface 902, and further optionally, the computer device 900 may also include at least one memory 903 and a bus 904. Wherein the processor 901, the communication interface 902 and the memory 903 are coupled by a bus 904.

The processor 901 is a module for performing arithmetic operation and/or logic operation, and may specifically be one or more of a central processing unit (central processing unit, CPU), a picture processor (graphics processing unit, GPU), a microprocessor (microprocessor unit, MPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a field programmable gate array (Field Programmable Gate Array, FPGA), a complex programmable logic device (Complex programmable logic device, CPLD), a coprocessor (assisting the central processing unit to perform corresponding processing and application), a micro control unit (Microcontroller Unit, MCU), and other processing modules.

The communication interface 902 may be used to provide information input or output to at least one processor. And/or the communication interface 902 may be configured to receive data sent externally and/or send data to the outside, and may be a wired link interface including, for example, an ethernet cable, or may be a wireless link (Wi-Fi, bluetooth, universal wireless transmission, vehicle-mounted short-range communication technology, and other short-range wireless communication technologies, etc.) interface.

The memory 903 is used to provide storage space in which data such as an operating system and computer programs can be stored. The memory 903 may be one or more of a random access memory (random access memory, RAM), a read-only memory (ROM), an erasable programmable read-only memory (erasable programmable read only memory, EPROM), or a portable read-only memory (compact disc read-only memory, CD-ROM), etc.

The at least one processor 901 of the computer device 900 is configured to invoke the computer program stored in the at least one memory 903 for performing the aforementioned data processing method, such as the data processing method described in the embodiment shown in fig. 2 and 5.

In a possible implementation, the processor 901 in the computer device 900 is configured to invoke a computer program stored in the at least one memory 903 for performing the following operations: acquiring a user instruction aiming at a target virtual role through a communication interface, wherein the user instruction comprises a target action type; selecting a target control strategy model belonging to the target action type from a control strategy model library; the target control strategy model is a control strategy model trained according to target action fragments belonging to the target action type; processing the state of the target virtual character in the t control period by utilizing a target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein the action control strategy is used for representing the action executed by the target virtual character, and t is an integer greater than or equal to 1; generating an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period; and outputting the motion animation of the target virtual character in all control periods.

In one embodiment, the processor 901 is specifically configured to: combining the motion animations of all the control periods into an animation sequence according to the sequence of the recorded generation time from small to large, and storing the animation sequence into an animation material library; the action type of the animation sequence belongs to the target action type; and when an animation playing instruction containing the target action type is received, invoking an animation sequence from the animation material library to play in response to the animation playing instruction.

In another embodiment, the processor 901 is further specifically configured to: respectively rendering the action animations of the target virtual character in all control periods according to the set rendering frame rate, and applying the skin grid to the action animations after rendering to obtain action animations to be played; the animation to be played comprises a plurality of animation frames; playing a plurality of animation frames in the graphical interface, wherein each animation frame in the plurality of animation frames is displayed after reaching the synchronous waiting time length; the synchronization waiting time period is determined according to the set rendering frame rate.

In one embodiment, the target avatar includes a plurality of target rotating objects; the processor 901 is specifically configured to: determining the rotation moment of each target rotation object in the target virtual role according to the state of the t control period and the action control strategy of the t+1th control period; according to the rotation moment, each target rotation object of the target virtual character is controlled to move in the (t+1) th control period so as to generate an action animation of the target virtual character in the (i+1) th training period; and after the movement of the target virtual character is finished, acquiring the state of the target virtual character in the t+1th control period.

In one embodiment, the processor 901 is specifically configured to: acquiring environment information and pose information of a target virtual character; the environment information is used for representing the virtual environment where the target virtual role is located; the pose information is used for representing the position and the pose of the target virtual character; and generating the state of the target virtual character in the t+1th control period according to the pose information and the environment information.

In an embodiment, the method is performed by a client, processor 901, for: acquiring the association state of the target virtual character in the ith training period; wherein i is an integer greater than or equal to 1; processing the association state of the ith training period by using a control strategy model to obtain an action control strategy of the target virtual character in the (i+1) th training period, and determining action rewards according to the association state of the ith training period; the control strategy model is stored in an action simulation engine of the client; combining the association state of the ith training period, the action control strategy of the ith training period, the action rewards and the association state of the (i+1) th training period into a training sample; the association state of the (i+1) th training period is acquired after the action control strategy of the (i+1) th training period is applied to the target virtual character; transmitting the training samples to a server so that the server trains a control strategy model stored in the server according to the training samples; and receiving model parameters of the trained control strategy model sent by the server, and updating the control strategy model according to the received model parameters to obtain a target control strategy model.

In an embodiment, the processor 901 is further configured to: acquiring states generated after the motion control strategy of the (i+1) th training period is applied to the target virtual character, and obtaining the associated state of the target virtual character in the (i+1) th training period; if the model parameters sent by the server are not received in the (i+1) th training period, the associated state of the (i+1) th training period is processed by using the control strategy model, and the action control strategy of the target virtual character in the (i+2) th training period is obtained.

In one embodiment, the processor 901 is specifically configured to: determining the action expectation and the action variance of the target virtual character according to the association state of the target virtual character in the ith training period and the control strategy model; determining the motion control strategy distribution of the target virtual character in the (i+1) th training period according to the motion variance and the motion expectation; sampling the motion control strategy distribution in the (i+1) th training period to obtain the motion control strategy of the target virtual character in the (i+1) th training period; the training samples further comprise strategy sampling probabilities, and the sampling probabilities are generated after sampling the action control strategy distribution.

In one embodiment, the processor 901 is specifically configured to: according to the received model parameters, adjusting the model parameters of the control strategy model, wherein the adjusted control strategy model is used for generating a new training sample, and the new training sample is used for continuously adjusting the model parameters of the control strategy model; and when the adjusted control strategy model meets the model convergence condition, taking the adjusted control strategy model as a target control strategy model.

In one embodiment, the processor 901 is specifically configured to: performing differential processing on the state of the target virtual character in the ith training period and the state of the reference virtual character of the target action segment in the ith training period to obtain differential information; determining a reward component corresponding to the differential information; the reward components include a position reward component, a posture reward component, and a joint rotation angle reward component; and carrying out weighted summation on the position rewarding component, the gesture rewarding component and the joint rotation angle rewarding component to obtain the action rewarding of the target virtual character.

It should be understood that the computer device 900 described in the embodiments of the present application may perform the description of the data processing method in the foregoing embodiments, and may also perform the description of the data processing apparatus 800 in the foregoing embodiments corresponding to fig. 2 or fig. 5, which are not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

In addition, it should be noted that, in an exemplary embodiment of the present application, a storage medium is further provided, where a computer program of the foregoing data processing method is stored, where the computer program includes program instructions, when one or more processors loads and executes the program instructions, the description of the data processing method in the embodiment may be implemented, and details of the description of beneficial effects of the same method are omitted here. It will be appreciated that the program instructions may be executed on one or more computer devices that are capable of communicating with each other.

The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method provided by the foregoing embodiment.

The steps in the method of the embodiment of the application can be sequentially adjusted, combined and deleted according to actual needs.

The modules in the device of the embodiment of the application can be combined, divided and deleted according to actual needs.

The above disclosure is only a few examples of the present invention, and it is not intended to limit the scope of the present invention, but it is understood by those skilled in the art that all or a part of the above embodiments may be implemented and equivalents thereof may be modified according to the scope of the present invention.

Claims

1. A method of data processing, the method comprising:

processing the state of the target virtual character in the t control period by utilizing the target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein the action control strategy is used for representing the action executed by the target virtual character, and t is an integer greater than or equal to 1;

2. The method of claim 1, wherein the outputting of the motion animation of the target avatar for all control periods comprises:

combining the motion animations of all the control periods into an animation sequence according to the sequence of the recorded generation time from small to large, and storing the animation sequence into an animation material library; the action type of the animation sequence belongs to the target action type;

And when receiving an animation playing instruction containing the target action type, calling the animation sequence from the animation material library to play in response to the animation playing instruction.

3. The method of claim 1, wherein the outputting of the motion animation of the target avatar for all control periods comprises:

respectively rendering the action animations of the target virtual character in all control periods according to the set rendering frame rate, and applying the skin grid to the action animations after rendering to obtain action animations to be played; the animation to be played comprises a plurality of animation frames;

playing the plurality of animation frames in a graphical interface, wherein each animation frame in the plurality of animation frames is displayed after reaching a synchronous waiting time period; the synchronization waiting duration is determined according to the set rendering frame rate.

4. A method according to any one of claims 1 to 3, wherein the target virtual character comprises a plurality of target rotating objects;

generating an action animation of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period, and a state of the target virtual character in the t+1th control period, wherein the action animation comprises the following steps:

Determining the rotation moment of each target rotation object in the target virtual role according to the state of the t control period and the action control strategy of the t+1th control period;

according to the rotation moment, controlling each target rotation object of the target virtual character to move in the (t+1) th control period so as to generate an action animation of the target virtual character in the (i+1) th training period;

and after the movement of the target virtual character is finished, acquiring the state of the target virtual character in the (t+1) th control period.

5. The method of claim 4, wherein the acquiring the state of the target virtual character at the t+1th control cycle comprises:

acquiring environment information and pose information of the target virtual character; the environment information is used for representing the virtual environment where the target virtual role is located; the pose information is used for representing the position and the pose of the target virtual character;

and generating the state of the target virtual character in the t+1th control period according to the pose information and the environment information.

6. The method of claim 1, wherein the method is performed by a client, the method further comprising:

Acquiring the association state of the target virtual character in the ith training period; wherein i is an integer greater than or equal to 1;

processing the association state of the ith training period by using a control strategy model to obtain an action control strategy of the target virtual character in the (i+1) th training period, and determining action rewards according to the association state of the ith training period; the control strategy model is stored in an action simulation engine of the client;

combining the association state of the ith training period, the action control strategy of the ith training period, the action rewards and the association state of the (i+1) th training period into a training sample; the association state of the (i+1) th training period is acquired after the action control strategy of the (i+1) th training period is applied to the target virtual character;

transmitting the training sample to a server so that the server trains the control strategy model stored in the server according to the training sample;

and receiving model parameters of the trained control strategy model sent by the server, and updating the control strategy model according to the received model parameters to obtain the target control strategy model.

7. The method of claim 6, wherein the association state of the ith training period comprises: the first state of the target virtual character and the second state of the reference virtual character in the target action segment;

the first state comprises a state of the target virtual character in an ith training period and a state of the target virtual character in a historical training period;

the second state comprises a state of the reference virtual character in a reference training period; the reference training period is determined by the ith training period and a set training period number;

the state includes position information, the posture information, and joint rotation angles.

8. The method of claim 6 or 7, wherein the method further comprises:

acquiring states generated after the motion control strategy of the (i+1) th training period is applied to the target virtual character, and obtaining the associated state of the target virtual character in the (i+1) th training period;

if the model parameters sent by the server are not received in the (i+1) th training period, the control strategy model is utilized to process the association state of the (i+1) th training period, and the action control strategy of the target virtual character in the (i+2) th training period is obtained.

9. The method of claim 6, wherein the processing the associated state of the ith training period using the control strategy model to obtain the motion control strategy of the target virtual character in the (i+1) th training period comprises:

determining the action expectation and the action variance of the target virtual character according to the association state of the target virtual character in the ith training period and the control strategy model;

determining the motion control strategy distribution of the target virtual character in the (i+1) th training period according to the motion variance and the motion expectation;

sampling the motion control strategy distribution in the (i+1) th training period to obtain the motion control strategy of the target virtual character in the (i+1) th training period;

the training sample further comprises a strategy sampling probability, and the sampling probability is generated after sampling the action control strategy distribution.

10. The method of claim 6, wherein updating the control strategy model based on the received model parameters to obtain the target control strategy model comprises:

adjusting the model parameters of the control strategy model according to the received model parameters, wherein the adjusted control strategy model is used for generating a new training sample, and the new training sample is used for continuously adjusting the model parameters of the control strategy model;

And when the adjusted control strategy model meets the model convergence condition, taking the adjusted control strategy model as the target control strategy model.

11. The method of claim 7, wherein said determining an action prize based on an associated state of the ith training period comprises:

performing differential processing on the state of the target virtual character in the ith training period and the state of the reference virtual character of the target action segment in the ith training period to obtain differential information;

determining a reward component corresponding to the differential information; the reward components comprise a position reward component, a gesture reward component and a joint rotation angle reward component;

and carrying out weighted summation on the position rewarding component, the gesture rewarding component and the joint corner rewarding component to obtain the action rewarding of the target virtual character.

12. A data processing apparatus, comprising:

the system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring a user instruction aiming at a target virtual role, and the user instruction comprises a target action type;

the selecting module is used for selecting a target control strategy model belonging to the target action type from a control strategy model library; the target control strategy model is a control strategy model trained according to target action fragments belonging to the target action type;

The processing module is used for processing the state of the target virtual character in the t control period by utilizing the target control strategy model to obtain an action control strategy of the target virtual character in the t+1th control period, wherein the action control strategy is used for representing actions executed by the target virtual character, and t is an integer greater than or equal to 1;

the processing module is further used for generating an action animation of the target virtual character in the t+1th control period and a state of the target virtual character in the t+1th control period according to the action control strategy of the t+1th control period;

13. A computer device, comprising: a processor, a memory, and a network interface;

the processor is connected to the memory and the network interface, wherein the network interface is configured to provide network communication functions, the memory is configured to store program code, and the processor is configured to invoke the program code to perform the data processing method of any of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, performs the data processing method of any one of claims 1 to 11.

15. A computer program product, characterized in that the computer program product comprises a computer program or computer instructions which, when executed by a processor, implement the steps of the data processing method according to any of claims 1 to 11.