CN117414580A

CN117414580A - Training and role control method and device of policy neural network and electronic equipment

Info

Publication number: CN117414580A
Application number: CN202210804233.1A
Authority: CN
Inventors: 李世迪
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-07
Filing date: 2022-07-07
Publication date: 2024-01-19

Abstract

The application provides a training and role control method and device of a strategy neural network and electronic equipment; the method comprises the following steps: in a simulation environment of virtual character motion control in a virtual scene, each client acquires training samples generated continuously by interaction between each strategy neural network and the simulation environment, and periodically transmits training sample sets of target generation periods in the training samples to a server, so that the server continuously trains a backup strategy network based on the periodically transmitted training sample sets, and returns strategy parameters of the trained backup strategy network to each client after training the backup strategy network in each round; and in response to receiving the policy parameters of the trained backup policy network returned by the server, updating the policy neural network based on the policy parameters of the trained backup policy network to continuously generate training samples based on the updated policy neural network.

Description

Training and role control method and device of policy neural network and electronic equipment

Technical Field

The present disclosure relates to artificial intelligence technologies, and in particular, to a method and apparatus for training and role control of a policy neural network, and an electronic device.

Background

The virtual characters in the game typically exhibit diverse and flexible actions, such as running, jumping, etc. In the related technology, the actions are collected by professional action actors to form an animation material library, and the closest animation is matched from the material library in real time according to the character state and player instructions for playing when the game is used. However, the memory space and the amount of computation required for this method increase linearly with the increase of animation materials.

In addition, the related art also provides training of the policy neural network based on deep reinforcement learning to control the game character based on the control policy, however, in the related art, although data can be collected in parallel between different clients in the process of alternately running between a plurality of clients and a server, the clients and the server run alternately in each iteration, a mutual waiting phenomenon exists, resources are not fully utilized, and the training efficiency of the control policy model is low, so that animation is possibly stuck when the game is deployed online.

Disclosure of Invention

The embodiment of the application provides a training method, a role control method, a device, equipment, a computer readable storage medium and a computer program product of a strategy neural network based on a distributed architecture, which can improve the training efficiency of a control strategy network and realize the accurate control of the actions of virtual roles in a resource intensive mode.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a training method of a strategy neural network based on a distributed architecture, wherein the distributed architecture comprises at least one client and a server, the strategy neural network is arranged on each client, and a backup strategy network corresponding to the strategy neural network is arranged on the server; the method comprises the following steps:

in a simulation environment of each client for controlling the movement of a virtual character in a virtual scene, acquiring training samples generated continuously by each strategy neural network in interaction with the simulation environment, and periodically sending a training sample set of a target generation period in the training samples to a server so as to enable the training samples to be in a specific form

The server continuously trains the backup strategy network based on the training sample set which is periodically sent, and returns the trained strategy parameters of the backup strategy network to each client after each round of training of the backup strategy network;

in response to receiving the trained policy parameters of the backup policy network returned by the server, updating the policy neural network based on the trained policy parameters of the backup policy network to continuously generate the training samples based on the updated policy neural network;

The strategy neural network is used for outputting a target action instruction aiming at the virtual character so as to control the virtual character to execute a target action based on the target action instruction.

The embodiment of the application provides a control method of a virtual character, which comprises the following steps:

acquiring a third state vector corresponding to a virtual character in a virtual scene at a third time and a control instruction triggered by the virtual character;

determining a fourth state vector which corresponds to the virtual character at a fourth moment based on a control instruction vector which corresponds to the control instruction and the third state vector;

based on the third state vector and the fourth state vector, invoking a strategy neural network to perform action prediction processing to obtain a target prediction action instruction, and controlling the virtual character to execute corresponding actions based on the target prediction action instruction;

the strategy neural network is obtained by training the training method of the strategy neural network based on the distributed architecture.

The embodiment of the application provides a training device of a strategy neural network based on a distributed architecture, wherein the distributed architecture comprises at least one client and a server, the strategy neural network is arranged on each client, and a backup strategy network corresponding to the strategy neural network is arranged on the server; the device comprises:

The first processing module is used for acquiring training samples generated continuously by interaction between each strategy neural network and the simulation environment in the simulation environment of virtual character motion control in the virtual scene of each client, periodically transmitting a training sample set of a target generation period in the training samples to a server, enabling the server to continuously train the backup strategy network based on the periodically transmitted training sample set, and returning strategy parameters of the trained backup strategy network to each client after training the backup strategy network in each round;

the second processing module is used for responding to the received strategy parameters of the trained backup strategy network returned by the server, updating the strategy neural network based on the trained strategy parameters of the backup strategy network, and continuously generating the training sample based on the updated strategy neural network;

The embodiment of the application provides a virtual character control device, which comprises:

the acquisition module is used for acquiring a third state vector corresponding to a virtual character in a virtual scene at a third moment and a control instruction triggered by the virtual character;

the determining module is used for determining a fourth state vector which is needed to be corresponding to the virtual character at a fourth moment based on the control instruction vector corresponding to the control instruction and the third state vector;

the prediction module is used for calling a strategy neural network to perform action prediction processing based on the third state vector and the fourth state vector so as to obtain a target prediction action instruction;

the strategy neural network is obtained by training the training method of the strategy neural network based on the distributed architecture, which is provided by the embodiment of the application;

and the control module is used for controlling the virtual character to execute corresponding actions based on the target prediction action instruction.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the training method of the strategy neural network based on the distributed architecture or the control method of the virtual role, which are provided by the embodiment of the application, when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores executable instructions for causing a processor to execute, so as to implement the training method of the policy neural network or the control method of the virtual character based on the distributed architecture.

The embodiment of the application provides a computer program product, which comprises a computer program or an instruction, wherein the computer program or the instruction realizes the training method of the strategy neural network based on the distributed architecture or the control method of the virtual role when being executed by a processor.

The embodiment of the application has the following beneficial effects:

when the strategy neural network is trained, each client continuously generates training samples, part of training sample sets in the generated training samples are periodically sent to a server, the server continuously trains the backup strategy network of the strategy neural network based on the received training sample sets, strategy parameters of the backup strategy network after each round of training are returned to each client for the client to update the strategy parameter strategy neural network, the training samples are continuously generated based on the updated strategy neural network, and the acquisition and the network training of the training samples are repeatedly performed in such a way that each client and the server are in a full-load working state, so that the accurate strategy neural network can be trained at a high speed as soon as possible, and after the strategy neural network training is completed, the clients in each terminal device can output target action instructions for virtual roles based on the trained strategy neural network, so as to control the virtual roles to execute the target actions based on the target action instructions, and realize precise control of the virtual roles in the virtual scene, thereby obtaining smooth and realistic high-quality animation.

Drawings

FIG. 1 is a schematic diagram of a distributed deep reinforcement learning architecture according to an embodiment of the present application;

fig. 2 is a communication schematic diagram of a training strategy neural network according to an embodiment of the present application;

fig. 3 is a schematic architecture diagram of a training system 100 based on a policy neural network of a distributed architecture according to an embodiment of the present application;

fig. 4A is a schematic structural diagram of an electronic device 500 according to an embodiment of the present application;

fig. 4B is a schematic structural diagram of a terminal device 400 according to an embodiment of the present application;

fig. 5 is a flow chart of a training method of a neural network model according to an embodiment of the present application;

fig. 6 is a flow chart of a method for controlling a virtual character according to an embodiment of the present application;

fig. 7 is a communication schematic diagram of a training strategy neural network according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a bipedal character model in a phantom 4 engine provided in an embodiment of the application;

fig. 9 is a schematic diagram of a policy neural network according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by persons of ordinary skill in the art without making creative efforts are within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the term "first/second …" is merely to distinguish similar objects and does not represent a particular ordering for objects, it being understood that the "first/second …" may be interchanged with a particular order or precedence where allowed to enable embodiments of the present application described herein to be implemented in other than those illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) The strategy Neural network is a complex network system formed by widely interconnecting a large number of simple processing units (called neurons), which reflects many basic characteristics of human brain functions, and is a highly complex nonlinear power learning system. The neural network model has massively parallel, distributed storage and processing, self-organizing, self-adapting and self-learning capabilities, and is particularly suitable for processing imprecise and fuzzy information processing problems which need to consider a plurality of factors and conditions at the same time.

2) Virtual scene: is the scene that the application displays (or provides) when running on the terminal device. The virtual scene may be a simulation environment for the real world, a semi-simulation and semi-fictional virtual environment, or a pure fictional virtual environment. The virtual scene may be any one of a two-dimensional virtual scene, a 2.5-dimensional virtual scene or a three-dimensional virtual scene, and the dimensions of the virtual scene are not limited in the embodiment of the present application. For example, a virtual scene may include sky, land, sea, etc., the land may include environmental elements of a desert, city, etc., and a user may control a virtual character to move in the virtual scene.

3) Virtual roles: the avatars of various people and objects in the virtual scene that can interact with, or movable objects in the virtual scene. The movable object may be a virtual character, a virtual animal, a cartoon character, etc., such as a character, an animal, etc., displayed in a virtual scene. The avatar may be an avatar in a virtual scene for representing a user. A virtual scene may include multiple virtual characters, each virtual character having its own shape and volume in the virtual scene, occupying a portion of the space in the virtual scene.

4) Queues: a particular linear table allows delete operations only at the front (front) of the table, and insert operations at the back (rear) of the table, as a stack, a queue is an operation-limited linear table. Because the queue is only allowed to be inserted at one end and deleted at the other end, only the sample that first entered the queue can be deleted from the queue first, so the queue is also known as a first-in-first-out linear table.

5) Ghost engine: a game engine refers to the core components of some compiled editable computer game systems or some interactive real-time image applications. These systems provide game designers with a variety of tools required to write games in order to allow the game designer to easily and quickly make game programs without starting from zero. In short, the game engine is a pile of code frames written by a developer, and the game developer uses the pile of code frames to quickly realize game development.

Referring to fig. 1, fig. 1 is a schematic diagram of a distributed deep reinforcement learning architecture provided in an embodiment of the present application, where the architecture includes a plurality of terminal devices and a server, each terminal device is provided with a client (e.g., UE4 client with a dynamics simulation engine) and a policy neural network, the server is provided with a backup policy network (the same as the policy neural network in the client) corresponding to the policy neural network, an algorithm for deep reinforcement learning training of the policy neural network is deployed at the server, the client (e.g., client 1, client 2, …, client N, N is a positive integer greater than 2, e.g., n=1000) respectively starts up the UE4 client with the dynamics simulation engine, the clients are started up on cores of different central processing units (CPU, central Processing Unit), and when the deep reinforcement learning is performed each round of iteration, policy parameters of the policy neural network are issued to each client, the local policy neural network on each client interacts with the virtual policy engine, and collects a current state set, and a virtual state, and a graphic training data is performed on the server (e.g., a graphic training data is performed on the server, a graphic training data is performed on the graphic training system (Graphics Processing Unit). Then, the server issues the updated strategy parameters of the backup strategy network to each client, and starts the next iteration; training of the strategic neural network is repeatedly achieved in this way.

The inventor finds that, as shown in fig. 2, fig. 2 is a communication schematic diagram of a training policy neural network provided in the embodiment of the present application, and the related art adopts a deep reinforcement learning of a near-end policy optimization (PPO, proximal Policy Optimization) on a server to perform policy gradient estimation (i.e. perform training on a backup policy network), which is an on-policy algorithm, and requires that only a training sample collected through a current policy neural network can be used to perform training on the backup policy network of the current policy neural network. Since this scheme is used to simulate the computation of the controller policy for the virtual character in the UE4 client, after a segment of track is collected (about 80-400 steps of sample data are included), all the sample data are uploaded to the server synchronously, and after a certain number of training samples (for example, 800 data) are collected in total, the server starts training the backup policy network on the GPU. At this time, the CPU on each client is in a waiting state, stops running, and waits for the calculation result of the GPU. On the other hand, when each iteration is in a link that the client acquires training samples by using the CPU, no other tasks of the GPU on the server can be executed, and the GPU is in a waiting stage. As can be seen, when the related art trains the policy neural network, in the process of alternately running between a plurality of clients and a server, although data can be collected in parallel between different CPUs (clients), the CPU and the GPU run alternately in each iteration, so that the phenomenon of waiting each other exists, resources are not fully utilized, and the training efficiency of the control policy model is low.

In view of this, the embodiments of the present application provide a training method of a policy neural network based on a distributed architecture, a method and apparatus for controlling a virtual character, an electronic device, a computer readable storage medium, and a computer program product, which can train out an accurate policy neural network at a relatively high speed, thereby implementing accurate control on actions of the virtual character. An exemplary application of the electronic device provided by the embodiment of the present application is described below, where the electronic device provided by the embodiment of the present application may be implemented as a server, or may be implemented cooperatively by the server and the terminal device. The following describes an example of a control method for cooperatively implementing the virtual roles provided in the embodiments of the present application by using a server and a terminal device.

For example, referring to fig. 3, fig. 3 is a schematic architecture diagram of a training system 100 based on a policy neural network with a distributed architecture according to an embodiment of the present application, for supporting an application capable of implementing precise control over actions of virtual characters in a resource intensive manner, as shown in fig. 3, the training system 100 includes: the server 200, the network 300 and the terminal devices (the terminal devices 400-1, …, the terminal device 400-N), wherein the terminal device is provided with a client, the client 410 can be various application programs supporting virtual scenes, can provide a simulation environment for the motion control of virtual characters in the virtual scenes, and can be any one of a First person shooting game (FPS, first-Person Shooting game), a third person shooting game, a virtual reality application program, a three-dimensional map program or a multi-person gunfight survival game; in addition, each client is provided with a strategy neural network, and the server is provided with a backup strategy network corresponding to the strategy neural network. The terminal device is connected to the server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both.

In some embodiments, each terminal device obtains training samples generated continuously by interaction between each policy neural network and the simulation environment in the simulation environment for controlling the movement of the virtual character in the virtual scene, and periodically sends a training sample set of a target generation period in the training samples to the server 200; the server 200 continuously trains the backup strategy network based on the training sample set periodically sent by each terminal device, and returns strategy parameters of the trained backup strategy network to clients in each terminal device after training the backup strategy network in each round; the client in each terminal device responds to the received strategy parameters of the trained backup strategy network returned by the server, and updates the strategy neural network based on the strategy parameters of the trained backup strategy network so as to continuously generate training samples based on the updated strategy neural network; the training is repeatedly performed in such a way, so that each client and each server are in a full-load working state, an accurate strategy neural network can be trained at a faster speed as soon as possible, after the strategy neural network training is completed, the clients in each terminal device can output target action instructions aiming at the virtual roles based on the trained strategy neural network, so that the virtual roles are controlled to execute target actions based on the target action instructions, the accurate control of the virtual roles in the virtual scene is realized, smooth and vivid high-quality animation is obtained, and the visual experience of users is improved.

In other embodiments, embodiments of the present application may also be implemented by means of Cloud Technology (Cloud Technology), which refers to a hosting Technology that integrates a series of resources such as hardware, software, networks, etc. together in a wide area network or a local area network, to implement calculation, storage, processing, and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical network systems require a large amount of computing and storage resources.

By way of example, the server 200 shown in fig. 3 may be an independent physical server, a server cluster or a distributed architecture formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN, content Delivery Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a car terminal, etc. The terminal device and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

Referring to fig. 4A, fig. 4A is a schematic structural diagram of an electronic device 500 provided in an embodiment of the present application, in an actual application, the electronic device 500 may be the terminal or the server 200 in fig. 3, taking the electronic device as an example of the terminal shown in fig. 3, and the electronic device 500 shown in fig. 4A includes: at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is appreciated that bus system 540 is used to facilitate connected communications between these components. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled as bus system 540 in fig. 4A.

The processor 510 may be an integrated circuit chip with signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.

The user interface 530 includes one or more output devices 531 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 530 also includes one or more input devices 532, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 550 may optionally include one or more storage devices physically located remote from processor 510.

Memory 550 includes volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 550 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks; network communication module 552 is used to reach other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 include: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.; a presentation module 553 for enabling presentation of information (e.g., a user interface for operating a peripheral device and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530; an input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.

In some embodiments, the training device for a policy neural network based on a distributed architecture provided in the embodiments of the present application may be implemented in a software manner, and fig. 4A shows a training device 555 for a policy neural network based on a distributed architecture stored in a memory 550, which may be software in the form of a program and a plug-in, and includes the following software modules: the first processing module 5551 and the second processing module 5552 are logical, so that any combination or further splitting may be performed according to the implemented functions, and the functions of the respective modules will be described below.

In other embodiments, the training apparatus of the distributed architecture-based policy neural network provided in the embodiments of the present application may be implemented in hardware, and as an example, the training apparatus of the distributed architecture-based policy neural network provided in the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the training method of the distributed architecture-based policy neural network provided in the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), or other electronic components.

In some embodiments, referring to fig. 4B, fig. 4B is a schematic structural diagram of a terminal device 400 provided in an embodiment of the present application, as shown in fig. 4B, the terminal device 400 includes: at least one network interface 420, a user interface 430, a system bus 440, a memory 450, and a processor 460. The user interface 430 includes, among other things, one or more output devices 431 that enable presentation of media content, including one or more speakers and/or one or more visual displays. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls. Included in the memory 450 are: an operating system 451, a network communication module 452, a presentation module 453 for enabling display of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430, an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions, an information recommendation device 455. Further, the control device 455 of the virtual character stored in the memory 450 includes: the acquisition module 4551, the determination module 4552, the prediction module 4553 and the control module 4554 are logical, and thus may be arbitrarily combined or further split according to the functions implemented, the functions of each module will be described below.

The following describes a training method of the distributed architecture-based policy neural network provided in the embodiment of the present application in combination with the training system of the distributed architecture-based policy neural network provided in the embodiment of the present application. Referring to fig. 5, fig. 5 is a flowchart of a training method of a neural network model according to an embodiment of the present application, and a coordinated implementation of a client and a server in each terminal device will be described with reference to the steps shown in fig. 5.

In step 101, each client obtains training samples generated continuously by interaction between each policy neural network and a simulation environment in a simulation environment for controlling movement of a virtual character in a virtual scene, and periodically sends training sample sets of target generation periods in the training samples to a server.

In some embodiments, each client may obtain training samples generated continuously for each strategic neural network to interact with the simulation environment by: in each sampling period of at least one sampling period and in the process of interaction between each strategy neural network and the simulation environment, a first state vector corresponding to the virtual character at a first moment is obtained, and a second state vector corresponding to the virtual character at a second moment is obtained; based on the first state vector and the second state vector, invoking a strategy neural network to conduct motion prediction, and obtaining a predicted motion instruction; calling a Gaussian distribution probability density function to carry out probability prediction based on the predicted action instruction, and determining to obtain a probability value of the predicted action instruction; calling a dynamic model to simulate and obtain a third state vector corresponding to the virtual character at a second moment based on the predicted action instruction, and calling a rewarding function to conduct rewarding prediction based on the third state vector and the second state vector to obtain a rewarding value for controlling the strategy neural network to update; the first state vector, the second state vector, the predicted action instruction, the probability value, and the prize value are combined into a training sample.

In some embodiments, each client may obtain a corresponding second state vector for the virtual character at the second time by: acquiring a control instruction triggered by a virtual character; and adding the control instruction vector corresponding to the control instruction and the first state vector, and determining an addition result as a second state vector which is needed to be corresponding to the virtual character at a second moment.

Here, the policy neural network is a network capable of controlling the actions of the virtual roles, in the process of interaction between the policy neural network and the simulation environment, the policy neural network can obtain a first state vector of the virtual roles at a first moment from the simulation environment, taking the virtual roles as a skeleton-based virtual role a in the UE4 client as an example, each client first obtains a first state vector corresponding to the virtual role a at the first moment (i.e. a vector corresponding to the first state of the virtual role a at the first moment, for example, a multidimensional vector a may be used to represent the first state of the virtual role a at the first moment, where the first state includes a first position, a first orientation, a first body posture, and the like of the virtual role a at the first moment), and determines, according to the first state vector, and a manipulation instruction triggered by a real player or an Agent (Agent) for the virtual role a, a second state vector corresponding to the virtual role a needs to be required at a second moment (i.e. a vector corresponding to the second state of the virtual role a second desired moment, for example, one may be used to more than one or more The vector g of the dimension represents a second state of the virtual character a at a second moment, where the second state includes a second position, a second orientation, and a second body posture that the virtual character a needs to reach at the second moment, and so on), for example, a manipulation instruction vector (a manipulation instruction vector b is a vector of the same dimension as the vector a) corresponding to the manipulation instruction may be added to the first state vector (i.e., a component of each dimension included in the manipulation instruction vector b is added to a component of the corresponding dimension in the vector a, for example, assuming that the first state vector a= [ a ] ₁ ， a ₂ ，a ₃ ]Steering instruction vector b= [ b ] ₁ ，b ₂ ，b ₃ ]Then vector c= [ a ] obtained by adding vector a and vector b ₁ +b ₁ ，a ₂ +b ₂ ，a ₂ +b ₂ ]That is, the dimension of the vector does not change after the addition processing is performed on the two vectors, but the size of the component of each dimension changes), and the addition processing is determined as a second state vector that the virtual character a needs to correspond to at the second moment.

In the method for training the policy neural network provided in the embodiment of the present application, the policy neural network is called multiple times, so the first time and the second time here are 2 times before and after each process of calling the policy neural network, and are not particularly a fixed time (i.e. the process of performing the action prediction processing on each time of calling the policy neural network, the time before the calling is called the first time, and the time after the calling is called the second time), and the first state vector and the second state vector are the same. For example, taking the n (n is an integer greater than 1) th time of calling the strategy neural network to perform action prediction processing as an example, wherein the second state vector can be a first state vector when the n+1th time of calling the strategy neural network; for the first invocation of the policy neural network, the first state vector may be a vector corresponding to an initial state in which the virtual character is located when the virtual scene is initialized or when the virtual character is recovered from the last saved progress.

In some embodiments, each client may invoke the policy neural network to perform motion prediction based on the first state vector and the second state vector to obtain a predicted motion instruction by: performing splicing processing on the first state vector and the second state vector to obtain a spliced vector; and calling a strategy neural network to conduct action prediction processing based on the spliced vector to obtain a predicted action instruction.

In some embodiments, the strategic neural network may be a rule model including motion Control rules (e.g., positive dynamic equations), the rule model (e.g., whole Body Control (WBC), model predictive Control (MPC, model Predictive Control) controller, etc.) may be invoked to determine the predicted motion instructions based on the motion Control rules, e.g., using the rule model as a WBC controller, the first state vector and the second state vector may be substituted into the positive dynamic equations included in the WBC controller for calculation to obtain the predicted motion instructions. In other embodiments, the policy neural network may also be a multi-layer perceptron, a decision tree model, a gradient lifting tree model, a support vector machine, a convolutional neural network model, a deep convolutional neural network model, or a fully-connected neural network model, and the types of the policy neural network in the embodiments of the present application are not limited specifically.

Here, the splicing process refers to adding the second state vector to the back of the first state vector, assuming that the first state vector a= [ a ] ₁ ，a ₂ ，a ₃ ]Second state vector g= [ g ] ₁ ，g ₂ ，g ₃ ]A concatenation vector d= [ a ] obtained by performing a concatenation process on the first state vector a and the second state vector g ₁ ，a ₂ ，a ₃ ，g ₁ ，g ₂ ，g ₃ ]. It can be seen that, unlike the addition of two vectors, the dimensions of the vector will change after the concatenation of the two vectors, but the size of the components in each dimension remains unchanged.

In some embodiments, the first state vector includes at least one of the following state parameters: a first position of the virtual character at a first moment (for example, a three-dimensional coordinate position of the virtual character in a world coordinate system), a first direction of the virtual character at the first momentThe orientation of the virtual character may be represented, for example, by three-dimensional attitude angles, where the three-dimensional attitude angles include Yaw angle (Yaw), roll angle (Roll), and Pitch angle (Pitch)), the first body attitude of the virtual character at the first moment (e.g., the body attitude of the virtual character may be represented by angles corresponding to a plurality of joints of the virtual character, respectively, and assuming that the virtual character has 18 joints, each of which has a rotational degree of freedom that varies from 1 dimension to 3 dimensions, the body attitude of the virtual character may be represented by 32 angles); the second state vector includes at least one of the following state parameters: the second position that the virtual character needs to reach at the second time (assuming that the three-dimensional coordinate corresponding to the first position where the virtual character is located at the first time is [ x ] ₁ 、y ₁ 、z ₁ ]The components corresponding to the three-dimensional coordinates in the manipulation instruction vector are [0, 1, 0 ]]The three-dimensional coordinate corresponding to the second position where the virtual character needs to arrive at the second moment is [ x ] ₁ 、y ₁ +1、z ₁ ]) The second orientation that the virtual character needs to reach at the second moment (assuming that the three-dimensional attitude angle corresponding to the first orientation in which the virtual character is located at the first moment is [ R ] ₁ 、P ₁ 、Y ₁ ]The components corresponding to the three-dimensional attitude angles in the control instruction vector are [10, 5 and 30]The three-dimensional attitude angle corresponding to the second direction that the virtual character needs to reach at the second moment is [ R ₁ +10、 P ₁ +5、Y ₁ +30]) A second body position that the virtual character needs to reach at a second time (e.g., assume that the virtual character corresponds to a first body position at the first time at 32 angles d ₁ 、d ₂ 、…、d ₃₂ ]The components corresponding to 32 angles in the control instruction vector are [5, 10, …, 8 ]]The virtual character at the second moment needs to reach the second body posture corresponding to 32 angles of d ₁ +5、d ₂ +10、…、d ₃₂ +8]) The method comprises the steps of carrying out a first treatment on the surface of the Each client can splice the first state vector and the second state vector to obtain a spliced vector by the following modes: determining a first difference between a first three-dimensional coordinate corresponding to the first position and a second three-dimensional coordinate corresponding to the second position, and determining a first yaw angle corresponding to the first direction and a second yaw angle corresponding to the second direction A second difference between the second yaw angles; and performing splicing processing on the first difference value, the second difference value, the first rolling angle and the first pitch angle corresponding to the first direction, the second rolling angle and the second pitch angle corresponding to the second direction, the first body posture and the second body posture to obtain a spliced vector.

Taking the virtual character as an example of the virtual character a, after the server obtains the first state vector corresponding to the virtual character a at the first moment and the second state vector corresponding to the virtual character a at the second moment, because the feature focused by the policy neural network is the difference between the first state vector and the second state vector, and not the first state vector and the second state vector, each client may first calculate a first three-dimensional coordinate corresponding to the first position included in the first state vector (for example, a 3-dimensional position coordinate of the virtual character a in the world coordinate system at the first moment, and supposedly x1, y1, z 1) and a second three-dimensional coordinate corresponding to the second position included in the second state vector (for example, a 3-dimensional position coordinate of the virtual character a needed to arrive in the world coordinate system at the second moment, assuming a first difference between x2, y2, z 2) (i.e., x2-x1, y2-y1, z2-z 1), then a first Yaw angle corresponding to a first orientation (e.g., a first Yaw angle corresponding to a first orientation of the avatar a at a first time instance, assuming Yaw 1) and a second Yaw angle corresponding to a second orientation (e.g., a second Yaw angle corresponding to a second orientation of the avatar a at a second time instance) may also be calculated for the first state vector, assuming a second difference between Yaw 2) (i.e., yaw2-Yaw 1), the server may then calculate a first difference, a second difference, a first roll angle and a first pitch angle corresponding to the first orientation, a second roll angle and a second pitch angle corresponding to the second orientation, a first body position, a second body position, a third body position, a fourth body position, a fifth body position, a sixth body position, a seventh body position, a fourth body position, a fifth body position, a sixth body position, and a fifth body position, and performing stitching processing on the second body posture to obtain a stitching vector, taking the stitching vector as the input of the strategy neural network, and performing prediction processing on the stitching vector through the strategy neural network to obtain a predicted action instruction, so that the dimension of the input stitching vector can be reduced, and the training efficiency is improved.

It should be noted that, each client may directly perform the splicing process on the first state vector and the second state vector, and use the spliced vector obtained after the splicing process as the input of the policy neural network, which is not specifically limited in the embodiment of the present application.

In some embodiments, each client may invoke a policy neural network to perform motion prediction processing based on the splice vector to obtain a predicted motion instruction by: when the number of the dimensions of the spliced vectors is at least two, a strategy neural network is called to perform forward calculation based on the spliced vectors, so that sub-prediction instructions corresponding to the spliced vectors of the dimensions are obtained; and determining the mean value and standard deviation of sub-prediction instructions corresponding to the spliced vectors of each dimension, and calling a Gaussian distribution function to sample based on the mean value and the standard deviation to obtain a prediction action instruction.

For example, as shown in table 2, the concatenation vector of the virtual characters is a vector of dimension 41×7=287, and is used as an input of the policy neural network, and the concatenation vector is calculated in the forward direction of the policy neural network, so as to obtain the mean value (mean vector) of the sub-prediction instructions of each dimension. By way of example, taking the policy neural network as a Multi-Layer Perceptron (MLP), assuming that the policy neural network in this application is a Layer 3 Perceptron, its expression can be written as: mean=w ₂ *tanh(W ₁ *tanh(W ₀ *Input+b ₀ )+b ₁ )+b ₂ Wherein matrix W ₀ 、W ₁ 、W ₂ And offset vector b ₀ 、b ₁ 、b ₂ Collectively, the weights of the policy neural network. The output (mean vector) of the network is sampled by a gaussian distribution to obtain a predicted action instruction (action), and the expression is: action-N (std), where std is the standard deviation of sub-predicted commands in each dimension, and this action, i.e. the output of the strategy neural network in Table 2, is a 32-dimensional feed-forward torque.

In some embodiments, each client may call a probability density function of the gaussian distribution to make a probability prediction based on the predicted action instruction by determining a probability value for the predicted action instruction as follows: and calling a probability density function of Gaussian distribution to conduct probability prediction based on the mean value, the standard deviation and the predicted action instruction, and obtaining a probability value of the predicted action instruction by downsampling each sub-predicted instruction.

Here, based on the mean value (mean vector) of the sub-prediction instructions in each dimension, the corresponding standard deviation (std) and the prediction action instruction (action), a probability density function of gaussian distribution is called to perform probability prediction, so as to obtain a probability value of the prediction action instruction obtained by downsampling each sub-prediction instruction (i.e. the probability of downsampling the current mean to the action), where the probability can be represented by using the negative logarithm negogp value of the probability, where the expression is:

After the predicted action command (action) is obtained, it will be used for dynamic simulation by the physical simulator in the UE4 until this control period ends. Based on action calling dynamics model simulation, a third state vector (namely q) corresponding to the virtual character at the next moment is obtained _i+1 、p _i+1 、o _i+1 ) And based on the simulation, obtaining the corresponding character state of the virtual character at the next time and the second state vector (q ^g _i+1 、 p ^g _i+1 、o ^g _i+1 ) And calling a reward function to conduct reward prediction to obtain a reward value for controlling the updating of the strategy neural network, wherein the expression is r=0.7 x exp (sigma (q _i+1 -q ^g _i+1 ) ² )+0.2*exp(-∑(p _i+1 –p ^g _i+1 ) ² )+0.1*exp(- ∑(o _i+1 –o ^g _i+1 ) ² )。

What each client handles in one sampling period and so on. After each sampling period is finished, the client combines the first state vector, the second state vector, the predicted action instruction, the probability value, the rewarding value and the like acquired in the sampling period into a training sample, and sends the training sample to the server.

In some embodiments, each client may store the training samples in a sample queue after obtaining the training samples generated by the interaction persistence of the corresponding policy neural network and the simulation environment; accordingly, each client may periodically send a training sample set of the target generation period in the training data to the server by: detecting whether the strategy parameter returned by the server is received or not in other sampling periods except the first sampling period in at least one sampling period; responding to the strategy parameters which are not received and returned by the server, and sending a training sample set consisting of a plurality of training samples corresponding to other sampling periods in a sample queue to the server; and deleting the training sample set stored in the sample queue in response to receiving the strategy parameters returned by the server.

Wherein, the training sample set corresponds to a sampling period, wherein, each client directly uploads the training sample acquired in the first sampling period to the server so that the server trains the backup strategy network in the server based on the training sample; and detecting whether the strategy parameters returned by the server are received or not in real time in other sampling periods except the sampling period, storing the training samples acquired in the sampling period into a sample queue when the strategy parameters returned by the server are not received, uploading the training samples to the server in time, deleting the training samples acquired in the sampling period stored in the sample queue by each client in time when the strategy parameters returned by the server are received, uploading the outdated training samples to the server, updating the strategy neural network of the client based on the strategy parameters returned by the server, and continuing to acquire the training samples of the next sampling period based on the updated strategy neural network.

In some embodiments, each client may obtain training samples generated continuously for each strategic neural network to interact with the simulation environment by: acquiring interaction time length of each strategy neural network for interaction with the simulation environment in each sampling period in at least one sampling period in the process of interaction between each strategy neural network and the simulation environment; determining random moments corresponding to all clients based on the interaction time, wherein different clients correspond to different random moments; and acquiring training samples generated by each client at corresponding random moments.

For example, when a skeleton character is to simulate the action in a known dynamic capturing segment, the embodiment of the application uses a random number uniformly distributed between 0 and m (0 < m <1, which can be set, for example, to be 0.618), multiplies the total duration (i.e. interaction duration) of the known dynamic capturing segment by the random number, and takes out the training sample corresponding to the calculated time in the dynamic capturing segment as the initialization of the skeleton character in the UE4, so that the characters in different clients have random starting points in simulation time, so that strong correlation between the training samples collected by the server from different clients can be avoided, and the training accuracy of the policy neural network is improved.

In some embodiments, the policy neural network set on each client includes a first policy neural network and a second policy neural network, and each client may first obtain training samples generated continuously by interaction between each first policy neural network and the simulation environment, and store the training samples in a sample queue; in the process of continuously generating training samples based on the first strategy neural network, responding to the strategy parameters of the trained backup strategy network returned by the server, and updating the second strategy neural network based on the strategy parameters of the trained backup strategy network; in response to the second strategic neural network being updated, stopping generating training samples based on the first strategic neural network, deleting training samples stored in the sample queue, and continuously generating training samples based on the updated strategic neural network.

Here, the embodiment of the application proposes a policy exchange in the exchange of the client, two sets of policy neural networks are maintained locally at the client, and the client is supposed to receive a command returned by the server to update the policy neural network in the process of controlling the role movement by using the first policy neural network and collecting training samples, decode the policy parameters of the received new version policy neural, and fill the policy parameters into the local second policy neural network. Meanwhile, the first strategy neural network is still used in the main thread of the client to control the angular movement, only the collected training samples are pushed into the sample queue and are not uploaded to the server in time until the second strategy neural network is filled, the main thread is informed to start to abandon the first strategy neural network, the second strategy neural network is started, and meanwhile, the training samples in the sample queue are emptied, so that outdated training samples are prevented from being uploaded; and when the next strategy updating command of the server comes, filling the strategy updating command into the first strategy neural network, and reciprocating the strategy neural network until training is finished. Of course, after training is finished, only the latest version of the policy neural network needs to be saved, and the latest version of the policy neural network can stably generate a control command for the virtual character with the mechanical simulation in the UE4, guide the virtual character to complete the given action and generate the high-quality animation.

In step 102, the server continuously trains the backup policy network based on the training sample set that is periodically sent, and returns the policy parameters of the trained backup policy network to each client after each round of training the backup policy network.

In some embodiments, after each client periodically sends a training sample set of a target generation period in the training samples to the server, the server transfers the received training sample set to the backup policy network according to a first-in first-out order, so as to continuously train the backup policy network; in the process of training the backup strategy network, in response to the capacity of the stored training sample set exceeding the target capacity, deleting the training sample set received first and storing the training sample set received last according to the first-in first-out sequence.

Here, after receiving the training sample set periodically sent by each client, the server may store the training sample set in a database, where the database also complies with the first-in-first-out queue structure rule, and when the capacity of the training samples existing in the database reaches the target capacity, the first-collected training sample is deleted.

Illustratively, when the capacity of the training sample set stored in the database reaches the target capacityFor example, the capacity of the training sample set stored in the database exceeds 2 ²⁰ When the capacity of the training samples corresponds to that of the step, the server can delete the oldest training sample set (namely the training sample set which is firstly stored in the database) from the database according to the first-in first-out sequence, and transmit the latest received training sample set to the backup strategy network, so that the earlier the training sample set which enters the database is, the smaller the contribution of the training sample set which just enters the database to the gradient calculation of the subsequent backup strategy network is, and the larger the contribution of the training sample set which just enters the database to the gradient calculation of the subsequent backup strategy network is, under the condition that the storage space is limited, the old sample which has smaller contribution to the gradient calculation is removed from the database, the new sample which has larger contribution to the gradient calculation is reserved, and the performance of the training strategy neural network can be improved.

In some embodiments, the server may continually update the backup policy network based on the periodically transmitted training data set by: average the reward value (r) of each training sample in the training data set to obtain a reward average value Subtracting the rewarding value of each training sample from the rewarding average value to obtain a corresponding first subtraction result; subtracting a mean value (mean) from a predicted action instruction (action) of each training sample in the training data set to obtain a second subtraction result, and dividing the square of the first subtraction result by the square of the standard deviation to obtain a first division result; dividing the index of the probability value of each training sample in the training data set with the index of the first division result to obtain a second division result; multiplying the first subtraction result, the first division result and the second division result of each training sample to obtain a first multiplication result of each training sample; and summing the first multiplication results of the training samples to obtain a loss function of the backup strategy network, and updating the backup strategy network based on the loss function.

In light of the above description, the expression of the loss function of the policy neural network is:

wherein both the predicted action instruction (action) and negogp are determined when the training sample is sent to the server. Thus, in engineering, it is necessary that the action, when just sampled, calculate the negogp value through the policy neural network of the current version and store it as part of the training sample in the database.

After obtaining the loss function, the value of the loss function of the backup strategy network can be determined, when the value of the loss function reaches a preset threshold value, a corresponding error signal is determined based on the loss function, the error signal is reversely transmitted in the corresponding backup strategy network, and model parameters of each layer of the corresponding backup strategy network are updated in the transmission process, so that training and updating of the backup strategy network are realized.

In step 103, each client, in response to receiving the policy parameters of the trained backup policy network returned by the server, updates the policy neural network based on the policy parameters of the trained backup policy network to continuously generate training samples based on the updated policy neural network.

The strategy neural network is used for outputting target action instructions aiming at the virtual roles so as to control the virtual roles to execute target actions based on the target action instructions.

In some embodiments, after each client obtains the predicted action instruction, the first state vector may be further converted into a reference state vector of the adaptive physical simulation engine, for example, multiplying the first time derivative of the first state vector with the second parameter to obtain a second multiplication result, and adding the second multiplication result and the target action instruction to obtain a first addition result; dividing the first addition result and the first parameter to obtain a third division result; adding the third division result and the first state vector to obtain a second addition result, and determining the second addition result as a reference state vector; the reference state vector is input into an interface of the physical simulation engine to cause the physical simulation engine to control the actions of the virtual character based on the reference state vector.

Taking the phantom engine as the UE4 as an example, since the interface of the Physics x physical simulation engine built in the UE4 editor can only receive the reference state vector (assumed to be denoted as q ^d ) But cannot receive the target action instruction, each client may further perform the following processing after obtaining the target action instruction: first, the first time derivative of the first state vector is multiplied by the second parameter to obtain a second multiplication result (i.e) And the second multiplication result and the target action command (assumed to be marked as tau _int ) Performing addition processing to obtain a first addition result (i.e.)>) The method comprises the steps of carrying out a first treatment on the surface of the Then the first addition result and the first parameter are divided to obtain a third division result (namely +.>) The method comprises the steps of carrying out a first treatment on the surface of the Then the third division result is added with the first state vector to obtain a second addition result (i.e.)>) And determining the second addition result as a reference state vector; and finally, inputting the reference state vector into an interface of the physical simulation engine to complete the setting of the expected instruction, wherein the physical simulation engine controls the action of the virtual character based on the reference state vector at this time (for example, the physical simulation engine can control the joint corresponding to the virtual character to rotate according to the angle included by the reference state vector).

The training process of the strategic neural network has been described so far. The method for controlling the virtual character according to the embodiment of the present application will be described below with reference to exemplary applications and implementations of the terminal device according to the embodiment of the present application.

Referring to fig. 6, fig. 6 is a flowchart of a method for controlling a virtual character according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 6.

In step 201, the terminal device obtains a third state vector corresponding to the virtual character in the virtual scene at a third time, and a manipulation instruction triggered by the virtual character.

In some embodiments, the terminal device may first obtain a third state vector corresponding to the virtual character in the virtual scene at a third time, and a manipulation instruction triggered by the virtual character, where the manipulation instruction may be triggered by a real player or may be triggered by an agent (e.g., a robot for simulating a real player).

In step 202, a fourth state vector corresponding to the virtual character at a fourth time is determined based on the manipulation instruction vector corresponding to the manipulation instruction and the third state vector.

In some embodiments, after receiving a manipulation instruction triggered by a user or an agent for the virtual character, the terminal device may perform addition processing on a manipulation instruction vector corresponding to the manipulation instruction and a third state vector, and determine an addition result as a fourth state vector that the virtual character needs to correspond to at a fourth time.

The third time and the fourth time are only 2 times before and after the process of calling the policy neural network to perform the action prediction processing, and do not refer to a fixed time, and the third state vector and the fourth state vector are the same. For example, taking the n-th call policy neural network as an example, the fourth state vector may be the third state vector when the n+1-th call policy neural network performs the action prediction processing.

In step 203, the policy neural network is invoked to perform the prediction processing based on the third state vector and the fourth state vector, so as to obtain the target prediction action instruction.

Here, the policy neural network may be obtained by training according to the training method of the policy neural network based on the distributed architecture provided in the embodiment of the present application.

In some embodiments, after training the backup policy network of the policy neural network based on the plurality of training samples stored in the sample queue, the server may send the policy parameters of the trained backup policy network to the terminal device, and the terminal device updates the local policy neural network based on the policy parameters, to obtain the policy neural network of the final version. In this way, after the terminal device obtains the third state vector and the fourth state vector, the third state vector and the fourth state vector may be first spliced, and the spliced vector obtained after the splicing is used as an input of the final version of the policy neural network, so that the final version of the policy neural network model performs motion prediction processing based on the spliced vector, and outputs the target prediction motion instruction.

In step 204, the virtual character is controlled to execute a corresponding action according to the target predicted action instruction.

In some embodiments, after obtaining the target predicted action instruction, the terminal device may input the target predicted action instruction to an interface of the physical simulation engine, so that the physical simulation engine controls the action of the virtual character according to the target predicted action instruction, for example, the physical simulation engine may control the corresponding joint of the virtual character to rotate according to the angle value when the angle value of each joint of the virtual character needs to rotate is carried in the target predicted action instruction.

According to the method, when the strategy neural network is trained, each client continuously generates training samples, part of training sample sets in the generated training samples are periodically sent to the server, the server continuously trains the backup strategy network of the strategy neural network based on the received training sample sets, strategy parameters of the backup strategy network after each round of training are returned to each client, the strategy neural network is updated by the client based on the strategy parameters, training samples are continuously generated based on the updated strategy neural network, the training is repeatedly performed in an iterative mode, so that each client and the server are in a full-load working state, the accurate strategy neural network can be trained at a high speed as soon as possible, after the strategy neural network training is completed, the client in each terminal device can output target action instructions for virtual roles based on the trained strategy neural network, the virtual roles are controlled to execute target actions based on the target action instructions, and the virtual roles in the virtual scene are accurately controlled, and accordingly smooth and vivid high-quality animation is obtained.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

In the embodiment of the application, the communication module is placed in the sub-thread outside the main thread in each client, whether the client or the server, so as to realize asynchronous non-blocking communication, and realize more efficient parallel efficiency of the CPU on the client and the GPU on the server. A sample Queue (Queue) is maintained in each client, in each sampling period (such as 1 second) of the client (UE 4), each step of training samples are collected, the collected training samples are stored in the sample Queue, whether updated policy parameters returned by the server are received or not is detected in real time in a communication sub-thread of the client, and when the updated policy parameters returned by the server are not received in one sampling period, all the training samples in the sample Queue are uploaded to the server; when the updated policy parameters returned by the server are received in one sampling period, the main thread of the client is enabled to load the updated policy parameters in the next sampling period, and all training samples which are not yet uploaded in the sample queue are deleted. The communication sub-thread of the server is usually in a waiting (Block (Timeout)) state, when receiving training samples sent by the communication sub-thread in the client, the received training samples are stored in a database for training of the backup control network, and after the GPU finishes training of the backup control network, the main thread of the server can generate policy parameters of a new version of the backup control network and send the policy parameters to each client in time through the communication sub-thread, so that each client can update the policy neural network based on the new version of the policy parameters, and the data can be continuously collected based on the updated policy neural network.

Therefore, the multithreading asynchronous communication mechanism provided by the embodiment of the application can upload training samples to the server more frequently, rather than waiting for the end of a game (or track) to upload training samples uniformly; and it allows for temporary exchange of policy parameters of the policy neural network among a pair (or trace). Thus, each client collects only a small number (e.g., about 5 steps) of training samples, and there are enough training samples on the server (about 5000 steps when there are 1000 clients), so that the backup policy network can be trained immediately. At this point, the CPU in each client does not stop collecting training samples, but continues to collect data along the old policy neural network in the past. When the acquired sample contains the information of the probability value of the predicted action instruction, the information is utilized to approximate to the importance sampling theory, so that the data acquired by using the old strategy neural network can be used for training the new strategy neural network. Conversely, when the server updates the new policy parameters and issues the updated policy parameters to each client in time, even if a game (or track) on the client is still in progress, the client is allowed to break, replace the new policy parameters in the existing policy neural network, and begin to use the obtained updated policy neural network to continue to collect data. And the database on the server will manage with the data structure of the queue, and throw out the earliest training sample while continuously receiving new training samples. Therefore, the characteristics and performances of all terminal devices can be fully developed through the mechanism, the CPU on the client side or the GPU on the server are in a full-load working state under the communication mechanism, mutual waiting caused by a synchronous communication mechanism is not needed, and the advantages of distributed deep reinforcement learning training are furthest developed.

The technical scheme provided by the embodiment of the application comprises the following key points: client data processing, server deep reinforcement learning training, strategy deployment and communication framework; the communication framework adopts the framework shown in fig. 7, and several contents of server algorithm importance sampling improvement, a client random initialization mechanism and a client in-game policy neural network replacement permission are provided for adapting to a new communication framework, and are described one by one.

1. Client data processing

Taking 1/30 second sampling period of the simulation time Tick () of the simulator as an example, after each sampling period of Tick (), each client records displacement and posture quaternion of a Component part (Component) of the current virtual character in a world coordinate system, displacement and posture quaternion of a pelvic bone (Pelvis) joint relative to the Component, and posture quaternion of all bone joints participating in physical simulation relative to the Component. Note that the simulation time here refers to the time for integration in the dynamics simulation, not the time of the world clock.

For example, referring to fig. 8, fig. 8 is a schematic diagram of a bipedal character model in a fantasy 4 engine provided in an embodiment of the present application, and fig. 8 shows a relationship between a Component (Component) coordinate system 802 of a bipedal character model and a pelvic bone (Pelvis) coordinate system 801 in a UE4 client, where Pelvis is a root cause of a skeleton of an entire character, and a position and an attitude in the world coordinate system represent a position and an orientation of the character in a world coordinate system. It has 6 degrees of freedom and can be characterized by 6 scales. Wherein 3 quantities are three-dimensional position coordinates, and the other 3 quantities are three angles representing orientation, namely Roll angle (Roll), pitch angle (Pitch) and Yaw angle (Yaw), respectively.

Furthermore, the body posture of a skeletal character can then be represented in 32 angles. As shown in fig. 8, in the bipedal human skeletal model as an example, 18 rotatable joints are listed in table 1 in addition to Pelvis as the root skeletal joint. The degree of freedom in rotation of each joint also varies from 1 to 3 dimensions. Thus, a total of 32 degrees of freedom can be created by 18 joints. Therefore, the quaternion of bone rotation is represented by 18 groups in the data packet, and 32 angle values can be obtained. Using 32 angle values, plus the 3-dimensional position coordinates of Pelvis in the world coordinate system, 6-dimensional attitude angles, a total of 41-dimensional vectors can fully characterize the current state of the character.

Table 1 rotatable joint and degrees of freedom thereof

/>

Table 2 shows the relevant definitions of the input (observation) and output (action) of the strategic neural network, since the training of the strategic neural network aims to let the role with dynamic simulation in the UE4 follow as much as possible a target segment, which can be acquired by motion capture, can be manually designed by beauty, can be generated by various motion generation algorithms, and is likely to be inconsistent with the dynamic simulation characteristics, thus not look realistic enough; the control strategy generated by the training method provided by the embodiment of the application controls the character generation action, which accords with the dynamic characteristics and can follow the given target segment as much as possible.

Table 2 input and output of a strategic neural network

Here, according to the observation format in table 2, the current role state is sorted into a vector of 41×7=287 dimensions (i.e. the above-mentioned stitching vector), and this observation vector is subjected to forward calculation by the policy neural network, so as to obtain a mean vector of action commands (i.e. the above-mentioned predicted action instructions) (i.e. the mean of the above-mentioned sub-predicted instructions of each dimension), which, by way of example, takes the policy neural network as MLP, and assuming that the policy neural network in the present application is a 3-layer perceptron, its expression can be writtenThe method comprises the following steps: mean=w ₂ *tanh(W ₁ *tanh(W ₀ *Input+b ₀ )+b ₁ )+b ₂ Wherein matrix W ₀ 、W ₁ 、W ₂ And offset vector b ₀ 、b ₁ 、b ₂ Collectively, the weights of the policy neural network. The output (mean vector) of the network is sampled by a gaussian distribution to obtain action, and the expression is: the action-N (std), the output of the strategy neural network in Table 2, is a 32-dimensional feed-forward torque.

Then, based on the mean vector, the corresponding standard deviation (std) and action, invoking a probability density function of Gaussian distribution to perform probability prediction to obtain the probability of sampling the action at the current mean (namely, obtaining the probability value of the predicted action instruction at each sub-prediction instruction by the downsampling), wherein the probability can be represented by using the negative logarithm negogp value of the probability, and the expression is as follows:

After the action is obtained, it will be used for kinetic simulation by the physical simulator in the UE4 until this control period ends. Based on the action calling dynamics model simulation, the corresponding character state (the third state vector, namely q _i+1 、p _i+1 、o _i+1 ) And based on simulation, obtaining the corresponding role state of the virtual role at the next moment and the role state of the virtual role at the next moment (namely the second state vector) to call a reward function to conduct reward prediction, so as to obtain a reward value for controlling the policy neural network to update, wherein the expression is r=0.7 x exp (sigma (q _i+1 -q ^g _i+1 ) ² )+0.2*exp(-∑(p _i+1 –p ^g _i+1 ) ² )+ 0.1*exp(-∑(o _i+1 –o ^g _i+1 ) ² )。

UE4 is applied to 32 per skeletal character modelThe torque expression for the corner is: torque=a ^g _i+ spring *(q ^g _i+1 –q _t )+damping*(0–v _t ) The spring and the weighting are two control system coefficients, generally default to be constants 10000 and 200, and in practical application, the variable values can be automatically adjusted according to experience, so that a more ideal control effect is achieved. V herein _t Is the angular velocity of the joint rotation angle, and can be measured from the angle q _t The simple difference is obtained.

What the UE4 on the client does in one control period (1/30 seconds) and so cycles. After each period is finished, the client terminal organizes observation, action of the previous period, negogp value representing the sampling probability of the action, r calculated by the observation and other data into a group of data serving as a training sample, and the training sample is put into a sample queue and then uploaded to the server. The training samples uploaded to the server may vary depending on the type of deep reinforcement learning algorithm selected.

After the training samples are put into the sample queue by the client, the client can automatically start dynamics simulation and data acquisition of the next period without waiting for the reply of the server, namely, in the asynchronous communication framework used in the embodiment of the application, each time the training samples are still in the sample queue, the communication thread of the client can upload the adopted training samples of each step in time as long as updated strategy parameters returned by the server are not received yet.

It should be noted that, the figures of the bipedal character model in the UE4 engine are shown in table 1 and table 2, and the scheme provided in the embodiments of the present application can be fully applied to any animated character figure based on a tandem skeleton.

2. Server deep reinforcement learning training

In this embodiment, the server will continuously receive the training samples sent by each client and store the received training samples in the database, which also complies with the fifo queue structure rule, and the capacity of the training samples in the database reaches the targetCapacity (e.g. 2 ²⁰ Training samples of steps), the training sample collected first is deleted. Deep reinforcement learning agents randomly extract a number of training sample form batches (batch, e.g., 2) ¹⁰ ) Data sets, and performing deep reinforcement learning training based on random gradient descent (namely training the backup strategy network in the server) by adopting the data sets, and updating the strategy parameters of the backup strategy network. Weights W0, W1, W2, b0, b1, b2 and std for sampling of the updated backup policy network are sequentially issued to each client. After receiving the weights, the client reloads the weights in local strategy neural network calculation to form a new strategy neural network, acquires training samples by using the new strategy neural network, and sends the acquired training samples to the server. And the method is repeated in a circulating way until the strategy neural network meeting the task requirement is trained.

3. Policy deployment

After the deep reinforcement learning training is finished, the policy neural network can be fixed and does not change any more, and no additional programming is needed at this time, as shown in fig. 9, fig. 9 is a schematic diagram of the policy neural network provided in the embodiment of the present application, and the policy neural network directly generates a control policy for a game character in a client, that is, the trained policy neural network is used for outputting a target action instruction for the virtual character, so as to control the virtual character to execute a target action based on the target action instruction.

It should be noted that, in practical applications, the UE4 environment for training the policy neural network and the UE4 environment for policy deployment may be slightly different, for example, since the UE4 for deploying the policy is provided with a graphic rendering and a graphic interface, not only the generated character action is rendered and played out after skinning, but also a time synchronization waiting mechanism, for example, an animation with a rendering frame rate of 30 frames, the UE4 may need much less than 1/30 seconds to complete the calculation of the next frame of picture, but in order to make the picture look close to the rate of the real time, then force to wait until 1/30 seconds before playing the picture. In order to accelerate data collection and improve training efficiency, the UE4 environment for training the strategy neural network cancels the graphical interface and does not have the waiting mechanism. Once the dynamics simulator calculates the character state after each cycle (1/30 second in this example), the next sampling cycle is started immediately, which is also to maximize the efficiency of deep reinforcement learning.

4. Server algorithm importance sampling

The loss function of the strategic neural network can be expressed as The gradient of the loss function with respect to observation is called strategy gradient, where Σ refers to summing all training samples in this batch, while +.>The term negogp is the mean of the reward values of the training samples in batch, and is related to the policy parameters of the current policy neural network. That is, training samples collected using a version of the policy neural network can be an unbiased estimate only when used to estimate the policy gradient of that version of the policy neural network.

When the old framework shown in fig. 2 is used for training, due to the adoption of synchronous communication, data acquisition and strategy upgrading are strictly alternated, so that the data acquired by the strategy neural network of each version can be strictly used for gradient estimation of the strategy neural network of the version (such as the backup strategy network of the corresponding version). However, in the embodiment of the present application, due to the asynchronous communication framework as shown in fig. 7, it is difficult to ensure that the data collected by the old version of the policy neural network is not used for gradient estimation of the new version of the policy neural network because of late arrival. In order to eliminate errors caused by the biased estimation, based on an importance sampling principle, the loss function of the strategy neural network is rewritten as follows:

Wherein mean=w2×tanh (w1×tanh (w0×observation+b0) +b1) +b2, the latest version of control policy is adopted during operation, and actions and negogp are determined when the training sample is sent to the server. Therefore, in engineering, it is required that the action be calculated from the current version of the neural network by the time the action is just sampled, and be part of the training sample to be stored in the database.

It should be noted that, the loss function of the rewritten policy neural network only largely counteracts the error of the biased estimation, but still has small deviation, and the expectations of the deviation increase with the increase of the gap between the policy versions, which is why the database is to be made into a data structure of a queue, and the training samples collected by the older versions of the policies are discarded earlier.

5. Client random frame initialization

In the version shown in fig. 2, after each client collects at least 1 complete track (1 complete track usually contains 80-400 steps of data), the data is uploaded to the server, and deep learning training is started. However, in the scheme as shown in fig. 7 provided in the embodiment of the present application, each time a client collects 1 step of training samples, the collected training samples are pushed into a sample queue to be uploaded to a server, if n=1000 clients are started simultaneously, each client collects only 2 steps, and the server can obtain more than 1 batch (2 ¹⁰ Step), the policy gradient is immediately started to be updated. In this case, the server will distribute the received training samples at the beginning with their observations very unevenly around the virtual character birth point preset by the UE4, so that there is a strong correlation between the training samples, which is very detrimental to the estimation of the policy gradient. Therefore, if the communication framework shown in fig. 7 is to be used, measures like random frame initialization need to be introduced into the client to initialize the virtual character in different states.

For example, when it is desired to make a skeletal character mimic the action in a known dynamic capture segment, the embodiment of the present application uses a uniform distribution between 0 and 0.618 to generate a random number, multiplies the random number by the total duration of the known dynamic capture segment, and takes out the character state corresponding to the calculated time in the dynamic capture segment as the initialization of the skeletal character in UE 4. In this way, roles in different clients have random starting points in simulation time, so that strong correlation between the servers and the observations of training samples collected from different clients can be avoided.

6. Exchange strategy in customer-premises

As described above, each client only needs to collect one or two steps of data, and the server can start to perform the strategy update of deep reinforcement learning, so that the client will receive the strategy parameter update command returned by the server in a short time. Since a game or a 13 second track usually contains data of 80-400 steps, this means that if the client waits until the game or the track ends and then takes updated policy parameters of the policy neural network, it will probably be seriously out of date when the data collected at the moment, because the current policy version and the latest version are too different, resulting in having to discard all the collected data. If other threads of the game are not paused directly in one time () period, the policy parameters of the new version of the policy neural network are directly and forcedly decoded and replaced, the time consumed by the policy parameters exceeds the time of each time () period, the frame rate of the client is unstable, the calculation result of the physical simulator is affected, and the calculation of the controller is endangered.

In order to replace the policy and prevent the frame rate from being disturbed, the embodiment of the application provides a policy replacement method in a client, two sets of policy neural networks are locally maintained in the client, the client receives a command of updating the policy neural network returned by the server in the process of controlling the movement of a role and collecting data by using the policy neural network A, and decodes the policy parameters of the received new version policy neural in the socket process and fills the policy parameters into the local policy neural network B. Meanwhile, the control role of the strategy neural network A is still used in the main thread of the client, only the collected data is pushed into a sample queue (not uploaded to a server at the moment), until the strategy neural network B is filled, the main thread is informed to start to abandon the strategy neural network A, the strategy neural network B is started, and meanwhile, the sample queue is emptied, so that outdated training samples are prevented from being uploaded; and when the next strategy updating command of the server comes, filling the strategy updating command into the strategy neural network A, and reciprocating the strategy neural network A until training is finished. Of course, after training is finished, only the latest version of the policy neural network (as shown in fig. 9) needs to be saved, and the latest version of the policy neural network can stably generate a control command for the virtual character with dynamic simulation in the UE4, so as to guide the virtual character to complete the given action and generate the high-quality animation.

In this way, the embodiment of the application proposes an improvement of a communication framework for performing deep reinforcement learning training control strategy by using the dynamics simulator of the UE4, and by using an asynchronous communication mechanism, under the condition of not affecting the stability of the main line Cheng Zhen of the client of the UE4, the data exchange frequency is improved, the waiting time between the CPU for data acquisition and the GPU for strategy gradient calculation in the distributed training framework is greatly saved, and the utilization rate of the equipment and the training efficiency are greatly improved. Moreover, the method is also applicable to the field of the present invention. The physical simulator of the UE4 is utilized for reinforcement learning training to help generate a complex control strategy, and a high-quality animation which is smooth, vivid and interactive with the collision of the environment is generated for the game roles in the UE4 in real time.

Continuing with the description below of an exemplary architecture of the training device 555 for a distributed architecture based policy neural network provided in an embodiment of the present application implemented as a software module, in some embodiments, the distributed architecture is stored in the training device 555 for a distributed architecture based policy neural network of the memory 550 in fig. 4A, where the distributed architecture includes at least one client and one server, and each client is provided with a policy neural network, and each server is provided with a backup policy network corresponding to the policy neural network; the software modules may include:

A first processing module 5551, configured to obtain, in a simulation environment in which each client performs virtual character motion control in a virtual scene, training samples generated continuously by each policy neural network in interaction with the simulation environment, and periodically send a training sample set of a target generation period in the training samples to a server, so that the server continuously trains the backup policy network based on the periodically sent training sample set, and after training the backup policy network in each round, returns policy parameters of the trained backup policy network to each client; a second processing module 5552, configured to update, in response to receiving the trained policy parameters of the backup policy network returned by the server, the policy neural network based on the trained policy parameters of the backup policy network, so as to continuously generate the training samples based on the updated policy neural network; the strategy neural network is used for outputting a target action instruction aiming at the virtual character so as to control the virtual character to execute a target action based on the target action instruction.

A first processing module 5551, configured to obtain, in a simulation environment in which each client performs virtual character motion control in a virtual scene, training samples generated continuously by each of the policy neural networks in interaction with the simulation environment, and periodically send a training sample set of a target generation period in the training samples to a server, so that the server continuously trains the backup policy network based on the periodically sent training sample set, and after each round of training the backup policy network, returns policy parameters of the trained backup policy network to each client; a second processing module 5552, configured to update, in response to receiving the trained policy parameters of the backup policy network returned by the server, the policy neural network based on the trained policy parameters of the backup policy network, so as to continuously generate the training samples based on the updated policy neural network; the strategy neural network is used for outputting a target action instruction aiming at the virtual character so as to control the virtual character to execute a target action based on the target action instruction.

In some embodiments, the first processing module is further configured to, during each of the sampling periods in at least one sampling period, each of the clients perform the following operations in an interaction process between each of the policy neural networks and the simulation environment: acquiring a first state vector corresponding to the virtual character at a first moment, and acquiring a second state vector corresponding to the virtual character at a second moment; based on the first state vector and the second state vector, invoking the strategy neural network to conduct action prediction to obtain a predicted action instruction; calling a Gaussian distribution probability density function to carry out probability prediction based on the predicted action instruction, and determining a probability value of the predicted action instruction; calling dynamic simulation based on the predicted action instruction to obtain a third state vector corresponding to the virtual character at the second moment, and calling a reward function to conduct reward prediction based on the third state vector and the second state vector to obtain a reward value for controlling the strategy neural network to update; the first state vector, the second state vector, the predicted action instruction, the probability value, and the prize value are combined into a training sample.

In some embodiments, the first processing module is further configured to obtain a manipulation instruction triggered for the virtual character; and adding the control instruction vector corresponding to the control instruction and the first state vector, and determining an addition result as a second state vector which is needed to be corresponding to the virtual character at a second moment.

In some embodiments, the first processing module is further configured to perform a stitching process on the first state vector and the second state vector to obtain a stitched vector; and calling the strategy neural network to conduct action prediction processing based on the splicing vector to obtain a predicted action instruction.

In some embodiments, the first state vector includes at least one of the following state parameters: a first position of the virtual character at the first moment, a first direction of the virtual character at the first moment, and a first body posture of the virtual character at the first moment; the second state vector includes at least one of the following state parameters: a second position at which the avatar needs to reach at the second moment, a second orientation at which the avatar needs to reach at the second moment, a second body posture at which the avatar needs to reach at the second moment; the first processing module is further configured to determine a first difference between a first three-dimensional coordinate corresponding to the first position and a second three-dimensional coordinate corresponding to the second position, and determine a second difference between a first yaw angle corresponding to the first orientation and a second yaw angle corresponding to the second orientation; and performing splicing treatment on the first difference value, the second difference value, the first rolling angle and the first pitch angle corresponding to the first direction, the second rolling angle and the second pitch angle corresponding to the second direction, the first body posture and the second body posture to obtain a spliced vector.

In some embodiments, the first processing module is further configured to call the policy neural network to perform forward computation based on the stitching vector when the number of dimensions of the stitching vector is at least two, so as to obtain a sub-prediction instruction corresponding to the stitching vector in each dimension; and determining the mean value and standard deviation of the sub-prediction instruction corresponding to the splicing vector of each dimension, and calling a Gaussian distribution function to sample based on the mean value and the standard deviation to obtain a prediction action instruction.

In some embodiments, the first processing module is further configured to call a probability density function of gaussian distribution to perform probability prediction based on the mean value, the standard deviation and the predicted action instruction, so as to obtain probability values of the predicted action instruction by downsampling each of the sub-predicted instructions.

In some embodiments, the first processing module is further configured to average a reward value of each training sample in the training data set to obtain a reward average value, and subtract the reward value of each training sample from the reward average value to obtain a corresponding first subtraction result; subtracting the average value from the predicted action instruction of each training sample in the training data set to obtain a second subtraction result, and dividing the square of the first subtraction result by the square of the standard deviation to obtain a first division result; dividing the index of the probability value of each training sample in the training data set with the index of the first division result to obtain a second division result; multiplying the first subtraction result, the first division result and the second division result of each training sample to obtain a first multiplication result of each training sample; and summing the first multiplication results of the training samples to obtain a loss function of the backup strategy network, and updating the backup strategy network based on the loss function.

In some embodiments, after the training samples generated by interaction between each of the strategic neural networks and the simulation environment are obtained, the first processing module is further configured to store the training samples in a sample queue; detecting whether policy parameters returned by the server are received or not in other sampling periods except the first sampling period in the at least one sampling period; responding to the strategy parameters which are not received and returned by the server, and sending a training sample set consisting of a plurality of training samples corresponding to other sampling periods in the sample queue to the server; and deleting training samples stored in the sample queue in response to receiving the strategy parameters returned by the server.

In some embodiments, after the training sample set of the target generation period in the training samples is periodically sent to a server, the first processing module is further configured to send the received training sample set to the backup policy network according to a first-in first-out order by the server, so as to continuously train the backup policy network; and in the process of training the backup strategy network, deleting the training sample set received first according to the first-in first-out sequence in response to the stored capacity of the training sample set exceeding the target capacity, and storing the training sample set received last.

In some embodiments, the first processing module is further configured to obtain, in each of the sampling periods in at least one sampling period, an interaction duration of each of the policy neural networks interacting with the simulation environment during each of the policy neural networks interacting with the simulation environment; determining random time corresponding to each client based on the interaction time, wherein different clients correspond to different random time; and acquiring training samples generated by the clients at corresponding random moments.

In some embodiments, the policy neural network set on each client includes a first policy neural network and a second policy neural network, and the first processing module is further configured to obtain training samples generated continuously by interaction between each first policy neural network and the simulation environment, and store the training samples in a sample queue; the second processing module is further configured to, in a process of generating the training sample based on the persistence of the first policy neural network, respond to receiving the trained policy parameters of the backup policy network returned by the server, and update a second policy neural network based on the trained policy parameters of the backup policy network; and stopping generating the training samples based on the first strategy neural network in response to the updating of the second strategy neural network, deleting the training samples stored in the sample queue, and continuously generating the training samples based on the updated strategy neural network.

In some embodiments, the apparatus further comprises: the third processing module is used for multiplying the first-order time derivative of the first state vector with a second parameter to obtain a second multiplication result, and adding the second multiplication result and the target action instruction to obtain a first addition result; performing division processing on the first addition result and the first parameter to obtain a third division result; adding the third division result and the first state vector to obtain a second addition result, and determining the second addition result as a reference state vector; the reference state vector is input into an interface of a physical simulation engine, so that the physical simulation engine controls the execution target action of the virtual character based on the reference state vector.

Continuing with the description below of an exemplary structure implemented as a software module for the virtual character control device 455 provided in embodiments of the present application, in some embodiments, as shown in fig. 4B, the software module stored in the virtual character control device 455 of the memory 450 may include: the obtaining module 4551 is configured to obtain a third state vector corresponding to a virtual character in a virtual scene at a third time, and a manipulation instruction triggered by the virtual character; a determining module 4552, configured to determine, based on a manipulation instruction vector corresponding to the manipulation instruction and the third state vector, a fourth state vector that is required to be corresponding to the virtual character at a fourth time; the prediction module 4553 is configured to invoke a policy neural network to perform motion prediction processing based on the third state vector and the fourth state vector, so as to obtain a target predicted motion instruction; the strategy neural network is obtained by training the strategy neural network training method based on the distributed architecture according to the embodiment of the application; and the control module 4554 is used for controlling the virtual character to execute corresponding actions based on the target prediction action instruction.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device executes the training method of the policy neural network or the control method of the virtual character based on the distributed architecture according to the embodiment of the application.

The embodiments of the present application provide a computer readable storage medium storing executable instructions, where the executable instructions are stored, which when executed by a processor, cause the processor to perform a method for training a policy neural network or controlling virtual roles based on a distributed architecture provided in the embodiments of the present application, for example, a method as shown in fig. 3 or fig. 4.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices located at one site or, alternatively, on multiple computing devices distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. The training method of the strategy neural network based on the distributed architecture is characterized in that the distributed architecture comprises at least one client side and a server, wherein the strategy neural network is arranged on each client side, and a backup strategy network corresponding to the strategy neural network is arranged on the server; the method comprises the following steps:

in a simulation environment of each client for controlling the movement of a virtual character in a virtual scene, acquiring training samples generated continuously by each strategy neural network in interaction with the simulation environment, and periodically transmitting training sample sets of target generation periods in the training samples to a server so as to enable the training sample sets to be generated continuously

The server continuously trains the backup strategy network based on the training sample set which is periodically sent, and returns strategy parameters of the trained backup strategy network to each client after each round of training of the backup strategy network;

2. The method of claim 1, wherein the obtaining training samples generated for interaction persistence of each of the strategic neural networks with the simulation environment comprises:

during the interaction of each of the strategic neural networks with the simulation environment during each of the sampling periods of at least one sampling period, each of the clients performs the following operations:

acquiring a first state vector corresponding to the virtual character at a first moment, and acquiring a second state vector corresponding to the virtual character at a second moment;

based on the first state vector and the second state vector, invoking the strategy neural network to conduct action prediction, and obtaining a predicted action instruction;

Calling a Gaussian distribution probability density function to carry out probability prediction based on the predicted action instruction, and determining to obtain a probability value of the predicted action instruction;

calling a dynamics model to simulate and obtain a third state vector corresponding to the virtual character at the second moment based on the predicted action instruction, and calling a reward function to conduct reward prediction based on the third state vector and the second state vector to obtain a reward value for controlling the strategy neural network to update;

the first state vector, the second state vector, the predicted action instruction, the probability value, and the prize value are combined into one training sample.

3. The method of claim 2, wherein the obtaining a corresponding second state vector for the virtual character at a second time comprises:

acquiring a control instruction triggered by the virtual character;

and adding the control instruction vector corresponding to the control instruction and the first state vector, and determining an addition result as a second state vector which is needed to be corresponding to the virtual character at a second moment.

4. The method of claim 2, wherein invoking the policy neural network for action prediction based on the first state vector and the second state vector, results in a predicted action instruction, comprises:

Performing splicing processing on the first state vector and the second state vector to obtain a spliced vector;

and calling the strategy neural network to conduct action prediction processing based on the splicing vector to obtain a predicted action instruction.

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

the first state vector includes at least one of the following state parameters: a first position of the virtual character at the first moment, a first direction of the virtual character at the first moment, and a first body posture of the virtual character at the first moment;

the second state vector includes at least one of the following state parameters: a second position that the avatar needs to reach at the second moment, a second orientation that the avatar needs to reach at the second moment, a second body posture that the avatar needs to reach at the second moment;

the splicing processing is performed on the first state vector and the second state vector to obtain a spliced vector, which comprises the following steps:

determining a first difference between a first three-dimensional coordinate corresponding to the first position and a second three-dimensional coordinate corresponding to the second position, and determining a second difference between a first yaw angle corresponding to the first orientation and a second yaw angle corresponding to the second orientation;

And performing splicing treatment on the first difference value, the second difference value, the first rolling angle and the first pitch angle corresponding to the first direction, the second rolling angle and the second pitch angle corresponding to the second direction, the first body posture and the second body posture to obtain a spliced vector.

6. The method of claim 4, wherein invoking the policy neural network to perform motion prediction based on the splice vector results in a predicted motion instruction, comprising:

when the number of the dimensions of the spliced vectors is at least two, calling the strategy neural network to perform forward calculation based on the spliced vectors to obtain sub-prediction instructions corresponding to the spliced vectors of all the dimensions;

and determining the mean value and standard deviation of the sub-prediction instruction corresponding to the splicing vector of each dimension, and calling a Gaussian distribution function to sample based on the mean value and the standard deviation to obtain a prediction action instruction.

7. The method of claim 6, wherein the calling a gaussian distributed probability density function based on the predicted action command to make a probability prediction, determining a probability value for the predicted action command, comprises:

And calling a probability density function of Gaussian distribution to carry out probability prediction based on the mean value, the standard deviation and the predicted action instruction, and obtaining a probability value of the predicted action instruction by sampling under each sub-predicted instruction.

8. The method of claim 6, wherein the server continuously updates the backup policy network based on the periodically transmitted training data set, comprising:

averaging the reward values of the training samples in the training data set to obtain a reward average value, and subtracting the reward value of the training samples from the reward average value to obtain a corresponding first subtraction result;

subtracting the average value from the predicted action instruction of each training sample in the training data set to obtain a second subtraction result, and dividing the square of the first subtraction result by the square of the standard deviation to obtain a first division result;

dividing the index of the probability value of each training sample in the training data set with the index of the first division result to obtain a second division result;

multiplying the first subtraction result, the first division result and the second division result of each training sample to obtain a first multiplication result of each training sample;

And summing the first multiplication results of the training samples to obtain a loss function of the backup strategy network, and updating the backup strategy network based on the loss function.

9. The method of claim 1, wherein after the obtaining training samples generated for interaction persistence of each of the strategic neural networks with the simulation environment, the method further comprises:

storing the training samples in a sample queue;

the periodically sending the training sample set of the target generation period in the training samples to a server includes:

detecting whether policy parameters returned by the server are received or not in other sampling periods except the first sampling period in the at least one sampling period;

responding to the strategy parameters which are not received and returned by the server, and sending a training sample set consisting of a plurality of training samples corresponding to other sampling periods in the sample queue to the server;

the method further comprises the steps of:

and deleting the training sample set stored in the sample queue in response to receiving the strategy parameters returned by the server.

10. The method of claim 1, wherein after periodically sending the training sample set of target generation periods in the training samples to a server, the method further comprises:

The server transmits the received training sample set to the backup strategy network according to a first-in first-out sequence so as to continuously train the backup strategy network;

and in the process of training the backup strategy network, deleting the training sample set received first according to the first-in first-out sequence in response to the stored capacity of the training sample set exceeding the target capacity, and storing the latest received training sample set.

11. The method of claim 1, wherein the obtaining training samples generated for interaction persistence of each of the strategic neural networks with the simulation environment comprises:

acquiring interaction time length of each strategy neural network interacting with the simulation environment in each sampling period in at least one sampling period in the process of each strategy neural network interacting with the simulation environment;

determining random time corresponding to each client based on the interaction time, wherein different clients correspond to different random time;

and acquiring training samples generated by the clients at corresponding random moments.

12. The method of claim 1, wherein the policy neural network provided on each of the clients includes a first policy neural network and a second policy neural network, the obtaining training samples generated by interaction persistence of each of the policy neural networks with the simulation environment includes:

Acquiring training samples generated continuously by interaction between each first strategy neural network and the simulation environment, and storing the training samples into a sample queue;

the responding to the received trained policy parameters of the backup policy network returned by the server, updating the policy neural network based on the trained policy parameters of the backup policy network to continuously generate the training samples based on the updated policy neural network, comprising:

in the process of continuously generating the training sample based on the first strategy neural network, responding to the strategy parameters of the backup strategy network after training returned by the server, and updating a second strategy neural network based on the strategy parameters of the backup strategy network after training;

and stopping generating the training samples based on the first strategy neural network in response to the second strategy neural network being updated, deleting the training samples stored in the sample queue, and continuously generating the training samples based on the updated strategy neural network.

13. The method of claim 1, wherein the method further comprises:

Converting the first state vector into a reference state vector of an adaptive physics simulation engine;

the reference state vector is input into an interface of a physical simulation engine, so that the physical simulation engine controls the action of the virtual character based on the reference state vector.

14. A method for controlling a virtual character, the method comprising:

wherein the policy neural network is trained according to the training method of the policy neural network based on the distributed architecture as claimed in any one of claims 1 to 13.

15. The training device of the strategy neural network based on the distributed architecture is characterized in that the distributed architecture comprises at least one client and a server, wherein the strategy neural network is arranged on each client, and a backup strategy network corresponding to the strategy neural network is arranged on the server; the device comprises:

A first processing module, configured to obtain training samples generated continuously by interaction between each of the policy neural networks and a simulation environment in which each of the clients controls movement of a virtual character in a virtual scene, and periodically send training sample sets of a target generation period in the training samples to a server, so that

the second processing module is used for responding to the received strategy parameters of the trained backup strategy network returned by the server, updating the strategy neural network based on the strategy parameters of the trained backup strategy network so as to continuously generate the training sample based on the updated strategy neural network;

16. A virtual character control apparatus, the apparatus comprising:

the prediction module is used for calling a strategy neural network to perform motion prediction processing based on the third state vector and the fourth state vector so as to obtain a target prediction motion instruction;

wherein the policy neural network is trained according to the training method of the policy neural network based on the distributed architecture of any one of claims 1 to 13;

17. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 14 when executing executable instructions stored in said memory.

18. A computer readable storage medium storing executable instructions for implementing the method of any one of claims 1 to 14 when executed by a processor.

19. A computer program product comprising a computer program or instructions which, when executed by a processor, implements the method of any one of claims 1 to 14.