CN111026272A

CN111026272A - Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Info

Publication number: CN111026272A
Application number: CN201911254761.9A
Authority: CN
Inventors: 贾航天; 林磊
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2020-04-17
Anticipated expiration: 2039-12-09
Also published as: CN111026272B

Abstract

The application provides a training method and device for a virtual object behavior strategy, an electronic device and a storage medium, which belong to the technical field of artificial intelligence and specifically comprise the following steps: acquiring the state data before and after the virtual object executes the interactive action; calculating the reward value of the virtual object for executing the interactive action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed; and training the behavior strategy reaching the target state by utilizing the pre-state data and the post-state data of the executed interaction and the reward value. Therefore, the change rule of the reward value is more consistent with the learning rule of human beings and animals, thereby improving the training efficiency and more quickly simulating the learning process of human beings and animals.

Description

Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for training a behavior policy of a virtual object, an electronic device, and a computer-readable storage medium.

Background

Reinforcement learning is one of the sub-fields of machine learning, and is mainly to update an understanding of an agent about an environment through interaction between the agent and the environment where the agent is located and reward (reward) obtained from the environment, so that a better strategy can be generated to promote accumulated long-distance reward obtained from the environment by the agent, and the agent can gradually generate an optimal strategy for one environment theoretically through continuous training.

As shown in FIG. 1, for example, in the game of "revenge of Monte Zuma", the agent pays a 1 point reward each step towards the bottom of the ladder and a-1 point penalty away from the bottom of the ladder, so that the agent should learn to go down the ladder quickly, because more rewards are obtained.

However, the learning efficiency of the agent may not be improved well, and if an agent needs to be trained to cross a river along a small bridge, the reward can be designed to give a fixed score every time the agent approaches the river in one step according to the prior art, but as a learning mode of human or animals, the exciting degree of the inner heart is not changed linearly in the process of crossing the river, and the design according to the existing linear reward is not in accordance with the change of the heart state of the real player, so that the training effect is poor, and the training efficiency is low.

Disclosure of Invention

The embodiment of the application provides a training method of a virtual object behavior strategy, which is used for improving the training efficiency.

The application provides a training method of a virtual object behavior strategy, which comprises the following steps:

acquiring the state data before and after the virtual object executes the interactive action;

calculating the reward value of the virtual object for executing the interaction action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;

and training a behavior strategy reaching the target state by utilizing the pre-state data and the post-state data for executing the interactive action and the reward value.

In an embodiment, the calculating a reward value for the virtual object to perform the interaction according to a reward function with gradient change configured for the task performed by the virtual object in advance includes:

according to the current state after the interactive action is executed, calculating the distance from the current state to the target state of the virtual object;

and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.

In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:

selecting a sub-interaction action for each sub-task;

controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring front and back sub-state data of the sub-interactive action under each sub-task;

the calculating of the reward value of the virtual object for executing the interaction action according to the reward function with gradient change configured for the task executed by the virtual object in advance comprises:

calculating branch rewards for executing corresponding sub-interaction actions under each subtask according to the reward function corresponding to each subtask and the sub-interaction actions under each subtask;

and superposing the branch rewards for executing the corresponding sub-interactive actions under each sub-task to obtain the reward value for executing all the sub-interactive actions by the virtual object.

In an embodiment, the overlaying of the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions includes:

and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.

In an embodiment, the calculating, according to the reward function corresponding to each subtask and the sub-interaction action under each subtask, a branch reward for executing the corresponding sub-interaction action under each subtask includes:

for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask;

normalizing the distance under each subtask;

and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.

In one embodiment, the training of the behavior strategy to reach the goal state using the contextual state data and the reward value for performing the interaction comprises:

building a neural network model of the behavior strategy;

acquiring a group of experience data comprising the front and back state data, the interactive actions and the reward values, taking the back state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the back state data;

adding the reward value in the experience data and the maximum output value to obtain a target profit value;

and taking the previous state data in the empirical data and the interaction action under the previous state data as the input of the neural network model, updating the parameters of the neural network model, and enabling the future expected value output by the neural network model to approach the target profit value.

In one embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.

The present application further provides a training apparatus for a virtual object behavior strategy, the apparatus including:

the data acquisition module is used for acquiring the state data before and after the virtual object executes the interactive action;

the reward calculation module is used for calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change, which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;

and the strategy training module is used for training the behavior strategy reaching the target state by utilizing the state data before and after the interactive action is executed and the reward value.

In addition, the present application also provides an electronic device, which includes:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the training method of the virtual object behavior strategy.

Further, the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is executable by a processor to perform the method for training the behavior policy of the virtual object provided in the present application.

According to the technical scheme provided by the embodiment of the application, the reward value of the virtual object for executing the interactive action in each state is calculated by acquiring the state data before and after the virtual object executes the interactive action and utilizing the reward function with gradient change, so that the behavior strategy of training the empirical data to reach the target state is formed. Because the gradient of the reward function changes along with the distance between the current state and the target state of the virtual object, the change rule of the reward function is more consistent with the learning rule of human beings and animals, the training efficiency is improved, and the learning process of the human beings and the animals is simulated more quickly.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.

FIG. 1 is a diagram of the interface of the "revenge of Monte-Promega" game in the background art;

FIG. 2 is a schematic diagram illustrating deep reinforcement learning provided by an embodiment of the present application;

fig. 3 is a schematic view of an application scenario of a training method for a behavior strategy of a virtual object according to an embodiment of the present application;

fig. 4 is a flowchart illustrating a method for training a behavior policy of a virtual object according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a forward reward function provided by an embodiment of the present application;

FIG. 6 is a diagram of a negative reward function provided by an embodiment of the present application;

FIG. 7 is a detailed flowchart of step 420 in a corresponding embodiment of FIG. 4;

FIG. 8 is a detailed flowchart of steps 410 and 420 in the corresponding embodiment of FIG. 4;

FIG. 9 is a flowchart showing details of step 421 in the corresponding embodiment of FIG. 8;

FIG. 10 is a detailed flowchart of step 430 in a corresponding embodiment of FIG. 1;

FIG. 11 is a schematic diagram of a robot arm environment provided by an embodiment of the present application;

FIG. 12 is a graph illustrating training efficiency comparison of different reward functions provided by embodiments of the present application;

fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

FIG. 2 is a schematic diagram of the deep reinforcement learning principle provided by the embodiment of the present application, and as shown in FIG. 2, an environment is obtained at a timestamp t by an agentState data s of_t(ii) a The agent performs action a; the environment reacts to this action and gets the next state data s_t+1And feeds back a reward value (reward) to the agent. By continuously cycling the above processes, based on the reward values of different action feedbacks executed in different states, the optimal strategy for achieving the goal can be finally obtained.

In solving practical problems with reinforcement learning algorithms, the design of the prize values is an important component. Since this part of colloquial can be understood as an important indicator for an agent to construct a "value view". The agent can only determine how well the action was performed by the value of the reward it receives for the action it takes. Based on the method, the reward function with gradient change is utilized, and a training method of the behavior strategy of the virtual object is provided. Virtual objects refer to agents in a virtual scene, such as a game, for example, virtual characters in the game. The behavior strategy means that the virtual object can automatically execute the optimal action facing the task. The bonus value is high or low to measure the action.

Fig. 3 is a schematic view of an application scenario of a training method for a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 3, the application scenario includes a plurality of clients 310, and the clients 310 may be Personal Computers (PCs), tablet computers, smart phones, Personal Digital Assistants (PDAs), and the like, in which application programs are installed. The client 310 may use the method provided in the present application to train the behavior policy of the virtual object, so as to automatically execute the optimal policy for completing the task.

In an embodiment, the application scenario further includes a server 320, and the server 320 may be a server, a server cluster, or a cloud computing center. The server 320 and the client 310 may be connected through a wired or wireless network. The server 320 may train the behavior policy of the virtual object by using the method provided by the present application, and then the server 320 may control the client 310 to automatically execute the optimal policy for completing the task according to the trained behavior policy.

In an embodiment, the present application further provides an electronic device, which may be the client 310 or the server 320. By way of example with the server 320, as shown in fig. 3, the electronic device may include a processor 321; a memory 322 for storing instructions executable by the processor 321; the processor 321 is configured to execute the training method of the virtual object behavior strategy provided in the present application.

The Memory 322 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.

A computer-readable storage medium is also provided, and the storage medium stores a computer program, which can be executed by the processor 321 to perform the method for training the behavior policy of the virtual object provided in the present application.

Fig. 4 is a schematic flowchart of a method for training a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 4, the method includes the following steps 410-430.

In step 410, pre-and post-state data of the virtual object performing the interaction is obtained.

The front-back state data refers to state data before the interactive action is executed and state data after the interactive action is executed. The state data may be the current position and the target point position of the virtual character in the game, the position of the pieces in the chessboard, the road condition, and the self-motion state of the autonomous vehicle. By preprocessing the chessboard image, the image of the virtual environment where the virtual character is located in the game and the road surface image when playing chess, the image characteristics can be extracted, so that the position state, the chess piece position state, the road condition and the like of the virtual character in the game are determined.

The interactive action refers to an action that can be performed by a virtual object, for example, the virtual object may be a virtual character in a basketball game, and may perform actions such as "walk one step left", "jump", "shoot", "defend", and the like. To maximize the cumulative reward, the interaction performed by the virtual object may be to choose an action at random with a probability of ε using a greedy algorithm, or to choose an action known to maximize the expected future profit Q.

In step 420, calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.

The virtual object may perform one or more tasks, and different tasks may correspond to different reward functions. The reward function is used to indicate the amount of reward value given after different actions are performed in different states. With a gradient change is meant that the reward function is non-linear. The reward function can be set in advance and stored in the server side, and therefore training efficiency is improved. The current state refers to the state of the virtual object after the interactive action is performed, for example, the position of the virtual object is moved forward and backward. The target state may be an end position of travel of the virtual object, a desired motion state parameter, or the like, depending on the task. The current state and the target state may be represented by a multi-dimensional feature vector. The distance between the current state and the target state may be a euclidean distance between the current state vector and the target state vector.

The reward functions can be divided into two broad categories, the first category being positive reward functions with a gradient as shown in fig. 5, and the second category being negative reward functions with a gradient as shown in fig. 6. The abscissa represents the distance of the current state from the goal state and the ordinate represents the prize value. The two types of reward functions are distinguished in that the first type of reward values are both positive values to encourage good actions by the virtual object, and the second type of reward values are both negative values to penalize bad actions by the virtual object. The selection of a particular reward function may be based on actual task needs.

Taking the forward reward function y of fig. 5 as 1- (x) 0.4 as an example, the value of y gradually increases with decreasing x, and the gradient also gradually increases. The reward function can be used for guiding the virtual object to approach a certain target, and the change of the reward value of the virtual object gradually increases when the virtual object is closer to the target, so that the virtual object is better guided to reach the target than a linear reward function curve.

The learning process corresponding to humans and animals can be understood as follows: for example, when the user participates in a marathon game, the distance from the end point is different, the mood is possibly different, the mood is stronger when the user is eager to succeed, the gradient is changed along with the distance between the current state and the target state of the virtual object, so that the learning of the behavior strategy is more fit for the real learning process of human beings and animals, and the learning efficiency is higher. Similarly, for the negative excitation function y in fig. 6 ═ x ^2.8, when x becomes larger, it can be understood that the virtual object is farther from the target point, which is not the desired result, so the farther away, a stronger penalty should be given, which corresponds to the gradient of the reward function curve increasing gradually. The gradient of the reward function can be considered to vary with the distance between the current state and the target state of the virtual object. Based on different tasks, the larger the distance, the larger the gradient, and the larger the distance, the smaller the gradient.

In one embodiment, the virtual object is assumed to perform a task with a reward function of y ═ 1- (x) ^0.4, and the virtual object is in state s_tThe state after the interactive action a is executed is s_t+1，s_t+1Can be considered as the current state, and the distance from the target state is x_t+1Then x can be substituted_t+1Substituting y for 1- (x) 0.4, and calculating the corresponding y value. The calculated y value is s_tThe prize value of the interaction a is performed.

In an embodiment, the client or server may be in state s₀Controlling a virtual object to perform an interaction a₀Obtaining a new state s₁And a prize value r₀，(s₀、a₀、s₁、r₀) A quaternary experience data add experience pool may be constructed.Continue in state s₁Controlling a virtual object to perform an interaction a₁Obtaining a new state s₂And a prize value r₁Obtaining new quaternary experience data, adding the quaternary experience data into an experience pool, and continuously cycling to store a large amount of experience data(s) in the experience pool_t、a_t、s_t+1、r_t)。

In step 430, a behavior strategy for reaching the goal state is trained using the pre-and post-interaction state data and the reward value.

Based on the quantity(s) stored in the experience pool_t、a_t、s_t+1、r_t) Empirical data may be used to train a behavior strategy to a target state through a reinforcement learning algorithm, such as DQN (Deep Q-learning), policygadient (strategy gradient), actercriticc (action-evaluation algorithm), and the like. The training of the behavior strategy reaching the target state refers to finding a set of strategy, which can automatically control the virtual object to execute the optimal action when facing a certain state, so that the overall number of steps reaching the target state is minimum.

In an embodiment, as shown in fig. 7, the step 420 may include the following steps 701 and 702.

In step 701, according to the current state after the interactive action is performed, a distance from the current state to the target state of the virtual object is calculated.

Virtual object in state s_tPerforming an interaction a_tThen enter intoNext state s_t+1. The state data of the virtual object after the interactive action is executed, namely the current state, can be obtained by collecting the image of the next state and extracting the image characteristics. For example, the state data after the interaction is performed may be a multidimensional vector (a)_x，b_x，c_x，d_x) Representing, the target state may be a known multidimensional vector (a)_y，b_y，c_y，d_y) The distance that the virtual object reaches the target state from the current state may be a calculation vector (a)_x，b_x，c_x，d_x) And (a)_y，b_y，c_y，d_y) The euclidean distance between them.

In one embodiment, the current state after the interaction is performed may be the position coordinates of the virtual character, and the target state is the destination coordinates of the virtual character, so the distance from the virtual object to the target state may be the euclidean distance between the position coordinates and the destination coordinates.

In step 702, the distance is used as the input of the reward function, and the reward value of the interaction action output by the reward function is obtained.

For example, the reward function may be y ═ 1- (x) ^0.4, x may be considered as the input to the reward function, and the output of the reward function is y. After the distance is calculated in step 701, the distance may be substituted as an x value into the reward function to calculate a corresponding y value, which may be considered to be in state s_tLower execution of an interaction a_tThe prize value of.

In one embodiment, the task referred to at step 420 may include a plurality of subtasks, each subtask having a corresponding reward function. Thus, as shown in fig. 8, the above step 410 may include the following steps 411 and 412. The step 420 may include the following steps 421 and 422.

In step 411, a selection of sub-interactions is made for each sub-task.

For multiple tasks that need to be completed simultaneously, each task may be referred to as a subtask. For example, a task may be learning to sing and dance, and a subtask may be learning to sing and dance. Different subtasks may correspond to different reward functions. The sub-interactive action refers to the action which can be executed by the virtual object aiming at each sub-task, and the action type can be configured in advance. For each subtask, the server may select an interaction corresponding to the subtask. The specific selection may be made by taking the action of randomly selecting or selecting the known future return that is expected to be greatest, as described above.

In step 412, the virtual object is controlled to execute the sub-interaction under each sub-task, and front and back sub-state data of the sub-interaction under each sub-task is obtained.

The server can control the virtual object to simultaneously execute the sub-interactive action selected under each sub-task. And acquiring state data before executing the corresponding sub-interactive action and state data after executing the corresponding sub-interactive action aiming at each sub-task. For example, suppose there are two subtasks, U, V, with the U task at state s_u1Selects and executes action a_u1To obtain the next state s_u1(ii) a In state s under V task_v1Selects and executes action a_v1To obtain the next state s_v1. And continuously processing the next state, and establishing an experience pool by continuously circulating the processes.

In step 421, the branch reward for executing the corresponding sub-interaction under each sub-task is calculated according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task.

On the basis of the above embodiment, for the subtask U, the server side can calculate the state s through the reward function corresponding to the subtask U_u1Lower execution action a_u1Is given a prize value r_u1. For the subtask V, the server side can calculate the state s through the reward function corresponding to the subtask V_v1Lower execution action a_v1Is given a prize value r_v1. The branch reward refers to a reward value for performing a corresponding sub-interaction under each sub-task. For example, the bonus value r_u1Can be regarded as a branch prize, the prize value r_v1May also be considered a branch prize.

In step 422, the branch rewards for executing the corresponding sub-interaction under each sub-task are superimposed, and the reward value for executing all the sub-interaction by the virtual object is obtained.

Wherein, the superposition may be to add or multiply the branch rewards of the corresponding sub-interaction actions executed under each sub-task. And taking the superposition result as a reward value of the virtual object for executing the interactive action. For example, r can be_u1And r_v1Add as in state(s)_u1，s_v1) Execute action (a)_u1，a_v1) The prize value of.

In an embodiment, the step 422 may specifically include the following steps: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.

For example, the subtasks may be singing and dancing, and assuming more emphasis on learning to sing, the singing subtask may be weighted more heavily (e.g., 60%), the dancing subtask may be weighted less heavily (e.g., 40%), and the branching reward r for the singing subtask may be awarded to the branches of the singing subtask_u1Multiplied by the corresponding weight 60%, plus the branch reward r of the dancing subtask_v1Multiplying by the corresponding weight by 40% to obtain the reward value (r) of the virtual object for performing the interactive action₁＝60％r_u1+40％r_v1)。

In an embodiment, as shown in fig. 9, the step 421 may specifically include the following steps 4211-4213.

In step 4211, for each subtask, a distance from the subtask data to the target state of the virtual object is calculated according to the subtask data after the corresponding subtask is executed in the subtask.

For example, in state s under the Utask_u1Selects and executes action a_u1To obtain the next state s_u1(ii) a In state s under V task_v1Selects and executes action a_v1To obtain the next state s_v1。s_u1It can be considered that the sub-interaction a is performed under the task U_u1Sub-state data of (a); s_v1Can be used forIt is considered that the sub-interaction a is performed under the task V_v1The sub-state data of (2).

The sub-state data can also be represented by a multi-dimensional feature vector, and the target state can be considered as a known quantity set in advance for each sub-task. Therefore, for the subtask U, the sub-state data s under the subtask can be calculated_u1Distance x from target state data_U(ii) a For the subtask V, the subtask state data s under the subtask can be calculated_v1Distance x from target state data_v。

In step 4212, the distance under each subtask is normalized.

Normalization refers to constraining the distance to be in the range of 0, 1. The normalization may be done by dividing the distance under each subtask by the maximum value of the distance under that subtask.

In step 4213, for each subtask, the distance normalized under the subtask is used as an input of a reward function corresponding to the subtask, and a branch reward of a sub interaction action corresponding to the subtask output by the reward function is obtained.

For example, assuming that subtasks U and V exist, the normalized distances are X, respectively_U，X_vIf the reward function for the sub-task U is y 1-X ^0.4 and the reward function for the sub-task V is y 1-X ^2.8, X may be set to X_UThe value of variable x as reward function y 1-x ^0.4, corresponding y value as U task in state s_u1Selects and executes action a_u1The branch prize of (1). Mixing X_vThe value of variable x as reward function y 1-x 2.8, corresponding to the value of y as V task at state s_v1Selects and executes action a_v1The branch prize of (1).

It can also be seen from fig. 5 and 6 that when the variable x is in the interval [0,1], the y value of the corresponding reward function is also constrained to the interval [0,1 ]. The value y of the reward function output, which corresponds to the normalization of the variable x to within the interval 0,1, is always between 0 and 1. That is, by normalizing the distance under each subtask, the branch reward of the sub-interaction under each subtask can be controlled within the [0,1] interval. Therefore, the awards of different tasks are controlled to be in one order of magnitude, and the awarding proportion among different tasks is conveniently controlled when multi-task learning is carried out.

In one embodiment, as shown in fig. 10, the step 430 may include the following steps:

in step 431, a neural network model of the behavioral policy is built.

The goal of reinforcement learning is to achieve some expectation that the outcome of a currently executed action will affect the subsequent state, and therefore it is necessary to determine whether the action will receive a good return in the future, which is delayed. For example, weiqi, a current step of chess will not be finished immediately, but will affect subsequent chess games, so that it is necessary to maximize the probability of winning in the future, and the future will be random. Therefore, a neural network model is constructed to learn the Q value (expected future profit value), i.e., to learn the expected future return of performing a certain action under a certain state. At this time, the parameters of the neural network model are initial values, and the calculated Q value is inaccurate, so that the parameters of the neural network model need to be updated in a gradient descent mode by using an experience pool to perform fitting training on the Q value.

In step 432, a set of empirical data including the front-back state data, the interaction actions, and the reward value is obtained, the rear state data in the empirical data is used as the input of the neural network model, and the maximum output value of the neural network model is obtained according to the output of the neural network model corresponding to different interaction actions in the rear state data.

The server side can randomly extract experience data(s) from the experience pool_t、a_t、s_t+1、r_t) And (6) learning. In particular, if s_t+1Not the target state, will s_t+1Assuming that 4 interactions (left, right, up, and down) can be performed as inputs to the neural network model, each interaction is also used as an input to the neural network model, and the s can be calculated by the neural network model_t+1In this state, the Q values of the four interactions are performed, respectively. The Q value is the neural networkAnd (5) outputting the model. Obtaining the maximum output value of the neural network model is obtaining the value Q with the maximum expected future profit_max(s_t+1A'). If s is_t+1Is the goal state, the prize value r_tIs at s_tIn this state, action a is executed_tThe expected future benefits of (a) can be used directly to update the parameters of the neural network model.

In step 433, the reward value in the experience data is added to the maximum output value to obtain a target profit value.

If s is_t+1Not in the goal state, will award the value r_tAnd Q_max(s_t+1A') are added, the result of which can be regarded as being at s_tIn this state, action a is executed_tFor differentiation, the expected future benefits are referred to herein as target benefits. Wherein the target profit value may be r_t+γQ_max(s_t+1A'). Gamma is called the discount factor and is [0,1]]The discount factor is used because there is more uncertainty in the future, so the return value decays over time. Adding such a discounting factor that decays exponentially with time prevents the summation term from going to infinity.

In step 434, the pre-state data in the empirical data and the interaction under the pre-state data are used as the input of the neural network model, and the parameters of the neural network model are updated, so that the future expected value output by the neural network model approaches to the target profit value.

Will s_tAnd a_tAs inputs to the neural network model, the values output by the neural network model may be referred to as future expectation values. Since step 433 has already calculated the value at s_tIn this state, action a is executed_tThe target profit value of (2) so that the parameters of the neural network model can be updated to make the future expected value approach to the target profit value. By continuously looping through steps 432-434, the difference between the target profit value and the future expected value can be minimized. When the gap is less than the threshold, training may be considered complete. By using the trained neural network model, the execution in each state can be determinedThe expected future benefit of which interaction action is performed is the largest, so that the virtual object can be controlled to execute the action with the largest Q value at each step, and an optimal strategy for completing the task is learned.

In order to verify the advantages of the training method of the virtual object behavior strategy provided by the present application using the reward function with gradient change, the present application uses a robot arm environment (arm 101 targets at approach block 102) training process as shown in fig. 11 to perform a simple verification, and the algorithm may use a DDPG (Deep deterministic policy gradient) algorithm. The experimental results are as follows

FIG. 12 is a schematic diagram of the learning process of an agent without a reward function, as indicated by the curve labeled "None", from which it can be seen that the agent has not been able to learn this task at all times; the curve with the label "y ═ -x" is the learning process for the agent using the linear reward function, and the curve with the label "y ═ - (x) ^ 2.8" is the learning process for the agent using the reward function with the gradient change proposed by the present application. From the results, it can be seen that the reward function with gradient changes is more effective in learning efficiency.

The following is an embodiment of the apparatus of the present application, which may be used to execute a training embodiment of a virtual object behavior policy executed by the client or the server in the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training method for the behavior policy of the virtual object of the present application.

Fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application. As shown in fig. 13, the training device of the virtual object behavior strategy may include the following modules: a data acquisition module 1310, a reward calculation module 1320, and a policy training module 1330.

A data obtaining module 1310, configured to obtain pre-and post-state data of the virtual object performing the interaction.

A reward calculation module 1320, configured to calculate a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for a task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.

The strategy training module 1330 is configured to train the behavior strategy to reach the goal state by using the pre-and post-interaction state data and the reward value.

The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the training method of the virtual object behavior policy, and is not described herein again.

In an embodiment, the reward calculation module 1320 is specifically configured to calculate, according to a current state after the interaction is performed, a distance from the current state to the target state of the virtual object; and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.

In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the data obtaining module 1310 is specifically configured to: selecting a sub-interaction action for each sub-task; and controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring the front and back sub-state data of the sub-interactive action under each sub-task.

The reward calculation module 1320 specifically includes: the device comprises a branch reward calculation unit and a branch reward superposition unit. The branch reward calculation unit is used for calculating branch rewards for executing corresponding sub-interaction actions under each sub-task according to the reward functions corresponding to each sub-task and the sub-interaction actions under each sub-task; and the branch reward overlapping unit is used for overlapping branch rewards for executing corresponding sub-interactive actions under each sub-task and obtaining reward values of all the sub-interactive actions executed by the virtual object.

In an embodiment, the branch prize stacking unit is specifically configured to: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.

In an embodiment, the branch reward calculating unit is specifically configured to: for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask; normalizing the distance under each subtask; and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.

In one embodiment, the strategy training module 1330 includes the following elements: the device comprises a network building unit, a maximum value obtaining unit, a target calculating unit and a parameter updating unit.

And the network building unit is used for building a neural network model of the behavior strategy.

And the maximum value acquisition unit is used for acquiring a group of experience data comprising the front and rear state data, the interactive action and the reward value, taking the rear state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the rear state data.

And the target calculating unit is used for adding the reward value in the experience data with the maximum output value to obtain a target profit value.

And the parameter updating unit is used for taking the previous state data in the empirical data and the interactive action under the previous state data as the input of the neural network model, updating the parameters of the neural network model and enabling the future expected value output by the neural network model to approach the target profit value.

In an embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.

In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims

1. A training method of a virtual object behavior strategy is characterized by comprising the following steps:

2. The method according to claim 1, wherein the calculating of the reward value for the virtual object to perform the interaction according to the reward function with gradient change configured for the task performed by the virtual object in advance comprises:

3. The method of claim 1, wherein the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:

selecting a sub-interaction action for each sub-task;

4. The method according to claim 3, wherein the step of superposing the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions comprises the following steps:

5. The method according to claim 3, wherein the calculating of the branch reward for executing the corresponding sub-interaction under each sub-task according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task comprises:

normalizing the distance under each subtask;

6. The method of claim 1, wherein training the behavior strategy to reach the goal state using pre-and post-state data and reward values for performing the interaction comprises:

building a neural network model of the behavior strategy;

7. The method of claim 1,

the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases along with the decrease of the distance between the current state and the target state of the virtual object;

or,

the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.

8. An apparatus for training a behavior strategy of a virtual object, comprising:

9. An electronic device, characterized in that the electronic device comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the method of training of virtual object behavior strategy of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of training a behavior strategy of a virtual object according to any one of claims 1-7.