CN111026272A - Training method and device for virtual object behavior strategy, electronic equipment and storage medium - Google Patents
Training method and device for virtual object behavior strategy, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN111026272A CN111026272A CN201911254761.9A CN201911254761A CN111026272A CN 111026272 A CN111026272 A CN 111026272A CN 201911254761 A CN201911254761 A CN 201911254761A CN 111026272 A CN111026272 A CN 111026272A
- Authority
- CN
- China
- Prior art keywords
- virtual object
- reward
- sub
- subtask
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 54
- 238000000034 method Methods 0.000 title claims abstract description 48
- 230000006870 function Effects 0.000 claims abstract description 104
- 230000006399 behavior Effects 0.000 claims abstract description 47
- 230000002452 interceptive effect Effects 0.000 claims abstract description 44
- 230000003993 interaction Effects 0.000 claims abstract description 41
- 230000008859 change Effects 0.000 claims abstract description 22
- 230000009471 action Effects 0.000 claims description 85
- 238000003062 neural network model Methods 0.000 claims description 40
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000013459 approach Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 230000007423 decrease Effects 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 14
- 241001465754 Metazoa Species 0.000 abstract description 9
- 241000282414 Homo sapiens Species 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 239000003795 chemical substances by application Substances 0.000 description 19
- 238000010586 diagram Methods 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000002787 reinforcement Effects 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 230000001351 cycling effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000036651 mood Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- XCWPUUGSGHNIDZ-UHFFFAOYSA-N Oxypertine Chemical compound C1=2C=C(OC)C(OC)=CC=2NC(C)=C1CCN(CC1)CCN1C1=CC=CC=C1 XCWPUUGSGHNIDZ-UHFFFAOYSA-N 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 239000009891 weiqi Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/011—Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2203/00—Indexing scheme relating to G06F3/00 - G06F3/048
- G06F2203/01—Indexing scheme relating to G06F3/01
- G06F2203/012—Walk-in-place systems for allowing a user to walk in a virtual environment while constraining him to a given position in the physical environment
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Human Computer Interaction (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Processing Or Creating Images (AREA)
Abstract
The application provides a training method and device for a virtual object behavior strategy, an electronic device and a storage medium, which belong to the technical field of artificial intelligence and specifically comprise the following steps: acquiring the state data before and after the virtual object executes the interactive action; calculating the reward value of the virtual object for executing the interactive action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed; and training the behavior strategy reaching the target state by utilizing the pre-state data and the post-state data of the executed interaction and the reward value. Therefore, the change rule of the reward value is more consistent with the learning rule of human beings and animals, thereby improving the training efficiency and more quickly simulating the learning process of human beings and animals.
Description
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for training a behavior policy of a virtual object, an electronic device, and a computer-readable storage medium.
Background
Reinforcement learning is one of the sub-fields of machine learning, and is mainly to update an understanding of an agent about an environment through interaction between the agent and the environment where the agent is located and reward (reward) obtained from the environment, so that a better strategy can be generated to promote accumulated long-distance reward obtained from the environment by the agent, and the agent can gradually generate an optimal strategy for one environment theoretically through continuous training.
As shown in FIG. 1, for example, in the game of "revenge of Monte Zuma", the agent pays a 1 point reward each step towards the bottom of the ladder and a-1 point penalty away from the bottom of the ladder, so that the agent should learn to go down the ladder quickly, because more rewards are obtained.
However, the learning efficiency of the agent may not be improved well, and if an agent needs to be trained to cross a river along a small bridge, the reward can be designed to give a fixed score every time the agent approaches the river in one step according to the prior art, but as a learning mode of human or animals, the exciting degree of the inner heart is not changed linearly in the process of crossing the river, and the design according to the existing linear reward is not in accordance with the change of the heart state of the real player, so that the training effect is poor, and the training efficiency is low.
Disclosure of Invention
The embodiment of the application provides a training method of a virtual object behavior strategy, which is used for improving the training efficiency.
The application provides a training method of a virtual object behavior strategy, which comprises the following steps:
acquiring the state data before and after the virtual object executes the interactive action;
calculating the reward value of the virtual object for executing the interaction action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and training a behavior strategy reaching the target state by utilizing the pre-state data and the post-state data for executing the interactive action and the reward value.
In an embodiment, the calculating a reward value for the virtual object to perform the interaction according to a reward function with gradient change configured for the task performed by the virtual object in advance includes:
according to the current state after the interactive action is executed, calculating the distance from the current state to the target state of the virtual object;
and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:
selecting a sub-interaction action for each sub-task;
controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring front and back sub-state data of the sub-interactive action under each sub-task;
the calculating of the reward value of the virtual object for executing the interaction action according to the reward function with gradient change configured for the task executed by the virtual object in advance comprises:
calculating branch rewards for executing corresponding sub-interaction actions under each subtask according to the reward function corresponding to each subtask and the sub-interaction actions under each subtask;
and superposing the branch rewards for executing the corresponding sub-interactive actions under each sub-task to obtain the reward value for executing all the sub-interactive actions by the virtual object.
In an embodiment, the overlaying of the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions includes:
and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
In an embodiment, the calculating, according to the reward function corresponding to each subtask and the sub-interaction action under each subtask, a branch reward for executing the corresponding sub-interaction action under each subtask includes:
for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask;
normalizing the distance under each subtask;
and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
In one embodiment, the training of the behavior strategy to reach the goal state using the contextual state data and the reward value for performing the interaction comprises:
building a neural network model of the behavior strategy;
acquiring a group of experience data comprising the front and back state data, the interactive actions and the reward values, taking the back state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the back state data;
adding the reward value in the experience data and the maximum output value to obtain a target profit value;
and taking the previous state data in the empirical data and the interaction action under the previous state data as the input of the neural network model, updating the parameters of the neural network model, and enabling the future expected value output by the neural network model to approach the target profit value.
In one embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
The present application further provides a training apparatus for a virtual object behavior strategy, the apparatus including:
the data acquisition module is used for acquiring the state data before and after the virtual object executes the interactive action;
the reward calculation module is used for calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change, which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and the strategy training module is used for training the behavior strategy reaching the target state by utilizing the state data before and after the interactive action is executed and the reward value.
In addition, the present application also provides an electronic device, which includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the training method of the virtual object behavior strategy.
Further, the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is executable by a processor to perform the method for training the behavior policy of the virtual object provided in the present application.
According to the technical scheme provided by the embodiment of the application, the reward value of the virtual object for executing the interactive action in each state is calculated by acquiring the state data before and after the virtual object executes the interactive action and utilizing the reward function with gradient change, so that the behavior strategy of training the empirical data to reach the target state is formed. Because the gradient of the reward function changes along with the distance between the current state and the target state of the virtual object, the change rule of the reward function is more consistent with the learning rule of human beings and animals, the training efficiency is improved, and the learning process of the human beings and the animals is simulated more quickly.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
FIG. 1 is a diagram of the interface of the "revenge of Monte-Promega" game in the background art;
FIG. 2 is a schematic diagram illustrating deep reinforcement learning provided by an embodiment of the present application;
fig. 3 is a schematic view of an application scenario of a training method for a behavior strategy of a virtual object according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a method for training a behavior policy of a virtual object according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a forward reward function provided by an embodiment of the present application;
FIG. 6 is a diagram of a negative reward function provided by an embodiment of the present application;
FIG. 7 is a detailed flowchart of step 420 in a corresponding embodiment of FIG. 4;
FIG. 8 is a detailed flowchart of steps 410 and 420 in the corresponding embodiment of FIG. 4;
FIG. 9 is a flowchart showing details of step 421 in the corresponding embodiment of FIG. 8;
FIG. 10 is a detailed flowchart of step 430 in a corresponding embodiment of FIG. 1;
FIG. 11 is a schematic diagram of a robot arm environment provided by an embodiment of the present application;
FIG. 12 is a graph illustrating training efficiency comparison of different reward functions provided by embodiments of the present application;
fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
FIG. 2 is a schematic diagram of the deep reinforcement learning principle provided by the embodiment of the present application, and as shown in FIG. 2, an environment is obtained at a timestamp t by an agentState data s oft(ii) a The agent performs action a; the environment reacts to this action and gets the next state data st+1And feeds back a reward value (reward) to the agent. By continuously cycling the above processes, based on the reward values of different action feedbacks executed in different states, the optimal strategy for achieving the goal can be finally obtained.
In solving practical problems with reinforcement learning algorithms, the design of the prize values is an important component. Since this part of colloquial can be understood as an important indicator for an agent to construct a "value view". The agent can only determine how well the action was performed by the value of the reward it receives for the action it takes. Based on the method, the reward function with gradient change is utilized, and a training method of the behavior strategy of the virtual object is provided. Virtual objects refer to agents in a virtual scene, such as a game, for example, virtual characters in the game. The behavior strategy means that the virtual object can automatically execute the optimal action facing the task. The bonus value is high or low to measure the action.
Fig. 3 is a schematic view of an application scenario of a training method for a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 3, the application scenario includes a plurality of clients 310, and the clients 310 may be Personal Computers (PCs), tablet computers, smart phones, Personal Digital Assistants (PDAs), and the like, in which application programs are installed. The client 310 may use the method provided in the present application to train the behavior policy of the virtual object, so as to automatically execute the optimal policy for completing the task.
In an embodiment, the application scenario further includes a server 320, and the server 320 may be a server, a server cluster, or a cloud computing center. The server 320 and the client 310 may be connected through a wired or wireless network. The server 320 may train the behavior policy of the virtual object by using the method provided by the present application, and then the server 320 may control the client 310 to automatically execute the optimal policy for completing the task according to the trained behavior policy.
In an embodiment, the present application further provides an electronic device, which may be the client 310 or the server 320. By way of example with the server 320, as shown in fig. 3, the electronic device may include a processor 321; a memory 322 for storing instructions executable by the processor 321; the processor 321 is configured to execute the training method of the virtual object behavior strategy provided in the present application.
The Memory 322 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.
A computer-readable storage medium is also provided, and the storage medium stores a computer program, which can be executed by the processor 321 to perform the method for training the behavior policy of the virtual object provided in the present application.
Fig. 4 is a schematic flowchart of a method for training a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 4, the method includes the following steps 410-430.
In step 410, pre-and post-state data of the virtual object performing the interaction is obtained.
The front-back state data refers to state data before the interactive action is executed and state data after the interactive action is executed. The state data may be the current position and the target point position of the virtual character in the game, the position of the pieces in the chessboard, the road condition, and the self-motion state of the autonomous vehicle. By preprocessing the chessboard image, the image of the virtual environment where the virtual character is located in the game and the road surface image when playing chess, the image characteristics can be extracted, so that the position state, the chess piece position state, the road condition and the like of the virtual character in the game are determined.
The interactive action refers to an action that can be performed by a virtual object, for example, the virtual object may be a virtual character in a basketball game, and may perform actions such as "walk one step left", "jump", "shoot", "defend", and the like. To maximize the cumulative reward, the interaction performed by the virtual object may be to choose an action at random with a probability of ε using a greedy algorithm, or to choose an action known to maximize the expected future profit Q.
In step 420, calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.
The virtual object may perform one or more tasks, and different tasks may correspond to different reward functions. The reward function is used to indicate the amount of reward value given after different actions are performed in different states. With a gradient change is meant that the reward function is non-linear. The reward function can be set in advance and stored in the server side, and therefore training efficiency is improved. The current state refers to the state of the virtual object after the interactive action is performed, for example, the position of the virtual object is moved forward and backward. The target state may be an end position of travel of the virtual object, a desired motion state parameter, or the like, depending on the task. The current state and the target state may be represented by a multi-dimensional feature vector. The distance between the current state and the target state may be a euclidean distance between the current state vector and the target state vector.
The reward functions can be divided into two broad categories, the first category being positive reward functions with a gradient as shown in fig. 5, and the second category being negative reward functions with a gradient as shown in fig. 6. The abscissa represents the distance of the current state from the goal state and the ordinate represents the prize value. The two types of reward functions are distinguished in that the first type of reward values are both positive values to encourage good actions by the virtual object, and the second type of reward values are both negative values to penalize bad actions by the virtual object. The selection of a particular reward function may be based on actual task needs.
Taking the forward reward function y of fig. 5 as 1- (x) 0.4 as an example, the value of y gradually increases with decreasing x, and the gradient also gradually increases. The reward function can be used for guiding the virtual object to approach a certain target, and the change of the reward value of the virtual object gradually increases when the virtual object is closer to the target, so that the virtual object is better guided to reach the target than a linear reward function curve.
The learning process corresponding to humans and animals can be understood as follows: for example, when the user participates in a marathon game, the distance from the end point is different, the mood is possibly different, the mood is stronger when the user is eager to succeed, the gradient is changed along with the distance between the current state and the target state of the virtual object, so that the learning of the behavior strategy is more fit for the real learning process of human beings and animals, and the learning efficiency is higher. Similarly, for the negative excitation function y in fig. 6 ═ x ^2.8, when x becomes larger, it can be understood that the virtual object is farther from the target point, which is not the desired result, so the farther away, a stronger penalty should be given, which corresponds to the gradient of the reward function curve increasing gradually. The gradient of the reward function can be considered to vary with the distance between the current state and the target state of the virtual object. Based on different tasks, the larger the distance, the larger the gradient, and the larger the distance, the smaller the gradient.
In one embodiment, the virtual object is assumed to perform a task with a reward function of y ═ 1- (x) ^0.4, and the virtual object is in state stThe state after the interactive action a is executed is st+1,st+1Can be considered as the current state, and the distance from the target state is xt+1Then x can be substitutedt+1Substituting y for 1- (x) 0.4, and calculating the corresponding y value. The calculated y value is stThe prize value of the interaction a is performed.
In an embodiment, the client or server may be in state s0Controlling a virtual object to perform an interaction a0Obtaining a new state s1And a prize value r0,(s0、a0、s1、r0) A quaternary experience data add experience pool may be constructed.Continue in state s1Controlling a virtual object to perform an interaction a1Obtaining a new state s2And a prize value r1Obtaining new quaternary experience data, adding the quaternary experience data into an experience pool, and continuously cycling to store a large amount of experience data(s) in the experience poolt、at、st+1、rt)。
In step 430, a behavior strategy for reaching the goal state is trained using the pre-and post-interaction state data and the reward value.
Based on the quantity(s) stored in the experience poolt、at、st+1、rt) Empirical data may be used to train a behavior strategy to a target state through a reinforcement learning algorithm, such as DQN (Deep Q-learning), policygadient (strategy gradient), actercriticc (action-evaluation algorithm), and the like. The training of the behavior strategy reaching the target state refers to finding a set of strategy, which can automatically control the virtual object to execute the optimal action when facing a certain state, so that the overall number of steps reaching the target state is minimum.
According to the technical scheme provided by the embodiment of the application, the reward value of the virtual object for executing the interactive action in each state is calculated by acquiring the state data before and after the virtual object executes the interactive action and utilizing the reward function with gradient change, so that the behavior strategy of training the empirical data to reach the target state is formed. Because the gradient of the reward function changes along with the distance between the current state and the target state of the virtual object, the change rule of the reward function is more consistent with the learning rule of human beings and animals, the training efficiency is improved, and the learning process of the human beings and the animals is simulated more quickly.
In an embodiment, as shown in fig. 7, the step 420 may include the following steps 701 and 702.
In step 701, according to the current state after the interactive action is performed, a distance from the current state to the target state of the virtual object is calculated.
Virtual object in state stPerforming an interaction atThen enter intoNext state st+1. The state data of the virtual object after the interactive action is executed, namely the current state, can be obtained by collecting the image of the next state and extracting the image characteristics. For example, the state data after the interaction is performed may be a multidimensional vector (a)x,bx,cx,dx) Representing, the target state may be a known multidimensional vector (a)y,by,cy,dy) The distance that the virtual object reaches the target state from the current state may be a calculation vector (a)x,bx,cx,dx) And (a)y,by,cy,dy) The euclidean distance between them.
In one embodiment, the current state after the interaction is performed may be the position coordinates of the virtual character, and the target state is the destination coordinates of the virtual character, so the distance from the virtual object to the target state may be the euclidean distance between the position coordinates and the destination coordinates.
In step 702, the distance is used as the input of the reward function, and the reward value of the interaction action output by the reward function is obtained.
For example, the reward function may be y ═ 1- (x) ^0.4, x may be considered as the input to the reward function, and the output of the reward function is y. After the distance is calculated in step 701, the distance may be substituted as an x value into the reward function to calculate a corresponding y value, which may be considered to be in state stLower execution of an interaction atThe prize value of.
In one embodiment, the task referred to at step 420 may include a plurality of subtasks, each subtask having a corresponding reward function. Thus, as shown in fig. 8, the above step 410 may include the following steps 411 and 412. The step 420 may include the following steps 421 and 422.
In step 411, a selection of sub-interactions is made for each sub-task.
For multiple tasks that need to be completed simultaneously, each task may be referred to as a subtask. For example, a task may be learning to sing and dance, and a subtask may be learning to sing and dance. Different subtasks may correspond to different reward functions. The sub-interactive action refers to the action which can be executed by the virtual object aiming at each sub-task, and the action type can be configured in advance. For each subtask, the server may select an interaction corresponding to the subtask. The specific selection may be made by taking the action of randomly selecting or selecting the known future return that is expected to be greatest, as described above.
In step 412, the virtual object is controlled to execute the sub-interaction under each sub-task, and front and back sub-state data of the sub-interaction under each sub-task is obtained.
The server can control the virtual object to simultaneously execute the sub-interactive action selected under each sub-task. And acquiring state data before executing the corresponding sub-interactive action and state data after executing the corresponding sub-interactive action aiming at each sub-task. For example, suppose there are two subtasks, U, V, with the U task at state su1Selects and executes action au1To obtain the next state su1(ii) a In state s under V taskv1Selects and executes action av1To obtain the next state sv1. And continuously processing the next state, and establishing an experience pool by continuously circulating the processes.
In step 421, the branch reward for executing the corresponding sub-interaction under each sub-task is calculated according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task.
On the basis of the above embodiment, for the subtask U, the server side can calculate the state s through the reward function corresponding to the subtask Uu1Lower execution action au1Is given a prize value ru1. For the subtask V, the server side can calculate the state s through the reward function corresponding to the subtask Vv1Lower execution action av1Is given a prize value rv1. The branch reward refers to a reward value for performing a corresponding sub-interaction under each sub-task. For example, the bonus value ru1Can be regarded as a branch prize, the prize value rv1May also be considered a branch prize.
In step 422, the branch rewards for executing the corresponding sub-interaction under each sub-task are superimposed, and the reward value for executing all the sub-interaction by the virtual object is obtained.
Wherein, the superposition may be to add or multiply the branch rewards of the corresponding sub-interaction actions executed under each sub-task. And taking the superposition result as a reward value of the virtual object for executing the interactive action. For example, r can beu1And rv1Add as in state(s)u1,sv1) Execute action (a)u1,av1) The prize value of.
In an embodiment, the step 422 may specifically include the following steps: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
For example, the subtasks may be singing and dancing, and assuming more emphasis on learning to sing, the singing subtask may be weighted more heavily (e.g., 60%), the dancing subtask may be weighted less heavily (e.g., 40%), and the branching reward r for the singing subtask may be awarded to the branches of the singing subtasku1Multiplied by the corresponding weight 60%, plus the branch reward r of the dancing subtaskv1Multiplying by the corresponding weight by 40% to obtain the reward value (r) of the virtual object for performing the interactive action1=60%ru1+40%rv1)。
In an embodiment, as shown in fig. 9, the step 421 may specifically include the following steps 4211-4213.
In step 4211, for each subtask, a distance from the subtask data to the target state of the virtual object is calculated according to the subtask data after the corresponding subtask is executed in the subtask.
For example, in state s under the Utasku1Selects and executes action au1To obtain the next state su1(ii) a In state s under V taskv1Selects and executes action av1To obtain the next state sv1。su1It can be considered that the sub-interaction a is performed under the task Uu1Sub-state data of (a); sv1Can be used forIt is considered that the sub-interaction a is performed under the task Vv1The sub-state data of (2).
The sub-state data can also be represented by a multi-dimensional feature vector, and the target state can be considered as a known quantity set in advance for each sub-task. Therefore, for the subtask U, the sub-state data s under the subtask can be calculatedu1Distance x from target state dataU(ii) a For the subtask V, the subtask state data s under the subtask can be calculatedv1Distance x from target state datav。
In step 4212, the distance under each subtask is normalized.
Normalization refers to constraining the distance to be in the range of 0, 1. The normalization may be done by dividing the distance under each subtask by the maximum value of the distance under that subtask.
In step 4213, for each subtask, the distance normalized under the subtask is used as an input of a reward function corresponding to the subtask, and a branch reward of a sub interaction action corresponding to the subtask output by the reward function is obtained.
For example, assuming that subtasks U and V exist, the normalized distances are X, respectivelyU,XvIf the reward function for the sub-task U is y 1-X ^0.4 and the reward function for the sub-task V is y 1-X ^2.8, X may be set to XUThe value of variable x as reward function y 1-x ^0.4, corresponding y value as U task in state su1Selects and executes action au1The branch prize of (1). Mixing XvThe value of variable x as reward function y 1-x 2.8, corresponding to the value of y as V task at state sv1Selects and executes action av1The branch prize of (1).
It can also be seen from fig. 5 and 6 that when the variable x is in the interval [0,1], the y value of the corresponding reward function is also constrained to the interval [0,1 ]. The value y of the reward function output, which corresponds to the normalization of the variable x to within the interval 0,1, is always between 0 and 1. That is, by normalizing the distance under each subtask, the branch reward of the sub-interaction under each subtask can be controlled within the [0,1] interval. Therefore, the awards of different tasks are controlled to be in one order of magnitude, and the awarding proportion among different tasks is conveniently controlled when multi-task learning is carried out.
In one embodiment, as shown in fig. 10, the step 430 may include the following steps:
in step 431, a neural network model of the behavioral policy is built.
The goal of reinforcement learning is to achieve some expectation that the outcome of a currently executed action will affect the subsequent state, and therefore it is necessary to determine whether the action will receive a good return in the future, which is delayed. For example, weiqi, a current step of chess will not be finished immediately, but will affect subsequent chess games, so that it is necessary to maximize the probability of winning in the future, and the future will be random. Therefore, a neural network model is constructed to learn the Q value (expected future profit value), i.e., to learn the expected future return of performing a certain action under a certain state. At this time, the parameters of the neural network model are initial values, and the calculated Q value is inaccurate, so that the parameters of the neural network model need to be updated in a gradient descent mode by using an experience pool to perform fitting training on the Q value.
In step 432, a set of empirical data including the front-back state data, the interaction actions, and the reward value is obtained, the rear state data in the empirical data is used as the input of the neural network model, and the maximum output value of the neural network model is obtained according to the output of the neural network model corresponding to different interaction actions in the rear state data.
The server side can randomly extract experience data(s) from the experience poolt、at、st+1、rt) And (6) learning. In particular, if st+1Not the target state, will st+1Assuming that 4 interactions (left, right, up, and down) can be performed as inputs to the neural network model, each interaction is also used as an input to the neural network model, and the s can be calculated by the neural network modelt+1In this state, the Q values of the four interactions are performed, respectively. The Q value is the neural networkAnd (5) outputting the model. Obtaining the maximum output value of the neural network model is obtaining the value Q with the maximum expected future profitmax(st+1A'). If s ist+1Is the goal state, the prize value rtIs at stIn this state, action a is executedtThe expected future benefits of (a) can be used directly to update the parameters of the neural network model.
In step 433, the reward value in the experience data is added to the maximum output value to obtain a target profit value.
If s ist+1Not in the goal state, will award the value rtAnd Qmax(st+1A') are added, the result of which can be regarded as being at stIn this state, action a is executedtFor differentiation, the expected future benefits are referred to herein as target benefits. Wherein the target profit value may be rt+γQmax(st+1A'). Gamma is called the discount factor and is [0,1]]The discount factor is used because there is more uncertainty in the future, so the return value decays over time. Adding such a discounting factor that decays exponentially with time prevents the summation term from going to infinity.
In step 434, the pre-state data in the empirical data and the interaction under the pre-state data are used as the input of the neural network model, and the parameters of the neural network model are updated, so that the future expected value output by the neural network model approaches to the target profit value.
Will stAnd atAs inputs to the neural network model, the values output by the neural network model may be referred to as future expectation values. Since step 433 has already calculated the value at stIn this state, action a is executedtThe target profit value of (2) so that the parameters of the neural network model can be updated to make the future expected value approach to the target profit value. By continuously looping through steps 432-434, the difference between the target profit value and the future expected value can be minimized. When the gap is less than the threshold, training may be considered complete. By using the trained neural network model, the execution in each state can be determinedThe expected future benefit of which interaction action is performed is the largest, so that the virtual object can be controlled to execute the action with the largest Q value at each step, and an optimal strategy for completing the task is learned.
In order to verify the advantages of the training method of the virtual object behavior strategy provided by the present application using the reward function with gradient change, the present application uses a robot arm environment (arm 101 targets at approach block 102) training process as shown in fig. 11 to perform a simple verification, and the algorithm may use a DDPG (Deep deterministic policy gradient) algorithm. The experimental results are as follows
FIG. 12 is a schematic diagram of the learning process of an agent without a reward function, as indicated by the curve labeled "None", from which it can be seen that the agent has not been able to learn this task at all times; the curve with the label "y ═ -x" is the learning process for the agent using the linear reward function, and the curve with the label "y ═ - (x) ^ 2.8" is the learning process for the agent using the reward function with the gradient change proposed by the present application. From the results, it can be seen that the reward function with gradient changes is more effective in learning efficiency.
The following is an embodiment of the apparatus of the present application, which may be used to execute a training embodiment of a virtual object behavior policy executed by the client or the server in the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training method for the behavior policy of the virtual object of the present application.
Fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application. As shown in fig. 13, the training device of the virtual object behavior strategy may include the following modules: a data acquisition module 1310, a reward calculation module 1320, and a policy training module 1330.
A data obtaining module 1310, configured to obtain pre-and post-state data of the virtual object performing the interaction.
A reward calculation module 1320, configured to calculate a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for a task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.
The strategy training module 1330 is configured to train the behavior strategy to reach the goal state by using the pre-and post-interaction state data and the reward value.
The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the training method of the virtual object behavior policy, and is not described herein again.
In an embodiment, the reward calculation module 1320 is specifically configured to calculate, according to a current state after the interaction is performed, a distance from the current state to the target state of the virtual object; and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the data obtaining module 1310 is specifically configured to: selecting a sub-interaction action for each sub-task; and controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring the front and back sub-state data of the sub-interactive action under each sub-task.
The reward calculation module 1320 specifically includes: the device comprises a branch reward calculation unit and a branch reward superposition unit. The branch reward calculation unit is used for calculating branch rewards for executing corresponding sub-interaction actions under each sub-task according to the reward functions corresponding to each sub-task and the sub-interaction actions under each sub-task; and the branch reward overlapping unit is used for overlapping branch rewards for executing corresponding sub-interactive actions under each sub-task and obtaining reward values of all the sub-interactive actions executed by the virtual object.
In an embodiment, the branch prize stacking unit is specifically configured to: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
In an embodiment, the branch reward calculating unit is specifically configured to: for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask; normalizing the distance under each subtask; and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
In one embodiment, the strategy training module 1330 includes the following elements: the device comprises a network building unit, a maximum value obtaining unit, a target calculating unit and a parameter updating unit.
And the network building unit is used for building a neural network model of the behavior strategy.
And the maximum value acquisition unit is used for acquiring a group of experience data comprising the front and rear state data, the interactive action and the reward value, taking the rear state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the rear state data.
And the target calculating unit is used for adding the reward value in the experience data with the maximum output value to obtain a target profit value.
And the parameter updating unit is used for taking the previous state data in the empirical data and the interactive action under the previous state data as the input of the neural network model, updating the parameters of the neural network model and enabling the future expected value output by the neural network model to approach the target profit value.
In an embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Claims (10)
1. A training method of a virtual object behavior strategy is characterized by comprising the following steps:
acquiring the state data before and after the virtual object executes the interactive action;
calculating the reward value of the virtual object for executing the interaction action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and training a behavior strategy reaching the target state by utilizing the pre-state data and the post-state data for executing the interactive action and the reward value.
2. The method according to claim 1, wherein the calculating of the reward value for the virtual object to perform the interaction according to the reward function with gradient change configured for the task performed by the virtual object in advance comprises:
according to the current state after the interactive action is executed, calculating the distance from the current state to the target state of the virtual object;
and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
3. The method of claim 1, wherein the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:
selecting a sub-interaction action for each sub-task;
controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring front and back sub-state data of the sub-interactive action under each sub-task;
the calculating of the reward value of the virtual object for executing the interaction action according to the reward function with gradient change configured for the task executed by the virtual object in advance comprises:
calculating branch rewards for executing corresponding sub-interaction actions under each subtask according to the reward function corresponding to each subtask and the sub-interaction actions under each subtask;
and superposing the branch rewards for executing the corresponding sub-interactive actions under each sub-task to obtain the reward value for executing all the sub-interactive actions by the virtual object.
4. The method according to claim 3, wherein the step of superposing the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions comprises the following steps:
and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
5. The method according to claim 3, wherein the calculating of the branch reward for executing the corresponding sub-interaction under each sub-task according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task comprises:
for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask;
normalizing the distance under each subtask;
and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
6. The method of claim 1, wherein training the behavior strategy to reach the goal state using pre-and post-state data and reward values for performing the interaction comprises:
building a neural network model of the behavior strategy;
acquiring a group of experience data comprising the front and back state data, the interactive actions and the reward values, taking the back state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the back state data;
adding the reward value in the experience data and the maximum output value to obtain a target profit value;
and taking the previous state data in the empirical data and the interaction action under the previous state data as the input of the neural network model, updating the parameters of the neural network model, and enabling the future expected value output by the neural network model to approach the target profit value.
7. The method of claim 1,
the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases along with the decrease of the distance between the current state and the target state of the virtual object;
or,
the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
8. An apparatus for training a behavior strategy of a virtual object, comprising:
the data acquisition module is used for acquiring the state data before and after the virtual object executes the interactive action;
the reward calculation module is used for calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change, which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and the strategy training module is used for training the behavior strategy reaching the target state by utilizing the state data before and after the interactive action is executed and the reward value.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of training of virtual object behavior strategy of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of training a behavior strategy of a virtual object according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911254761.9A CN111026272B (en) | 2019-12-09 | 2019-12-09 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911254761.9A CN111026272B (en) | 2019-12-09 | 2019-12-09 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111026272A true CN111026272A (en) | 2020-04-17 |
CN111026272B CN111026272B (en) | 2023-10-31 |
Family
ID=70208257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911254761.9A Active CN111026272B (en) | 2019-12-09 | 2019-12-09 | Training method and device for virtual object behavior strategy, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111026272B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401556A (en) * | 2020-04-22 | 2020-07-10 | 清华大学深圳国际研究生院 | Selection method of opponent type imitation learning winning incentive function |
CN112101563A (en) * | 2020-07-22 | 2020-12-18 | 西安交通大学 | Confidence domain strategy optimization method and device based on posterior experience and related equipment |
CN112221140A (en) * | 2020-11-04 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Motion determination model training method, device, equipment and medium for virtual object |
CN113663335A (en) * | 2021-07-15 | 2021-11-19 | 广州三七极耀网络科技有限公司 | AI model training method, device, equipment and storage medium for FPS game |
CN114146420A (en) * | 2022-02-10 | 2022-03-08 | 中国科学院自动化研究所 | Resource allocation method, device and equipment |
WO2022100363A1 (en) * | 2020-11-13 | 2022-05-19 | 腾讯科技(深圳)有限公司 | Robot control method, apparatus and device, and storage medium and program product |
CN115648204A (en) * | 2022-09-26 | 2023-01-31 | 吉林大学 | Training method, device, equipment and storage medium of intelligent decision model |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105637540A (en) * | 2013-10-08 | 2016-06-01 | 谷歌公司 | Methods and apparatus for reinforcement learning |
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
CN109445456A (en) * | 2018-10-15 | 2019-03-08 | 清华大学 | A kind of multiple no-manned plane cluster air navigation aid |
CN109460015A (en) * | 2017-09-06 | 2019-03-12 | 通用汽车环球科技运作有限责任公司 | Unsupervised learning agency for autonomous driving application |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
CN109847366A (en) * | 2019-01-29 | 2019-06-07 | 腾讯科技(深圳)有限公司 | Data for games treating method and apparatus |
CN109974737A (en) * | 2019-04-11 | 2019-07-05 | 山东师范大学 | Route planning method and system based on combination of safety evacuation signs and reinforcement learning |
CN110025959A (en) * | 2019-01-25 | 2019-07-19 | 清华大学 | Method and apparatus for controlling intelligent body |
CN110136481A (en) * | 2018-09-20 | 2019-08-16 | 初速度(苏州)科技有限公司 | A kind of parking strategy based on deeply study |
US20190258938A1 (en) * | 2016-11-04 | 2019-08-22 | Deepmind Technologies Limited | Reinforcement learning with auxiliary tasks |
CN110178364A (en) * | 2017-01-13 | 2019-08-27 | 微软技术许可有限责任公司 | Optimum scanning track for 3D scene |
CN110327624A (en) * | 2019-07-03 | 2019-10-15 | 广州多益网络股份有限公司 | A kind of game follower method and system based on course intensified learning |
-
2019
- 2019-12-09 CN CN201911254761.9A patent/CN111026272B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105637540A (en) * | 2013-10-08 | 2016-06-01 | 谷歌公司 | Methods and apparatus for reinforcement learning |
US20190258938A1 (en) * | 2016-11-04 | 2019-08-22 | Deepmind Technologies Limited | Reinforcement learning with auxiliary tasks |
CN110178364A (en) * | 2017-01-13 | 2019-08-27 | 微软技术许可有限责任公司 | Optimum scanning track for 3D scene |
CN109460015A (en) * | 2017-09-06 | 2019-03-12 | 通用汽车环球科技运作有限责任公司 | Unsupervised learning agency for autonomous driving application |
CN108600379A (en) * | 2018-04-28 | 2018-09-28 | 中国科学院软件研究所 | A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient |
CN110136481A (en) * | 2018-09-20 | 2019-08-16 | 初速度(苏州)科技有限公司 | A kind of parking strategy based on deeply study |
CN109445456A (en) * | 2018-10-15 | 2019-03-08 | 清华大学 | A kind of multiple no-manned plane cluster air navigation aid |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
CN110025959A (en) * | 2019-01-25 | 2019-07-19 | 清华大学 | Method and apparatus for controlling intelligent body |
CN109847366A (en) * | 2019-01-29 | 2019-06-07 | 腾讯科技(深圳)有限公司 | Data for games treating method and apparatus |
CN109974737A (en) * | 2019-04-11 | 2019-07-05 | 山东师范大学 | Route planning method and system based on combination of safety evacuation signs and reinforcement learning |
CN110327624A (en) * | 2019-07-03 | 2019-10-15 | 广州多益网络股份有限公司 | A kind of game follower method and system based on course intensified learning |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111401556A (en) * | 2020-04-22 | 2020-07-10 | 清华大学深圳国际研究生院 | Selection method of opponent type imitation learning winning incentive function |
CN112101563A (en) * | 2020-07-22 | 2020-12-18 | 西安交通大学 | Confidence domain strategy optimization method and device based on posterior experience and related equipment |
CN112221140A (en) * | 2020-11-04 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Motion determination model training method, device, equipment and medium for virtual object |
CN112221140B (en) * | 2020-11-04 | 2024-03-22 | 腾讯科技(深圳)有限公司 | Method, device, equipment and medium for training action determination model of virtual object |
WO2022100363A1 (en) * | 2020-11-13 | 2022-05-19 | 腾讯科技(深圳)有限公司 | Robot control method, apparatus and device, and storage medium and program product |
CN113663335A (en) * | 2021-07-15 | 2021-11-19 | 广州三七极耀网络科技有限公司 | AI model training method, device, equipment and storage medium for FPS game |
CN114146420A (en) * | 2022-02-10 | 2022-03-08 | 中国科学院自动化研究所 | Resource allocation method, device and equipment |
CN115648204A (en) * | 2022-09-26 | 2023-01-31 | 吉林大学 | Training method, device, equipment and storage medium of intelligent decision model |
CN115648204B (en) * | 2022-09-26 | 2024-08-27 | 吉林大学 | Training method, device, equipment and storage medium of intelligent decision model |
Also Published As
Publication number | Publication date |
---|---|
CN111026272B (en) | 2023-10-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111026272A (en) | Training method and device for virtual object behavior strategy, electronic equipment and storage medium | |
US9679258B2 (en) | Methods and apparatus for reinforcement learning | |
CN108920221B (en) | Game difficulty adjusting method and device, electronic equipment and storage medium | |
CN107158708A (en) | Multi-player video game matching optimization | |
CN114139637B (en) | Multi-agent information fusion method and device, electronic equipment and readable storage medium | |
Efthymiadis et al. | Using plan-based reward shaping to learn strategies in starcraft: Broodwar | |
Mousavi et al. | Applying q (λ)-learning in deep reinforcement learning to play atari games | |
CN114404975B (en) | Training method, device, equipment, storage medium and program product of decision model | |
Khan et al. | Playing first-person shooter games with machine learning techniques and methods using the VizDoom Game-AI research platform | |
Varghese et al. | A hybrid multi-task learning approach for optimizing deep reinforcement learning agents | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
CN113509726B (en) | Interaction model training method, device, computer equipment and storage medium | |
CN109731338A (en) | Artificial intelligence training method and device, storage medium and electronic device in game | |
Yin et al. | A data-driven approach for online adaptation of game difficulty | |
CN116510302A (en) | Analysis method and device for abnormal behavior of virtual object and electronic equipment | |
Pons et al. | Scenario control for (serious) games using self-organizing multi-agent systems | |
Kuo et al. | Applying hybrid learning approach to RoboCup's strategy | |
US11478716B1 (en) | Deep learning for data-driven skill estimation | |
US11413541B2 (en) | Generation of context-aware, personalized challenges in computer games | |
Daswani et al. | Reinforcement learning with value advice | |
Togelius et al. | Evolutionary Machine Learning and Games | |
CN110831677A (en) | System and method for managing content presentation in a multiplayer online game | |
CN117648585B (en) | Intelligent decision model generalization method and device based on task similarity | |
Picardi | A comparison of Different Machine Learning Techniques to Develop the AI of a Virtual Racing Game | |
West | Self-play deep learning for games: Maximising experiences |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |