CN111026272A - Training method and device for virtual object behavior strategy, electronic equipment and storage medium - Google Patents

Training method and device for virtual object behavior strategy, electronic equipment and storage medium Download PDF

Info

Publication number
CN111026272A
CN111026272A CN201911254761.9A CN201911254761A CN111026272A CN 111026272 A CN111026272 A CN 111026272A CN 201911254761 A CN201911254761 A CN 201911254761A CN 111026272 A CN111026272 A CN 111026272A
Authority
CN
China
Prior art keywords
virtual object
reward
sub
subtask
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911254761.9A
Other languages
Chinese (zh)
Other versions
CN111026272B (en
Inventor
贾航天
林磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201911254761.9A priority Critical patent/CN111026272B/en
Publication of CN111026272A publication Critical patent/CN111026272A/en
Application granted granted Critical
Publication of CN111026272B publication Critical patent/CN111026272B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/012Walk-in-place systems for allowing a user to walk in a virtual environment while constraining him to a given position in the physical environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The application provides a training method and device for a virtual object behavior strategy, an electronic device and a storage medium, which belong to the technical field of artificial intelligence and specifically comprise the following steps: acquiring the state data before and after the virtual object executes the interactive action; calculating the reward value of the virtual object for executing the interactive action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed; and training the behavior strategy reaching the target state by utilizing the pre-state data and the post-state data of the executed interaction and the reward value. Therefore, the change rule of the reward value is more consistent with the learning rule of human beings and animals, thereby improving the training efficiency and more quickly simulating the learning process of human beings and animals.

Description

Training method and device for virtual object behavior strategy, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for training a behavior policy of a virtual object, an electronic device, and a computer-readable storage medium.
Background
Reinforcement learning is one of the sub-fields of machine learning, and is mainly to update an understanding of an agent about an environment through interaction between the agent and the environment where the agent is located and reward (reward) obtained from the environment, so that a better strategy can be generated to promote accumulated long-distance reward obtained from the environment by the agent, and the agent can gradually generate an optimal strategy for one environment theoretically through continuous training.
As shown in FIG. 1, for example, in the game of "revenge of Monte Zuma", the agent pays a 1 point reward each step towards the bottom of the ladder and a-1 point penalty away from the bottom of the ladder, so that the agent should learn to go down the ladder quickly, because more rewards are obtained.
However, the learning efficiency of the agent may not be improved well, and if an agent needs to be trained to cross a river along a small bridge, the reward can be designed to give a fixed score every time the agent approaches the river in one step according to the prior art, but as a learning mode of human or animals, the exciting degree of the inner heart is not changed linearly in the process of crossing the river, and the design according to the existing linear reward is not in accordance with the change of the heart state of the real player, so that the training effect is poor, and the training efficiency is low.
Disclosure of Invention
The embodiment of the application provides a training method of a virtual object behavior strategy, which is used for improving the training efficiency.
The application provides a training method of a virtual object behavior strategy, which comprises the following steps:
acquiring the state data before and after the virtual object executes the interactive action;
calculating the reward value of the virtual object for executing the interaction action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and training a behavior strategy reaching the target state by utilizing the pre-state data and the post-state data for executing the interactive action and the reward value.
In an embodiment, the calculating a reward value for the virtual object to perform the interaction according to a reward function with gradient change configured for the task performed by the virtual object in advance includes:
according to the current state after the interactive action is executed, calculating the distance from the current state to the target state of the virtual object;
and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:
selecting a sub-interaction action for each sub-task;
controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring front and back sub-state data of the sub-interactive action under each sub-task;
the calculating of the reward value of the virtual object for executing the interaction action according to the reward function with gradient change configured for the task executed by the virtual object in advance comprises:
calculating branch rewards for executing corresponding sub-interaction actions under each subtask according to the reward function corresponding to each subtask and the sub-interaction actions under each subtask;
and superposing the branch rewards for executing the corresponding sub-interactive actions under each sub-task to obtain the reward value for executing all the sub-interactive actions by the virtual object.
In an embodiment, the overlaying of the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions includes:
and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
In an embodiment, the calculating, according to the reward function corresponding to each subtask and the sub-interaction action under each subtask, a branch reward for executing the corresponding sub-interaction action under each subtask includes:
for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask;
normalizing the distance under each subtask;
and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
In one embodiment, the training of the behavior strategy to reach the goal state using the contextual state data and the reward value for performing the interaction comprises:
building a neural network model of the behavior strategy;
acquiring a group of experience data comprising the front and back state data, the interactive actions and the reward values, taking the back state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the back state data;
adding the reward value in the experience data and the maximum output value to obtain a target profit value;
and taking the previous state data in the empirical data and the interaction action under the previous state data as the input of the neural network model, updating the parameters of the neural network model, and enabling the future expected value output by the neural network model to approach the target profit value.
In one embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
The present application further provides a training apparatus for a virtual object behavior strategy, the apparatus including:
the data acquisition module is used for acquiring the state data before and after the virtual object executes the interactive action;
the reward calculation module is used for calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change, which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and the strategy training module is used for training the behavior strategy reaching the target state by utilizing the state data before and after the interactive action is executed and the reward value.
In addition, the present application also provides an electronic device, which includes:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to execute the training method of the virtual object behavior strategy.
Further, the present application also provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is executable by a processor to perform the method for training the behavior policy of the virtual object provided in the present application.
According to the technical scheme provided by the embodiment of the application, the reward value of the virtual object for executing the interactive action in each state is calculated by acquiring the state data before and after the virtual object executes the interactive action and utilizing the reward function with gradient change, so that the behavior strategy of training the empirical data to reach the target state is formed. Because the gradient of the reward function changes along with the distance between the current state and the target state of the virtual object, the change rule of the reward function is more consistent with the learning rule of human beings and animals, the training efficiency is improved, and the learning process of the human beings and the animals is simulated more quickly.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the embodiments of the present application will be briefly described below.
FIG. 1 is a diagram of the interface of the "revenge of Monte-Promega" game in the background art;
FIG. 2 is a schematic diagram illustrating deep reinforcement learning provided by an embodiment of the present application;
fig. 3 is a schematic view of an application scenario of a training method for a behavior strategy of a virtual object according to an embodiment of the present application;
fig. 4 is a flowchart illustrating a method for training a behavior policy of a virtual object according to an embodiment of the present application;
FIG. 5 is a schematic diagram of a forward reward function provided by an embodiment of the present application;
FIG. 6 is a diagram of a negative reward function provided by an embodiment of the present application;
FIG. 7 is a detailed flowchart of step 420 in a corresponding embodiment of FIG. 4;
FIG. 8 is a detailed flowchart of steps 410 and 420 in the corresponding embodiment of FIG. 4;
FIG. 9 is a flowchart showing details of step 421 in the corresponding embodiment of FIG. 8;
FIG. 10 is a detailed flowchart of step 430 in a corresponding embodiment of FIG. 1;
FIG. 11 is a schematic diagram of a robot arm environment provided by an embodiment of the present application;
FIG. 12 is a graph illustrating training efficiency comparison of different reward functions provided by embodiments of the present application;
fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
FIG. 2 is a schematic diagram of the deep reinforcement learning principle provided by the embodiment of the present application, and as shown in FIG. 2, an environment is obtained at a timestamp t by an agentState data s oft(ii) a The agent performs action a; the environment reacts to this action and gets the next state data st+1And feeds back a reward value (reward) to the agent. By continuously cycling the above processes, based on the reward values of different action feedbacks executed in different states, the optimal strategy for achieving the goal can be finally obtained.
In solving practical problems with reinforcement learning algorithms, the design of the prize values is an important component. Since this part of colloquial can be understood as an important indicator for an agent to construct a "value view". The agent can only determine how well the action was performed by the value of the reward it receives for the action it takes. Based on the method, the reward function with gradient change is utilized, and a training method of the behavior strategy of the virtual object is provided. Virtual objects refer to agents in a virtual scene, such as a game, for example, virtual characters in the game. The behavior strategy means that the virtual object can automatically execute the optimal action facing the task. The bonus value is high or low to measure the action.
Fig. 3 is a schematic view of an application scenario of a training method for a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 3, the application scenario includes a plurality of clients 310, and the clients 310 may be Personal Computers (PCs), tablet computers, smart phones, Personal Digital Assistants (PDAs), and the like, in which application programs are installed. The client 310 may use the method provided in the present application to train the behavior policy of the virtual object, so as to automatically execute the optimal policy for completing the task.
In an embodiment, the application scenario further includes a server 320, and the server 320 may be a server, a server cluster, or a cloud computing center. The server 320 and the client 310 may be connected through a wired or wireless network. The server 320 may train the behavior policy of the virtual object by using the method provided by the present application, and then the server 320 may control the client 310 to automatically execute the optimal policy for completing the task according to the trained behavior policy.
In an embodiment, the present application further provides an electronic device, which may be the client 310 or the server 320. By way of example with the server 320, as shown in fig. 3, the electronic device may include a processor 321; a memory 322 for storing instructions executable by the processor 321; the processor 321 is configured to execute the training method of the virtual object behavior strategy provided in the present application.
The Memory 322 may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk or optical disk.
A computer-readable storage medium is also provided, and the storage medium stores a computer program, which can be executed by the processor 321 to perform the method for training the behavior policy of the virtual object provided in the present application.
Fig. 4 is a schematic flowchart of a method for training a behavior policy of a virtual object according to an embodiment of the present application. As shown in fig. 4, the method includes the following steps 410-430.
In step 410, pre-and post-state data of the virtual object performing the interaction is obtained.
The front-back state data refers to state data before the interactive action is executed and state data after the interactive action is executed. The state data may be the current position and the target point position of the virtual character in the game, the position of the pieces in the chessboard, the road condition, and the self-motion state of the autonomous vehicle. By preprocessing the chessboard image, the image of the virtual environment where the virtual character is located in the game and the road surface image when playing chess, the image characteristics can be extracted, so that the position state, the chess piece position state, the road condition and the like of the virtual character in the game are determined.
The interactive action refers to an action that can be performed by a virtual object, for example, the virtual object may be a virtual character in a basketball game, and may perform actions such as "walk one step left", "jump", "shoot", "defend", and the like. To maximize the cumulative reward, the interaction performed by the virtual object may be to choose an action at random with a probability of ε using a greedy algorithm, or to choose an action known to maximize the expected future profit Q.
In step 420, calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.
The virtual object may perform one or more tasks, and different tasks may correspond to different reward functions. The reward function is used to indicate the amount of reward value given after different actions are performed in different states. With a gradient change is meant that the reward function is non-linear. The reward function can be set in advance and stored in the server side, and therefore training efficiency is improved. The current state refers to the state of the virtual object after the interactive action is performed, for example, the position of the virtual object is moved forward and backward. The target state may be an end position of travel of the virtual object, a desired motion state parameter, or the like, depending on the task. The current state and the target state may be represented by a multi-dimensional feature vector. The distance between the current state and the target state may be a euclidean distance between the current state vector and the target state vector.
The reward functions can be divided into two broad categories, the first category being positive reward functions with a gradient as shown in fig. 5, and the second category being negative reward functions with a gradient as shown in fig. 6. The abscissa represents the distance of the current state from the goal state and the ordinate represents the prize value. The two types of reward functions are distinguished in that the first type of reward values are both positive values to encourage good actions by the virtual object, and the second type of reward values are both negative values to penalize bad actions by the virtual object. The selection of a particular reward function may be based on actual task needs.
Taking the forward reward function y of fig. 5 as 1- (x) 0.4 as an example, the value of y gradually increases with decreasing x, and the gradient also gradually increases. The reward function can be used for guiding the virtual object to approach a certain target, and the change of the reward value of the virtual object gradually increases when the virtual object is closer to the target, so that the virtual object is better guided to reach the target than a linear reward function curve.
The learning process corresponding to humans and animals can be understood as follows: for example, when the user participates in a marathon game, the distance from the end point is different, the mood is possibly different, the mood is stronger when the user is eager to succeed, the gradient is changed along with the distance between the current state and the target state of the virtual object, so that the learning of the behavior strategy is more fit for the real learning process of human beings and animals, and the learning efficiency is higher. Similarly, for the negative excitation function y in fig. 6 ═ x ^2.8, when x becomes larger, it can be understood that the virtual object is farther from the target point, which is not the desired result, so the farther away, a stronger penalty should be given, which corresponds to the gradient of the reward function curve increasing gradually. The gradient of the reward function can be considered to vary with the distance between the current state and the target state of the virtual object. Based on different tasks, the larger the distance, the larger the gradient, and the larger the distance, the smaller the gradient.
In one embodiment, the virtual object is assumed to perform a task with a reward function of y ═ 1- (x) ^0.4, and the virtual object is in state stThe state after the interactive action a is executed is st+1,st+1Can be considered as the current state, and the distance from the target state is xt+1Then x can be substitutedt+1Substituting y for 1- (x) 0.4, and calculating the corresponding y value. The calculated y value is stThe prize value of the interaction a is performed.
In an embodiment, the client or server may be in state s0Controlling a virtual object to perform an interaction a0Obtaining a new state s1And a prize value r0,(s0、a0、s1、r0) A quaternary experience data add experience pool may be constructed.Continue in state s1Controlling a virtual object to perform an interaction a1Obtaining a new state s2And a prize value r1Obtaining new quaternary experience data, adding the quaternary experience data into an experience pool, and continuously cycling to store a large amount of experience data(s) in the experience poolt、at、st+1、rt)。
In step 430, a behavior strategy for reaching the goal state is trained using the pre-and post-interaction state data and the reward value.
Based on the quantity(s) stored in the experience poolt、at、st+1、rt) Empirical data may be used to train a behavior strategy to a target state through a reinforcement learning algorithm, such as DQN (Deep Q-learning), policygadient (strategy gradient), actercriticc (action-evaluation algorithm), and the like. The training of the behavior strategy reaching the target state refers to finding a set of strategy, which can automatically control the virtual object to execute the optimal action when facing a certain state, so that the overall number of steps reaching the target state is minimum.
According to the technical scheme provided by the embodiment of the application, the reward value of the virtual object for executing the interactive action in each state is calculated by acquiring the state data before and after the virtual object executes the interactive action and utilizing the reward function with gradient change, so that the behavior strategy of training the empirical data to reach the target state is formed. Because the gradient of the reward function changes along with the distance between the current state and the target state of the virtual object, the change rule of the reward function is more consistent with the learning rule of human beings and animals, the training efficiency is improved, and the learning process of the human beings and the animals is simulated more quickly.
In an embodiment, as shown in fig. 7, the step 420 may include the following steps 701 and 702.
In step 701, according to the current state after the interactive action is performed, a distance from the current state to the target state of the virtual object is calculated.
Virtual object in state stPerforming an interaction atThen enter intoNext state st+1. The state data of the virtual object after the interactive action is executed, namely the current state, can be obtained by collecting the image of the next state and extracting the image characteristics. For example, the state data after the interaction is performed may be a multidimensional vector (a)x,bx,cx,dx) Representing, the target state may be a known multidimensional vector (a)y,by,cy,dy) The distance that the virtual object reaches the target state from the current state may be a calculation vector (a)x,bx,cx,dx) And (a)y,by,cy,dy) The euclidean distance between them.
In one embodiment, the current state after the interaction is performed may be the position coordinates of the virtual character, and the target state is the destination coordinates of the virtual character, so the distance from the virtual object to the target state may be the euclidean distance between the position coordinates and the destination coordinates.
In step 702, the distance is used as the input of the reward function, and the reward value of the interaction action output by the reward function is obtained.
For example, the reward function may be y ═ 1- (x) ^0.4, x may be considered as the input to the reward function, and the output of the reward function is y. After the distance is calculated in step 701, the distance may be substituted as an x value into the reward function to calculate a corresponding y value, which may be considered to be in state stLower execution of an interaction atThe prize value of.
In one embodiment, the task referred to at step 420 may include a plurality of subtasks, each subtask having a corresponding reward function. Thus, as shown in fig. 8, the above step 410 may include the following steps 411 and 412. The step 420 may include the following steps 421 and 422.
In step 411, a selection of sub-interactions is made for each sub-task.
For multiple tasks that need to be completed simultaneously, each task may be referred to as a subtask. For example, a task may be learning to sing and dance, and a subtask may be learning to sing and dance. Different subtasks may correspond to different reward functions. The sub-interactive action refers to the action which can be executed by the virtual object aiming at each sub-task, and the action type can be configured in advance. For each subtask, the server may select an interaction corresponding to the subtask. The specific selection may be made by taking the action of randomly selecting or selecting the known future return that is expected to be greatest, as described above.
In step 412, the virtual object is controlled to execute the sub-interaction under each sub-task, and front and back sub-state data of the sub-interaction under each sub-task is obtained.
The server can control the virtual object to simultaneously execute the sub-interactive action selected under each sub-task. And acquiring state data before executing the corresponding sub-interactive action and state data after executing the corresponding sub-interactive action aiming at each sub-task. For example, suppose there are two subtasks, U, V, with the U task at state su1Selects and executes action au1To obtain the next state su1(ii) a In state s under V taskv1Selects and executes action av1To obtain the next state sv1. And continuously processing the next state, and establishing an experience pool by continuously circulating the processes.
In step 421, the branch reward for executing the corresponding sub-interaction under each sub-task is calculated according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task.
On the basis of the above embodiment, for the subtask U, the server side can calculate the state s through the reward function corresponding to the subtask Uu1Lower execution action au1Is given a prize value ru1. For the subtask V, the server side can calculate the state s through the reward function corresponding to the subtask Vv1Lower execution action av1Is given a prize value rv1. The branch reward refers to a reward value for performing a corresponding sub-interaction under each sub-task. For example, the bonus value ru1Can be regarded as a branch prize, the prize value rv1May also be considered a branch prize.
In step 422, the branch rewards for executing the corresponding sub-interaction under each sub-task are superimposed, and the reward value for executing all the sub-interaction by the virtual object is obtained.
Wherein, the superposition may be to add or multiply the branch rewards of the corresponding sub-interaction actions executed under each sub-task. And taking the superposition result as a reward value of the virtual object for executing the interactive action. For example, r can beu1And rv1Add as in state(s)u1,sv1) Execute action (a)u1,av1) The prize value of.
In an embodiment, the step 422 may specifically include the following steps: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
For example, the subtasks may be singing and dancing, and assuming more emphasis on learning to sing, the singing subtask may be weighted more heavily (e.g., 60%), the dancing subtask may be weighted less heavily (e.g., 40%), and the branching reward r for the singing subtask may be awarded to the branches of the singing subtasku1Multiplied by the corresponding weight 60%, plus the branch reward r of the dancing subtaskv1Multiplying by the corresponding weight by 40% to obtain the reward value (r) of the virtual object for performing the interactive action1=60%ru1+40%rv1)。
In an embodiment, as shown in fig. 9, the step 421 may specifically include the following steps 4211-4213.
In step 4211, for each subtask, a distance from the subtask data to the target state of the virtual object is calculated according to the subtask data after the corresponding subtask is executed in the subtask.
For example, in state s under the Utasku1Selects and executes action au1To obtain the next state su1(ii) a In state s under V taskv1Selects and executes action av1To obtain the next state sv1。su1It can be considered that the sub-interaction a is performed under the task Uu1Sub-state data of (a); sv1Can be used forIt is considered that the sub-interaction a is performed under the task Vv1The sub-state data of (2).
The sub-state data can also be represented by a multi-dimensional feature vector, and the target state can be considered as a known quantity set in advance for each sub-task. Therefore, for the subtask U, the sub-state data s under the subtask can be calculatedu1Distance x from target state dataU(ii) a For the subtask V, the subtask state data s under the subtask can be calculatedv1Distance x from target state datav
In step 4212, the distance under each subtask is normalized.
Normalization refers to constraining the distance to be in the range of 0, 1. The normalization may be done by dividing the distance under each subtask by the maximum value of the distance under that subtask.
In step 4213, for each subtask, the distance normalized under the subtask is used as an input of a reward function corresponding to the subtask, and a branch reward of a sub interaction action corresponding to the subtask output by the reward function is obtained.
For example, assuming that subtasks U and V exist, the normalized distances are X, respectivelyU,XvIf the reward function for the sub-task U is y 1-X ^0.4 and the reward function for the sub-task V is y 1-X ^2.8, X may be set to XUThe value of variable x as reward function y 1-x ^0.4, corresponding y value as U task in state su1Selects and executes action au1The branch prize of (1). Mixing XvThe value of variable x as reward function y 1-x 2.8, corresponding to the value of y as V task at state sv1Selects and executes action av1The branch prize of (1).
It can also be seen from fig. 5 and 6 that when the variable x is in the interval [0,1], the y value of the corresponding reward function is also constrained to the interval [0,1 ]. The value y of the reward function output, which corresponds to the normalization of the variable x to within the interval 0,1, is always between 0 and 1. That is, by normalizing the distance under each subtask, the branch reward of the sub-interaction under each subtask can be controlled within the [0,1] interval. Therefore, the awards of different tasks are controlled to be in one order of magnitude, and the awarding proportion among different tasks is conveniently controlled when multi-task learning is carried out.
In one embodiment, as shown in fig. 10, the step 430 may include the following steps:
in step 431, a neural network model of the behavioral policy is built.
The goal of reinforcement learning is to achieve some expectation that the outcome of a currently executed action will affect the subsequent state, and therefore it is necessary to determine whether the action will receive a good return in the future, which is delayed. For example, weiqi, a current step of chess will not be finished immediately, but will affect subsequent chess games, so that it is necessary to maximize the probability of winning in the future, and the future will be random. Therefore, a neural network model is constructed to learn the Q value (expected future profit value), i.e., to learn the expected future return of performing a certain action under a certain state. At this time, the parameters of the neural network model are initial values, and the calculated Q value is inaccurate, so that the parameters of the neural network model need to be updated in a gradient descent mode by using an experience pool to perform fitting training on the Q value.
In step 432, a set of empirical data including the front-back state data, the interaction actions, and the reward value is obtained, the rear state data in the empirical data is used as the input of the neural network model, and the maximum output value of the neural network model is obtained according to the output of the neural network model corresponding to different interaction actions in the rear state data.
The server side can randomly extract experience data(s) from the experience poolt、at、st+1、rt) And (6) learning. In particular, if st+1Not the target state, will st+1Assuming that 4 interactions (left, right, up, and down) can be performed as inputs to the neural network model, each interaction is also used as an input to the neural network model, and the s can be calculated by the neural network modelt+1In this state, the Q values of the four interactions are performed, respectively. The Q value is the neural networkAnd (5) outputting the model. Obtaining the maximum output value of the neural network model is obtaining the value Q with the maximum expected future profitmax(st+1A'). If s ist+1Is the goal state, the prize value rtIs at stIn this state, action a is executedtThe expected future benefits of (a) can be used directly to update the parameters of the neural network model.
In step 433, the reward value in the experience data is added to the maximum output value to obtain a target profit value.
If s ist+1Not in the goal state, will award the value rtAnd Qmax(st+1A') are added, the result of which can be regarded as being at stIn this state, action a is executedtFor differentiation, the expected future benefits are referred to herein as target benefits. Wherein the target profit value may be rt+γQmax(st+1A'). Gamma is called the discount factor and is [0,1]]The discount factor is used because there is more uncertainty in the future, so the return value decays over time. Adding such a discounting factor that decays exponentially with time prevents the summation term from going to infinity.
In step 434, the pre-state data in the empirical data and the interaction under the pre-state data are used as the input of the neural network model, and the parameters of the neural network model are updated, so that the future expected value output by the neural network model approaches to the target profit value.
Will stAnd atAs inputs to the neural network model, the values output by the neural network model may be referred to as future expectation values. Since step 433 has already calculated the value at stIn this state, action a is executedtThe target profit value of (2) so that the parameters of the neural network model can be updated to make the future expected value approach to the target profit value. By continuously looping through steps 432-434, the difference between the target profit value and the future expected value can be minimized. When the gap is less than the threshold, training may be considered complete. By using the trained neural network model, the execution in each state can be determinedThe expected future benefit of which interaction action is performed is the largest, so that the virtual object can be controlled to execute the action with the largest Q value at each step, and an optimal strategy for completing the task is learned.
In order to verify the advantages of the training method of the virtual object behavior strategy provided by the present application using the reward function with gradient change, the present application uses a robot arm environment (arm 101 targets at approach block 102) training process as shown in fig. 11 to perform a simple verification, and the algorithm may use a DDPG (Deep deterministic policy gradient) algorithm. The experimental results are as follows
FIG. 12 is a schematic diagram of the learning process of an agent without a reward function, as indicated by the curve labeled "None", from which it can be seen that the agent has not been able to learn this task at all times; the curve with the label "y ═ -x" is the learning process for the agent using the linear reward function, and the curve with the label "y ═ - (x) ^ 2.8" is the learning process for the agent using the reward function with the gradient change proposed by the present application. From the results, it can be seen that the reward function with gradient changes is more effective in learning efficiency.
The following is an embodiment of the apparatus of the present application, which may be used to execute a training embodiment of a virtual object behavior policy executed by the client or the server in the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the training method for the behavior policy of the virtual object of the present application.
Fig. 13 is a block diagram of a training apparatus for a virtual object behavior strategy according to an embodiment of the present application. As shown in fig. 13, the training device of the virtual object behavior strategy may include the following modules: a data acquisition module 1310, a reward calculation module 1320, and a policy training module 1330.
A data obtaining module 1310, configured to obtain pre-and post-state data of the virtual object performing the interaction.
A reward calculation module 1320, configured to calculate a reward value of the virtual object for executing the interaction action according to a reward function with gradient change configured for a task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed.
The strategy training module 1330 is configured to train the behavior strategy to reach the goal state by using the pre-and post-interaction state data and the reward value.
The implementation process of the functions and actions of each module in the device is specifically detailed in the implementation process of the corresponding step in the training method of the virtual object behavior policy, and is not described herein again.
In an embodiment, the reward calculation module 1320 is specifically configured to calculate, according to a current state after the interaction is performed, a distance from the current state to the target state of the virtual object; and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
In one embodiment, the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the data obtaining module 1310 is specifically configured to: selecting a sub-interaction action for each sub-task; and controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring the front and back sub-state data of the sub-interactive action under each sub-task.
The reward calculation module 1320 specifically includes: the device comprises a branch reward calculation unit and a branch reward superposition unit. The branch reward calculation unit is used for calculating branch rewards for executing corresponding sub-interaction actions under each sub-task according to the reward functions corresponding to each sub-task and the sub-interaction actions under each sub-task; and the branch reward overlapping unit is used for overlapping branch rewards for executing corresponding sub-interactive actions under each sub-task and obtaining reward values of all the sub-interactive actions executed by the virtual object.
In an embodiment, the branch prize stacking unit is specifically configured to: and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
In an embodiment, the branch reward calculating unit is specifically configured to: for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask; normalizing the distance under each subtask; and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
In one embodiment, the strategy training module 1330 includes the following elements: the device comprises a network building unit, a maximum value obtaining unit, a target calculating unit and a parameter updating unit.
And the network building unit is used for building a neural network model of the behavior strategy.
And the maximum value acquisition unit is used for acquiring a group of experience data comprising the front and rear state data, the interactive action and the reward value, taking the rear state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the rear state data.
And the target calculating unit is used for adding the reward value in the experience data with the maximum output value to obtain a target profit value.
And the parameter updating unit is used for taking the previous state data in the empirical data and the interactive action under the previous state data as the input of the neural network model, updating the parameters of the neural network model and enabling the future expected value output by the neural network model to approach the target profit value.
In an embodiment, the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases as the distance between the current state and the target state of the virtual object decreases; or the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
In the embodiments provided in the present application, the disclosed apparatus and method can be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Claims (10)

1. A training method of a virtual object behavior strategy is characterized by comprising the following steps:
acquiring the state data before and after the virtual object executes the interactive action;
calculating the reward value of the virtual object for executing the interaction action according to a reward function with gradient change which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and training a behavior strategy reaching the target state by utilizing the pre-state data and the post-state data for executing the interactive action and the reward value.
2. The method according to claim 1, wherein the calculating of the reward value for the virtual object to perform the interaction according to the reward function with gradient change configured for the task performed by the virtual object in advance comprises:
according to the current state after the interactive action is executed, calculating the distance from the current state to the target state of the virtual object;
and taking the distance as the input of the reward function, and obtaining the reward value output by the reward function for executing the interactive action.
3. The method of claim 1, wherein the task comprises a plurality of subtasks, each subtask having a corresponding reward function; the acquiring of the front and back state data of the virtual object executing the interactive action includes:
selecting a sub-interaction action for each sub-task;
controlling the virtual object to execute the sub-interactive action under each sub-task, and acquiring front and back sub-state data of the sub-interactive action under each sub-task;
the calculating of the reward value of the virtual object for executing the interaction action according to the reward function with gradient change configured for the task executed by the virtual object in advance comprises:
calculating branch rewards for executing corresponding sub-interaction actions under each subtask according to the reward function corresponding to each subtask and the sub-interaction actions under each subtask;
and superposing the branch rewards for executing the corresponding sub-interactive actions under each sub-task to obtain the reward value for executing all the sub-interactive actions by the virtual object.
4. The method according to claim 3, wherein the step of superposing the branch reward for executing the corresponding sub-interaction under each sub-task to obtain the reward value for the virtual object to execute all the sub-interactions comprises the following steps:
and according to the weight configured for each subtask, the branch rewards corresponding to each subtask are added in a weighted mode to obtain reward values of all the sub interaction actions executed by the virtual object.
5. The method according to claim 3, wherein the calculating of the branch reward for executing the corresponding sub-interaction under each sub-task according to the reward function corresponding to each sub-task and the sub-interaction under each sub-task comprises:
for each subtask, calculating the distance from the subtask data to the target state of the virtual object according to the subtask data after the corresponding subtask is executed under the subtask;
normalizing the distance under each subtask;
and aiming at each subtask, taking the distance after normalization under the subtask as the input of the reward function corresponding to the subtask, and obtaining the branch reward of the corresponding sub-interaction action under the subtask output by the reward function.
6. The method of claim 1, wherein training the behavior strategy to reach the goal state using pre-and post-state data and reward values for performing the interaction comprises:
building a neural network model of the behavior strategy;
acquiring a group of experience data comprising the front and back state data, the interactive actions and the reward values, taking the back state data in the experience data as the input of the neural network model, and acquiring the maximum output value of the neural network model according to the output of the neural network model corresponding to different interactive actions under the back state data;
adding the reward value in the experience data and the maximum output value to obtain a target profit value;
and taking the previous state data in the empirical data and the interaction action under the previous state data as the input of the neural network model, updating the parameters of the neural network model, and enabling the future expected value output by the neural network model to approach the target profit value.
7. The method of claim 1,
the reward function is a positive reward function, the reward value is a positive value, and the gradient of the positive reward function increases along with the decrease of the distance between the current state and the target state of the virtual object;
alternatively, the first and second electrodes may be,
the reward function is a negative reward function, the reward value is a negative value, and the gradient of the negative reward function increases as the distance between the current state and the target state of the virtual object increases.
8. An apparatus for training a behavior strategy of a virtual object, comprising:
the data acquisition module is used for acquiring the state data before and after the virtual object executes the interactive action;
the reward calculation module is used for calculating a reward value of the virtual object for executing the interaction action according to a reward function with gradient change, which is configured for the task executed by the virtual object in advance; wherein the gradient varies with a distance between a current state and a target state of the virtual object after the interactive action is performed;
and the strategy training module is used for training the behavior strategy reaching the target state by utilizing the state data before and after the interactive action is executed and the reward value.
9. An electronic device, characterized in that the electronic device comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the method of training of virtual object behavior strategy of any of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program executable by a processor to perform the method of training a behavior strategy of a virtual object according to any one of claims 1-7.
CN201911254761.9A 2019-12-09 2019-12-09 Training method and device for virtual object behavior strategy, electronic equipment and storage medium Active CN111026272B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911254761.9A CN111026272B (en) 2019-12-09 2019-12-09 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911254761.9A CN111026272B (en) 2019-12-09 2019-12-09 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111026272A true CN111026272A (en) 2020-04-17
CN111026272B CN111026272B (en) 2023-10-31

Family

ID=70208257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911254761.9A Active CN111026272B (en) 2019-12-09 2019-12-09 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111026272B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN112101563A (en) * 2020-07-22 2020-12-18 西安交通大学 Confidence domain strategy optimization method and device based on posterior experience and related equipment
CN112221140A (en) * 2020-11-04 2021-01-15 腾讯科技(深圳)有限公司 Motion determination model training method, device, equipment and medium for virtual object
CN113663335A (en) * 2021-07-15 2021-11-19 广州三七极耀网络科技有限公司 AI model training method, device, equipment and storage medium for FPS game
CN114146420A (en) * 2022-02-10 2022-03-08 中国科学院自动化研究所 Resource allocation method, device and equipment
WO2022100363A1 (en) * 2020-11-13 2022-05-19 腾讯科技(深圳)有限公司 Robot control method, apparatus and device, and storage medium and program product
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
CN108600379A (en) * 2018-04-28 2018-09-28 中国科学院软件研究所 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN109445456A (en) * 2018-10-15 2019-03-08 清华大学 A kind of multiple no-manned plane cluster air navigation aid
CN109460015A (en) * 2017-09-06 2019-03-12 通用汽车环球科技运作有限责任公司 Unsupervised learning agency for autonomous driving application
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN109847366A (en) * 2019-01-29 2019-06-07 腾讯科技(深圳)有限公司 Data for games treating method and apparatus
CN109974737A (en) * 2019-04-11 2019-07-05 山东师范大学 Route planning method and system based on combination of safety evacuation signs and reinforcement learning
CN110025959A (en) * 2019-01-25 2019-07-19 清华大学 Method and apparatus for controlling intelligent body
CN110136481A (en) * 2018-09-20 2019-08-16 初速度(苏州)科技有限公司 A kind of parking strategy based on deeply study
US20190258938A1 (en) * 2016-11-04 2019-08-22 Deepmind Technologies Limited Reinforcement learning with auxiliary tasks
CN110178364A (en) * 2017-01-13 2019-08-27 微软技术许可有限责任公司 Optimum scanning track for 3D scene
CN110327624A (en) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 A kind of game follower method and system based on course intensified learning

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105637540A (en) * 2013-10-08 2016-06-01 谷歌公司 Methods and apparatus for reinforcement learning
US20190258938A1 (en) * 2016-11-04 2019-08-22 Deepmind Technologies Limited Reinforcement learning with auxiliary tasks
CN110178364A (en) * 2017-01-13 2019-08-27 微软技术许可有限责任公司 Optimum scanning track for 3D scene
CN109460015A (en) * 2017-09-06 2019-03-12 通用汽车环球科技运作有限责任公司 Unsupervised learning agency for autonomous driving application
CN108600379A (en) * 2018-04-28 2018-09-28 中国科学院软件研究所 A kind of isomery multiple agent Collaborative Decision Making Method based on depth deterministic policy gradient
CN110136481A (en) * 2018-09-20 2019-08-16 初速度(苏州)科技有限公司 A kind of parking strategy based on deeply study
CN109445456A (en) * 2018-10-15 2019-03-08 清华大学 A kind of multiple no-manned plane cluster air navigation aid
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN110025959A (en) * 2019-01-25 2019-07-19 清华大学 Method and apparatus for controlling intelligent body
CN109847366A (en) * 2019-01-29 2019-06-07 腾讯科技(深圳)有限公司 Data for games treating method and apparatus
CN109974737A (en) * 2019-04-11 2019-07-05 山东师范大学 Route planning method and system based on combination of safety evacuation signs and reinforcement learning
CN110327624A (en) * 2019-07-03 2019-10-15 广州多益网络股份有限公司 A kind of game follower method and system based on course intensified learning

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN112101563A (en) * 2020-07-22 2020-12-18 西安交通大学 Confidence domain strategy optimization method and device based on posterior experience and related equipment
CN112221140A (en) * 2020-11-04 2021-01-15 腾讯科技(深圳)有限公司 Motion determination model training method, device, equipment and medium for virtual object
CN112221140B (en) * 2020-11-04 2024-03-22 腾讯科技(深圳)有限公司 Method, device, equipment and medium for training action determination model of virtual object
WO2022100363A1 (en) * 2020-11-13 2022-05-19 腾讯科技(深圳)有限公司 Robot control method, apparatus and device, and storage medium and program product
CN113663335A (en) * 2021-07-15 2021-11-19 广州三七极耀网络科技有限公司 AI model training method, device, equipment and storage medium for FPS game
CN114146420A (en) * 2022-02-10 2022-03-08 中国科学院自动化研究所 Resource allocation method, device and equipment
CN115648204A (en) * 2022-09-26 2023-01-31 吉林大学 Training method, device, equipment and storage medium of intelligent decision model

Also Published As

Publication number Publication date
CN111026272B (en) 2023-10-31

Similar Documents

Publication Publication Date Title
CN111026272A (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
US20210374538A1 (en) Reinforcement learning using target neural networks
Mnih et al. Human-level control through deep reinforcement learning
CN108920221B (en) Game difficulty adjusting method and device, electronic equipment and storage medium
CN107158708A (en) Multi-player video game matching optimization
CN114139637B (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
Efthymiadis et al. Using plan-based reward shaping to learn strategies in starcraft: Broodwar
Mousavi et al. Applying q (λ)-learning in deep reinforcement learning to play atari games
Khan et al. Playing first-person shooter games with machine learning techniques and methods using the VizDoom Game-AI research platform
CN113509726B (en) Interaction model training method, device, computer equipment and storage medium
CN114004149A (en) Intelligent agent training method and device, computer equipment and storage medium
CN109731338A (en) Artificial intelligence training method and device, storage medium and electronic device in game
CN114404975A (en) Method, device, equipment, storage medium and program product for training decision model
Yin et al. A data-driven approach for online adaptation of game difficulty
Pons et al. Scenario control for (serious) games using self-organizing multi-agent systems
CN116510302A (en) Analysis method and device for abnormal behavior of virtual object and electronic equipment
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
Kuo et al. Applying hybrid learning approach to RoboCup's strategy
US11413541B2 (en) Generation of context-aware, personalized challenges in computer games
Daswani et al. Reinforcement learning with value advice
Togelius et al. Evolutionary Machine Learning and Games
CN110831677A (en) System and method for managing content presentation in a multiplayer online game
CN117648585B (en) Intelligent decision model generalization method and device based on task similarity
Picardi A comparison of Different Machine Learning Techniques to Develop the AI of a Virtual Racing Game
Dann Learning and planning in videogames via task decomposition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant