CN112016678A - Training method and device for strategy generation network for reinforcement learning and electronic equipment - Google Patents
Training method and device for strategy generation network for reinforcement learning and electronic equipment Download PDFInfo
- Publication number
- CN112016678A CN112016678A CN202010867107.1A CN202010867107A CN112016678A CN 112016678 A CN112016678 A CN 112016678A CN 202010867107 A CN202010867107 A CN 202010867107A CN 112016678 A CN112016678 A CN 112016678A
- Authority
- CN
- China
- Prior art keywords
- continuous
- information
- state
- value
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 120
- 230000002787 reinforcement Effects 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000009826 distribution Methods 0.000 claims abstract description 94
- 230000006870 function Effects 0.000 claims abstract description 41
- 230000007704 transition Effects 0.000 claims description 36
- 238000005070 sampling Methods 0.000 claims description 14
- 238000004590 computer program Methods 0.000 claims description 10
- 230000004044 response Effects 0.000 claims description 9
- 239000006185 dispersion Substances 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 7
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000009471 action Effects 0.000 description 18
- 238000010586 diagram Methods 0.000 description 8
- 230000007613 environmental effect Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000001747 exhibiting effect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Manipulator (AREA)
Abstract
A training method, a training device and an electronic device for a strategy generation network for reinforcement learning are disclosed. The training method for the strategy generation network for reinforcement learning comprises the following steps: acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts; determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively; determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and updating parameters of the policy generation network with a predetermined policy with the KL divergence value as a reward function. Thus, the performance of generating the network generation strategy by the strategy is improved.
Description
Technical Field
The present application relates to the field of reinforcement learning technology, and more particularly, to a training method, a training apparatus, and an electronic device for a strategy generation network for reinforcement learning.
Background
Recently, Reinforcement Learning (RL) combined with Deep Learning (DL) has been successful in many bonus-driven tasks, including exhibiting superior performance over human performance in various games, and also exhibiting superior performance in continuous control tasks of robots, navigation tasks within complex environments, and tasks of manipulating objects.
However, despite much success, in current reinforcement learning tasks, objects that perform the task, such as robots that manipulate objects, are typically only learned from external reward signals, unlike human learning processes. For example, in a process in which a person learns to manipulate objects, the person not only attempts to complete a task, but also learns to grasp controllable aspects of the environment. For example, even in an unsupervised state, a person can quickly discover the association between his own actions and changes in the state of the environment, thereby using this skill to manipulate the environment to a desired state.
Also, in the current actual reinforcement learning task, it is difficult to design an external reward function that can guarantee the object that performs the task to learn a desired behavior, such as the behavior of manipulating an object. Accordingly, it is desirable to provide an improved training scheme for a policy generation network that generates actions for objects performing tasks to efficiently generate actions for objects performing tasks in the absence of external rewards.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a training method, a training apparatus, and an electronic device for a policy generation network for reinforcement learning, which determine mutual information between an object state and an environment state, that is, KL dispersion values between probability distributions of the object state and the environment state, to train a policy generation network as a reward function, thereby improving performance of a policy generation network generation policy.
According to an aspect of the present application, there is provided a training method for a strategy generation network for reinforcement learning, including: acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts, wherein the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment; determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively; determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and updating parameters of the policy generation network with a predetermined policy with the KL divergence value as a reward function.
According to another aspect of the present application, there is provided a training apparatus for a strategy generation network for reinforcement learning, including: a state acquisition unit configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment; a distribution determining unit, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively; a dispersion value determining unit configured to determine a KL dispersion value of the product of the joint probability distribution determined by the distribution determining unit and the first edge distribution and the second edge distribution; and a network updating unit for updating the parameters of the policy generation network by a predetermined policy with the KL divergence value determined by the divergence value determining unit as a reward function.
According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the training method for a strategy generation network for reinforcement learning as described above.
According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the training method for a strategy generation network for reinforcement learning as described above.
According to the training method, the training device and the electronic equipment for the strategy generation network for reinforcement learning, mutual information between the object state and the environment state, namely KL divergence values between probability distributions of the object state and the environment state are determined to be used as a reward function for training the strategy generation network, and the object learning which is equivalent to executing tasks controls the environment through the strategy generated by the strategy generation network, so that the effectiveness of the strategy generated by the strategy generation network is improved.
In addition, according to the training method, the training device and the electronic device for the strategy generation network for reinforcement learning, mutual information between the object state and the environment state is used as the reward function training strategy generation network, the strategy can be effectively generated under the condition that no external reward function is artificially made or manually specified or the condition that the reward is sparse in the environment, and therefore the performance of the strategy generation network is improved.
In addition, the training method, the training device and the electronic equipment for the strategy generation network for reinforcement learning, which are provided by the application, can help the object for executing the task to quickly adapt to the unknown task in a way of simulating human learning to execute the task by learning the control environment through the strategy generated by the strategy generation network through the object for executing the task.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 illustrates a schematic diagram of a standard reinforcement learning model.
Fig. 2 illustrates a flow chart of a training method for a strategy generation network for reinforcement learning according to an embodiment of the present application.
Fig. 3 illustrates a block diagram of a training apparatus for a strategy generation network for reinforcement learning according to an embodiment of the present application.
FIG. 4 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
FIG. 1 illustrates a schematic diagram of a standard reinforcement learning model. As shown in fig. 1, the policy generation network N generates an action a of an object (Agent) performing a task, a current state S0 of an environment is migrated to a next state S1 of the environment based on the action a, and p is used to represent a transition probability of the current state to the next state. In addition, the reward function r is input to the policy generating network N for updating the policy by the policy generating network N which generates the action a, e.g. with the general aim of maximizing the cumulative value of the reward function.
Taking the reinforcement learning task of controlling the robot as an example, if the policy generation network N generates a policy for controlling the robot to perform action a, for example, if the robot is moving in a certain direction, the current state S0 of the environment may be represented as the current position of the robot, which transitions to the next state S1 of the environment, i.e., the next position of the robot, based on the action a.
In the present application, taking the reinforcement learning task of controlling the robot to manipulate an object as an example, the policy generation network N generates a policy for controlling the robot to perform action a by which the robot is controlled to make action a by which the robot changes an object to be manipulated, for example, an object to be moved, from the current state S0, for example, the current position, to the next state S1, for example, to the next position. Here, the current state S0 may be divided into a current state of the robot, and a current state of the object to be moved. Further, the next state S1 can also be divided into the next state of the robot, and the next state of the object to be moved. In addition, as described above, the reward function r is input to the policy generation network N for updating the policy of its generation action a by the policy generation network N.
As described above, since there are often cases where there is no external reward function that is artificially created or specified in the reinforcement learning task, it is difficult to design such a function, and there are many environments where external rewards are sparse.
Thus, in these tasks, there is a need for an update of a policy generation network that can learn to control the actions of objects performing the task completely autonomously from the environment without external reward functions, i.e., with some type of intrinsic reward driving the actions of generating the objects.
Therefore, the basic concept of the present application is to divide the environment state in the conventional reinforcement learning task into an object state of an object performing the task and an environment state of an environment on which the object acts, and to serve as a reward function by mutual information between the object state and the environment state.
That is, by estimating mutual information between the object state and the environment state during the learning process of the object performing the task, the object performing the task may receive a high intrinsic reward when there is a high mutual information value between its own state and the environment state, thereby learning the control environment equivalent to the object performing the task.
Specifically, the training device and the electronic equipment first acquire continuous object state information of an object executing a task and continuous environment state information of an environment acted by the object, the continuous object state information includes a plurality of object states of the object and the continuous environment state information includes a plurality of environment states of the environment, then determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively, determining a KL divergence value of a product of the joint probability distribution and the first edge distribution and the second edge distribution, and finally updating the parameters of the policy generation network through a predetermined policy by taking the KL divergence value as a reward function.
In this way, in the training method for a policy generation network for reinforcement learning provided by the present application, by using the KL divergence value of the product of the joint probability distribution of continuous object state information and continuous environment state information and the first edge distribution and the second edge distribution as an intrinsic reward training policy generation network, it is possible to learn a control environment state without external supervision, thereby obtaining the following advantages:
first, the action policy of the object performing the task can be learned without an external reward function created or specified manually. Second, learning to grasp the state of the manipulated environment can help the object performing the task learn to achieve the goal in an environment where the reward is sparse, that is, the object performing the task can find the manipulation skill and can quickly adapt to a specific task with a sparse external reward. Third, learning to learn the state of the manipulated environment can help the object performing the task quickly adapt to the unknown task.
In addition, in the training method for the strategy generation network for reinforcement learning provided by the application, the learned mutual information can be used for other purposes besides being used as the intrinsic reward, for example, when the strategy generation network generates a plurality of candidate actions according to experience, the mutual information can be used for judging the priority of the candidate actions.
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary method
Fig. 2 illustrates a flow chart of a training method for a strategy generation network for reinforcement learning according to an embodiment of the present application.
As shown in fig. 2, the training method of the strategy generation network for reinforcement learning according to the embodiment of the present application includes the following steps.
Step S110, acquiring continuous object state information of an object executing a task and continuous environment state information of an environment acted by the object, where the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment.
In the embodiment of the application, in the task of reinforcement learning, the object for executing the task and the environment acted by the object depend on the task type, and can be different types of objects and environments. For example, in a task of manipulating an object by a robot, an object performing the task is the robot, and an environment in which the object acts refers to a state of the manipulated object.
As described above, in a general reinforcement learning scheme, only a single environmental state is included. In the embodiment of the present application, a single environment state is divided into two parts, that is, an object state of an object that performs a task and an environment state of an environment on which the object acts. For example, an object state of an object that performs a task refers to a state of a robot, and an environmental state of an environment in which the object acts is a state of an object manipulated by the robot.
In particular, the robot may manipulate objects through various actions, such as pushing, picking, placing, and the like. For the robot, the state thereof may include the position of each joint of the robot, i.e., the coordinate position expressed by (x, y, z), and may further include the orientation, linear velocity, angular velocity, and the like of each joint. In the present embodiment, for simplicity, the state of the robot may be described only in terms of the position expressed in (x, y, z) coordinates. In addition, the state of the object manipulated by the robot can also be described simply in terms of a position in (x, y, z) coordinates. Thus, the continuous object state of the robot refers to a continuous set of (x, y, z) coordinates, and the continuous environment state of the object manipulated by the robot refers to a continuous set of (x, y, z) coordinates.
For example, the state of each robot is defined as scThen there is sc=(xc,yc,zc) And the state of the object manipulated by the robot is defined as siThen there is si=(xi,yi,zi). And, the continuous object state information of the robot may be represented as ScWhereinAnd the continuous environmental state information of the robot-manipulated object may be represented as SiWherein
Therefore, in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, acquiring continuous object state information of an object that performs a task and continuous environment state information of an environment in which the object acts includes: acquiring continuous three-dimensional position information of the object executing the task as the continuous object state information; and acquiring continuous three-dimensional position information of an environment on which the object acts as the continuous environment state information.
In this way, by acquiring the continuous three-dimensional position information of the object and the continuous three-dimensional position information of the environment as the state information, the form of the state information can be made simple, and since the state information is focused on the spatial positions of the object and the environment, it is convenient for the policy generation network to be used for the task of the spatial position type.
Further, in the training method of the policy generation network for reinforcement learning described above, wherein acquiring continuous three-dimensional position information of the object performing the task as the continuous object state information includes: acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.
That is, by acquiring three-dimensional position information of an object performing a task and other motion information such as an orientation, a linear velocity, and an angular velocity as state information, it is possible to train a policy generation network so that the generated policy can control the motion of the above-described aspect of the object performing the task, thereby realizing a complicated function such as a robot picking up an article.
At the beginning of the training method according to the embodiment of the present application, the object executing the task may perform an action according to a partially random strategy, such as an e-greedy algorithm, to explore the environment and collect the object state and the environment state to obtain the continuous object state information and the continuous environment state information.
Step S120, determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively.
Because the object performing the task, which can control the environment state to have high mutual information with its own state, can better grasp the environment, in the embodiment of the present application, mutual information between the object state and the environment state is used to promote the policy generation network learning policy, such as denoted as pi, without an external reward functionθ(at|st) Wherein a istRepresents the motion, stRepresenting a state and the policy generation network has a parameter theta.
Mathematically, mutual information between two random variables can be expressed as a KL divergence between the product of the joint probability distribution of the two random variables and the respective edge distributions of the two random variables. Therefore, in the embodiment of the present application, in order to determine mutual information between the continuous object state information and the continuous environment state information, a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively, are first determined.
S130, determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution. That is, as described above, a KL divergence value of a product of the joint probability distribution and the first edge distribution and the second edge distribution is determined as mutual information between the continuous object state information and the continuous environment state information in accordance with a mathematical representation of mutual information between the continuous object state information and the continuous environment state information. Specifically, it can be represented by the following formula:
wherein the content of the first and second substances,representing a joint probability distribution of the continuous object state information and the continuous environment state information,a product of an edge distribution representing the continuous environmental state information and an edge distribution representing the continuous object state information.
S140, updating the parameters of the policy generation network through a preset policy by taking the KL divergence value as a reward function. That is, the policy generation network may be updated using an update policy of the policy generation network that is commonly used in reinforcement learning, and the update of the policy generation network is aimed at maximizing the accumulated value of the reward function, that is, maximizing mutual information between state information of an object that performs a task and environment information of an environment in which the object acts.
Specifically, in the embodiment of the present application, a Deep Deterministic Policy Gradient (DDPG) may be used to update the parameters of the Policy generation network. The policy updates the parameters of the policy generation network in a relatively aggressive manner to improve the policy, better when the object performing the task starts learning quickly.
In addition, in the embodiment of the application, flexible actuator criticality (SAC) can be used to update the parameters of the strategy generation network. The policy updates parameters of the policy generation network in a relatively conservative manner to improve the policy, enabling a more comprehensive exploration of the environment.
Therefore, in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes: updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.
And in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes: updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.
In this way, according to the training method for the strategy generation network for reinforcement learning in the embodiment of the present application, the strategy generation network can be trained as a reward function by determining the mutual information between the object state and the environment state, that is, the KL dispersion value between the probability distributions of the object state and the environment state, and the object learning equivalent to the task execution controls the environment through the strategy generated by the strategy generation network, thereby improving the effectiveness of the strategy generated by the strategy generation network.
In addition, according to the training method for the strategy generation network for reinforcement learning in the embodiment of the application, the mutual information between the object state and the environment state can be used as the reward function training strategy generation network, so that the strategy can be effectively generated under the condition that no external reward function is artificially made or manually specified or the reward is sparse in the environment, and the performance of the strategy generation network is improved.
In addition, according to the training method of the strategy generation network for reinforcement learning, the object for executing the task can be made to learn to control the environment by using the strategy generated by the strategy generation network, and the object for executing the task can be helped to adapt to the unknown task quickly in a mode of simulating the task executed by human learning.
An example of calculating the KL dispersion value of the product of the joint probability distribution and the first edge distribution and the second edge distribution will be described in further detail below.
In one example, a lower limit value is used to approximate the value of the mutual information, i.e., I (S)i;Sc). First, the doncker-varahan representation can be used to rewrite the KL form of mutual information as:
where the supremum is taken from all functions T so that both expectations are limited. Then, the lower limit value of mutual information in the doncker-varahan expression is extracted by the compression theorem in the PAC-bayesian literature, and is expressed as:
the expectation in the above formula may be used fromAndor by dragging the sample along the sample axis from a joint distribution. And IΦ(Si,Sc) Training can be by gradient ascent. Statistical network TφIt can be parameterized by a deep neural network with parameters φ e φ, with the aim of estimating the mutual information with arbitrary precision. The expression for mutual information in training the statistical network is as follows:
wherein, the state pairFrom joint distributionSampling to obtain, other statesDistributing samples from edgesAnd (6) sampling. At the estimated lower limit valueAfterwards, the statistical network T is optimized using back propagationφIs measured.
Also, in the embodiment of the present application, the transition mutual information value is defined as the KL dispersion value as described above, which is the value from the current state stTo the next state st+1The mutual information added value of (a) is expressed as:
here, the first and second liquid crystal display panels are,is a predefined maximum transition mutual information value. The clip can be used to limit the transition mutual information value toWithin the interval (c). Wherein the lower limit 0 forces the mutual information estimate to be non-negative. And in practice, to mitigate the effect of certain particularly large transitional mutual information values, a threshold is appliedTo determine a transitional mutual information value, i.e. an intrinsic reward functionAn upper limit of the value is beneficial. By using this clip function, the training of the policy generation network can be stabilized. The threshold may be used as a hyper-parameter.
Therefore, in the training method of a strategy generation network for reinforcement learning according to an embodiment of the present application, determining the KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution includes: sampling (either empirically or along a sample axis) a first current state pair and a first next state pair from the joint probability distribution; sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively; determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information; determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants; determining a transition mutual information value based on the two first mutual information values and the two second index values; and obtaining the KL divergence value based on the transition mutual information value.
In particular, the first state current pair and the first next state pair may be as in the above formulaWhere n ═ t and t +1, the second current state pair and second next state pair may be as in the above equationsWhere n is T and T +1, the statistical network may be T as in the above formulaφThe two first mutual information values of the first current state pair and the first next state pair may be as in the above formulaWhere n ═ t and t +1, the second current state pair and the second downThe two second mutual information values of a state pair may be as in the above formulaWhere n ═ t and t +1, the two second index values may be as in the above formulaWhere n ═ t and t + 1.
In this way, the KL divergence values may be obtained with relatively simple calculations.
In the training method for a strategy generation network for reinforcement learning, obtaining the KL divergence value based on the transition mutual information value includes: determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value; setting the KL divergence value to zero in response to the transition mutual information value being less than zero; setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.
In this way, by defining the KL divergence value between zero and the maximum transition mutual information value, the training of the policy generation network may be stabilized.
In addition, in the training method of the policy generation network for reinforcement learning, the statistical network is obtained by training, and the training process includes: sampling a plurality of training first state pairs from the joint probability distribution; sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs; calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network; calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants; subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and updating parameters of the statistical network by back propagation to maximize the transient mutual information value for training.
In particular, the plurality of training first state pairs may be as in the above formulaThe plurality of training second state pairs may be as in the above formulaThe statistical network may be T as in the above formulaφThe plurality of training first mutual information values may be as in the above formulaThe training second mutual information value may be as in the above formulaThe plurality of training second index values may be as in the above formula
In this way, in the training process of the statistical network, the calculation of the transition mutual information value for training is simple, so that the time cost and the calculation cost of the training of the statistical network are reduced.
Exemplary devices
Fig. 3 illustrates a block diagram of a training apparatus for a strategy generation network for reinforcement learning according to an embodiment of the present application.
As shown in fig. 3, a training apparatus 200 for a strategy generation network for reinforcement learning according to an embodiment of the present application includes: a state acquiring unit 210 configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment; a distribution determining unit 220, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit 210, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively; a divergence value determining unit 230 for determining a KL divergence value of the product of the joint probability distribution determined by the distribution determining unit 220 and the first edge distribution and the second edge distribution; and a network updating unit 240 for updating the parameters of the policy generation network by a predetermined policy with the KL divergence value determined by the divergence value determining unit 230 as a reward function.
In an example, in the training apparatus 200 of the policy generation network for reinforcement learning, the state obtaining unit 210 is configured to: an object state acquisition subunit configured to acquire continuous three-dimensional position information of the object that performs the task as the continuous object state information; and an environment state acquisition subunit operable to acquire continuous three-dimensional position information of an environment on which the object acts as the continuous environment state information.
In an example, in the training apparatus 200 of the policy generation network for reinforcement learning, the object state obtaining subunit is configured to: acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.
In an example, in the training apparatus 200 for generating a network of strategies for reinforcement learning, the divergence value determining unit 230 is configured to: sampling a first current state pair and a first next state pair from the joint probability distribution; sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively; determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information; determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants; determining a transition mutual information value based on the two first mutual information values and the two second index values; and obtaining the KL divergence value based on the transition mutual information value.
In one example, in the training apparatus 200 for a strategy generation network for reinforcement learning, the obtaining, by the divergence value determining unit 230, the KL divergence values based on the transition mutual information value includes: determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value; setting the KL divergence value to zero in response to the transition mutual information value being less than zero; setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.
In an example, in the training apparatus 200 for generating a network of strategies for reinforcement learning, the statistical network is obtained by training, and the training process includes: sampling a plurality of training first state pairs from the joint probability distribution; sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs; calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network; calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants; subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and updating parameters of the statistical network by back propagation to maximize the transient mutual information value for training.
In an example, in the training apparatus 200 for generating a network according to the above-mentioned strategy for reinforcement learning, the network updating unit 240 is configured to: updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.
In an example, in the training apparatus 200 for generating a network according to the above-mentioned strategy for reinforcement learning, the network updating unit 240 is configured to: updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.
Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the training apparatus 200 for a strategy generation network for reinforcement learning described above have been described in detail in the description of the training method for a strategy generation network for reinforcement learning with reference to fig. 2, and thus, a repetitive description thereof will be omitted.
As described above, the training apparatus 200 for a strategy generation network for reinforcement learning according to the embodiment of the present application can be implemented in various terminal devices, such as a server for reinforcement learning tasks and the like. In one example, the training apparatus 200 for a strategy generation network for reinforcement learning according to the embodiment of the present application may be integrated into a terminal device as a software module and/or a hardware module. For example, the training apparatus 200 of the policy generation network for reinforcement learning may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the training apparatus 200 for the strategy generation network for reinforcement learning may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the training apparatus 200 of the strategy generation network for reinforcement learning and the terminal device may be separate devices, and the training apparatus 200 of the strategy generation network for reinforcement learning may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to an agreed data format.
Exemplary electronic device
Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 4.
FIG. 4 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.
As shown in fig. 4, the electronic device 10 includes one or more processors 11 and memory 12.
The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.
In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).
The input device 13 may include, for example, a keyboard, a mouse, and the like.
The output device 14 may output various information including parameters of the trained policy generation network to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.
Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 4, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer-readable storage Medium
In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method for a policy generation network for reinforcement learning according to various embodiments of the present application described in the "exemplary methods" section of this specification above.
The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method for a reinforcement learning policy generation network according to various embodiments of the present application described in the "exemplary methods" section above in this specification.
The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (10)
1. A training method for a strategy generation network for reinforcement learning, comprising:
acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts, wherein the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment;
determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively;
determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and
updating parameters of the policy generating network with a predetermined policy with the KL divergence value as a reward function.
2. The training method of a policy generation network for reinforcement learning according to claim 1, wherein acquiring continuous object state information of an object that performs a task and continuous environment state information of an environment in which the object acts comprises:
acquiring continuous three-dimensional position information of the object executing the task as the continuous object state information; and
acquiring continuous three-dimensional position information of an environment in which the object acts as the continuous environment state information.
3. The training method of a policy generation network for reinforcement learning according to claim 2, wherein acquiring continuous three-dimensional position information of the object performing the task as the continuous object state information includes:
acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.
4. The training method for a strategy generation network for reinforcement learning of claim 1, wherein determining a KL divergence value of the joint probability distribution and the product of the first edge distribution and the second edge distribution comprises:
sampling a first current state pair and a first next state pair from the joint probability distribution;
sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively;
determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information;
determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants;
determining a transition mutual information value based on the two first mutual information values and the two second index values; and the number of the first and second groups,
and obtaining the KL divergence value based on the transition mutual information value.
5. The training method for a strategy generation network for reinforcement learning of claim 4, wherein obtaining the KL divergence value based on the transitional mutual information value comprises:
determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value;
setting the KL divergence value to zero in response to the transition mutual information value being less than zero;
setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and the number of the first and second groups,
setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.
6. The training method of the strategy generation network for reinforcement learning of claim 4, wherein the statistical network is obtained by training, and the training process comprises:
sampling a plurality of training first state pairs from the joint probability distribution;
sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs;
calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network;
calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants;
subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and the number of the first and second groups,
updating parameters of the statistical network by back-propagation to maximize the training interim mutual information value.
7. The training method of a policy generation network for reinforcement learning according to claim 1, wherein updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes:
updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.
8. The training method of a policy generation network for reinforcement learning according to claim 1, wherein updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes:
updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.
9. A training apparatus for a strategy generation network for reinforcement learning, comprising:
a state acquisition unit configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment;
a distribution determining unit, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively;
a dispersion value determining unit configured to determine a KL dispersion value of the product of the joint probability distribution determined by the distribution determining unit and the first edge distribution and the second edge distribution; and
a network updating unit for updating the parameters of the policy generation network by a predetermined policy with the KL divergence values determined by the divergence value determining unit as a reward function.
10. An electronic device, comprising:
a processor; and
memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a method of training a policy generation network for reinforcement learning according to any of claims 1-8.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201962904419P | 2019-09-23 | 2019-09-23 | |
US62/904,419 | 2019-09-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112016678A true CN112016678A (en) | 2020-12-01 |
Family
ID=73503476
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010867107.1A Pending CN112016678A (en) | 2019-09-23 | 2020-08-26 | Training method and device for strategy generation network for reinforcement learning and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112016678A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949933A (en) * | 2021-03-23 | 2021-06-11 | 成都信息工程大学 | Traffic organization scheme optimization method based on multi-agent reinforcement learning |
CN113537406A (en) * | 2021-08-30 | 2021-10-22 | 重庆紫光华山智安科技有限公司 | Method, system, medium and terminal for enhancing image automatic data |
CN113705777A (en) * | 2021-08-07 | 2021-11-26 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle autonomous path-finding model training method and device |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288094A (en) * | 2018-01-31 | 2018-07-17 | 清华大学 | Deeply learning method and device based on ambient condition prediction |
US20190126472A1 (en) * | 2017-10-27 | 2019-05-02 | Deepmind Technologies Limited | Reinforcement and imitation learning for a task |
CN110081893A (en) * | 2019-04-01 | 2019-08-02 | 东莞理工学院 | A kind of navigation path planning method reused based on strategy with intensified learning |
US20190258918A1 (en) * | 2016-11-03 | 2019-08-22 | Deepmind Technologies Limited | Training action selection neural networks |
-
2020
- 2020-08-26 CN CN202010867107.1A patent/CN112016678A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190258918A1 (en) * | 2016-11-03 | 2019-08-22 | Deepmind Technologies Limited | Training action selection neural networks |
CN110235148A (en) * | 2016-11-03 | 2019-09-13 | 渊慧科技有限公司 | Training action selects neural network |
US20190126472A1 (en) * | 2017-10-27 | 2019-05-02 | Deepmind Technologies Limited | Reinforcement and imitation learning for a task |
CN108288094A (en) * | 2018-01-31 | 2018-07-17 | 清华大学 | Deeply learning method and device based on ambient condition prediction |
CN110081893A (en) * | 2019-04-01 | 2019-08-02 | 东莞理工学院 | A kind of navigation path planning method reused based on strategy with intensified learning |
Non-Patent Citations (2)
Title |
---|
REIN HOUTHOOFT ETC.: "VIME: Variational Information Maximizing Exploration", ARXIV:1605.09674V4, 27 January 2017 (2017-01-27), pages 1 - 2 * |
李建国等: "基于KL散度的策略优化", 计算机科学, 30 June 2019 (2019-06-30) * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112949933A (en) * | 2021-03-23 | 2021-06-11 | 成都信息工程大学 | Traffic organization scheme optimization method based on multi-agent reinforcement learning |
CN113705777A (en) * | 2021-08-07 | 2021-11-26 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle autonomous path-finding model training method and device |
CN113705777B (en) * | 2021-08-07 | 2024-04-12 | 中国航空工业集团公司沈阳飞机设计研究所 | Unmanned aerial vehicle autonomous path-finding model training method and device |
CN113537406A (en) * | 2021-08-30 | 2021-10-22 | 重庆紫光华山智安科技有限公司 | Method, system, medium and terminal for enhancing image automatic data |
CN113537406B (en) * | 2021-08-30 | 2023-04-07 | 重庆紫光华山智安科技有限公司 | Method, system, medium and terminal for enhancing image automatic data |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4231197B1 (en) | Training machine learning models on multiple machine learning tasks | |
US11779837B2 (en) | Method, apparatus, and device for scheduling virtual objects in virtual environment | |
CN112016678A (en) | Training method and device for strategy generation network for reinforcement learning and electronic equipment | |
US20130325774A1 (en) | Learning stochastic apparatus and methods | |
JP7013489B2 (en) | Learning device, live-action image classification device generation system, live-action image classification device generation device, learning method and program | |
WO2017116814A1 (en) | Calibrating object shape | |
EP4303767A1 (en) | Model training method and apparatus | |
US20210107144A1 (en) | Learning method, learning apparatus, and learning system | |
US20230311335A1 (en) | Natural language control of a robot | |
CN111352419B (en) | Path planning method and system for updating experience playback cache based on time sequence difference | |
US20200276704A1 (en) | Determining control policies for robots with noise-tolerant structured exploration | |
CN114840322A (en) | Task scheduling method and device, electronic equipment and storage | |
CN114398834A (en) | Training method of particle swarm optimization algorithm model, particle swarm optimization method and device | |
Wodziński et al. | Sequential classification of palm gestures based on A* algorithm and MLP neural network for quadrocopter control | |
CN112069662A (en) | Complex product autonomous construction method and module based on man-machine hybrid enhancement | |
US20200134498A1 (en) | Dynamic boltzmann machine for predicting general distributions of time series datasets | |
CN113419424A (en) | Modeling reinforcement learning robot control method and system capable of reducing over-estimation | |
CN116968024A (en) | Method, computing device and medium for obtaining control strategy for generating shape closure grabbing pose | |
CN115421387B (en) | Variable impedance control system and control method based on inverse reinforcement learning | |
CN112334914A (en) | Mock learning using generative lead neural networks | |
CN116710974A (en) | Domain adaptation using domain countermeasure learning in composite data systems and applications | |
CN114880740A (en) | Data-mechanics-rule driven structure support intelligent arrangement method and device | |
CN114882587A (en) | Method, apparatus, electronic device, and medium for generating countermeasure sample | |
CN112016611A (en) | Training method and device for generator network and strategy generation network and electronic equipment | |
US20210182459A1 (en) | Simulation device, simulation method, and computer-readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |