CN112016678A

CN112016678A - Training method and device for strategy generation network for reinforcement learning and electronic equipment

Info

Publication number: CN112016678A
Application number: CN202010867107.1A
Authority: CN
Inventors: 赵瑞; 徐伟
Original assignee: Nanjing Horizon Robotics Technology Co Ltd
Current assignee: Nanjing Horizon Robotics Technology Co Ltd
Priority date: 2019-09-23
Filing date: 2020-08-26
Publication date: 2020-12-01

Abstract

A training method, a training device and an electronic device for a strategy generation network for reinforcement learning are disclosed. The training method for the strategy generation network for reinforcement learning comprises the following steps: acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts; determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively; determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and updating parameters of the policy generation network with a predetermined policy with the KL divergence value as a reward function. Thus, the performance of generating the network generation strategy by the strategy is improved.

Description

Training method and device for strategy generation network for reinforcement learning and electronic equipment

Technical Field

The present application relates to the field of reinforcement learning technology, and more particularly, to a training method, a training apparatus, and an electronic device for a strategy generation network for reinforcement learning.

Background

Recently, Reinforcement Learning (RL) combined with Deep Learning (DL) has been successful in many bonus-driven tasks, including exhibiting superior performance over human performance in various games, and also exhibiting superior performance in continuous control tasks of robots, navigation tasks within complex environments, and tasks of manipulating objects.

However, despite much success, in current reinforcement learning tasks, objects that perform the task, such as robots that manipulate objects, are typically only learned from external reward signals, unlike human learning processes. For example, in a process in which a person learns to manipulate objects, the person not only attempts to complete a task, but also learns to grasp controllable aspects of the environment. For example, even in an unsupervised state, a person can quickly discover the association between his own actions and changes in the state of the environment, thereby using this skill to manipulate the environment to a desired state.

Also, in the current actual reinforcement learning task, it is difficult to design an external reward function that can guarantee the object that performs the task to learn a desired behavior, such as the behavior of manipulating an object. Accordingly, it is desirable to provide an improved training scheme for a policy generation network that generates actions for objects performing tasks to efficiently generate actions for objects performing tasks in the absence of external rewards.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. Embodiments of the present application provide a training method, a training apparatus, and an electronic device for a policy generation network for reinforcement learning, which determine mutual information between an object state and an environment state, that is, KL dispersion values between probability distributions of the object state and the environment state, to train a policy generation network as a reward function, thereby improving performance of a policy generation network generation policy.

According to an aspect of the present application, there is provided a training method for a strategy generation network for reinforcement learning, including: acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts, wherein the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment; determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively; determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and updating parameters of the policy generation network with a predetermined policy with the KL divergence value as a reward function.

According to another aspect of the present application, there is provided a training apparatus for a strategy generation network for reinforcement learning, including: a state acquisition unit configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment; a distribution determining unit, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively; a dispersion value determining unit configured to determine a KL dispersion value of the product of the joint probability distribution determined by the distribution determining unit and the first edge distribution and the second edge distribution; and a network updating unit for updating the parameters of the policy generation network by a predetermined policy with the KL divergence value determined by the divergence value determining unit as a reward function.

According to still another aspect of the present application, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the training method for a strategy generation network for reinforcement learning as described above.

According to yet another aspect of the present application, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the training method for a strategy generation network for reinforcement learning as described above.

According to the training method, the training device and the electronic equipment for the strategy generation network for reinforcement learning, mutual information between the object state and the environment state, namely KL divergence values between probability distributions of the object state and the environment state are determined to be used as a reward function for training the strategy generation network, and the object learning which is equivalent to executing tasks controls the environment through the strategy generated by the strategy generation network, so that the effectiveness of the strategy generated by the strategy generation network is improved.

In addition, according to the training method, the training device and the electronic device for the strategy generation network for reinforcement learning, mutual information between the object state and the environment state is used as the reward function training strategy generation network, the strategy can be effectively generated under the condition that no external reward function is artificially made or manually specified or the condition that the reward is sparse in the environment, and therefore the performance of the strategy generation network is improved.

In addition, the training method, the training device and the electronic equipment for the strategy generation network for reinforcement learning, which are provided by the application, can help the object for executing the task to quickly adapt to the unknown task in a way of simulating human learning to execute the task by learning the control environment through the strategy generated by the strategy generation network through the object for executing the task.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 illustrates a schematic diagram of a standard reinforcement learning model.

Fig. 2 illustrates a flow chart of a training method for a strategy generation network for reinforcement learning according to an embodiment of the present application.

Fig. 3 illustrates a block diagram of a training apparatus for a strategy generation network for reinforcement learning according to an embodiment of the present application.

FIG. 4 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

FIG. 1 illustrates a schematic diagram of a standard reinforcement learning model. As shown in fig. 1, the policy generation network N generates an action a of an object (Agent) performing a task, a current state S0 of an environment is migrated to a next state S1 of the environment based on the action a, and p is used to represent a transition probability of the current state to the next state. In addition, the reward function r is input to the policy generating network N for updating the policy by the policy generating network N which generates the action a, e.g. with the general aim of maximizing the cumulative value of the reward function.

Taking the reinforcement learning task of controlling the robot as an example, if the policy generation network N generates a policy for controlling the robot to perform action a, for example, if the robot is moving in a certain direction, the current state S0 of the environment may be represented as the current position of the robot, which transitions to the next state S1 of the environment, i.e., the next position of the robot, based on the action a.

In the present application, taking the reinforcement learning task of controlling the robot to manipulate an object as an example, the policy generation network N generates a policy for controlling the robot to perform action a by which the robot is controlled to make action a by which the robot changes an object to be manipulated, for example, an object to be moved, from the current state S0, for example, the current position, to the next state S1, for example, to the next position. Here, the current state S0 may be divided into a current state of the robot, and a current state of the object to be moved. Further, the next state S1 can also be divided into the next state of the robot, and the next state of the object to be moved. In addition, as described above, the reward function r is input to the policy generation network N for updating the policy of its generation action a by the policy generation network N.

As described above, since there are often cases where there is no external reward function that is artificially created or specified in the reinforcement learning task, it is difficult to design such a function, and there are many environments where external rewards are sparse.

Thus, in these tasks, there is a need for an update of a policy generation network that can learn to control the actions of objects performing the task completely autonomously from the environment without external reward functions, i.e., with some type of intrinsic reward driving the actions of generating the objects.

Therefore, the basic concept of the present application is to divide the environment state in the conventional reinforcement learning task into an object state of an object performing the task and an environment state of an environment on which the object acts, and to serve as a reward function by mutual information between the object state and the environment state.

That is, by estimating mutual information between the object state and the environment state during the learning process of the object performing the task, the object performing the task may receive a high intrinsic reward when there is a high mutual information value between its own state and the environment state, thereby learning the control environment equivalent to the object performing the task.

Specifically, the training device and the electronic equipment first acquire continuous object state information of an object executing a task and continuous environment state information of an environment acted by the object, the continuous object state information includes a plurality of object states of the object and the continuous environment state information includes a plurality of environment states of the environment, then determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively, determining a KL divergence value of a product of the joint probability distribution and the first edge distribution and the second edge distribution, and finally updating the parameters of the policy generation network through a predetermined policy by taking the KL divergence value as a reward function.

In this way, in the training method for a policy generation network for reinforcement learning provided by the present application, by using the KL divergence value of the product of the joint probability distribution of continuous object state information and continuous environment state information and the first edge distribution and the second edge distribution as an intrinsic reward training policy generation network, it is possible to learn a control environment state without external supervision, thereby obtaining the following advantages:

first, the action policy of the object performing the task can be learned without an external reward function created or specified manually. Second, learning to grasp the state of the manipulated environment can help the object performing the task learn to achieve the goal in an environment where the reward is sparse, that is, the object performing the task can find the manipulation skill and can quickly adapt to a specific task with a sparse external reward. Third, learning to learn the state of the manipulated environment can help the object performing the task quickly adapt to the unknown task.

In addition, in the training method for the strategy generation network for reinforcement learning provided by the application, the learned mutual information can be used for other purposes besides being used as the intrinsic reward, for example, when the strategy generation network generates a plurality of candidate actions according to experience, the mutual information can be used for judging the priority of the candidate actions.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary method

As shown in fig. 2, the training method of the strategy generation network for reinforcement learning according to the embodiment of the present application includes the following steps.

Step S110, acquiring continuous object state information of an object executing a task and continuous environment state information of an environment acted by the object, where the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment.

In the embodiment of the application, in the task of reinforcement learning, the object for executing the task and the environment acted by the object depend on the task type, and can be different types of objects and environments. For example, in a task of manipulating an object by a robot, an object performing the task is the robot, and an environment in which the object acts refers to a state of the manipulated object.

As described above, in a general reinforcement learning scheme, only a single environmental state is included. In the embodiment of the present application, a single environment state is divided into two parts, that is, an object state of an object that performs a task and an environment state of an environment on which the object acts. For example, an object state of an object that performs a task refers to a state of a robot, and an environmental state of an environment in which the object acts is a state of an object manipulated by the robot.

In particular, the robot may manipulate objects through various actions, such as pushing, picking, placing, and the like. For the robot, the state thereof may include the position of each joint of the robot, i.e., the coordinate position expressed by (x, y, z), and may further include the orientation, linear velocity, angular velocity, and the like of each joint. In the present embodiment, for simplicity, the state of the robot may be described only in terms of the position expressed in (x, y, z) coordinates. In addition, the state of the object manipulated by the robot can also be described simply in terms of a position in (x, y, z) coordinates. Thus, the continuous object state of the robot refers to a continuous set of (x, y, z) coordinates, and the continuous environment state of the object manipulated by the robot refers to a continuous set of (x, y, z) coordinates.

For example, the state of each robot is defined as s^cThen there is s^c＝(x^c,y^c,z^c) And the state of the object manipulated by the robot is defined as sⁱThen there is sⁱ＝(xⁱ,yⁱ,zⁱ). And, the continuous object state information of the robot may be represented as S^cWherein

And the continuous environmental state information of the robot-manipulated object may be represented as SⁱWherein

Therefore, in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, acquiring continuous object state information of an object that performs a task and continuous environment state information of an environment in which the object acts includes: acquiring continuous three-dimensional position information of the object executing the task as the continuous object state information; and acquiring continuous three-dimensional position information of an environment on which the object acts as the continuous environment state information.

In this way, by acquiring the continuous three-dimensional position information of the object and the continuous three-dimensional position information of the environment as the state information, the form of the state information can be made simple, and since the state information is focused on the spatial positions of the object and the environment, it is convenient for the policy generation network to be used for the task of the spatial position type.

Further, in the training method of the policy generation network for reinforcement learning described above, wherein acquiring continuous three-dimensional position information of the object performing the task as the continuous object state information includes: acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.

That is, by acquiring three-dimensional position information of an object performing a task and other motion information such as an orientation, a linear velocity, and an angular velocity as state information, it is possible to train a policy generation network so that the generated policy can control the motion of the above-described aspect of the object performing the task, thereby realizing a complicated function such as a robot picking up an article.

At the beginning of the training method according to the embodiment of the present application, the object executing the task may perform an action according to a partially random strategy, such as an e-greedy algorithm, to explore the environment and collect the object state and the environment state to obtain the continuous object state information and the continuous environment state information.

Step S120, determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively.

Because the object performing the task, which can control the environment state to have high mutual information with its own state, can better grasp the environment, in the embodiment of the present application, mutual information between the object state and the environment state is used to promote the policy generation network learning policy, such as denoted as pi, without an external reward function_θ(a_t|s_t) Wherein a is_tRepresents the motion, s_tRepresenting a state and the policy generation network has a parameter theta.

Mathematically, mutual information between two random variables can be expressed as a KL divergence between the product of the joint probability distribution of the two random variables and the respective edge distributions of the two random variables. Therefore, in the embodiment of the present application, in order to determine mutual information between the continuous object state information and the continuous environment state information, a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively, are first determined.

S130, determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution. That is, as described above, a KL divergence value of a product of the joint probability distribution and the first edge distribution and the second edge distribution is determined as mutual information between the continuous object state information and the continuous environment state information in accordance with a mathematical representation of mutual information between the continuous object state information and the continuous environment state information. Specifically, it can be represented by the following formula:

wherein the content of the first and second substances,

representing a joint probability distribution of the continuous object state information and the continuous environment state information,

a product of an edge distribution representing the continuous environmental state information and an edge distribution representing the continuous object state information.

S140, updating the parameters of the policy generation network through a preset policy by taking the KL divergence value as a reward function. That is, the policy generation network may be updated using an update policy of the policy generation network that is commonly used in reinforcement learning, and the update of the policy generation network is aimed at maximizing the accumulated value of the reward function, that is, maximizing mutual information between state information of an object that performs a task and environment information of an environment in which the object acts.

Specifically, in the embodiment of the present application, a Deep Deterministic Policy Gradient (DDPG) may be used to update the parameters of the Policy generation network. The policy updates the parameters of the policy generation network in a relatively aggressive manner to improve the policy, better when the object performing the task starts learning quickly.

In addition, in the embodiment of the application, flexible actuator criticality (SAC) can be used to update the parameters of the strategy generation network. The policy updates parameters of the policy generation network in a relatively conservative manner to improve the policy, enabling a more comprehensive exploration of the environment.

Therefore, in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes: updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.

And in the training method of a policy generation network for reinforcement learning according to an embodiment of the present application, updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes: updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.

In this way, according to the training method for the strategy generation network for reinforcement learning in the embodiment of the present application, the strategy generation network can be trained as a reward function by determining the mutual information between the object state and the environment state, that is, the KL dispersion value between the probability distributions of the object state and the environment state, and the object learning equivalent to the task execution controls the environment through the strategy generated by the strategy generation network, thereby improving the effectiveness of the strategy generated by the strategy generation network.

In addition, according to the training method for the strategy generation network for reinforcement learning in the embodiment of the application, the mutual information between the object state and the environment state can be used as the reward function training strategy generation network, so that the strategy can be effectively generated under the condition that no external reward function is artificially made or manually specified or the reward is sparse in the environment, and the performance of the strategy generation network is improved.

In addition, according to the training method of the strategy generation network for reinforcement learning, the object for executing the task can be made to learn to control the environment by using the strategy generated by the strategy generation network, and the object for executing the task can be helped to adapt to the unknown task quickly in a mode of simulating the task executed by human learning.

An example of calculating the KL dispersion value of the product of the joint probability distribution and the first edge distribution and the second edge distribution will be described in further detail below.

In one example, a lower limit value is used to approximate the value of the mutual information, i.e., I (S)ⁱ；S^c). First, the doncker-varahan representation can be used to rewrite the KL form of mutual information as:

where the supremum is taken from all functions T so that both expectations are limited. Then, the lower limit value of mutual information in the doncker-varahan expression is extracted by the compression theorem in the PAC-bayesian literature, and is expressed as:

the expectation in the above formula may be used from

And

or by dragging the sample along the sample axis from a joint distribution. And I_Φ(Sⁱ,S^c) Training can be by gradient ascent. Statistical network T_φIt can be parameterized by a deep neural network with parameters φ e φ, with the aim of estimating the mutual information with arbitrary precision. The expression for mutual information in training the statistical network is as follows:

wherein, the state pair

From joint distribution

Sampling to obtain, other states

Distributing samples from edges

And (6) sampling. At the estimated lower limit value

Afterwards, the statistical network T is optimized using back propagation_φIs measured.

Also, in the embodiment of the present application, the transition mutual information value is defined as the KL dispersion value as described above, which is the value from the current state s_tTo the next state s_t+1The mutual information added value of (a) is expressed as:

here, the first and second liquid crystal display panels are,

is a predefined maximum transition mutual information value. The clip can be used to limit the transition mutual information value to

Within the interval (c). Wherein the lower limit 0 forces the mutual information estimate to be non-negative. And in practice, to mitigate the effect of certain particularly large transitional mutual information values, a threshold is applied

To determine a transitional mutual information value, i.e. an intrinsic reward functionAn upper limit of the value is beneficial. By using this clip function, the training of the policy generation network can be stabilized. The threshold may be used as a hyper-parameter.

Therefore, in the training method of a strategy generation network for reinforcement learning according to an embodiment of the present application, determining the KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution includes: sampling (either empirically or along a sample axis) a first current state pair and a first next state pair from the joint probability distribution; sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively; determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information; determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants; determining a transition mutual information value based on the two first mutual information values and the two second index values; and obtaining the KL divergence value based on the transition mutual information value.

In particular, the first state current pair and the first next state pair may be as in the above formula

Where n ═ t and t +1, the second current state pair and second next state pair may be as in the above equations

Where n is T and T +1, the statistical network may be T as in the above formula_φThe two first mutual information values of the first current state pair and the first next state pair may be as in the above formula

Where n ═ t and t +1, the second current state pair and the second downThe two second mutual information values of a state pair may be as in the above formula

Where n ═ t and t +1, the two second index values may be as in the above formula

Where n ═ t and t + 1.

In this way, the KL divergence values may be obtained with relatively simple calculations.

In the training method for a strategy generation network for reinforcement learning, obtaining the KL divergence value based on the transition mutual information value includes: determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value; setting the KL divergence value to zero in response to the transition mutual information value being less than zero; setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.

In this way, by defining the KL divergence value between zero and the maximum transition mutual information value, the training of the policy generation network may be stabilized.

In addition, in the training method of the policy generation network for reinforcement learning, the statistical network is obtained by training, and the training process includes: sampling a plurality of training first state pairs from the joint probability distribution; sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs; calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network; calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants; subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and updating parameters of the statistical network by back propagation to maximize the transient mutual information value for training.

In particular, the plurality of training first state pairs may be as in the above formula

The plurality of training second state pairs may be as in the above formula

The statistical network may be T as in the above formula_φThe plurality of training first mutual information values may be as in the above formula

The training second mutual information value may be as in the above formula

The plurality of training second index values may be as in the above formula

In this way, in the training process of the statistical network, the calculation of the transition mutual information value for training is simple, so that the time cost and the calculation cost of the training of the statistical network are reduced.

Exemplary devices

As shown in fig. 3, a training apparatus 200 for a strategy generation network for reinforcement learning according to an embodiment of the present application includes: a state acquiring unit 210 configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment; a distribution determining unit 220, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit 210, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively; a divergence value determining unit 230 for determining a KL divergence value of the product of the joint probability distribution determined by the distribution determining unit 220 and the first edge distribution and the second edge distribution; and a network updating unit 240 for updating the parameters of the policy generation network by a predetermined policy with the KL divergence value determined by the divergence value determining unit 230 as a reward function.

In an example, in the training apparatus 200 of the policy generation network for reinforcement learning, the state obtaining unit 210 is configured to: an object state acquisition subunit configured to acquire continuous three-dimensional position information of the object that performs the task as the continuous object state information; and an environment state acquisition subunit operable to acquire continuous three-dimensional position information of an environment on which the object acts as the continuous environment state information.

In an example, in the training apparatus 200 of the policy generation network for reinforcement learning, the object state obtaining subunit is configured to: acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.

In an example, in the training apparatus 200 for generating a network of strategies for reinforcement learning, the divergence value determining unit 230 is configured to: sampling a first current state pair and a first next state pair from the joint probability distribution; sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively; determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information; determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants; determining a transition mutual information value based on the two first mutual information values and the two second index values; and obtaining the KL divergence value based on the transition mutual information value.

In one example, in the training apparatus 200 for a strategy generation network for reinforcement learning, the obtaining, by the divergence value determining unit 230, the KL divergence values based on the transition mutual information value includes: determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value; setting the KL divergence value to zero in response to the transition mutual information value being less than zero; setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.

In an example, in the training apparatus 200 for generating a network of strategies for reinforcement learning, the statistical network is obtained by training, and the training process includes: sampling a plurality of training first state pairs from the joint probability distribution; sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs; calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network; calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants; subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and updating parameters of the statistical network by back propagation to maximize the transient mutual information value for training.

In an example, in the training apparatus 200 for generating a network according to the above-mentioned strategy for reinforcement learning, the network updating unit 240 is configured to: updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.

In an example, in the training apparatus 200 for generating a network according to the above-mentioned strategy for reinforcement learning, the network updating unit 240 is configured to: updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.

Here, it will be understood by those skilled in the art that the specific functions and operations of the respective units and modules in the training apparatus 200 for a strategy generation network for reinforcement learning described above have been described in detail in the description of the training method for a strategy generation network for reinforcement learning with reference to fig. 2, and thus, a repetitive description thereof will be omitted.

As described above, the training apparatus 200 for a strategy generation network for reinforcement learning according to the embodiment of the present application can be implemented in various terminal devices, such as a server for reinforcement learning tasks and the like. In one example, the training apparatus 200 for a strategy generation network for reinforcement learning according to the embodiment of the present application may be integrated into a terminal device as a software module and/or a hardware module. For example, the training apparatus 200 of the policy generation network for reinforcement learning may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the training apparatus 200 for the strategy generation network for reinforcement learning may also be one of many hardware modules of the terminal device.

Alternatively, in another example, the training apparatus 200 of the strategy generation network for reinforcement learning and the terminal device may be separate devices, and the training apparatus 200 of the strategy generation network for reinforcement learning may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information according to an agreed data format.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 4.

As shown in fig. 4, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the training method for a reinforcement learning policy generation network of the various embodiments of the present application described above and/or other desired functions. Various contents such as object state information, environment state information, mutual information values, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

The input device 13 may include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including parameters of the trained policy generation network to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 4, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method for a policy generation network for reinforcement learning according to various embodiments of the present application described in the "exemplary methods" section of this specification above.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the training method for a reinforcement learning policy generation network according to various embodiments of the present application described in the "exemplary methods" section above in this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A training method for a strategy generation network for reinforcement learning, comprising:

acquiring continuous object state information of an object executing a task and continuous environment state information of an environment on which the object acts, wherein the continuous object state information includes a plurality of object states of the object, and the continuous environment state information includes a plurality of environment states of the environment;

determining a joint probability distribution of the continuous object state information and the continuous environment state information, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information respectively;

determining a KL divergence value of the product of the joint probability distribution and the first edge distribution and the second edge distribution; and

updating parameters of the policy generating network with a predetermined policy with the KL divergence value as a reward function.

2. The training method of a policy generation network for reinforcement learning according to claim 1, wherein acquiring continuous object state information of an object that performs a task and continuous environment state information of an environment in which the object acts comprises:

acquiring continuous three-dimensional position information of the object executing the task as the continuous object state information; and

acquiring continuous three-dimensional position information of an environment in which the object acts as the continuous environment state information.

3. The training method of a policy generation network for reinforcement learning according to claim 2, wherein acquiring continuous three-dimensional position information of the object performing the task as the continuous object state information includes:

acquiring continuous three-dimensional position information of the object performing the task, and at least one of continuous azimuth information, linear velocity information, and angular velocity information as the continuous object state information.

4. The training method for a strategy generation network for reinforcement learning of claim 1, wherein determining a KL divergence value of the joint probability distribution and the product of the first edge distribution and the second edge distribution comprises:

sampling a first current state pair and a first next state pair from the joint probability distribution;

sampling a current state and a next state from the continuous object state information and the second edge distribution to form a second current state pair and a second next state pair, respectively;

determining two first mutual information values of the first current state pair and the first next state pair through a statistical network for calculating mutual information;

determining two second mutual information values of the second current state pair and the second next state pair through a statistical network for calculating mutual information, and determining two second index values by using the plurality of second mutual information values as indexes of natural constants;

determining a transition mutual information value based on the two first mutual information values and the two second index values; and the number of the first and second groups,

and obtaining the KL divergence value based on the transition mutual information value.

5. The training method for a strategy generation network for reinforcement learning of claim 4, wherein obtaining the KL divergence value based on the transitional mutual information value comprises:

determining whether the transition mutual information value is less than zero or greater than a predefined maximum transition mutual information value;

setting the KL divergence value to zero in response to the transition mutual information value being less than zero;

setting the KL divergence value to a predefined maximum transition mutual information value in response to the transition mutual information value being greater than the predefined maximum transition mutual information value; and the number of the first and second groups,

setting the KL divergence value to the transitional mutual information value in response to the transitional mutual information value being greater than zero and less than the predefined maximum transitional mutual information value.

6. The training method of the strategy generation network for reinforcement learning of claim 4, wherein the statistical network is obtained by training, and the training process comprises:

sampling a plurality of training first state pairs from the joint probability distribution;

sampling states from the continuous object state information and the second edge distribution respectively to form a plurality of training second state pairs;

calculating a plurality of training first mutual information values for the plurality of training first state pairs using the statistical network;

calculating a plurality of training second mutual information values of the plurality of training second state pairs by using the statistical network, and calculating a plurality of training second index values by using the plurality of training second mutual information values as indexes of natural constants;

subtracting the logarithm of the average value of a plurality of second index values from the average value of the plurality of first mutual information values to obtain a transition mutual information value for training; and the number of the first and second groups,

updating parameters of the statistical network by back-propagation to maximize the training interim mutual information value.

7. The training method of a policy generation network for reinforcement learning according to claim 1, wherein updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes:

updating parameters of the policy generation network with the KL divergence value as a reward function through a depth-determinative policy gradient.

8. The training method of a policy generation network for reinforcement learning according to claim 1, wherein updating parameters of the policy generation network by a predetermined policy with the KL divergence value as a reward function includes:

updating parameters of the policy generation network by flexible actuation evaluation with the KL divergence value as a reward function.

9. A training apparatus for a strategy generation network for reinforcement learning, comprising:

a state acquisition unit configured to acquire continuous object state information of an object that executes a task and continuous environment state information of an environment on which the object acts, the continuous object state information including a plurality of object states of the object, and the continuous environment state information including a plurality of environment states of the environment;

a distribution determining unit, configured to determine a joint probability distribution of the continuous object state information and the continuous environment state information acquired by the state acquiring unit, and a first edge distribution and a second edge distribution of the continuous object state information and the continuous environment state information, respectively;

a dispersion value determining unit configured to determine a KL dispersion value of the product of the joint probability distribution determined by the distribution determining unit and the first edge distribution and the second edge distribution; and

a network updating unit for updating the parameters of the policy generation network by a predetermined policy with the KL divergence values determined by the divergence value determining unit as a reward function.

10. An electronic device, comprising:

a processor; and

memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform a method of training a policy generation network for reinforcement learning according to any of claims 1-8.