CN112101563A

CN112101563A - Confidence domain strategy optimization method and device based on posterior experience and related equipment

Info

Publication number: CN112101563A
Application number: CN202010713458.7A
Authority: CN
Inventors: 兰旭光; 张翰博; 柏思特; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2020-12-18

Abstract

The invention discloses a trust domain strategy optimization method, a device and related equipment based on posterior experience, wherein the method comprises the following steps: s100, taking the arrived target point in the experience data as a virtual target point, and generating virtual after experience data; s200, filtering the virtual target based on a post-event target filtering algorithm to obtain corresponding training data; s300, based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling; s400, based on the weighted importance sampling, correcting the distribution deviation of the virtual experience data and the original experience data, and estimating KL divergence values among strategies; s500, correcting the gradient direction of the strategy through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length. The method enables the intelligent to complete an effective exploration process for the environment and tasks based on a small amount of interactive data and a simply designed reward function, and efficiently learns and updates the behavior strategy.

Description

Confidence domain strategy optimization method and device based on posterior experience and related equipment

Technical Field

The invention belongs to the field of machine learning intelligent robots, and particularly relates to a trust domain strategy optimization method and device based on posterior experience and related equipment.

Background

With the rapid development of artificial intelligence technology, the information processing method based on the intelligent and automatic technology has completely emerged in various industries through intelligent and automatic information processing processes. However, the mainstream deep learning method in the field of artificial intelligence at present mostly depends on large-scale artificial labeling data, and how to acquire data and complete the learning process through autonomous interaction between a robot or an intelligent agent and the environment is a great difficulty problem in the field of artificial intelligence. The reinforcement learning is an important branch technology in the field of artificial intelligence, and can help the robot to complete exploration and learning in an autonomous interaction process with the environment. However, reinforcement learning currently faces many problems such as slow learning speed, difficult reward function design, low exploration efficiency, etc., so it is difficult to apply to the actual complex task. In particular, reinforcement learning based agents often require tens of millions or more of interactive data to obtain a trusted behavioral strategy. In addition, for complex tasks, a sophisticated reward function needs to be designed based on the current task, and characterization of task rewards is completed, so that the intelligent agent is prevented from learning suboptimal strategies.

Therefore, how to design an efficient reinforcement learning method can learn an effective strategy through autonomous exploration on the basis of a simply designed reward function on the premise of less interactive data, and is a prominent problem faced by reinforcement learning at present.

Disclosure of Invention

The invention aims to overcome the defects and provides a trust domain strategy optimization method, a trust domain strategy optimization device and related equipment based on posterior experience, and the method enables an intelligent agent to complete an effective exploration process for environments and tasks based on a small amount of interactive data and a simply designed reward function, and efficiently learn and update behavior strategies.

In order to achieve the above object, the present invention comprises the steps of:

a trust domain strategy optimization method based on after-the-fact experience comprises the following steps:

s100, using the acquired experience data of the robot executing action in the strategy training process under the target condition, and taking the arrived target point in the experience data as a virtual target point to generate virtual after experience data;

s200, filtering the virtual target based on a posterior target filtering algorithm, and acquiring training data corresponding to the posterior target which is distributed approximately to the original target;

s300, based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;

s400, when the strategy distribution is similar, the secondary KL divergence is used for approximating the KL divergence, and the distribution deviation of the virtual empirical data and the original empirical data is corrected based on the weighted importance sampling, so that the KL divergence value between the strategies is estimated;

s500, correcting the gradient direction of the strategy through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, returning to S100, and repeating the strategy updating process until the strategy converges.

As a further improvement of the present invention, in S100, the action strategy is currently executed by the robot

Interacting with the environment, and acquiring experience data of the robot to execute the action

Wherein the content of the first and second substances,

is determined by the current state s of the robot_tPerforming an action a as input to a robot_tBy performing action a_tThe robot will obtain the reward value r from the environment_t；

Generating a virtual target g' ═ phi(s) through empirical data; generating virtual empirical data using a virtual target to condition raw empirical data

As a further improvement of the present invention, the current state of the robot includes a joint angle, a joint speed, and a target position;

as a further improvement of the present invention, in S300, the objective function value is:

wherein A represents the dominance function, i.e. at state s_kUnder the condition of (a), performing action (a)_kAdvantages over current strategies; the x is a representation of a normalization factor,

representing a pre-update strategy, theta representing a post-update strategy, gamma representing a reward discount factor, N representing the number of tracks under the virtual target g', and t representing a reinforcement learning time step.

As a further improvement of the present invention, in S400, the KL dispersion value is;

as a further improvement of the present invention, in S500, the policy step size:

where e represents the maximum KL divergence limit.

A trust domain policy optimization apparatus based on posterior experience, comprising:

the acquisition module is used for generating virtual after-the-fact experience data by using the acquired experience data of the robot executing actions in the strategy training process under the target condition and taking the arrived target point in the experience data as a virtual target point;

the filtering module is used for finishing filtering the virtual target based on a posterior target filtering algorithm and acquiring training data corresponding to the posterior target which is distributed close to the original target;

the correction module is used for correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling based on the virtual experience data, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;

the estimation module is used for approximating the KL divergence by using the secondary KL divergence when the strategy distribution is similar, and correcting the distribution deviation of the virtual empirical data and the original empirical data based on the weighted importance sampling so as to estimate the KL divergence value between strategies;

the updating module is used for correcting the gradient direction of the strategy through the KL divergence and calculating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, and repeating the strategy updating process until the strategy is converged.

A post experience based trust domain policy optimization device comprising: a memory, a processor, and an after experience based trust domain policy optimizer stored on the memory and operable on the processor, the after experience based trust domain policy optimizer when executed by the processor implementing the steps of the after experience based trust domain policy optimization method.

A computer readable storage medium having stored thereon an after experience based trust domain policy optimization program, which when executed by a processor implements the steps of the after experience based trust domain policy optimization method.

Compared with the prior art, the invention has the following advantages:

according to the method, the original experience data is combined with the post-event target filtering to generate post-event experience data capable of helping the strategy to be effectively updated, and the convergence speed, the stability and the final performance of the strategy updating method are greatly improved through the post-event experience data. By using the quadratic KL divergence to replace the KL divergence, the variance of the existing KL divergence estimation method is greatly reduced, and the accuracy of KL divergence estimation is greatly improved. In the method operation process, the robot can take the current state as the strategy input, obtain the optimal action under the current state condition and execute the given task. Compared with the prior art, the method successfully applies the post experience data to the trust domain strategy optimization method, overcomes the defects of low reinforcement learning performance, slow convergence and incapability of acquiring effective strategies caused by low exploration efficiency, less effective data and difficult design of reward functions, and successfully realizes a high-performance reinforcement learning method. In the existing benchmark test, compared with the previous method, the method obtains the current optimal performance in a plurality of tasks, and solves the complex problems which cannot be solved by a plurality of existing methods, such as FetchSlideDiscrete, FetchReachDiscrete and the like. The method has the advantages of high efficiency, high convergence rate and high performance, and can have excellent performance in various control tasks (such as control strategy learning based on image input, terminal speed control, speed difference control, joint speed control and the like), so the method has great application potential.

Drawings

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

FIG. 1 is a flowchart of a trust domain policy optimization method based on post experience according to the present invention.

FIG. 2 is a result graph (success rate of executing tasks by different methods) of the present invention, where each subgraph represents a task, the task name is represented by the subgraph name, and several curves for each task represent the performance of different methods.

FIG. 3 is a schematic diagram of a trust domain policy optimization apparatus based on post experience according to the present invention.

FIG. 4 is a schematic diagram of a trust domain policy optimization apparatus based on post experience according to the present invention.

Detailed Description

In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

The invention discloses a trust domain strategy optimization method based on posterior experience. And correcting the distribution deviation of the post-experience data and the original experience data by using an importance sampling method. KL divergence estimation errors are reduced by approximating the KL divergence using quadratic KL divergence. By using post-event target filtering, a more efficient virtual target point is obtained, and the generalization capability of the method is improved.

Compared with the prior invention, the method can obtain higher data efficiency and better method performance in the sparse reward reinforcement learning task. The introduction of the experience data after the fact can greatly improve the effective data quantity obtained by the method in the training process and greatly reduce the training time of the method. The introduction of the quadratic KL divergence can greatly reduce the variance of KL divergence estimation of adjacent distribution, and provide an extremely accurate KL divergence estimation value. The posterior target filtering algorithm can obviously reduce the distribution deviation of the posterior target and the original target, and improves the generalization capability of the method to the original target. Compared with the prior method, the trust domain strategy optimization method based on the posterior experience can greatly simplify the design process of the reward function and can obtain good performance under the simple sparse reward function setting.

When the method is applied specifically, the robot can be helped to learn skills in an autonomous and interactive mode, and the method does not depend on a large amount of manual labeling data. In addition, the method has good performance in large-scale, nonlinear and image input-based application scenes, and has great application potential.

As shown in fig. 1, the present invention specifically includes the following steps:

the method comprises the following steps: by current policy

Interacting with the environment to obtain empirical data

Wherein the content of the first and second substances,

is determined by the current state s of the robot_t(e.g., joint angle, joint velocity, target position, etc.) as inputs to the robot to perform action a_tTo (3) is performed. By performing action a_tThe robot will obtain the reward value r from the environment_t. Generating a virtual target g' ═ phi(s) through empirical data; generating virtual empirical data using a virtual target to condition raw empirical data

Step two: based on a posterior target filtering algorithm, filtering the virtual target, and acquiring training data corresponding to the posterior target which is as close to the original target distribution as possible;

step three: based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling, and estimating an objective function value according to the distribution deviation:

and obtaining the original strategy gradient

Wherein A represents the dominance function, i.e. in the state s_kUnder the condition of (a), performing action (a)_kAdvantages over current strategies. The x is a representation of a normalization factor,

denotes a pre-update policy, and θ denotes a post-update policy.

Step four: when the strategy distribution is similar, the secondary KL divergence is used for approximating the KL divergence, and the distribution deviation of the virtual empirical data and the original empirical data is corrected based on the weighted importance sampling, so that the KL divergence value between strategies is estimated;

where γ represents the reward discount factor.

Step five: correcting the gradient direction of the strategy through KL divergence, and calculating the step length of the updating strategy through the maximum KL divergence step length (trust domain) belonging to the following steps:

where e represents the maximum KL divergence limit. And updating the existing strategy according to the step length, returning to the step I, and repeating the strategy updating process until the strategy is converged.

FIG. 2 is a graph of the results of the present invention (success rate of different methods to perform tasks). The comparison performance of different algorithms and HTRPO is tested in 13 tasks such as the simulation environment of the Fetch robot, and the final success rate of the algorithms is used as the measurement standard of the performance.

It can be seen that in simpler tasks, such as FlipBit, emptyrom, fourrom, fetchrach, etc., the confidence domain policy optimization method (HTRPO) based on post experience can obtain the optimal action policy with less data while maintaining the same or similar performance, compared with other methods, and the data efficiency of the algorithm is higher. In complex tasks such as FetchPush, FetchSlide, FetchPickAndPlace and the like, the trust domain strategy optimization method based on posterior experience has great advantages in data efficiency and performance, and a better action strategy can be obtained with less data.

Referring to fig. 3, a second aspect of the present application provides a trust domain policy optimization apparatus based on post experience.

The trust domain strategy optimization device based on the posterior experience provided by the embodiment of the application comprises:

As shown in fig. 4, a third aspect of the present application provides a trust domain policy optimization device based on posterior experience, including: a memory, a processor, and an after experience based trust domain policy optimizer stored on the memory and operable on the processor, the after experience based trust domain policy optimizer when executed by the processor implementing the steps of the after experience based trust domain policy optimization method.

A fourth aspect of the present application provides a computer-readable storage medium, on which a post experience based trust domain policy optimization program is stored, which, when executed by a processor, implements the steps of the post experience based trust domain policy optimization method.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, B, or C, may represent: a, B, C, "A and B", "A and C", "B and C", or "A and B and C", wherein A, B, C may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a READ-only MEMORY (ROM), a RANDOM ACCESS MEMORY (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A trust domain strategy optimization method based on after experience is characterized by comprising the following steps:

2. The method for optimizing confidence domain policy based on post experience according to claim 1, wherein, in S100,currently executing action strategies by robots

Wherein the content of the first and second substances,

3. The method of claim 1, wherein the current state of the robot comprises joint angle, joint speed and target position.

4. The method for trust domain policy optimization based on post experience according to claim 1, wherein in S300, the objective function value is:

5. The method for optimizing trust domain strategy based on posterior experience according to claim 1, wherein in S400, KL divergence values are;

6. the method for trust domain policy optimization based on post experience according to claim 1, wherein in S500, the policy step size is:

where e represents the maximum KL divergence limit.

7. A trust domain policy optimization apparatus based on post experience, comprising:

8. A post experience based trust domain policy optimization device, comprising: a memory, a processor and an after experience based trust domain policy optimizer stored on the memory and executable on the processor, the after experience based trust domain policy optimizer implementing the steps of the after experience based trust domain policy optimization method according to any one of claims 1 to 6 when executed by the processor.

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon an after-experience based trust domain policy optimization program, which when executed by a processor implements the steps of the after-experience based trust domain policy optimization method according to any one of claims 1 to 6.