CN112101563A - Confidence domain strategy optimization method and device based on posterior experience and related equipment - Google Patents

Confidence domain strategy optimization method and device based on posterior experience and related equipment Download PDF

Info

Publication number
CN112101563A
CN112101563A CN202010713458.7A CN202010713458A CN112101563A CN 112101563 A CN112101563 A CN 112101563A CN 202010713458 A CN202010713458 A CN 202010713458A CN 112101563 A CN112101563 A CN 112101563A
Authority
CN
China
Prior art keywords
strategy
experience
virtual
data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010713458.7A
Other languages
Chinese (zh)
Inventor
兰旭光
张翰博
柏思特
郑南宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202010713458.7A priority Critical patent/CN112101563A/en
Publication of CN112101563A publication Critical patent/CN112101563A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a trust domain strategy optimization method, a device and related equipment based on posterior experience, wherein the method comprises the following steps: s100, taking the arrived target point in the experience data as a virtual target point, and generating virtual after experience data; s200, filtering the virtual target based on a post-event target filtering algorithm to obtain corresponding training data; s300, based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling; s400, based on the weighted importance sampling, correcting the distribution deviation of the virtual experience data and the original experience data, and estimating KL divergence values among strategies; s500, correcting the gradient direction of the strategy through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length. The method enables the intelligent to complete an effective exploration process for the environment and tasks based on a small amount of interactive data and a simply designed reward function, and efficiently learns and updates the behavior strategy.

Description

Confidence domain strategy optimization method and device based on posterior experience and related equipment
Technical Field
The invention belongs to the field of machine learning intelligent robots, and particularly relates to a trust domain strategy optimization method and device based on posterior experience and related equipment.
Background
With the rapid development of artificial intelligence technology, the information processing method based on the intelligent and automatic technology has completely emerged in various industries through intelligent and automatic information processing processes. However, the mainstream deep learning method in the field of artificial intelligence at present mostly depends on large-scale artificial labeling data, and how to acquire data and complete the learning process through autonomous interaction between a robot or an intelligent agent and the environment is a great difficulty problem in the field of artificial intelligence. The reinforcement learning is an important branch technology in the field of artificial intelligence, and can help the robot to complete exploration and learning in an autonomous interaction process with the environment. However, reinforcement learning currently faces many problems such as slow learning speed, difficult reward function design, low exploration efficiency, etc., so it is difficult to apply to the actual complex task. In particular, reinforcement learning based agents often require tens of millions or more of interactive data to obtain a trusted behavioral strategy. In addition, for complex tasks, a sophisticated reward function needs to be designed based on the current task, and characterization of task rewards is completed, so that the intelligent agent is prevented from learning suboptimal strategies.
Therefore, how to design an efficient reinforcement learning method can learn an effective strategy through autonomous exploration on the basis of a simply designed reward function on the premise of less interactive data, and is a prominent problem faced by reinforcement learning at present.
Disclosure of Invention
The invention aims to overcome the defects and provides a trust domain strategy optimization method, a trust domain strategy optimization device and related equipment based on posterior experience, and the method enables an intelligent agent to complete an effective exploration process for environments and tasks based on a small amount of interactive data and a simply designed reward function, and efficiently learn and update behavior strategies.
In order to achieve the above object, the present invention comprises the steps of:
a trust domain strategy optimization method based on after-the-fact experience comprises the following steps:
s100, using the acquired experience data of the robot executing action in the strategy training process under the target condition, and taking the arrived target point in the experience data as a virtual target point to generate virtual after experience data;
s200, filtering the virtual target based on a posterior target filtering algorithm, and acquiring training data corresponding to the posterior target which is distributed approximately to the original target;
s300, based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;
s400, when the strategy distribution is similar, the secondary KL divergence is used for approximating the KL divergence, and the distribution deviation of the virtual empirical data and the original empirical data is corrected based on the weighted importance sampling, so that the KL divergence value between the strategies is estimated;
s500, correcting the gradient direction of the strategy through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, returning to S100, and repeating the strategy updating process until the strategy converges.
As a further improvement of the present invention, in S100, the action strategy is currently executed by the robot
Figure BDA0002597382030000024
Interacting with the environment, and acquiring experience data of the robot to execute the action
Figure BDA0002597382030000021
Wherein the content of the first and second substances,
Figure BDA0002597382030000022
is determined by the current state s of the robottPerforming an action a as input to a robottBy performing action atThe robot will obtain the reward value r from the environmentt
Generating a virtual target g' ═ phi(s) through empirical data; generating virtual empirical data using a virtual target to condition raw empirical data
Figure BDA0002597382030000023
As a further improvement of the present invention, the current state of the robot includes a joint angle, a joint speed, and a target position;
as a further improvement of the present invention, in S300, the objective function value is:
Figure BDA0002597382030000031
wherein A represents the dominance function, i.e. at state skUnder the condition of (a), performing action (a)kAdvantages over current strategies; the x is a representation of a normalization factor,
Figure BDA0002597382030000032
representing a pre-update strategy, theta representing a post-update strategy, gamma representing a reward discount factor, N representing the number of tracks under the virtual target g', and t representing a reinforcement learning time step.
As a further improvement of the present invention, in S400, the KL dispersion value is;
Figure BDA0002597382030000033
as a further improvement of the present invention, in S500, the policy step size:
Figure BDA0002597382030000034
where e represents the maximum KL divergence limit.
A trust domain policy optimization apparatus based on posterior experience, comprising:
the acquisition module is used for generating virtual after-the-fact experience data by using the acquired experience data of the robot executing actions in the strategy training process under the target condition and taking the arrived target point in the experience data as a virtual target point;
the filtering module is used for finishing filtering the virtual target based on a posterior target filtering algorithm and acquiring training data corresponding to the posterior target which is distributed close to the original target;
the correction module is used for correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling based on the virtual experience data, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;
the estimation module is used for approximating the KL divergence by using the secondary KL divergence when the strategy distribution is similar, and correcting the distribution deviation of the virtual empirical data and the original empirical data based on the weighted importance sampling so as to estimate the KL divergence value between strategies;
the updating module is used for correcting the gradient direction of the strategy through the KL divergence and calculating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, and repeating the strategy updating process until the strategy is converged.
A post experience based trust domain policy optimization device comprising: a memory, a processor, and an after experience based trust domain policy optimizer stored on the memory and operable on the processor, the after experience based trust domain policy optimizer when executed by the processor implementing the steps of the after experience based trust domain policy optimization method.
A computer readable storage medium having stored thereon an after experience based trust domain policy optimization program, which when executed by a processor implements the steps of the after experience based trust domain policy optimization method.
Compared with the prior art, the invention has the following advantages:
according to the method, the original experience data is combined with the post-event target filtering to generate post-event experience data capable of helping the strategy to be effectively updated, and the convergence speed, the stability and the final performance of the strategy updating method are greatly improved through the post-event experience data. By using the quadratic KL divergence to replace the KL divergence, the variance of the existing KL divergence estimation method is greatly reduced, and the accuracy of KL divergence estimation is greatly improved. In the method operation process, the robot can take the current state as the strategy input, obtain the optimal action under the current state condition and execute the given task. Compared with the prior art, the method successfully applies the post experience data to the trust domain strategy optimization method, overcomes the defects of low reinforcement learning performance, slow convergence and incapability of acquiring effective strategies caused by low exploration efficiency, less effective data and difficult design of reward functions, and successfully realizes a high-performance reinforcement learning method. In the existing benchmark test, compared with the previous method, the method obtains the current optimal performance in a plurality of tasks, and solves the complex problems which cannot be solved by a plurality of existing methods, such as FetchSlideDiscrete, FetchReachDiscrete and the like. The method has the advantages of high efficiency, high convergence rate and high performance, and can have excellent performance in various control tasks (such as control strategy learning based on image input, terminal speed control, speed difference control, joint speed control and the like), so the method has great application potential.
Drawings
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
FIG. 1 is a flowchart of a trust domain policy optimization method based on post experience according to the present invention.
FIG. 2 is a result graph (success rate of executing tasks by different methods) of the present invention, where each subgraph represents a task, the task name is represented by the subgraph name, and several curves for each task represent the performance of different methods.
FIG. 3 is a schematic diagram of a trust domain policy optimization apparatus based on post experience according to the present invention.
FIG. 4 is a schematic diagram of a trust domain policy optimization apparatus based on post experience according to the present invention.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
The invention discloses a trust domain strategy optimization method based on posterior experience. And correcting the distribution deviation of the post-experience data and the original experience data by using an importance sampling method. KL divergence estimation errors are reduced by approximating the KL divergence using quadratic KL divergence. By using post-event target filtering, a more efficient virtual target point is obtained, and the generalization capability of the method is improved.
Compared with the prior invention, the method can obtain higher data efficiency and better method performance in the sparse reward reinforcement learning task. The introduction of the experience data after the fact can greatly improve the effective data quantity obtained by the method in the training process and greatly reduce the training time of the method. The introduction of the quadratic KL divergence can greatly reduce the variance of KL divergence estimation of adjacent distribution, and provide an extremely accurate KL divergence estimation value. The posterior target filtering algorithm can obviously reduce the distribution deviation of the posterior target and the original target, and improves the generalization capability of the method to the original target. Compared with the prior method, the trust domain strategy optimization method based on the posterior experience can greatly simplify the design process of the reward function and can obtain good performance under the simple sparse reward function setting.
When the method is applied specifically, the robot can be helped to learn skills in an autonomous and interactive mode, and the method does not depend on a large amount of manual labeling data. In addition, the method has good performance in large-scale, nonlinear and image input-based application scenes, and has great application potential.
As shown in fig. 1, the present invention specifically includes the following steps:
the method comprises the following steps: by current policy
Figure BDA0002597382030000061
Interacting with the environment to obtain empirical data
Figure BDA0002597382030000062
Wherein the content of the first and second substances,
Figure BDA0002597382030000063
is determined by the current state s of the robott(e.g., joint angle, joint velocity, target position, etc.) as inputs to the robot to perform action atTo (3) is performed. By performing action atThe robot will obtain the reward value r from the environmentt. Generating a virtual target g' ═ phi(s) through empirical data; generating virtual empirical data using a virtual target to condition raw empirical data
Figure BDA0002597382030000064
Step two: based on a posterior target filtering algorithm, filtering the virtual target, and acquiring training data corresponding to the posterior target which is as close to the original target distribution as possible;
step three: based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling, and estimating an objective function value according to the distribution deviation:
Figure BDA0002597382030000071
and obtaining the original strategy gradient
Figure BDA0002597382030000072
Wherein A represents the dominance function, i.e. in the state skUnder the condition of (a), performing action (a)kAdvantages over current strategies. The x is a representation of a normalization factor,
Figure BDA0002597382030000073
denotes a pre-update policy, and θ denotes a post-update policy.
Step four: when the strategy distribution is similar, the secondary KL divergence is used for approximating the KL divergence, and the distribution deviation of the virtual empirical data and the original empirical data is corrected based on the weighted importance sampling, so that the KL divergence value between strategies is estimated;
Figure BDA0002597382030000074
where γ represents the reward discount factor.
Step five: correcting the gradient direction of the strategy through KL divergence, and calculating the step length of the updating strategy through the maximum KL divergence step length (trust domain) belonging to the following steps:
Figure BDA0002597382030000075
where e represents the maximum KL divergence limit. And updating the existing strategy according to the step length, returning to the step I, and repeating the strategy updating process until the strategy is converged.
FIG. 2 is a graph of the results of the present invention (success rate of different methods to perform tasks). The comparison performance of different algorithms and HTRPO is tested in 13 tasks such as the simulation environment of the Fetch robot, and the final success rate of the algorithms is used as the measurement standard of the performance.
It can be seen that in simpler tasks, such as FlipBit, emptyrom, fourrom, fetchrach, etc., the confidence domain policy optimization method (HTRPO) based on post experience can obtain the optimal action policy with less data while maintaining the same or similar performance, compared with other methods, and the data efficiency of the algorithm is higher. In complex tasks such as FetchPush, FetchSlide, FetchPickAndPlace and the like, the trust domain strategy optimization method based on posterior experience has great advantages in data efficiency and performance, and a better action strategy can be obtained with less data.
Referring to fig. 3, a second aspect of the present application provides a trust domain policy optimization apparatus based on post experience.
The trust domain strategy optimization device based on the posterior experience provided by the embodiment of the application comprises:
the acquisition module is used for generating virtual after-the-fact experience data by using the acquired experience data of the robot executing actions in the strategy training process under the target condition and taking the arrived target point in the experience data as a virtual target point;
the filtering module is used for finishing filtering the virtual target based on a posterior target filtering algorithm and acquiring training data corresponding to the posterior target which is distributed close to the original target;
the correction module is used for correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling based on the virtual experience data, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;
the estimation module is used for approximating the KL divergence by using the secondary KL divergence when the strategy distribution is similar, and correcting the distribution deviation of the virtual empirical data and the original empirical data based on the weighted importance sampling so as to estimate the KL divergence value between strategies;
the updating module is used for correcting the gradient direction of the strategy through the KL divergence and calculating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, and repeating the strategy updating process until the strategy is converged.
As shown in fig. 4, a third aspect of the present application provides a trust domain policy optimization device based on posterior experience, including: a memory, a processor, and an after experience based trust domain policy optimizer stored on the memory and operable on the processor, the after experience based trust domain policy optimizer when executed by the processor implementing the steps of the after experience based trust domain policy optimization method.
A fourth aspect of the present application provides a computer-readable storage medium, on which a post experience based trust domain policy optimization program is stored, which, when executed by a processor, implements the steps of the post experience based trust domain policy optimization method.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, B, or C, may represent: a, B, C, "A and B", "A and C", "B and C", or "A and B and C", wherein A, B, C may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a READ-only MEMORY (ROM), a RANDOM ACCESS MEMORY (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (9)

1. A trust domain strategy optimization method based on after experience is characterized by comprising the following steps:
s100, using the acquired experience data of the robot executing action in the strategy training process under the target condition, and taking the arrived target point in the experience data as a virtual target point to generate virtual after experience data;
s200, filtering the virtual target based on a posterior target filtering algorithm, and acquiring training data corresponding to the posterior target which is distributed approximately to the original target;
s300, based on the virtual experience data, correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;
s400, when the strategy distribution is similar, the secondary KL divergence is used for approximating the KL divergence, and the distribution deviation of the virtual empirical data and the original empirical data is corrected based on the weighted importance sampling, so that the KL divergence value between the strategies is estimated;
s500, correcting the gradient direction of the strategy through the KL divergence, and calculating and updating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, returning to S100, and repeating the strategy updating process until the strategy converges.
2. The method for optimizing confidence domain policy based on post experience according to claim 1, wherein, in S100,currently executing action strategies by robots
Figure FDA0002597382020000011
Interacting with the environment, and acquiring experience data of the robot to execute the action
Figure FDA0002597382020000012
Wherein the content of the first and second substances,
Figure FDA0002597382020000013
is determined by the current state s of the robottPerforming an action a as input to a robottBy performing action atThe robot will obtain the reward value r from the environmentt
Generating a virtual target g' ═ phi(s) through empirical data; generating virtual empirical data using a virtual target to condition raw empirical data
Figure FDA0002597382020000014
3. The method of claim 1, wherein the current state of the robot comprises joint angle, joint speed and target position.
4. The method for trust domain policy optimization based on post experience according to claim 1, wherein in S300, the objective function value is:
Figure FDA0002597382020000021
wherein A represents the dominance function, i.e. at state skUnder the condition of (a), performing action (a)kAdvantages over current strategies; the x is a representation of a normalization factor,
Figure FDA0002597382020000022
representing a pre-update strategy, theta representing a post-update strategy, gamma representing a reward discount factor, N representing the number of tracks under the virtual target g', and t representing a reinforcement learning time step.
5. The method for optimizing trust domain strategy based on posterior experience according to claim 1, wherein in S400, KL divergence values are;
Figure FDA0002597382020000023
6. the method for trust domain policy optimization based on post experience according to claim 1, wherein in S500, the policy step size is:
Figure FDA0002597382020000024
where e represents the maximum KL divergence limit.
7. A trust domain policy optimization apparatus based on post experience, comprising:
the acquisition module is used for generating virtual after-the-fact experience data by using the acquired experience data of the robot executing actions in the strategy training process under the target condition and taking the arrived target point in the experience data as a virtual target point;
the filtering module is used for finishing filtering the virtual target based on a posterior target filtering algorithm and acquiring training data corresponding to the posterior target which is distributed close to the original target;
the correction module is used for correcting the distribution deviation of the virtual experience data and the original experience data through weighting importance sampling based on the virtual experience data, estimating a target function value according to the distribution deviation, and acquiring an original strategy gradient;
the estimation module is used for approximating the KL divergence by using the secondary KL divergence when the strategy distribution is similar, and correcting the distribution deviation of the virtual empirical data and the original empirical data based on the weighted importance sampling so as to estimate the KL divergence value between strategies;
the updating module is used for correcting the gradient direction of the strategy through the KL divergence and calculating the strategy step length through the maximum KL divergence step length; and updating the existing strategy according to the strategy step length, and repeating the strategy updating process until the strategy is converged.
8. A post experience based trust domain policy optimization device, comprising: a memory, a processor and an after experience based trust domain policy optimizer stored on the memory and executable on the processor, the after experience based trust domain policy optimizer implementing the steps of the after experience based trust domain policy optimization method according to any one of claims 1 to 6 when executed by the processor.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon an after-experience based trust domain policy optimization program, which when executed by a processor implements the steps of the after-experience based trust domain policy optimization method according to any one of claims 1 to 6.
CN202010713458.7A 2020-07-22 2020-07-22 Confidence domain strategy optimization method and device based on posterior experience and related equipment Pending CN112101563A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010713458.7A CN112101563A (en) 2020-07-22 2020-07-22 Confidence domain strategy optimization method and device based on posterior experience and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010713458.7A CN112101563A (en) 2020-07-22 2020-07-22 Confidence domain strategy optimization method and device based on posterior experience and related equipment

Publications (1)

Publication Number Publication Date
CN112101563A true CN112101563A (en) 2020-12-18

Family

ID=73749581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010713458.7A Pending CN112101563A (en) 2020-07-22 2020-07-22 Confidence domain strategy optimization method and device based on posterior experience and related equipment

Country Status (1)

Country Link
CN (1) CN112101563A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114102600A (en) * 2021-12-02 2022-03-01 西安交通大学 Multi-space fusion man-machine skill migration and parameter compensation method and system

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026272A (en) * 2019-12-09 2020-04-17 网易(杭州)网络有限公司 Training method and device for virtual object behavior strategy, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HANBO ZHANG 等,: "Hindsight Trust Region Policy Optimization", 《ARXIV》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114102600A (en) * 2021-12-02 2022-03-01 西安交通大学 Multi-space fusion man-machine skill migration and parameter compensation method and system
CN114102600B (en) * 2021-12-02 2023-08-04 西安交通大学 Multi-space fusion human-machine skill migration and parameter compensation method and system

Similar Documents

Publication Publication Date Title
KR102170105B1 (en) Method and apparatus for generating neural network structure, electronic device, storage medium
US10410362B2 (en) Method, device, and non-transitory computer readable storage medium for image processing
CN110276442B (en) Searching method and device of neural network architecture
US11580378B2 (en) Reinforcement learning for concurrent actions
JP2013242761A (en) Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment
Chirodea et al. Comparison of tensorflow and pytorch in convolutional neural network-based applications
EP3612356B1 (en) Determining control policies for robots with noise-tolerant structured exploration
CN113222123A (en) Model training method, device, equipment and computer storage medium
US20210141383A1 (en) Systems and methods for improving generalization in visual navigation
CN111652371A (en) Offline reinforcement learning network training method, device, system and storage medium
CN112085056A (en) Target detection model generation method, device, equipment and storage medium
KR20200132305A (en) Method for performing convolution operation at predetermined layer within the neural network by electronic device, and electronic device thereof
Skolik et al. Robustness of quantum reinforcement learning under hardware errors
CN112016678A (en) Training method and device for strategy generation network for reinforcement learning and electronic equipment
CN112101563A (en) Confidence domain strategy optimization method and device based on posterior experience and related equipment
CN113156473B (en) Self-adaptive judging method for satellite signal environment of information fusion positioning system
CN117422005A (en) Method for automatically controlling simulation errors of analog circuit and application
CN110502975A (en) A kind of batch processing system that pedestrian identifies again
CN113361380B (en) Human body key point detection model training method, detection method and device
CN113361381B (en) Human body key point detection model training method, detection method and device
CN115879536A (en) Learning cognition analysis model robustness optimization method based on causal effect
Chen et al. Kalman Filtering Under Information Theoretic Criteria
US20200250493A1 (en) Apparatus for q-learning for continuous actions with cross-entropy guided policies and method thereof
CN103971339B (en) A kind of nuclear magnetic resonance image dividing method and equipment based on parametric method
CN113887708A (en) Multi-agent learning method based on mean field, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201218

RJ01 Rejection of invention patent application after publication