CN111144580B - Hierarchical reinforcement learning training method and device based on imitation learning - Google Patents

Hierarchical reinforcement learning training method and device based on imitation learning Download PDF

Info

Publication number
CN111144580B
CN111144580B CN201911406220.3A CN201911406220A CN111144580B CN 111144580 B CN111144580 B CN 111144580B CN 201911406220 A CN201911406220 A CN 201911406220A CN 111144580 B CN111144580 B CN 111144580B
Authority
CN
China
Prior art keywords
training
data
reinforcement learning
learning
teaching data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911406220.3A
Other languages
Chinese (zh)
Other versions
CN111144580A (en
Inventor
唐思琦
李明强
陈思
高放
黄彬城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN201911406220.3A priority Critical patent/CN111144580B/en
Publication of CN111144580A publication Critical patent/CN111144580A/en
Application granted granted Critical
Publication of CN111144580B publication Critical patent/CN111144580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/10Interfaces, programming languages or software development kits, e.g. for simulating neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a hierarchical reinforcement learning training method and device based on imitation learning and electronic equipment, comprising the following steps: acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model. Training and retraining are respectively carried out by using teaching data, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.

Description

Hierarchical reinforcement learning training method and device based on imitation learning
Technical Field
The present application relates to the field of machine learning, and more particularly, to a hierarchical reinforcement learning training method, apparatus and electronic device based on simulated learning.
Background
Reinforcement learning is one of the most interesting research directions in the field of artificial intelligence in recent years, and has achieved a lot of glaring results in a plurality of fields. The agent learns about behavior in the environment by performing certain operations and observing rewards or results obtained from those operations. However, an important disadvantage of reinforcement learning is the dimension disaster, and as the dimension of the system state increases, the number of parameters to be trained increases exponentially therewith, which consumes a large amount of computing and storage resources.
Based on the inherent hierarchical and combinatorial nature of real world tasks, hierarchical reinforcement learning breaks down complex problems into several sub-problems, learning from multiple policy layers, each layer being responsible for controlling time and behavior abstractions at different levels. By means of hierarchical reinforcement learning, the dimension disaster problem can be well solved through hierarchical abstraction.
However, in hierarchical reinforcement learning multi-step decisions, the agent training process may not be rewarded frequently, and there is a tremendous search space for this progressive rewards-based learning approach.
The imitation learning is directly learned from expert data provided by a demonstrator, so that the multi-step decision problem can be well solved. Thus, the advantages of imitative learning are exerted, and the training hierarchy for hierarchical reinforcement learning is developed.
However, in the past, only training of the parameters of the hierarchical reinforcement learning model by using the teaching data of the simulation learning was performed in two different steps, and the teaching data was not applied to the hierarchical reinforcement learning step, and the internal connection in the training process was ignored.
Accordingly, there is a need to provide an improved hierarchical reinforcement learning training method based on mimicking learning.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a hierarchical reinforcement learning training method, device and electronic equipment based on imitation learning, which respectively conduct pre-training and retraining by using teaching data, effectively utilize priori knowledge and strategies, reduce search space and improve training efficiency.
According to an aspect of the present invention, there is provided a hierarchical reinforcement learning training method based on imitation learning, including: the abstract decomposition reinforcement learning task is a plurality of subtasks; acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model.
Further, the abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.
Further, the mode of acquiring teaching data of human expert includes adopting one or more sensors of optics, vision, inertial navigation and data glove to acquire indirectly or directly acquire through remote control agent of the device.
Further, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, expressed as:
D={s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 … }, wherein
{s 1 ,s 2 … } represents a sequence of states;
{g 1 ,g 2 … } represents a subtask policy sequence;
1 ,τ 2 … } represents a sequence of actions under a given subtask, in which
τ={s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g.
Further, the pre-training process includes: extracting data from the teaching data according to a given initial state; the reinforcement learning initial strategy is determined according to a simulated learning method based on the extracted data.
Further, the retraining process includes: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
According to another aspect of the present invention, there is provided a hierarchical reinforcement learning training device based on imitation learning, including: the training system comprises a teaching data acquisition module, a pre-training module and a retraining module.
The teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training.
The pre-training module is used for pre-training based on imitation learning by using teaching data and determining a reinforcement learning initial strategy.
The retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model.
Further, the acquiring means includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission means (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
According to still another aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.
According to yet another aspect of the present invention, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.
According to the training method, training data are used for respectively pre-training and retraining, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.
Drawings
Various other advantages and benefits of the present application will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. It is apparent that the drawings described below are only some embodiments of the present application and that other drawings may be obtained from these drawings by those of ordinary skill in the art without inventive effort. Also, like reference numerals are used to designate like parts throughout the figures.
FIG. 1 is a general process of reinforcement learning;
FIG. 2 is a flow diagram of a hierarchical reinforcement learning training method based on mimicking learning in accordance with one embodiment of the present invention;
FIG. 3 is a flow chart of pre-training based on simulated learning using teaching data in accordance with one embodiment of the present invention;
FIG. 4 is a flow chart for retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention;
FIG. 5 is a block diagram of a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to one embodiment of the invention.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Summary of the application
Fig. 1 shows a general process of reinforcement learning. In general, reinforcement learning systems include an agent and an execution environment, the agent constantly learns and optimizes its strategy through interactions and feedback with the execution environment. Specifically, the agent observes and obtains the state of the execution environment, and determines, according to a policy, the action to be taken with respect to the state of the current execution environment. Such actions act on the execution environment, changing the state of the execution environment, and generating a feedback to the agent, also known as rewards. The intelligent agent judges whether the previous behavior is correct or not according to the obtained rewards, and whether the strategy needs to be adjusted or not so as to update the strategy. By repeatedly and constantly observing the state, determining the behavior, receiving feedback, the agent can constantly update the strategy, with the ultimate goal of being able to learn a strategy that maximizes the prize accumulation achieved.
Exemplary method
FIG. 2 illustrates a hierarchical reinforcement learning training method flow diagram based on impersonation learning, according to one embodiment of the invention.
As shown in fig. 2, a hierarchical reinforcement learning training method based on simulated learning according to one embodiment of the present invention includes:
s21: acquisition of teaching data of human expert
The mode of acquiring teaching data of human expert includes adopting one or more sensors in optical, visual, inertial navigation and data glove to acquire indirectly or directly through remote control of the device.
Further, the indirect acquisition of teaching data by using the sensor means a manner of firstly acquiring task data of a human being through the sensor and then mapping the human data to the human-simulated intelligent body for the human-simulated intelligent body with a similar structure to the human body.
For example, a human wearing data glove acquires task data and maps to a humanoid paw to obtain teaching data for machine learning of the humanoid paw.
Furthermore, the remote control intelligent agent directly acquires teaching data, for example, the teaching box controls the motion of the bionic intelligent agent body, and records the motion data. Because the data are acquired through the movement of the intelligent body, the data can be directly used for machine learning of the bionic body.
In addition, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as:
D={s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 … }, wherein
{s 1 ,s 2 … } represents a sequence of states;
{g 1 ,g 2 … } represents a subtask policy sequence;
1 ,τ 2 … } represents a sequence of actions under a given subtask, in which
τ={s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g.
Hierarchy levelThe goal of reinforcement learning training is to obtain subtask strategy pi μ : S→G, and sub-decisions pi for each specific sub-target G g :S→A。
S22: pre-training based on simulated learning using teaching data, determining reinforcement learning initial strategy
FIG. 3 illustrates a flow chart for pre-training based on simulated learning using teaching data, according to one embodiment of the invention.
As shown in fig. 3, a flowchart for pre-training based on simulated learning using teaching data according to one embodiment of the present invention includes:
s31: extracting data from teaching data according to a given initial state
From the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence ofTraining policy network pi μ
S32: determining reinforcement learning initial strategy according to imitation learning method based on extracted data
Training the policy network pi according to a simulated learning method, e.g. a behavioral cloning method, based on the extracted data as described above g Policy network pi μ Respectively corresponding to the output strategy networksAnd->And will->As a reinforcement learning initial strategy.
S23: retraining based on reinforcement learning according to an initial strategy, determining a training model
FIG. 4 illustrates a flow chart for retraining based on reinforcement learning according to an initial strategy, according to one embodiment of the invention.
As shown in fig. 4, a flowchart of retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention includes:
s41: performing an initial strategy to obtain training data according to a given initial state
Executing initial policies according to given initial statesAnd->Acquisition of training data (s, g, τ)
S42: determining an empirical data pool based on teaching data and training data
Teaching data d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 The union of … and training data (s, g, τ) is used as the empirical data pool
S43: determining a final training model based on a hierarchical reinforcement learning method using empirical data pool data
Model training using empirical pool data based on hierarchical reinforcement learning methods, such as MAXQ methods, until termination goals are met, determining a strategic network at that timeAnd->The model is finally trained.
Obviously, the hierarchical reinforcement learning is performed based on the experience data pool data composed of the initial strategy, the teaching data and the training data, so that the prior knowledge and strategy of the expert can be effectively utilized, the search space is reduced, and the training efficiency is improved.
Further, the reinforcement learning training method based on the imitation learning further comprises the following steps: abstracting and decomposing reinforcement learning task into a plurality of subtasks
Based on the hierarchical reinforcement learning idea, reinforcement learning task abstraction is decomposed into a plurality of subtasks for further reinforcement learning training.
The abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.
The state space decomposition divides the state space into a plurality of different subspaces, each subspace is trained respectively, and each training is performed in a small-scale space.
The temporal abstraction is to consider a task as an action set, and expand single-step training to multiple steps, so that the decision times at a single moment are reduced, and the training pressure is reduced.
For example, the task of placing fruit in a refrigerator is broken down into a series of sub-tasks of "pick up fruit", "open refrigerator", "place refrigerator", "close refrigerator", and specific operations are performed under each sub-task, such as the type and number of fruits selected in the pick up fruit sub-task, whether the place in the refrigerator sub-task requires refrigeration or freshness, which compartment of the refrigerator to place in, and so forth.
The space abstraction is realized by neglecting a plurality of dimension variables which are irrelevant to subtasks.
Exemplary apparatus
FIG. 5 illustrates a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention.
As shown in fig. 5, a hierarchical reinforcement learning training device 500 based on imitation learning according to an embodiment of the present invention includes: a teaching data acquisition module 510, a pre-training module 520, a retraining module 530.
The teaching data obtaining module 510 is configured to obtain teaching data required for reinforcement learning training.
Specifically, the acquiring mode includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
For example, the teaching data is acquired from the sensor through a bluetooth module.
The pre-training module 520 is configured to perform pre-training based on the simulated learning using the teaching data, and determine a reinforcement learning initial strategy.
The retraining module 530 is configured to retrain based on reinforcement learning according to an initial strategy, and determine a training model.
Exemplary electronic device
Next, a block diagram of an electronic device according to an embodiment of the present application is described with reference to fig. 6.
As shown in fig. 6, the electronic device 60 includes one or more processors 61 and memory 62.
The processor 61 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device 60 to perform the desired functions.
Memory 62 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 41 to implement the hierarchical reinforcement learning training method based on mimicking learning and/or other desired functions of the embodiments of the present application described above.
In one example, the electronic device 60 may further include: an input device 63 and an output device 64, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).
For example, the input device 63 may include, for example, a keyboard, a mouse, or the like, which may be used to input the teaching data.
The output device 64 may output various information to the outside, such as the distance between the first train and the second train. The output device 64 may include, for example, a display, speakers, a printer, a communication network and its connected remote output devices, etc., and may be used to output the reinforcement learning training results.
Of course, only some of the components of the electronic device 60 that are relevant to the present application are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 60 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods, apparatus and systems described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of a model learning based hierarchical reinforcement learning training method according to embodiments of the present application described in the above "exemplary methods" section of this specification.
The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a hierarchical reinforcement learning training method based on impersonation learning in embodiments of the present application.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.
The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner.
Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the methods, apparatus and devices of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (8)

1. A hierarchical reinforcement learning training method based on simulated learning, comprising:
acquiring teaching data of human expert; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 ,…},{s 1 ,s 2 … represents a state sequence, { g 1 ,g 2 … represents a sequence of subtask strategies, { τ } 1 ,τ 2 … } represents the sequence of actions under a given subtask, τ= { s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;
pre-training based on simulated learning using the teaching data, determining an initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; determining a reinforcement learning initial strategy according to a simulated learning method based on the extracted data; the method comprises the following steps: from the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence->Training policy network pi μ Training a strategy network pi according to a simulated learning method based on the extracted data g Policy network pi μ Output policy network respectively corresponding to->And->And will->As a reinforcement learning initial strategy;
retraining based on reinforcement learning based on the initial strategy, and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
2. The training method of claim 1, further comprising abstracting the reinforcement learning task into a plurality of subtasks.
3. The training method of claim 2, wherein the abstract method comprises: state space decomposition, temporal abstraction, and spatial abstraction.
4. The training method of claim 1, wherein the means for acquiring teaching data of human expert comprises one or more of optical, visual, inertial navigation, and data glove, or directly by remote control of the agent via the device.
5. Hierarchical reinforcement learning training device based on imitative learning, characterized by comprising:
the teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 ,…},{s 1 ,s 2 … } represents a sequence of states; { g 1 ,g 2 … } represents a subtask policy sequence; { tau 1 ,τ 2 … } represents the sequence of actions under a given subtask, τ= { s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;
the pre-training module is used for pre-training based on imitation learning by using the teaching data and determining a reinforcement learning initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; based on the extracted data according to the simulated learning methodDetermining a reinforcement learning initial strategy; the method comprises the following steps: from the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence->Training policy network pi μ Training a strategy network pi according to a simulated learning method based on the extracted data g Policy network pi μ Output policy network respectively corresponding to->And->And will beAs a reinforcement learning initial strategy;
the retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
6. The training device of claim 5, wherein the teaching data obtaining module obtains the teaching data through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
7. An electronic device, comprising: a processor;
and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the training method of any of claims 1-4.
8. A computer readable storage medium storing a computer program for performing the training method of any one of claims 1-4.
CN201911406220.3A 2019-12-31 2019-12-31 Hierarchical reinforcement learning training method and device based on imitation learning Active CN111144580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911406220.3A CN111144580B (en) 2019-12-31 2019-12-31 Hierarchical reinforcement learning training method and device based on imitation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911406220.3A CN111144580B (en) 2019-12-31 2019-12-31 Hierarchical reinforcement learning training method and device based on imitation learning

Publications (2)

Publication Number Publication Date
CN111144580A CN111144580A (en) 2020-05-12
CN111144580B true CN111144580B (en) 2024-04-12

Family

ID=70522686

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911406220.3A Active CN111144580B (en) 2019-12-31 2019-12-31 Hierarchical reinforcement learning training method and device based on imitation learning

Country Status (1)

Country Link
CN (1) CN111144580B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112034888B (en) * 2020-09-10 2021-07-30 南京大学 Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle
CN112162564B (en) * 2020-09-25 2021-09-28 南京大学 Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm
CN112264995B (en) * 2020-10-16 2021-11-16 清华大学 Robot double-shaft hole assembling method based on hierarchical reinforcement learning
CN113408621B (en) * 2021-06-21 2022-10-14 中国科学院自动化研究所 Rapid simulation learning method, system and equipment for robot skill learning
CN113837396B (en) * 2021-09-26 2023-08-04 中国联合网络通信集团有限公司 B-M2M-based device simulation learning method, MEC and storage medium
CN114609925B (en) * 2022-01-14 2022-12-06 中国科学院自动化研究所 Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish
CN114386524A (en) * 2022-01-17 2022-04-22 深圳市城图科技有限公司 Power equipment identification method for dynamic self-adaptive graph layering simulation learning
CN115204387B (en) * 2022-07-21 2023-10-03 法奥意威(苏州)机器人系统有限公司 Learning method and device under layered target condition and electronic equipment
CN116079737A (en) * 2023-02-23 2023-05-09 南京邮电大学 Mechanical arm complex operation skill learning method and system based on layered reinforcement learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11562287B2 (en) * 2017-10-27 2023-01-24 Salesforce.Com, Inc. Hierarchical and interpretable skill acquisition in multi-task reinforcement learning

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288094A (en) * 2018-01-31 2018-07-17 清华大学 Deeply learning method and device based on ambient condition prediction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
戴朝晖 ; 袁姣红 ; 吴敏 ; 陈鑫 ; .基于概率模型的动态分层强化学习.控制理论与应用.2011,(11),全文. *
隋洪建 ; 尚伟伟 ; 李想 ; 丛爽 ; .基于渐进式神经网络的机器人控制策略迁移.中国科学技术大学学报.2019,(10),全文. *

Also Published As

Publication number Publication date
CN111144580A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN111144580B (en) Hierarchical reinforcement learning training method and device based on imitation learning
US11584008B1 (en) Simulation-real world feedback loop for learning robotic control policies
US10766136B1 (en) Artificial intelligence system for modeling and evaluating robotic success at task performance
Yang et al. Hierarchical deep reinforcement learning for continuous action control
Amarjyoti Deep reinforcement learning for robotic manipulation-the state of the art
US10766137B1 (en) Artificial intelligence system for modeling and evaluating robotic success at task performance
Heess et al. Actor-critic reinforcement learning with energy-based policies
US11714996B2 (en) Learning motor primitives and training a machine learning system using a linear-feedback-stabilized policy
Jeerige et al. Comparison of deep reinforcement learning approaches for intelligent game playing
CN113826051A (en) Generating digital twins of interactions between solid system parts
JP2010179454A5 (en)
EP2363251A1 (en) Robot with Behavioral Sequences on the basis of learned Petri Net Representations
JP2010179454A (en) Learning and use of schemata in robotic device
JP2021501433A (en) Generation of control system for target system
KR20210033809A (en) Control server and method for controlling robot using artificial neural network, and the robot implementing the same
CN114730407A (en) Modeling human behavior in a work environment using neural networks
EP4088228A1 (en) Autonomous control system and method using embodied homeostatic feedback in an operating environment
Hafez et al. Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination
Hafez et al. Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space
Bellas et al. A cognitive developmental robotics architecture for lifelong learning by evolution in real robots
Goertzel et al. Cognitive synergy between procedural and declarative learning in the control of animated and robotic agents using the opencogprime agi architecture
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
Floyd et al. Building learning by observation agents using jloaf
Contardo et al. Learning states representations in pomdp
CN112230618A (en) Method for automatically synthesizing multi-robot distributed controller from global task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant