CN111144580B - Hierarchical reinforcement learning training method and device based on imitation learning - Google Patents
Hierarchical reinforcement learning training method and device based on imitation learning Download PDFInfo
- Publication number
- CN111144580B CN111144580B CN201911406220.3A CN201911406220A CN111144580B CN 111144580 B CN111144580 B CN 111144580B CN 201911406220 A CN201911406220 A CN 201911406220A CN 111144580 B CN111144580 B CN 111144580B
- Authority
- CN
- China
- Prior art keywords
- training
- data
- reinforcement learning
- learning
- teaching data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 101
- 230000002787 reinforcement Effects 0.000 title claims abstract description 75
- 238000000034 method Methods 0.000 title claims abstract description 55
- 230000009471 action Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 8
- 230000003287 optical effect Effects 0.000 claims description 7
- 238000000354 decomposition reaction Methods 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 claims description 4
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 claims description 2
- 101100272279 Beauveria bassiana Beas gene Proteins 0.000 claims 1
- 239000003795 chemical substances by application Substances 0.000 description 11
- 230000008901 benefit Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 7
- 230000006399 behavior Effects 0.000 description 4
- 235000013399 edible fruits Nutrition 0.000 description 4
- 230000033001 locomotion Effects 0.000 description 3
- 238000010801 machine learning Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000008570 general process Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000011664 nicotinic acid Substances 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 238000005057 refrigeration Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/10—Interfaces, programming languages or software development kits, e.g. for simulating neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Abstract
The invention discloses a hierarchical reinforcement learning training method and device based on imitation learning and electronic equipment, comprising the following steps: acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model. Training and retraining are respectively carried out by using teaching data, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.
Description
Technical Field
The present application relates to the field of machine learning, and more particularly, to a hierarchical reinforcement learning training method, apparatus and electronic device based on simulated learning.
Background
Reinforcement learning is one of the most interesting research directions in the field of artificial intelligence in recent years, and has achieved a lot of glaring results in a plurality of fields. The agent learns about behavior in the environment by performing certain operations and observing rewards or results obtained from those operations. However, an important disadvantage of reinforcement learning is the dimension disaster, and as the dimension of the system state increases, the number of parameters to be trained increases exponentially therewith, which consumes a large amount of computing and storage resources.
Based on the inherent hierarchical and combinatorial nature of real world tasks, hierarchical reinforcement learning breaks down complex problems into several sub-problems, learning from multiple policy layers, each layer being responsible for controlling time and behavior abstractions at different levels. By means of hierarchical reinforcement learning, the dimension disaster problem can be well solved through hierarchical abstraction.
However, in hierarchical reinforcement learning multi-step decisions, the agent training process may not be rewarded frequently, and there is a tremendous search space for this progressive rewards-based learning approach.
The imitation learning is directly learned from expert data provided by a demonstrator, so that the multi-step decision problem can be well solved. Thus, the advantages of imitative learning are exerted, and the training hierarchy for hierarchical reinforcement learning is developed.
However, in the past, only training of the parameters of the hierarchical reinforcement learning model by using the teaching data of the simulation learning was performed in two different steps, and the teaching data was not applied to the hierarchical reinforcement learning step, and the internal connection in the training process was ignored.
Accordingly, there is a need to provide an improved hierarchical reinforcement learning training method based on mimicking learning.
Disclosure of Invention
Aiming at the defects and shortcomings in the prior art, the invention provides a hierarchical reinforcement learning training method, device and electronic equipment based on imitation learning, which respectively conduct pre-training and retraining by using teaching data, effectively utilize priori knowledge and strategies, reduce search space and improve training efficiency.
According to an aspect of the present invention, there is provided a hierarchical reinforcement learning training method based on imitation learning, including: the abstract decomposition reinforcement learning task is a plurality of subtasks; acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model.
Further, the abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.
Further, the mode of acquiring teaching data of human expert includes adopting one or more sensors of optics, vision, inertial navigation and data glove to acquire indirectly or directly acquire through remote control agent of the device.
Further, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, expressed as:
D={s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 … }, wherein
{s 1 ,s 2 … } represents a sequence of states;
{g 1 ,g 2 … } represents a subtask policy sequence;
{τ 1 ,τ 2 … } represents a sequence of actions under a given subtask, in which
τ={s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g.
Further, the pre-training process includes: extracting data from the teaching data according to a given initial state; the reinforcement learning initial strategy is determined according to a simulated learning method based on the extracted data.
Further, the retraining process includes: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
According to another aspect of the present invention, there is provided a hierarchical reinforcement learning training device based on imitation learning, including: the training system comprises a teaching data acquisition module, a pre-training module and a retraining module.
The teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training.
The pre-training module is used for pre-training based on imitation learning by using teaching data and determining a reinforcement learning initial strategy.
The retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model.
Further, the acquiring means includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission means (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
According to still another aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.
According to yet another aspect of the present invention, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.
According to the training method, training data are used for respectively pre-training and retraining, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.
Drawings
Various other advantages and benefits of the present application will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. It is apparent that the drawings described below are only some embodiments of the present application and that other drawings may be obtained from these drawings by those of ordinary skill in the art without inventive effort. Also, like reference numerals are used to designate like parts throughout the figures.
FIG. 1 is a general process of reinforcement learning;
FIG. 2 is a flow diagram of a hierarchical reinforcement learning training method based on mimicking learning in accordance with one embodiment of the present invention;
FIG. 3 is a flow chart of pre-training based on simulated learning using teaching data in accordance with one embodiment of the present invention;
FIG. 4 is a flow chart for retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention;
FIG. 5 is a block diagram of a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to one embodiment of the invention.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Summary of the application
Fig. 1 shows a general process of reinforcement learning. In general, reinforcement learning systems include an agent and an execution environment, the agent constantly learns and optimizes its strategy through interactions and feedback with the execution environment. Specifically, the agent observes and obtains the state of the execution environment, and determines, according to a policy, the action to be taken with respect to the state of the current execution environment. Such actions act on the execution environment, changing the state of the execution environment, and generating a feedback to the agent, also known as rewards. The intelligent agent judges whether the previous behavior is correct or not according to the obtained rewards, and whether the strategy needs to be adjusted or not so as to update the strategy. By repeatedly and constantly observing the state, determining the behavior, receiving feedback, the agent can constantly update the strategy, with the ultimate goal of being able to learn a strategy that maximizes the prize accumulation achieved.
Exemplary method
FIG. 2 illustrates a hierarchical reinforcement learning training method flow diagram based on impersonation learning, according to one embodiment of the invention.
As shown in fig. 2, a hierarchical reinforcement learning training method based on simulated learning according to one embodiment of the present invention includes:
s21: acquisition of teaching data of human expert
The mode of acquiring teaching data of human expert includes adopting one or more sensors in optical, visual, inertial navigation and data glove to acquire indirectly or directly through remote control of the device.
Further, the indirect acquisition of teaching data by using the sensor means a manner of firstly acquiring task data of a human being through the sensor and then mapping the human data to the human-simulated intelligent body for the human-simulated intelligent body with a similar structure to the human body.
For example, a human wearing data glove acquires task data and maps to a humanoid paw to obtain teaching data for machine learning of the humanoid paw.
Furthermore, the remote control intelligent agent directly acquires teaching data, for example, the teaching box controls the motion of the bionic intelligent agent body, and records the motion data. Because the data are acquired through the movement of the intelligent body, the data can be directly used for machine learning of the bionic body.
In addition, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as:
D={s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 … }, wherein
{s 1 ,s 2 … } represents a sequence of states;
{g 1 ,g 2 … } represents a subtask policy sequence;
{τ 1 ,τ 2 … } represents a sequence of actions under a given subtask, in which
τ={s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g.
Hierarchy levelThe goal of reinforcement learning training is to obtain subtask strategy pi μ : S→G, and sub-decisions pi for each specific sub-target G g :S→A。
S22: pre-training based on simulated learning using teaching data, determining reinforcement learning initial strategy
FIG. 3 illustrates a flow chart for pre-training based on simulated learning using teaching data, according to one embodiment of the invention.
As shown in fig. 3, a flowchart for pre-training based on simulated learning using teaching data according to one embodiment of the present invention includes:
s31: extracting data from teaching data according to a given initial state
From the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence ofTraining policy network pi μ 。
S32: determining reinforcement learning initial strategy according to imitation learning method based on extracted data
Training the policy network pi according to a simulated learning method, e.g. a behavioral cloning method, based on the extracted data as described above g Policy network pi μ Respectively corresponding to the output strategy networksAnd->And will->As a reinforcement learning initial strategy.
S23: retraining based on reinforcement learning according to an initial strategy, determining a training model
FIG. 4 illustrates a flow chart for retraining based on reinforcement learning according to an initial strategy, according to one embodiment of the invention.
As shown in fig. 4, a flowchart of retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention includes:
s41: performing an initial strategy to obtain training data according to a given initial state
Executing initial policies according to given initial statesAnd->Acquisition of training data (s, g, τ)
S42: determining an empirical data pool based on teaching data and training data
Teaching data d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 The union of … and training data (s, g, τ) is used as the empirical data pool
S43: determining a final training model based on a hierarchical reinforcement learning method using empirical data pool data
Model training using empirical pool data based on hierarchical reinforcement learning methods, such as MAXQ methods, until termination goals are met, determining a strategic network at that timeAnd->The model is finally trained.
Obviously, the hierarchical reinforcement learning is performed based on the experience data pool data composed of the initial strategy, the teaching data and the training data, so that the prior knowledge and strategy of the expert can be effectively utilized, the search space is reduced, and the training efficiency is improved.
Further, the reinforcement learning training method based on the imitation learning further comprises the following steps: abstracting and decomposing reinforcement learning task into a plurality of subtasks
Based on the hierarchical reinforcement learning idea, reinforcement learning task abstraction is decomposed into a plurality of subtasks for further reinforcement learning training.
The abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.
The state space decomposition divides the state space into a plurality of different subspaces, each subspace is trained respectively, and each training is performed in a small-scale space.
The temporal abstraction is to consider a task as an action set, and expand single-step training to multiple steps, so that the decision times at a single moment are reduced, and the training pressure is reduced.
For example, the task of placing fruit in a refrigerator is broken down into a series of sub-tasks of "pick up fruit", "open refrigerator", "place refrigerator", "close refrigerator", and specific operations are performed under each sub-task, such as the type and number of fruits selected in the pick up fruit sub-task, whether the place in the refrigerator sub-task requires refrigeration or freshness, which compartment of the refrigerator to place in, and so forth.
The space abstraction is realized by neglecting a plurality of dimension variables which are irrelevant to subtasks.
Exemplary apparatus
FIG. 5 illustrates a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention.
As shown in fig. 5, a hierarchical reinforcement learning training device 500 based on imitation learning according to an embodiment of the present invention includes: a teaching data acquisition module 510, a pre-training module 520, a retraining module 530.
The teaching data obtaining module 510 is configured to obtain teaching data required for reinforcement learning training.
Specifically, the acquiring mode includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
For example, the teaching data is acquired from the sensor through a bluetooth module.
The pre-training module 520 is configured to perform pre-training based on the simulated learning using the teaching data, and determine a reinforcement learning initial strategy.
The retraining module 530 is configured to retrain based on reinforcement learning according to an initial strategy, and determine a training model.
Exemplary electronic device
Next, a block diagram of an electronic device according to an embodiment of the present application is described with reference to fig. 6.
As shown in fig. 6, the electronic device 60 includes one or more processors 61 and memory 62.
The processor 61 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device 60 to perform the desired functions.
Memory 62 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 41 to implement the hierarchical reinforcement learning training method based on mimicking learning and/or other desired functions of the embodiments of the present application described above.
In one example, the electronic device 60 may further include: an input device 63 and an output device 64, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).
For example, the input device 63 may include, for example, a keyboard, a mouse, or the like, which may be used to input the teaching data.
The output device 64 may output various information to the outside, such as the distance between the first train and the second train. The output device 64 may include, for example, a display, speakers, a printer, a communication network and its connected remote output devices, etc., and may be used to output the reinforcement learning training results.
Of course, only some of the components of the electronic device 60 that are relevant to the present application are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 60 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods, apparatus and systems described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of a model learning based hierarchical reinforcement learning training method according to embodiments of the present application described in the above "exemplary methods" section of this specification.
The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a hierarchical reinforcement learning training method based on impersonation learning in embodiments of the present application.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.
The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner.
Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the methods, apparatus and devices of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.
Claims (8)
1. A hierarchical reinforcement learning training method based on simulated learning, comprising:
acquiring teaching data of human expert; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 ,…},{s 1 ,s 2 … represents a state sequence, { g 1 ,g 2 … represents a sequence of subtask strategies, { τ } 1 ,τ 2 … } represents the sequence of actions under a given subtask, τ= { s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;
pre-training based on simulated learning using the teaching data, determining an initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; determining a reinforcement learning initial strategy according to a simulated learning method based on the extracted data; the method comprises the following steps: from the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence->Training policy network pi μ Training a strategy network pi according to a simulated learning method based on the extracted data g Policy network pi μ Output policy network respectively corresponding to->And->And will->As a reinforcement learning initial strategy;
retraining based on reinforcement learning based on the initial strategy, and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
2. The training method of claim 1, further comprising abstracting the reinforcement learning task into a plurality of subtasks.
3. The training method of claim 2, wherein the abstract method comprises: state space decomposition, temporal abstraction, and spatial abstraction.
4. The training method of claim 1, wherein the means for acquiring teaching data of human expert comprises one or more of optical, visual, inertial navigation, and data glove, or directly by remote control of the agent via the device.
5. Hierarchical reinforcement learning training device based on imitative learning, characterized by comprising:
the teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 ,…},{s 1 ,s 2 … } represents a sequence of states; { g 1 ,g 2 … } represents a subtask policy sequence; { tau 1 ,τ 2 … } represents the sequence of actions under a given subtask, τ= { s 1 ,a 1 ,s 2 ,a 2 … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;
the pre-training module is used for pre-training based on imitation learning by using the teaching data and determining a reinforcement learning initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; based on the extracted data according to the simulated learning methodDetermining a reinforcement learning initial strategy; the method comprises the following steps: from the teaching data d= { s according to the given initial state 1 ,g 1 ,τ 1 ,s 2 ,g 2 ,τ 2 Extracting corresponding data in …Wherein the subsequence->For training policy network pi g Sequence->Training policy network pi μ Training a strategy network pi according to a simulated learning method based on the extracted data g Policy network pi μ Output policy network respectively corresponding to->And->And will beAs a reinforcement learning initial strategy;
the retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.
6. The training device of claim 5, wherein the teaching data obtaining module obtains the teaching data through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).
7. An electronic device, comprising: a processor;
and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the training method of any of claims 1-4.
8. A computer readable storage medium storing a computer program for performing the training method of any one of claims 1-4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406220.3A CN111144580B (en) | 2019-12-31 | 2019-12-31 | Hierarchical reinforcement learning training method and device based on imitation learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911406220.3A CN111144580B (en) | 2019-12-31 | 2019-12-31 | Hierarchical reinforcement learning training method and device based on imitation learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111144580A CN111144580A (en) | 2020-05-12 |
CN111144580B true CN111144580B (en) | 2024-04-12 |
Family
ID=70522686
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911406220.3A Active CN111144580B (en) | 2019-12-31 | 2019-12-31 | Hierarchical reinforcement learning training method and device based on imitation learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111144580B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112034888B (en) * | 2020-09-10 | 2021-07-30 | 南京大学 | Autonomous control cooperation strategy training method for fixed wing unmanned aerial vehicle |
CN112162564B (en) * | 2020-09-25 | 2021-09-28 | 南京大学 | Unmanned aerial vehicle flight control method based on simulation learning and reinforcement learning algorithm |
CN112264995B (en) * | 2020-10-16 | 2021-11-16 | 清华大学 | Robot double-shaft hole assembling method based on hierarchical reinforcement learning |
CN113408621B (en) * | 2021-06-21 | 2022-10-14 | 中国科学院自动化研究所 | Rapid simulation learning method, system and equipment for robot skill learning |
CN113837396B (en) * | 2021-09-26 | 2023-08-04 | 中国联合网络通信集团有限公司 | B-M2M-based device simulation learning method, MEC and storage medium |
CN114609925B (en) * | 2022-01-14 | 2022-12-06 | 中国科学院自动化研究所 | Training method of underwater exploration strategy model and underwater exploration method of bionic machine fish |
CN114386524A (en) * | 2022-01-17 | 2022-04-22 | 深圳市城图科技有限公司 | Power equipment identification method for dynamic self-adaptive graph layering simulation learning |
CN115204387B (en) * | 2022-07-21 | 2023-10-03 | 法奥意威(苏州)机器人系统有限公司 | Learning method and device under layered target condition and electronic equipment |
CN116079737A (en) * | 2023-02-23 | 2023-05-09 | 南京邮电大学 | Mechanical arm complex operation skill learning method and system based on layered reinforcement learning |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288094A (en) * | 2018-01-31 | 2018-07-17 | 清华大学 | Deeply learning method and device based on ambient condition prediction |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11562287B2 (en) * | 2017-10-27 | 2023-01-24 | Salesforce.Com, Inc. | Hierarchical and interpretable skill acquisition in multi-task reinforcement learning |
-
2019
- 2019-12-31 CN CN201911406220.3A patent/CN111144580B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108288094A (en) * | 2018-01-31 | 2018-07-17 | 清华大学 | Deeply learning method and device based on ambient condition prediction |
Non-Patent Citations (2)
Title |
---|
戴朝晖 ; 袁姣红 ; 吴敏 ; 陈鑫 ; .基于概率模型的动态分层强化学习.控制理论与应用.2011,(11),全文. * |
隋洪建 ; 尚伟伟 ; 李想 ; 丛爽 ; .基于渐进式神经网络的机器人控制策略迁移.中国科学技术大学学报.2019,(10),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN111144580A (en) | 2020-05-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111144580B (en) | Hierarchical reinforcement learning training method and device based on imitation learning | |
US11584008B1 (en) | Simulation-real world feedback loop for learning robotic control policies | |
US10766136B1 (en) | Artificial intelligence system for modeling and evaluating robotic success at task performance | |
Yang et al. | Hierarchical deep reinforcement learning for continuous action control | |
Amarjyoti | Deep reinforcement learning for robotic manipulation-the state of the art | |
US10766137B1 (en) | Artificial intelligence system for modeling and evaluating robotic success at task performance | |
Heess et al. | Actor-critic reinforcement learning with energy-based policies | |
US11714996B2 (en) | Learning motor primitives and training a machine learning system using a linear-feedback-stabilized policy | |
Jeerige et al. | Comparison of deep reinforcement learning approaches for intelligent game playing | |
CN113826051A (en) | Generating digital twins of interactions between solid system parts | |
JP2010179454A5 (en) | ||
EP2363251A1 (en) | Robot with Behavioral Sequences on the basis of learned Petri Net Representations | |
JP2010179454A (en) | Learning and use of schemata in robotic device | |
JP2021501433A (en) | Generation of control system for target system | |
KR20210033809A (en) | Control server and method for controlling robot using artificial neural network, and the robot implementing the same | |
CN114730407A (en) | Modeling human behavior in a work environment using neural networks | |
EP4088228A1 (en) | Autonomous control system and method using embodied homeostatic feedback in an operating environment | |
Hafez et al. | Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination | |
Hafez et al. | Efficient intrinsically motivated robotic grasping with learning-adaptive imagination in latent space | |
Bellas et al. | A cognitive developmental robotics architecture for lifelong learning by evolution in real robots | |
Goertzel et al. | Cognitive synergy between procedural and declarative learning in the control of animated and robotic agents using the opencogprime agi architecture | |
CN114529010A (en) | Robot autonomous learning method, device, equipment and storage medium | |
Floyd et al. | Building learning by observation agents using jloaf | |
Contardo et al. | Learning states representations in pomdp | |
CN112230618A (en) | Method for automatically synthesizing multi-robot distributed controller from global task |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |