CN111144580B

CN111144580B - Hierarchical reinforcement learning training method and device based on imitation learning

Info

Publication number: CN111144580B
Application number: CN201911406220.3A
Authority: CN
Inventors: 唐思琦; 李明强; 陈思; 高放; 黄彬城
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2024-04-12
Anticipated expiration: 2039-12-31
Also published as: CN111144580A

Abstract

The invention discloses a hierarchical reinforcement learning training method and device based on imitation learning and electronic equipment, comprising the following steps: acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model. Training and retraining are respectively carried out by using teaching data, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.

Description

Hierarchical reinforcement learning training method and device based on imitation learning

Technical Field

The present application relates to the field of machine learning, and more particularly, to a hierarchical reinforcement learning training method, apparatus and electronic device based on simulated learning.

Background

Reinforcement learning is one of the most interesting research directions in the field of artificial intelligence in recent years, and has achieved a lot of glaring results in a plurality of fields. The agent learns about behavior in the environment by performing certain operations and observing rewards or results obtained from those operations. However, an important disadvantage of reinforcement learning is the dimension disaster, and as the dimension of the system state increases, the number of parameters to be trained increases exponentially therewith, which consumes a large amount of computing and storage resources.

Based on the inherent hierarchical and combinatorial nature of real world tasks, hierarchical reinforcement learning breaks down complex problems into several sub-problems, learning from multiple policy layers, each layer being responsible for controlling time and behavior abstractions at different levels. By means of hierarchical reinforcement learning, the dimension disaster problem can be well solved through hierarchical abstraction.

However, in hierarchical reinforcement learning multi-step decisions, the agent training process may not be rewarded frequently, and there is a tremendous search space for this progressive rewards-based learning approach.

The imitation learning is directly learned from expert data provided by a demonstrator, so that the multi-step decision problem can be well solved. Thus, the advantages of imitative learning are exerted, and the training hierarchy for hierarchical reinforcement learning is developed.

However, in the past, only training of the parameters of the hierarchical reinforcement learning model by using the teaching data of the simulation learning was performed in two different steps, and the teaching data was not applied to the hierarchical reinforcement learning step, and the internal connection in the training process was ignored.

Accordingly, there is a need to provide an improved hierarchical reinforcement learning training method based on mimicking learning.

Disclosure of Invention

Aiming at the defects and shortcomings in the prior art, the invention provides a hierarchical reinforcement learning training method, device and electronic equipment based on imitation learning, which respectively conduct pre-training and retraining by using teaching data, effectively utilize priori knowledge and strategies, reduce search space and improve training efficiency.

According to an aspect of the present invention, there is provided a hierarchical reinforcement learning training method based on imitation learning, including: the abstract decomposition reinforcement learning task is a plurality of subtasks; acquiring teaching data of human expert; pre-training based on simulated learning using the teaching data, determining an initial strategy; retraining based on reinforcement learning based on the initial strategy, determining a training model.

Further, the abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.

Further, the mode of acquiring teaching data of human expert includes adopting one or more sensors of optics, vision, inertial navigation and data glove to acquire indirectly or directly acquire through remote control agent of the device.

Further, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, expressed as:

D＝{s ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ … }, wherein

{s ₁ ，s ₂ … } represents a sequence of states;

{g ₁ ，g ₂ … } represents a subtask policy sequence;

{τ ₁ ，τ ₂ … } represents a sequence of actions under a given subtask, in which

τ＝{s ₁ ，a ₁ ，s ₂ ，a ₂ … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g.

Further, the pre-training process includes: extracting data from the teaching data according to a given initial state; the reinforcement learning initial strategy is determined according to a simulated learning method based on the extracted data.

Further, the retraining process includes: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.

According to another aspect of the present invention, there is provided a hierarchical reinforcement learning training device based on imitation learning, including: the training system comprises a teaching data acquisition module, a pre-training module and a retraining module.

The teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training.

The pre-training module is used for pre-training based on imitation learning by using teaching data and determining a reinforcement learning initial strategy.

The retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model.

Further, the acquiring means includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission means (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).

According to still another aspect of the present invention, there is provided an electronic apparatus including: a processor; and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.

According to yet another aspect of the present invention, there is provided a computer readable medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the hierarchical reinforcement learning training method flowchart method based on mimicking learning as described above.

According to the training method, training data are used for respectively pre-training and retraining, priori knowledge and strategies are effectively utilized, search space is reduced, and training efficiency is improved.

Drawings

Various other advantages and benefits of the present application will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. It is apparent that the drawings described below are only some embodiments of the present application and that other drawings may be obtained from these drawings by those of ordinary skill in the art without inventive effort. Also, like reference numerals are used to designate like parts throughout the figures.

FIG. 1 is a general process of reinforcement learning;

FIG. 2 is a flow diagram of a hierarchical reinforcement learning training method based on mimicking learning in accordance with one embodiment of the present invention;

FIG. 3 is a flow chart of pre-training based on simulated learning using teaching data in accordance with one embodiment of the present invention;

FIG. 4 is a flow chart for retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention;

FIG. 5 is a block diagram of a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention;

fig. 6 is a block diagram of an electronic device according to one embodiment of the invention.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.

Summary of the application

Fig. 1 shows a general process of reinforcement learning. In general, reinforcement learning systems include an agent and an execution environment, the agent constantly learns and optimizes its strategy through interactions and feedback with the execution environment. Specifically, the agent observes and obtains the state of the execution environment, and determines, according to a policy, the action to be taken with respect to the state of the current execution environment. Such actions act on the execution environment, changing the state of the execution environment, and generating a feedback to the agent, also known as rewards. The intelligent agent judges whether the previous behavior is correct or not according to the obtained rewards, and whether the strategy needs to be adjusted or not so as to update the strategy. By repeatedly and constantly observing the state, determining the behavior, receiving feedback, the agent can constantly update the strategy, with the ultimate goal of being able to learn a strategy that maximizes the prize accumulation achieved.

Exemplary method

FIG. 2 illustrates a hierarchical reinforcement learning training method flow diagram based on impersonation learning, according to one embodiment of the invention.

As shown in fig. 2, a hierarchical reinforcement learning training method based on simulated learning according to one embodiment of the present invention includes:

s21: acquisition of teaching data of human expert

The mode of acquiring teaching data of human expert includes adopting one or more sensors in optical, visual, inertial navigation and data glove to acquire indirectly or directly through remote control of the device.

Further, the indirect acquisition of teaching data by using the sensor means a manner of firstly acquiring task data of a human being through the sensor and then mapping the human data to the human-simulated intelligent body for the human-simulated intelligent body with a similar structure to the human body.

For example, a human wearing data glove acquires task data and maps to a humanoid paw to obtain teaching data for machine learning of the humanoid paw.

Furthermore, the remote control intelligent agent directly acquires teaching data, for example, the teaching box controls the motion of the bionic intelligent agent body, and records the motion data. Because the data are acquired through the movement of the intelligent body, the data can be directly used for machine learning of the bionic body.

In addition, the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as:

D＝{s ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ … }, wherein

{s ₁ ，s ₂ … } represents a sequence of states;

{g ₁ ，g ₂ … } represents a subtask policy sequence;

Hierarchy levelThe goal of reinforcement learning training is to obtain subtask strategy pi _μ : S→G, and sub-decisions pi for each specific sub-target G _g ：S→A。

S22: pre-training based on simulated learning using teaching data, determining reinforcement learning initial strategy

FIG. 3 illustrates a flow chart for pre-training based on simulated learning using teaching data, according to one embodiment of the invention.

As shown in fig. 3, a flowchart for pre-training based on simulated learning using teaching data according to one embodiment of the present invention includes:

s31: extracting data from teaching data according to a given initial state

From the teaching data d= { s according to the given initial state ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ Extracting corresponding data in …Wherein the subsequence->For training policy network pi _g Sequence ofTraining policy network pi _μ 。

S32: determining reinforcement learning initial strategy according to imitation learning method based on extracted data

Training the policy network pi according to a simulated learning method, e.g. a behavioral cloning method, based on the extracted data as described above _g Policy network pi _μ Respectively corresponding to the output strategy networksAnd->And will->As a reinforcement learning initial strategy.

S23: retraining based on reinforcement learning according to an initial strategy, determining a training model

FIG. 4 illustrates a flow chart for retraining based on reinforcement learning according to an initial strategy, according to one embodiment of the invention.

As shown in fig. 4, a flowchart of retraining based on reinforcement learning according to an initial strategy according to one embodiment of the invention includes:

s41: performing an initial strategy to obtain training data according to a given initial state

Executing initial policies according to given initial statesAnd->Acquisition of training data (s, g, τ)

S42: determining an empirical data pool based on teaching data and training data

Teaching data d= { s ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ The union of … and training data (s, g, τ) is used as the empirical data pool

S43: determining a final training model based on a hierarchical reinforcement learning method using empirical data pool data

Model training using empirical pool data based on hierarchical reinforcement learning methods, such as MAXQ methods, until termination goals are met, determining a strategic network at that timeAnd->The model is finally trained.

Obviously, the hierarchical reinforcement learning is performed based on the experience data pool data composed of the initial strategy, the teaching data and the training data, so that the prior knowledge and strategy of the expert can be effectively utilized, the search space is reduced, and the training efficiency is improved.

Further, the reinforcement learning training method based on the imitation learning further comprises the following steps: abstracting and decomposing reinforcement learning task into a plurality of subtasks

Based on the hierarchical reinforcement learning idea, reinforcement learning task abstraction is decomposed into a plurality of subtasks for further reinforcement learning training.

The abstract method comprises the following steps: state space decomposition, temporal abstraction, and spatial abstraction.

The state space decomposition divides the state space into a plurality of different subspaces, each subspace is trained respectively, and each training is performed in a small-scale space.

The temporal abstraction is to consider a task as an action set, and expand single-step training to multiple steps, so that the decision times at a single moment are reduced, and the training pressure is reduced.

For example, the task of placing fruit in a refrigerator is broken down into a series of sub-tasks of "pick up fruit", "open refrigerator", "place refrigerator", "close refrigerator", and specific operations are performed under each sub-task, such as the type and number of fruits selected in the pick up fruit sub-task, whether the place in the refrigerator sub-task requires refrigeration or freshness, which compartment of the refrigerator to place in, and so forth.

The space abstraction is realized by neglecting a plurality of dimension variables which are irrelevant to subtasks.

Exemplary apparatus

FIG. 5 illustrates a reinforcement learning training device based on simulated learning in accordance with one embodiment of the present invention.

As shown in fig. 5, a hierarchical reinforcement learning training device 500 based on imitation learning according to an embodiment of the present invention includes: a teaching data acquisition module 510, a pre-training module 520, a retraining module 530.

The teaching data obtaining module 510 is configured to obtain teaching data required for reinforcement learning training.

Specifically, the acquiring mode includes acquiring through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).

For example, the teaching data is acquired from the sensor through a bluetooth module.

The pre-training module 520 is configured to perform pre-training based on the simulated learning using the teaching data, and determine a reinforcement learning initial strategy.

The retraining module 530 is configured to retrain based on reinforcement learning according to an initial strategy, and determine a training model.

Exemplary electronic device

Next, a block diagram of an electronic device according to an embodiment of the present application is described with reference to fig. 6.

As shown in fig. 6, the electronic device 60 includes one or more processors 61 and memory 62.

The processor 61 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in the electronic device 60 to perform the desired functions.

Memory 62 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that may be executed by the processor 41 to implement the hierarchical reinforcement learning training method based on mimicking learning and/or other desired functions of the embodiments of the present application described above.

In one example, the electronic device 60 may further include: an input device 63 and an output device 64, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown).

For example, the input device 63 may include, for example, a keyboard, a mouse, or the like, which may be used to input the teaching data.

The output device 64 may output various information to the outside, such as the distance between the first train and the second train. The output device 64 may include, for example, a display, speakers, a printer, a communication network and its connected remote output devices, etc., and may be used to output the reinforcement learning training results.

Of course, only some of the components of the electronic device 60 that are relevant to the present application are shown in fig. 6 for simplicity, components such as buses, input/output interfaces, etc. are omitted. In addition, the electronic device 60 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer readable storage Medium

In addition to the methods, apparatus and systems described above, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps of a model learning based hierarchical reinforcement learning training method according to embodiments of the present application described in the above "exemplary methods" section of this specification.

The computer program product may write program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform steps in a hierarchical reinforcement learning training method based on impersonation learning in embodiments of the present application.

The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.

The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner.

Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.

It is also noted that in the methods, apparatus and devices of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims

1. A hierarchical reinforcement learning training method based on simulated learning, comprising:

acquiring teaching data of human expert; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ ，…}，{s ₁ ，s ₂ … represents a state sequence, { g ₁ ，g ₂ … represents a sequence of subtask strategies, { τ } ₁ ，τ ₂ … } represents the sequence of actions under a given subtask, τ= { s ₁ ，a ₁ ，s ₂ ，a ₂ … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;

pre-training based on simulated learning using the teaching data, determining an initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; determining a reinforcement learning initial strategy according to a simulated learning method based on the extracted data; the method comprises the following steps: from the teaching data d= { s according to the given initial state ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ Extracting corresponding data in …Wherein the subsequence->For training policy network pi _g Sequence->Training policy network pi _μ Training a strategy network pi according to a simulated learning method based on the extracted data _g Policy network pi _μ Output policy network respectively corresponding to->And->And will->As a reinforcement learning initial strategy;

retraining based on reinforcement learning based on the initial strategy, and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.

2. The training method of claim 1, further comprising abstracting the reinforcement learning task into a plurality of subtasks.

3. The training method of claim 2, wherein the abstract method comprises: state space decomposition, temporal abstraction, and spatial abstraction.

4. The training method of claim 1, wherein the means for acquiring teaching data of human expert comprises one or more of optical, visual, inertial navigation, and data glove, or directly by remote control of the agent via the device.

5. Hierarchical reinforcement learning training device based on imitative learning, characterized by comprising:

the teaching data acquisition module is used for acquiring teaching data required by reinforcement learning training; the teaching data is a decision data sequence suitable for hierarchical reinforcement learning, and is expressed as: d= { s ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ ，…}，{s ₁ ，s ₂ … } represents a sequence of states; { g ₁ ，g ₂ … } represents a subtask policy sequence; { tau ₁ ，τ ₂ … } represents the sequence of actions under a given subtask, τ= { s ₁ ，a ₁ ，s ₂ ，a ₂ … represents a series of state base action sequences that need to be completed in order to reach the corresponding subtask target g;

the pre-training module is used for pre-training based on imitation learning by using the teaching data and determining a reinforcement learning initial strategy; the pre-training process comprises the following steps: extracting data from the teaching data according to a given initial state; based on the extracted data according to the simulated learning methodDetermining a reinforcement learning initial strategy; the method comprises the following steps: from the teaching data d= { s according to the given initial state ₁ ，g ₁ ，τ ₁ ，s ₂ ，g ₂ ，τ ₂ Extracting corresponding data in …Wherein the subsequence->For training policy network pi _g Sequence->Training policy network pi _μ Training a strategy network pi according to a simulated learning method based on the extracted data _g Policy network pi _μ Output policy network respectively corresponding to->And->And will beAs a reinforcement learning initial strategy;

the retraining module is used for retraining based on reinforcement learning according to an initial strategy and determining a training model; the retraining process comprises the following steps: executing an initial strategy according to a given initial state to acquire training data; taking the union of teaching data and training data as an experience data pool; the final training model is determined based on a hierarchical reinforcement learning method using empirical data pool data.

6. The training device of claim 5, wherein the teaching data obtaining module obtains the teaching data through a USB interface, or an ethernet interface, or a serial port, or a parallel port, or a magnetic disk drive, or an optical disk drive, or a tape drive, or through a wireless transmission mode (including but not limited to bluetooth, infrared, wifi, zigbee, or mobile data GSM, CDMA, GPRS, 3G, 4G, 5G, etc.).

7. An electronic device, comprising: a processor;

and a memory having stored therein computer program instructions that, when executed by the processor, cause the processor to perform the training method of any of claims 1-4.

8. A computer readable storage medium storing a computer program for performing the training method of any one of claims 1-4.