CN101739601B

CN101739601B - Frame and method for developing reinforcement learning system

Info

Publication number: CN101739601B
Application number: CN 200810051406
Authority: CN
Inventors: 孟祥萍; 谭万禹; 皮玉珍; 苑全德; 纪秀
Original assignee: Changchun Institute Technology
Current assignee: State Grid Corp of China SGCC; State Grid Jilin Electric Power Corp; Changchun Institute Technology
Priority date: 2008-11-12
Filing date: 2008-11-12
Publication date: 2013-03-06
Anticipated expiration: 2028-11-12
Also published as: CN101739601A

Abstract

The invention relates to a frame and method for developing a reinforcement learning system, which is characterized in that a learner interface interactive with the external environment, a state interface for showing the environment state, an action interface of which the action is executed by the system through an executive component, a basic testing environment and other parts constitute the frame, and then, the frame is used for developing the reinforcement learning system, wherein the learner interface acquires the environment state through the state interface, updates the internal state by learning, makes a decision and calls the action interface to act on the environment. Meanwhile, the study group of the invention also provides the implementation of a new multi-robot reinforcement learning algorithm based on the quantum theory to be used as example demonstration. Developers can finish the development of robots or other intelligent device learning modules only by realizing the corresponding interfaces according to certain steps. The frame of the invention has high portability, can operate in many platforms, can be combined with other robot system frames for use, greatly reduces the compiling complexity of the learning algorithm, and has simple method.

Description

A kind of framework and method for developing reinforcement learning system

Technical field

The present invention relates to a kind of framework and method of developing reinforcement learning system.

Background technology

Intensified learning is called again and strengthens study, be a kind of with environmental feedback as machine learning method input, special, that conform.Since late 1980s, along with the Fundamentals of Mathematics Remarkable Progress On Electric Artificial progress to intensified learning, research and the application of intensified learning are carried out day by day, become one of the study hotspot in machine learning field.

The intensified learning technology obtains uncertain reward value by the perception ambient condition with from environment, only by similar trial and error, gets final product the optimum behavior strategy of learning dynamics system, thereby has attracted many researchers.Up to the present, intensified learning in a lot of fields in or jejune, need further to study intensified learning.

Reinforcement learning system can be applicable to numerous fields, especially is fit to the development machines people and has the study adaptation module of adaptive intelligent apparatus.By learning system robot is executed the task in the dynamic environment of the unknown, do not need to environment set up complete model (for other learning system this be one very the trouble thing).

Traditional system that will develop based on intensified learning all needs to start from scratch, and still the general framework of neither one can use, and causes a large amount of duplication of labour, and because the neither one standard can be complied with, might cause complex structure chaotic.

Summary of the invention

It is high that technical matters to be solved by this invention provides a kind of portability, can be in numerous platform operations, the framework of the developing reinforcement learning system that can be combined with other Agent system frameworks, greatly reduce the complexity that learning algorithm is write, the heavy program design work that original research intensified learning need to be done is simplified, the design iterations part is finished by this framework, and the thought of whole study contains among framework.

The present invention also provides the method for developing reinforcement learning system.

For solving the problems of the technologies described above, the invention provides a kind of framework for developing reinforcement learning system, it is characterized in that comprising:

One with the mutual learner's interface of external environment condition, be that reinforcement learning system is used for organizing the module of other interface to learn and to make a strategic decision;

The state interface of an expression ambient condition, this interface provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;

Motion interface by the execution unit execution action is used for providing obtaining action and execution action method, is used for respectively obtaining current action and carries out current action;

A basic test environment, this environment are the classical grid world, are used for the initial position of target setting, barrier and learning agent.

But the learner interface mutual with external environment condition comprises initialization study, environment of observation, obtains remuneration, learns and upgrade internal state value, obtains best action, six overloaded methods of execution action, the learner gives tacit consent to realization Q learning algorithm, wherein the original chemical learning method is used for the initialization study factor and discount factor, return true value after the success, otherwise return falsity; Environment of observation method acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface object and returns; Obtaining the remuneration method calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state method and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior method by importing the current state sign into, obtain optimum action, and carry out by the execution action method.

A kind of method for developing reinforcement learning system is characterized in that comprising the following steps:

By with mutual other interface of learner interface organization of external environment condition to learn and to make a strategic decision;

The mapping method that the state interface of utilization expression ambient condition provides provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;

Motion interface by the execution unit execution action provides obtains action and execution action method, is used for respectively obtaining current action and carries out current action;

Initial position by basic test environment set target, barrier and learning agent.

The present invention also provides based on the intensified learning algorithm of quantum theory and has realized as an example demonstration, and this learning algorithm is described below:

1, initialization:

(1) initial study ginseng α, δ are set _w, δ _lWith discount factor β, and establish t=0;

(2) init state and behavior, give respectively identical superposition value:

(3) according to all states | s ^(m)

And behavior

Order

Q_{0}^{i} (s^{(0)}, a_{s}^{(1)}, a_{s}^{(2)}, \cdot \cdot \cdot, a_{s}^{(n)}) = 0;

π_{0}^{i} (s^{(0)}, a_{s}^{(i)}) = \frac{1}{4},

{\overset{&OverBar;}{π}}_{0}^{i} (s^{(0)}, a_{s}^{(i)}) = \frac{1}{4},

C ⁱ(s)＝0；

2, within each cycle, be repeated below step, until t=200 finishes:

(1) to all states, the observed behavior collection , and obtain one | a 〉;

(2) execution action | a 〉, there is the award of robot in observation post

And new state

| s_{t}^{(m)} &rang;;

(3) use formula

π_{t + 1}^{i} (s, a) = π_{t}^{i} (s, a) + Δ δ_{sa}

Upgrade and calculate

Δ δ_{sa} = \{\begin{matrix} \min (π_{t}^{i} (s, a), δ / (| A^{i} | - 1)) & if | a &rang; = U_{Grov}^{L} | a_{s}^{(n)} &rang; \\ - \min (π_{t}^{i} (s, a), δ / (| A^{i} | - 1)) & otherwise \end{matrix}

(4) upgrade the calculating Average Strategy ,

C ⁱ(s)＝C ⁱ(s)+1，

{\overset{&OverBar;}{π}}^{i} (s^{(i)}, a_{s}^{(i)}) = {\overset{&OverBar;}{π}}^{i} (s^{(i)}, a_{s}^{(i)}) + \frac{π^{i} (s^{(i)}, a_{s}^{(i)}) - {\overset{&OverBar;}{π}}^{i} (s^{(i)}, a_{s}^{(i)})}{C^{i} (s)};

(5) explore next step action, repeat the Grover iterative operation and upgrade probability amplitude L time.

Use the flow process of this Development of Framework reinforcement learning system to be described in detail as follows:

Step 1 imports kit;

Step 2 realizes interface, writes strategy;

Step 3 if use default contexts, can directly be moved test, checks operational effect; If self-defined environment will be responsible for writing environment to the work such as mapping ruler of robot interior state, then operation test;

If step 4 normal, then finish, otherwise returns step 2.

Framework portability of the present invention is high, can be combined with other machines robot system framework in numerous platform operations, greatly reduces the complexity that learning algorithm is write, and method is simple.

Description of drawings

Fig. 1 is the intensified learning model;

Fig. 2 is the main class figure of framework of the present invention;

Fig. 3 is based on the development process figure of framework of the present invention.

Embodiment

With reference to Fig. 1, reinforcement learning system obtains state and award by environment of observation, through the study adjustment, environment is made action, and then environment of observation determines next action, so circulates, and reaches at last the ambient condition of hope.

With reference to Fig. 2, the present invention includes one with the mutual learner's interface of external environment condition, that reinforcement learning system is organized the module of other interfaces to learn and to make a strategic decision, but comprise initialization study, environment of observation, obtain remuneration, learn and upgrade internal state value, obtain best action, six overloaded methods of execution action, the learner gives tacit consent to realization Q learning algorithm, wherein the original chemical learning method is used for the initialization study factor and discount factor, returns true value after the success, otherwise returns falsity; Environment of observation method acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface object and returns; Obtaining the remuneration method calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state method and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior method by importing the current state sign into, obtain optimum action, and carry out by the execution action method; The state interface of an expression ambient condition, this interface provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment; Motion interface by the execution unit execution action is used for providing obtaining action and execution action method, is used for respectively obtaining current action and carries out current action; A basic test environment, this environment are the classical grid world, are used for the initial position of target setting, barrier and learning agent.Wherein Q value table and Q value are respectively the classes for realizing that the Q learning algorithm designs, and Q value table provides the method for upgrading Q value method and selecting to select under the current state maximum Q value; Q value class has represented a concrete Q value, and the renewal Q value method in the Q value table uses the method in the Q value class to upgrade.

With reference to Fig. 3, use the flow process of the reinforcement learning system of this Development of Framework oneself to be described in detail as follows:

Step 1 imports kit, and this is to use framework to develop the step that must do, and the interface that uses framework to provide need to import this bag.

Step 2 realizes interface, writes strategy, and wherein learner's interface class is the normative reference that the design robot study module provides, and realizes this framework of use that the module of this standard can be convenient; State interface is used for representing state, consider that different robots may move the different of quantity of causing condition express method and state etc. under different environment, by abstract, a method only is provided, it is mapping method, the internal state that is used for ambient condition is mapped as corresponding robot represents, such as an array or a matrix.

Step 3 if use default contexts, can directly be moved test, checks operational effect; If self-defined environment will be responsible for writing environment to the work such as mapping ruler of robot interior state.Then operation test.

If step 4 normal, then finish, otherwise returns step 2.

Claims

1. framework that is used for developing reinforcement learning system is characterized in that comprising:

One with the mutual learner's interface unit of external environment condition, be that reinforcement learning system is organized the module of other interface units to learn and to make a strategic decision;

The state interface parts of an expression ambient condition, this interface unit provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;

Motion interface parts by the execution unit execution action are used for providing obtaining action and execution action method, are used for respectively obtaining current action and carry out current action;

2. the framework for developing reinforcement learning system according to claim 1, it is characterized in that: the learner interface unit mutual with external environment condition comprise initialization study, environment of observation, obtain remuneration, learn and upgrade internal state value, obtain best action, six of execution actions can heavily loaded functional module, learner's interface unit acquiescence realizes the Q learning algorithm, wherein initialization study module is used for the initialization study factor and discount factor, return true value after the success, otherwise return falsity; Environment of observation module acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface parts and returns; Obtaining the remuneration module calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state module and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior module by importing the current state sign into, obtain optimum action, and carry out by the execution action module.

3. a method that is used for developing reinforcement learning system is characterized in that comprising the following steps:

By organizing other interface unit to learn and to make a strategic decision with the mutual learner's interface unit of external environment condition;

Learner's interface unit is by state interface component retrieval ambient condition, through learning update mode value table and making a policy, the call action interface unit acts on environment, the state interface parts provide the mapping function module, being used for the state mapping in the environment is the internal system state, as the important references of obtaining the optimum action;

Motion interface parts by the execution unit execution action provide and obtain action and execution action function module, are used for respectively obtaining current action and carry out current action;

Initial position by basic test environment set target, barrier and learning agent;

Import kit, this is to use frame system to develop the step that must do, and the interface unit that uses frame system to provide need to import this bag;

Realize interface, write strategy, wherein learner's interface unit provides a normative reference for the design robot study module, realizes convenient this frame system of use of module of this standard; The state interface parts are used for representing state, consider that different robots moves the different of the quantity that causes condition express method and state under different environment, by abstract, a mapping block is provided, the internal state that ambient condition is mapped as corresponding robot represents;

If the use default contexts is directly moved test, check operational effect; If self-defined environment will be responsible for writing environment to the mapping ruler of robot interior state, then operation test;

If normal, then finish, otherwise return the step that realizes interface, writes strategy.