CN101739601B - Frame and method for developing reinforcement learning system - Google Patents

Frame and method for developing reinforcement learning system Download PDF

Info

Publication number
CN101739601B
CN101739601B CN 200810051406 CN200810051406A CN101739601B CN 101739601 B CN101739601 B CN 101739601B CN 200810051406 CN200810051406 CN 200810051406 CN 200810051406 A CN200810051406 A CN 200810051406A CN 101739601 B CN101739601 B CN 101739601B
Authority
CN
China
Prior art keywords
state
environment
action
interface
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN 200810051406
Other languages
Chinese (zh)
Other versions
CN101739601A (en
Inventor
孟祥萍
谭万禹
皮玉珍
苑全德
纪秀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Jilin Electric Power Corp
Changchun Institute Technology
Original Assignee
Changchun Institute Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun Institute Technology filed Critical Changchun Institute Technology
Priority to CN 200810051406 priority Critical patent/CN101739601B/en
Publication of CN101739601A publication Critical patent/CN101739601A/en
Application granted granted Critical
Publication of CN101739601B publication Critical patent/CN101739601B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Electrically Operated Instructional Devices (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention relates to a frame and method for developing a reinforcement learning system, which is characterized in that a learner interface interactive with the external environment, a state interface for showing the environment state, an action interface of which the action is executed by the system through an executive component, a basic testing environment and other parts constitute the frame, and then, the frame is used for developing the reinforcement learning system, wherein the learner interface acquires the environment state through the state interface, updates the internal state by learning, makes a decision and calls the action interface to act on the environment. Meanwhile, the study group of the invention also provides the implementation of a new multi-robot reinforcement learning algorithm based on the quantum theory to be used as example demonstration. Developers can finish the development of robots or other intelligent device learning modules only by realizing the corresponding interfaces according to certain steps. The frame of the invention has high portability, can operate in many platforms, can be combined with other robot system frames for use, greatly reduces the compiling complexity of the learning algorithm, and has simple method.

Description

A kind of framework and method for developing reinforcement learning system
Technical field
The present invention relates to a kind of framework and method of developing reinforcement learning system.
Background technology
Intensified learning is called again and strengthens study, be a kind of with environmental feedback as machine learning method input, special, that conform.Since late 1980s, along with the Fundamentals of Mathematics Remarkable Progress On Electric Artificial progress to intensified learning, research and the application of intensified learning are carried out day by day, become one of the study hotspot in machine learning field.
The intensified learning technology obtains uncertain reward value by the perception ambient condition with from environment, only by similar trial and error, gets final product the optimum behavior strategy of learning dynamics system, thereby has attracted many researchers.Up to the present, intensified learning in a lot of fields in or jejune, need further to study intensified learning.
Reinforcement learning system can be applicable to numerous fields, especially is fit to the development machines people and has the study adaptation module of adaptive intelligent apparatus.By learning system robot is executed the task in the dynamic environment of the unknown, do not need to environment set up complete model (for other learning system this be one very the trouble thing).
Traditional system that will develop based on intensified learning all needs to start from scratch, and still the general framework of neither one can use, and causes a large amount of duplication of labour, and because the neither one standard can be complied with, might cause complex structure chaotic.
Summary of the invention
It is high that technical matters to be solved by this invention provides a kind of portability, can be in numerous platform operations, the framework of the developing reinforcement learning system that can be combined with other Agent system frameworks, greatly reduce the complexity that learning algorithm is write, the heavy program design work that original research intensified learning need to be done is simplified, the design iterations part is finished by this framework, and the thought of whole study contains among framework.
The present invention also provides the method for developing reinforcement learning system.
For solving the problems of the technologies described above, the invention provides a kind of framework for developing reinforcement learning system, it is characterized in that comprising:
One with the mutual learner's interface of external environment condition, be that reinforcement learning system is used for organizing the module of other interface to learn and to make a strategic decision;
The state interface of an expression ambient condition, this interface provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;
Motion interface by the execution unit execution action is used for providing obtaining action and execution action method, is used for respectively obtaining current action and carries out current action;
A basic test environment, this environment are the classical grid world, are used for the initial position of target setting, barrier and learning agent.
But the learner interface mutual with external environment condition comprises initialization study, environment of observation, obtains remuneration, learns and upgrade internal state value, obtains best action, six overloaded methods of execution action, the learner gives tacit consent to realization Q learning algorithm, wherein the original chemical learning method is used for the initialization study factor and discount factor, return true value after the success, otherwise return falsity; Environment of observation method acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface object and returns; Obtaining the remuneration method calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state method and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior method by importing the current state sign into, obtain optimum action, and carry out by the execution action method.
A kind of method for developing reinforcement learning system is characterized in that comprising the following steps:
By with mutual other interface of learner interface organization of external environment condition to learn and to make a strategic decision;
The mapping method that the state interface of utilization expression ambient condition provides provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;
Motion interface by the execution unit execution action provides obtains action and execution action method, is used for respectively obtaining current action and carries out current action;
Initial position by basic test environment set target, barrier and learning agent.
The present invention also provides based on the intensified learning algorithm of quantum theory and has realized as an example demonstration, and this learning algorithm is described below:
1, initialization:
(1) initial study ginseng α, δ are set w, δ lWith discount factor β, and establish t=0;
(2) init state and behavior, give respectively identical superposition value:
Figure G2008100514067D00031
Figure G2008100514067D00032
(3) according to all states | s (m)
Figure G2008100514067D0003093538QIETU
And behavior
Figure G2008100514067D00033
Order Q 0 i ( s ( 0 ) , a s ( 1 ) , a s ( 2 ) , · · · , a s ( n ) ) = 0 ; π 0 i ( s ( 0 ) , a s ( i ) ) = 1 4 , π ‾ 0 i ( s ( 0 ) , a s ( i ) ) = 1 4 , C i(s)=0;
2, within each cycle, be repeated below step, until t=200 finishes:
(1) to all states, the observed behavior collection , and obtain one | a 〉;
(2) execution action | a 〉, there is the award of robot in observation post
Figure G2008100514067D00042
And new state | s t ( m ) ⟩ ;
(3) use formula π t + 1 i ( s , a ) = π t i ( s , a ) + Δ δ sa Upgrade and calculate
Δ δ sa = min ( π t i ( s , a ) , δ / ( | A i | - 1 ) ) if | a ⟩ = U Grov L | a s ( n ) ⟩ - min ( π t i ( s , a ) , δ / ( | A i | - 1 ) ) otherwise
(4) upgrade the calculating Average Strategy ,
C i(s)=C i(s)+1, π ‾ i ( s ( i ) , a s ( i ) ) = π ‾ i ( s ( i ) , a s ( i ) ) + π i ( s ( i ) , a s ( i ) ) - π ‾ i ( s ( i ) , a s ( i ) ) C i ( s ) ;
(5) explore next step action, repeat the Grover iterative operation and upgrade probability amplitude L time.
Use the flow process of this Development of Framework reinforcement learning system to be described in detail as follows:
Step 1 imports kit;
Step 2 realizes interface, writes strategy;
Step 3 if use default contexts, can directly be moved test, checks operational effect; If self-defined environment will be responsible for writing environment to the work such as mapping ruler of robot interior state, then operation test;
If step 4 normal, then finish, otherwise returns step 2.
Framework portability of the present invention is high, can be combined with other machines robot system framework in numerous platform operations, greatly reduces the complexity that learning algorithm is write, and method is simple.
Description of drawings
Fig. 1 is the intensified learning model;
Fig. 2 is the main class figure of framework of the present invention;
Fig. 3 is based on the development process figure of framework of the present invention.
Embodiment
With reference to Fig. 1, reinforcement learning system obtains state and award by environment of observation, through the study adjustment, environment is made action, and then environment of observation determines next action, so circulates, and reaches at last the ambient condition of hope.
With reference to Fig. 2, the present invention includes one with the mutual learner's interface of external environment condition, that reinforcement learning system is organized the module of other interfaces to learn and to make a strategic decision, but comprise initialization study, environment of observation, obtain remuneration, learn and upgrade internal state value, obtain best action, six overloaded methods of execution action, the learner gives tacit consent to realization Q learning algorithm, wherein the original chemical learning method is used for the initialization study factor and discount factor, returns true value after the success, otherwise returns falsity; Environment of observation method acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface object and returns; Obtaining the remuneration method calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state method and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior method by importing the current state sign into, obtain optimum action, and carry out by the execution action method; The state interface of an expression ambient condition, this interface provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment; Motion interface by the execution unit execution action is used for providing obtaining action and execution action method, is used for respectively obtaining current action and carries out current action; A basic test environment, this environment are the classical grid world, are used for the initial position of target setting, barrier and learning agent.Wherein Q value table and Q value are respectively the classes for realizing that the Q learning algorithm designs, and Q value table provides the method for upgrading Q value method and selecting to select under the current state maximum Q value; Q value class has represented a concrete Q value, and the renewal Q value method in the Q value table uses the method in the Q value class to upgrade.
With reference to Fig. 3, use the flow process of the reinforcement learning system of this Development of Framework oneself to be described in detail as follows:
Step 1 imports kit, and this is to use framework to develop the step that must do, and the interface that uses framework to provide need to import this bag.
Step 2 realizes interface, writes strategy, and wherein learner's interface class is the normative reference that the design robot study module provides, and realizes this framework of use that the module of this standard can be convenient; State interface is used for representing state, consider that different robots may move the different of quantity of causing condition express method and state etc. under different environment, by abstract, a method only is provided, it is mapping method, the internal state that is used for ambient condition is mapped as corresponding robot represents, such as an array or a matrix.
Step 3 if use default contexts, can directly be moved test, checks operational effect; If self-defined environment will be responsible for writing environment to the work such as mapping ruler of robot interior state.Then operation test.
If step 4 normal, then finish, otherwise returns step 2.

Claims (3)

1. framework that is used for developing reinforcement learning system is characterized in that comprising:
One with the mutual learner's interface unit of external environment condition, be that reinforcement learning system is organized the module of other interface units to learn and to make a strategic decision;
The state interface parts of an expression ambient condition, this interface unit provides mapping method, and being used for provides state for the internal system state for obtaining the optimum action with the state mapping in the environment;
Motion interface parts by the execution unit execution action are used for providing obtaining action and execution action method, are used for respectively obtaining current action and carry out current action;
A basic test environment, this environment are the classical grid world, are used for the initial position of target setting, barrier and learning agent.
2. the framework for developing reinforcement learning system according to claim 1, it is characterized in that: the learner interface unit mutual with external environment condition comprise initialization study, environment of observation, obtain remuneration, learn and upgrade internal state value, obtain best action, six of execution actions can heavily loaded functional module, learner's interface unit acquiescence realizes the Q learning algorithm, wherein initialization study module is used for the initialization study factor and discount factor, return true value after the success, otherwise return falsity; Environment of observation module acquiescence is obtained status information from test environment, by the current combinations of states observations of main body, the state of will observing is encapsulated in the state interface parts and returns; Obtaining the remuneration module calculates remuneration according to current state and Q value table and returns; Study is upgraded the internal state module and is upgraded Q value table by remuneration, the current state obtained, returns true value after the success; Obtain the optimizing behavior module by importing the current state sign into, obtain optimum action, and carry out by the execution action module.
3. a method that is used for developing reinforcement learning system is characterized in that comprising the following steps:
By organizing other interface unit to learn and to make a strategic decision with the mutual learner's interface unit of external environment condition;
Learner's interface unit is by state interface component retrieval ambient condition, through learning update mode value table and making a policy, the call action interface unit acts on environment, the state interface parts provide the mapping function module, being used for the state mapping in the environment is the internal system state, as the important references of obtaining the optimum action;
Motion interface parts by the execution unit execution action provide and obtain action and execution action function module, are used for respectively obtaining current action and carry out current action;
Initial position by basic test environment set target, barrier and learning agent;
Import kit, this is to use frame system to develop the step that must do, and the interface unit that uses frame system to provide need to import this bag;
Realize interface, write strategy, wherein learner's interface unit provides a normative reference for the design robot study module, realizes convenient this frame system of use of module of this standard; The state interface parts are used for representing state, consider that different robots moves the different of the quantity that causes condition express method and state under different environment, by abstract, a mapping block is provided, the internal state that ambient condition is mapped as corresponding robot represents;
If the use default contexts is directly moved test, check operational effect; If self-defined environment will be responsible for writing environment to the mapping ruler of robot interior state, then operation test;
If normal, then finish, otherwise return the step that realizes interface, writes strategy.
CN 200810051406 2008-11-12 2008-11-12 Frame and method for developing reinforcement learning system Active CN101739601B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200810051406 CN101739601B (en) 2008-11-12 2008-11-12 Frame and method for developing reinforcement learning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200810051406 CN101739601B (en) 2008-11-12 2008-11-12 Frame and method for developing reinforcement learning system

Publications (2)

Publication Number Publication Date
CN101739601A CN101739601A (en) 2010-06-16
CN101739601B true CN101739601B (en) 2013-03-06

Family

ID=42463060

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200810051406 Active CN101739601B (en) 2008-11-12 2008-11-12 Frame and method for developing reinforcement learning system

Country Status (1)

Country Link
CN (1) CN101739601B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104640168B (en) * 2014-12-04 2018-10-09 北京理工大学 Vehicular ad hoc network method for routing based on Q study
JP2021018644A (en) * 2019-07-22 2021-02-15 コニカミノルタ株式会社 Mechanical learning device, mechanical learning method and mechanical learning program
CN113255347B (en) * 2020-02-10 2022-11-15 阿里巴巴集团控股有限公司 Method and equipment for realizing data fusion and method for realizing identification of unmanned equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256482A (en) * 2007-12-19 2008-09-03 深圳市同洲电子股份有限公司 Development system and method for built-in application program
CN101276279A (en) * 2008-05-21 2008-10-01 天柏宽带网络科技(北京)有限公司 Unified development system and method
CN101620535A (en) * 2009-07-29 2010-01-06 北京航空航天大学 General frame design method of airborne computer software

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101256482A (en) * 2007-12-19 2008-09-03 深圳市同洲电子股份有限公司 Development system and method for built-in application program
CN101276279A (en) * 2008-05-21 2008-10-01 天柏宽带网络科技(北京)有限公司 Unified development system and method
CN101620535A (en) * 2009-07-29 2010-01-06 北京航空航天大学 General frame design method of airborne computer software

Also Published As

Publication number Publication date
CN101739601A (en) 2010-06-16

Similar Documents

Publication Publication Date Title
Drogoul et al. Multi-agent based simulation: Where are the agents?
Lee et al. Computational thinking for youth in practice
CN103019903A (en) Embedded equipment energy consumption simulating evaluation system
Bredeweg et al. Garp3: A new workbench for qualitative reasoning and modelling
Degenne et al. Ocelet: Simulating processes of landscape changes using interaction graphs
CN101739601B (en) Frame and method for developing reinforcement learning system
Santos et al. Quantitatively assessing the benefits of model-driven development in agent-based modeling and simulation
Iba et al. From conceptual models to simulation models: Model driven development of agent-based simulations
Keller et al. Meta-modeling: a knowledge-based approach to facilitating process model construction and reuse
KR102188044B1 (en) Framework system for intelligent application development based on neuromorphic architecture
Modi et al. On the architecture of a human-centered CAD agent system
Farooqui et al. MIDES: A tool for supervisor synthesis via active learning
Benjamin et al. Embodying a cognitive model in a mobile robot
Lee et al. Combining GRN modeling and demonstration-based programming for robot control
Batra et al. Contemporary Approaches and Techniques for the Systems Analyst.
Alfert et al. Software engineering education needs adequate modeling tools
CN107622311A (en) A kind of robot learning by imitation method based on contextual translation
Soylu et al. Ontology-Driven Adaptive and Pervasive Learning Environments–APLEs: An Interdisciplinary Approach
Puheim et al. A proposal for Multi-Purpose Fuzzy Cognitive Maps library for complex system modeling
Samsonovich et al. Self-awareness as metacognition about own self concept
Stein Beyond objects
Noguchi et al. Architecting success in model based systems engineering pilot projects
David et al. ORCHESTRA: formalism to express mobile cooperative applications
Grangel et al. Using UML profiles for enterprise knowledge modelling
Al-Sharawneh et al. Abms: Agent-based modeling and simulation in web service selection

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: STATE GRID CORPORATION OF CHINA CHANGCHUN POWER SU

Effective date: 20140630

C41 Transfer of patent application or patent right or utility model
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Meng Xiangping

Inventor after: Tan Wanyu

Inventor after: Pi Yuzhen

Inventor after: Yuan Quande

Inventor after: Ji Xiu

Inventor after: Zhang Xilin

Inventor before: Meng Xiangping

Inventor before: Tan Wanyu

Inventor before: Pi Yuzhen

Inventor before: Yuan Quande

Inventor before: Ji Xiu

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: MENG XIANGPING TAN WANYU PI YUZHEN YUAN QUANDE JI XIU TO: MENG XIANGPING TAN WANYU PI YUZHEN YUAN QUANDE JI XIU ZHANG XILIN

TR01 Transfer of patent right

Effective date of registration: 20140630

Address after: 130012 Jilin province Changchun wide flat road No. 395

Patentee after: Changchun Engineering College

Patentee after: State Grid Corporation of China

Patentee after: Changchun Power Supply Company, State Grid Jilin Electric Power Co., Ltd.

Address before: 130012 Jilin province Changchun wide flat road No. 395

Patentee before: Changchun Engineering College