CN111178545A - Dynamic reinforcement learning decision training system - Google Patents

Dynamic reinforcement learning decision training system Download PDF

Info

Publication number
CN111178545A
CN111178545A CN201911412353.1A CN201911412353A CN111178545A CN 111178545 A CN111178545 A CN 111178545A CN 201911412353 A CN201911412353 A CN 201911412353A CN 111178545 A CN111178545 A CN 111178545A
Authority
CN
China
Prior art keywords
module
training
environment
reinforcement learning
return
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911412353.1A
Other languages
Chinese (zh)
Other versions
CN111178545B (en
Inventor
高放
李明强
陈思
唐思琦
黄彬城
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Information Science Research Institute
Original Assignee
CETC Information Science Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Information Science Research Institute filed Critical CETC Information Science Research Institute
Priority to CN201911412353.1A priority Critical patent/CN111178545B/en
Publication of CN111178545A publication Critical patent/CN111178545A/en
Application granted granted Critical
Publication of CN111178545B publication Critical patent/CN111178545B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module; the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module; the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information; the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process; the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module; the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface; the algorithm universality is greatly enhanced, the interface design difficulty is reduced, and the limitation of the environment on the algorithm form is reduced.

Description

Dynamic reinforcement learning decision training system
Technical Field
The invention belongs to the field of computer artificial intelligence, and particularly relates to a training system for strengthening machine learning.
Background
Reinforcement Learning (RL), also known as refinishment learning, evaluative learning or reinforcement learning, is one of the paradigms and methodologies of machine learning, and is used to describe and solve the problem that agents (agents) can achieve maximum return or achieve specific goals through learning strategies in the process of interacting with the environment. The reinforcement learning is an unsupervised learning method without prior knowledge and data, and the main working mode is that a strategy model continuously makes action attempts (exploration) in the environment, learning information is obtained by receiving the return (feedback) of the environment to the action, model parameters are updated, and finally model convergence is realized. At present, some deep strong chemical algorithms can reach the human level in weiqi and electronic games, and show the great potential in the aspects of processing complex, multiple aspects and decision problems, so that the deep strong chemical algorithms have great application prospects in the fields of industrial systems and games, marketing, advertising, finance, education, even data science and the like, and are machine learning technologies which hopefully realize general artificial intelligence.
The decision model formed by any reinforcement learning method needs to have a corresponding training/use environment, and a set of corresponding interfaces supports the interaction state, action and return of the decision model and the environment. Aiming at different application fields, the environment can be a real physical environment or a certain software environment such as games, weiqi and the like. Due to the fact that training speed in a real environment is low, cost is high, even if the reinforcement learning training facing to real applications such as robots and the like is used, rapid training iteration is carried out in a simulation software environment more frequently. In terms of virtual environments oriented to reinforcement learning research and development, OpenAIGym is commonly used, and in Gym, a simple environment scene model is manually derived, and a complex model needs some powerful physical engines. The IsaacSim platform for the autonomous robot reinforcement learning training is also provided by great, and the robot with sensors such as a laser radar and a camera can be supported to perform reinforcement learning autonomous action training in a simulation environment. Google DeepMind, incorporated with the snowstorm Game, introduced a reinforcement learning research environment SC2LE for interstellar dispar 2, providing a set of APIs based on information and control instructions for interacting with interstellar dispar 2 games to support artificial intelligence research for interstellar dispar 2.
The environment can quickly verify the reinforcement learning algorithm to form an effective reinforcement learning strategy model. The reinforcement learning environment platform provides a set of fixed reinforcement learning training interaction interfaces, and research personnel can conduct reinforcement learning algorithm research based on the environments and must follow interface specifications such as data organization modes, interaction flows and the like. On one hand, the technical form of the reinforcement learning algorithm is limited, so that certain algorithms are not suitable for the interface specification of the current platform, and the application of the algorithms on the platform is prevented or the platform adaptation workload of research personnel is increased; on the other hand, platform developers have to design interface specifications with universality as much as possible to be suitable for model training in different forms and increase the difficulty of platform design, but the interface universality effect is not good due to the fact that algorithms are varied.
Disclosure of Invention
The invention aims to solve the technical problems of high interface universality design difficulty, high algorithm adaptation difficulty and the like caused by the solidification of the algorithm interface in the traditional reinforcement learning training environment.
In order to achieve the purpose, the invention provides the following technical scheme:
a dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;
the method is characterized in that:
the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module;
the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information;
the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process;
the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module;
the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface.
The dynamic reinforcement learning decision training system has the advantages that:
the reinforcement learning training system and the interface architecture can greatly enhance the algorithm universality, reduce the interface design difficulty, simultaneously reduce the limit of the environment on the algorithm form, and reduce the workload of unnecessary interface adaptation of the reinforcement learning algorithm to the environment by a user.
Drawings
FIG. 1 is a schematic diagram of a dynamic reinforcement learning decision training system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the specific scheme of the invention is as follows:
a dynamic reinforcement learning decision training system comprises a reinforcement learning model and a training environment module. The training environment module is composed of three key functional modules, namely an environment execution engine module, an observation construction module and a return calculation module. The system also comprises an observation generation algorithm definition module and a return generation definition module which are in man-machine interaction with the user, and the user can designate an observation construction algorithm and a return generation definition corresponding to a specific reinforcement learning model through the observation generation algorithm definition module and the return generation definition module.
The environment execution engine module maintains a bottom state data structure, simultaneously constructs an observation construction module, and in the training/execution process, the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom state data to generate state information; the return calculation module sets return check points according to various return generation conditions, a user defines an assignment rule of each check point through the return generation definition module, and the training environment module calculates and outputs a check point return value in an execution step length.
The data interface between the reinforcement learning model and the training environment module mainly comprises a state information sending interface, an action receiving interface and a return sending interface.
The state information sending interface needs to design a set of interfaces which meet the requirements of training and executing any algorithm for the environment because different reinforcement learning algorithms need different state data formats and information organization forms, such as state information based on discrete data, state information based on images, state information based on multi-layer data and various types of mixed state information;
wherein the underlying data (base state data) containing all state information is output by the environment execution engine module. And developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module. The observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select. The user can directly select the preset state construction algorithm for algorithm training, and can also directly use the shared algorithm of the bottom state interface. By utilizing the observation generation algorithm definition module, a user can also customize an observation construction module meeting the algorithm requirement independently; in the training/executing process, the training environment module calls the corresponding observation construction module to generate state information through a callback or dynamic loading mechanism.
The action receiving interface and the action division mainly depend on the environment, and the action is closely related to the environment, so that adaptive matching is not performed. The action information output of the reinforcement learning model can be directly output to an environment execution engine module in the training environment module.
When the output of the reinforcement learning model can not be directly matched with the environment and can receive actions, such as abstraction, extension, simplification and the like, the reinforcement learning model can be designed to take charge of corresponding action information conversion.
In the reward sending interface, a user (algorithm researcher) often needs to continuously modify a reward generation rule and a reward form to find the most effective reward incentive scheme, and the traditional environment adopts a form of a fixed reward generation strategy to hinder the reinforcement learning algorithm research.
A reward calculation module in the training environment module sets a reward check point for a plurality of reward generation conditions in the environment; a return generation definition module is utilized, a user writes a return definition script, a return value generated by each check point is designated, the assignment of each check point can be positive or negative, and if the assignment is not used, the assignment is directly set to be 0; and after the execution of each step is completed, the environment calculates the return sum generated by each check point, and the return sum is output as a final return value.
Example (b):
in specific practical application, the system can be oriented to artificial intelligence decision-making training and executing software systems, unmanned aerial vehicles, unmanned vehicles, robots and other unmanned systems.
The artificial intelligence decision training and executing system is designed with a variable state information interface.
The reinforcement learning algorithm is written by python to form an Agent class, the Agent class comprises a key member variable self.
And assigning values by using each reporting check point in the Json definition environment, reading the Json file when the environment is started, and generating an assignment rule.
Finally, it should be noted that: although the present invention has been described in detail, it will be apparent to those skilled in the art that changes may be made in the above embodiments, and equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;
the method is characterized in that:
the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module;
the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information;
the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process;
the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module;
the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface.
2. The system of claim 1, wherein for the status information transmission interface, the environment execution engine module outputs the underlying data containing all the status information; developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module; the observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select.
3. The system of claim 2, wherein the user can directly select the preset state construction algorithm for algorithm training, or directly use the underlying state interface common algorithm.
4. The dynamic reinforcement learning decision training system of claim 3, further comprising: the observation generation algorithm definition module performs man-machine interaction with a user, and the user can specify an observation construction algorithm corresponding to a specific reinforcement learning model through the observation generation algorithm definition module; by utilizing the observation generation algorithm definition module, a user can customize an observation construction module meeting the algorithm requirement.
5. The system of claim 1, wherein for the action receiving interface, the action information output from the reinforcement learning model can be directly output to the environment execution engine module in the training environment module.
6. The system of claim 5, wherein when the reinforcement learning model output cannot directly match the environment-receivable action, the reinforcement learning model is responsible for performing the corresponding action information transformation and outputting the transformed action information to the environment execution engine module.
7. The system of claim 1, wherein for the reward sending interface, a reward calculating module in the training environment module sets a reward checking point for a plurality of reward generating conditions; and after the execution of each step length is finished, the training environment module calculates the return sum generated by each check point, and the return sum is output as a final return value.
8. The system of claim 7, further comprising: and the user can specify a return generation definition corresponding to the specific reinforcement learning model through the return generation definition module.
9. The system of claim 8, wherein the reward generation definition module is utilized to write a reward definition script by a user, and specify the reward value generated by each checkpoint, and the assignment of each checkpoint can be positive or negative, and if not used, is directly set to 0.
10. The system of claim 1, applied to an artificial intelligence decision-making training and execution oriented software system and an unmanned autonomous machine system.
CN201911412353.1A 2019-12-31 2019-12-31 Dynamic reinforcement learning decision training system Active CN111178545B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911412353.1A CN111178545B (en) 2019-12-31 2019-12-31 Dynamic reinforcement learning decision training system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911412353.1A CN111178545B (en) 2019-12-31 2019-12-31 Dynamic reinforcement learning decision training system

Publications (2)

Publication Number Publication Date
CN111178545A true CN111178545A (en) 2020-05-19
CN111178545B CN111178545B (en) 2023-02-24

Family

ID=70654185

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911412353.1A Active CN111178545B (en) 2019-12-31 2019-12-31 Dynamic reinforcement learning decision training system

Country Status (1)

Country Link
CN (1) CN111178545B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111882027A (en) * 2020-06-02 2020-11-03 东南大学 Robot reinforcement learning training environment system for RoboMaster artificial intelligence challenge competition
CN112138396A (en) * 2020-09-23 2020-12-29 中国电子科技集团公司第十五研究所 Intelligent training method and system for unmanned system simulation confrontation
CN112766508A (en) * 2021-04-12 2021-05-07 北京一流科技有限公司 Distributed data processing system and method thereof
CN114189517A (en) * 2021-12-03 2022-03-15 中国电子科技集团公司信息科学研究院 Heterogeneous autonomous unmanned cluster unified access control system
CN117114088A (en) * 2023-10-17 2023-11-24 安徽大学 Deep reinforcement learning intelligent decision platform based on unified AI framework
CN117725985A (en) * 2024-02-06 2024-03-19 之江实验室 Reinforced learning model training and service executing method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060224535A1 (en) * 2005-03-08 2006-10-05 Microsoft Corporation Action selection for reinforcement learning using influence diagrams
US20180357552A1 (en) * 2016-01-27 2018-12-13 Bonsai AI, Inc. Artificial Intelligence Engine Having Various Algorithms to Build Different Concepts Contained Within a Same AI Model
CN109947567A (en) * 2019-03-14 2019-06-28 深圳先进技术研究院 A kind of multiple agent intensified learning dispatching method, system and electronic equipment
CN110000785A (en) * 2019-04-11 2019-07-12 上海交通大学 Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060224535A1 (en) * 2005-03-08 2006-10-05 Microsoft Corporation Action selection for reinforcement learning using influence diagrams
US20180357552A1 (en) * 2016-01-27 2018-12-13 Bonsai AI, Inc. Artificial Intelligence Engine Having Various Algorithms to Build Different Concepts Contained Within a Same AI Model
CN109947567A (en) * 2019-03-14 2019-06-28 深圳先进技术研究院 A kind of multiple agent intensified learning dispatching method, system and electronic equipment
CN110000785A (en) * 2019-04-11 2019-07-12 上海交通大学 Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴昊霖等: "在线更新的信息强度引导启发式Q学习", 《计算机应用研究》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111882027A (en) * 2020-06-02 2020-11-03 东南大学 Robot reinforcement learning training environment system for RoboMaster artificial intelligence challenge competition
CN112138396A (en) * 2020-09-23 2020-12-29 中国电子科技集团公司第十五研究所 Intelligent training method and system for unmanned system simulation confrontation
CN112138396B (en) * 2020-09-23 2024-04-12 中国电子科技集团公司第十五研究所 Unmanned system simulation countermeasure-oriented intelligent body training method and system
CN112766508A (en) * 2021-04-12 2021-05-07 北京一流科技有限公司 Distributed data processing system and method thereof
CN114189517A (en) * 2021-12-03 2022-03-15 中国电子科技集团公司信息科学研究院 Heterogeneous autonomous unmanned cluster unified access control system
CN114189517B (en) * 2021-12-03 2024-01-09 中国电子科技集团公司信息科学研究院 Heterogeneous autonomous unmanned cluster unified access management and control system
CN117114088A (en) * 2023-10-17 2023-11-24 安徽大学 Deep reinforcement learning intelligent decision platform based on unified AI framework
CN117114088B (en) * 2023-10-17 2024-01-19 安徽大学 Deep reinforcement learning intelligent decision platform based on unified AI framework
CN117725985A (en) * 2024-02-06 2024-03-19 之江实验室 Reinforced learning model training and service executing method and device and electronic equipment
CN117725985B (en) * 2024-02-06 2024-05-24 之江实验室 Reinforced learning model training and service executing method and device and electronic equipment

Also Published As

Publication number Publication date
CN111178545B (en) 2023-02-24

Similar Documents

Publication Publication Date Title
CN111178545B (en) Dynamic reinforcement learning decision training system
Wang et al. Computation offloading in multi-access edge computing using a deep sequential model based on reinforcement learning
Chen et al. DNNOff: offloading DNN-based intelligent IoT applications in mobile edge computing
Bianchi et al. Transferring knowledge as heuristics in reinforcement learning: A case-based approach
Li et al. Neural-network-based path planning for a multirobot system with moving obstacles
CN109690576A (en) The training machine learning model in multiple machine learning tasks
CN112272831A (en) Reinforcement learning system including a relationship network for generating data encoding relationships between entities in an environment
CN111966361B (en) Method, device, equipment and storage medium for determining model to be deployed
US20180314963A1 (en) Domain-independent and scalable automated planning system using deep neural networks
Mukadam et al. Riemannian motion policy fusion through learnable lyapunov function reshaping
CN114424208A (en) Gated attention neural network
Yu et al. Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem
Kono et al. Transfer learning method using ontology for heterogeneous multi-agent reinforcement learning
CN109635706A (en) Gesture identification method, equipment, storage medium and device neural network based
CN111667060B (en) Deep learning algorithm compiling method and device and related products
Shintani et al. A set based design method using Bayesian active learning
CN117011118A (en) Model parameter updating method, device, computer equipment and storage medium
CN115860113A (en) Training method and related device for self-antagonistic neural network model
WO2022127603A1 (en) Model processing method and related device
CN116710974A (en) Domain adaptation using domain countermeasure learning in composite data systems and applications
CN114707070A (en) User behavior prediction method and related equipment thereof
JP2022165395A (en) Method for optimizing neural network model and method for providing graphical user interface for neural network model
CN115114927A (en) Model training method and related device
CN114118374A (en) Multi-wisdom reinforcement learning method and system based on hierarchical consistency learning
Han et al. Three‐dimensional obstacle avoidance for UAV based on reinforcement learning and RealSense

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant