CN111178545B - Dynamic reinforcement learning decision training system - Google Patents
Dynamic reinforcement learning decision training system Download PDFInfo
- Publication number
- CN111178545B CN111178545B CN201911412353.1A CN201911412353A CN111178545B CN 111178545 B CN111178545 B CN 111178545B CN 201911412353 A CN201911412353 A CN 201911412353A CN 111178545 B CN111178545 B CN 111178545B
- Authority
- CN
- China
- Prior art keywords
- module
- training
- environment
- reinforcement learning
- return
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module; the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module; the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information; the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process; the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module; the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface; the algorithm universality is greatly enhanced, the interface design difficulty is reduced, and the limitation of the environment on the algorithm form is reduced.
Description
Technical Field
The invention belongs to the field of computer artificial intelligence, and particularly relates to a training system for strengthening machine learning.
Background
Reinforcement Learning (RL), also called refitting learning, evaluative learning or reinforcement learning, is one of the paradigms and methodologies of machine learning, and is used to describe and solve the problem of an agent (agent) in interacting with the environment to achieve maximum return or achieve a specific goal through learning strategies. The reinforcement learning is an unsupervised learning method without prior knowledge and data, and the main working mode is that a strategy model continuously makes action attempts (exploration) in the environment, learning information is obtained by receiving the return (feedback) of the environment to the action, model parameters are updated, and finally model convergence is realized. At present, some deep strong chemical algorithms can reach the human level in weiqi and electronic games, and show the great potential in the aspects of processing complex, multiple aspects and decision problems, so that the deep strong chemical algorithms have great application prospects in the fields of industrial systems and games, marketing, advertising, finance, education, even data science and the like, and are machine learning technologies which hopefully realize general artificial intelligence.
The decision model formed by any reinforcement learning method needs to have a corresponding training/use environment, and a set of corresponding interfaces supports the interaction state, action and return of the decision model and the environment. Aiming at different application fields, the environment can be a real physical environment or a certain software environment such as games, weiqi and the like. Due to the fact that training speed in a real environment is low and cost is high, even if reinforcement learning training facing real applications such as robots and the like is conducted, rapid training iteration is conducted in a simulation software environment more frequently. In the aspect of virtual environments facing reinforcement learning research and development, openAIGym is commonly used, a simple environment scene model is manually derived in Gym, and a complex model needs some powerful physical engines. The IsaacSim platform for the autonomous robot reinforcement learning training is also provided by great, and the robot with sensors such as a laser radar and a camera can be supported to perform reinforcement learning autonomous action training in a simulation environment. Google DeepMind, in conjunction with the snowstorm Game company, introduced a reinforcement learning research environment SC2LE for interstellar dispar 2, providing a set of APIs based on information and control instructions for interacting with interstellar dispar 2 games to support artificial intelligence research for interstellar dispar 2.
The environment can quickly verify the reinforcement learning algorithm to form an effective reinforcement learning strategy model. The reinforcement learning environment platform provides a set of fixed reinforcement learning training interaction interfaces, and research personnel can conduct reinforcement learning algorithm research based on the environments and must follow interface specifications such as data organization modes, interaction flows and the like. On one hand, the technical form of the reinforcement learning algorithm is limited under the condition, so that certain algorithms are not suitable for the interface specification of the current platform, and the algorithm is prevented from being applied to the platform or the platform adaptation workload of research and development personnel is increased; on the other hand, platform developers have to design interface specifications with universality as much as possible to be suitable for model training in different forms and increase the difficulty of platform design, but the interface universality effect is not good due to the fact that algorithms are varied.
Disclosure of Invention
The invention aims to solve the technical problems of high interface universality design difficulty, high algorithm adaptation difficulty and the like caused by the solidification of the algorithm interface in the traditional reinforcement learning training environment.
In order to achieve the purpose, the invention provides the following technical scheme:
a dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;
the method is characterized in that:
the training environment module consists of three functional modules, namely an environment execution engine module, an observation construction module and a return calculation module;
the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information;
the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process;
the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module;
the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface.
The dynamic reinforcement learning decision training system has the advantages that:
the reinforcement learning training system and the interface architecture can greatly enhance the algorithm universality, reduce the interface design difficulty, simultaneously reduce the limit of the environment on the algorithm form, and reduce the workload of unnecessary interface adaptation of the reinforcement learning algorithm to the environment by a user.
Drawings
FIG. 1 is a schematic diagram of a dynamic reinforcement learning decision training system according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the specific scheme of the invention is as follows:
a dynamic reinforcement learning decision training system comprises a reinforcement learning model and a training environment module. The training environment module is composed of three key functional modules, namely an environment execution engine module, an observation construction module and a return calculation module. The system also comprises an observation generation algorithm definition module and a return generation definition module which are in man-machine interaction with the user, and the user can designate an observation construction algorithm and a return generation definition corresponding to a specific reinforcement learning model through the observation generation algorithm definition module and the return generation definition module.
The environment execution engine module maintains a bottom state data structure, simultaneously constructs an observation construction module, and in the training/execution process, the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom state data to generate state information; the return calculation module sets return check points according to various return generation conditions, a user defines an assignment rule of each check point through the return generation definition module, and the training environment module calculates and outputs a check point return value in an execution step length.
The data interface between the reinforcement learning model and the training environment module mainly comprises a state information sending interface, an action receiving interface and a return sending interface.
The state information sending interface needs to design a set of interfaces which meet the requirements of training and executing any algorithm for the environment because different reinforcement learning algorithms need different state data formats and information organization forms, such as state information based on discrete data, state information based on images, state information based on multi-layer data and various types of mixed state information;
wherein the underlying data (base state data) containing all state information is output by the environment execution engine module. And developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module. The observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select. The user can directly select the preset state construction algorithm for algorithm training, and can also directly use the shared algorithm of the bottom state interface. By utilizing the observation generation algorithm definition module, a user can also customize an observation construction module meeting the algorithm requirement independently; in the training/executing process, the training environment module calls the corresponding observation construction module to generate state information through a callback or dynamic loading mechanism.
The action receiving interface, the division of the action is mainly dependent on the environment itself, and the action is closely related to the environment itself, so that the adaptive matching is not performed any more. The action information output of the reinforcement learning model can be directly output to an environment execution engine module in the training environment module.
When the output of the reinforcement learning model can not be directly matched with the environment and can receive actions, such as abstraction, extension, simplification and the like, the reinforcement learning model can be designed to take charge of corresponding action information conversion.
In the reward sending interface, a user (algorithm researcher) often needs to continuously modify a reward generation rule and a reward form to find the most effective reward incentive scheme, and the traditional environment adopts a form of a fixed reward generation strategy to hinder the reinforcement learning algorithm research.
A reward calculation module in the training environment module sets a reward check point for a plurality of reward generation conditions in the environment; a return generation definition module is utilized, a user writes a return definition script, a return value generated by each check point is designated, the assignment of each check point can be positive or negative, and if the assignment is not used, the assignment is directly set to be 0; and after each step of execution is completed, the environment calculates the sum of the return generated by each check point, and outputs the sum as a final return value.
Example (b):
in specific practical application, the system can be oriented to artificial intelligence decision-making training and executing software systems, unmanned aerial vehicles, unmanned vehicles, robots and other unmanned systems.
The artificial intelligence decision training and executing system is designed with a variable state information interface.
The reinforcement learning algorithm is written by python to form an Agent class, the Agent class comprises a key member variable self.
And assigning values by using each reporting check point in the Json definition environment, reading the Json file when the environment is started, and generating an assignment rule.
Finally, it should be noted that: although the present invention has been described in detail, it will be apparent to those skilled in the art that changes may be made in the above embodiments, and equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;
the method is characterized in that:
the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module;
the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information;
the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process;
the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module;
the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface.
2. The system of claim 1, wherein for the status information transmission interface, the environment execution engine module outputs the underlying data containing all the status information; developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module; the observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select.
3. The system of claim 2, wherein the user can directly select the preset state construction algorithm for algorithm training, or directly use the underlying state interface common algorithm.
4. The dynamic reinforcement learning decision training system of claim 3, further comprising: the observation generation algorithm definition module performs man-machine interaction with a user, and the user can specify an observation construction algorithm corresponding to a specific reinforcement learning model through the observation generation algorithm definition module; by utilizing the observation generation algorithm definition module, a user can independently customize an observation construction module meeting the algorithm requirement.
5. The system of claim 1, wherein for the action receiving interface, the action information output from the reinforcement learning model can be directly output to the environment execution engine module in the training environment module.
6. The system of claim 5, wherein when the reinforcement learning model output cannot directly match the environment-receivable action, the reinforcement learning model is responsible for performing the corresponding action information transformation and outputting the transformed action information to the environment execution engine module.
7. The system of claim 1, wherein for the reward sending interface, a reward calculating module in the training environment module sets a reward checking point for a plurality of reward generating conditions; and after the execution of each step length is finished, the training environment module calculates the return sum generated by each check point, and the return sum is output as a final return value.
8. The system of claim 7, further comprising: and the user can specify a return generation definition corresponding to the specific reinforcement learning model through the return generation definition module.
9. The system of claim 8, wherein the reward generation definition module is utilized to write a reward definition script by a user, and specify the reward value generated by each checkpoint, and the assignment of each checkpoint can be positive or negative, and if not used, is directly set to 0.
10. The system of claim 1, applied to an artificial intelligence decision-making training and execution oriented software system and an unmanned autonomous machine system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911412353.1A CN111178545B (en) | 2019-12-31 | 2019-12-31 | Dynamic reinforcement learning decision training system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911412353.1A CN111178545B (en) | 2019-12-31 | 2019-12-31 | Dynamic reinforcement learning decision training system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111178545A CN111178545A (en) | 2020-05-19 |
CN111178545B true CN111178545B (en) | 2023-02-24 |
Family
ID=70654185
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911412353.1A Active CN111178545B (en) | 2019-12-31 | 2019-12-31 | Dynamic reinforcement learning decision training system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111178545B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111882027B (en) * | 2020-06-02 | 2024-09-17 | 东南大学 | Robot reinforcement learning training environment system for RoboMaster artificial intelligence challenge |
CN112138396B (en) * | 2020-09-23 | 2024-04-12 | 中国电子科技集团公司第十五研究所 | Unmanned system simulation countermeasure-oriented intelligent body training method and system |
CN112766508B (en) * | 2021-04-12 | 2022-04-08 | 北京一流科技有限公司 | Distributed data processing system and method thereof |
CN114189517B (en) * | 2021-12-03 | 2024-01-09 | 中国电子科技集团公司信息科学研究院 | Heterogeneous autonomous unmanned cluster unified access management and control system |
CN117114088B (en) * | 2023-10-17 | 2024-01-19 | 安徽大学 | Deep reinforcement learning intelligent decision platform based on unified AI framework |
CN117725985B (en) * | 2024-02-06 | 2024-05-24 | 之江实验室 | Reinforced learning model training and service executing method and device and electronic equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947567A (en) * | 2019-03-14 | 2019-06-28 | 深圳先进技术研究院 | A kind of multiple agent intensified learning dispatching method, system and electronic equipment |
CN110000785A (en) * | 2019-04-11 | 2019-07-12 | 上海交通大学 | Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7707131B2 (en) * | 2005-03-08 | 2010-04-27 | Microsoft Corporation | Thompson strategy based online reinforcement learning system for action selection |
US11775850B2 (en) * | 2016-01-27 | 2023-10-03 | Microsoft Technology Licensing, Llc | Artificial intelligence engine having various algorithms to build different concepts contained within a same AI model |
-
2019
- 2019-12-31 CN CN201911412353.1A patent/CN111178545B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109947567A (en) * | 2019-03-14 | 2019-06-28 | 深圳先进技术研究院 | A kind of multiple agent intensified learning dispatching method, system and electronic equipment |
CN110000785A (en) * | 2019-04-11 | 2019-07-12 | 上海交通大学 | Agriculture scene is without calibration robot motion's vision collaboration method of servo-controlling and equipment |
Non-Patent Citations (1)
Title |
---|
在线更新的信息强度引导启发式Q学习;吴昊霖等;《计算机应用研究》;20170721(第08期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111178545A (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111178545B (en) | Dynamic reinforcement learning decision training system | |
Chen et al. | DNNOff: offloading DNN-based intelligent IoT applications in mobile edge computing | |
WO2021190597A1 (en) | Processing method for neural network model, and related device | |
US11836650B2 (en) | Artificial intelligence engine for mixing and enhancing features from one or more trained pre-existing machine-learning models | |
CN111602144A (en) | Generating neural network system for generating instruction sequences to control agents performing tasks | |
CN112272831A (en) | Reinforcement learning system including a relationship network for generating data encoding relationships between entities in an environment | |
KR20220045215A (en) | Gated Attention Neural Network | |
CN111966361B (en) | Method, device, equipment and storage medium for determining model to be deployed | |
Yu et al. | Hybrid attention-oriented experience replay for deep reinforcement learning and its application to a multi-robot cooperative hunting problem | |
JP2022165395A (en) | Method for optimizing neural network model and method for providing graphical user interface for neural network model | |
EP3847583A1 (en) | Determining control policies by minimizing the impact of delusion | |
WO2023114661A1 (en) | A concept for placing an execution of a computer program | |
CN115533905A (en) | Virtual and real transfer learning method and device of robot operation technology and storage medium | |
CN114648103A (en) | Automatic multi-objective hardware optimization for processing deep learning networks | |
CN109635706A (en) | Gesture identification method, equipment, storage medium and device neural network based | |
US20230394413A1 (en) | Generative artificial intelligence for explainable collaborative and competitive problem solving | |
CN111667060B (en) | Deep learning algorithm compiling method and device and related products | |
CN117518907A (en) | Control method, device, equipment and storage medium of intelligent agent | |
Li | [Retracted] Optimization and Simulation of Virtual Experiment System of Human Sports Science Based on VR | |
Hu et al. | AdaFlow: Imitation Learning with Variance-Adaptive Flow-Based Policies | |
CN115860113A (en) | Training method and related device for self-antagonistic neural network model | |
Shintani et al. | A set based design method using Bayesian active learning | |
CN117011118A (en) | Model parameter updating method, device, computer equipment and storage medium | |
CN116710974A (en) | Domain adaptation using domain countermeasure learning in composite data systems and applications | |
Cummings et al. | Development of a hybrid machine learning agent based model for optimization and interpretability |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |