CN111178545B

CN111178545B - Dynamic reinforcement learning decision training system

Info

Publication number: CN111178545B
Application number: CN201911412353.1A
Authority: CN
Inventors: 高放; 李明强; 陈思; 唐思琦; 黄彬城
Original assignee: CETC Information Science Research Institute
Current assignee: CETC Information Science Research Institute
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2023-02-24
Anticipated expiration: 2039-12-31
Also published as: CN111178545A

Abstract

A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module; the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module; the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information; the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process; the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module; the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface; the algorithm universality is greatly enhanced, the interface design difficulty is reduced, and the limitation of the environment on the algorithm form is reduced.

Description

Dynamic reinforcement learning decision training system

Technical Field

The invention belongs to the field of computer artificial intelligence, and particularly relates to a training system for strengthening machine learning.

Background

Reinforcement Learning (RL), also called refitting learning, evaluative learning or reinforcement learning, is one of the paradigms and methodologies of machine learning, and is used to describe and solve the problem of an agent (agent) in interacting with the environment to achieve maximum return or achieve a specific goal through learning strategies. The reinforcement learning is an unsupervised learning method without prior knowledge and data, and the main working mode is that a strategy model continuously makes action attempts (exploration) in the environment, learning information is obtained by receiving the return (feedback) of the environment to the action, model parameters are updated, and finally model convergence is realized. At present, some deep strong chemical algorithms can reach the human level in weiqi and electronic games, and show the great potential in the aspects of processing complex, multiple aspects and decision problems, so that the deep strong chemical algorithms have great application prospects in the fields of industrial systems and games, marketing, advertising, finance, education, even data science and the like, and are machine learning technologies which hopefully realize general artificial intelligence.

The decision model formed by any reinforcement learning method needs to have a corresponding training/use environment, and a set of corresponding interfaces supports the interaction state, action and return of the decision model and the environment. Aiming at different application fields, the environment can be a real physical environment or a certain software environment such as games, weiqi and the like. Due to the fact that training speed in a real environment is low and cost is high, even if reinforcement learning training facing real applications such as robots and the like is conducted, rapid training iteration is conducted in a simulation software environment more frequently. In the aspect of virtual environments facing reinforcement learning research and development, openAIGym is commonly used, a simple environment scene model is manually derived in Gym, and a complex model needs some powerful physical engines. The IsaacSim platform for the autonomous robot reinforcement learning training is also provided by great, and the robot with sensors such as a laser radar and a camera can be supported to perform reinforcement learning autonomous action training in a simulation environment. Google DeepMind, in conjunction with the snowstorm Game company, introduced a reinforcement learning research environment SC2LE for interstellar dispar 2, providing a set of APIs based on information and control instructions for interacting with interstellar dispar 2 games to support artificial intelligence research for interstellar dispar 2.

The environment can quickly verify the reinforcement learning algorithm to form an effective reinforcement learning strategy model. The reinforcement learning environment platform provides a set of fixed reinforcement learning training interaction interfaces, and research personnel can conduct reinforcement learning algorithm research based on the environments and must follow interface specifications such as data organization modes, interaction flows and the like. On one hand, the technical form of the reinforcement learning algorithm is limited under the condition, so that certain algorithms are not suitable for the interface specification of the current platform, and the algorithm is prevented from being applied to the platform or the platform adaptation workload of research and development personnel is increased; on the other hand, platform developers have to design interface specifications with universality as much as possible to be suitable for model training in different forms and increase the difficulty of platform design, but the interface universality effect is not good due to the fact that algorithms are varied.

Disclosure of Invention

The invention aims to solve the technical problems of high interface universality design difficulty, high algorithm adaptation difficulty and the like caused by the solidification of the algorithm interface in the traditional reinforcement learning training environment.

In order to achieve the purpose, the invention provides the following technical scheme:

a dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;

the method is characterized in that:

the training environment module consists of three functional modules, namely an environment execution engine module, an observation construction module and a return calculation module;

the environment execution engine module is used for maintaining a bottom state data structure and outputting bottom state data containing all state information;

the observation construction module is used for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom layer state data to generate state information in the training process;

the return calculation module is used for setting a return check point according to various return generation conditions, and calculating and outputting a check point return value in the execution step length of the training environment module;

the data interface between the reinforcement learning model and the training environment module comprises: the device comprises a state information sending interface, an action receiving interface and a return sending interface.

The dynamic reinforcement learning decision training system has the advantages that:

the reinforcement learning training system and the interface architecture can greatly enhance the algorithm universality, reduce the interface design difficulty, simultaneously reduce the limit of the environment on the algorithm form, and reduce the workload of unnecessary interface adaptation of the reinforcement learning algorithm to the environment by a user.

Drawings

FIG. 1 is a schematic diagram of a dynamic reinforcement learning decision training system according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the specific scheme of the invention is as follows:

a dynamic reinforcement learning decision training system comprises a reinforcement learning model and a training environment module. The training environment module is composed of three key functional modules, namely an environment execution engine module, an observation construction module and a return calculation module. The system also comprises an observation generation algorithm definition module and a return generation definition module which are in man-machine interaction with the user, and the user can designate an observation construction algorithm and a return generation definition corresponding to a specific reinforcement learning model through the observation generation algorithm definition module and the return generation definition module.

The environment execution engine module maintains a bottom state data structure, simultaneously constructs an observation construction module, and in the training/execution process, the training environment module calls the corresponding observation construction module through a callback or dynamic loading mechanism to reconstruct the bottom state data to generate state information; the return calculation module sets return check points according to various return generation conditions, a user defines an assignment rule of each check point through the return generation definition module, and the training environment module calculates and outputs a check point return value in an execution step length.

The data interface between the reinforcement learning model and the training environment module mainly comprises a state information sending interface, an action receiving interface and a return sending interface.

The state information sending interface needs to design a set of interfaces which meet the requirements of training and executing any algorithm for the environment because different reinforcement learning algorithms need different state data formats and information organization forms, such as state information based on discrete data, state information based on images, state information based on multi-layer data and various types of mixed state information;

wherein the underlying data (base state data) containing all state information is output by the environment execution engine module. And developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module. The observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select. The user can directly select the preset state construction algorithm for algorithm training, and can also directly use the shared algorithm of the bottom state interface. By utilizing the observation generation algorithm definition module, a user can also customize an observation construction module meeting the algorithm requirement independently; in the training/executing process, the training environment module calls the corresponding observation construction module to generate state information through a callback or dynamic loading mechanism.

The action receiving interface, the division of the action is mainly dependent on the environment itself, and the action is closely related to the environment itself, so that the adaptive matching is not performed any more. The action information output of the reinforcement learning model can be directly output to an environment execution engine module in the training environment module.

When the output of the reinforcement learning model can not be directly matched with the environment and can receive actions, such as abstraction, extension, simplification and the like, the reinforcement learning model can be designed to take charge of corresponding action information conversion.

In the reward sending interface, a user (algorithm researcher) often needs to continuously modify a reward generation rule and a reward form to find the most effective reward incentive scheme, and the traditional environment adopts a form of a fixed reward generation strategy to hinder the reinforcement learning algorithm research.

A reward calculation module in the training environment module sets a reward check point for a plurality of reward generation conditions in the environment; a return generation definition module is utilized, a user writes a return definition script, a return value generated by each check point is designated, the assignment of each check point can be positive or negative, and if the assignment is not used, the assignment is directly set to be 0; and after each step of execution is completed, the environment calculates the sum of the return generated by each check point, and outputs the sum as a final return value.

Example (b):

in specific practical application, the system can be oriented to artificial intelligence decision-making training and executing software systems, unmanned aerial vehicles, unmanned vehicles, robots and other unmanned systems.

The artificial intelligence decision training and executing system is designed with a variable state information interface.

The reinforcement learning algorithm is written by python to form an Agent class, the Agent class comprises a key member variable self.

And assigning values by using each reporting check point in the Json definition environment, reading the Json file when the environment is started, and generating an assignment rule.

Finally, it should be noted that: although the present invention has been described in detail, it will be apparent to those skilled in the art that changes may be made in the above embodiments, and equivalents may be substituted for elements thereof. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A dynamic reinforcement learning decision training system comprises a reinforcement learning model, a training environment module, and a data interface between the reinforcement learning model and the training environment module;

the method is characterized in that:

the training environment module consists of an environment execution engine module, an observation construction module and a return calculation module;

2. The system of claim 1, wherein for the status information transmission interface, the environment execution engine module outputs the underlying data containing all the status information; developing various state information construction algorithms aiming at different algorithm requirements through an observation construction module; the observation construction module is responsible for converting the bottom layer state data into a state information form which is suitable for different algorithm requirements, and a state construction algorithm set is formed and provided for users to select.

3. The system of claim 2, wherein the user can directly select the preset state construction algorithm for algorithm training, or directly use the underlying state interface common algorithm.

4. The dynamic reinforcement learning decision training system of claim 3, further comprising: the observation generation algorithm definition module performs man-machine interaction with a user, and the user can specify an observation construction algorithm corresponding to a specific reinforcement learning model through the observation generation algorithm definition module; by utilizing the observation generation algorithm definition module, a user can independently customize an observation construction module meeting the algorithm requirement.

5. The system of claim 1, wherein for the action receiving interface, the action information output from the reinforcement learning model can be directly output to the environment execution engine module in the training environment module.

6. The system of claim 5, wherein when the reinforcement learning model output cannot directly match the environment-receivable action, the reinforcement learning model is responsible for performing the corresponding action information transformation and outputting the transformed action information to the environment execution engine module.

7. The system of claim 1, wherein for the reward sending interface, a reward calculating module in the training environment module sets a reward checking point for a plurality of reward generating conditions; and after the execution of each step length is finished, the training environment module calculates the return sum generated by each check point, and the return sum is output as a final return value.

8. The system of claim 7, further comprising: and the user can specify a return generation definition corresponding to the specific reinforcement learning model through the return generation definition module.

9. The system of claim 8, wherein the reward generation definition module is utilized to write a reward definition script by a user, and specify the reward value generated by each checkpoint, and the assignment of each checkpoint can be positive or negative, and if not used, is directly set to 0.

10. The system of claim 1, applied to an artificial intelligence decision-making training and execution oriented software system and an unmanned autonomous machine system.