CN117725985A

CN117725985A - Reinforced learning model training and service executing method and device and electronic equipment

Info

Publication number: CN117725985A
Application number: CN202410171178.6A
Authority: CN
Inventors: 张杨; 王超; 陈卫; 陈振宇; 王永恒; 郑黄河; 恽爽; 曾洪海; 连建晓; 王梦丝; 路游; 周春来; 鲁艺
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2024-02-06
Filing date: 2024-02-06
Publication date: 2024-03-19
Anticipated expiration: 2044-02-06
Also published as: CN117725985B

Abstract

The specification discloses a reinforcement learning model training and service execution method and device and electronic equipment. The method comprises the following steps: obtaining reinforcement learning environment data constructed by a user aiming at a specified service scene; determining a target agent selected by the user based on the reinforcement learning environment data in response to a specified operation of the user, and determining agent configuration information of the target agent; determining a target reinforcement learning algorithm in a preset algorithm library; constructing a reinforcement learning model based on the target intelligent agent, intelligent agent configuration information and a target reinforcement learning algorithm, simulating the reinforcement learning model, and storing data generated in the simulation process into an experience playback pool; training the reinforcement learning model according to training data and rewarding function information obtained from the experience playback pool. The scheme greatly reduces the use threshold of the user, and fully meets the diversified demands of the user on the reinforcement learning environment.

Description

Reinforced learning model training and service executing method and device and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and apparatus for training reinforcement learning model and executing service, and an electronic device.

Background

Unlike traditional supervised and unsupervised learning, reinforcement learning is an algorithm in which an agent continuously optimizes its own strategy by constantly interacting with the environment in an uncertain environment. Two main roles in reinforcement learning are agents and environments, which refer to the world in which agents exist and interact. During each interaction process with the environment, the intelligent agent can acquire the observation of the environment state at the current moment and then decide what action to take. The agent can not only observe a certain environmental state, but also sense rewarding information.

However, reinforcement learning can only be realized aiming at the improved environment on a specific simulation platform at present, the reinforcement learning algorithm is single, the setting of the intelligent body is fixed, the diversified requirements of users on the reinforcement learning environment are difficult to meet, and the construction process of the existing strong chemical model usually needs manual completion of the users, so that the requirement on the specialization of the users is high.

Therefore, how to lower the limitation on the user's expertise in the reinforcement learning model construction process, and meet the diversified demands of users on reinforcement learning environments is a problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a reinforcement learning model training method, apparatus, storage medium and electronic device, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the specification provides a reinforcement learning model training method comprising the following steps:

obtaining reinforcement learning environment data constructed by a user aiming at a specified service scene;

in response to the specified operation of the user, determining a target agent selected by the user based on the reinforcement learning environment data, and determining agent configuration information corresponding to the target agent, wherein the agent configuration information comprises: the method comprises the steps of determining the quantity of the intelligent agents, rewarding function information, action configuration information corresponding to actions executable by each target intelligent agent and state configuration information corresponding to the state of the reinforcement learning environment acquired by each target intelligent agent;

determining a target reinforcement learning algorithm matched with the service scene and the intelligent agent configuration information in a preset algorithm library;

constructing a reinforcement learning model based on the target intelligent agent, the intelligent agent configuration information and the target reinforcement learning algorithm, simulating the reinforcement learning model, and storing data generated in the simulation process into an experience playback pool;

Training the reinforcement learning model according to training data obtained from the experience playback pool and the reward function information.

Optionally, the action configuration information includes: the action type of the action which can be executed by the target intelligent agent and the action change range when each action is executed;

the state configuration information includes: the target agent can acquire the state dimension of the state and the state change range corresponding to the state of each dimension.

Optionally, determining a target reinforcement learning algorithm matched with the service scene and the agent configuration information in a preset algorithm library, which specifically includes:

determining each reinforcement learning algorithm matched with the service scene and the intelligent agent configuration information in the algorithm library as a candidate algorithm;

the target reinforcement learning algorithm is determined in response to a selection operation performed by the user for each candidate algorithm.

Optionally, simulating the reinforcement learning model to obtain simulation data, which specifically includes:

for each moment, acquiring the reinforcement learning environment state acquired by each target intelligent agent at the moment, the action executed by each target intelligent agent at the moment and the rewarding value obtained when each target intelligent agent at the moment executes different actions as training data corresponding to the moment;

And storing training data corresponding to each moment in the experience playback pool.

Optionally, the training data corresponding to each moment is stored in the experience playback pool, which specifically includes:

acquiring a state mark of the reinforcement learning environment;

based on the status flag, invalid data in the training data is determined, and training data other than the invalid data is stored in the experience playback pool.

Optionally, the algorithm library is provided with a plurality of reinforcement learning algorithms and algorithm configuration information corresponding to different reinforcement learning algorithms, where the algorithm configuration information includes: at least one of rewarding discount coefficient, learning rate, size of experience playback pool corresponding to different reinforcement learning algorithm, batch size, gradient clipping coefficient and entropy clipping coefficient;

determining a target reinforcement learning algorithm matched with the specified service scene and the intelligent agent configuration information in a preset algorithm library, wherein the target reinforcement learning algorithm specifically comprises the following steps:

and determining a target reinforcement learning algorithm matched with the specified service scene and the intelligent agent configuration information in a preset algorithm library, and determining target algorithm configuration information corresponding to the target reinforcement learning algorithm.

Optionally, before simulating the reinforcement learning model, the method further comprises:

in response to the user's specified operation, determining a simulation configuration comprising at least: the maximum number of simulation fragments, the maximum number of time steps contained in each simulation fragment and the termination condition of each simulation fragment;

simulating the reinforcement learning model, and storing data generated in the simulation process into an experience playback pool, wherein the method specifically comprises the following steps:

constructing an experience playback pool corresponding to the reinforcement learning model according to the target algorithm configuration information;

and simulating the reinforcement learning model according to the simulation configuration.

The present specification provides a service execution method, including:

acquiring initial state information of a target service scene;

inputting the initial state information into a service model corresponding to the target service scene to determine an execution strategy corresponding to a target object in the target service scene through the service model, wherein the service model is obtained through training by the reinforcement learning model training method;

and executing the service according to the execution strategy.

The present specification provides a reinforcement learning model training device, comprising:

The acquisition module is used for acquiring reinforcement learning environment data constructed by a user aiming at a specified service scene;

the selection module is used for responding to the specified operation of the user, determining a target agent selected by the user based on the reinforcement learning environment data and determining agent configuration information corresponding to the target agent, wherein the agent configuration information comprises: the method comprises the steps of determining the quantity of the intelligent agents, rewarding function information, action configuration information corresponding to actions executable by each target intelligent agent and state configuration information corresponding to the state of the reinforcement learning environment acquired by each target intelligent agent;

the determining module is used for determining a target reinforcement learning algorithm matched with the service scene and the intelligent agent configuration information in a preset algorithm library;

the building module is used for building a reinforcement learning model based on the target intelligent agent, the intelligent agent configuration information and the target reinforcement learning algorithm, simulating the reinforcement learning model and storing data generated in the simulation process into an experience playback pool;

and the training module is used for training the reinforcement learning model according to the training data obtained from the experience playback pool and the rewarding function information.

The present specification provides a service execution apparatus, including:

the acquisition module is used for acquiring initial state information of a target service scene;

the input module is used for inputting the initial state information into a service model corresponding to the target service scene so as to determine an execution strategy corresponding to a target object in the target service scene through the service model, wherein the service model is obtained by training by the reinforcement learning model training method;

and the execution module is used for executing the service according to the execution strategy.

The present specification provides a computer readable storage medium storing a computer program which when executed by a processor implements the reinforcement learning model training and business execution method described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the reinforcement learning model training and business execution method described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

in the reinforcement learning model training method provided by the specification, reinforcement learning environment data constructed by a user aiming at a specified service scene is obtained; determining a target agent selected by the user based on the reinforcement learning environment data in response to a specified operation of the user, and determining agent configuration information of the target agent; determining a target reinforcement learning algorithm in a preset algorithm library; constructing a reinforcement learning model based on the target intelligent agent, intelligent agent configuration information and a target reinforcement learning algorithm, simulating the reinforcement learning model, and storing data generated in the simulation process into an experience playback pool; and training the reinforcement learning model by taking the reward value obtained by the maximized agent as an optimization target according to the data stored in the experience playback pool. The training efficiency of the reinforcement learning model is greatly improved, the use threshold of a user is reduced, and the diversified demands of the user on the reinforcement learning environment are fully met.

According to the method, the reinforcement learning algorithm matched with the reinforcement learning environment information can be determined in the algorithm library based on the reinforcement learning environment information which is built by user definition, and then a complete reinforcement learning model is built based on the reinforcement learning environment information for training.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

FIG. 1 is a schematic flow chart of a reinforcement learning model training method provided in the present specification;

FIG. 2 is a schematic diagram of a reinforcement learning model construction process provided in the present specification;

FIG. 3 is a schematic diagram of a training and testing process of a reinforcement learning model provided in the present specification;

Fig. 4 is a schematic flow chart of a service execution method provided in the present specification;

FIG. 5 is a schematic diagram of a reinforcement learning model training device provided in the present disclosure;

fig. 6 is a schematic diagram of a service execution device provided in the present specification;

fig. 7 is a schematic diagram of an electronic device corresponding to fig. 1 provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a reinforcement learning model training method provided in the present specification, which includes the following steps:

s101: and obtaining reinforcement learning environment data constructed by the user aiming at the specified service scene.

Reinforcement learning is a study of how agents learn by trial and error. Two main roles in reinforcement learning are agents and environments, which refer to the world in which agents exist and interact. The agent obtains the current observation of the environment during each interaction with the environment and then decides what action to take. The state transition probability of the environment depends not only on each step action of the agent but also on the environment itself. The intelligent agent can observe a certain environment state and sense rewarding information, and rewarding value generated by state transition expresses the state. The goal of the agent is to maximize this cumulative prize value, i.e., the return.

The policy of the agent refers to a rule that decides what action to do based on the observed state at the current moment, and may be a probability distribution of one action or all executable actions of the output certainty. The collecting track information by the intelligent agent in continuous interaction with the environment comprises the following steps: the current observation state, the executed action, the rewards obtained by the executed action, the next state and the like, and continuously optimizing the strategy of the user by trial and error so as to obtain the maximum accumulated rewards.

This also results in the need to master the expertise of a wide range of machine learning areas using reinforcement learning due to its own nature. Different motion spaces for different environmental scenarios or agents or different numbers of agents may result in different algorithms being used. So it is very difficult for a person without the relevant expertise background to quickly use reinforcement learning applications in a custom environment.

Based on the above, the specification provides a modularized configurable reinforcement learning training method, which can perform reinforcement learning based on user-defined environment information without modifying the original environment in a large amount on the premise of ensuring data security.

In the present specification, the execution subject for implementing the reinforcement learning model training and service execution method may be a designated device such as a server, or may be a client installed on a device such as a mobile phone, a notebook computer, or a desktop computer.

The server can acquire reinforcement learning environment data constructed by a user aiming at a specified business scene. The environment data may include several types of agents constructed by the user and an environment model for characterizing the operating environment of the agents, and may include other data such as execution functions and cost functions.

In this specification, the above specified service scenario may include automatic driving, intelligent road searching, intelligent power supply and distribution, calculation, storage resource allocation, and the like, which is not limited in this specification.

In addition, the server may create an algorithm library including different reinforcement learning algorithms, and import algorithm configuration information corresponding to each reinforcement learning algorithm into the algorithm library, where the algorithm configuration information includes: at least one of a discount coefficient, a learning rate, a size of a corresponding experience playback pool of different reinforcement learning algorithms, a batch size, a gradient clipping coefficient, and an entropy clipping coefficient is awarded.

At the same time, the server may create an empty dictionary in the private attribute of the instantiated reinforcement learning, and this dictionary may record various reinforcement learning simulation experiments (including training, testing, deduction, etc. of the reinforcement learning model) created later, so as to facilitate determining which simulation experiment to use to obtain the action or which simulation experiment to save the policy network in the subsequent call.

S102: in response to the specified operation of the user, determining a target agent selected by the user based on the reinforcement learning environment data, and determining agent configuration information corresponding to the target agent, wherein the agent configuration information comprises: the method comprises the steps of agent quantity, rewarding function information, action configuration information corresponding to actions executable by each target agent and state configuration information corresponding to states of reinforcement learning environments acquired by each target agent.

The server can display a selection and configuration page containing all agents in the reinforcement learning environment data to a user through the client, and respond to specified operation executed by the user in the configuration page to determine at least one agent which needs to be optimized and is selected by the user as a target agent, and determine the configuration information of the agents set by the user.

The user can select various types of agents from the intelligent agent configuration information, and can select only one type of agent, wherein the intelligent agent configuration information comprises: the method comprises the steps of obtaining the quantity of each target intelligent agent, rewarding function information, action configuration information corresponding to actions which each target intelligent agent can make at each moment, and state configuration information corresponding to reinforcement learning environment states which each target intelligent agent can obtain at each moment.

The bonus function information may include a bonus target (maximum or minimum), and of course, may also include other information such as a type of the bonus function, a calculation mode, and the like, which is not specifically limited in this specification.

The action configuration information may include: the action type of the target agent and the action change range (size) when each action is executed form the action space of the agent, and the action space can contain a plurality of actions, and each action can be a limited discrete action or a continuous action with upper and lower limits. It is necessary to mark whether each action of the agent is discrete in the form of a boolean value at the time of input.

The state configuration information may include: the target agent can acquire the state dimension of the state and the state change range (size) corresponding to the state of each dimension.

Further, the server may further determine, according to an operation performed by the user, a simulation configuration, where the simulation configuration may include: the maximum number of simulation segments, the maximum number of time steps contained in each simulation segment, the termination condition of each simulation segment, and the like.

The server may set a corresponding identification for a single agent or for multiple homogeneous intelligence, which may indicate its corresponding simulation experiment, which may determine which reinforcement learning simulation experiment to call in subsequent calls.

Taking an intelligent charging scheduling scenario of an electric vehicle as an example, an agent included in agent environment data constructed by a user may include: electric automobile, main electric wire netting, fill electric pile, charging station, charging socket etc. can select electric automobile and charging station as target intelligent object from this to set up electric automobile's quantity to m, fill electric pile's quantity and set up to n, space distribution adopts random distribution. Wherein n < m.

Under the above scenario, the electric vehicle needs to purchase electric energy from the main power grid according to a certain electric energy purchase strategy, and the charging station needs to distribute the electric energy to the electric vehicle according to a certain electric energy distribution strategy, so as to maximize the economic benefit of the charging station.

The execution function of the electric vehicle may include initializing electric vehicle information and vehicle movement information to the charging station. The execution function of the charging station comprises electricity purchasing selection, judging whether the battery electric quantity reaches the online state, initializing the available electric quantity, updating the available electricity selling quantity at each moment, updating the energy of a battery storage system, calculating the individual utility and the like.

The global script function corresponding to the environment model comprises global environment initialization, electric vehicle-charging station distribution, charging station utility and calculation, time update, electric energy distribution result update and the like.

For an electric vehicle, the types of actions that can be performed may include: the moving direction, the purchase amount, etc., the state may include: uncharged, charged, residual current at each moment, underfilled charge, etc.

And for a charging station, actions that it can perform include: start power supply, amount of power supply, stop power supply, etc., the states may include: the current time charging station may govern the amount of electricity and the current time charging station may govern the number of charging posts.

Finally, the optimization targets of the above-mentioned agents may be: the corresponding economic benefit of each charging station is maximized.

Further, the server may also provide the user with optional configuration items through the client, including the number of cycles to initiate training, what algorithm to use, whether to load a model that has been trained in the same environment before, etc., which may satisfy the option of experimental adjustment if the user subsequently wants a finer granularity algorithm.

It should be noted that, if the user selects or sets the selectable configuration item, the server may perform the subsequent task according to the default selectable configuration.

S103: and determining a target reinforcement learning algorithm matched with the service scene and the intelligent agent configuration information in a preset algorithm library.

The server can process the input parameters to a certain degree through a preset algorithm module to extract the needed information.

The algorithm module can determine whether a single agent task or a multi-agent task according to the number of agents transmitted by parameters, and meanwhile, the algorithm module can also determine the configuration of the created experience playback pool to the dimension of the number of agents during data storage in the multi-agent task.

The incoming state configuration information determines the input dimensions of the policy network. The input dimension of the critic network of the algorithm in the multi-agent scenario will also determine a global state dimension based on the number of agents and each agent observation space.

The incoming action configuration information and the flag of whether each action is discrete or not will determine the output dimension of the policy network, and different algorithms will also perform certain processing such as linear transformation on the output of the network during subsequent output actions to match the output to the action space of the agent.

Because reinforcement learning has the condition of updating own strategies in continuous interaction with the environment, most reinforcement learning algorithms start updating own strategies only when training begins by first collecting a certain amount of data, and configuration of parameters for starting a training period can enable the algorithm to be a stage of collecting data before starting training.

For whether to load the model that has been trained before, a path is filled in. If the network model and the paths of various parameters are executed before the network model is filled, the algorithm loads the saved model parameters and various training super parameters after the initialization is completed. The continuous execution of the previous experiment is ensured, and the training time and the test on the result are saved.

The algorithm module can determine each reinforcement learning algorithm matched with the agent configuration information and the service scene as candidate algorithms, and then displays names of the candidate algorithms to a user through the client, so that the target reinforcement learning algorithm selected by the user is determined in response to selection operation of the user on each candidate algorithm in the client.

Specifically, the server may determine a data type in the current service scenario, where the data type includes continuous data and discrete data, and the server may match the candidate algorithm according to the data type in the current service scenario and the number of target agents.

For example, continuous data is suitably used with a reinforcement learning algorithm such as Deep Q-network (DQN). For Discrete data, reinforcement learning algorithms such as SAC-Discrete (i.e., discrete version SAC) and near-end policy optimization (Proximal Policy Optimization, PPO) are suitable.

For another example, a single agent reinforcement learning model is more suitable for reinforcement learning algorithms such as DQN, rainbow Table, depth deterministic strategy gradient (Deep Deterministic Policy Gradient, DDGP), PPO, SAC, etc., while a Multi-agent reinforcement learning model is more suitable for reinforcement learning algorithms such as Multi-agent near-end strategy optimization algorithm (Multi-Agent Proximal Policy Optimization, MAPPO), multi-agent depth deterministic strategy gradient algorithm based on quantile regression (Quantile Regression-based Multi-Agent Deep Deterministic Policy Gradient, QMIX), etc.

Of course, when the user does not perform the selection operation, the algorithm module may also automatically select the target reinforcement learning algorithm from among the candidate reinforcement learning algorithms.

In addition, the algorithm also records some customized super parameters in the self and the parameters corresponding to the algorithm in the initialization configuration file. After the algorithm is created, the algorithm is recorded in a dictionary of the reinforcement learning algorithm module, wherein the key is the name of the parameter input, and the value is the created algorithm. The creation of a reinforcement learning simulation experiment is completed.

S104: and constructing a reinforcement learning model based on the target agent, the agent configuration information and the target reinforcement learning algorithm, simulating the reinforcement learning model, and storing data generated in the simulation process into an experience playback pool.

The algorithm module of the server can construct a reinforcement learning model according to the determined target agent, agent configuration information and target reinforcement learning algorithm. For ease of understanding, the present disclosure provides a schematic diagram of a process for constructing a reinforcement learning model, as shown in fig. 2.

Fig. 2 is a schematic diagram of a construction process of a reinforcement learning model provided in the present specification.

After the server obtains the pre-built instantiation reinforcement learning algorithm library and the number of reinforcement learning environments input by the user, the target intelligent agent, the number of target intelligent agents and the user-defined names can be determined, and state matching information and action configuration information are obtained. And then constructing a reinforcement learning model based on the information to execute subsequent training tasks.

And then the server can determine the target algorithm configuration information corresponding to the target reinforcement learning algorithm, construct an experience playback pool corresponding to the reinforcement learning model according to at least part (such as the size of the experience playback pool) of the target algorithm configuration information, and simulate the reinforcement learning model according to the simulation configuration.

Specifically, the initial state of one simulation cycle of the reinforcement learning environment may be determined, and the initial state may be a random initial state or a fixed state, or may be a state specified by a user.

After a period is started, the user can choose to call a training function or a test function. Then this function needs to be invoked to complete the training or testing during this period to ensure the integrity and chronology of the data collection. The name of which simulation experiment is called, the observation state of the current intelligent agent, the rewards of the action before execution and whether the period is finished are transmitted into parameters.

Because of the passive information acquisition, however, the incoming information is temporally displaced, i.e. the state at the current moment, the reward obtained by the last action is performed, whether the last action context is executed is terminated. Therefore, in the process of saving data to the experience playback pool, a buffer pool for the data time exists, and the data is only transmitted to the experience playback pool after a complete information track is collected.

That is, in the simulation process, for each time, the server may acquire the reinforcement learning environment state acquired by each target agent at the time, the actions performed by each target agent at the time, and the reward values obtained when each target agent performs different actions at the time, and may not use the obtained data at the time as training data corresponding to the time until the data at the time of completion is acquired, and then the server may store the training data corresponding to the time in the experience playback pool.

Because the modularized reinforcement learning algorithm library is passively invoked by the environment, the information collection is difficult to confirm, namely, when information is transmitted, whether the environment is reset, whether the information is useful and the like cannot be known. For example, the state that is transferred in at the initial first moment of the environment, the reward at the last moment, and whether the environment is terminated at the last moment is useful information, and the other two states do not need to be saved and need to be removed, but the state that is transferred in at what moment is the state of the first period, which needs to be recorded inside the algorithm. Similarly, if the last time of the incoming environment has terminated, then no unnecessary information is needed to be calculated and saved by the neural network for the incoming state. There are a number of flags within the training function to determine if it is useful information and to determine the environmental status.

Thus, in transferring data to the experience playback pool, the server may acquire a status flag of the reinforcement learning environment, determine invalid data among the training data based on the status flag, and store the training data other than the invalid data in the experience playback pool.

The end mark set by the user is reached, and one simulation period of the reinforcement learning environment is ended. The server may reset the environment for the next cycle of interaction.

In addition, in the simulation process, the server may first determine that the environment is not in an end state, and the state passes through the policy neural network and then outputs an action range to the range size of the action space of the agent through a certain linear transformation or the like.

It should be noted that strategies for training and post-evaluation of algorithm performance acquisition actions may vary in the design of most algorithms for reinforcement learning. The method is characterized in that the acquisition of actions of an algorithm in training can avoid rapid falling into local optimum because the exploring degree of the algorithm is increased, so that the method adopts a certain probability to perform random actions, add noise to the actions, sample the output strategy to acquire the actions and the like in training. The optimal action is selected when evaluating the algorithm performance, and the exploration degree is not required to be increased, in other words, the target intelligent agent can execute a plurality of different actions for the environment state at any moment.

In training the function, the function performs a corresponding strategy form acquisition action that increases the exploration degree.

Unlike traditional reinforcement learning training logic, the training of traditional reinforcement learning algorithm can obtain environmental control rights, can control the termination and restarting of the environment and what actions are performed by the agents in the environment, actively interact to obtain information required by the training, and then sort the collected information for training the neural network. The modularized reinforcement learning provided by the patent is to embed reinforcement learning in a user-defined environment, a user does not need to care about details of information data collection and how the algorithm is trained, only needs to call the training method to transfer environment information, the training function can return the action of an agent at the current moment, and the user only needs to change the action of the original agent into the return action.

S105: training the reinforcement learning model according to training data obtained from the experience playback pool and the reward function information.

Because of the difference of reinforcement learning algorithms, different reinforcement learning model training processes have different calculation modes to update the strategy network, after a certain amount of data is collected in the experience playback pool, the server can randomly extract training receipts from the experience playback pool, train the reinforcement learning model according to the training data and the rewarding function information, and update the neural network of the model, so that actions executed by the target intelligent agent in different states can obtain the maximum rewarding value.

Specifically, when the reward target of the reward function is the maximized reward value, the server may train the reinforcement learning model with the maximized reward value as the optimized target according to the training data, and when the reward target of the reward function is the minimized reward value, the server may train the reinforcement learning model with the minimized reward value as the optimized target according to the training data.

After the training goal is met, the server can save the model parameters of the reinforcement learning model.

Furthermore, the server can call the test function through the algorithm module so as to test the performance of the reinforcement learning model, and in the process, the data in the interaction process is not saved, the neural network is not updated, and the optimal action of the strategy is selected.

The user-defined environment data can call the stored algorithm of the algorithm library at any moment of the instantiation algorithm, and the stored path and the simulation experiment name are transmitted to store the information such as the optimizer, the learning rate and the like in the training process of the neural network parameters of the specified simulation experiment, so that the simulation experiment can be ensured to continue. For ease of understanding, the present disclosure provides a schematic diagram of a training and testing process of the reinforcement learning model, as shown in fig. 3.

FIG. 3 is a schematic diagram of a training and testing process of a reinforcement learning model provided in the present specification.

In any period in the model training process, the server can call a training function, time the data in the simulation process is not stored in the experience playback pool, and then the data is acquired from the experience playback pool to update the parameters of the reinforcement learning model strategy network.

In any period in the model test process, the server can call a test function, obtain the current action along with the incoming information, and further determine the rewarding information obtained after the current action is executed to test the model.

Further, the present disclosure also provides a service execution method applied to the reinforcement learning model trained by the reinforcement learning model training method, as shown in fig. 4.

Fig. 4 is a flow chart of a service execution method provided in the present specification, which includes the following steps:

s401: and acquiring initial state information of the target service scene.

S402: and inputting the initial state information into a service model corresponding to the target service scene to determine an execution strategy corresponding to a target object in the target service scene through the service model, wherein the service model is trained by the reinforcement learning model training method.

S403: and executing the service according to the execution strategy.

Specifically, the server may obtain initial state information of a target service scenario, where the target service scenario may refer to a service scenario corresponding to reinforcement learning environment data in a model training process.

And then the server can input the initial state information into a service model corresponding to the target service scene, so that an execution strategy corresponding to the target object in the target service scene is determined through the service model, wherein the service model can be a reinforcement learning model obtained through training in the mode.

The server may then execute the corresponding service based on the execution policy output by the model.

Taking a service scenario as an example of an intelligent charging scheduling scenario of an electric vehicle, the input initial state information may include current charging station available current and current charging station available charge quantity, and of course, may also include other information such as quantity of vehicles to be charged, required electric quantity of each vehicle to be charged, and the like.

After the information is input into the reinforcement learning model, the reinforcement learning model can determine and feed back information such as the electric car charged by each charging station, the saleable electric quantity, the charging station and the electricity purchasing quantity required by each electric car to the user under the condition that the income of each charging station is maximized. Thus, the charging dispatching task for each automobile to be charged and the charging station is completed.

For another example, in the intelligent driving scenario, the target object (agent) may be an unmanned device that is driving, the initial state information may include information such as a vehicle speed, a vehicle distance, and a distance to a destination in the driving scenario, and after the initial state information is input into the reinforcement learning model, the model may output an optimal driving strategy corresponding to the unmanned device. And then feeding the optimal driving strategy back to the driving equipment to execute the intelligent driving task.

According to the method, the scheme adopts a reinforcement learning passive calling mode, the use threshold of adding the custom environment by a user is greatly reduced, the self environment is not required to be greatly modified to adapt to the reinforcement learning function, and the creation and the use of the simulation experiment can be completed only by adding or modifying a plurality of lines of codes. As the modularized configurable reinforcement learning creation and use mode is adopted, the reinforcement learning creation and use mode is embedded into the user-defined environment, the user-defined environment does not need to be packaged or even uploaded to other reinforcement learning providing platforms, and unsafe caused by information cross-platform is avoided.

The invention realizes multiple algorithms in reinforcement learning, has adaptation to single-agent algorithms and multi-agent algorithms, and satisfies the conditions of using simulation experiments in more custom environments. The construction and the use of the reinforcement learning model simulation experiment are realized and the obvious optimization effect is obtained on various custom environments established on an independently constructed simulation platform at present.

Reinforcement learning algorithms are used to optimize agent behavior in a low threshold form in a custom created environment. The reinforcement learning algorithm can be used only by determining the optimal target on the basis of the configured environment, and the user can select other suitable reinforcement learning algorithms on the basis of the default recommended algorithm, so that the configurability is greatly improved.

Industry-similar products have little support for algorithm libraries and do not support multi-agent reinforcement learning algorithms. The invention also adapts a plurality of multi-agent reinforcement learning algorithms on the basis of expanding reinforcement learning algorithms commonly used in more industries so as to realize application in multi-agent scenes and support similar agent scenes and heterogeneous agent scenes in the multi-agent reinforcement learning scenes.

Because of similar products in the industry, the environment package built by the user needs to be sent to the reinforcement learning server for training, which is equivalent to the reinforcement learning that is a main body for calling the environment module, which causes unsafe cross-platform information and a large amount of modification to the custom environment to match the reinforcement learning function. In the patent, the reinforcement learning library is constructed into a module, and the reinforcement learning simulation experiment can be constructed and used only by adding and modifying a plurality of lines of codes on the original self-defined environment, and the reinforcement learning function module is embedded in the environment to ensure the environment as a main body. As reinforcement learning is embedded into the custom environment as a module, the construction and the use of the algorithm can be completed on the same platform, and the information safety is higher compared with that of the cross-platform. The environment itself does not need to be modified greatly and cross-platform operation configuration is not needed, so that the use is easier.

And packaging the reinforcement learning algorithm library, exposing only a small amount of APIs to provide the environment call, and calling the reinforcement learning module by taking the environment as a main body. The reinforcement learning function is embedded in the environment to replace actions executed by the intelligent body, so that the performance change visualization of the intelligent body in training can be well realized while the intelligent body is easy to call.

Compared with the traditional method for creating reinforcement learning for users without reinforcement learning professional technical background, the modularized configurable reinforcement learning simulation experiment construction and use method can select to adjust reinforcement learning experiments in finer granularity on the basis of providing a small amount of APIs, and greatly improves the autonomy of the reinforcement learning experiments of the users without increasing the threshold for professional knowledge requirements.

The foregoing is a similar idea for one or more methods for implementing reinforcement learning model training and service execution in the present specification, and the present specification further provides corresponding reinforcement learning model training and service apparatus, as shown in fig. 5 and 6.

Fig. 5 is a schematic diagram of a reinforcement learning model training device provided in the present specification, including:

an obtaining module 501, configured to obtain reinforcement learning environment data constructed by a user for a specified service scenario;

A selection module 502, configured to determine, in response to a specified operation of the user, a target agent selected by the user based on the reinforcement learning environment data, and determine agent configuration information corresponding to the target agent, where the agent configuration information includes: the method comprises the steps of determining the quantity of the intelligent agents, rewarding function information, action configuration information corresponding to actions executable by each target intelligent agent and state configuration information corresponding to the state of the reinforcement learning environment acquired by each target intelligent agent;

a determining module 503, configured to determine a target reinforcement learning algorithm matching the service scenario and the agent configuration information in a preset algorithm library;

a building module 504, configured to build a reinforcement learning model based on the target agent, the agent configuration information, and the target reinforcement learning algorithm, simulate the reinforcement learning model, and store data generated in the simulation process into an experience playback pool;

the training module 505 is configured to train the reinforcement learning model according to training data obtained from the experience playback pool and the reward function information.

Optionally, the action configuration information includes: the action type of the action which can be executed by the target intelligent agent and the action change range when each action is executed; the state configuration information includes: the target agent can acquire the state dimension of the state and the state change range corresponding to the state of each dimension.

Optionally, the selecting module 502 is specifically configured to determine, in the algorithm library, each reinforcement learning algorithm that matches the service scenario and the agent configuration information as a candidate algorithm; the target reinforcement learning algorithm is determined in response to a selection operation performed by the user for each candidate algorithm.

Optionally, the building module 504 is specifically configured to, for each time, obtain, as training data corresponding to the time, a reinforcement learning environment state obtained by each target agent at the time, an action executed by each target agent at the time, and a reward value obtained when each target agent at the time executes different actions; and storing training data corresponding to each moment in the experience playback pool.

Optionally, the building module 504 is specifically configured to obtain a status flag of the reinforcement learning environment; based on the status flag, invalid data in the training data is determined, and training data other than the invalid data is stored in the experience playback pool.

The selection module 502 is specifically configured to determine a target reinforcement learning algorithm that matches the specified service scenario and the agent configuration information in a preset algorithm library, and determine target algorithm configuration information corresponding to the target reinforcement learning algorithm.

Optionally, before simulating the reinforcement learning model, the selecting module 502 is further configured to determine, in response to a specified operation of the user, a simulation configuration, where the simulation configuration includes at least: the maximum number of simulation fragments, the maximum number of time steps contained in each simulation fragment and the termination condition of each simulation fragment;

the construction module 504 is specifically configured to construct an experience playback pool corresponding to the reinforcement learning model according to the target algorithm configuration information; and simulating the reinforcement learning model according to the simulation configuration.

Fig. 6 is a schematic diagram of a service execution device provided in the present specification, including:

the acquiring module 601 is configured to acquire initial state information of a target service scenario;

the input module 602 is configured to input the initial state information into a service model corresponding to the target service scenario, so as to determine, according to the service model, an execution policy corresponding to a target object in the target service scenario, where the service model is obtained by training by the reinforcement learning training method;

And the executing module 603 is configured to execute a service according to the executing policy.

The present specification also provides a computer readable storage medium having stored thereon a computer program operable to perform a reinforcement learning model training method as provided in fig. 1 above.

The present specification also provides a schematic structural diagram of an electronic device corresponding to fig. 1 shown in fig. 7. At the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile storage, as described in fig. 7, although other hardware required by other services may be included. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs to implement the reinforcement learning model training or business execution method described above with reference to fig. 1 or fig. 4. Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

Improvements to one technology can clearly distinguish between improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) and software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present description is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the specification. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A method of training a reinforcement learning model, comprising:

2. The method of claim 1, wherein the action configuration information comprises: the action type of the action which can be executed by the target intelligent agent and the action change range when each action is executed;

3. The method of claim 1, wherein determining a target reinforcement learning algorithm matching the business scenario and the agent configuration information in a preset algorithm library, specifically comprises:

4. The method of claim 1, wherein simulating the reinforcement learning model to obtain simulation data comprises:

5. The method of claim 4, wherein storing training data corresponding to each time instant in the experience playback pool specifically comprises:

acquiring a state mark of the reinforcement learning environment;

6. The method of claim 1, wherein a plurality of reinforcement learning algorithms are provided in the algorithm library, and algorithm configuration information corresponding to different reinforcement learning algorithms, the algorithm configuration information comprising: at least one of rewarding discount coefficient, learning rate, size of experience playback pool corresponding to different reinforcement learning algorithm, batch size, gradient clipping coefficient and entropy clipping coefficient;

7. The method of claim 6, wherein prior to simulating the reinforcement learning model, the method further comprises:

8. A service execution method, comprising:

acquiring initial state information of a target service scene;

inputting the initial state information into a service model corresponding to the target service scene to determine an execution strategy corresponding to a target object in the target service scene through the service model, wherein the service model is obtained through training by the method of any one of claims 1-7;

and executing the service according to the execution strategy.

9. A reinforcement learning model training device, comprising:

10. A service execution apparatus, comprising:

the input module is used for inputting the initial state information into a service model corresponding to the target service scene so as to determine an execution strategy corresponding to a target object in the target service scene through the service model, wherein the service model is obtained through training by the method of any one of claims 1-7;

11. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-8.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-8 when executing the program.