CN118261228A

CN118261228A - Unsupervised data generation framework suitable for offline reinforcement learning

Info

Publication number: CN118261228A
Application number: CN202410391685.0A
Authority: CN
Inventors: 季向阳; 何舜成
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2024-04-02
Filing date: 2024-04-02
Publication date: 2024-06-28

Abstract

The application relates to the technical field of deep reinforcement learning, in particular to an unsupervised data generation framework suitable for offline reinforcement learning, wherein the framework comprises: acquiring a plurality of policy networks provided for an agent; performing unsupervised reinforcement learning training based on the plurality of strategy networks to obtain a plurality of strategy networks after training is completed, and obtaining a plurality of data sets by utilizing interaction between the plurality of strategy networks after training and the environment; labeling the multiple data sets according to the task targets of the offline reinforcement learning, selecting a target data set meeting target conditions from the labeled multiple data sets, and performing the offline reinforcement learning based on the target data set to obtain a policy network learned by the offline learning. Therefore, the problems that the offline data sets are distributed narrowly, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned and the like in the related technology are solved.

Description

Unsupervised data generation framework suitable for offline reinforcement learning

Technical Field

The application relates to the technical field of deep reinforcement learning, in particular to an unsupervised data generation framework suitable for offline reinforcement learning.

Background

Reinforcement learning is a relatively complex research field in the machine learning field, aims to solve a more general problem such as man-machine interaction, game, robot control and the like, and can also be applied to the development of a large language model. Therefore, reinforcement learning has wide prospect in realizing general artificial intelligence, and is the current leading edge research field.

Offline reinforcement learning considers the case where an agent has access to only given data and cannot interact with the environment. The data which is interacted with the environment by a certain strategy existing when the agent accesses the data comprises the state, action, the state of the agent at the next moment and the return obtained by the agent. Compared with online reinforcement learning, offline reinforcement learning saves cost and safety consideration brought by interaction with the environment, and is suitable for sequence decision tasks in the real world. However, in the learning process, the intelligent agent cannot obtain feedback of the environment, and is difficult to rectify the current strategy, if an online reinforcement learning algorithm is simply adopted, the deviation of probability distribution and the deviation of value function estimation can be caused, so that the strategy obtained by learning performs poorly in the actual situation.

Current offline reinforcement learning algorithms mostly suppress the value of the value function outside the distribution, or limit the strategy to the vicinity of the data set distribution, so that the performance of the strategy obtained by offline reinforcement learning is greatly affected by the data set, and the method for constructing the offline data set in the related art generally uses the strategy generated when an agent trains under the supervision condition, and uses the strategy to interact with the environment to generate experience data. This strategy may be the initial strategy of the agent during training, or the optimal strategy after convergence of training, or any time in the strategy. The data set may also be blended from interactive experience data generated by multiple strategies. Generally, the current strategy adopts a strategy obtained by supervised reinforcement learning under a specific task target, so that the distribution of a data set is narrow, the generalization performance of an offline reinforcement learning stage is poor, and once the current strategy needs to deal with a task target different from the previous one, the offline reinforcement learning is difficult to learn an optimal strategy.

Disclosure of Invention

The application provides an unsupervised data generation framework, device, equipment and storage medium suitable for offline reinforcement learning, which are used for solving the problems that the distribution of offline data sets is narrow in the related technology, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned.

An embodiment of a first aspect of the present application provides an unsupervised data generation framework suitable for offline reinforcement learning, including the steps of: acquiring a plurality of policy networks provided for an agent; performing unsupervised reinforcement learning training based on the plurality of strategy networks to obtain a plurality of strategy networks after training is completed, and obtaining a plurality of data sets by utilizing interaction between the plurality of strategy networks after training and the environment; labeling the multiple data sets according to the task targets of the offline reinforcement learning, selecting a target data set meeting target conditions from the labeled multiple data sets, and performing the offline reinforcement learning based on the target data set to obtain a policy network learned by the offline learning.

Optionally, performing unsupervised reinforcement learning training based on the plurality of policy networks to obtain a plurality of trained policy networks, including: initializing a plurality of strategy networks and sample pools corresponding to the strategy networks, selecting a target strategy network from the strategy networks, and initializing the track of the target strategy network; acquiring an environment state variable observed by an agent at a target moment, inputting the environment state variable observed by the agent at the target moment into a target strategy network, and outputting an action variable at the target moment by the target strategy network; determining an environmental state variable at the next moment according to the action variable at the target moment, adding the environmental state variable at the next moment into the track of the target strategy network until the track of the target strategy network is finished, and adding the environmental state variable at the target moment, the environmental state variable at the next moment, the action variable at the target moment and the task return as samples into a sample pool; and randomly sampling from the sample pool to obtain a state variable sample, carrying out gradient back propagation on the state variable sample to optimize a target strategy network, and obtaining a plurality of trained strategy networks after optimizing the strategy networks.

Optionally, interacting with the environment using the trained plurality of policy networks to obtain a plurality of data sets, comprising: initializing sample pools corresponding to a plurality of strategy networks after training is completed; the sample generating action is repeated for each data set in the sample pool until the sample data in each data set reaches the target number.

Optionally, the sample generating act includes: acquiring an environment state variable observed by an agent at a target moment, inputting the environment state into any strategy network, and outputting intelligent action variables at the target moment by the any strategy network; determining an environmental state variable at the next moment according to the intelligent action variable at the target moment; and generating a sample according to the environment state variable at the target moment, the action variable at the target moment and the environment state variable at the next moment.

Optionally, labeling the plurality of data sets according to the task objective of offline reinforcement learning includes: for each of the plurality of data sets, calculating a task return for each sample in each data set according to the task objective of offline reinforcement learning, and labeling the data set based on the task returns.

Optionally, performing offline reinforcement learning based on the target data set to obtain a policy network learned by offline learning, including: the method comprises the steps of obtaining a strategy network and a group of dynamic models, wherein each dynamic model comprises a state transition model and a return model, the state transition model is used for calculating an environmental state variable at the next moment, and the return model is used for calculating task return; inputting the environmental state variable of each sample in the target data set into a strategy network, outputting the action variable by the strategy network, inputting the action variable into all dynamic models, outputting the environmental state variable and task return at the next moment by each dynamic model, and calculating the environmental state variable average value and task return average value of all dynamic models; and generating a new sample according to the environment state variable mean value, the task return mean value and the corresponding sample, and performing offline reinforcement learning on the strategy network based on the new sample and the target data set.

An embodiment of a second aspect of the present application provides an unsupervised data generation apparatus adapted for offline reinforcement learning, including: the acquisition module is used for acquiring a plurality of strategy networks provided for the intelligent agent; the training module is used for performing unsupervised reinforcement learning training based on the strategy networks to obtain a plurality of strategy networks with completed training, and obtaining a plurality of data sets by utilizing interaction between the strategy networks with completed training and the environment; the learning module is used for marking the plurality of data sets according to the task targets of the offline reinforcement learning, selecting the target data sets meeting the target conditions from the marked plurality of data sets, and performing the offline reinforcement learning based on the target data sets to obtain the policy network learned by the offline learning.

Optionally, the training module is further configured to initialize a plurality of policy networks and sample pools corresponding to the plurality of policy networks, select a target policy network from the plurality of policy networks, and initialize a track of the target policy network; acquiring an environment state variable observed by an agent at a target moment, inputting the environment state variable observed by the agent at the target moment into a target strategy network, and outputting an action variable at the target moment by the target strategy network; determining an environmental state variable at the next moment according to the action variable at the target moment, adding the environmental state variable at the next moment into the track of the target strategy network until the track of the target strategy network is finished, and adding the environmental state variable at the target moment, the environmental state variable at the next moment, the action variable at the target moment and the task return as samples into a sample pool; and randomly sampling from the sample pool to obtain a state variable sample, carrying out gradient back propagation on the state variable sample to optimize a target strategy network, and obtaining a plurality of trained strategy networks after optimizing the strategy networks.

Optionally, the training module is further configured to initialize sample pools corresponding to the plurality of trained policy networks; the sample generating action is repeated for each data set in the sample pool until the sample data in each data set reaches the target number.

Optionally, the learning module is further configured to calculate, for each of the plurality of data sets, a task return for each sample in each data set according to a task objective of offline reinforcement learning, and annotate the data set based on the task returns.

Optionally, the learning module is further configured to obtain a policy network and a set of dynamic models, where each dynamic model includes a state transition model and a reporting model, where the state transition model is used to calculate an environmental state variable at a next moment, and the reporting model is used to calculate task reporting; inputting the environmental state variable of each sample in the target data set into a strategy network, outputting the action variable by the strategy network, inputting the action variable into all dynamic models, outputting the environmental state variable and task return at the next moment by each dynamic model, and calculating the environmental state variable average value and task return average value of all dynamic models; and generating a new sample according to the environment state variable mean value, the task return mean value and the corresponding sample, and performing offline reinforcement learning on the strategy network based on the new sample and the target data set.

An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to implement an unsupervised data generation framework suitable for offline reinforcement learning as described in the above embodiments.

A fourth aspect embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing an unsupervised data generation framework suitable for offline reinforcement learning as described in the above embodiments.

A fifth aspect embodiment of the application provides a computer program product having stored thereon a computer program or instructions which, when executed, is adapted to implement an unsupervised data generation framework suitable for offline reinforcement learning as in the above-described embodiments.

Therefore, the application has the following beneficial effects:

According to the embodiment of the application, the plurality of strategy networks can be trained by adopting an unsupervised reinforcement learning method, the plurality of data sets can be obtained by utilizing interaction between the trained strategy networks and the environment, so that the data distribution range acquired in the data generation stage is wide enough, the data sets are remarked according to the task targets, the data set with the largest average task return in the plurality of data sets is selected for offline reinforcement learning, the strategy network learned by offline learning is obtained, and even under the condition that the task targets are unknown, better strategies can be learned. Therefore, the problems that the offline data sets are distributed narrowly, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned and the like in the related technology are solved.

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of an unsupervised data generation framework for offline reinforcement learning provided in accordance with an embodiment of the present application;

FIG. 2 is a flow chart of the physical quantity relationship of an unsupervised reinforcement learning stage according to one embodiment of the present application;

FIG. 3 is a flow chart of the physical quantity relationship of the offline reinforcement learning stage according to one embodiment of the present application;

FIG. 4 is a flow chart for generating new samples using a dynamic model provided in accordance with one embodiment of the present application;

FIG. 5 is a block diagram of an unsupervised data generation apparatus adapted for offline reinforcement learning according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

An unsupervised data generation framework, apparatus, electronic device, storage medium, and computer program product suitable for offline reinforcement learning according to embodiments of the present application are described below with reference to the accompanying drawings. Aiming at the problems in the background art, the application provides an unsupervised data generation framework suitable for offline reinforcement learning, in the method, a plurality of strategy networks are trained by adopting an unsupervised reinforcement learning method, a plurality of data sets can be obtained by utilizing interaction between the trained strategy networks and the environment, so that the data distribution range acquired in a data generation stage is wide enough, the data sets are re-labeled according to task targets, the data set with the largest average task return in the plurality of data sets is selected for offline reinforcement learning, and the strategy network learned by offline learning is obtained, so that better strategies can be learned even under the condition that the task targets are unknown. Therefore, the problems that the offline data sets are distributed narrowly, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned and the like in the related technology are solved.

Specifically, fig. 1 is a schematic flow chart of an unsupervised data generation framework suitable for offline reinforcement learning according to an embodiment of the present application.

As shown in fig. 1, the unsupervised data generation framework suitable for offline reinforcement learning includes the steps of:

in step S101, a plurality of policy networks provided to an agent are acquired.

In step S102, an unsupervised reinforcement learning training is performed based on the plurality of policy networks to obtain a plurality of trained policy networks, and a plurality of data sets are obtained by using the plurality of trained policy networks to interact with the environment.

It can be understood that, in the embodiment of the present application, a certain unsupervised reinforcement learning algorithm may be adopted, and "pseudo-return" is used to replace task return in supervised reinforcement learning, so as to train and obtain multiple policy networks of an agent, and increase the coverage degree of the multiple policy networks to the whole state space as much as possible. The agent then interacts with the environment using each of the plurality of trained policy networks to collect empirical data and stores it in a corresponding data cache.

In one embodiment of the present application, performing unsupervised reinforcement learning training based on a plurality of policy networks to obtain a plurality of trained policy networks includes: initializing a plurality of strategy networks and sample pools corresponding to the strategy networks, selecting a target strategy network from the strategy networks, and initializing the track of the target strategy network; acquiring an environment state variable observed by an agent at a target moment, inputting the environment state variable observed by the agent at the target moment into a target strategy network, and outputting an action variable at the target moment by the target strategy network; determining an environmental state variable at the next moment according to the action variable at the target moment, adding the environmental state variable at the next moment into the track of the target strategy network until the track of the target strategy network is finished, and adding the environmental state variable at the target moment, the environmental state variable at the next moment, the action variable at the target moment and the task return as samples into a sample pool; and randomly sampling from the sample pool to obtain a state variable sample, carrying out gradient back propagation on the state variable sample to optimize a target strategy network, and obtaining a plurality of trained strategy networks after optimizing the strategy networks.

For ease of understanding, the number of policy networks may be N in embodiments of the present application, which may be optimized using a reinforcement learning algorithm (e.g., SAC (Soft Actor-Critic algorithm)). The method comprises the following steps:

Initializing N policy networks And a sample pool D ₁＝{},…,D_N = { } corresponding to the N policy networks. After that, the following steps 1-3 are repeated before the preset iteration number is reached, as shown in fig. 2, where the preset iteration number may be set according to the actual situation, and is not limited specifically. The method comprises the following steps:

Step 21: one of the policy networks l e {1, …, N } is selected randomly or sequentially in turn, and the trajectory s= { } is initialized.

Step 22: at time t, the environmental state variable observed by the agent is s _t. The mapping from observed quantity to motion vector in the intelligent agent consists of an end-to-end deep neural network. The agent inputs S _t observed by itself into the policy network to obtain the action variable a _t, and after the environment receives a _t, the environment state variable S _t+1 at the next moment is obtained according to the internal state transition equation, and the environment state variable is added into the locus s=s { S _t+1 }.

If the trace ends at this point, then N-1 samples of the state variable T ₁,…,T_N-1 are randomly sampled from the pool D _m (where m=1, …, N, m+.l), respectively, and W _j＝W(S,T_j is calculated using the estimation method for the original form of the neisseria distance in reference [4], j=1, …, N-1. Selecting the subscript of the minimum Neisseria distance

K=argmin _j＝1,…,NW_j, then using the algorithm input of "false return" in reference [1] (S, T _k) to get all of S

R _t. Adding the respective samples into the corresponding sample cells: d _l＝D_l∪{(s_t,a_t,s_t+1,r_t) }.

Step 23: a state variable sample is sampled from the sample pool in step 22, and the strategy network of the agent is optimized by gradient back propagation using a deep reinforcement learning algorithm.

It should be noted that, in the embodiment of the present application, the selection of the deep reinforcement learning algorithm is adjustable and diversified,

Including but not limited to the algorithm SAC.

In one embodiment of the application, interacting with the environment using a plurality of strategic networks with training is accomplished to obtain a plurality of data sets, including: initializing sample pools corresponding to a plurality of strategy networks after training is completed; the sample generating action is repeated for each data set in the sample pool until the sample data in each data set reaches the target number.

Wherein the sample generating act comprises: acquiring an environment state variable observed by an agent at a target moment, inputting the environment state into any strategy network, and outputting intelligent action variables at the target moment by the any strategy network; determining an environmental state variable at the next moment according to the intelligent action variable at the target moment; and generating a sample according to the environment state variable at the target moment, the action variable at the target moment and the environment state variable at the next moment.

It can be appreciated that in obtaining trained N policy networksThereafter, the environment may be interacted with to obtain N data sets. As shown in fig. 2, a sample pool E ₁＝{},…,E_N = { } corresponding to the N policy networks is initialized. Cycling i from 1 to N, for each i, repeating the following steps until a pre-specified data set size M is reached:

Step 21: at time t, the environmental state variable observed by the agent is s _t. The mapping from observed quantity to motion vector in the intelligent agent consists of an end-to-end deep neural network. The agent inputs s _t observed by the agent into the strategy network to obtain an action variable a _t, and after the environment receives a _t, the environment state variable s _t+1 at the next moment is obtained according to the internal state transition equation, and the triplet (s _t,a_t,s_t+1) is added into E _i.

In step S103, labeling the multiple data sets according to the task targets of the offline reinforcement learning, selecting a target data set meeting the target conditions from the labeled multiple data sets, and performing the offline reinforcement learning based on the target data set to obtain the policy network learned by the offline learning.

Wherein the target conditions include: the average task return of the data set is greater than the preset return, which may be set according to the actual situation, and is not specifically limited.

It can be appreciated that in the offline reinforcement learning stage, the embodiment of the present application may first label the returns of multiple data sets in all data caches according to a given task objective, and calculate the average return of sample data in each data set. And then, one or more data sets with average task returns larger than preset returns of the selected data sets are used as target data sets and provided for the offline reinforcement learning algorithm, so that the offline reinforcement learning algorithm can obtain a high-return strategy on a given task.

In one embodiment of the application, labeling multiple data sets according to task goals of offline reinforcement learning includes: for each of the plurality of data sets, calculating a task return for each sample in each data set according to the task objective of offline reinforcement learning, and labeling the data set based on the task returns.

After obtaining N datasets E ₁,…,E_N, embodiments of the present application may re-label the return values of the datasets according to the current task needs. Let i loop from 1 to N, for each dataset E _i, take all M pieces of data in the dataset in turn, and calculate the return r _t＝r(s_t,a_t according to the return function r (s, a) determined by the task requirements, and then put the quadruple (s _t,a_t,r_t,s_t+1) back into dataset E _i.

Further, the embodiment of the application can cycle i from 1 to N, and calculate the average value of the return values of all M pieces of data for each data set E _i Selecting subscript with largest average value of return valuesAnd a corresponding data set E _k.

In one embodiment of the present application, performing offline reinforcement learning based on a target data set to obtain a policy network learned by offline learning, including: the method comprises the steps of obtaining a strategy network and a group of dynamic models, wherein each dynamic model comprises a state transition model and a return model, the state transition model is used for calculating an environmental state variable at the next moment, and the return model is used for calculating task return; inputting the environmental state variable of each sample in the target data set into a strategy network, outputting the action variable by the strategy network, inputting the action variable into all dynamic models, outputting the environmental state variable and task return at the next moment by each dynamic model, and calculating the environmental state variable average value and task return average value of all dynamic models; and generating a new sample according to the environment state variable mean value, the task return mean value and the corresponding sample, and performing offline reinforcement learning on the strategy network based on the new sample and the target data set.

Based on the above embodiment, the embodiment of the present application may use any offline reinforcement learning method on the data set E _k to obtain the learned optimal strategy for offline learning. Taking a model-based offline reinforcement learning algorithm as an example, as shown in fig. 3, the method comprises the following steps:

Step 31: initializing a set of M dynamic models, each dynamic model including a state transition model Model for reporting(WhereinRepresenting a multivariate normal distribution), for each neural network Φ E { Φ ₁,…,φ_H } a supervised learning method is used on the data set E _k to update the network parameters Φ (including both state transition and return models) with maximum likelihood estimates, the parameter update is performed by SGD (StochasticGradient Descent, random gradient descent) method, and the cycle is repeated a predetermined number of times.

Step 32: initializing a strategy network pi _θ, learning on a data set E _k by using a reinforcement learning algorithm such as SAC and the like, obtaining h newly generated samples for each sample (s _t,a_t,r_t,s_t+1) in each E _k by adopting the following algorithm and participating in SAC learning together with the existing samples, wherein the method comprises the following steps of:

step 41: given h, given λ, given policy network pi _θ, given M dynamic models, given samples (s _t,a_t,r_t,s_t+1).

Step 42: initializing state s=s _t+1, initializing sample cell E _model = { }, and performing steps 43 to 46a total of h times.

Step 43: action a=pi _θ () is obtained using the policy network.

Step 44: inputting (s, a) into all dynamic models, and outputting the estimated environment state variable mean value of the next step stateEstimated task return average

Step 45: calculating rewards with uncertainty penaltiesWherein II _F is the Frobenius norm.

Step 46: new samples are takenAdded to E _model, and let s=s'.

Step 47: and returning to the sample cell E _model.

In summary, the embodiment of the application can acquire more widely distributed experience data through a plurality of strategy networks obtained through the unsupervised reinforcement learning training, and then annotate and screen the data through the return weight, and construct an offline data set suitable for a given task target, so that the offline reinforcement learning algorithm can obtain better performance under various task targets.

According to the unsupervised data generation framework suitable for offline reinforcement learning provided by the embodiment of the application, the unsupervised reinforcement learning method is adopted to train a plurality of strategy networks, and a plurality of data sets can be obtained by utilizing interaction between the trained strategy networks and the environment, so that the data distribution range acquired in the data generation stage is wide enough, the data sets are remarked according to the task targets, the data set with the largest average task return in the plurality of data sets is selected for offline reinforcement learning, and the learned strategy network of offline learning is obtained, so that better strategies can be learned even under the condition that the task targets are unknown. Therefore, the problems that the offline data sets are distributed narrowly, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned and the like in the related technology are solved.

An unsupervised data generation apparatus for offline reinforcement learning according to an embodiment of the present application will be described next with reference to the accompanying drawings.

Fig. 5 is a block diagram of an unsupervised data generation apparatus for offline reinforcement learning according to an embodiment of the present application.

As shown in fig. 5, the unsupervised data generation apparatus 10 suitable for offline reinforcement learning includes: an acquisition module 100, a training module 200, and a learning module 300.

Wherein, the obtaining module 100 is configured to obtain a plurality of policy networks provided to an agent; the training module 200 is configured to perform unsupervised reinforcement learning training based on a plurality of strategy networks to obtain a plurality of trained strategy networks, and interact with the environment by using the plurality of trained strategy networks to obtain a plurality of data sets; the learning module 300 is configured to label a plurality of data sets according to a task target of offline reinforcement learning, select a target data set meeting a target condition from the labeled plurality of data sets, and perform offline reinforcement learning based on the target data set, so as to obtain a policy network learned by offline learning.

In one embodiment of the present application, the training module 200 is further configured to initialize a plurality of policy networks and sample pools corresponding to the plurality of policy networks, select a target policy network from the plurality of policy networks, and initialize a track of the target policy network; acquiring an environment state variable observed by an agent at a target moment, inputting the environment state variable observed by the agent at the target moment into a target strategy network, and outputting an action variable at the target moment by the target strategy network; determining an environmental state variable at the next moment according to the action variable at the target moment, adding the environmental state variable at the next moment into the track of the target strategy network until the track of the target strategy network is finished, and adding the environmental state variable at the target moment, the environmental state variable at the next moment, the action variable at the target moment and the task return as samples into a sample pool; and randomly sampling from the sample pool to obtain a state variable sample, carrying out gradient back propagation on the state variable sample to optimize a target strategy network, and obtaining a plurality of trained strategy networks after optimizing the strategy networks.

In one embodiment of the present application, the training module 200 is further configured to initialize sample pools corresponding to the plurality of policy networks after training is completed; the sample generating action is repeated for each data set in the sample pool until the sample data in each data set reaches the target number.

In one embodiment of the application, the sample generating act comprises: acquiring an environment state variable observed by an agent at a target moment, inputting the environment state into any strategy network, and outputting intelligent action variables at the target moment by the any strategy network; determining an environmental state variable at the next moment according to the intelligent action variable at the target moment; and generating a sample according to the environment state variable at the target moment, the action variable at the target moment and the environment state variable at the next moment.

In one embodiment of the present application, the learning module 300 is further configured to calculate, for each of the plurality of data sets, a task report for each sample in each data set based on the task objective of the offline reinforcement learning, and annotate the data set based on the task report.

In one embodiment of the application, the target conditions include: the average task return for the data set is greater than the preset return.

In one embodiment of the present application, the learning module 300 is further configured to obtain a policy network and a set of dynamic models, where each dynamic model includes a state transition model and a reporting model, where the state transition model is used to calculate an environmental state variable at a next moment, and the reporting model is used to calculate task reporting; inputting the environmental state variable of each sample in the target data set into a strategy network, outputting the action variable by the strategy network, inputting the action variable into all dynamic models, outputting the environmental state variable and task return at the next moment by each dynamic model, and calculating the environmental state variable average value and task return average value of all dynamic models; and generating a new sample according to the environment state variable mean value, the task return mean value and the corresponding sample, and performing offline reinforcement learning on the strategy network based on the new sample and the target data set.

It should be noted that the foregoing explanation of the embodiment of the unsupervised data generation framework suitable for offline reinforcement learning is also applicable to the unsupervised data generation device suitable for offline reinforcement learning of this embodiment, and will not be repeated here.

According to the unsupervised data generation device suitable for offline reinforcement learning, the unsupervised reinforcement learning method is adopted to train a plurality of strategy networks, the trained strategy networks are used for interacting with the environment to obtain a plurality of data sets, so that the data distribution range acquired in the data generation stage is wide enough, the data sets are remarked according to task targets, the data set with the largest average task return in the plurality of data sets is selected for offline reinforcement learning, the learned strategy network of offline learning is obtained, and even under the condition that the task targets are unknown, better strategies can be learned. Therefore, the problems that the offline data sets are distributed narrowly, the generalization performance of the offline reinforcement learning stage is poor, and when a task target is unknown, an optimal strategy cannot be learned and the like in the related technology are solved.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:

A memory 601, a processor 602, and a computer program stored on the memory 601 and executable on the processor 602.

The processor 602, when executing the program, implements the unsupervised data generation framework suitable for offline reinforcement learning provided in the above-described embodiments.

Further, the electronic device further includes:

a communication interface 603 for communication between the memory 601 and the processor 602.

A memory 601 for storing a computer program executable on the processor 602.

The memory 601 may comprise high-speed RAM (Random Access Memory ) memory, and may also include non-volatile memory, such as at least one disk memory.

If the memory 601, the processor 602, and the communication interface 603 are implemented independently, the communication interface 603, the memory 601, and the processor 602 may be connected to each other through a bus and perform communication with each other. The bus may be an ISA (IndustryStandard Architecture ) bus, a PCI (PERIPHERAL COMPONENT, external device interconnect) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 601, the processor 602, and the communication interface 603 are integrated on a chip, the memory 601, the processor 602, and the communication interface 603 may perform communication with each other through internal interfaces.

The processor 602 may be a CPU (Central Processing Unit ) or an ASIC (Application SPECIFIC INTEGRATED Circuit, application specific integrated Circuit) or one or more integrated circuits configured to implement embodiments of the present application.

The embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor. The unsupervised data generation framework applicable to offline reinforcement learning is realized.

Embodiments of the present application also provide a computer program product having a computer program or instructions stored thereon that, when executed, is adapted to implement the unsupervised data generation framework adapted for offline reinforcement learning as in the above embodiments.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. An unsupervised data generation framework adapted for offline reinforcement learning, comprising the steps of:

acquiring a plurality of policy networks provided for an agent;

Performing unsupervised reinforcement learning training based on the plurality of strategy networks to obtain a plurality of trained strategy networks, and utilizing the plurality of trained strategy networks to interact with the environment to obtain a plurality of data sets;

Labeling the multiple data sets according to the task targets of the offline reinforcement learning, selecting a target data set meeting target conditions from the labeled multiple data sets, and performing the offline reinforcement learning based on the target data set to obtain a policy network learned by the offline learning.

2. The unsupervised data generation framework adapted for offline reinforcement learning of claim 1, wherein the performing the unsupervised reinforcement learning training based on the plurality of policy networks results in a plurality of trained policy networks comprising:

Initializing the strategy networks and sample pools corresponding to the strategy networks, selecting a target strategy network from the strategy networks, and initializing the track of the target strategy network;

Acquiring an environment state variable observed by an agent at a target moment, inputting the environment state variable observed by the agent at the target moment into the target strategy network, and outputting an action variable at the target moment by the target strategy network;

Determining an environmental state variable at the next moment according to the action variable at the target moment, adding the environmental state variable at the next moment into the track of the target strategy network until the track of the target strategy network is finished, and adding the environmental state variable at the target moment, the environmental state variable at the next moment, the action variable at the target moment and the task return as samples to the sample pool;

And randomly sampling from the sample pool to obtain a state variable sample, carrying out gradient back propagation on the state variable sample to optimize the target strategy network, and obtaining a plurality of trained strategy networks after optimizing the strategy networks.

3. The unsupervised data generation framework adapted for offline reinforcement learning of claim 1, wherein the interacting with an environment with the plurality of strategic networks accomplished with the training to obtain a plurality of data sets comprises:

initializing sample pools corresponding to the trained multiple strategy networks;

and repeating the sample generation action for each data set in the sample pool until the sample data in each data set reaches the target number.

4. The unsupervised data generation framework adapted for offline reinforcement learning of claim 3, wherein the sample generation actions comprise:

acquiring an environment state variable observed by an agent at a target moment, inputting the environment state into an arbitrary strategy network, and outputting an intelligent action variable at the target moment by the arbitrary strategy network;

Determining an environmental state variable at the next moment according to the intelligent action variable at the target moment;

and generating a sample according to the environment state variable at the target moment, the action variable at the target moment and the environment state variable at the next moment.

5. The unsupervised data generation framework adapted for offline reinforcement learning of claim 1, wherein the labeling of the plurality of data sets according to the task objective of offline reinforcement learning comprises:

For each data set in the plurality of data sets, calculating the task return of each sample in each data set according to the task target of offline reinforcement learning, and labeling the data set based on the task return.

6. The unsupervised data generation framework adapted for offline reinforcement learning according to claim 1, wherein said performing offline reinforcement learning based on said target data set, to obtain an offline learning learned strategy network, comprises:

The method comprises the steps of obtaining a strategy network and a group of dynamic models, wherein each dynamic model comprises a state transition model and a return model, the state transition model is used for calculating an environment state variable at the next moment, and the return model is used for calculating task return;

Inputting the environmental state variable of each sample in the target data set into the strategy network, outputting the action variable by the strategy network, inputting the action variable into all dynamic models, outputting the environmental state variable and the task return at the next moment by each dynamic model, and calculating the environmental state variable average value and the task return average value of all dynamic models;

And generating a new sample according to the environment state variable mean value, the task return mean value and the corresponding sample, and performing offline reinforcement learning on the strategy network based on the new sample and the target data set.

7. An unsupervised data generation apparatus adapted for offline reinforcement learning, comprising:

the acquisition module is used for acquiring a plurality of strategy networks provided for the intelligent agent;

The training module is used for performing unsupervised reinforcement learning training based on the plurality of strategy networks to obtain a plurality of trained strategy networks, and utilizing the plurality of trained strategy networks to interact with the environment to obtain a plurality of data sets;

And the learning module is used for marking the plurality of data sets according to the task targets of the offline reinforcement learning, selecting a target data set meeting target conditions from the marked plurality of data sets, and performing the offline reinforcement learning based on the target data set to obtain a policy network learned by the offline learning.

8. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the unsupervised data generation framework of any one of claims 1-6 adapted for offline reinforcement learning.

9. A computer readable storage medium having stored thereon a computer program or instructions, which when executed, implements the unsupervised data generation framework of any one of claims 1-6 adapted for offline reinforcement learning.

10. A computer program product having a computer program or instructions stored thereon, which when executed, implements the unsupervised data generation framework of any one of claims 1-6 adapted for offline reinforcement learning.