CN113469372A

CN113469372A - Reinforcement learning training method, device, electronic equipment and storage medium

Info

Publication number: CN113469372A
Application number: CN202110749747.7A
Authority: CN
Inventors: 刘宇; 牛雅哲; 张明; 陈若冰; 李楚鸣
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-10-01

Abstract

The disclosure provides a reinforcement learning training method, a reinforcement learning training device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a target training task; determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interactive environment of the target training task; distributing computing resources for the target training task in the container cluster, and determining training data in interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment; training the reinforcement learning submodel based on the computing resources and the training data.

Description

Reinforcement learning training method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a reinforcement learning training method and apparatus, an electronic device, and a storage medium.

Background

Deep reinforcement learning has become the core technical route of decision intelligence in artificial intelligence research. Research and application based on deep reinforcement learning reach the level which is equal to or even exceeds the level of human beings in various fields, and the related technology can be applied to any scenes which need decision planning in the society.

However, the whole training process of these large-scale reinforcement learning training evaluation techniques requires a huge amount of computing resources, and the duration of the training process is also long. Moreover, for a new environment or a new problem, the algorithm needs to be redesigned and implemented, and the whole training system needs to be reconstructed according to specific requirements. Therefore, the existing training process requires a large number of deep reinforcement learning researchers and distributed system engineers to obtain a specialized training system through repeated iterations.

Disclosure of Invention

The embodiment of the disclosure at least provides a reinforcement learning training method, a reinforcement learning training device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a reinforcement learning training method, including: acquiring a target training task; determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interactive environment of the target training task; distributing computing resources for the target training task in the container cluster, and determining training data in interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment; training the reinforcement learning submodel based on the computing resources and the training data.

In the embodiment of the disclosure, after the target training task is obtained, the reinforcement learning sub-model matched with the interaction environment indicated in the target training task is determined in the preset reinforcement learning model, so that the technical scheme of the disclosure can be applied to reinforcement learning models in various interaction environments, and the application range of the technology of the disclosure is expanded. Meanwhile, corresponding computing resources are distributed to the target training tasks through the container cluster, and a mode of training the reinforcement learning submodel through the computing resources and the training data can be used for processing a plurality of target training tasks in parallel, and the target training tasks are guaranteed to have no influence so as to be normally executed.

In an optional embodiment, the training the reinforcement learning submodel based on the computing resources and the training data includes: determining target metadata based on the computing resources, wherein the target metadata is metadata of the interactive data; searching target interactive data matched with the target metadata in an interactive data queue; and taking the target interaction data as the training data to train the reinforcement learning sub-model.

In an alternative embodiment, the determining target metadata based on the computing resources includes: determining, based on the computing resources, metadata control nodes allocated by the container cluster for the target training task; the metadata control node is used for managing metadata of interaction data between the intelligent agent and the corresponding interaction environment; selecting the target metadata in the metadata control node.

In an optional implementation manner, the interaction data queue is stored in a device memory of a device deployed in the container cluster.

In the above embodiment, the interactive data is separated into two parts, namely, metadata and full data, and after the data generator generates new interactive data, only the metadata containing the basic information needs to be sent to the centralized metadata control node (i.e., the coordinator control node), and the specific full data is transmitted through the intermediate storage component (the interactive data queue). When the trainer is loaded to the metadata, interactive data is loaded from the intermediate storage component, and data pressure of the centralized coordinator is avoided. The bandwidth pressure of the coordinator control node is always in a constant range according to various changing environments, and the intermediate storage piece can transmit data with different sizes and generation frequencies by using different storage media, so that the efficiency and the resource utilization are balanced.

In an alternative embodiment, the allocating computing resources for the target training task in the container cluster and determining training data in the interaction data includes: determining a resource demand for the target training task; and distributing the computing resources for the target training task according to the resource surplus of each working node in the container cluster, and determining training data in the interactive data.

In the above embodiment, by determining the computing resources according to the resource demand of the target training task and the resource residual amount of each working node in the container cluster, the matched computing resources can be accurately determined for the target training task, so as to ensure the normal operation of the target training task.

In an optional implementation manner, in a case that the target training task includes a plurality of sub-training tasks, the allocating the computing resources to the target training task according to the remaining amount of resources of each working node in the container cluster, and determining training data in the interaction data includes: determining the resource demand of each sub-training task; distributing sub-computing resources to each sub-training task according to the resource surplus of each working node in the container cluster, and determining training data matched with each sub-training task in the interactive data.

In the above embodiment, the target training task is divided into a plurality of sub-training tasks, and the computing resource is allocated to each sub-training task, so that distributed parallel processing of the plurality of sub-training tasks can be realized, and the execution efficiency of the target training task is improved.

In an optional embodiment, the determining, in a preset reinforcement learning model, a reinforcement learning sub-model to be trained based on the interaction environment of the target training task includes: obtaining a target screening dimension; the target screening dimension comprises at least one of: model type, training algorithm, data processing format and model scale data; and screening in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel.

In the above embodiment, the set target screening dimensions are used for screening among a plurality of preset reinforcement learning models, and reinforcement learning submodels more conforming to the target training task can be obtained through screening, so that the technical scheme disclosed by the invention can be applied to a wider interactive environment.

In an optional embodiment, each of the preset reinforcement learning models includes at least one network module; the screening is performed in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel, and the method comprises the following steps: based on the target screening dimension, screening a plurality of target network modules from a plurality of network modules corresponding to the preset reinforcement learning models to obtain a plurality of target network modules; determining a connection relationship between the plurality of target network modules; and determining the reinforcement learning submodel based on the connection relation and the target network modules.

In the above embodiment, by screening the target network modules matched with the target training task from the network modules included in each preset reinforcement learning module, the search range can be expanded in the preset reinforcement learning module, so as to search for a reinforcement learning sub-model more conforming to the target training task.

In an alternative embodiment, the interactive environment comprises a plurality of different types of interactive environments; the method further comprises the following steps: determining a data requirement of the training data; under the condition that the data demand quantity meets the requirement of a preset quantity, a plurality of synchronous processes are established for the interactive environment; and acquiring interactive data between the intelligent agent and the interactive environment in the interactive environment through the synchronous processes, and taking the determined interactive data as the training data.

In the above embodiment, by creating a plurality of synchronous processes for the interactive environment, the efficiency of acquiring interactive data can be accelerated, so as to shorten the training time of the reinforcement learning submodel and improve the training efficiency of the reinforcement learning submodel.

In an optional embodiment, the method further comprises: detecting the training state of the target training task in the process of training the reinforcement learning submodel; and under the condition that the training state is detected to be failed, retraining the reinforcement learning submodel.

In the above embodiment, for the interruption and the abnormality of the target training task, a corresponding automatic recovery restart mechanism is provided, and by the recovery restart mechanism, the training task can be stably run for a long time, and the use of all the computing resources is scheduled in a cluster angle, so that the resource utilization rate of the whole cluster is improved.

In a second aspect, an embodiment of the present disclosure provides a reinforcement learning training device, including: the acquisition unit is used for acquiring a target training task; the first determining unit is used for determining a reinforcement learning submodel to be trained in a preset reinforcement learning model based on the interactive environment of the target training task; the resource allocation unit is used for allocating computing resources for the target training task in the container cluster; a second determining unit, configured to determine training data in the interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment; and the training unit is used for training the reinforcement learning sub-model based on the computing resources and the training data.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect described above, or any possible implementation of the first aspect.

In a fourth aspect, this disclosed embodiment also provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps in the first aspect or any one of the possible implementation manners of the first aspect.

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty.

FIG. 1 is a flow chart illustrating a reinforcement learning training method provided by an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a specific method for determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on an interaction environment of the target training task in the reinforcement learning training method provided in the embodiment of the present disclosure;

FIG. 3 is a flowchart illustrating a specific method for allocating computing resources for the target training task in a container cluster and determining training data in interactive data in a reinforcement learning training method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a reinforcement learning training apparatus provided by an embodiment of the present disclosure;

fig. 5 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, not all of the embodiments. The components of the embodiments of the present disclosure, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure, presented in the figures, is not intended to limit the scope of the claimed disclosure, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the disclosure without making creative efforts, shall fall within the protection scope of the disclosure.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The term "and/or" herein merely describes an associative relationship, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

According to research, the whole training process of the large-scale reinforcement learning, training and evaluating technology needs to spend huge amount of computing resources, and the duration of the training process is long. Moreover, for a new environment or a new problem, the algorithm needs to be redesigned and implemented, and the whole training system needs to be reconstructed according to specific requirements. Therefore, the existing training process requires a large number of deep reinforcement learning researchers and distributed system engineers to obtain a specialized training system through repeated iterations.

Based on the above research, the present disclosure provides a reinforcement learning training method, apparatus, electronic device, and storage medium. In the embodiment of the disclosure, after the target training task is obtained, the reinforcement learning sub-model matched with the interaction environment indicated in the target training task is determined in the preset reinforcement learning model, so that the technical scheme of the disclosure can be applied to reinforcement learning models in various interaction environments, and the application range of the technology of the disclosure is expanded. Meanwhile, corresponding computing resources are distributed to the target training tasks through the container cluster, and a mode of training the reinforcement learning submodel through the computing resources and the training data can be used for processing a plurality of target training tasks in parallel, and the target training tasks are guaranteed to have no influence so as to be normally executed.

First, words that may appear in the technical solutions of the present disclosure are explained.

Reinforcement learning (reinforcement learning): the intelligence Agent (Agent) learns in a trial and error mode, and plays a reward guidance behavior obtained by interacting with the environment, and the goal is to enable the intelligence Agent to obtain the maximum reward value.

Several basic elements of reinforcement learning: environment, agent, reward, action. In the following, the principles and ideas of reinforcement learning are explained by combining the above elements:

assuming that the brain represents the agent, the trainer of the neural network can operate the agent to make a decision, i.e. select an appropriate Action a 1. It is assumed that the earth represents the environment to be studied, which contains the corresponding state model. After the control brain selects to perform action a1, the State of the environment (State) may change. At this point, it may be found that the environmental state has changed from S { t } to S { t +1} while also deriving a time-delay Reward (Reward) R { t +1} for the brain to take action A1. The brain can then continue to select the next appropriate action, and the state of the environment changes again, again with a new reward value.

To facilitate understanding of the embodiment, a reinforcement learning training method disclosed in the embodiments of the present disclosure is first described in detail, and an execution subject of the reinforcement learning training method provided in the embodiments of the present disclosure is generally an electronic device with certain computing capability. In some possible implementations, the reinforcement learning training method may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a reinforcement learning training method provided in an embodiment of the present disclosure is shown, where the method includes steps S101 to S107, where:

s101: and acquiring a target training task.

Here, the number of the target training tasks may be one or more. Aiming at different target training tasks, the processing objects of the reinforcement learning model required to be trained can be different or the same.

For example, the plurality of target training tasks include a target training task a and a target training task B, where the target training task a is used to train a reinforcement learning model capable of recognizing images; the target training task B is used to train a reinforcement learning model that can process the audio sequence.

In the embodiment of the present disclosure, the target training task includes task parameters, and the task parameters are used to indicate the type, scale, accuracy, processing object, running environment, and the like of the reinforcement learning model that needs to be obtained. For example, the reinforcement learning model is suitable for the automatic driving environment, and the reinforcement learning model is suitable for the face payment environment.

S103: and determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interactive environment of the target training task.

In the embodiment of the present disclosure, a plurality of reinforcement learning models may be preset, that is, preset reinforcement learning models, where the preset reinforcement learning models specifically include the following types of models:

DQN (Deep Q-Learning) series model based on value function and relevant derivative model; a correlation model based on a policy gradient and a value function; the Monte Carlo tree search model is suitable for the discrete decision space of the chessboard game environment; a VPN/Muzero model for modeling the reinforcement learning model; a model learning model that combines human expert data and strategies; an inverse reinforcement learning correlation model BC/GAIL (genetic adaptive learning)/SQIL; her (hindsight Experience replay) and related models for sparse reward problems; RND (random Network partitioning)/Bebold correlation model for solving the problem of difficult exploration; a multi-agent reinforcement learning algorithm Qmix model family aiming at multi-agent problems.

In addition to the reinforcement learning models described above, other reinforcement learning models may be selected. For example, the predetermined reinforcement learning model may be a reinforcement learning model pre-built by a neural network trainer.

Here, the plurality of predetermined reinforcement learning models are not fixed, and the user may update the predetermined reinforcement learning in real time, where the update includes at least one of: deleting, adding and modifying any one preset reinforcement learning model.

The modification here can be understood as adjusting the structure of the pre-set reinforcement learning model, for example, adding a part structure, deleting a part structure, adjusting a part structure, etc.

S105: distributing computing resources for the target training task in the container cluster, and determining training data in interactive data; the interaction data includes: an agent matched with the target training task, and data generated in the interaction process of the interaction environment.

In the embodiment of the present disclosure, the container cluster may be a kubernets container cluster, where the kubernets container cluster includes a Master node and at least one Work node, and a corresponding computing resource, for example, a GPU and a CPU, is deployed on each Work node.

S107: training the reinforcement learning submodel based on the computing resources and the training data.

In the process of training the reinforcement learning submodel through the training data, the network parameters of the reinforcement learning submodel can be adjusted according to the training result.

As can be seen from the above description, in the embodiment of the present disclosure, first, a target training task for a reinforcement learning model is obtained, and then, a reinforcement learning sub-model is determined in a preset reinforcement learning model based on an interaction environment of the target training task.

As shown in fig. 2, for the step S103, determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interaction environment of the target training task, specifically including the following steps:

step S1031, obtaining a target screening dimension; the target screening dimension comprises at least one of: model type, training algorithm, data processing format and model scale data;

and S1032, screening in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel.

In the embodiment of the present disclosure, a model search space may be preset, and the preset reinforcement learning model is a part or all of the models in the model search space.

For example, when it is determined that a smaller-scale reinforcement learning model needs to be trained according to the target training task, a plurality of preset reinforcement learning models corresponding to the smaller-scale reinforcement learning model can be searched from the model search space. And then, determining a reinforcement learning submodel according to a plurality of preset reinforcement learning models.

Here, task parameters of the target training task may be obtained, and then, a target screening dimension may be determined in the task parameters, where the target screening dimension may be one or more. For example, the target screening dimension comprises at least one of: model type, training algorithm, data processing format, model scale data.

The model type may be used to characterize the type of input data and the type of output data. For example, the type of input data may be a combination of pictures, text sequences, voice, numerical values, and time sequences. The type of output data may be: discrete, continuous, composite types. The model type may also be used to characterize the number of agents, i.e. whether an agent is one or more. The model type can be used to characterize whether autoregressive, multi-way branching, and reparameterization techniques are used, among others.

For example, a corresponding coding neural network may be selected from the preset reinforcement learning model according to the type of input data, and a corresponding output prediction neural network may be selected from the preset reinforcement learning model according to the type of output data.

The training algorithm is used to characterize at least one of:

reward value, merit function (nstep-td, mc, td-lambda, GAE, vtrace); aiming at the correction mode selected by the training of different strategies; defining an importance sampling factor; balanced exploration and utilization of the output action of the intelligent agent; selection of an optimizer, and correction of anomalous gradient signals.

The data processing format is used to characterize at least one of:

whether to normalize the observation space and the reward space, and whether the normalized statistics is estimated by an exponential weighted average or a maximum posterior, how to define the quality index of the training data specifically includes: freshness of the sample, importance, number of reusability of the sample.

In an optional embodiment, in the case that each of the preset reinforcement learning models includes at least one network module, the step S1032 performs screening in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning sub-model, and specifically includes the following steps:

(1) based on the target screening dimension, screening a plurality of target network modules from a plurality of network modules corresponding to the preset reinforcement learning models to obtain a plurality of target network modules;

(2) determining the connection relation among the target network modules;

(3) and determining the reinforcement learning sub-model based on the connection relation and the target network modules.

In the embodiment of the present disclosure, at least one network module may be included for each predetermined reinforcement learning module, for example, the predetermined reinforcement learning module a may include a network module a1, a network module a2, and a network module A3. The default reinforcement learning module B may include a network module B1, a network module B2, and a network module B3. The default reinforcement learning module C may include a network module C1, a network module C2, and a network module C3.

At this time, at least one target network module may be obtained by screening from a plurality of network models included in a plurality of preset reinforcement learning modules. For example, network module A1, network module B2, network module C1, and network module C2 may be obtained by screening.

After determining the at least one target network module, a connection relationship between the at least one target network module may be determined.

In an alternative embodiment, the connection relationship between at least one target network module may be determined according to the connection relationship between the network modules included in the model search space.

In another alternative embodiment, the connection relationship between at least one target network module may be determined according to the module function of each of the at least one target network module.

After determining the at least one target network module, a reinforcement learning submodel may be determined based on the connection relationship and the at least one target network module.

In embodiments of the present disclosure, the network module may be a convolutional layer module, a pooling layer module, a normalization layer module, an attention module, or the like.

In an optional embodiment, as shown in fig. 3, for step S105, allocating computing resources to the target training task in the container cluster, and determining training data in the interactive data specifically includes the following steps:

step S1051, determining the resource demand of the target training task;

step S1052, allocating the computing resource to the target training task according to the resource remaining amount of each working node in the container cluster, and determining training data in the interactive data.

In the embodiment of the disclosure, a resource request end (e.g., a trainer of a neural network) may edit a Pod configuration file to indicate, through information contained in the Pod configuration file, a resource demand amount of a computing module requested to be allocated by the resource request end. Here, the resource requirement may be understood as the number of GPUs, the number of CPUs, and the number of memory spaces required to execute the target training task.

In an embodiment of the present disclosure, the computing module may be a container, and the container may be created on a work node of a container cluster for performing the above target training task.

After the resource demand is determined, the resource surplus of each working node in the container cluster can be determined, then the working nodes with the resource surplus larger than the resource demand are determined from the plurality of working nodes, and computing resources are distributed to the target training task in the determined working nodes.

For a target training task, one or more computing modules may be correspondingly allocated to the target training task, and when a plurality of computing modules are allocated, the plurality of computing modules may be respectively deployed on different working nodes. At this time, the target training task can be executed simultaneously by a plurality of computing modules, so that the target training task is executed in a distributed manner.

And aiming at different target training tasks, the distributed computing modules can be respectively deployed on different working nodes. By the processing mode, each target training task can be guaranteed to have no influence, so that the target training tasks can be guaranteed to be normally executed.

In the embodiment of the present disclosure, when it is determined that all the working nodes in the container cluster are in the working state, or it is determined that the resource surplus of the working nodes in the container cluster is smaller than the resource demand, the computing resources may be allocated to the target training task after the execution of other training tasks is finished.

In an optional implementation manner, in a case that the target training task includes a plurality of sub-training tasks, step S1052, according to the remaining amount of resources of each working node in the container cluster, allocates the computing resources to the target training task, and determines training data in the interaction data, specifically including the following steps:

(1) determining the resource demand of each sub-training task;

(2) distributing sub-computing resources for each sub-training task according to the resource surplus of each working node in the container cluster, and determining training data matched with each sub-training task in the interactive data.

In the embodiment of the disclosure, the target training task may be divided into a plurality of sub-training tasks under the condition that the target training task satisfies the task division condition.

The target training task satisfying the task dividing condition comprises at least one of the following conditions:

the resource surplus of the working nodes in the container cluster is smaller than the resource demand of the target training task;

the target training task comprises a plurality of model branches which need to be trained through different training data;

the target training task comprises a plurality of training stages, and each training stage corresponds to different training data.

Based on this, when the target training task is divided into a plurality of sub-training tasks, the target training task can be divided into the plurality of sub-training tasks according to the way that different sub-training tasks correspond to different training data.

After the target training task is divided into a plurality of sub-training tasks, the resource demand of each sub-training task can be determined, and sub-computing resources are distributed to each sub-training task according to the resource surplus of each working node in the container cluster.

In the embodiment of the present disclosure, after allocating sub-computing resources to each of the sub-training tasks, it may also be determined in the interaction data that each of the sub-training tasks matches the training data. And then, training the network model corresponding to each sub-training task through the sub-computing resources and the training data.

In an optional embodiment, in step S107, training the reinforcement learning sub-model based on the computing resources and the training data specifically includes the following processes:

(1) determining target metadata based on the computing resources, wherein the target metadata is metadata of the interactive data, and the method specifically comprises the following steps:

determining, based on the computing resources, metadata control nodes allocated by the container cluster for the target training task; the metadata control node is used for managing metadata of interaction data between the intelligent agent and the corresponding interaction environment; then, the target metadata is selected in the metadata control node.

(2) Searching target interactive data matched with the target metadata in an interactive data queue; wherein the interactive data queue is stored in a device memory of a device deployed by the container cluster.

(3) And taking the target interaction data as the training data, training the reinforcement learning sub-model, and training to obtain a target reinforcement learning module.

For different interaction environments and different reinforcement learning algorithms, the difference between the required training data is large. For example, the size of the interactive data may vary and the generation frequency of the interactive data may also vary for different interactive environments. For example, in the field of games, the change frequency of the interactive data is relatively fast, in the field of weiqi, the change frequency of the interactive data is relatively slow, and a group of interactive data is generated after one play is finished.

Therefore, different interactive environments and different reinforcement learning algorithms can generate different requirements on the system, and the training tasks of multiple environments cannot be met by one set of system. In order to mask the above problem, the interactive data may be split into metadata and full data, and the metadata and the full data may be stored separately. For example, the metadata may be stored in a metadata control node and the full data may be stored in an interaction data queue.

In determining the training data, the target metadata may be first randomly selected from the metadata control node, and then the target interaction data matching the target metadata may be searched in the interaction data queue. And finally, determining the searched target interaction data as training data.

Due to the fact that the data volume of the metadata is small, the interactive data are divided into the metadata and the full data, and the metadata are searched to obtain the corresponding interactive data, so that the technical scheme of the method can not be limited by the interactive environment and the reinforcement learning algorithm, and the method can be applied to the wider interactive environment.

As can be seen from the above description, in the embodiment of the present disclosure, the interactive data is separated into two parts, namely, metadata and full data, and after the data generator generates new interactive data, only the metadata containing the basic information needs to be sent to the centralized metadata control node (i.e., the coordinator control node), and the specific full data is transmitted through the intermediate storage (the interactive data queue). When the trainer is loaded to the metadata, interactive data is loaded from the intermediate storage component, and data pressure of the centralized coordinator is avoided. The bandwidth pressure of the coordinator control node is always in a constant range according to various changing environments, and the intermediate storage piece can transmit data with different sizes and generation frequencies by using different storage media, so that the efficiency and the resource utilization are balanced.

In the embodiment of the present disclosure, a shared object store may be created in advance in a device memory of a device deployed in the container cluster, where the shared object store is used to store an interaction data queue.

The method aims at the problems of low reading and writing speed, unstable reading and writing, competition for distributed locks and the like of a common file system (hard disk) and a distributed file system (such as Ceph) in actual reinforcement learning training. According to the technical scheme, a set of cross-machine memory sharing object store is built in the equipment memory based on plasma and gRPC. Through the shared memory object store, data transmission can be reserved in the memory as much as possible, data transmission efficiency is greatly improved, a corresponding preloading and broadcasting mechanism is supported, and the shared memory object store is suitable for some special reinforced learning data transmission scenes.

In the disclosed embodiment, in the case where the interactive environment includes a plurality of different types of interactive environments; the method also includes the steps of:

(1) determining the data demand of the training data;

(2) under the condition that the data demand quantity meets the requirement of a preset quantity, a plurality of synchronous processes are established for the interactive environment;

(3) and acquiring interactive data between the intelligent agent and the interactive environment in the interactive environment through the plurality of synchronous processes, and taking the determined interactive data as the training data.

For an interactive environment with low generation frequency of interactive data and data demand of training data, a plurality of synchronous processes can be created for the interactive environment, so that interactive data between an agent and the interactive environment in the interactive environment can be acquired as training data through the plurality of synchronous processes.

For example, in a go playing environment, a training sample is data of a play of go, and the generation frequency of interaction data is low due to a long play of go, so that the training efficiency of the reinforcement learning submodel is seriously influenced.

At this time, a plurality of synchronous processes can be created for the go playing environment, and each process is used for generating interactive data of the intelligent agent and the go playing environment. For example, 128 pieces of interactive data are needed in a go playing environment, at this time, 128 synchronous processes can be created, and then the 128 synchronous processes simultaneously generate the interactive data, so as to improve the training efficiency of the reinforcement learning submodel.

In an embodiment of the disclosure, the method further comprises the steps of:

(1) detecting the training state of the target training task in the process of training the reinforcement learning submodel;

(2) and re-training the reinforcement learning sub-model when the training state failure is detected.

In the embodiment of the disclosure, in the process of training the reinforcement learning submodel, a training state of a target training task may also be detected, where the training state is used to represent whether the training process of the reinforcement learning submodel is normal.

When the training state failure is detected, the failure reason can be analyzed, and the reinforcement learning submodel is retrained according to the failure reason.

For example, the failure reason may be a lack of allocation of computing resources, or an interruption of training data acquisition, or the like.

After the failure reason is determined, a retraining strategy, such as adjusting computing resources, adjusting a generation mode of training data, and the like, may be determined according to the failure reason, and then the reinforcement learning submodel is retrained according to the training strategy.

In another optional embodiment, in the case of detecting that the training state fails, the training state at the current time may also be retained; the reinforcement learning submodel may then continue to be trained according to the recorded training states.

As can be seen from the above description, the technical solution of the present disclosure has the following advantages:

1. and (4) universality. The technical scheme disclosed by the invention aims to improve the universality of the reinforcement learning algorithm, considers the universality of coping with various types of environments in the aspect of algorithm design and deals with the universality of coping with various types of computing resource scales. Only a small number of algorithms are required to implement the modification and hyper-parameter adjustment for different environments.

2. The overall data production and utilization efficiency is improved. For data transmission, model training and interactive environment simulation, customized optimization is performed by combining reinforcement learning characteristics, and compared with similar products, the efficiency is obviously improved.

3. Algorithm and system design is carried out from the perspective of a plurality of training tasks (clusters), and scheduling and optimization are carried out from the perspective of resource utilization rate of the whole cluster, training time consumption of all training tasks and the like.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, the embodiment of the present disclosure further provides a reinforcement learning training device corresponding to the reinforcement learning training method, and because the principle of solving the problem of the device in the embodiment of the present disclosure is similar to the reinforcement learning training method in the embodiment of the present disclosure, the implementation of the device can refer to the implementation of the method, and repeated details are not repeated.

Referring to fig. 4, a schematic diagram of a reinforcement learning training apparatus provided in an embodiment of the present disclosure is shown, where the apparatus includes: an acquisition unit 41, a first determination unit 42, a resource allocation unit 43, and a training unit 44; wherein the content of the first and second substances,

an obtaining unit 41, configured to obtain a target training task;

the first determining unit 42 is configured to determine, based on the interaction environment of the target training task, a reinforcement learning sub-model to be trained in a preset reinforcement learning model;

a resource allocation unit 43, configured to allocate computing resources for the target training task in a container cluster; determining training data in the interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment;

a training unit 44, configured to train the reinforcement learning sub-model based on the computing resources and the training data.

In a possible embodiment, the training unit is configured to: determining target metadata based on the computing resources, wherein the target metadata is metadata of the interactive data; searching target interactive data matched with the target metadata in an interactive data queue; and taking the target interaction data as the training data to train the reinforcement learning sub-model.

In a possible embodiment, the training unit is configured to: determining, based on the computing resources, metadata control nodes allocated by the container cluster for the target training task; the metadata control node is used for managing metadata of interaction data between the intelligent agent and the corresponding interaction environment; selecting the target metadata in the metadata control node.

In one possible implementation, the interaction data queue is stored in a device memory of a device deployed by the container cluster.

In a possible embodiment, the resource allocation unit is configured to: determining a resource demand for the target training task; and distributing the computing resources for the target training task according to the resource surplus of each working node in the container cluster, and determining training data in the interactive data.

In a possible implementation, the resource allocation unit is further configured to: determining the resource demand of each sub-training task under the condition that the target training task comprises a plurality of sub-training tasks; distributing sub-computing resources to each sub-training task according to the resource surplus of each working node in the container cluster, and determining training data matched with each sub-training task in the interactive data.

In one possible embodiment, the first determining unit is configured to: obtaining a target screening dimension; the target screening dimension comprises at least one of: model type, training algorithm, data processing format and model scale data; and screening in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel.

In one possible embodiment, the first determining unit is configured to: under the condition that each preset reinforcement learning model comprises at least one network module, screening a plurality of target network modules from a plurality of network modules corresponding to the plurality of preset reinforcement learning models based on the target screening dimension; determining a connection relationship between the plurality of target network modules; and determining the reinforcement learning submodel based on the connection relation and the target network modules.

In one possible embodiment, the method is further configured to: determining a data requirement amount of the training data in the case that the interactive environment comprises a plurality of different types of interactive environments; under the condition that the data demand quantity meets the requirement of a preset quantity, a plurality of synchronous processes are established for the interactive environment; and acquiring interactive data between the intelligent agent and the interactive environment in the interactive environment through the synchronous processes, and taking the determined interactive data as the training data.

In one possible embodiment, the method is further configured to: detecting the training state of the target training task in the process of training the reinforcement learning submodel; and under the condition that the training state is detected to be failed, retraining the reinforcement learning submodel.

The description of the processing flow of each module in the device and the interaction flow between the modules may refer to the related description in the above method embodiments, and will not be described in detail here.

Corresponding to the reinforcement learning training method in fig. 1, an embodiment of the present disclosure further provides an electronic device 500, and as shown in fig. 5, a schematic structural diagram of the electronic device 500 provided in the embodiment of the present disclosure includes:

a processor 51, a memory 52, and a bus 53; the storage 52 is used for storing execution instructions and comprises a memory 521 and an external storage 522; the memory 521 is also referred to as an internal memory, and is configured to temporarily store operation data in the processor 51 and data exchanged with an external memory 522 such as a hard disk, the processor 51 exchanges data with the external memory 522 through the memory 521, and when the electronic device 500 operates, the processor 51 communicates with the memory 52 through the bus 53, so that the processor 51 executes the following instructions:

acquiring a target training task;

determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interactive environment of the target training task;

distributing computing resources for the target training task in the container cluster, and determining training data in interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment;

training the reinforcement learning submodel based on the computing resources and the training data.

The embodiments of the present disclosure also provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program performs the steps of the reinforcement learning training method in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

The embodiments of the present disclosure also provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to execute the steps of the reinforcement learning training method in the foregoing method embodiments, which may be referred to in detail in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the system and the apparatus described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present disclosure. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are merely specific embodiments of the present disclosure, which are used for illustrating the technical solutions of the present disclosure and not for limiting the same, and the scope of the present disclosure is not limited thereto, and although the present disclosure is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive of the technical solutions described in the foregoing embodiments or equivalent technical features thereof within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present disclosure, and should be construed as being included therein. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A reinforcement learning training method is characterized by comprising the following steps:

acquiring a target training task;

2. The method of claim 1, wherein training the reinforcement learning submodel based on the computing resources and the training data comprises:

determining target metadata based on the computing resources, wherein the target metadata is metadata of the interactive data;

searching target interactive data matched with the target metadata in an interactive data queue;

and taking the target interaction data as the training data to train the reinforcement learning sub-model.

3. The method of claim 2, wherein determining target metadata based on the computing resources comprises:

determining, based on the computing resources, metadata control nodes allocated by the container cluster for the target training task; the metadata control node is used for managing metadata of interaction data between the intelligent agent and the corresponding interaction environment;

selecting the target metadata in the metadata control node.

4. The method of claim 2 or 3, wherein the interaction data queue is stored in a device memory of a device deployed by the container cluster.

5. The method of claim 1, wherein allocating computing resources for the target training task in a container cluster and determining training data in interaction data comprises:

determining a resource demand for the target training task;

and distributing the computing resources for the target training task according to the resource surplus of each working node in the container cluster, and determining training data in the interactive data.

6. The method according to claim 5, wherein, in a case that the target training task includes a plurality of sub-training tasks, the allocating the computing resources to the target training task according to the resource remaining amount of each working node in the container cluster and determining training data in the interaction data includes:

determining the resource demand of each sub-training task;

distributing sub-computing resources to each sub-training task according to the resource surplus of each working node in the container cluster, and determining training data matched with each sub-training task in the interactive data.

7. The method of claim 1, wherein the determining a reinforcement learning sub-model to be trained in a preset reinforcement learning model based on the interaction environment of the target training task comprises:

obtaining a target screening dimension; the target screening dimension comprises at least one of: model type, training algorithm, data processing format and model scale data;

and screening in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel.

8. The method according to claim 7, wherein each of the predetermined reinforcement learning models comprises at least one network module;

the screening is performed in a plurality of preset reinforcement learning models based on the target screening dimension to obtain the reinforcement learning submodel, and the method comprises the following steps:

based on the target screening dimension, screening a plurality of target network modules from a plurality of network modules corresponding to the preset reinforcement learning model;

determining a connection relationship between the plurality of target network modules;

and determining the reinforcement learning submodel based on the connection relation and the target network modules.

9. The method of claim 1, wherein the interactive environment comprises a plurality of different types of interactive environments; the method further comprises the following steps:

determining a data requirement of the training data;

under the condition that the data demand quantity meets the requirement of a preset quantity, a plurality of synchronous processes are established for the interactive environment;

and acquiring interactive data between the intelligent agent and the interactive environment in the interactive environment through the synchronous processes, and taking the determined interactive data as the training data.

10. The method of claim 1, further comprising:

detecting the training state of the target training task in the process of training the reinforcement learning submodel;

and under the condition that the training state is detected to be failed, retraining the reinforcement learning submodel.

11. A reinforcement learning training device, comprising:

the acquisition unit is used for acquiring a target training task;

the first determining unit is used for determining a reinforcement learning submodel to be trained in a preset reinforcement learning model based on the interactive environment of the target training task;

the resource allocation unit is used for allocating computing resources for the target training task in the container cluster and determining training data in the interactive data; the interaction data includes: an agent matched with the target training task and data generated in the interaction process of the interaction environment;

and the training unit is used for training the reinforcement learning sub-model based on the computing resources and the training data.

12. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the reinforcement learning training method according to any one of claims 1 to 10.

13. A computer-readable storage medium, having stored thereon a computer program for executing the steps of the reinforcement learning training method according to any one of claims 1 to 10 when the computer program is executed by a processor.