CN112862108A

CN112862108A - Componentized reinforcement learning model processing method, system, equipment and storage medium

Info

Publication number: CN112862108A
Application number: CN202110171433.3A
Authority: CN
Inventors: 朱恒满; 周正; 张正生; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2021-02-07
Filing date: 2021-02-07
Publication date: 2021-05-28
Anticipated expiration: 2041-02-07
Also published as: CN112862108B

Abstract

The application relates to a processing method, a processing device, computer equipment and a storage medium of a modularized reinforcement learning model. The method comprises the following steps: acquiring interactive data generated by a virtual object in an interactive process with an interactive environment; the virtual object is controlled by an operation component in a reinforcement learning system deployed in the cloud; the reinforcement learning system further comprises a learning component and an evaluation component; performing iterative training on the reinforcement learning model based on the interactive data through a learning component; in the iterative training process, the reinforcement learning model obtained by iterative training is evaluated through an evaluation component, and whether the reinforcement learning model obtained by iterative training meets the interaction condition is judged according to the evaluation result; and if not, updating the model associated with the operating component according to the reinforcement learning model obtained by iterative training so that the operating component controls the virtual object based on the updated reinforcement learning model. By adopting the method, the complexity of multiplexing the reinforcement learning model training framework in different services can be reduced.

Description

Componentized reinforcement learning model processing method, system, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a model training method, apparatus, computer device, and storage medium.

Background

With the development of artificial intelligence technology, the reinforcement learning technology is widely applied to the fields of games, e-commerce recommendation, automatic driving, intelligent scheduling and the like. The training process of the reinforcement learning model is complex, and how to train the reinforcement learning model is an important problem for promoting the development of reinforcement learning technology. Conventionally, the reinforcement Learning model is trained using an open source framework of rllib (reinforcement Learning library) distributed reinforcement Learning. However, since the parts of the RLLib framework are related to each other, when the reinforcement learning model training process based on the RLLib framework is multiplexed into various different services, the training framework needs to be modified on a large scale, and the operation is complicated.

Disclosure of Invention

In view of the foregoing, there is a need to provide a reinforcement learning model processing method, apparatus, computer device and storage medium, which can be conveniently and flexibly multiplexed into various services without modifying the training framework of the reinforcement learning model on a large scale.

A componentized reinforcement learning model processing method, the method comprising:

acquiring interactive data generated by a virtual object in an interactive process with an interactive environment; the virtual object is controlled by an operating component in a reinforcement learning system deployed in a cloud end; the reinforcement learning system further comprises a learning component and an evaluation component;

iteratively training, by the learning component, a reinforcement learning model based on the interaction data;

in the iterative training process, the reinforcement learning model obtained by iterative training is evaluated through the evaluation component, and whether the reinforcement learning model obtained by iterative training meets the interaction condition is judged according to the evaluation result;

and if not, updating the model associated with the operating component according to the reinforcement learning model obtained by the iterative training so as to enable the operating component to control the virtual object based on the updated reinforcement learning model.

In one embodiment, the method further comprises:

expanding the operating component in the reinforcement learning system to obtain an operating component copy;

the acquiring of interaction data generated by the virtual object in the interaction process with the interaction environment comprises:

and respectively controlling different virtual objects to interact with the interaction environment through the operation component and the operation component copy so as to obtain interaction data generated in the interaction process.

In one embodiment, the method further comprises:

when the reinforcement learning model in the model library is not updated, calling the initial reinforcement learning model from the model library through the operation component; or when the reinforcement learning model in the model library is updated, calling the updated reinforcement learning model from the model library through the running component;

controlling the virtual object to interact with the interaction environment through the called reinforcement learning model to obtain interaction data;

storing the obtained interactive data in a buffer memory through a buffer memory component in the reinforcement learning system;

and acquiring the interactive data from the buffer.

In one embodiment, the controlling the virtual object to interact with the interaction environment through the invoked reinforcement learning model, and obtaining interaction data includes:

when the interactive environment is an initialized interactive environment, acquiring an initial environment state in the interactive environment so that a called reinforcement learning model determines a first interactive behavior according to the initial environment state;

controlling the virtual object to interact with the interaction environment through the first interaction behavior to obtain a first reward value;

interaction data is composed based on the first reward value, the first interaction behavior, and the initial environmental state.

In one embodiment, the method further comprises:

updating the interactive environment after the virtual object interacts with the interactive environment by the first interactive behavior;

acquiring an updated environment state in the interactive environment, so that a called reinforcement learning model determines a second interactive behavior according to the environment state;

controlling the virtual object to interact with the interaction environment through the second interaction behavior to obtain a second reward value;

composing new interaction data based on the second reward value, the second interaction behavior, and the environmental state.

In one embodiment, the reinforcement learning system further comprises a batch component and an agent component; the storing the obtained interaction data in a cache by a cache component in the reinforcement learning system comprises:

when the interactive data in the batch processing format is obtained through batch processing of the batch processing component, the interactive data in the batch processing format is stored in a buffer memory through a buffer memory component in the reinforcement learning system;

the method further comprises the following steps: and reading the stored interactive data from the buffer through the agent component, and sending the interactive data to the learning component so as to execute the step of performing iterative training on the reinforcement learning model based on the interactive data through the learning component.

In one embodiment, the evaluating, by the evaluation component, the reinforcement learning model obtained by iterative training, and the determining whether the reinforcement learning model obtained by iterative training satisfies the interaction condition according to the evaluation result includes:

applying the reinforcement learning model obtained by the iterative training to a test environment through the evaluation component so as to enable the reinforcement learning model to interact with the test environment to obtain an interaction result;

evaluating the interaction result to obtain an evaluation index value;

and judging whether the reinforcement learning model obtained by the iterative training meets the interaction condition or not according to the evaluation index value.

A componentized reinforcement learning system, the system comprising: an execution component, a learning component, and an evaluation component;

the operation component is used for controlling the interaction between the virtual object and the interactive environment and storing interactive data obtained by the interaction;

the learning component is used for acquiring stored interactive data and carrying out iterative training on a reinforcement learning model based on the interactive data;

the evaluation component is used for evaluating the reinforcement learning model obtained by iterative training in the iterative training process and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition according to the evaluation result;

the learning component is further configured to update the model associated with the operating component with the reinforcement learning model obtained through the iterative training when the reinforcement learning model obtained through the iterative training does not meet the interaction condition, so that the operating component controls the virtual object based on the updated reinforcement learning model.

In one embodiment, the system further comprises:

the extension component is used for extending the operation component in the reinforcement learning system to obtain an operation component copy;

the running component copy is used for controlling other virtual objects except the virtual object controlled by the running component to interact with the interaction environment so as to obtain interaction data generated in the interaction process.

In one embodiment, the system further comprises:

the operation component is also used for calling the initial reinforcement learning model from the model library when the reinforcement learning model in the model library is not updated; or when the reinforcement learning model in the model library is updated, the reinforcement learning model is also used for calling the updated reinforcement learning model from the model library;

the operation component is also used for controlling the virtual object to interact with the interaction environment through a called reinforcement learning model to obtain interaction data;

the cache component is used for storing the obtained interactive data in a cache;

the learning component is further configured to obtain the interaction data from the buffer.

In one embodiment, the execution component is further configured to:

updating the interactive environment after the virtual object interacts with the interactive environment according to the interactive behavior;

In one embodiment, the system further comprises:

the batch processing component is used for batch processing the interactive data to obtain interactive data in a batch processing format and storing the interactive data in the batch processing format in a buffer;

and the agent component is used for reading the stored interaction data from the buffer and sending the interaction data to the learning component so as to execute the step of performing iterative training on the reinforcement learning model based on the interaction data through the learning component.

In one embodiment, the evaluation component is further configured to:

applying the reinforcement learning model obtained by the iterative training to a test environment so as to enable the reinforcement learning model to interact with the test environment to obtain an interaction result;

evaluating the interaction result to obtain an evaluation index value;

A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the modular reinforcement learning model processing method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the modular reinforcement learning model processing method.

In the above embodiment, the server controls the virtual object to interact with the interaction environment through the operation component in the reinforcement learning system, so as to obtain interaction data generated in the interaction process. The reinforcement learning model is then iteratively trained, via the learning component, based on the interaction data. And finally, evaluating the reinforcement learning model obtained by iterative training through an evaluation component, and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition. Each component in the reinforcement learning system respectively completes different functions, and each component is relatively independent. When the components are multiplexed into different services, other components cannot be influenced by modification of the components, and the modification complexity is reduced, so that the reinforcement learning model training framework formed by the components can be conveniently and flexibly multiplexed into different services.

Drawings

FIG. 1 is a diagram of an application environment of a modular reinforcement learning model processing method in one embodiment;

FIG. 2 is a flow diagram that illustrates a processing method for a componentized reinforcement learning model, according to one embodiment;

FIG. 3 is a schematic diagram of a reinforcement learning system deployed in the cloud in one embodiment;

FIG. 4 is a flowchart illustrating a method for obtaining interactive data according to one embodiment;

FIG. 5 is a schematic diagram of an environmental state in one embodiment;

FIG. 6 is a flow diagram illustrating a method for evaluating reinforcement learning models, according to one embodiment;

FIG. 7 is a block diagram of a reinforcement learning system in accordance with an embodiment;

FIG. 8 is a block diagram of the structure of a componentized reinforcement learning system in one embodiment;

FIG. 9 is a block diagram of the structure of a componentized reinforcement learning system in another embodiment;

FIG. 10 is a diagram showing an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The processing method of the modularized reinforcement learning model provided by the application can be applied to the application environment shown in fig. 1. The server 102 controls the virtual object to interact with the interaction environment through the operation component 1020 in the reinforcement learning system, so as to obtain interaction data generated in the interaction process. The reinforcement learning model is then iteratively trained based on the interaction data by the learning component 1022. And finally, evaluating the reinforcement learning model obtained by iterative training through an evaluation component 1024, judging whether the reinforcement learning model obtained by iterative training meets the interaction condition, and stopping iterative training when the reinforcement learning model obtained by iterative training meets the interaction condition. The server 102 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 2, a componentized reinforcement learning model processing method is provided, which is described by taking the method as an example applied to the server in fig. 1, and includes the following steps:

s202, acquiring interactive data generated in the process of interacting the virtual object with the interactive environment; the virtual object is controlled by an operation component in a reinforcement learning system deployed in the cloud; the reinforcement learning system also includes a learning component and an evaluation component.

The virtual object can be a movable object in a virtual scene, including a virtual character, a virtual animal, an animation character and the like. The virtual object may be an avatar in the virtual scene to represent the user. The virtual scene may include a plurality of virtual objects, each virtual object having its own shape and volume in the virtual scene and occupying a portion of the space in the virtual scene. Optionally, when the virtual scene is a three-dimensional virtual scene, the virtual object can be a three-dimensional model, the three-dimensional model can be a three-dimensional character constructed based on a three-dimensional human skeleton technology, and the same virtual object can show different external images by wearing different skins. In some embodiments, the virtual object can also be implemented using a 2.5-dimensional or 2-dimensional model.

Wherein the interactive environment is a sum of various factors that can be changed according to the behavior generated by the virtual object. For example, in a game scenario, the interactive environment is a game environment. The process of the interaction of the virtual object and the interactive environment is that the virtual object aims at the environment state S of the interactive environment_tSelecting a suitable interaction behavior A_tInteracting with an interaction environment according to A_tSetting the environmental status from S_tIs changed into S_t+1. For example, for an interactive environment of a board game, the virtual object interacts with the interactive environment by moving the playing pieces on the board, thereby changing the interactive environmentThe state of (1). For example, for an interactive environment of a battle game, a virtual object makes actions such as moving, shooting, etc. in the interactive environment to interact with the interactive environment, thereby changing the state of the interactive environment.

Wherein the interaction data is data generated based on interaction of the virtual object with the interaction environment. The interaction data may include characteristics of an environment state of the interaction environment, interaction behaviors selected by the virtual object for the environment state of the interaction environment, reward values fed back to the virtual object by the interaction environment according to the interaction behaviors selected by the virtual object, interaction policies composed of the interaction behaviors of the virtual object, and the like.

Among them, Reinforcement Learning (RL): also known as refile learning, evaluative learning, or reinforcement learning, is one of the paradigms and methodologies of machine learning to describe and solve the problem of virtual objects achieving maximum return or achieving a specific goal through learning strategies during interaction with an interaction environment. The reinforcement learning system is a system for training an initial reinforcement learning model by using a reinforcement learning method. The virtual object can interact with the interaction environment according to the reinforcement learning model obtained by training, so that the return maximization is achieved or a specific target is realized. The reinforcement learning system may include various components, such as an execution component, a learning component, and an evaluation component, which respectively implement different functions. The reinforcement learning system can be deployed in the cloud through a K8s (Kubernetes) system. In one embodiment, as shown in fig. 3, the K8s system creates multiple namespaces in the cloud virtual cluster, and the reinforcement learning system in each namespace performs an independent training task.

Wherein a component is an encapsulation of data and methods, including properties and methods. Interfaces are defined in the components through which the methods defined in the components can be called. The operation component is a component for controlling the interaction between the virtual object and the interaction environment in the reinforcement learning system so as to generate interaction data according to the interaction process. The learning component is a component that iteratively trains a reinforcement learning model based on the interaction data. The evaluation component is a component that evaluates whether the reinforcement learning model generated in the iterative training process of the learning component satisfies the interaction condition.

And S204, the server carries out iterative training on the reinforcement learning model based on the interactive data through the learning component.

The iterative training is a process of adjusting the parameters of the reinforcement learning model according to an optimization algorithm. The Optimization algorithm includes Policy Gradient (Policy Gradient) algorithm, PPO (proximity Policy Optimization) algorithm, and the like. And the server calculates the interactive data according to the optimization algorithm and adjusts the parameters of the reinforcement learning model according to the optimization target in the optimization algorithm. In one embodiment, the server enables distributed training of the learning component using a ring-global specification (ring-allow) through the mpijob object of the K8s system.

And S206, the server evaluates the reinforcement learning model obtained by iterative training through the evaluation component in the iterative training process, and judges whether the reinforcement learning model obtained by iterative training meets the interaction condition according to the evaluation result.

Wherein, the evaluation is a process of testing and counting the performance of the reinforcement learning model when interacting with the interaction environment. For example, in a game scenario, the odds of the reinforcement learning model when it deals with other players, the time taken to complete a mission, the cost paid to complete a mission, the amount of tasks left unfinished, and the like are evaluated. The interaction condition is a condition for judging whether the reinforcement learning model can complete a set target when interacting with the interaction environment. For example, in a go-like game scenario, the interaction condition may be whether the reinforcement learning model can achieve a win rate of more than 30% in the game. For example, the interaction condition may be whether the reinforcement learning model can control the robot to successfully avoid the obstacle. For example, the interaction condition may also be whether the reinforcement learning model can successfully hit the target, and the like.

And S208, if not, the server updates the model associated with the operating component according to the reinforcement learning model obtained by iterative training so that the operating component controls the virtual object based on the updated reinforcement learning model.

And the model associated with the operation component is a model used by the operation component for controlling the virtual object to interact with the interaction environment. In one embodiment, the learning component sends the reinforcement learning model to the execution component after iteratively training the reinforcement learning model a number of times. For example, the learning component performs 100 times of iterative training for each pair of reinforcement learning models, that is, the reinforcement learning models obtained by iterative training are sent to the running component.

In one embodiment, initially, the running component controls the virtual object through the reinforcement learning model with random parameters, and after the learning component performs iterative training on the reinforcement learning model, the reinforcement learning model obtained through the iterative training is sent to the running component, and the running component controls the virtual object through the reinforcement learning model obtained through the iterative training.

And in the iterative training process, the reinforcement learning model obtained by iterative training is evaluated through the evaluation component, and when the reinforcement learning model obtained by iterative training is judged to meet the interaction condition according to the evaluation result, the learning component stops performing iterative training on the reinforcement learning model.

In the above embodiment, the server controls the virtual object to interact with the interaction environment through the operation component in the reinforcement learning system, so as to obtain interaction data generated in the interaction process. The reinforcement learning model is then iteratively trained, via the learning component, based on the interaction data. And finally, evaluating the reinforcement learning model obtained by iterative training through an evaluation component, and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition. Each component in the reinforcement learning system respectively completes different functions, and each component is relatively independent. When the components are multiplexed into different services, other components cannot be influenced by the modification of the parts, and the modification complexity is reduced, so that the reinforcement learning system can be conveniently and flexibly multiplexed into different services.

In one embodiment, the server expands the running component in the reinforcement learning system to obtain a running component copy; and respectively controlling different virtual objects to interact with the interaction environment through the operation component and the operation component copy so as to obtain interaction data generated in the interaction process.

Wherein extending the runtime component is the process of creating a copy of the runtime component. In one embodiment, the server may extend the run components through the Deployment object in the K8s system. Components extended by the Deployment object do not require additional data dependency or state maintenance. If the running component or running component copy stops working, the server can automatically pull up the stopped component through the Deployment object.

In one embodiment, the server manages data of intermediate states generated during interaction of the virtual objects with the interactive environment through a management component. For example, the hidden state (hidden state) and the cell state (cell state) of LSTM (Long Short-Term Memory network) are managed.

The server expands the operation component in the reinforcement learning system, so that the operation component and the plurality of operation component copies can operate simultaneously, and different virtual objects are respectively controlled to interact with the interaction environment so as to generate interaction data. Therefore, the efficiency of the server for generating the interactive data is higher, more interactive data can be generated to train the reinforcement learning model, and the training process of the reinforcement learning model is accelerated.

In one embodiment, when the reinforcement learning model in the model library is not updated, the server calls the initial reinforcement learning model from the model library through the operation component; or when the reinforcement learning model in the model library is updated, calling the updated reinforcement learning model from the model library through the operation component; controlling the virtual object to interact with the interactive environment through the called reinforcement learning model to obtain interactive data; storing the obtained interactive data in a buffer memory through a buffer memory component in the reinforcement learning system; the method for acquiring interaction data generated by the virtual object in the interaction process with the interaction environment comprises the following steps: and acquiring the interaction data from the buffer.

The model library is used for storing the reinforcement learning model called by the running component. In one embodiment, at the start of the run component, stored in the model library is a reinforcement learning model of random parameters. In the training process, the learning component sends the reinforcement learning model obtained by iterative training to the model base so as to update the reinforcement learning model in the model base. After the model library is updated, the run component invokes the updated reinforcement learning model.

In one embodiment, the server selects the invoked model from the model library through a model selection component. The model selection component can select a model from a model library through a variety of model selection algorithms. For example, the latest model is selected from a model library, or a model is randomly selected from models that are updated within a certain time range.

The cache component is a component for caching the interactive data. The cache component may store the interaction data in a Remote Dictionary service (Redis) database, or may also store the interaction data in a Memcache database.

And the server updates the model in the model library, so that the running component can call the updated model to control the virtual object. Because the updated model is more accurate, the operation component controls the virtual object to interact with the interaction environment through the updated model to generate new interaction data which is more accurate, and then trains the reinforcement learning model through the new interaction data, so that the training process of the model can be accelerated, and the model can be quickly converged.

In one embodiment, as shown in fig. 4, the step of controlling, by the server, the virtual object to interact with the interaction environment through the invoked reinforcement learning model, and obtaining interaction data includes the following steps:

s402, when the interactive environment is the initialized interactive environment, the server obtains the initial environment state in the interactive environment, so that the called reinforcement learning model determines the first interactive behavior according to the initial environment state.

S404, the server controls the virtual object to interact with the interaction environment through the first interaction behavior, and a first reward value is obtained.

S406, the server composes interaction data based on the first reward value, the first interaction behavior and the initial environmental state.

Wherein the initialized interactive environment is the interactive environment in the initial environment state. For example, for an interactive environment for a board game, the initialized interactive environment is the interactive environment when no chess pieces are on the board. Where the environment state is a state of the interactive environment, for example, as shown in FIG. 5, the environment state is one when the interactive environment is an elimination game-like environment. The remaining 7 box elements, 4 circle elements and 5 triangle elements in this environment state are not eliminated, the number of steps that have been consumed is 12 steps, the remaining number of steps is 8, and the remaining time is 1: 30. For example, for a battle game environment, the environmental status includes current teammate casualty of the battle participants, the position of the virtual object in each battle participant, life value information, game level, and the like.

In one embodiment, the server predicts according to the characteristics of the initial environment state through the called reinforced learning model, and determines the first interaction behavior of the interaction object in the environment state. When the virtual object interacts with the interactive environment according to the first interactive behavior, the interactive environment feeds back a reward value to the virtual object according to the first interactive behavior, if the interactive environment considers that the first interactive behavior is the interactive behavior which is beneficial to the virtual object to finish the interactive task, the reward value fed back by the interactive environment is higher, and otherwise, the reward value is lower.

In one embodiment, after the virtual object interacts with the interactive environment in the first interactive behavior, the server updates the interactive environment; acquiring an environment state in the updated interactive environment so that the called reinforcement learning model determines a second interactive behavior according to the environment state; controlling the virtual object to interact with the interactive environment in a second interactive behavior to obtain a second incentive value; new interaction data is composed based on the second reward value, the second interaction behavior and the environmental state.

Wherein, updating the interactive environment means changing the state of the interactive environment according to the interactive behavior of the virtual object. And when the virtual object interacts with the interactive environment by the interactive behavior, updating the interactive environment according to the interactive behavior. The execution component then predicts a next interaction behavior of the virtual object based on the updated environmental state via the invoked reinforcement learning model.

In one embodiment, the reinforcement learning system further comprises a batch component and an agent component; storing the obtained interaction data in a cache by a cache component in the reinforcement learning system comprises: when the interactive data in the batch processing format is obtained through batch processing of the batch processing component, the interactive data in the batch processing format is stored in a buffer memory through a buffer memory component in the reinforcement learning system; the method further comprises the following steps: and reading the stored interactive data from the buffer through the agent component and sending the interactive data to the learning component so as to execute the step of performing iterative training on the reinforcement learning model based on the interactive data through the learning component.

The batch processing component is used for carrying out batch processing on the interactive data. The proxy component is a component for reading and writing the interactive data in the buffer. In one embodiment, both the agent component and the batch component belong to a data component. The data component further includes an adaptation component for serializing and deserializing the interaction data such that the interaction data can be propagated in the network.

In one embodiment, the server reads the interactive data from the buffer through the proxy component, performs batch processing on the interactive data to obtain interactive data in a batch processing format, and then returns the interactive data in the batch processing format to the buffer.

And the server carries out batch processing on the interactive data through the batch processing component, then carries out iterative training on the reinforcement learning model by using the batch processed interactive data, and then acquires new interactive data by using a strategy learned through the iterative training. The efficiency of model iterative training is improved, and the convergence speed of the model is accelerated.

In one embodiment, as shown in fig. 6, the server evaluates the reinforcement learning model obtained by iterative training through the evaluation component, and determining whether the reinforcement learning model obtained by iterative training satisfies the interaction condition according to the evaluation result includes the following steps:

s602, the server applies the reinforcement learning model obtained by iterative training to a test environment through the evaluation component, so that the reinforcement learning model and the test environment interact with each other, and an interaction result is obtained.

S604, the server evaluates the interaction result to obtain an evaluation index value.

And S606, the server judges whether the reinforcement learning model obtained by iterative training meets the interaction condition according to the evaluation index value.

Wherein, the test environment is an interactive environment for evaluating the reinforcement learning model obtained by the iterative training. And the interaction result is a final result determined according to the state of the test environment when the interaction between the reinforcement learning model and the test environment is finished. For example, when the test environment is an elimination-like game environment, the interaction result may be whether the virtual object wins the game, a score obtained, a time taken, and the like. When the test environment is a battle-type game environment, the interaction result may be the number of battle parties eliminated, the amount of blood consumed, and the like.

The evaluation index is a parameter for evaluating the reinforcement learning model, and may be, for example, a game winning rate, an average score, or the like. The evaluation index value may be a numerical value, for example, the winning rate of the game is 80%, the average score is 200 points, and the like.

And the server judges whether the reinforcement learning model obtained by iterative training meets the interaction condition or not according to the evaluation index value. For example, when the virtual object is controlled by the reinforcement learning model obtained through iterative training to play a battle game, if the winning rate of the virtual object is higher than 80%, the reinforcement learning model obtained through iterative training is determined to meet the interaction condition.

In one embodiment, the server causes the assessment component to periodically pull the model to be assessed from the learning component through a timed task (Cronjob) object in the K8s system and to assess the pulled model.

And the server makes the reinforcement learning model obtained by iterative training interact with the test environment, evaluates the interaction result and judges whether the reinforcement learning model obtained by iterative training meets the interaction condition. The server evaluates the reinforcement learning model through the test environment, and can acquire real data about the performance of the reinforcement learning model, so that the evaluation result is more accurate.

In one embodiment, the server starts the running component, the learning component, the evaluation component and other components through a script program; or, generating an application layout file, and enabling the layout system to start the components such as the running component, the learning component and the evaluation component according to the application layout file. The script program can be a shell script. If the reinforcement learning system is deployed in a cluster, the server may submit the application orchestration file to the K8s cluster through kubecect to start components in the reinforcement learning system and start training the reinforcement learning model.

In one embodiment, the reinforcement learning system further comprises a caching component; the method further comprises the following steps: the server expands the cache component to obtain a cache component copy; acquiring the address of a cache component and the address of a cache component copy; and sending the interactive data to the cache component according to the address of the cache component, or sending the interactive data to the copy of the cache component according to the address of the copy of the cache component, so that the learning component can obtain the interactive data through the cache component or the copy of the cache component.

In one embodiment, the server may extend the caching component through a stateful object in the K8s system. In one embodiment, the server obtains the addresses of the cache components and the copies of the cache components via a headless service (header service) provided by the K8s system.

In one embodiment, the server acquires the monitoring index of the reinforcement learning system deployed in the cloud in real time through the Prometheus monitoring system to monitor whether the reinforcement learning system normally operates. And the server can perform log stream processing through an Elastic Search engine, so that the distributed computing task is monitored in real time.

In one embodiment, the server performs a layer of encapsulation on the K8s layout file to facilitate the user to layout the distributed application. In one embodiment, the server deploys the reinforcement learning system in the cloud using a Mesos distributed resource management framework.

In one embodiment, the server deploys the reinforcement learning system on a distributed cluster in the cloud through the K8s orchestration system. The reinforcement learning system includes an execution component, a learning component, and a caching component. The reinforcement learning system can further include an evaluation component, a data component, a policy component, an inference component, and a pattern component.

As shown in fig. 7, the running component controls the virtual object and the interactive object to interact, and generates interactive data through interaction. The running component sends the generated interaction data to the cache component through the proxy component so as to store the interaction data to the cache through the cache component. The learning component acquires the interactive data from the cache component through the agent component, and then iteratively trains the reinforcement learning model according to the interactive data. When the iterative training reaches a certain number of times, the learning component sends the reinforcement learning model to the operation component so as to update the model in the model library of the operation component, so that the operation component can call the updated reinforcement learning model and control the virtual object through the updated reinforcement learning model.

The running component can also comprise a role component, an environment component and a management component. The character component is used for enabling the virtual object to select the interactive behavior. The role component can also comprise a model selection component, and the model selection component is used for selecting a model called by the operation component from the model library. The environment component is used for simulating the interactive environment and deducing the state of the interactive environment according to the interactive behavior of the virtual object. The management component is used for managing data of intermediate states in the interaction process of the virtual object and the interactive environment. The learning component can further comprise a data processing component, and the data processing component is used for pulling the interactive data from the buffer through the buffer component, carrying out batch processing on the interactive data, and providing the batch processed interactive data for the learning component for iterative training.

Wherein the data components are used to enable data to be transferred between the components. The data component may include an adaptation component, an agent component, and a batch component. The adaptation component is used for serializing and deserializing the interactive data. The proxy component is used for reading and writing data in the cache component. The batch processing component is used for carrying out batch processing on the interactive data. Wherein the policy component is configured to provide a policy for the virtual object to select the interaction behavior. The selectable policies of the virtual object include a top1 policy or a softmax policy, etc. The strategy component can comprise a model component and an algorithm component, and the algorithm component is used for providing an optimization algorithm. The reasoning component is used for improving reasoning efficiency through a batch processing model reasoning method. The schema component is used to define the distributed structure of the reinforcement learning system.

It should be understood that although the steps in the flowcharts of fig. 2, 4 and 6 are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2, 4, and 6 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps or stages.

In one embodiment, as shown in fig. 8, there is provided a componentized reinforcement learning system comprising: an execution component 802, a learning component 804, and an evaluation component 806, wherein:

the operation component 802 is used for controlling the interaction between the virtual object and the interactive environment and storing the interactive data obtained by the interaction;

the learning component 804 is used for acquiring the stored interactive data and performing iterative training on the reinforcement learning model based on the interactive data;

the evaluation component 806 is configured to evaluate the reinforcement learning model obtained through the iterative training in the iterative training process, and determine whether the reinforcement learning model obtained through the iterative training satisfies an interaction condition according to an evaluation result;

the learning component 804 is further configured to update the model associated with the operating component by using the reinforcement learning model obtained through the iterative training when the reinforcement learning model obtained through the iterative training does not satisfy the interaction condition, so that the operating component controls the virtual object based on the updated reinforcement learning model.

In one embodiment, as shown in fig. 9, the system further comprises:

an extension component 808, configured to extend the run component in the reinforcement learning system to obtain a run component copy;

the running component copy 810 is used for controlling other virtual objects except the virtual object controlled by the running component to interact with the interaction environment so as to obtain interaction data generated in the interaction process.

In one embodiment, the system further comprises:

a run component 802 that is further configured to invoke the initial reinforcement learning model from the model library when the reinforcement learning model in the model library is not updated; or when the reinforcement learning model in the model library is updated, the reinforcement learning model is also used for calling the updated reinforcement learning model from the model library;

the operation component 802 is further configured to control the virtual object to interact with the interaction environment through the invoked reinforcement learning model, so as to obtain interaction data;

a cache component 812 for storing the obtained interactive data in a cache;

a learning component 804 for obtaining interaction data from the buffer.

In one embodiment, the component 802 is further operable to:

when the interactive environment is an initialized interactive environment, acquiring an initial environment state in the interactive environment so as to enable the called reinforcement learning model to determine a first interactive behavior according to the initial environment state;

controlling the virtual object to interact with the interaction environment by a first interaction behavior to obtain a first reward value;

In one embodiment, the component 802 is further operable to:

acquiring an environment state in the updated interactive environment so that the called reinforcement learning model determines a second interactive behavior according to the environment state;

controlling the virtual object to interact with the interactive environment in a second interactive behavior to obtain a second incentive value;

new interaction data is composed based on the second reward value, the second interaction behavior and the environmental state.

In one embodiment, the system further comprises:

the batch processing component 814 is used for batch processing the interactive data to obtain interactive data in a batch processing format and storing the interactive data in the batch processing format in a buffer;

the agent component 816 is configured to read the stored interaction data from the buffer and send the interaction data to the learning component, so as to perform a step of performing iterative training on the reinforcement learning model based on the interaction data by the learning component.

In one embodiment, the evaluation component 806 is further configured to:

applying the reinforcement learning model obtained by iterative training to a test environment so as to enable the reinforcement learning model to interact with the test environment to obtain an interaction result;

evaluating the interaction result to obtain an evaluation index value;

and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition or not according to the evaluation index value.

For specific limitations of the componentized reinforcement learning system, reference may be made to the above limitations of the componentized reinforcement learning method, which are not described in detail herein. The modules in the above-described componentized reinforcement learning system can be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 10. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the processing data of the componentized reinforcement learning model. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a componentized reinforcement learning model processing method.

Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring interactive data generated by a virtual object in an interactive process with an interactive environment; the virtual object is controlled by an operation component in a reinforcement learning system deployed in the cloud; the reinforcement learning system further comprises a learning component and an evaluation component; performing iterative training on the reinforcement learning model based on the interactive data through a learning component; in the iterative training process, the reinforcement learning model obtained by iterative training is evaluated through an evaluation component, and whether the reinforcement learning model obtained by iterative training meets the interaction condition is judged according to the evaluation result; and if not, updating the model associated with the operating component according to the reinforcement learning model obtained by iterative training so that the operating component controls the virtual object based on the updated reinforcement learning model.

In one embodiment, the processor, when executing the computer program, further performs the steps of: expanding the operating component in the reinforcement learning system to obtain an operating component copy; the method for acquiring interaction data generated by the virtual object in the interaction process with the interaction environment comprises the following steps: and respectively controlling different virtual objects to interact with the interaction environment through the operation component and the operation component copy so as to obtain interaction data generated in the interaction process.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the reinforcement learning model in the model base is not updated, calling the initial reinforcement learning model from the model base through the operation component; or when the reinforcement learning model in the model library is updated, calling the updated reinforcement learning model from the model library through the operation component; controlling the virtual object to interact with the interactive environment through the called reinforcement learning model to obtain interactive data; storing the obtained interactive data in a buffer memory through a buffer memory component in the reinforcement learning system; and acquiring the interaction data from the buffer.

In one embodiment, the processor, when executing the computer program, further performs the steps of: when the interactive environment is an initialized interactive environment, acquiring an initial environment state in the interactive environment so as to enable the called reinforcement learning model to determine a first interactive behavior according to the initial environment state; controlling the virtual object to interact with the interaction environment by a first interaction behavior to obtain a first reward value; interaction data is composed based on the first reward value, the first interaction behavior, and the initial environmental state.

In one embodiment, the processor, when executing the computer program, further performs the steps of: updating the interactive environment after the virtual object interacts with the interactive environment by the first interactive behavior; acquiring an environment state in the updated interactive environment so that the called reinforcement learning model determines a second interactive behavior according to the environment state; controlling the virtual object to interact with the interactive environment in a second interactive behavior to obtain a second incentive value; new interaction data is composed based on the second reward value, the second interaction behavior and the environmental state.

In one embodiment, the reinforcement learning system further comprises a batch component and an agent component; the processor, when executing the computer program, further performs the steps of: when the interactive data in the batch processing format is obtained through batch processing of the batch processing component, the interactive data in the batch processing format is stored in a buffer memory through a buffer memory component in the reinforcement learning system; and reading the stored interactive data from the buffer through the agent component and sending the interactive data to the learning component so as to execute the step of performing iterative training on the reinforcement learning model based on the interactive data through the learning component.

In one embodiment, the processor, when executing the computer program, further performs the steps of: applying the reinforcement learning model obtained by iterative training to a test environment through an evaluation component so as to enable the reinforcement learning model to interact with the test environment to obtain an interaction result; evaluating the interaction result to obtain an evaluation index value; and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition or not according to the evaluation index value.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring interactive data generated by a virtual object in an interactive process with an interactive environment; the virtual object is controlled by an operation component in a reinforcement learning system deployed in the cloud; the reinforcement learning system further comprises a learning component and an evaluation component; performing iterative training on the reinforcement learning model based on the interactive data through a learning component; in the iterative training process, the reinforcement learning model obtained by iterative training is evaluated through an evaluation component, and whether the reinforcement learning model obtained by iterative training meets the interaction condition is judged according to the evaluation result; and if not, updating the model associated with the operating component according to the reinforcement learning model obtained by iterative training so that the operating component controls the virtual object based on the updated reinforcement learning model.

In one embodiment, the computer program when executed by the processor further performs the steps of: expanding the operating component in the reinforcement learning system to obtain an operating component copy; the method for acquiring interaction data generated by the virtual object in the interaction process with the interaction environment comprises the following steps: and respectively controlling different virtual objects to interact with the interaction environment through the operation component and the operation component copy so as to obtain interaction data generated in the interaction process.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the reinforcement learning model in the model base is not updated, calling the initial reinforcement learning model from the model base through the operation component; or when the reinforcement learning model in the model library is updated, calling the updated reinforcement learning model from the model library through the operation component; controlling the virtual object to interact with the interactive environment through the called reinforcement learning model to obtain interactive data; storing the obtained interactive data in a buffer memory through a buffer memory component in the reinforcement learning system; and acquiring the interaction data from the buffer.

In one embodiment, the computer program when executed by the processor further performs the steps of: when the interactive environment is an initialized interactive environment, acquiring an initial environment state in the interactive environment so as to enable the called reinforcement learning model to determine a first interactive behavior according to the initial environment state; controlling the virtual object to interact with the interaction environment by a first interaction behavior to obtain a first reward value; interaction data is composed based on the first reward value, the first interaction behavior, and the initial environmental state.

In one embodiment, the computer program when executed by the processor further performs the steps of: updating the interactive environment after the virtual object interacts with the interactive environment by the first interactive behavior; acquiring an environment state in the updated interactive environment so that the called reinforcement learning model determines a second interactive behavior according to the environment state; controlling the virtual object to interact with the interactive environment in a second interactive behavior to obtain a second incentive value; new interaction data is composed based on the second reward value, the second interaction behavior and the environmental state.

In one embodiment, the reinforcement learning system further comprises a batch component and an agent component; the computer program when executed by the processor further realizes the steps of: when the interactive data in the batch processing format is obtained through batch processing of the batch processing component, the interactive data in the batch processing format is stored in a buffer memory through a buffer memory component in the reinforcement learning system; and reading the stored interactive data from the buffer through the agent component and sending the interactive data to the learning component so as to execute the step of performing iterative training on the reinforcement learning model based on the interactive data through the learning component.

In one embodiment, the computer program when executed by the processor further performs the steps of: applying the reinforcement learning model obtained by iterative training to a test environment through an evaluation component so as to enable the reinforcement learning model to interact with the test environment to obtain an interaction result; evaluating the interaction result to obtain an evaluation index value; and judging whether the reinforcement learning model obtained by iterative training meets the interaction condition or not according to the evaluation index value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of processing a componentized reinforcement learning model, the method comprising:

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

and acquiring the interactive data from the buffer.

4. The method of claim 3, wherein the controlling the virtual object to interact with the interactive environment through the invoked reinforcement learning model, and obtaining interaction data comprises:

5. The method of claim 4, further comprising:

6. The method of claim 3, further comprising a batch component and an agent component in the reinforcement learning system; the storing the obtained interaction data in a cache by a cache component in the reinforcement learning system comprises:

7. The method of claim 1, wherein the evaluating, by the evaluation component, the iteratively trained reinforcement learning model, and the determining whether the iteratively trained reinforcement learning model satisfies the interaction condition according to the evaluation result comprises:

evaluating the interaction result to obtain an evaluation index value;

8. A modular reinforcement learning system, the system comprising: an execution component, a learning component, and an evaluation component;

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.