CN112036578B

CN112036578B - Intelligent body training method and device, storage medium and electronic equipment

Info

Publication number: CN112036578B
Application number: CN202010901910.2A
Authority: CN
Inventors: 李焱; 覃小春; 李佶学
Original assignee: Chengdu Digital Sky Technology Co ltd
Current assignee: Chengdu Digital Sky Technology Co ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2023-06-27
Anticipated expiration: 2040-09-01
Also published as: CN112036578A

Abstract

The application relates to the technical field of artificial intelligence, and provides an agent training method and device, a storage medium and electronic equipment. The intelligent agent training method comprises the following steps: receiving a first action execution request initiated by a first algorithm side; sending a first action which is a return result of a first action acquisition request initiated before the first environment side to the first environment side so as to enable the first environment side to execute the first action; receiving a second action acquisition request initiated by the first environment side; and sending a second state serving as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and acquires a second action selected by the updated agent, wherein the second action is carried in a second action execution request to be initiated after the first algorithm side. The method enables algorithm designers and environment developers to perform program development according to the custom logic, so that the efficiency of algorithm and environment development is remarkably improved.

Description

Intelligent body training method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an agent training method and device, a storage medium and electronic equipment.

Background

Reinforcement learning is a machine learning mode which is used for continuously collecting data feedback through interaction of an intelligent body and an environment and finally generating intelligent behaviors, and compared with supervised learning, reinforcement learning does not need manual annotation data and can be used in many scenes.

Taking a game environment as an example, the reinforcement learning process mainly involves two ends: algorithms and games are respectively responsible for development by algorithm designers and game developers. The logic that algorithm designers want is "algorithm driven games", i.e., algorithms control when the game performs actions as needed; the logic that the game developer wishes is a "game driven algorithm", i.e., the game controls when to request actions to be performed from the algorithm as needed; the action executed by the game is automatically selected by an agent in the algorithm according to the current state of the game and other factors, and the target to be trained by reinforcement learning is the agent.

However, there is a conflict between the two logic of "algorithm-driven game" and "game-driven algorithm", the former requires the game development to be adapted to the algorithm, so that the game developer needs to pay a large amount of effort, and the latter requires the algorithm development to be adapted to the game, so that the algorithm designer needs to pay a large amount of effort. Thus, whichever logic is employed results in inefficient development.

Disclosure of Invention

An objective of the embodiments of the present application is to provide an agent training method, a model training method, and a corresponding device, so as to improve the above technical problems.

In order to achieve the above purpose, the present application provides the following technical solutions:

in a first aspect, an embodiment of the present application provides an agent training method, applied to an intermediate platform, where the method includes: receiving a first action execution request initiated by a first algorithm side; the first action execution request carries a first action, wherein the first action is selected by an agent according to a first state, and the first state is a state that a first environment side is in after executing the last action of the first action; sending the first action which is a return result of a first action acquisition request initiated before the first environment side to the first environment side so as to enable the first environment side to execute the first action; receiving a second action acquisition request initiated by the first environment side; the second action obtaining request carries a second state, and the second state is a state that the first environment side is in after executing the first action; sending the second state serving as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and acquires a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

The above method provides an intermediate platform between the algorithm side and the environment side (including but not limited to games) such that interactions between the algorithm side and the environment side are accomplished through the intermediate platform. The intermediate platform is transparent to the environment side and the algorithm side, and when the environment side executes actions according to the needs (by sending action execution requests aiming at the environment side) seen by the algorithm side, the logic of the algorithm driving environment is realized; from the environmental side, it controls when to request the action to be executed from the algorithm (by sending the action acquisition request for the algorithm side) as required, i.e. the logic of the "environment driven algorithm" is implemented. Therefore, no matter an algorithm designer or an environment developer can carry out program development according to the custom logic, so that the efficiency of algorithm development and environment development is remarkably improved, and the reinforcement learning task (namely training agent) can be completed in a short time.

In an implementation manner of the first aspect, the first action is an action selected by an agent according to a first state and a first reward, where the first state is a state that the first environment side is in after executing a previous action of the first action, and the first reward is a reward generated by the first environment side after executing the previous action of the first action; the second action obtaining request also carries a second reward, wherein the second reward is generated by the first environment side after executing the first action; the sending the second state as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and obtains a second action selected by the updated agent, including: and sending the second state and the second reward which are returned results of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state and the second reward, and acquires the second action selected by the updated agent.

In the implementation manner, besides updating the agent according to the state, the agent can be updated according to rewards, so that the flexibility in updating manner of the agent is higher.

In an implementation manner of the first aspect, before the receiving the first action execution request sent by the first algorithm side, the method further includes: receiving an algorithm access request initiated by the first algorithm side, wherein the algorithm access request carries an environment keyword; searching the environment sides matched with the environment keywords from all environment sides recorded with the identifiers on the intermediate platform; the searched environment side is the first environment side; and sending the identifier of the first environment side to the first algorithm side so that the first algorithm side carries the identifier of the first environment side in the request initiated later, and the intermediate platform can determine the environment side for which the request is directed according to the identifier of the first environment side.

In the implementation manner, when the algorithm side is accessed to the intermediate platform, the accessed algorithm side and the designated environment side are bound by using the identifier of the environment side, and the bound algorithm side and the environment side can perform the interaction with definite targets.

In an implementation manner of the first aspect, the algorithm access request further carries an algorithm registration credential; after the receiving the algorithm access request initiated by the first algorithm side and before the searching the environment sides matched with the environment keywords from all the environment sides recorded with the identifications on the intermediate platform, the method further comprises: determining that the first algorithmic side has been registered on the intermediate platform in accordance with the algorithmic registration credentials.

In the implementation manner, when the algorithm side accesses the intermediate platform, the algorithm registration credentials can be utilized to verify the access qualification of the algorithm side.

In an implementation manner of the first aspect, after the sending the identifier of the first environment side to the first algorithm side, the method further includes: receiving an initial state acquisition request initiated by the first algorithm side, wherein the initial state acquisition request carries an initialization action; sending the initialization action and the identification of the first environment side to the first environment side, wherein the initialization action is used as a return result of an environment access request initiated by the first environment side, so that the first environment side executes the initialization action and records the identification of the first environment side; receiving the first action acquisition request initiated by the first environment side, wherein the first action acquisition request carries the first state, and the first state is a state of the first environment side after the initialization action is executed; and sending the first state serving as a return result of the initial state acquisition request to the first algorithm side, so that the first algorithm side updates the agent according to the first state, and acquires the first action selected by the updated agent.

The first state in the above implementation, i.e. the initial state of the first ambient side, may initialize the ambient side to be used in the training before starting a new training, since it is not excluded that the ambient side may have been trained before and is not in the initial state.

In one implementation manner of the first aspect, the first action acquisition request and the second action acquisition request are both initiated by calling the same interface provided by the intermediate platform.

In the implementation mode, the environment side only needs to interact with the algorithm side through one interface, so that the method is simple and efficient.

In an implementation manner of the first aspect, before the receiving an algorithm access request initiated by the first algorithm side, the method further includes: receiving an environment access request sent by the first environment side; responding to the environment access request, generating an identifier of the first environment side, and recording the identifier of the first environment side on the intermediate platform; blocking the environment access request, and not sending a return result of the environment access request to the first environment side during blocking until the first algorithm side releases blocking after accessing the intermediate platform, and allowing to send an initialization action serving as the return result of the environment access request and an identifier of the first environment side to the first environment side; wherein, the first algorithm side having accessed the intermediate platform means that the intermediate platform has processed the algorithm access request initiated by the first algorithm side.

In the implementation manner, when the environment side accesses the intermediate platform, an identifier is allocated to the intermediate platform, and the identifier can be used for binding the algorithm side and the environment side at a later stage. In addition, as a design mode, the environment side should be accessed to the intermediate platform before the algorithm side, so that the situation that the algorithm side cannot be trained due to the fact that the corresponding environment side is not available after the algorithm side is accessed is avoided. The intermediate platform is accessed at the environment side but not accessed at the algorithm side for a period of time, the platform can only temporarily block the access request initiated by the environment side from returning, because the return result of the request needs to be provided by the algorithm.

In an implementation manner of the first aspect, the environment access request further carries an environment registration credential; after the receiving the environment access request sent by the first environment side and before the responding to the environment access request and generating the identifier of the first environment side, the method further comprises: and determining that the first environment side is registered on the intermediate platform according to the environment registration credentials.

In the above implementation manner, when the environment side accesses the intermediate platform, the environment registration credential may be used to verify the access qualification of the environment side.

In one implementation manner of the first aspect, the method further includes: receiving training information collected by the first algorithm side in the training process of the intelligent agent, and displaying the training information on a visual interface provided by the intermediate platform; wherein the training information comprises training progress and/or resource usage status.

In the implementation mode, algorithm designers and environment developers can log in the middle platform at any time in the training process, and the training information is checked through a visual interface provided by the middle platform to timely master the training condition.

In a second aspect, an embodiment of the present application provides an agent training method, applied to a first algorithm side, where the method includes: initiating a first action execution request to an intermediate platform, so that the intermediate platform sends a first action serving as a return result of a first action acquisition request initiated before the first environment side to the first environment side; the first action is carried in the first action execution request, and is selected by the intelligent agent according to a first state, wherein the first state is a state that a first environment side is in after executing the last action of the first action; receiving a second state which is sent by the intermediate platform and is used as a return result of the first action execution request, wherein the second state is a state of the first environment side after the first action is executed; updating the intelligent agent according to the second state, and acquiring a second action selected by the updated intelligent agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

In a third aspect, an embodiment of the present application provides an agent training method, applied to a first environment side, where the method includes: receiving a first action which is sent by an intermediate platform and is used as a return result of a first action acquisition request initiated before the first environment side; the first action is selected by the intelligent agent according to a first state, wherein the first state is a state in which a first environment side is in after executing a previous action of the first action; performing the first action; initiating a second action acquisition request to the intermediate platform so that the intermediate platform sends a second state serving as a return result of a first action execution request initiated by the first algorithm side to the first algorithm side; the second state is carried in the second action obtaining request, and the second state is a state that the first environment side is in after executing the first action.

In a fourth aspect, an embodiment of the present application provides an agent training device configured on an intermediate platform, the device including: the first request receiving module is used for receiving a first action execution request initiated by a first algorithm side; the first action execution request carries a first action, wherein the first action is selected by an agent according to a first state, and the first state is a state that a first environment side is in after executing the last action of the first action; a first request processing module, configured to send, to the first environment side, the first action that is a return result of a first action acquisition request that was initiated before the first environment side, so that the first environment side executes the first action; the second request receiving module is used for receiving a second action acquisition request initiated by the first environment side; the second action obtaining request carries a second state, and the second state is a state that the first environment side is in after executing the first action; the second request processing module is used for sending the second state which is the return result of the first action execution request to the first algorithm side so that the first algorithm side updates the agent according to the second state and acquires a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

In a fifth aspect, an embodiment of the present application provides an agent training device configured on a first algorithm side, where the device includes: the first request initiating module is used for initiating a first action executing request to the intermediate platform so that the intermediate platform sends a first action serving as a return result of a first action acquiring request initiated before the first environment side to the first environment side; the first action is carried in the first action execution request, and is selected by the intelligent agent according to a first state, wherein the first state is a state that a first environment side is in after executing the last action of the first action; the first result receiving module is used for receiving a second state which is sent by the intermediate platform and is used as a return result of the first action execution request, wherein the second state is a state in which the first environment side is after executing the first action; the agent updating module is used for updating the agent according to the second state and acquiring a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

In a sixth aspect, an embodiment of the present application provides an agent training device configured on a first environmental side, where the device includes: the second result receiving module is used for receiving a first action which is sent by the intermediate platform and is used as a return result of a first action acquisition request initiated before the first environment side; the first action is selected by the intelligent agent according to a first state, wherein the first state is a state in which a first environment side is in after executing a previous action of the first action; an action execution module for executing the first action; the second request initiating module is used for initiating a second action acquisition request to the intermediate platform so that the intermediate platform sends a second state serving as a return result of the first action execution request initiated by the first algorithm side to the first algorithm side; the second state is carried in the second action obtaining request, and the second state is a state that the first environment side is in after executing the first action.

In a seventh aspect, embodiments of the present application provide a computer readable storage medium having stored thereon computer program instructions which, when read and executed by a processor, perform the method provided by the first aspect, the second aspect, the third aspect, or any one of the possible implementations of the third aspect.

In an eighth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor, the memory having stored therein computer program instructions which, when read and executed by the processor, perform the method of the first aspect, the second aspect, the third aspect or any one of the possible implementations of the aspects.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow of an agent training method provided in an embodiment of the present application;

FIG. 2 shows a structure of an agent training device provided in an embodiment of the present application;

fig. 3 shows a structure of an electronic device provided in an embodiment of the present application.

Detailed Description

The industry often uses the Gym platform provided by OpenAI to study reinforcement learning algorithms, which provides many games that algorithm designers use to test to verify the effectiveness of new algorithms. The general flow of training agents using reinforcement learning algorithms on the Gym platform is as follows:

Algorithm side flow (python pseudocode):

env=gym.make ("carthole-v 0") # creation environment

agent=rlagent () # obtain agent

Number of training n=100#

for i in range(N):

state=env. reset () # initialization environment

while 1:

action=agent. Action (state, rewind, info) # get action

state, reorder, done, info=env.step (action) # execute action

if done:

break # end of this training

......

The code mainly relates to two entities, namely an environment env and an agent, wherein the environment env and the agent can refer to games (for example, the CartPole-v0 above), but can also refer to a physical flight system, a stock exchange system, an automatic recommendation system, an automatic driving system and the like; the latter is a training goal for reinforcement learning, and the trained agent can be used in an environment, giving artificial intelligence to the environment (e.g., as a computer-controlled player in a game, a character in a game).

The main interfaces of the environment are two, one is the upper state=env.reset (), which is used for executing the initialization of the environment and returning to the initial state of the environment, and the other is the upper state, rewind, done, info=env.step (action), which is used for executing the action (such as the operation in the game) selected by the agent, returning to the new state after executing, rewarding reward (such as the number of game targets for executing the action) obtained after executing the action, the mark done for ending the training (such as ending the game when the game is already in communication), and other information info.

Inside agent, agent updates according to the state, rewards and other information returned after this time step is invoked (for example, if agent is implemented as a neural network, the parameters of the network can be updated), and the updated agent will select the action to be executed in the next step of the environment. The reinforcement learning algorithm to be developed by the algorithm designer mainly refers to the internal logic of the agent, act, and may include, for example, the structure of the agent, the update of the agent, the policy of the agent selection action, and the like.

The code implements logic of an 'algorithm driven environment', namely, when the algorithm performs actions according to the requirement (calling an interface provided by the environment), and the logic is friendly to algorithm designers. To implement an "algorithm driven environment," it is necessary to package the environment as a python interface (e.g., reset, step interface above) or a python-callable library (referring to libraries written in other languages than python) for algorithm invocation, but it is generally a relatively simple environment (e.g., games provided by Gym platforms) that can be packaged as either a python interface or a python-callable library, and it is relatively difficult to package a somewhat complex environment. Alternatively, the environment may be made as a server and the algorithm calls the environment over a network, however, such modification of the environment can take a significant amount of time and some environments are not suitable for modification to the form of a server (e.g., client gaming). In summary, to implement the reinforcement learning process according to the logic of the "algorithm driven environment", the environment developer needs to bear a large development pressure.

The logic that the environment developer desires is an "environment driven algorithm," i.e., the environment controls when to request actions to be performed from the algorithm as needed, which generally flows as follows:

ambient side flow (python pseudocode):

env = new Env()

state=env. init () # environment start-up

agent=rlagent () # obtain agent

while 1:

action=agent. Get_action (state, record, info) # get action

state, reorder, done, info=env.step (action) # execute action

if done:

break

......

The portions of the code that are similar to the algorithm-side flow are not described in detail. The agent_action indicates an action to be performed by the acquisition environment, and the action is acquired from an input device such as a keyboard or a mouse (for example, the action is generated by a player through the input device when playing a game) before (when not training the agent), and is acquired from the agent. When the environment needs to make a decision, an algorithm is called, and can be called through a network (for example, an http request is sent), or an intelligent agent can be made into a lib package to be built in a game, in a word, under the logic of the environment-driven algorithm, the environment code only needs to be changed rarely (the acquisition source of the action is changed), so that the logic is friendly to environment developers.

However, in this manner, because the environment provided by most non-Gym platforms does not implement Gym interfaces (such as reset and step interfaces above), a great deal of modification is required on the algorithm level, and because the existing reinforcement learning algorithm is designed for the environment with Gym interfaces, many existing algorithm libraries cannot be directly used, so that the algorithm designer needs to bear great development pressure.

In summary, there is a conflict in the requirements of the two logic, namely an "algorithm-driven environment" and an "environment-driven algorithm", the former requiring the environment development to be adapted to the algorithm, and the latter requiring the algorithm development to be adapted to the environment. These two requirements are not well unified in the prior art, and therefore, whichever logic is employed results in significant development effort for one of the technicians, and thus, the reinforcement learning task cannot be completed quickly.

In the intelligent agent training method provided by the embodiment of the application, two logics of an algorithm driving environment and an environment driving algorithm are well coupled together by arranging the intermediate platform between the algorithm side and the environment side, so that the development efficiency of the algorithm side and the environment side is remarkably improved. The algorithm side, the environment side and the intermediate platform are mainly divided from the software function level, and are not limited as to how to deploy on the hardware level. For example, three parties may be deployed on the same electronic device, or on different electronic devices.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. The terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Fig. 1 shows a flow of an agent training method provided in an embodiment of the present application. Referring to fig. 1, one or more algorithm sides may be accessed on an intermediate platform, or one or more environment sides may be accessed, e.g., the intermediate platform may be implemented as a server or a server cluster for access by the algorithm or environment sides, the intermediate platform having been started prior to performing the steps of fig. 1. For a particular agent, the training process involves only one particular ambient side and one particular algorithm side, so the method is described by taking the first algorithm side and the first ambient side as examples. The agent training method in fig. 1 includes:

Step S101: the first environment side initiates an environment access request to the intermediate platform.

Step S102: the intermediate platform determines that the first ambient side has been registered on the intermediate platform according to the algorithmic registration credential.

Step S101 and step S102 are described together. Before executing step S101, the environment developer may perform account registration on the intermediate platform, and obtain the environment registration credential after registration, for example, may be a string, a certificate, or the like. The environment registration credential can be carried in the environment access request initiated by the first environment side, so that the intermediate platform can verify the credential after receiving the request, if the environment access request does not carry the environment registration credential or the carried environment registration credential is illegal, the intermediate platform can reject the access of the first environment side, otherwise, the intermediate platform can continue to execute the subsequent steps. Of course, it is not excluded that in some implementations the intermediate platform allows free access on the environment side, i.e. does not verify the environment registration credentials.

In fig. 1, a first environment side initiates an algorithm access request by calling a starttracking interface provided by an intermediate platform, where parameters of the interface include an api key, i.e. an environment registration credential.

The first environment side initiates an environment access request, which is used for requesting, in addition to the access to the intermediate platform, a first action to be performed by the first environment side from the first algorithm side, in the sense that the environment access request is similar to a first action acquisition request and a second action acquisition request, which will be mentioned later, and thus it is not excluded in some implementations to split the environment access request into two request transmissions, one for accessing the intermediate platform and one for requesting the action to be performed from the first environment side, but that the network communication is performed twice.

It should be noted that it is not excluded that the first environment side has interacted with the other algorithm side (not the first algorithm side) before executing step S101, and participates in the training process of the other agent, so that the above-mentioned "the first action to be executed by the first environment side" is only for the first algorithm side, and does not represent that the first environment side has not executed other actions before.

Step S103: the intermediate platform generates and records an identification of the first environment side.

After confirming that the first environment side has access qualification, the intermediate platform can be allocated with a unique identifier, and the identifier and the related information of the first environment side are recorded locally on the intermediate platform, so that the first environment side can be considered to be accessed to the intermediate platform. The identity of the first environment side may be used for binding the algorithm side to the environment side at a later stage, and the intermediate platform may return the identity to the first environment side immediately after performing step S103, however, another approach is taken in the method shown in fig. 1, and the identity of the first environment side is not sent to the first environment side together with the initialization action as a return result of the environment access request until step S109, in order to reduce the number of network communications.

Step S104: the intermediate platform blocks the environment access request until the first algorithm side is accessed to the intermediate platform and then unblocks.

The access sequence of the first environment side and the first algorithm side is two ways, one is that the first environment side is accessed before the first algorithm side, and the other is that the first algorithm side is accessed before the first environment side. In order to avoid the situation that the first algorithm side cannot be trained after being accessed, but because the corresponding environment side is not available, fig. 1 adopts the former scheme (the latter scheme is similar and is not specifically described). In this scheme, after the intermediate platform receives the environmental access request, since the first algorithm side is likely to not be accessed yet, the intermediate platform cannot obtain the action requested by the first environmental side from the first algorithm side, so the request can only be blocked, and after blocking, the first environmental side cannot obtain the return result of the environmental access request, until the first algorithm side accesses the intermediate platform, the intermediate platform will not unblock and return the action requested by the environmental access request to the first environmental side (see later steps for details).

The first algorithm side having been accessed to the intermediate platform means that the intermediate platform has processed the algorithm access request initiated by the first algorithm side, that is, the binding relationship between the first algorithm side and the first environment side is confirmed in the middle (see later steps for details).

It should be further noted that if the first environment side accesses the intermediate platform before the first algorithm side, the first environment side does not know the existence of the first algorithm side when sending the environment access request, so the environment access request is not sent for a specific algorithm side (for example, does not carry the identifier of the first algorithm side), but after the first algorithm side accesses the intermediate platform and establishes a binding relationship with the first environment side (see later steps for details), the request may still be associated to the first algorithm side by the intermediate platform.

Step S105: the first algorithm initiates an algorithm access request to the intermediate platform.

Step S106: and the intermediate platform searches the environment sides matched with the environment keywords from all the environment sides recorded with the identifiers, and searches the first environment side.

Step S107: the intermediate platform sends the identification of the first environment side to the first algorithm side.

Step S105 to step S107 are described together. Before executing step S105, the algorithm designer may perform account registration on the intermediate platform, and obtain the algorithm registration credential after registration, for example, may be a string, a certificate, or the like. The algorithm registration credential can be carried in the algorithm access request initiated by the first algorithm side, so that the intermediate platform can verify the credential after receiving the request, if the algorithm access request does not carry the algorithm registration credential or the carried algorithm registration credential is illegal, the intermediate platform can reject the access of the first algorithm side, otherwise, the intermediate platform can continue to execute the subsequent steps. Of course, it is not excluded that in some implementations, the intermediate platform allows algorithm-side free access, i.e. does not verify the algorithm registration credentials.

In fig. 1, the first algorithm side initiates an algorithm access request by calling an init interface (note that do not get confused with env.init in the environment side code) provided by the intermediate platform, the parameters of which include an api key, i.e. an algorithm registration credential.

It has been mentioned above that in fig. 1, the first environmental side is accessed to the intermediate platform before the first algorithm side, so that when step S105 is performed, step S103 is already performed, that is, the intermediate platform has already recorded the identification of the first environmental side and the related information of the first environmental side, and of course, the intermediate platform may also be accessed to other environmental sides, where the identification of these environmental sides and the related information are also recorded.

It should be clear to the first algorithm side as to which environment side to bind itself and then perform the agent training, for example, the algorithm access request may carry an environment keyword for describing the environment side, such as the number of the environment side, etc. Therefore, the intermediate platform can search the environment side matched with the keyword according to the related information of the recorded environment side of the keyword, wherein the search result is the first environment side. The intermediate platform returns the identifier of the first environment side to the first algorithm side, and each request initiated by the first algorithm side to the intermediate platform later carries the identifier of the first environment side, so that the intermediate platform can determine that the environment to which the request is directed is the first environment side according to the identifier of the first environment side, and therefore data (such as a state) sent by the first environment side is returned to the first algorithm side, and data (such as an action) sent by the first algorithm side is returned to the first environment side. However, in the following description, for simplicity, it will not be specifically pointed out that the request initiated by the first algorithm side carries the identifier of the first environment side.

After the step S107 is performed, the intermediate platform may be considered to complete the processing of the algorithm access request, and at this time, a binding relationship is also established between the first algorithm side and the first environment side, and the first algorithm side may be considered to have been accessed to the intermediate platform.

In some implementations, the environment side may be set as a resident service, and the algorithm side program may be started and closed at any time, and the binding relationship between the algorithm side program and the environment side is naturally released after the algorithm side program is closed, and if the binding relationship between the algorithm side program and the environment side is to be established, the algorithm side needs to be re-connected to the intermediate platform.

In addition, in some implementations, the algorithm may be provided directly by the intermediate platform, where the algorithm side may be considered part of the intermediate platform, so that the algorithm side no longer needs to access the intermediate platform by sending a request, although the algorithm side and environment side binding steps may still be performed similarly.

Step S108: the first algorithm initiates an initial state acquisition request to the intermediate platform.

Step S109: the intermediate platform sends an initialization action to the first environment side and an identification of the first environment side.

Step S110: and the first environment side executes an initialization action to acquire a first state.

Step S108 to step S110 are described together. After the first algorithm side receives the identifier of the first environment side from the intermediate platform, an initial state acquisition request is initiated to the intermediate platform, so that the first environment side executes an initialization action, and an initial state of the environment is acquired from the first environment side. The initial state acquisition request carries an initialization action.

In fig. 1, the first algorithm side initiates the initial state acquisition request by calling the reset interface provided by the intermediate platform, and referring to the foregoing algorithm side flow, it may be considered that the reset interface provided by the intermediate platform is actually encapsulated inside the env.

In step S101, the intermediate platform receives the environment access request initiated by the first environment side, in step S109, the intermediate platform returns the initialization action and the identifier of the first environment side (obtained in step S103) to the first environment side, and as a return result of the environment access request, in fig. 1, the return value of starttracking in step S101 is obtained.

After the first environment receives the initialization action, the action is executed to realize self initialization (for example, restart the game level), and the initial state of the first environment is acquired, which is called a first state. Referring to the environment-side flow, an initialization operation, i.e., running the code env.step (action) in the environment-side flow, is performed, and at this time, the action is an initialization operation, and the output state is the first state. Of course, in different implementations, env.step (action) may also output the first reward (which is the initial reward at this time, and may be considered as a reward generated by the first environment side after performing the initialization action), the training end flag done, and other information info.

In addition, the first environment side also records the received identification of the first environment side for self identity recognition.

Step S111: the first environment initiates a first action acquisition request to the intermediate platform.

Step S112: the intermediate platform sends the first state to the first algorithm side.

Step S113: the first algorithm side updates the agent according to the first state and acquires a first action selected by the updated agent. Step S111 to step S113 are described together. After the initialization action is performed, the first environment side may initiate a first action acquisition request to the intermediate platform, where the purpose is to acquire the first action from the first algorithm side.

In fig. 1, the first environment side initiates the first action acquisition request by calling the get_action interface provided by the intermediate platform, and referring to the foregoing environment side flow, it may be considered that the get_action interface provided by the intermediate platform is actually encapsulated inside the agent, where the interface has four parameters including a state, a reward, a training end identifier done, and other information info, so that the four information may be carried in the first action acquisition request, and of course, not all the four parameters are included in the get_action interface in all the implementations, for example, if the get_action interface only includes one parameter of the state in some implementations, only the information of the state needs to be carried in the first action acquisition request.

Since the first environment side is initialized in step S110, the first action obtaining request may carry the first state (i.e., the first reward, the training end identifier, and other information obtained after the first environment side performs the initialization action (i.e., performs the env.step (action)) output. Wherein a first state is necessary and the latter three information items are optional. In the following steps, for simplicity, the two contents of the training end identifier fetch and other information may be ignored temporarily, that is, the first action obtaining request is considered to carry at least the first state and possibly the first reward.

In step S108, the intermediate platform receives the initial state acquisition request initiated by the first algorithm side, in step S112, the intermediate platform returns the first state (i.e., the initial state) to the first algorithm side, and as a return result of the initial state acquisition request, in fig. 1, the return value of reset in step S108 is the return value of env.

If the first action obtaining request also carries the first reward, the intermediate platform may also return the first reward to the first algorithm side. However, in the pseudo code given above, the return value of env.reset in the algorithm side flow does not include the first prize, and because the first prize is only the initial prize and is only a default value (e.g. 0) many times, in some implementations, the first algorithm side does not need to obtain the first prize from the first environment side, and may be specified as a default value by itself, and of course, in other implementations, the return value of env.reset may also include the first prize.

After the first algorithm receives the first state, the agent is updated according to the first state, and the updated agent selects a first action. The selection of an action is not limited to the selection from a specified set of actions, and may be the determination of an action.

In the algorithm side flow, the first algorithm side performs agent update according to the first state and selects a code agent (act) corresponding to the first action, wherein the state is the first state.

Note that the update of the agent according to the first state in step S113 is not limited to the update of the agent according to the first state only, and for example, if another state (history state) of the first environment side has been received before, the agent may be updated according to the first state and the history state.

If the first algorithm side receives the first reward, the agent may be updated according to the first state and the first reward (i.e. the first reward is taken by the reward in the agent' act (state), and similarly, the update of the agent according to the first state and the first reward is not limited to updating the agent according to the first state and the first reward only, and if other states (history states) and other rewards (history rewards) of the first environment side have been received before, the agent may also be updated according to the first state, the first reward, the history states and the history rewards. Looking back at steps S108 to S113, considering that the first environment side may have already participated in the training of other agents before training the agents in the first algorithm side, the first environment side is not in an initial state, and if the training is directly started, the training of the current agent may be negatively affected, so the first algorithm side may first initiate an initial state acquisition request after accessing, on the one hand, to acquire the first state (initial state), and on the other hand, to reinitialize the first environment side.

Step S114: the first algorithm initiates a first action execution request to the intermediate platform.

Step S115: the intermediate platform sends a first action to the first environment side.

Step S116: the first environment side executes a first action to acquire a second state.

Step S114 to step S116 are described together. After the first algorithm side acquires the first action, a first action execution request is initiated to the intermediate platform, so that the first environment side executes the first action, and a second state of the environment is acquired from the first environment side. The first action execution request carries the first action.

In fig. 1, the first algorithm side initiates the initial state acquisition request by calling the step interface provided by the intermediate platform, and referring to the foregoing algorithm side flow, it may be considered that the step interface provided by the intermediate platform is actually encapsulated inside env.

In step S111, the intermediate platform receives the first action acquisition request initiated by the first environment side, in step S115, the intermediate platform returns the first action to the first environment side, as a return result of the first action acquisition request, in fig. 1, the return value of get_action in step S111, and in the environment side flow, the return value of get_action is the agent.

After the first environment receives the first action, the action is executed, and the state of the first environment after executing the action is acquired, which is called a second state. Referring to the environment-side flow, an initialization operation, i.e., an operation code env.step (action), is executed, and at this time, the action is a first operation, and the output state is a second state. Of course, in a different implementation, env.step (action) may also output a second bonus, i.e. a bonus generated by the first ambient side after performing the first action.

Step S117: the first environment initiates a second action acquisition request to the intermediate platform.

Step S118: the intermediate platform sends the second state to the first algorithm side.

Step S119: the first algorithm side updates the agent according to the second state and acquires a second action selected by the updated agent.

Step S117 to step S119 are described together. After the initialization action is performed, the first environment side may initiate a second action acquisition request to the intermediate platform, which aims to acquire the second action from the first algorithm side. The second action obtaining request carries at least a second state and possibly a second prize.

In step S114, the intermediate platform receives the first action execution request initiated by the first algorithm side, in step S118, the intermediate platform returns the second state (i.e., the initial state) to the first algorithm side, and as a return result of the first action execution request, the return value of step in step S114 in fig. 1 is the return value of env. If the second action obtaining request also carries a second reward, the intermediate platform may also return the second reward to the first algorithm side.

After the first algorithm receives the second state, the agent is updated according to the second state, and the updated agent selects a second action. In the algorithm side flow, the first algorithm side performs agent update according to the second state and selects a code agent (act) corresponding to the second action, wherein the state is the second state.

Thereafter, the process may jump to step S114 for iterative execution, and the second action obtained in step S119 will be the first action in step S114.

The training process of the intelligent agent can be divided into a plurality of rounds, each round of training is corresponding to one value of N in the algorithm side flow, and after each round of iteration is finished (the time of each round of iteration is controlled according to the training end mark, and the specific logic is the pseudo code in the foregoing), the environment can be reinitialized. The trained agent may be deployed into an environment, for example, as a computer-controlled player in a game, a character in a game, etc.

In some implementations, the first algorithm side may report the collected training information to the intermediate platform during the training process, and the intermediate platform provides a visual interface (e.g., a web page, a client interface) for displaying the received training information, so that an algorithm designer and an environment developer may log in the intermediate platform at any time during the training process, and view the training information through the visual interface, so as to grasp the training situation in time. The training information includes training progress (e.g., training rounds) and/or resource usage (e.g., memory, CPU usage), among other things.

In summary, according to the agent training method provided by the embodiment of the present application, an intermediate platform is provided between the algorithm side and the environment side, so that interaction between the algorithm side and the environment side is realized through the intermediate platform. The intermediate platform is transparent to the environment side and the algorithm side, and can control when the environment side executes the action according to the need (by sending an action execution request for the environment side) from the algorithm side, namely the logic of the algorithm driving environment is realized; from the environmental side, it is possible to control when actions to be performed are requested from the algorithm (by sending an action acquisition request for the algorithm side) on demand, i.e. the logic of the "environment driven algorithm" is implemented. Therefore, both algorithm designers and environment developers can develop programs according to the custom logic (the logic in the pseudo code is basically not changed), so that the efficiency of algorithm development and environment development is remarkably improved, and the reinforcement learning task (namely training agent) can be completed in a short time.

The method and its possible implementation further include the following advantages:

1. the environment development is dominant (because the ultimate goal of training the agent is that the environment is applied to the environment in most cases), the environment developer does not need to know the knowledge of the reinforcement learning algorithm, only needs to simply change a few lines of codes (the codes of the intermediate platform are accessed, the artificially generated actions are changed into the codes of the actions generated by the agent), the environment is not packaged into a python interface or a python callable library, the environment is not packaged into a server, and the process of accessing the platform is very simple.

2. The reasoning and training interfaces remain the same. At present, the logic of an 'algorithm driving environment' is generally adopted in the training stage of an intelligent agent, two interfaces of env. Reset and env. Step are required to be developed on the environment side, the logic of an 'environment driving algorithm' is generally adopted in the reasoning stage of the intelligent agent, and a get_action interface is required to be developed on the environment side, namely two interfaces are required to be developed in the training and reasoning stage, so that the training and reasoning stage is very complicated. In the scheme of the application, the environment side only needs to develop one interface of get_action, the training stage calls the get_action interface provided by the intermediate platform (each action acquisition request is initiated through the interface), the reasoning stage calls the get_action interface provided by the agent, and in the reasoning stage, data can be collected through the interface for new training, so that the intelligent degree of the environment is continuously improved.

3. Existing algorithm libraries can be deployed as intermediate platforms, providing access to more environment developers. After the platform is built, an algorithm library can be continuously supplemented or an external algorithm platform can be added according to the requirement (namely, the external algorithm platform is connected to the middle platform), and the environment side can be connected to training and reasoning simply by setting an interface.

4. Is safer for the environment side. If the logic of the 'algorithm driving environment' is adopted, the environment is required to be packaged into a python interface or a python callable library and submitted to an algorithm side, or the environment is packaged into a service and called by the algorithm side, and the hidden safety hazards are large. Because the environment developer provides the environment python interface, the python callable library, to the algorithm, there is a risk that the environment code will be reversed, decompiled, leaked to the outside, etc. By adopting the logic of the environment driving algorithm, only one training example (namely, an environment side accessed to the intermediate platform) needs to be applied, and the environment side can be arranged in the internal network of a developer, so that the safety privacy is ensured.

5. It is also safer for the algorithm side. While the algorithm side may still be considered as logic employing an "algorithm driven environment," its invocation to the environment side is accomplished by means of an intermediate platform, and does not require direct invocation of packages provided by the environment developer locally on the algorithm side, thereby not directly risking the algorithm side itself even if the code in these packages is not secure (e.g., leave a backdoor, carry a virus, bind a trojan, etc.).

Fig. 2 shows a functional block diagram of an agent training device 200 according to an embodiment of the present application. Referring to fig. 2, the agent training apparatus 200 includes:

A first request receiving module 210, configured to receive a first action execution request initiated by a first algorithm side; the first action execution request carries a first action, wherein the first action is selected by an agent according to a first state, and the first state is a state that a first environment side is in after executing the last action of the first action;

a first request processing module 220, configured to send, to the first environment side, the first action that is a return result of a first action acquisition request that is initiated before the first environment side, so that the first environment side executes the first action;

a second request receiving module 230, configured to receive a second action acquisition request initiated by the first environment side; the second action obtaining request carries a second state, and the second state is a state that the first environment side is in after executing the first action;

a second request processing module 240, configured to send the second state that is a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and obtains a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

In one implementation of the agent training device 200, the first action is an action selected by the agent according to a first state and a first reward, where the first state is a state that the first environment side is in after executing a previous action of the first action, and the first reward is a reward generated by the first environment side after executing the previous action of the first action; the second action obtaining request also carries a second reward, wherein the second reward is generated by the first environment side after executing the first action; the second request processing module 240 sends the second state as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and obtains a second action selected by the updated agent, including: and sending the second state and the second reward which are returned results of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state and the second reward, and acquires the second action selected by the updated agent.

In one implementation manner of the agent training device 200, the first request receiving module 210 is further configured to receive an algorithm access request initiated by the first algorithm side before receiving the first action execution request sent by the first algorithm side, where the algorithm access request carries an environmental keyword; the first request processing module 220 is further configured to search for an environment side matching the environment keyword from all environment sides recorded with the identifier on the intermediate platform; the searched environment side is the first environment side; and the intermediate platform is used for sending the identifier of the first environment side to the first algorithm side so that the request initiated by the first algorithm side at the later time carries the identifier of the first environment side, and the intermediate platform can determine the environment side aimed at by the request according to the identifier of the first environment side.

In one implementation of the agent training device 200, the algorithm access request further carries an algorithm registration credential, and the first request processing module 220 is further configured to determine, after the first request receiving module 210 receives the algorithm access request initiated by the first algorithm side and before searching all environment sides recorded with the identifier on the intermediate platform for an environment side matching the environment keyword, that the first algorithm side has been registered on the intermediate platform according to the algorithm registration credential.

In one implementation manner of the agent training device 200, the first request receiving module 210 is further configured to receive an initial state acquisition request initiated by the first algorithm side after the first request processing module 220 sends the identifier of the first environment side to the first algorithm side, where the initial state acquisition request carries an initialization action; the first request processing module 220 is further configured to send, to the first environment side, the initialization action and the identifier of the first environment side, which are a return result of an environment access request initiated before the first environment side, so that the first environment side executes the initialization action and records the identifier of the first environment side; the second request receiving module 230 is further configured to receive the first action obtaining request initiated by the first environment side, where the first action obtaining request carries the first state, and the first state is a state of the first environment side after the initialization action is executed; the second request processing module 240 is further configured to send the first state that is a return result of the initial state acquisition request to the first algorithm side, so that the first algorithm side updates the agent according to the first state, and acquires the first action selected by the updated agent.

In one implementation of agent training device 200, the first action acquisition request and the second action acquisition request are both initiated by invoking the same interface provided by the intermediate platform.

In one implementation of the agent training device 200, the second request receiving module 230 is further configured to receive, before the first request receiving module 210 receives the algorithm access request initiated by the first algorithm side, an environment access request sent by the first environment side; the second request processing module 240 is further configured to generate an identifier of the first environment side in response to the environment access request, and record the identifier of the first environment side on the intermediate platform; and the device is used for blocking the environment access request, and does not send a return result of the environment access request to the first environment side during blocking until the first algorithm side is connected to the intermediate platform and then unblocks, and allows an initialization action serving as the return result of the environment access request and an identifier of the first environment side to be sent to the first environment side; wherein, the first algorithm side having accessed the intermediate platform means that the intermediate platform has processed the algorithm access request initiated by the first algorithm side.

In one implementation of the agent training device 200, the environment access request further carries an environment registration credential; the second request processing module 240 is further configured to determine that the first environment side is registered on the intermediate platform according to the environment registration credential after the second request receiving module 230 receives the environment access request sent by the first environment side and before generating the identifier of the first environment side in response to the environment access request.

In one implementation of the agent training device 200, the device further comprises:

the information display module is used for receiving training information collected by the first algorithm side in the training process of the intelligent agent and displaying the training information on a visual interface provided by the intermediate platform; wherein the training information comprises training progress and/or resource usage status.

The practical principles and technical effects of the intelligent training apparatus 200 provided in the embodiments of the present application have been described in the foregoing method embodiments, and for brevity, reference may be made to the corresponding contents of the method embodiments where the apparatus embodiments are not mentioned.

The embodiment of the application also provides an agent training device, which comprises:

The first request initiating module is used for initiating a first action executing request to the intermediate platform so that the intermediate platform sends a first action serving as a return result of a first action acquiring request initiated before the first environment side to the first environment side; the first action is carried in the first action execution request, and is selected by the intelligent agent according to a first state, wherein the first state is a state that a first environment side is in after executing the last action of the first action;

the first result receiving module is used for receiving a second state which is sent by the intermediate platform and is used as a return result of the first action execution request, wherein the second state is a state in which the first environment side is after executing the first action;

the agent updating module is used for updating the agent according to the second state and acquiring a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

The above-mentioned agent training device and possible implementation manners thereof may refer to the agent training device 200, and will not be repeated.

the second result receiving module is used for receiving a first action which is sent by the intermediate platform and is used as a return result of a first action acquisition request initiated before the first environment side; the first action is selected by the intelligent agent according to a first state, wherein the first state is a state in which a first environment side is in after executing a previous action of the first action;

an action execution module for executing the first action;

the second request initiating module is used for initiating a second action acquisition request to the intermediate platform so that the intermediate platform sends a second state serving as a return result of the first action execution request initiated by the first algorithm side to the first algorithm side; the second state is carried in the second action obtaining request, and the second state is a state that the first environment side is in after executing the first action.

Fig. 3 shows one possible structure of an electronic device 300 provided in an embodiment of the present application. Referring to fig. 3, the electronic device 300 includes: processor 310, memory 320, and communication interface 330, which are interconnected and communicate with each other by a communication bus 340 and/or other forms of connection mechanisms (not shown).

The Memory 320 includes one or more (Only one is shown in the figure), which may be, but is not limited to, random Access Memory (RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable programmable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable programmable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), and the like. The processor 310, as well as other possible components, may access, read, and/or write data from, the memory 320.

The processor 310 includes one or more (only one shown) which may be an integrated circuit chip having signal processing capabilities. The processor 310 may be a general-purpose processor, including a Central Processing Unit (CPU), a micro control unit (Micro Controller Unit MCU), a Network Processor (NP), or other conventional processors; and may also be a special purpose processor including a graphics processor (Graphics Processing Unit, GPU), neural-Network Processor (NPU), digital Signal Processor (DSP), application specific integrated circuit (Application Specific Integrated Circuits, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. Also, when the processor 310 is plural, some of them may be general-purpose processors, and another may be special-purpose processors.

The communication interface 330 includes one or more (only one shown) that may be used to communicate directly or indirectly with other devices for data interaction. Communication interface 330 may include an interface for wired and/or wireless communication.

One or more computer program instructions may be stored in memory 320 that may be read and executed by processor 310 to implement some or all of the steps in the agent training method provided in embodiments of the present application.

It is to be understood that the configuration shown in fig. 3 is illustrative only, and that electronic device 300 may also include more or fewer components than shown in fig. 3, or have a different configuration than shown in fig. 3. The components shown in fig. 3 may be implemented in hardware, software, or a combination thereof. The electronic device 300 may be a physical device such as a PC, a notebook, a tablet, a cell phone, a server, an embedded device, etc., or may be a virtual device such as a virtual machine, a virtualized container, etc. The electronic device 300 is not limited to a single device, and may be a combination of a plurality of devices or a cluster of a large number of devices. In the solution of the present application, the intermediate platform side, the first algorithm side, and the first environment side may all be deployed on the electronic device 300.

The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores computer program instructions, which when read and executed by a processor of a computer, execute part or all of the steps in the agent training method provided by the embodiment of the application. For example, the computer-readable storage medium may be implemented as memory 320 in electronic device 300 in FIG. 3.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application, and various modifications and variations may be suggested to one skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims

1. An agent training method, characterized in that it is applied to an intermediate platform provided between an algorithm side and an environment side, interaction between the algorithm side and the environment side being performed by the intermediate platform, the algorithm side controlling the environment side to execute an action by sending an action execution request for the environment side to the intermediate platform, the environment side controlling an action requested to be executed from the algorithm side by sending an action acquisition request for the algorithm side to the intermediate platform;

The method comprises the following steps:

receiving a first action execution request initiated by a first algorithm side; the first action execution request carries a first action, wherein the first action is selected by an agent according to a first state, and the first state is a state that a first environment side is in after executing the last action of the first action;

sending the first action which is a return result of a first action acquisition request initiated before the first environment side to the first environment side so as to enable the first environment side to execute the first action;

receiving a second action acquisition request initiated by the first environment side; the second action obtaining request carries a second state, and the second state is a state that the first environment side is in after executing the first action;

sending the second state serving as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and acquires a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

2. The agent training method according to claim 1, wherein the first action is an action selected by an agent according to a first state and a first reward, the first state being a state in which the first environment side is after a previous action of the first action is performed, the first reward being a reward generated by the first environment side after the previous action of the first action is performed;

the second action obtaining request also carries a second reward, wherein the second reward is generated by the first environment side after executing the first action;

the sending the second state as a return result of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state, and obtains a second action selected by the updated agent, including:

and sending the second state and the second reward which are returned results of the first action execution request to the first algorithm side, so that the first algorithm side updates the agent according to the second state and the second reward, and acquires the second action selected by the updated agent.

3. The agent training method of claim 1, wherein prior to said receiving the first action execution request sent by the first algorithm side, the method further comprises:

receiving an algorithm access request initiated by the first algorithm side, wherein the algorithm access request carries an environment keyword;

searching the environment sides matched with the environment keywords from all environment sides recorded with the identifiers on the intermediate platform; the searched environment side is the first environment side;

and sending the identifier of the first environment side to the first algorithm side so that the first algorithm side carries the identifier of the first environment side in the request initiated later, and the intermediate platform can determine the environment side for which the request is directed according to the identifier of the first environment side.

4. The agent training method of claim 3, wherein the algorithm access request further carries an algorithm registration credential;

after the receiving the algorithm access request initiated by the first algorithm side and before the searching the environment sides matched with the environment keywords from all the environment sides recorded with the identifications on the intermediate platform, the method further comprises:

Determining that the first algorithmic side has been registered on the intermediate platform in accordance with the algorithmic registration credentials.

5. The agent training method of claim 3, wherein after said transmitting the identification of the first environmental side to the first algorithm side, the method further comprises:

receiving an initial state acquisition request initiated by the first algorithm side, wherein the initial state acquisition request carries an initialization action;

sending the initialization action and the identification of the first environment side to the first environment side, wherein the initialization action is used as a return result of an environment access request initiated by the first environment side, so that the first environment side executes the initialization action and records the identification of the first environment side;

receiving the first action acquisition request initiated by the first environment side, wherein the first action acquisition request carries the first state, and the first state is a state of the first environment side after the initialization action is executed;

and sending the first state serving as a return result of the initial state acquisition request to the first algorithm side, so that the first algorithm side updates the agent according to the first state, and acquires the first action selected by the updated agent.

6. The agent training method of claim 5, wherein the first action acquisition request and the second action acquisition request are both initiated by invoking the same interface provided by the intermediate platform.

7. The agent training method of claim 3, wherein prior to said receiving the first algorithm side initiated algorithm access request, the method further comprises:

receiving an environment access request sent by the first environment side;

responding to the environment access request, generating an identifier of the first environment side, and recording the identifier of the first environment side on the intermediate platform;

blocking the environment access request, and not sending a return result of the environment access request to the first environment side during blocking until the first algorithm side releases blocking after accessing the intermediate platform, and allowing to send an initialization action serving as the return result of the environment access request and an identifier of the first environment side to the first environment side; wherein, the first algorithm side having accessed the intermediate platform means that the intermediate platform has processed the algorithm access request initiated by the first algorithm side.

8. The agent training method of claim 7, wherein the environment access request further carries an environment registration credential;

after the receiving the environment access request sent by the first environment side and before the responding to the environment access request and generating the identifier of the first environment side, the method further comprises:

and determining that the first environment side is registered on the intermediate platform according to the environment registration credentials.

9. The agent training method of any one of claims 1-8, wherein the method further comprises:

receiving training information collected by the first algorithm side in the training process of the intelligent agent, and displaying the training information on a visual interface provided by the intermediate platform; wherein the training information comprises training progress and/or resource usage status.

10. An agent training method, applied to a first algorithm side, the method comprising:

initiating an algorithm access request to an intermediate platform, wherein the algorithm access request carries an environment keyword;

receiving an identifier of a first environment side sent by the intermediate platform; the first environment side is an environment side matched with the environment keyword and found from all environment sides locally recorded with the identifiers by the intermediate platform, and the first algorithm side carries the identifiers of the first environment side in the later initiated request, so that the intermediate platform can determine the environment side aimed at by the request according to the identifiers of the first environment side; initiating a first action execution request to the intermediate platform, so that the intermediate platform sends a first action serving as a return result of a first action acquisition request initiated before the first environment side to the first environment side; the first action is carried in the first action execution request, and is selected by the intelligent agent according to a first state, wherein the first state is a state that a first environment side is in after executing the last action of the first action;

Receiving a second state which is sent by the intermediate platform and is used as a return result of the first action execution request, wherein the second state is a state of the first environment side after the first action is executed;

updating the intelligent agent according to the second state, and acquiring a second action selected by the updated intelligent agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

11. An agent training method, for application to a first environmental side, the method comprising:

sending an environment access request to an intermediate platform so that the intermediate platform generates an identifier of the first environment side, recording the identifier of the first environment side on the intermediate platform, and blocking the environment access request, wherein the intermediate platform does not send a return result of the environment access request to the first environment side during blocking until a first algorithm side accesses the intermediate platform and then unblocks the intermediate platform;

receiving a first action which is sent by the intermediate platform and is used as a return result of a first action acquisition request initiated before the first environment side; the first action is selected by the intelligent agent according to a first state, wherein the first state is a state in which a first environment side is in after executing a previous action of the first action;

Performing the first action;

initiating a second action acquisition request to the intermediate platform so that the intermediate platform sends a second state serving as a return result of a first action execution request initiated by the first algorithm side to the first algorithm side; the second state is carried in the second action obtaining request, and the second state is a state that the first environment side is in after executing the first action.

12. An agent training device, characterized by being configured in an intermediate platform provided between an algorithm side and an environment side, interaction between the algorithm side and the environment side being performed by the intermediate platform, the algorithm side controlling the environment side to execute an action by sending an action execution request for the environment side to the intermediate platform, the environment side controlling an action requested to be executed from the algorithm side by sending an action acquisition request for the algorithm side to the intermediate platform;

the device comprises:

the first request receiving module is used for receiving a first action execution request initiated by a first algorithm side; the first action execution request carries a first action, wherein the first action is selected by an agent according to a first state, and the first state is a state that a first environment side is in after executing the last action of the first action;

A first request processing module, configured to send, to the first environment side, the first action that is a return result of a first action acquisition request that was initiated before the first environment side, so that the first environment side executes the first action;

the second request receiving module is used for receiving a second action acquisition request initiated by the first environment side; the second action obtaining request carries a second state, and the second state is a state that the first environment side is in after executing the first action;

the second request processing module is used for sending the second state which is the return result of the first action execution request to the first algorithm side so that the first algorithm side updates the agent according to the second state and acquires a second action selected by the updated agent; and the second action is carried in a second action execution request to be initiated after the first algorithm side.

13. An agent training device configured on a first algorithm side, the device comprising:

the agent updating module is used for updating the agent according to the second state and acquiring a second action selected by the updated agent; the second action is carried in a second action execution request to be initiated after the first algorithm side;

the device is also for:

before the first request initiating module initiates the first action executing request to the intermediate platform, initiating an algorithm access request to the intermediate platform, wherein the algorithm access request carries an environment keyword;

receiving an identifier of a first environment side sent by the intermediate platform; the first environment side is the environment side matched with the environment keyword and found from all environment sides with the identifiers recorded locally by the intermediate platform, and the first algorithm side carries the identifiers of the first environment side in the later initiated request, so that the intermediate platform can determine the environment side aimed at by the request according to the identifiers of the first environment side.

14. An agent training device configured on a first environmental side, the device comprising:

an action execution module for executing the first action;

the second request initiating module is used for initiating a second action acquisition request to the intermediate platform so that the intermediate platform sends a second state serving as a return result of the first action execution request initiated by the first algorithm side to the first algorithm side; the second state is carried in the second action obtaining request, and the second state is a state that the first environment side is in after executing the first action;

the device is also for:

before the second result receiving module receives the first action sent by the intermediate platform, sending an environment access request to the intermediate platform, so that the intermediate platform generates the identifier of the first environment side, records the identifier of the first environment side on the intermediate platform, blocks the environment access request, and does not send a return result of the environment access request to the first environment side during the blocking period until the first algorithm side accesses the intermediate platform, and then unblocks the intermediate platform.

15. A computer readable storage medium, having stored thereon computer program instructions which, when read and executed by a processor, perform the method of any of claims 1-11.

16. An electronic device comprising a memory and a processor, the memory having stored therein computer program instructions that, when read and executed by the processor, perform the method of any of claims 1-11.