CN117518907A

CN117518907A - Control method, device, equipment and storage medium of intelligent agent

Info

Publication number: CN117518907A
Application number: CN202311451091.6A
Authority: CN
Inventors: 张亦正
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-02-06

Abstract

A control method, a device, equipment and a storage medium of an intelligent agent belong to the technical field of artificial intelligence. The embodiment of the application can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. The method comprises the following steps: acquiring state information and online information respectively corresponding to N intelligent agents in a real environment; generating action information corresponding to the N intelligent agents respectively according to the state information and the online information corresponding to the N intelligent agents respectively through a neural network model; and controlling the first intelligent agent according to the action information corresponding to the first intelligent agent. According to the method, the neural network model is input into the state information and the online information of the intelligent agents, so that the model can flexibly control a plurality of intelligent agents, and the problem of poor generalization caused by training the model only based on a fixed number of intelligent agents is solved.

Description

Control method, device, equipment and storage medium of intelligent agent

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for controlling an agent.

Technical Field

The intelligent agent control is applied to various scenes, and relates to scenes such as ground, air, underwater, outer space and the like. For example, in the industrial field, the control of the intelligent agent can be applied to the scenes such as warehouse logistics transportation, material transportation at different stations of factories, large-scale part processing or welding, long-distance object detection grabbing, and the like, and aims to improve the working efficiency, reduce the labor cost and reduce the danger of work.

In researching the control method of the intelligent agent, a network model of MLP (Multi-Layer Perceptron) combined with RNN (Recurrent Neural Network ) is often adopted in the model training stage. The network model extracts the feature expression of each intelligent agent and the state information of the environment where the intelligent agent is located through the MLP, the RNN further extracts the dependency relation feature between the intelligent agents according to the feature expression extracted by the MLP, and the action information corresponding to the intelligent agents can be obtained through the network model according to the dependency relation feature. The network model decides the action information of the intelligent agent according to the state information through a reinforcement learning mode, continuously optimizes network parameters with the aim of maximizing long-term return in the training process, and finally obtains the trained network model. The motion information of each agent can be obtained by inputting the state information of each agent and utilizing the trained network model.

The method uses the MLP in combination with the CNN network structure to train the network model, and because the network model trains the model according to a fixed number of agents, the model cannot obtain good performance in the control of a variable number of agents, namely, the generalization capability of the model is poor. Therefore, when the model is migrated to the real environment, the model is poorly applicable when the number of agents changes, such as when there is a failure of an agent during execution of a task, and thus the number decreases, which may cause a problem of poor control efficiency of the agent.

Disclosure of Invention

The embodiment of the application provides a control method, device and equipment of an intelligent agent and a storage medium. The technical scheme provided by the embodiment of the application is as follows:

according to an aspect of the embodiments of the present application, there is provided a method for controlling an agent, the method including:

acquiring state information and online information respectively corresponding to N intelligent agents in a real environment, wherein the state information corresponding to the intelligent agents is used for indicating the state of the intelligent agents and the state of the environment where the intelligent agents are located, the online information corresponding to the intelligent agents is used for indicating whether the intelligent agents are online or not, and N is an integer greater than 1;

Generating action information corresponding to the N intelligent agents respectively according to the state information and the online information corresponding to the N intelligent agents through a neural network model, wherein the action information corresponding to the intelligent agents is used for indicating actions required to be executed by the intelligent agents, and the neural network model is a model obtained by training in a reinforcement learning mode;

and for the online first intelligent agent in the N intelligent agents, controlling the first intelligent agent according to the action information corresponding to the first intelligent agent.

According to an aspect of an embodiment of the present application, there is provided a training method of a neural network model, the method including:

acquiring state information corresponding to M intelligent agents in a simulation environment at a first time unit respectively, wherein the state information corresponding to the intelligent agents is used for indicating the states of the intelligent agents and the environment where the intelligent agents are located, and M is an integer larger than 1;

determining online information corresponding to the M intelligent agents at the first time unit respectively, wherein the online information corresponding to the intelligent agents is used for indicating whether the intelligent agents are online or not;

generating action information corresponding to the M intelligent agents at a first time unit according to the state information and the online information corresponding to the M intelligent agents at the first time unit respectively through the neural network model, wherein the action information corresponding to the intelligent agents is used for indicating actions required to be executed by the intelligent agents;

After simulation control is performed on the M intelligent agents based on action information respectively corresponding to the M intelligent agents in a first time unit, determining state information and rewarding information respectively corresponding to the M intelligent agents in a second time unit, wherein the rewarding information corresponding to the intelligent agents refers to rewarding points obtained after the action information corresponding to the intelligent agents is executed, and the second time unit is located behind the first time unit;

according to the state information, the action information and the rewarding information which are respectively corresponding to the M intelligent agents in at least one time unit, calculating to obtain a loss function value of the neural network model;

and adjusting parameters of the neural network model based on the loss function value to obtain a trained neural network model.

According to an aspect of the embodiments of the present application, there is provided a control device for an agent, the device including:

the system comprises an acquisition module, a storage module and a control module, wherein the acquisition module is used for acquiring state information and online information corresponding to N intelligent agents in a real environment respectively, the state information corresponding to the intelligent agents is used for indicating the state of the intelligent agents and the state of the environment where the intelligent agents are located, the online information corresponding to the intelligent agents is used for indicating whether the intelligent agents are online or not, and N is an integer larger than 1;

The generating module is used for generating action information corresponding to the N intelligent agents respectively according to the state information and the online information corresponding to the N intelligent agents through a neural network model, wherein the action information corresponding to the intelligent agents is used for indicating actions required to be executed by the intelligent agents, and the neural network model is a model trained by adopting a reinforcement learning mode;

and the control module is used for controlling the first intelligent agent on line in the N intelligent agents according to the action information corresponding to the first intelligent agent.

According to an aspect of an embodiment of the present application, there is provided a training apparatus for a neural network model, the method including:

the system comprises an acquisition module, a first time unit and a second time unit, wherein the acquisition module is used for acquiring state information corresponding to M intelligent agents in a simulation environment respectively at the first time unit, the state information corresponding to the intelligent agents is used for indicating the states of the intelligent agents and the environment where the intelligent agents are located, and M is an integer larger than 1;

the first determining module is used for determining online information corresponding to the M intelligent agents at the first time unit respectively, and the online information corresponding to the intelligent agents is used for indicating whether the intelligent agents are online or not;

The generation module is used for generating action information corresponding to the M intelligent agents respectively at the first time unit according to the state information and the online information corresponding to the M intelligent agents respectively at the first time unit through the neural network model, wherein the action information corresponding to the intelligent agents is used for indicating actions required to be executed by the intelligent agents;

the second determining module is used for determining state information and rewarding information respectively corresponding to the M intelligent agents in a second time unit after performing simulation control on the M intelligent agents based on action information respectively corresponding to the M intelligent agents in the first time unit, wherein the rewarding information corresponding to the intelligent agents is rewarding points obtained after executing the action information corresponding to the intelligent agents, and the second time unit is located behind the first time unit;

the calculation module is used for calculating the loss function value of the neural network model according to the state information, the action information and the rewarding information which are respectively corresponding to the M intelligent agents in at least one time unit;

and the parameter adjusting module is used for adjusting parameters of the neural network model based on the loss function value to obtain a trained neural network model.

According to an aspect of the embodiments of the present application, there is provided a computer device including a processor and a memory, in which a computer program is stored, the computer program being loaded and executed by the processor to implement the control method of the agent or the training method of the neural network model.

According to an aspect of the embodiments of the present application, there is provided a computer-readable storage medium in which a computer program is stored, the computer program being loaded and executed by a processor to implement the control method of the agent or the training method of the neural network model.

According to an aspect of the embodiments of the present application, there is provided a computer program product including a computer program stored in a computer-readable storage medium, from which a processor reads and executes the computer program to implement the above-described control method of an agent or the above-described training method of a neural network model.

The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:

because the input of the neural network model is the state information and the online information of the intelligent agent, the online information can timely feed back the quantity change of the intelligent agent to the model. Therefore, the model can flexibly control a plurality of multi-agent, solves the problem of poor generalization caused by training the model based on a fixed number of agents, and has good applicability when the model is migrated to a real environment and the number of agents in the real environment is changed, thereby improving the agent control efficiency.

Drawings

FIG. 1 is a schematic diagram of a CTDE framework provided by one embodiment of the present application;

FIG. 2 is a schematic diagram of an embodiment implementation environment of the present application;

FIG. 3 is a flow chart of a method of controlling an agent according to one embodiment of the present application;

FIG. 4 is a flow chart of a method for controlling an agent according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a transducer model provided in one embodiment of the present application;

FIG. 6 is a flow chart of a method of training a neural network model provided in one embodiment of the present application;

FIG. 7 is a block diagram of an agent's control device provided in one embodiment of the present application;

FIG. 8 is a block diagram of a training apparatus for neural network models provided in one embodiment of the present application;

fig. 9 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial intelligence (Artificial Intelligence, AI for short) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. The pre-training model is the latest development result of deep learning, and integrates the technology.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, digital twin, virtual man, robot, AIGC (Artificial Intelligence Generated Content, artificial intelligence generation content), conversational interactions, smart medical, smart customer service, game AI, etc., and it is believed that with the development of technology, artificial intelligence technology will find application in more fields and play an increasingly important value.

The scheme provided by the embodiment of the application relates to an artificial intelligence reinforcement learning technology, and is specifically described by the following embodiment.

Before describing the technical scheme of the application, some concepts related to the application are defined and described.

CTDE (centralized training with decentralized execution, centralized training distributed execution): in CTDE, the training phase is performed in a centralized manner, all agents share global information, and together learn a central policy (Centralized Policy) that receives status information for the entire agent population and generates corresponding action information. Referring to fig. 1, a schematic diagram of a CTDE framework provided in one embodiment of the present application is shown. Each agent has an independent Actor (Actor) that shares the network parameters of the central policy. The Actor outputs corresponding actions (a) of the agent as actions taken by the actual agent in the environment based on the observed state information (o) of the agent itself through the central policy. Critics (Critic) is a cost function estimator that takes into account the inputs and outputs of all agents and gives a corresponding score or value. Critic functions to evaluate the overall quality of the joint action of multiple agents, providing a feedback signal or report for optimizing the shared network parameters.

In this application, this central policy is a neural network model, which in some embodiments may be a transformer neural network model.

Referring to fig. 2, a schematic diagram of an implementation environment of an embodiment of the present application is shown, where the implementation environment of the solution may include: a first device 10, a second device 20, and at least one agent 30.

The first device 10 is configured to train the neural network model in a simulation environment using a reinforcement learning manner, and obtain a trained neural network model. The trained neural network model is used for being directly migrated to a real environment for use, and the intelligent agent is controlled.

The second device 20 is configured to run the trained neural network model in the real environment, and generate, according to the state information of the real environment and the online information of the agent, action information for the agent 30 through the trained neural network model, where the action information is used to control the agent 30 to perform actions in the real environment.

In some embodiments, each agent 30 corresponds to a second device 20. The second device 20 may be provided separately from the agent 30, for example the second device 20 may be a separate device capable of communicating with the agent 30, such as an electronic device like a PC (Personal Computer ), a cell phone, a tablet, a server, etc.

In some embodiments, the second device 20 may also be integrated with the agent 30, for example, the second device 20 may be a controller mounted on the agent 30, the controller having functions of information processing and controlling the operation of the agent 30.

In some embodiments, the first device 10 may be an electronic device, such as a PC, a server, or the like, that has computing and storage capabilities.

The number of agents 30 is plural. The agent 30 may be an automated device capable of sensing environments, processing information, performing tasks, and the like, for the purpose of replacing or assisting humans in accomplishing work or entertainment in a variety of fields and scenarios. Illustratively, the agent 30 may be a smart cart, i.e., a cart that is movable and capable of controlling its movement. Illustratively, in a warehouse scenario, the agent 30 may be a robot for handling goods, a warehouse robot, a warehouse patrol robot, or the like. Of course, the technical scheme of the application can be applied to storage scenes, and can also be applied to other scenes, such as auxiliary driving, inspection scenes, distribution scenes and the like, and the application is not limited to the above.

In addition, a sensing device for collecting status information of the real environment may be provided in the real environment in which the agent 30 is located. Illustratively, the sensing device includes, but is not limited to, a camera, a distance sensor, a speed sensor, a motion sensor, a temperature sensor, etc., which is capable of acquiring the status of the real environment as well as the status of the agent 30 in the real environment.

Referring to fig. 3, a flowchart of a method for controlling an agent according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device, such as the second device 20 in the implementation environment of the solution shown in fig. 2. The method may include at least one of the following steps 310-330.

Step 310, acquiring state information and online information corresponding to N intelligent agents in a real environment respectively, wherein the state information corresponding to the intelligent agents is used for indicating the state of the intelligent agents and the state of the environment where the intelligent agents are located, the online information corresponding to the intelligent agents is used for indicating whether the intelligent agents are online or not, and N is an integer greater than 1.

An agent refers to an intelligent entity, such as a robot, unmanned vehicle, etc., that has autonomous decision-making, sensing, and execution capabilities. The real environment refers to a physical environment in which an agent moves, operates, and interacts, and may include physical parameters and features, such as terrain, obstacles, and the like. The fact that the real environment comprises N intelligent agents means that the real environment comprises a plurality of intelligent agents, the intelligent agents can mutually influence in a communication mode, a cooperation mode or a competition mode, and the technical scheme provided by the embodiment of the application is suitable for a scene of carrying out joint control on the intelligent agents.

The online information corresponding to the agent is used for reflecting whether the agent is currently online. Illustratively, if the agent is online, it means that the agent is able to communicate, receive instructions, send data, and update status information in real time. Illustratively, if the agent is not online, it is indicated that the agent is currently unavailable to participate in the joint operations of coordinated control, information interaction, and sharing decisions. This may be due to a depleted charge of the agent, hardware failure or other technical problem, and corresponding measures need to be taken to ensure that the agent can resume an online state as soon as possible in order to continue to participate in the task.

In some embodiments, for any one agent, the corresponding presence information of the agent is updated in the event that the agent changes from online to offline, or from offline to online. Illustratively, if the agent's power supply is interrupted or the battery is depleted, the agent changes from online to offline. Illustratively, if the power supply problem of the agent is solved, such as battery charging completion or power failure repair, the agent is changed from off-line to on-line.

In some embodiments, the state information includes environmental state information and agent state information. The environment state information is used for indicating the state of the real environment where the intelligent agent is located. The environmental status information may include at least one of position, size, attitude, weight, etc. of the object in the environment. The agent status information is used to indicate the status of the agent. The agent status information may include at least one of position, attitude, velocity, acceleration, etc. information of the agent. Of course, the above description of the environmental status information and the agent status information is merely exemplary and illustrative, and the present application is not limited thereto.

In some embodiments, the state information includes at least all information required by the neural network model to determine the agent action information based on the state information. For example, in a scenario where the robot is handling goods, the robot state information may include at least one of position, attitude, speed, orientation, load, etc. information of the robot, which may reflect a motion state and a task state of the robot. The environment state information may include at least one of information of a position, a number, a weight, a target place, etc. of the goods in the real environment, and at least one of information of a map, an obstacle, a passage, etc. of the real environment, which may reflect distribution and demand of the goods and a structure and limitation of the real environment.

In one possible implementation, the status information may be obtained by a sensing device installed on the real environment and/or the agent. For example, information of obstacles, personnel, objects and the like in the real environment can be acquired through sensing equipment such as cameras, radars, lasers and the like; the information of the position, the gesture, the speed and the like of the intelligent body can also be obtained through the sensing equipment arranged on the intelligent body.

Step 320, generating action information corresponding to the N agents respectively according to the state information and the online information corresponding to the N agents respectively through a neural network model, where the action information corresponding to the agents is used for indicating actions required to be executed by the agents, and the neural network model is a model trained by adopting a reinforcement learning mode.

A neural network model refers to a computational model that is composed of neurons and connections between neurons. It consists of multiple layers or components, each layer containing a number of neurons that enable characterization and processing of input data by learning and adjusting the connection weights between neurons. Reinforcement learning is a machine learning method that focuses on how an agent learns optimal behavior strategies through observed state information and rewards information in interactions with an environment. When training the neural network model in a reinforcement learning manner, the connection weights of the neural network are adjusted according to the experience of the interaction of the agent with the environment to maximize the desired jackpot. In this way, the resulting neural network model can exhibit optimal behavior strategies learned in interactions with the environment.

In some embodiments, the input data of the neural network model includes status information and online information of N agents, where if the number of agents that are online is K, the number of agents that are not online is N-K, where K is an integer greater than or equal to 0, and the input data includes N sets of data, where K sets of data are used to indicate status information corresponding to the online agents, and the N-K sets of data may be invalid data, and are used to indicate that the status information corresponding to the agents that are not online are null. The output data includes N sets of data, where K sets of data are used to indicate action information corresponding to the online agent, and N-K sets of data may be invalid data, used to indicate a probability of not controlling the agent that is not online, where the action information may include a number of candidate actions.

And 330, for the online first agent in the N agents, controlling the first agent according to the action information corresponding to the first agent.

The first agent is any one of the N agents in an online state, and motion information corresponding to the N agents can be obtained through the output of the neural network model, so that the current online agent is controlled, and the current non-online agent is not controlled. For example, assuming that 5 agents are included in the current system, the agents in the online state are the # 1 agent, the # 2 agent, and the # 3 agent, the neural network model accepts 5 sets of state information and online information, which can be noted ((o) ¹ ，mask ¹ )，(o ² ，mask ² )，(o ³ ，mask ³ )，(o ⁴ ，mask ⁴ )，(o ⁵ ，mask ⁵ ) And (o) ⁴ ，mask ⁴ ) And (o) ⁵ ，mask ⁵ ) Can all be recorded asIndicating that agent No. 4 and agent No. 5 are not online and that the entered status information is an invalid value. The output of 5 sets of motion information obtained by the neural network model can be denoted as (a) ¹ ，a ² ，a ³ ，a ⁴ ，a ⁵ ) Corresponding to the action information of each of the 5 agents, a is the result that the No. 4 agent and the No. 5 agent are not on line ⁴ And a ⁵ The value 0 can be used for indicating that the action information is an invalid value, namely, the number 4 agent and the number 5 agent are not controlled, or other preset symbols or values can be used for indicating that the action information is an invalid value, which is not the case in the applicationAnd are intended to be limiting. Agent number 1 execution a ¹ Instruction of agent No. 2 execution a ² Instruction of agent No. 3 execution a ³ Wherein a is ¹ ，a ² And a ³ May be specific action instructions.

According to the technical scheme provided by the embodiment of the application, as the input of the neural network model is the state information and the online information of the intelligent agent, the online information can timely feed back the quantity change of the intelligent agent to the model. Therefore, the model can flexibly control a plurality of multi-agent, solves the problem of poor generalization caused by training the model based on a fixed number of agents, and has good applicability when the model is migrated to a real environment and the number of agents in the real environment is changed, thereby improving the agent control efficiency.

Referring to fig. 4, a flowchart of a method for controlling an agent according to another embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device, such as the second device 20 in the implementation environment of the solution shown in fig. 2. The method may include at least one of the following steps 410-460.

In step 410, the state information and the online information corresponding to the N agents in the real environment are obtained, the state information corresponding to the agents is used for indicating the state of the agents and the state of the environment in which the agents are located, the online information corresponding to the agents is used for indicating whether the agents are online, and N is an integer greater than 1.

In some embodiments, the status information and the online information corresponding to the N agents, respectively, may be input to a neural network model, which may be a transducer model. Referring to fig. 5, a schematic diagram of a transducer model according to an embodiment of the present application is shown. Wherein the area 51 is used for indicating the corresponding status information of the agent, o represents the status information, o ^N Representing status information of the nth agent. The area 52 is used for indicating the presence information corresponding to the agent, wherein the mask represents the presence information, and the mask ^N Representing presence information of the nth agent.

In step 420, the embedded representation layer generates state embedded representations corresponding to the N agents according to the state information corresponding to the N agents, where the state embedded representations are vector representations obtained by converting the state information.

As shown in fig. 5, the neural network model includes an embedded representation layer 53, a feature extraction layer 54, and a multi-layer perceptron 55. The embedding operation of the state information by the embedding representation layer 53 may transform the state information into a vector representation that the model can understand and process, which vector representation contains semantic information of the state information.

Illustratively, the state embedded representation of the 1 st agent as a result of the ebadd operation may be denoted as (a ₁ ，a ₂ ，…，a _n ) Wherein n is an integer greater than 1, a ₁ ，a ₂ ，a _n May represent a vector representation of the different state information of agent 1. Similarly, we can derive a vector representation of the state information corresponding to each agent.

And 430, shielding the non-online intelligent agents in the N intelligent agents according to the online information respectively corresponding to the N intelligent agents, and reserving the state embedded representation corresponding to the online intelligent agents to obtain the input information of the feature extraction layer.

In some embodiments, an implementation of shielding an agent of the N agents that is not online may include: for each of the N agents, calculating a product of the state embedded representation corresponding to the agent and the online information corresponding to the agent, and obtaining an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on line; and embedding the representation according to the updated states respectively corresponding to the N intelligent agents to obtain the input information of the feature extraction layer.

Illustratively, the state embedded representation of the 1 st agent is denoted as (a ₁ ，a ₂ ，…，a _n ) If the 1 st agent is online and the online information is recorded as 1, the state corresponding to the 1 st agent is embeddedThe product of the incoming representation and the presence information corresponding thereto is still (a) ₁ ，a ₂ ，…，a _n ). For example, if the 1 st agent is not online and its online information is recorded as 0, the state embedding corresponding to the 1 st agent indicates that the product of the online information corresponding to the 1 st agent is 0, i.e., the 1 st agent is masked. Similarly, we can calculate the product of the state embedded representation corresponding to each agent and the corresponding online information, thereby realizing the shielding of the offline agents.

Step 440, processing the input information through the feature extraction layer to obtain an output feature vector;

in some embodiments, the feature extraction layer is built based on a transducer structure. The transducer is a neural network architecture based on Self-Attention mechanism (Self-Attention). The core idea of the transform model is to learn the relevance between the individual elements in the input sequence using a self-attention mechanism, capturing global context information without using a circular or convolution operation. As shown in fig. 5, the transducer structure may include X transformer block (blocks), where X is an integer greater than 0, and each transducer block may be connected in series, and a similar effect may be achieved by using different numbers of blocks, where the number of blocks transformer block is not limited in this application.

In some embodiments, input information is input to a feature extraction layer, and the input information is processed through the feature extraction layer to obtain an output feature vector; the feature extraction layer adopts an attention mechanism to extract the association relation between every two intelligent agents.

The input information may be an embedded representation of the state corresponding to the on-line agent, as shown in FIG. 5, with region 56 being a structural illustration of the attention mechanism. The attention mechanism is based on a mechanism called Self-attention mechanism (Self-attention), which allows the model to calculate at each location of the input information and assign a weight to each location, indicating the importance of that location in context. Illustratively, the attention weight is typically calculated using a set of queries (queries), keys (keys), and values (values). The query is used to specify a location, the keys and values are used to construct a representation, and the attention weight of the location to other locations is obtained by calculating the similarity between the query and the keys. Finally, the weight and the numerical value of the corresponding position are weighted and summed to obtain the output characteristic vector representing the importance of the position.

For example, the state information (such as the position information) corresponding to the 1 st agent may be used as the location of the query, the state information of each other agent may be used as the key, and the similarity may be calculated respectively, so as to obtain the attention weight of the position information to other state information. Therefore, the characteristic vector between each position in the input information can be calculated, and the association relation between every two intelligent agents is obtained.

In the method, the attention mechanism in the feature extraction layer can capture the association relationship between any two positions in the input information, so that the association relationship between any two intelligent agents is obtained. Such associations help the model to better understand the structure and semantic information of the input information and provide a more meaningful representation of the features for later decision making applications.

And 450, generating action information corresponding to the N intelligent agents respectively according to the output feature vectors through the multi-layer perceptron.

As shown in FIG. 5, region 57 is the output of the model, where a represents motion information, a ^N Representing the action information of the nth agent. The multilayer perceptron is MLP (Multilayer Perceptron), is a classical artificial neural network model and consists of a plurality of fully connected neural network layers. The basic structure of the device is composed of an input layer, a hidden layer and an output layer, and a plurality of layers can be arranged between the hidden layer and the output layer. Each neural network layer is made up of a plurality of neurons, with neurons of the hidden layer and the output layer typically employing nonlinear activation functions to increase the expressive power of the model. And the multi-layer perceptron obtains the corresponding action information of each intelligent agent through forward propagation calculation according to the feature vector output by the feature extraction layer as input. The action information may be a vector representing action information taken by the agent or may be a probability distribution containing information indicative of the agent's behaviour.

Step 460, for the online first agent in the N agents, controlling the first agent according to the action information corresponding to the first agent.

Step 460 is the same as that described in step 330, and the detailed description is referred to above, and is not repeated here.

According to the technical scheme provided by the embodiment of the application, the control of the variable quantity of intelligent agents is realized according to the online information, and the flexibility is improved. And the product of the online information corresponding to the intelligent agent is embedded and represented by calculating the state corresponding to the intelligent agent, the intelligent agent which is offline is shielded, the association relation of any two intelligent agents can be extracted through the attention structure of the model, and more accurate and reliable action information can be output through shielding the intelligent agent which is offline and extracting the association relation.

The above embodiments describe a scheme for controlling an agent by applying a trained neural network model in a real environment, and the training process of the neural network model in a simulation environment will be described by way of embodiments. For the application and training of neural network models, both are associated, and details not described in detail in one embodiment may be found in the description of the other embodiment.

Referring to fig. 6, a flowchart of a training method of a neural network model according to an embodiment of the present application is shown. The subject of execution of the steps of the method may be a computer device, such as the first device 10 in the implementation environment of the solution shown in fig. 2. The method may include at least one of the following steps 610-660.

In step 610, state information corresponding to M agents in the simulation environment at the first time unit is obtained, where the state information corresponding to the agents is used to indicate a state of the agents and a state of an environment in which the agents are located, and M is an integer greater than 1.

The simulation environment refers to a constructed virtual intelligent body motion scene and is used for replacing a real environment to develop, debug, verify and optimize a control algorithm. The design of the simulation environment should be as close as possible to the real environment, which helps to improve the performance of the neural network model after migration from the simulation environment to the real environment.

An agent in a simulation environment is a model for simulating an agent in a real environment. The intelligent agent in the simulation environment and the intelligent agent in the real environment can have the same structure, attribute parameters, capability, quantity and the like, so that the simulation environment is more similar to the real environment, and the performance of the neural network model after the neural network model is migrated from the simulation environment to the real environment is improved.

In some embodiments, the simulation environment and the agents in the simulation environment may be set and generated according to the configuration file. Specifically, configuration data is acquired, wherein the configuration data is used for configuring characteristics of a simulation environment and the number of intelligent agents; based on the configuration data, a simulation environment is constructed, and M agents are created in the simulation environment.

Characteristics of the simulation environment include the size, shape, distribution of obstacles, properties of objects, etc. of the environment. And constructing a simulation environment according to the configuration data, creating M intelligent agents according to the configuration data, and initializing.

In some embodiments, different configuration data is used to configure different numbers of agents, and the neural network model is trained with different configuration data. Illustratively, by modifying the agent quantity parameter in the configuration data, the quantity of agents generated can be flexibly changed to accommodate different experimental scenarios and requirements.

According to the method, the characteristics of the simulation environment and the quantity of the intelligent agents can be specified according to the configuration data, and the simulation environment and the initialized intelligent agents are constructed according to different configuration data. By flexibly adjusting the configuration data, the system can adapt to different task demands and provide diversified training data for the neural network model, thereby improving the application capability of the intelligent agent under different situations and improving the generalization performance of the model.

In step 620, online information corresponding to the M agents in the first time unit is determined, where the online information corresponding to the agents is used to indicate whether the agents are online.

In some embodiments, online information corresponding to each of the M agents at the first time unit is determined randomly or based on a predetermined rule. Illustratively, the online information of the agent is determined by random generation. The system may randomly generate a set of sequences of length M and containing only 0 and 1, where 0 represents that the agent is not online and 1 represents that the agent is online, so that the online situation of different agents in the first time unit can be simulated. This way of random generation can provide a degree of randomness and reflect to some extent uncertainty in the online status of the agent in the real world.

Illustratively, the presence information of the agent is determined based on established rules. In this case, the system may determine the online status of each agent according to a preset rule or condition. Illustratively, an agent changes from online to offline after 100 actions have been performed. Illustratively, an agent is turned from online to offline for a duration exceeding a preset threshold. Based on these rules, the system will determine each agent, and determine its presence or absence.

According to the method, the online information corresponding to the M intelligent agents at the first time unit is determined through random generation or based on the established rule, and the online information is used for indicating whether the intelligent agents are online or not. The introduction of randomness or the change of the online condition of the intelligent agents based on rules can realize the dynamic adjustment of the number of the controllable intelligent agents, and can well simulate the change and the difference of the online state of the intelligent agents in the real environment.

In step 630, according to the state information and the on-line information corresponding to the M agents in the first time unit, the neural network model generates action information corresponding to the M agents in the first time unit, where the action information corresponding to the agents is used to indicate actions to be executed by the agents.

In some embodiments, the neural network model includes an embedded representation layer, a feature extraction layer, and a multi-layer perceptron.

The specific implementation steps are as follows: generating state embedded representations corresponding to the M intelligent agents respectively according to the state information corresponding to the M intelligent agents through an embedded representation layer, wherein the state embedded representations are vector representations obtained by converting the state information; shielding the non-online intelligent agents in the M intelligent agents according to the online information respectively corresponding to the M intelligent agents, and reserving the state embedded representation corresponding to the online intelligent agents to obtain the input information of the feature extraction layer; processing the input information through a feature extraction layer to obtain an output feature vector; and generating action information corresponding to the M intelligent agents respectively according to the output characteristic vectors through the multi-layer perceptron.

In some embodiments, the process of shielding an agent of the M agents that is not online is as follows: for each of the M agents, calculating a product of the state embedded representation corresponding to the agent and the online information corresponding to the agent, and obtaining an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on line; and embedding the representation according to the updated states respectively corresponding to the M intelligent agents to obtain the input information of the feature extraction layer.

In some embodiments, the feature extraction layer is built based on a transducer structure.

In some embodiments, the present application proposes multi-agent reinforcement learning training using a Transformer network structure, where the training framework of CTDE is used. The Transformer network can also be used in conjunction with other multi-agent training frameworks, such as: IQL (Independent Q Learning), madddpg (Multi-Agent algorithm based on DDPG), COMA (Counterfactual Multi-Agent Policy Gradients), MAAC (Multi-Agent Soft activator-Critic, multi-Agent SAC algorithm): the multi-agent extended version of the SAC algorithm, as this application is not limited.

In step 640, after performing simulation control on the M agents based on the action information corresponding to the M agents in the first time unit, determining the state information and the reward information corresponding to the M agents in the second time unit, where the reward information corresponding to the agents is a reward score obtained after executing the action information corresponding to the agents, and the second time unit is located after the first time unit.

By simulating the interaction between the environment and the agent, the status information of the agent may be updated to reflect its new status at the second time unit. The rewards information reflects the results that the agent achieves after performing the action, and is typically used to evaluate and feedback the agent's behavior. The bonus information may be a real value representing the quality of the action or the quality of the effect. The rewards information may correspond to task goals, for example in a goal-directed task, the agent may get a positive rewards for the behavior to reach the task goals; while for actions that violate constraints or fail to reach a goal, the agent may get a negative prize.

In some embodiments, the code for the model to derive action information and rewards information from status information and online information may be as follows:

(o, mask) =entity () # environment initialization, generating a scene of the whole environment according to the configuration file, and simultaneously generating n agents to be controlled, wherein o represents the agents and state information of the environment where the agents are located, and mask is online information of the agents.

Policy. Reset (o, mask) # neural network initialization, corresponding also generated inputs are n o according to the configuration of the environment ⁱ And n masks ⁱ The output is n a ⁱ Is a trans former network of (c). Wherein i refers to the ith agent, o ⁱ Refers to the status information of the ith agent, mask ⁱ Refers to the online information of the ith agent, a ⁱ Refers to the action information of the i-th agent.

Done=false; the term #done refers to a task end flag, done indicates that the task is not ended by false, and Done indicates that the task is ended by true.

Whiledone is Flase:

Action=policy (o, mask) # network infers and outputs corresponding Action information Action according to the current state information and the online information.

o, rewind, done, info=environment =environment. Step # the environment accepts the corresponding action, and let each agent execute the respective action, and finally return the state information (o) of the whole environment and the reward information (rewind) of the corresponding action after all actions are executed, the boolean value of done can be determined by the state information returned by the environment, for example, in the scenario of carrying goods by multiple agents, when the number of goods returned in the environment is 0, done is true, otherwise is false. info information. Additional physical information of the agent and the environment in which the agent is located may be included in the info.

Step 650, calculating to obtain the loss function value of the neural network model according to the state information, the action information and the rewarding information respectively corresponding to the M intelligent agents in at least one time unit.

In some embodiments, parameters of the neural network model may be adjusted by a PPO (Proximal Policy Optimization, near-end policy optimization) algorithm. The PPO algorithm is a reinforcement learning algorithm based on a strategy gradient. The strategy is updated through the near-end strategy optimization, so that a stable and efficient training result is achieved. The core idea of the policy gradient algorithm is to optimize the policy by maximizing the expected return. The strategy gradient algorithm has the advantages that the strategy can be directly optimized, a solution value function is not needed, and complex conditions such as high dimension, continuous action space and the like can be dealt with.

In some embodiments, a merit function value and a target value of value are calculated according to the state information, the action information and the rewards information corresponding to the at least one time unit respectively, the merit function value is used for reflecting the current state and the quality degree of the action relative to the average level, and the target value of value is used for reflecting the expected return of the current state; calculating a near-end ratio clipping loss and a cost function loss according to the dominance function value and the objective cost function value, wherein the near-end ratio clipping loss is used for limiting the strategy updating amplitude, and the cost function loss is used for optimizing the strategy; and clipping the loss and the cost function loss according to the near-end ratio, and calculating to obtain a loss function value of the neural network model.

The merit function value refers to the calculation result of the merit function, which refers to the function of the difference between the current state and the value of the motion and the average value, and the specific formula is referred to as formula 2 below. The objective cost function value refers to the calculation result of the objective cost function, and the objective cost function refers to the function of the expected return of the current state, and the specific formula is referred to as formula 4 below.

Please refer to equation 1, which shows the equation for the near-end ratio clipping loss:

wherein t is the current time unit, r _t (θ) is the update amplitude of the current time unit policy, indicating that the current policy is in the current time unit state s _t Take action a down _t Probability of (2) and old policy in state s _t Take action a down _t Is a ratio of probabilities of (c). Epsilon and epsilon are hyper-parameters for controlling clipping amplitude. clip (r) _t (θ), 1 ε, 1+.E) is a clipping function. r is (r) _t The larger (θ) indicates that the current policy is in state s _t Take action a down _t The greater the probability of an update amplitude is with respect to the old strategy.Is the dominance function (equation 2 below) that represents the difference between the value of the current state and action and the average value for calculating clipping amplitude in the near-end ratio clipping loss.

Wherein, Represented in state s _t Take action a down _t Value of (2)，/>Represented in state s _t Average value below. The larger the value of the dominance function, the more excellent the current state and action, and the greater the rewards should be. The advantage function is used for improving the stability of strategy updating and avoiding unstable optimization process caused by excessively intense strategy updating. The use of the dominance function may help control clipping amplitude in calculating the near-end ratio clipping loss, thereby limiting the amplitude of policy updates.

The cost loss function is shown in equation 3:

V _θ (s) is a cost function of the state s, V _target Is the objective cost function, please see the following equation 4:

wherein T is the last time unit, T is the current time unit, r _i For the rewards information of the ith time unit, gamma is the discount factor. The first term in the formula represents a weighted sum of the instant prize information from the current time unit T to the end of the round for the last time unit T, representing a future payout. V (V) _θ (s _T ) Representing the target state s _T The value of the objective cost function, i.e. the expected return, representing the state s _T Future jackpots of (3).

The total loss function of the PPO algorithm can be defined as follows, please refer to equation 5:

L(θ)＝L ^clip (θ)-c ₁ L ^vf (θ)+c ₂ S(π _θ ) (equation 5)

Wherein c ₁ And c ₂ Is a super parameter, s (pi _θ ) Is the entropy of the policy to increase the exploratory of the policy, i.e., the policy for each stateRandomness of probability distribution of actions below. The greater the entropy of the policy, the more uniform the probability distribution of actions taken by the policy in each state, and the more exploratory the policy.

In some embodiments, other algorithms may be used to adjust parameters of the neural network model, illustratively, DQN (Deep Q-Networks, depth Q Networks), DDPG (Deep Deterministic Policy Gradient, depth policy gradient), A3C (Asynchronous Advantage Actor-Critic, asynchronous dominant Actor-criter), SAC (Soft Actor-criter), and the like, as the present application is not limited in this regard.

And step 660, adjusting parameters of the neural network model based on the loss function value to obtain a trained neural network model.

Based on the above formula 5, the minimized loss function is targeted, and after multiple iterative adjustment parameters, the neural network model gradually converges to a better state, i.e. the trained model. The trained model can be used for the prediction and decision making of the agent in the real environment.

According to the technical scheme provided by the embodiment of the application, the state information and the online information of the intelligent agents are comprehensively considered in the input of the neural network model, and the control models of different quantity of intelligent agents are trained in the training process, so that the models can adapt to the quantity change and show better generalization capability, and the models are enabled to be migrated to the real environment, and when the quantity of the intelligent agents changes in the real environment (such as the intelligent agents automatically get off line due to hardware faults), the models have good applicability.

In addition, during training, the quantity of the intelligent agents in the initial simulation environment is adjusted by changing the configuration data, so that the robustness of the model is improved. This means that the model is able to handle the case of different numbers of agents and to make decisions and controls accordingly. In this way, the application of the model in a real environment has better adaptability and stability.

The following are device embodiments of the present application, which may be used to perform method embodiments of the present application. For details not disclosed in the device embodiments of the present application, please refer to the method embodiments of the present application.

Referring to fig. 7, a block diagram of an intelligent agent control device according to an embodiment of the present application is shown. The device has the function of realizing the control method of the intelligent agent, and the function can be realized by hardware or by executing corresponding software by the hardware. The device may be a computer device or may be provided in a computer device. The apparatus 700 may include: an acquisition module 710, a generation module 720, and a control module 730.

The acquiring module 710 is configured to acquire status information and online information corresponding to N agents in a real environment, where the status information corresponding to the agents is used to indicate a status of the agents and a status of an environment in which the agents are located, and the online information corresponding to the agents is used to indicate whether the agents are online or not, and N is an integer greater than 1.

The generating module 720 is configured to generate, according to the state information and the online information corresponding to the N agents, action information corresponding to the N agents, where the action information corresponding to the agents is used to indicate actions that need to be executed by the agents, and the neural network model is a model that is trained by using a reinforcement learning mode.

And the control module 730 is configured to control, for a first online agent among the N agents, the first agent according to the action information corresponding to the first agent.

In some embodiments, the neural network model includes an embedded representation layer, a feature extraction layer, and a multi-layer perceptron; the generating module 720 includes: a first generation unit, a masking unit, a deriving unit and a second generation unit (not shown in fig. 7).

The first generation unit is used for generating state embedded representations corresponding to the N intelligent agents respectively according to the state information corresponding to the N intelligent agents through the embedded representation layer, wherein the state embedded representations are vector representations obtained by converting the state information.

And the shielding unit is used for shielding the non-online intelligent agents in the N intelligent agents according to the online information respectively corresponding to the N intelligent agents, and keeping the embedded representation of the state corresponding to the online intelligent agents to obtain the input information of the feature extraction layer.

And the obtaining unit is used for processing the input information through the characteristic extraction layer to obtain an output characteristic vector.

And the second generation unit is used for generating action information corresponding to the N intelligent agents respectively through the multi-layer perceptron according to the output characteristic vector.

In some embodiments, the shielding unit is configured to calculate, for each of the N agents, a product of a state embedded representation corresponding to the agent and online information corresponding to the agent, to obtain an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on-line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on-line; and embedding the representation according to the updated states respectively corresponding to the N intelligent agents to obtain the input information of the feature extraction layer.

In some embodiments, the obtaining unit is configured to input the input information to the feature extraction layer, and process the input information through the feature extraction layer to obtain the output feature vector; the feature extraction layer adopts an attention mechanism to extract the association relation between every two intelligent agents.

In some embodiments, the apparatus 700 further includes an updating module (not shown in fig. 7) configured to update, for any one of the agents, online information corresponding to the agent in a case where the agent changes from online to offline, or from offline to online.

Referring to fig. 8, a block diagram of a training apparatus for a neural network model according to an embodiment of the present application is shown. The device has the function of realizing the training method of the neural network model, and the function can be realized by hardware or by executing corresponding software by the hardware. The device may be a computer device or may be provided in a computer device. The apparatus 800 may include: the system comprises an acquisition module 810, a first determination module 820, a generation module 830, a second determination module 840, a calculation module 850 and a parameter adjustment module 860.

The obtaining module 810 is configured to obtain state information corresponding to M agents in the simulation environment at a first time unit, where the state information corresponding to the agents is used to indicate a state of the agent and a state of an environment where the agent is located, and M is an integer greater than 1.

The first determining module 820 is configured to determine presence information corresponding to the M agents at the first time unit, where the presence information corresponding to the agents is used to indicate whether the agents are online.

The generating module 830 is configured to generate, according to the state information and the online information corresponding to the M agents in the first time unit, the action information corresponding to the M agents in the first time unit, where the action information corresponding to the agents is used to indicate an action that needs to be executed by the agents.

The second determining module 840 is configured to determine, after performing simulation control on the M agents based on the action information corresponding to the M agents in the first time unit, state information and rewarding information corresponding to the M agents in a second time unit, where the rewarding information corresponding to the agents is a rewarding score obtained after the action information corresponding to the agents is executed, and the second time unit is located after the first time unit.

And a calculating module 850, configured to calculate a loss function value of the neural network model according to the state information, the action information, and the rewarding information corresponding to the M agents in at least one time unit.

And the parameter adjusting module 860 is configured to adjust parameters of the neural network model based on the loss function value, so as to obtain a trained neural network model.

In some embodiments, the neural network model includes an embedded representation layer, a feature extraction layer, and a multi-layer perceptron; the generating module 830 includes: a first generation unit, a masking unit, a deriving unit and a second generation unit (not shown in fig. 8).

The first generation unit is used for generating state embedded representations corresponding to the M intelligent agents respectively according to the state information corresponding to the M intelligent agents through the embedded representation layer, wherein the state embedded representations are vector representations obtained by converting the state information.

And the shielding unit is used for shielding the non-online intelligent agents in the M intelligent agents according to the online information respectively corresponding to the M intelligent agents, and keeping the embedded representation of the state corresponding to the online intelligent agents to obtain the input information of the feature extraction layer.

The first generation unit is used for generating action information corresponding to the M intelligent agents respectively through the multi-layer perceptron according to the output characteristic vector.

In some embodiments, the shielding unit is configured to calculate, for each of the M agents, a product of a state embedded representation corresponding to the agent and online information corresponding to the agent, to obtain an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on-line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on-line; and embedding the representation according to the updated states respectively corresponding to the M intelligent agents to obtain the input information of the feature extraction layer.

In some embodiments, the first determining module 820 is configured to determine online information corresponding to the M agents at the first time unit, respectively, randomly or based on a predetermined rule.

In some embodiments, the computing module 850 is configured to: calculating a dominance function value and a target value of value according to the state information, the action information and the rewarding information respectively corresponding to the at least one time unit, wherein the dominance function is used for reflecting the current state and the quality degree of the action relative to the average level, and the target value of value is used for reflecting the expected return of the current state; calculating a near-end ratio clipping loss and a cost function loss according to the dominance function value and the objective cost function value, wherein the near-end ratio clipping loss is used for limiting the strategy updating amplitude, and the cost function loss is used for optimizing the strategy; and clipping the loss and the cost function loss according to the near-end ratio, and calculating to obtain a loss function value of the neural network model.

In some embodiments, the apparatus 800 further comprises a configuration module (not shown in fig. 8) for obtaining configuration data for configuring characteristics of the simulation environment and the number of agents; constructing the simulation environment based on the configuration data, and creating the M agents in the simulation environment; different configuration data are used for configuring different numbers of the intelligent agents, and the neural network model is trained by adopting the different configuration data.

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

Referring to FIG. 9, a block diagram of a computer device 900 according to one embodiment of the present application is shown.

In general, the computer device 900 includes: a processor 910 and a memory 920.

Processor 910 may include one or more processing cores such as a 4-core processor, an 8-core processor, or the like. The processor 910 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). The processor 910 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 910 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor 910 may also include an AI processor for processing computing operations related to machine learning.

Memory 920 may include one or more computer-readable storage media, which may be non-transitory. Memory 920 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 920 is used to store a computer program configured to be executed by one or more processors to implement the above-described method of controlling an agent or the above-described method of training a neural network model.

Those skilled in the art will appreciate that the architecture shown in fig. 9 is not limiting of the computer device 900, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.

In some embodiments, there is also provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described agent control method or the above-described neural network model training method.

Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State Drives, solid State disk), optical disk, or the like. The random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ), among others.

In some embodiments, there is also provided a computer program product comprising a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the above-described control method of an agent or the above-described training method of a neural network model.

It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limited by the embodiments of the present application.

The foregoing description of the exemplary embodiments of the present application is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, alternatives, and alternatives falling within the spirit and scope of the invention.

Claims

1. A method for controlling an agent, the method comprising:

2. The method of claim 1, wherein the neural network model comprises an embedded representation layer, a feature extraction layer, and a multi-layer perceptron;

the generating, by the neural network model, action information corresponding to the N agents according to the state information and the online information corresponding to the N agents, includes:

generating state embedded representations corresponding to the N intelligent agents respectively according to the state information corresponding to the N intelligent agents through the embedded representation layer, wherein the state embedded representations are vector representations obtained by converting the state information;

shielding the non-online intelligent agents in the N intelligent agents according to the online information respectively corresponding to the N intelligent agents, and keeping the embedded representation of the state corresponding to the online intelligent agents to obtain the input information of the feature extraction layer;

Processing the input information through the feature extraction layer to obtain an output feature vector;

and generating action information corresponding to the N intelligent agents respectively according to the output characteristic vectors through the multi-layer perceptron.

3. The method according to claim 2, wherein the step of shielding the non-online agent of the N agents according to the online information corresponding to the N agents, and retaining the embedded representation of the state corresponding to the online agent, to obtain the input information of the feature extraction layer includes:

for each of the N agents, calculating a product of the state embedded representation corresponding to the agent and the online information corresponding to the agent, and obtaining an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on-line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on-line;

and embedding the representation according to the updated states respectively corresponding to the N intelligent agents to obtain the input information of the feature extraction layer.

4. The method according to claim 2, wherein said processing said input information by said feature extraction layer to obtain an output feature vector comprises:

Inputting the input information into the feature extraction layer, and processing the input information through the feature extraction layer to obtain the output feature vector; the feature extraction layer adopts an attention mechanism to extract the association relation between every two intelligent agents.

5. The method of claim 2, wherein the feature extraction layer is constructed based on a transducer structure.

6. The method according to any one of claims 1 to 5, further comprising:

and updating the online information corresponding to any intelligent agent when the intelligent agent is changed from online to offline or from offline to online.

7. A method of training a neural network model, the method comprising:

8. The method of claim 7, wherein the neural network model comprises an embedded representation layer, a feature extraction layer, and a multi-layer perceptron;

The generating, by the neural network model, action information corresponding to the M agents at the first time unit according to the state information and the online information corresponding to the M agents at the first time unit, includes:

generating state embedded representations corresponding to the M intelligent agents respectively according to the state information corresponding to the M intelligent agents through the embedded representation layer, wherein the state embedded representations are vector representations obtained by converting the state information;

shielding the non-online intelligent agents in the M intelligent agents according to the online information respectively corresponding to the M intelligent agents, and keeping the embedded representation of the state corresponding to the online intelligent agents to obtain the input information of the feature extraction layer;

and generating action information corresponding to the M intelligent agents respectively according to the output characteristic vectors through the multi-layer perceptron.

9. The method of claim 7, wherein the shielding the non-online agent of the M agents according to the online information corresponding to the M agents, and retaining the state embedded representation corresponding to the online agent, to obtain the input information of the feature extraction layer, includes:

For each of the M agents, calculating a product of the state embedded representation corresponding to the agent and the online information corresponding to the agent, and obtaining an updated state embedded representation corresponding to the agent; wherein, the presence information corresponding to the intelligent agent is 1 to indicate that the intelligent agent is on-line, and the presence information corresponding to the intelligent agent is 0 to indicate that the intelligent agent is not on-line;

and embedding the representation according to the updated states respectively corresponding to the M intelligent agents to obtain the input information of the feature extraction layer.

10. The method of claim 7, wherein the processing the input information by the feature extraction layer to obtain an output feature vector comprises:

11. The method of claim 7, wherein the feature extraction layer is constructed based on a transducer structure.

12. The method according to any one of claims 7 to 11, wherein the determining online information of the M agents respectively corresponding to the first time units includes:

And determining the online information corresponding to the M intelligent agents at the first time unit randomly or based on a set rule.

13. The method according to any one of claims 7 to 11, wherein the calculating the loss function value of the neural network model according to the state information, the action information, and the reward information respectively corresponding to the M agents in at least one time unit includes:

calculating a dominance function value and a target value function value according to the state information, the action information and the rewarding information respectively corresponding to the at least one time unit, wherein the dominance function value is used for reflecting the current state and the quality degree of the action relative to the average level, and the target value function value is used for reflecting the expected return of the current state;

calculating a near-end ratio clipping loss and a cost function loss according to the dominance function value and the objective cost function value, wherein the near-end ratio clipping loss is used for limiting the strategy updating amplitude, and the cost function loss is used for optimizing the strategy;

and clipping the loss and the cost function loss according to the near-end ratio, and calculating to obtain a loss function value of the neural network model.

14. The method according to any one of claims 7 to 11, further comprising:

acquiring configuration data, wherein the configuration data is used for configuring the characteristics of the simulation environment and the quantity of the intelligent agents;

constructing the simulation environment based on the configuration data, and creating the M agents in the simulation environment;

different configuration data are used for configuring different numbers of the intelligent agents, and the neural network model is trained by adopting the different configuration data.

15. A control device for an agent, the device comprising:

16. A training apparatus for a neural network model, the apparatus comprising:

17. A computer device comprising a processor and a memory in which a computer program is stored, the computer program being loaded and executed by the processor to implement the method of any one of claims 1 to 6 or the method of any one of claims 7 to 14.

18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement the method of any one of claims 1 to 6 or the method of any one of claims 7 to 14.

19. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, from which a processor reads and executes the computer program to implement the method according to any one of claims 1 to 6 or the method according to any one of claims 7 to 14.