CN115730630A

CN115730630A - Control method and device of intelligent agent, electronic equipment and storage medium

Info

Publication number: CN115730630A
Application number: CN202211457362.4A
Authority: CN
Inventors: 韩翠云; 曾增烽; 张记袁
Original assignee: Baidu com Times Technology Beijing Co Ltd
Current assignee: Baidu com Times Technology Beijing Co Ltd
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-03

Abstract

The disclosure provides a control method and device of an agent, electronic equipment and a storage medium, and relates to the technical field of machine learning and natural language processing. The specific implementation scheme is as follows: generating a state sequence according to target state data of a plurality of agents, inputting the state sequence into an agent policy model, and determining an action sequence according to the output of the agent policy model; the action sequence comprises target control actions of all agents belonging to the first object; and performing action control on each agent belonging to the first object according to each target control action in the action sequence. Therefore, the action control of each agent (such as the own agent) belonging to the first object in the multi-intelligent confrontation scene can be realized, so that each agent belonging to the first object can cooperate to complete the confrontation task.

Description

Control method and device of intelligent agent, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of AI (Artificial Intelligence), in particular to the technical field of machine learning and natural language processing, and in particular to a control method and apparatus for an agent, an electronic device, and a storage medium.

Background

With the continuous development of intelligent control technology, machine learning (including deep learning, reinforcement learning, supervised learning, unsupervised learning) technology has been widely applied to many fields such as intelligent robots, unmanned aerial vehicles, unmanned vehicles, industrial internet of things, and the like. The multi-agent confrontation is a research hotspot in the field of intelligent control, and the multi-agent confrontation mainly drives the own agent to effectively complete confrontation tasks with the enemy agent, for example, the own agent cooperates to complete the confrontation tasks such as following, defense and attack.

Disclosure of Invention

The disclosure provides a control method and device for an agent, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a control method of an agent, including:

obtaining target state data for a plurality of agents, wherein the plurality of agents includes at least one agent belonging to a first object and at least one agent belonging to a second object;

generating a state sequence according to the target state data of the plurality of agents;

inputting the state sequence into an agent policy model to determine an action sequence according to an output of the agent policy model; wherein the action sequence comprises target control actions of agents belonging to the first object;

and performing action control on each agent belonging to the first object according to each target control action in the action sequence.

According to another aspect of the present disclosure, there is provided a control apparatus of an agent, including:

an obtaining module for obtaining target status data of a plurality of agents, wherein the plurality of agents comprises at least one agent belonging to a first object and at least one agent belonging to a second object;

the generating module is used for generating a state sequence according to the target state data of the plurality of agents;

the determining module is used for inputting the state sequence into an agent strategy model so as to determine an action sequence according to the output of the agent strategy model; the action sequence comprises target control actions of all agents belonging to the first object;

and the control module is used for controlling the action of each agent belonging to the first object according to each target control action in the action sequence.

According to still another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of controlling an agent as set forth in the above aspect of the disclosure.

According to still another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium of computer instructions for causing a computer to perform the method of controlling an agent set forth in the above-described aspect of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of controlling an agent as set forth in the above aspect of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a control method of an agent according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of reinforcement learning principle;

fig. 3 is a schematic flowchart of a control method of an agent according to a second embodiment of the present disclosure;

fig. 4 is a schematic flowchart of a control method for an agent according to a third embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a control method of an agent according to a fourth embodiment of the present disclosure;

fig. 6 is a schematic flowchart of a control method of an agent according to a fifth embodiment of the present disclosure;

FIG. 7 is a schematic diagram illustrating the principle of implementing multi-agent countermeasure based on spatiotemporal feature fusion and two-way decoding actions provided by the embodiments of the present disclosure;

fig. 8 is a schematic structural diagram of a control device of an agent according to a sixth embodiment of the present disclosure;

FIG. 9 illustrates a schematic block diagram of an example electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, the following multi-agent countermeasure schemes are mainly included:

first, unmanned aerial gaming countermeasure schemes based on curriculum learning.

Use course study and imitate study training unmanned aerial vehicle strategy under the confrontation scene in this scheme, because the expert data in the motor action degree of difficulty classification in the course study and the imitate study all rely on a large amount of human resources, the cost of labor is higher. Moreover, the scheme depends on a specific simulation environment, when the simulation environment is replaced, the labor cost needs to be invested again, and the generalization is poor.

And secondly, a multi-agent game countermeasure scheme combining an expert system and reinforcement learning.

In this scheme, firstly, expert experience is required to be relied on, and secondly, although the difficulty of counterlearning can be reduced by adopting hierarchical reinforcement learning, the upper limit of learning is also reduced at the same time, for the reason that: the upper limit of learning is the expert experience, rather than some unexpected but very flexible operation that is learned.

Third, a multi-agent countermeasure scheme based on a dynamic graph neural network.

In the scheme, the graph structure between the intelligent agents is directly constructed by adopting the graph neural network so as to represent the relationship between the intelligent agents, but the feature processing and the relationship calculation before the graph construction are simple, the graph construction accuracy is difficult to guarantee, and the wrong cumulative transmission can be caused.

In view of at least one of the above problems, the present disclosure provides a method and an apparatus for controlling an agent, an electronic device, and a storage medium.

A control method, apparatus, electronic device, and storage medium of an agent of an embodiment of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a control method of an agent according to an embodiment of the present disclosure. Before the embodiments of the present disclosure are described in detail, for ease of understanding, common technical terms are first introduced:

reinforcement learning is machine learning, and as shown in fig. 2, reinforcement learning is a method of learning a strategy by feedback (i.e., reward). The policy is a logic that can output an action based on the observed value, and may be a model such as a neural network. The reinforcement learning can be applied to application scenes such as robot control, game AI, scheduling strategies, combination optimization and the like.

The embodiment of the present disclosure is exemplified by the control method of the agent being configured in the control device of the agent, and the control device of the agent can be applied to any electronic device, so that the electronic device can execute the control function of the agent.

The electronic device may be any device with computing capability, for example, a PC (Personal Computer), a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet Computer, a Personal digital assistant, and a wearable device.

As shown in fig. 1, the control method of the agent may include the steps of:

step 101, obtaining target state data of a plurality of agents, wherein the plurality of agents comprise at least one agent belonging to a first object and at least one agent belonging to a second object.

In the disclosed embodiment, the first object and the second object are different objects, for example, the first object may be a self and the second object may be an adversary.

In embodiments of the present disclosure, the target state data may include pose, position, speed, weapon loaded, etc. information for the agent.

In an embodiment of the present disclosure, target state data of at least one agent belonging to a first object and target state data of at least one agent belonging to at least one second object may be obtained.

Step 102, generating a state sequence according to the target state data of the plurality of agents.

In embodiments of the present disclosure, a state sequence may be generated from target state data for a plurality of agents.

As an example, the number of agents belonging to a first object is marked as N, the number of agents belonging to a second object is marked as M, and assuming that the first object is a self party and the second object is an enemy party, the target state data of the agents belonging to the first object may be: target status data s of own agent 1 ₁₁ Target status data s of own agent 2 ₁₂ …, target state data s of own agent N _1N The target state data of the agent belonging to the second object may be: target status data s of enemy agent 1 ₂₁ Object state data s of enemy agent 2 ₂₂ …, object state data s of enemy agent N _2M The sequence of states may be s ₁₁ ,s ₁₂ ,…,s _1N ,s ₂₁ ,s ₂₂ ,…,s _2M }。

Step 103, inputting the state sequence into the agent policy model to determine an action sequence according to the output of the agent policy model; wherein, the action sequence comprises the target control action of each agent belonging to the first object.

In the disclosed embodiment, the target control action may include a movement control action, wherein the movement control action is used to indicate at least one of a movement speed, a movement direction and a movement altitude of the agent, for example, when the agent is a drone, the movement control action may be used to indicate a flight speed, a heading and an altitude of the drone. Alternatively, the target control action may comprise a movement control action and an attack control action, wherein the attack control action is used for indicating a target weapon to be launched and a target agent belonging to the second object to be attacked.

In the disclosed embodiment, a state sequence may be input to the agent policy model to determine an action sequence according to an output of the agent policy model, where the action sequence may include a target control action for each agent belonging to the first object.

As an example, the sequence of actions may be { a } ₁₁ ,a ₁₂ ,…,a _1N In which a ₁₁ For the target control action of the own agent 1, a ₁₂ Is the target control action of the own agent 2, …, a _1N The action is controlled for the target of the own agent N.

And 104, controlling the action of each agent belonging to the first object according to each target control action in the action sequence.

In the embodiment of the present disclosure, each agent belonging to the first object may be motion-controlled according to each target control motion in the motion sequence.

Still exemplified by the above example, may be according to a ₁₁ Controlling the action of the own intelligent agent 1 according to a ₁₂ The motion control is carried out on the own intelligent agent 2, … according to a _1N And controlling the action of the own agent N.

The control method of the intelligent agent of the embodiment of the disclosure generates a state sequence according to target state data of a plurality of intelligent agents, inputs the state sequence into an intelligent agent strategy model, and determines an action sequence according to the output of the intelligent agent strategy model; the action sequence comprises target control actions of all agents belonging to the first object; and performing action control on each agent belonging to the first object according to each target control action in the action sequence. Therefore, the action control of each agent (such as the own agent) belonging to the first object in the multi-intelligent confrontation scene can be realized, so that each agent belonging to the first object can cooperate to complete the confrontation task.

In the technical scheme of the present disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user are all performed under the premise of obtaining the consent of the user, and all meet the regulations of the related laws and regulations, and do not violate the good custom of the public order.

In order to clearly illustrate how the above embodiments of the present disclosure obtain the target state data of each agent, the present disclosure also provides a control method of the agent.

Fig. 3 is a schematic flowchart of a control method of an agent according to a second embodiment of the present disclosure.

As shown in fig. 3, the control method of the agent may include the steps of:

step 301, for any agent in a plurality of agents, obtaining first state data of the agent at the current time.

Wherein the plurality of agents includes at least one agent belonging to the first object and at least one agent belonging to the second object.

In the disclosed embodiment, the first state data may include information of the pose, position, speed, weapon loaded, etc. of the agent at the current time.

It should be noted that in the multi-agent confrontation scenario, the weapon information loaded by each agent is transparent, for example, how many weapons are initially loaded on each agent is known, and by tracking each agent, the weapons that each agent has fired can be determined, so that the remaining weapons loaded can be determined according to the weapons initially loaded and the weapons fired by each agent. Moreover, by tracking each agent, information such as the pose, position, speed, etc. of each agent can be determined.

In a possible implementation manner of the embodiment of the present disclosure, for any one agent, the first state data of the agent at the current time may be determined through the following steps:

1. and acquiring first sub-state data of the agent at the current moment.

Wherein the first sub-status data is indicative of at least one of location information, velocity information and loaded weapon information of the agent at the current time.

2. Acquiring second sub-state data of the first agent at the current moment; wherein the second sub-status data is indicative of at least one of position information, velocity information and loaded weapon information for the first agent at the current time, the first agent and said agent belonging to the same object, and the distance between the first agent and said agent at the current time being smaller than a first distance threshold.

The first distance threshold is a preset distance threshold. That is, for any agent, the second sub-state data of other agents that are closer to the agent and belong to the same object at the current time may be acquired.

3. And acquiring third sub-state data of the second agent at the current moment, wherein the third sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the second agent at the current moment, the second agent and the agent belong to different objects, and the distance between the second agent and the agent at the current moment is less than a second distance threshold.

The second distance threshold is also a preset distance threshold. That is, for any agent, the third sub-state data of other agents that are closer to the agent and belong to different objects at the current time may be acquired.

4. And generating the first state data of the intelligent agent at the current moment according to at least one of the first sub-state data, the second sub-state data and the third sub-state data.

As an example, one of the first sub-state data, the second sub-state data, and the third sub-state data may be used as the first state data of the agent.

As another example, any two of the first sub-state data, the second sub-state data, and the third sub-state data may be used as the first state data of the agent.

As yet another example, the first sub-state data, the second sub-state data, and the third sub-state data may be used as the first state data of the agent.

Therefore, the state information of the agent can be determined based on the state information of other agents which are close to the agent in the spatial dimension, so that the target control action of each agent belonging to the first object can be predicted based on various information in the spatial dimension, and the reliability of the prediction result can be improved.

Step 302, second state data of the intelligent agent at least one historical moment before the current moment is obtained.

In the embodiment of the present disclosure, the historical time is a time before the current time, for example, if the current time is marked as time t, the historical time may be time t-1, time t-2, time t-3, and the like.

In the disclosed embodiment, the second state data may include information of the pose, position, speed, weapon loaded, etc. of the agent at historical time.

It should be noted that, considering that the greater the number of historical time instants, the higher the complexity of the model, in order to consider both the complexity of the model and the prediction accuracy, the number of historical time instants may not exceed 3 in one possible implementation manner of the embodiment of the present disclosure.

In the embodiment of the present disclosure, for any agent, the second state data of at least one historical time before the current time of the agent may be acquired.

In a possible implementation manner of the embodiment of the present disclosure, for any intelligent agent, the second state data of the intelligent agent at any historical time may be determined through the following steps:

1. and acquiring first historical sub-state data of the agent at the historical moment.

Wherein the first historical sub-state data is indicative of at least one of location information, velocity information, and loaded weapons information for the agent at the historical time.

2. Acquiring second historical sub-state data of a third agent at the historical moment; wherein the second historical sub-status data is indicative of at least one of position information, velocity information, and weapon loaded information for a third agent at a historical time, the third agent and the agent belonging to the same object, and a distance between the third agent and the agent at the historical time being less than a first distance threshold.

That is, for any agent, the second history sub-state data of other agents which are close to the agent and belong to the same object at the history time can be acquired.

3. And acquiring third historical sub-state data of a fourth agent at the historical moment, wherein the third historical sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the fourth agent at the historical moment, the fourth agent and the agent belong to different objects, and the distance between the fourth agent and the agent at the historical moment is smaller than a second distance threshold.

That is, for any agent, the state data of the third calendar Shi Zi at the historical time of another agent which is close to the agent and belongs to a different object can be acquired.

4. And generating second state data of the intelligent agent at the historical moment according to at least one of the first historical sub-state data, the second historical sub-state data and the third historical sub-state data.

As an example, one of the first, second, and third historical sub-state data may be used as the second state data for the agent.

As another example, any two of the first historical sub-state data, the second historical sub-state data, and the third historical sub-state data may be used as the second state data for the agent.

As yet another example, the first, second, and third historical sub-state data may be considered the second state data of the agent.

Step 303, generating target state data of the agent according to the first state data at the current moment and the second state data at least one historical moment.

In the embodiment of the present disclosure, the target state data of the agent may be generated according to the first state data at the current time and the second state data at least one historical time.

Step 304, generating a state sequence according to the target state data of the plurality of agents.

Step 305, inputting the state sequence into the agent policy model to determine an action sequence according to the output of the agent policy model; wherein, the action sequence comprises the target control action of each agent belonging to the first object.

And step 306, performing action control on each agent belonging to the first object according to each target control action in the action sequence.

The explanation of steps 304 to 306 can refer to the related description in any embodiment of the present disclosure, and is not repeated herein.

The control method of the agents in the embodiment of the disclosure can predict the target control action of each agent belonging to the first object based on the state information of each agent at a plurality of moments in the time dimension, and can improve the reliability of the prediction result.

In order to clearly illustrate how the action sequence is determined according to the state sequence in the above embodiments of the present disclosure, the present disclosure also provides a control method of an agent.

Fig. 4 is a schematic flowchart of a control method of an agent according to a third embodiment of the present disclosure.

As shown in fig. 4, the control method of the agent may include the steps of:

step 401, obtaining target state data of a plurality of agents, wherein the plurality of agents includes at least one agent belonging to a first object and at least one agent belonging to a second object.

Step 402, generating a state sequence according to the target state data of the plurality of agents.

For the explanation of steps 401 to 402, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

Step 403, performing normalization processing on the state sequence.

In the embodiment of the present disclosure, normalization processing may be performed on each state data in the state sequence based on a normalization algorithm.

It should be noted that, in practical applications, normalization processing on the state sequence may not be required, and the disclosure does not limit this.

And step 404, coding the normalized state sequence by adopting a coding layer of the agent strategy model based on an attention mechanism to obtain coding characteristics.

In the embodiment of the present disclosure, the state sequence after the normalization processing may be encoded by using an encoding layer of the agent policy model based on an attention mechanism to obtain an encoding characteristic.

Step 405, decoding the coding features by using a decoding layer of the agent policy model to obtain an action sequence.

Wherein, the action sequence comprises the target control action of each agent belonging to the first object. It should be noted that the explanation of the action sequence in the foregoing embodiment is also applicable to this embodiment, and is not described herein again.

In the embodiment of the present disclosure, a decoding layer of the agent policy model may be adopted to decode the coding features to obtain an action sequence.

And step 406, performing action control on each agent belonging to the first object according to each target control action in the action sequence.

For the explanation of step 406, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

According to the control method of the intelligent agent, the state sequence is coded through the coding layer, coding characteristics representing the resistance situation between the intelligent agents can be obtained, and the state sequence is coded based on the attention mechanism, so that the relative situation between the intelligent agents belonging to the first object and the intelligent agents belonging to the second object can be obtained through coding. The decoding layer is used for decoding the coding characteristics, and the control action of each agent can be effectively obtained.

In order to clearly illustrate how the encoded features are decoded to obtain the action sequence in the above embodiments of the present disclosure, the present disclosure further provides a control method for an agent.

Fig. 5 is a schematic flowchart of a control method of an agent according to a fourth embodiment of the present disclosure.

As shown in fig. 5, the control method of the agent may include the steps of:

step 501, acquiring target state data of a plurality of agents; the plurality of agents comprise at least one agent belonging to a first object and at least one agent belonging to a second object, and the number of agents belonging to the first object is N.

Step 502, generating a state sequence according to the target state data of the plurality of agents.

Step 503, performing normalization processing on the state sequence.

And step 504, coding the normalized state sequence by adopting a coding layer of the agent strategy model based on an attention mechanism to obtain coding characteristics.

For the explanation of steps 501 to 504, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

And 505, decoding the coding characteristics by adopting a decoding layer of the agent strategy model to obtain a target control action of the first agent belonging to the first object in the action sequence.

Wherein the action sequence comprises target control actions of agents belonging to the first object. It should be noted that the explanation of the action sequence and the target control action in the foregoing embodiment is also applicable to this embodiment, and the details are not described herein.

In the embodiment of the present disclosure, a decoding layer of the agent policy model may be adopted to perform sequence decoding on the encoded features to obtain an action sequence. That is, the encoding features may be first decoded using a decoding layer to obtain the target control action of the agent that belongs to the first object in the action sequence.

In a possible implementation manner of the embodiment of the present disclosure, the decoding layer may be adopted to decode the coding feature to obtain probabilities that a first agent belonging to the first object in the action sequence performs multiple control actions, so that a control action with a highest probability may be determined from the multiple control actions according to the probabilities that the first agent belonging to the first object performs the multiple control actions, and the control action with the highest probability may be used as a target control action of the first agent belonging to the first object.

Therefore, the control action with the highest probability is used as the target control action of the agent, and the accuracy of the control action prediction result can be improved.

In another possible implementation manner of the embodiment of the present disclosure, a decoding layer may be used to decode the coding feature to obtain probabilities that a first agent belonging to a first object in an action sequence executes multiple control actions, and generate a probability distribution according to the probabilities that the first agent belonging to the first object executes the multiple control actions, so that the probability distribution may be sampled (for example, randomly sampled) to obtain a sampling probability, and a control action corresponding to the sampling probability is used as a target control action of the first agent belonging to the first object.

Therefore, the control action with the maximum probability can be used as the target control action of the intelligent agent, the control action corresponding to the sampling probability can be used as the target control action of the intelligent agent, the target control action of the intelligent agent can be determined based on different modes, and the flexibility and the applicability of the method are improved.

Step 506, decoding the coding characteristics by adopting a decoding layer based on the target control action of the ith agent belonging to the first object in the action sequence to obtain the target control action of the (i + 1) th agent belonging to the first object in the action sequence; wherein i is a positive integer less than N.

In the embodiment of the present disclosure, the decoding layer may be adopted to decode the encoded feature based on the target control action of the agent belonging to the first object in the action sequence, so as to obtain the target control action of the agent belonging to the second object in the action sequence.

As an example, the decoding layer may be adapted to decode the encoded feature based on the target control action of the agent belonging to the first object in the action sequence to obtain the probability that the agent belonging to the second object in the action sequence performs the plurality of control actions, so as to determine the control action with the highest probability from the plurality of control actions according to the probability that the agent belonging to the second object performs the plurality of control actions, and use the control action with the highest probability as the target control action of the agent belonging to the second object.

As another example, the decoding layer may decode the coded feature based on a target control action of a first agent belonging to a first object in the action sequence to obtain a probability that a 2 nd agent belonging to the first object in the action sequence performs multiple control actions, and generate a probability distribution according to the probability that the 2 nd agent belonging to the first object performs multiple control actions, so that the probability distribution may be sampled (e.g., randomly sampled) to obtain a sampling probability, and a control action corresponding to the sampling probability may be used as a target control action of the 2 nd agent belonging to the first object.

Similarly, the decoding layer may be adopted to decode the coded features based on the target control action of the agent 2 in the action sequence belonging to the first object, so as to obtain the target control action of the agent 3 in the action sequence belonging to the first object.

It should be noted that, the determination manner of the target control action of the 3 rd agent belonging to the first object in the action sequence is similar to the determination manner of the target control action of the 2 nd agent belonging to the first object in the action sequence or the first agent belonging to the first object in the action sequence, and is not described herein again.

In this analogy, the decoding layer may be adopted to decode the coded features based on the target control action of the nth-1 th agent belonging to the first object in the action sequence to obtain the target control action of the nth agent belonging to the first object in the action sequence.

And step 507, controlling the action of each agent belonging to the first object according to each target control action in the action sequence.

For the explanation of step 507, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

The control method of the agent in the embodiment of the disclosure can realize the mode of sequence decoding through the decoding layer, and effectively obtain the target control action of each agent belonging to the first object. And, the strategy with certain cooperation is more facilitated through sequence decoding, so that each agent belonging to the first object can cooperate to complete the countermeasure task.

Fig. 6 is a schematic flowchart of a control method of an agent according to a fifth embodiment of the present disclosure.

As shown in fig. 6, the control method of the agent may include the steps of:

601, acquiring target state data of a plurality of agents; the plurality of agents comprise at least one agent belonging to a first object and at least one agent belonging to a second object, and the number of agents belonging to the first object is N.

Step 602, generating a state sequence according to the target state data of the plurality of agents.

Step 603, normalization processing is performed on the state sequence.

And step 604, coding the normalized state sequence by adopting a coding layer of the agent strategy model based on an attention mechanism to obtain coding characteristics.

For the explanation of steps 601 to 604, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

605, decoding the coding characteristics by using a first decoding layer of the agent policy model to obtain a first movement control action of the agent belonging to the first object in the action sequence; wherein the movement control action is used for indicating at least one of the movement speed, the movement direction and the movement height of the corresponding intelligent agent.

In the embodiment of the present disclosure, the decoding layer of the agent policy model may include a first decoding layer, and the encoding feature may be decoded by the first decoding layer to obtain the movement control action of the agent belonging to the first object in the action sequence.

Wherein the movement control action is used for indicating at least one of the movement speed, the movement direction and the movement height of the corresponding intelligent agent. For example, when the agent is a drone, the movement control action may be used to indicate the speed of flight, heading, altitude of the drone.

As a possible implementation manner, the decoding layer may further include a classification layer and a second decoding layer, and the classification layer may be used to classify the coding features to obtain a classification category of a first agent belonging to the first object.

In the case that the classification category indicates that the action intention of the first agent belonging to the first object is an attack, the target control action may further include an attack control action, and the encoding feature may be input into a second decoding layer to decode the encoding feature through the second decoding layer to obtain the attack control action; and the attack control action is used for indicating a target weapon to be launched and a target agent belonging to the second object to be attacked.

In case the classification category indicates that the action of the agent belonging first to the first object is intended to be non-attacking, the target control action may comprise only a movement control action, at which point it may not be necessary to input the coding features to the second decoding layer for decoding.

Therefore, the target control action can comprise not only the movement control action but also the attack control action, so that each intelligent body can better deal with the confrontation task and efficiently finish the confrontation task.

Step 606, decoding the coding characteristics by adopting a decoding layer based on the target control action of the ith agent belonging to the first object in the action sequence to obtain the target control action of the (i + 1) th agent belonging to the first object in the action sequence; wherein i is a positive integer less than or equal to N.

In the embodiment of the present disclosure, the first decoding layer may be adopted to decode the coded feature based on the movement control action of the agent belonging to the first object in the action sequence, so as to obtain the movement control action of the agent belonging to the first object at the 2 nd position in the action sequence.

Optionally, the coding features may be further classified by using a classification layer to obtain a classification category of the 2 nd agent belonging to the first object. In the case that the classification category indicates that the action intent of the 2 nd agent belonging to the first object is an attack, the target control action may further include an attack control action, and the encoded features may be input into the second decoding layer to be decoded by the second decoding layer to obtain the attack control action. And in the case that the classification category indicates that the action of the 2 nd agent belonging to the first object is not intended to be an attack, the target control action may include only a movement control action, and in this case, the encoding feature may not need to be input to the second decoding layer for decoding.

Similarly, the first decoding layer may be adopted to decode the coded features based on the movement control action of the agent belonging to the first object at the 2 nd position in the action sequence, so as to obtain the movement control action of the agent belonging to the first object at the 3 rd position in the action sequence.

Optionally, the coding features may be further classified by using a classification layer to obtain a 3 rd classification category of the agent belonging to the first object. In the case that the classification category indicates that the action intent of the 3 rd agent belonging to the first object is an attack, the target control action may further include an attack control action, and the encoded feature may be input into the second decoding layer to be decoded by the second decoding layer to obtain the attack control action. And in the case that the classification category indicates that the action of the 3 rd agent belonging to the first object is not intended to be an attack, the target control action may only comprise a movement control action, and at this time, the encoding feature may not need to be input to the second decoding layer for decoding.

By analogy, the first decoding layer may be adopted to decode the coded features based on the movement control action of the nth-1 th agent belonging to the first object in the action sequence, so as to obtain the movement control action of the nth agent belonging to the first object in the action sequence.

Optionally, the encoding features may be further classified by using a classification layer to obtain an nth classification category of the agent belonging to the first object. In the case that the classification category indicates that the action intention of the nth agent belonging to the first object is an attack, the target control action may further include an attack control action, and the encoded feature may be input into the second decoding layer to be decoded by the second decoding layer to obtain the attack control action. And in the case that the classification category indicates that the action of the nth agent belonging to the first object is not intended to be an attack, the target control action may only comprise a movement control action, and at this time, the encoding feature may not need to be input to the second decoding layer for decoding.

Step 607, performing motion control on each agent belonging to the first object according to each target control motion in the motion sequence.

For the explanation of step 607, reference may be made to the related description in any embodiment of the present disclosure, which is not described herein again.

According to the control method of the agents, the movement control action of each agent belonging to the first object can be predicted based on the first decoding layer, so that each agent belonging to the first object can be subjected to movement control according to the movement control action, and the agents belonging to the first object are prevented from being knocked down by each agent belonging to the second object.

In any embodiment of the present disclosure, multi-agent confrontation can be realized based on space-time feature fusion and two-way decoding action, and the realization principle can be as shown in fig. 7. For example, a multi-agent confrontation task can be viewed as a reinforcement learning task that maximizes the expected rewards that all agents of one's own have won (defeating the other agent). The reinforcement learning includes five parts of environment, reward, agent policy (referred to as agent policy model in this disclosure), state space and action space, where the state space has the feature of fusion based on space-time features, the action space has the feature of two-way decoding action, and the agent policy model is a neural network supporting the above state output and action output, and is described as follows:

1) State space: the state is input data of the agent policy model, and the state is composed of information of all agents of the own party and all agents of the available enemy. Taking an agent as an unmanned aerial vehicle for example, the state of an individual agent may include position information, speed information and loaded weapon information of the agent; the state of each agent finally input into the agent policy model includes two dimensions of space and time, the state (position information, speed information and weapon information) of the agent, the state (position information, speed information and weapon information) of m own agents nearest to the agent, the state (position information, speed information and weapon information) of n enemy agents nearest to the agent, and the information of each agent at the current moment in time, and the information of t moments in the prehistoric process (t can be 1, 2 and 3, the complexity of the model is considered, and the value of t is generally not more than 3).

It should be noted that, the fused spatiotemporal information can be clear to some extent about the action intention or the combat intention (such as attack, copy, escape, etc.) of the enemy agent, and on the basis, the action intention or the combat intention (such as attack, follow, escape, etc.) of the enemy agent is output continuously.

2) An action space: the action is output data of the agent strategy model, and because the multi-agent confrontation is carried out, the action space controls all agents of the own party; for each agent, its action space may consist of: 1) Motion control, including speed, heading, altitude; 2) Attack control, controlling various weapons to fire, and specifying shooting targets as required.

On one hand, the action output by each intelligent agent depends on the output action determined by the previous intelligent agent to form the dependence between the intelligent agents, and a strategy with certain cooperation is more favorably made through sequence decoding; on the other hand, the control action output by each agent comprises movement control and attack control, wherein the attack control determines whether to attack or not, and an attack object needs to be determined when the attack is determined, so that a sequence decoding of the action is formed, the action complexity is reduced (the attack object is determined at the same time of the attack, the action type is increased along with the increase of the number of enemy agents), and the model pressure is reduced; each variable in the motion control is a continuous value, and optionally, the continuous value can be discretized to reduce the complexity of the motion.

3) Agent policy model: the intelligent agent strategy model is the core of reinforcement learning, and a neural network structure is adopted. The input of the agent strategy model is the state combination of all agents, and the output is the action combination of all agents of own party, wherein the state combination and the action combination are described in detail in a state space and an action space. In order to support the above input and output, the neural network structure may be designed as follows:

A. an input layer: the input state includes a numerical type and a classification type (such as whether there is a person in the agent, the agent type, and the like), wherein the numerical type needs to be normalized to ensure data balance. When the intelligent agent is destroyed or quits the battle, the corresponding position needs to be supplemented with a default value so as to ensure the consistency of the relative position of the intelligent agent input by the model and reduce the learning difficulty.

B. An output layer: the output action combination comprises movement control and attack control, wherein the movement control is continuous action if discretization is not carried out, generally, the action is assumed to obey normal distribution N (mu, sigma 2), a neural network predicts a mean value mu and a variance sigma 2 of the normal distribution, and then sampling is carried out according to the mean value and the variance of the normal distribution to obtain the action to be executed; wherein, the attack control is discrete action, the neural network predicts the probability of executing each action, and can sample according to probability distribution, or directly execute the action with the maximum probability.

It should be noted that, since the attack has a certain scope, when selecting an attack object, it is necessary to select an enemy agent that is still alive within the scope of the attack, and this can be implemented by a mask method without introducing rules.

Wherein, the mask mode is as follows: if the enemy initially has 5 agents, attack objects can be selected from the 5 agents at the beginning of the confrontation, after a period of battle, if the enemy only has 3 alive agents, when selecting the attack objects, the attack objects are selected from the 3 alive agents, and the selection probabilities of the rest 2 non-alive agents are masked, so that the overall probability distribution is not influenced (at this time, the sum of the selection probabilities of the 3 alive enemy agents is 1).

Wherein, the rule is as follows: the selection probabilities of all agents (e.g., the selection probabilities of 5 agents) are output regardless of the number of surviving agents in the enemy agent, and then the agent with the highest selection probability can be determined from the surviving agents and used as the attack target.

C. Intermediate layer: the middle layer adopts a Transformer network comprising a coding layer and a decoding layer structure, wherein the input layer can well represent the current fighting situation after being coded by the coding layer, and a self-attention mechanism in the coding layer can code the relative states between the agents of the own party and all the agents of the enemy; sequence decoding operations of different agents can be supported based on the decoding layer, and the action combination (movement + attack) of one agent can be decoded by similar operations.

In summary, in terms of state representation, a better state representation of the current agent is obtained by fusing the historical state (time dimension) of the agent with the states (space dimensions) of other agents; in the aspect of action decoding, through modeling a structured combined action space and decision cooperation among modeling multiple intelligent agents, a very complex cooperation strategy can be learned, and the quantity, the variety and the fighting situation have certain robustness when changed. The inventor applies the scheme to the air intelligent game competition to obtain good effect, supports multi-thread follow-up and promotes the business landing of multi-agent confrontation.

In the method, the network structure of the intelligent agent strategy model is designed in a refined mode, the multi-intelligent agent confrontation scene is better adapted to the state space and the action space, and the strategy effect is not required to be improved by introducing ways of optimizing the whole training process such as course learning and imitation learning which need manual intervention. In addition, the two-way action decoding proposed in the action space can learn a very complex cooperation strategy through the decision cooperation between the modeling structured combined action space and the modeling multi-agent, namely, the more complex cooperation strategy can be learned through the better state representation of the combined learning space and the decision cooperation between the modeling multi-agent by the autoregressive mode modeling structured combined action space, and the robustness is provided when the number, the type and the fighting situation of the agents change.

Corresponding to the control method of the agent provided in the embodiments of fig. 1 to 6, the present disclosure also provides a control device of an agent, and since the control device of an agent provided in the embodiments of the present disclosure corresponds to the control method of an agent provided in the embodiments of fig. 1 to 6, the implementation manner of the control method of an agent is also applicable to the control device of an agent provided in the embodiments of the present disclosure, and is not described in detail in the embodiments of the present disclosure.

Fig. 8 is a schematic structural diagram of a control device of an agent according to a sixth embodiment of the present disclosure.

As shown in fig. 8, the control apparatus 800 of the agent may include: an acquisition module 801, a generation module 802, a determination module 803, and a control module 804.

The obtaining module 801 is configured to obtain target status data of a plurality of agents, where the plurality of agents includes at least one agent belonging to a first object and at least one agent belonging to a second object.

A generating module 802, configured to generate a state sequence according to the target state data of the multiple agents.

A determining module 803, configured to input the state sequence into the agent policy model to determine an action sequence according to an output of the agent policy model; wherein the action sequence comprises target control actions of agents belonging to the first object.

And the control module 804 is used for controlling the action of each agent belonging to the first object according to each target control action in the action sequence.

In a possible implementation manner of the embodiment of the present disclosure, the obtaining module 801 is configured to: aiming at any intelligent agent, acquiring first state data of the intelligent agent at the current moment; acquiring second state data of the intelligent agent at least one historical moment before the current moment; and generating target state data of the agent according to the first state data at the current moment and the second state data at least one historical moment.

In a possible implementation manner of the embodiment of the present disclosure, the obtaining module 801 is configured to: acquiring first sub-state data of the intelligent agent at the current moment, wherein the first sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the intelligent agent at the current moment; acquiring second sub-state data of the first agent at the current moment; wherein the second sub-status data is used to indicate at least one of position information, velocity information and loaded weapon information of the first agent at the current time, the first agent and the agent belong to the same object, and the distance between the first agent and the agent at the current time is smaller than a first distance threshold; acquiring third sub-state data of the second agent at the current moment, wherein the third sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the second agent at the current moment, the second agent and the agent belong to different objects, and the distance between the second agent and the agent at the current moment is smaller than a second distance threshold; generating first state data of the agent according to at least one of the first sub-state data, the second sub-state data and the third sub-state data.

In a possible implementation manner of the embodiment of the present disclosure, the determining module 803 may include:

and the processing unit is used for carrying out normalization processing on the state sequence.

And the coding unit is used for coding the state sequence after the normalization processing by adopting a coding layer of the intelligent agent strategy model based on an attention mechanism so as to obtain coding characteristics.

And the decoding unit is used for decoding the coding characteristics by adopting a decoding layer of the agent strategy model so as to obtain an action sequence.

In a possible implementation manner of the embodiment of the present disclosure, the number of agents belonging to the first object is N, and the decoding unit is configured to: decoding the coding characteristics by adopting a decoding layer to obtain a target control action of a first agent belonging to a first object in the action sequence; decoding the coding characteristics by adopting a decoding layer based on the target control action of the ith agent belonging to the first object in the action sequence to obtain the target control action of the (i + 1) th agent belonging to the first object in the action sequence; wherein i is a positive integer less than or equal to N.

In a possible implementation manner of the embodiment of the present disclosure, the decoding unit is configured to: decoding the coding characteristics by adopting a decoding layer to obtain the probability of executing various control actions by a first agent belonging to a first object in the action sequence; determining the control action with the highest probability from the multiple control actions according to the probability of executing the multiple control actions by the first agent belonging to the first object; and taking the control action with the highest probability as the target control action of the first agent belonging to the first object.

In a possible implementation manner of the embodiment of the present disclosure, the decoding unit is configured to: decoding the coding characteristics by adopting a decoding layer to obtain the probability of executing various control actions by the first agent belonging to the first object in the action sequence; generating probability distribution according to the probability of executing various control actions by the first agent belonging to the first object; sampling the probability distribution to obtain a sampling probability; and taking the control action corresponding to the sampling probability as the target control action of the first agent belonging to the first object.

In one possible implementation of the embodiment of the present disclosure, the target control action includes a movement control action, and the decoding layer includes a first decoding layer; a decoding unit to: decoding the coding characteristics by adopting a first decoding layer to obtain a first movement control action of the agent belonging to the first object in the action sequence; wherein the movement control action is used for indicating at least one of the movement speed, the movement direction and the movement height of the corresponding intelligent agent.

In a possible implementation manner of the embodiment of the present disclosure, the target control action further includes an attack control action, and the decoding layer further includes a classification layer and a second decoding layer; a decoding unit further configured to: classifying the coding features by adopting a classification layer to obtain a classification category of a first agent belonging to a first object; under the condition that the classification type indicates that the action intention of the first agent belonging to the first object is attack, decoding the coding characteristics by adopting a second decoding layer to obtain an attack control action; and the attack control action is used for indicating a target weapon to be launched and a target agent belonging to the second object to be attacked.

The control device of the agent of the embodiment of the present disclosure generates a state sequence according to target state data of a plurality of agents, and inputs the state sequence into an agent policy model to determine an action sequence according to an output of the agent policy model; the action sequence comprises target control actions of all agents belonging to the first object; and performing action control on each agent belonging to the first object according to each target control action in the action sequence. Therefore, the action control of each agent (such as the own agent) belonging to the first object in the multi-intelligent confrontation scene can be realized, so that each agent belonging to the first object can cooperate to complete the confrontation task.

To implement the above embodiments, the present disclosure also provides an electronic device, which may include at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the method for controlling the agent according to any of the above embodiments of the disclosure.

In order to achieve the above embodiments, the present disclosure also provides a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the method for controlling an agent set forth in any one of the above embodiments of the present disclosure.

In order to implement the above embodiments, the present disclosure also provides a computer program product, which includes a computer program that, when being executed by a processor, implements the control method of the agent proposed in any of the above embodiments of the present disclosure.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 9 shows a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

The electronic device may include the server and the client in the above embodiments. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes in accordance with a computer program stored in a ROM (Read-Only Memory) 902 or a computer program loaded from a storage unit 908 into a RAM (Random Access Memory) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An I/O (Input/Output) interface 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing Unit 901 include, but are not limited to, a CPU (Central Processing Unit), a GPU (graphics Processing Unit), various dedicated AI (Artificial Intelligence) computing chips, various computing Units running machine learning model algorithms, a DSP (Digital Signal Processor), and any suitable Processor, controller, microcontroller, and the like. The calculation unit 901 executes the respective methods and processes described above, such as the control method of the agent described above. For example, in some embodiments, the control methods of the agents described above may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the control method of the agent described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured by any other suitable means (e.g. by means of firmware) to perform the control method of the agent described above.

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, FPGAs (Field Programmable Gate arrays), ASICs (Application-Specific Integrated circuits), ASSPs (Application Specific Standard products), SOCs (System On Chip, system On a Chip), CPLDs (Complex Programmable Logic devices), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a RAM, a ROM, an EPROM (Electrically Programmable Read-Only-Memory) or flash Memory, an optical fiber, a CD-ROM (Compact Disc Read-Only-Memory), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a Display device (e.g., a CRT (Cathode Ray Tube) or LCD (Liquid Crystal Display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: LAN (Local Area Network), WAN (Wide Area Network), internet, and blockchain Network.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service extensibility in a conventional physical host and VPS service (Virtual Private Server). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be noted that artificial intelligence is a subject for studying a computer to simulate some human thinking processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), and includes both hardware and software technologies. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the embodiment of the disclosure, a state sequence is generated according to target state data of a plurality of agents, and the state sequence is input into an agent policy model, so that an action sequence is determined according to the output of the agent policy model; the action sequence comprises target control actions of all agents belonging to the first object; and performing action control on each agent belonging to the first object according to each target control action in the action sequence. Therefore, in a multi-intelligent confrontation scene, action control can be performed on each intelligent body (such as the own intelligent body) belonging to the first object, so that each intelligent body belonging to the first object can cooperate to complete a confrontation task.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions proposed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of controlling an agent, the method comprising:

2. The method of claim 1, wherein said obtaining target state data for a plurality of agents comprises:

aiming at any intelligent agent, acquiring first state data of the intelligent agent at the current moment;

acquiring second state data of the intelligent agent at least one historical moment before the current moment;

and generating target state data of the agent according to the first state data at the current moment and the second state data at the at least one historical moment.

3. The method of claim 2, wherein the obtaining first state data of the agent at a current time comprises:

acquiring first sub-state data of the intelligent agent at the current time, wherein the first sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the intelligent agent at the current time;

acquiring second sub-state data of the first agent at the current moment; wherein the second sub-status data is indicative of at least one of position information, velocity information and loaded weapons information for the first agent at the current time, the first agent and the agent belong to the same object, and a distance between the first agent and the agent at the current time is less than a first distance threshold;

acquiring third sub-state data of a second agent at the current moment, wherein the third sub-state data is used for indicating at least one of position information, speed information and loaded weapon information of the second agent at the current moment, the second agent and the agent belong to different objects, and the distance between the second agent and the agent at the current moment is less than a second distance threshold;

generating first state data of the agent according to at least one of the first sub-state data, the second sub-state data and the third sub-state data.

4. The method of claim 1, wherein said inputting the sequence of states into a smart agent policy model to determine a sequence of actions from an output of the smart agent policy model comprises:

carrying out normalization processing on the state sequence;

coding the state sequence after normalization processing by adopting a coding layer of the intelligent agent strategy model based on an attention mechanism to obtain coding characteristics;

and decoding the coding characteristics by adopting a decoding layer of the intelligent agent strategy model to obtain an action sequence.

5. The method of claim 4, wherein the number of agents belonging to the first object is N, and the decoding of the encoded features using the decoding layer of the agent policy model to obtain the action sequence comprises:

decoding the coding characteristics by adopting the decoding layer to obtain a target control action of a first agent belonging to the first object in the action sequence;

decoding the coding characteristics by adopting the decoding layer based on the target control action of the ith agent belonging to the first object in the action sequence to obtain the target control action of the (i + 1) th agent belonging to the first object in the action sequence; wherein i is a positive integer less than N.

6. The method of claim 5, wherein said decoding the encoded features with the decoding layer to obtain a target control action of a first agent in the action sequence belonging to the first object comprises:

decoding the coding characteristics by adopting the decoding layer to obtain the probability that the first agent belonging to the first object in the action sequence executes various control actions;

determining the control action with the highest probability from the multiple control actions according to the probability of executing the multiple control actions by the first agent belonging to the first object;

and taking the control action with the maximum probability as the target control action of the first agent belonging to the first object.

7. The method of claim 5, wherein said decoding the encoded features with the decoding layer to obtain a target control action of a first agent in the action sequence belonging to the first object comprises:

decoding the coding characteristics by adopting the decoding layer to obtain the probability of executing various control actions by the first agent belonging to the first object in the action sequence;

generating probability distribution according to the probability of executing various control actions by the first agent belonging to the first object;

sampling the probability distribution to obtain a sampling probability;

and taking the control action corresponding to the sampling probability as the target control action of the first agent belonging to the first object.

8. The method of any of claims 5-7, wherein the target control action comprises a movement control action, the decoding layer comprises a first decoding layer;

the decoding, by using the decoding layer, the encoded feature to obtain a target control action of a first agent belonging to the first object in the action sequence includes:

decoding the coding characteristics by adopting the first decoding layer to obtain a first movement control action of the agent belonging to the first object in the action sequence; wherein the movement control action is used for indicating at least one of the movement speed, the movement direction and the movement height of the corresponding intelligent agent.

9. The method of claim 8, wherein the target control action further comprises an attack control action, the decoding layers further comprise a classification layer and the second decoding layer;

the decoding, by using the decoding layer, the encoded feature to obtain a target control action of a first agent belonging to the first object in the action sequence, further includes:

classifying the coding features by adopting the classification layer to obtain the classification category of the first agent belonging to the first object;

under the condition that the classification type indicates that the action intention of the first agent belonging to the first object is attack, decoding the coding characteristics by adopting the second decoding layer to obtain an attack control action; the attack control action is used for indicating a target weapon to be launched and a target agent belonging to the second object to be attacked.

10. An apparatus for controlling an agent, the apparatus comprising:

a determining module for inputting the state sequence into an agent policy model to determine an action sequence according to an output of the agent policy model; wherein the action sequence comprises target control actions of agents belonging to the first object;

11. The apparatus of claim 10, wherein the means for obtaining is configured to:

12. The apparatus of claim 11, wherein the means for obtaining is configured to:

acquiring second sub-state data of the first agent at the current moment; wherein the second sub-status data is indicative of at least one of position information, velocity information and loaded weapons information for the first agent at the current time, the first agent and agent belong to the same object, and a distance between the first agent and the agent at the current time is less than a first distance threshold;

13. The apparatus of claim 10, wherein the means for determining comprises:

the processing unit is used for carrying out normalization processing on the state sequence;

the coding unit is used for coding the state sequence after the normalization processing by adopting a coding layer of the agent strategy model based on an attention mechanism so as to obtain coding characteristics;

and the decoding unit is used for decoding the coding characteristics by adopting a decoding layer of the agent strategy model to obtain an action sequence.

14. The apparatus of claim 13, wherein the number of agents belonging to the first object is N, and the decoding unit is configured to:

decoding the coding characteristics by adopting the decoding layer to obtain a target control action of the first agent belonging to the first object in the action sequence;

15. The apparatus of claim 14, wherein the decoding unit is configured to:

16. The apparatus of claim 14, wherein the decoding unit is configured to:

sampling the probability distribution to obtain a sampling probability;

17. The apparatus of any of claims 14-16, wherein the target control action comprises a movement control action, the decoding layer comprising a first decoding layer;

the decoding unit is configured to:

decoding the coding characteristics by adopting the first decoding layer to obtain the movement control action of the first agent belonging to the first object in the action sequence; wherein the movement control action is used for indicating at least one of the movement speed, the movement direction and the movement height of the corresponding intelligent agent.

18. The apparatus of claim 17, wherein the target control action further comprises an attack control action, the decoding layer further comprises a classification layer and the second decoding layer;

the decoding unit is further configured to:

classifying the coding features by adopting the classification layer to obtain a classification category of the first agent belonging to the first object;

19. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of controlling an agent of any of claims 1-9.

20. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method of controlling an agent according to any one of claims 1-9.

21. A computer program product comprising a computer program which, when being executed by a processor, carries out the steps of the controlling method of an agent according to any one of claims 1-9.