CN112494949A

CN112494949A - Intelligent agent action strategy making method, server and storage medium

Info

Publication number: CN112494949A
Application number: CN202011312201.7A
Authority: CN
Inventors: 杨木; 张弛; 武建芳; 王宇舟; 郭仁杰; 杨正云; 杨少杰; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2021-03-16
Anticipated expiration: 2040-11-20
Also published as: CN112494949B

Abstract

The application discloses an agent action strategy making method, a server and a storage medium, wherein the method comprises the steps of obtaining current frame state information of an agent in a 3D virtual environment and current frame interaction information of the agent and the 3D virtual environment; outputting current frame parallel task information and current frame non-parallel task information corresponding to the agent through an AI model based on the current frame state information and the current frame interaction information; outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information; controlling the intelligent agent to interact with the 3D virtual environment according to the current frame action output information so as to obtain the next frame state information and the next frame interaction information of the intelligent agent; and outputting the next frame action output information corresponding to the agent according to the next frame state information and the next frame interaction information. The application can realize highly anthropomorphic AI simulation.

Description

Intelligent agent action strategy making method, server and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, a server, and a storage medium for formulating an agent action policy.

Background

With the rapid development of Artificial Intelligence (AI) technology, the AI technology is widely applied to various fields such as 3D games, virtual traffic, automatic driving simulation, robot trajectory planning, etc., and AI simulation in a 3D virtual space has a great commercial value, for example, a match between an agent and a real person in various games can be realized by the AI technology.

Currently, in AI simulation of a partial 3D virtual space, an agent needs to collect various resources in the 3D virtual space and confront other agent players in a continuously reduced security area to enable the agent to live to the end, and in the AI simulation process, the agent needs to make a correct action decision in different environments to enable the agent to transfer and explore with a relative security area as a target point, and to fight with an enemy agent to enable the agent to live to the end.

Therefore, in order to enhance the game experience of the user, the AI simulation is desired to be highly anthropomorphic by the intelligent agent, and therefore how to realize the highly anthropomorphic AI simulation becomes a problem which needs to be solved urgently.

Disclosure of Invention

The embodiment of the application provides an agent action strategy making method, a server and a storage medium, aiming at realizing highly anthropomorphic AI simulation.

In a first aspect, an embodiment of the present application provides an agent action policy making method, where the method includes:

acquiring current frame state information of an agent in a 3D virtual environment and current frame interaction information of the agent and the 3D virtual environment;

outputting current frame parallel task information and current frame non-parallel task information corresponding to the agent through an AI model based on the current frame state information and the current frame interaction information;

outputting current frame action output information corresponding to the intelligent body according to the current frame parallel task information and the current frame non-parallel task information;

controlling the intelligent agent to interact with the 3D virtual environment according to the current frame action output information so as to obtain the next frame state information and the next frame interaction information of the intelligent agent;

and outputting the next frame action output information corresponding to the agent according to the next frame state information and the next frame interaction information.

In a second aspect, an embodiment of the present application further provides a server, where the server includes a processor and a memory; the memory stores a computer program and an AI model that can be invoked and executed by the processor, wherein the computer program, when executed by the processor, implements the agent action policy making method described above.

In a third aspect, an embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program, when executed by a processor, causes the processor to implement the method for making an action policy of an intelligent agent.

The embodiment of the application provides an agent action strategy making method, a server and a storage medium, wherein the agent action strategy making method comprises the steps of obtaining current frame state information of an agent in a 3D virtual environment and current frame interaction information of the agent and the 3D virtual environment; outputting current frame parallel task information and current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information through an AI model; outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information; controlling the interaction between the intelligent body and the 3D virtual environment according to the current frame action output information so as to acquire the next frame state information and the next frame interaction information of the intelligent body; and outputting the next frame action output information corresponding to the agent according to the next frame state information and the next frame interaction information. By analyzing the executable parallel task information and the non-parallel task information of the intelligent agent in the current state, the action which can be synchronously executed and the mutually exclusive execution action of the intelligent agent at present are obtained according to the parallel task information and the non-parallel task information, and the intelligent agent is controlled to output the corresponding output action according to the parallel task information and the non-parallel task information, so that the action output by the intelligent agent is more reasonable and more humanized.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flowchart illustrating steps of a method for formulating an action policy of an agent according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario of a method for making an action policy of an agent according to an embodiment of the present application;

FIG. 3 is a diagram illustrating actions that an agent may selectively output according to parallel task information and non-parallel task information in the application scenario corresponding to FIG. 2;

FIG. 4 is a schematic diagram of an AI model based agent action output provided by an embodiment of the application;

FIG. 5 is another schematic diagram of an AI model based agent action output provided by an embodiment of the application;

fig. 6 is a schematic block diagram of a server provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it should be understood that the described embodiments are some, but not all embodiments of the present application. All other embodiments that can be derived by a person skilled in the art from the embodiments given herein without making any inventive effort fall within the scope of protection of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be divided, combined or partially combined, so that the actual execution sequence may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

In order to solve the above problems, embodiments of the present application provide an agent action policy making method, a server, and a computer-readable storage medium for implementing highly anthropomorphic AI simulation. The method for formulating the action strategy of the intelligent agent can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for making an intelligent agent action policy according to an embodiment of the present application.

As shown in fig. 1, the action decision making method specifically includes steps S101 to S105.

Step S101: the method comprises the steps of obtaining current frame state information of an agent in a 3D virtual environment and current frame interaction information of the agent and the 3D virtual environment.

For example, in various application scenarios such as Artificial Intelligence (AI), robot simulation in a 3D virtual environment, a mechanical arm, unmanned driving, virtual traffic simulation, and the like, or in a game AI in a 3D type game, in order to implement highly anthropomorphic simulation, a highly anthropomorphic action decision is made for an Agent in the 3D virtual environment, and current frame state information of the Agent in the 3D virtual environment and current frame interaction information of the Agent and the 3D virtual environment are acquired, so as to make a corresponding action decision according to the current frame state information and the current frame interaction information. The intelligent agent is an intelligent agent which is hosted in a complex dynamic environment, autonomously senses environment information, autonomously takes action and realizes a series of preset targets or tasks.

The current frame state information of the intelligent agent is self-related state data information used for representing the current frame intelligent agent, and comprises self data information of the intelligent agent and equipment information worn by the intelligent agent. The data information of the intelligent agent comprises position information, motion information, blood volume information, equipment information, affiliated marketing information and the like.

The agent-3D virtual environment interaction information is relative data information between the agent and the 3D virtual environment, such as global information, circle information, material information, and sound information, for representing the current frame of the agent.

In the present embodiment, the AI simulation of the 3D game play is taken as an example for explanation, but the AI simulation includes, but is not limited to, 3DFPS (3DFirst Person Shooter) game, and may be an AI simulation of other 3D game play, and is not limited herein.

As shown in fig. 2, in the 3D game match, the agent may compete with a preset number of other players, where the other players may be other agents or game characters operated by human players, in this embodiment, the other players are taken as the other agents for example, but not limited to that, the other players may only be other agents.

The agents may participate in the game in a team with other agents or in a single team such that there are different teams of agents in the game. The intelligent agent participating in the game can select any region in the 3D virtual environment as a target region and can descend to the target region through parachuting, the intelligent agent needs to collect resources such as different weapons, defensive tools and props in the 3D virtual environment so as to increase the fighting capacity of the intelligent agent, and meanwhile, along with the progress of the game, the safe region on the 3D virtual environment is gradually reduced, the poison circle region is gradually enlarged, the intelligent agent participating in the game can also fight more between the intelligent agents in different camps in order to reach the safe region, the intelligent agent can kill the intelligent agent of the enemy of the intelligent agent belonging to other camps through various strategies, so that the winning is finally obtained.

The position information, the motion information, the blood volume information, the equipment information and the belonging marketing information of the current frame of the intelligent agent are obtained, so that the self-related information of the intelligent agent can be accurately evaluated.

Wherein the position information comprises a spatial position of the agent in the 3D virtual environment, which can be represented by a spatial coordinate system; the motion information comprises the current orientation and the moving speed of the agent; the blood volume information comprises the total blood volume, the residual blood volume and the like of the intelligent agent; the equipment information comprises information of armor, helmet and weapon owned by each weapon slot on the intelligent body, wherein the weapon information comprises weapon type, weapon state, such as weapon loading, residual amount and the like.

Global information, poison circle information, material information and sound information generated after the intelligent agent interacts with the 3D virtual environment are obtained, so that the current environment information of the intelligent agent can be accurately evaluated.

The global information mainly comprises the progress time of the current game, the number of the survival teammates, the total number of killing people of the teams of the current party and the like. The poison circle information comprises record information of the poison circle in the game, such as the center of the current poison circle, the radius of the current poison circle, the stage of the current poison circle, the residual time of the current poison circle, the center of the next poison circle, the radius of the next poison circle and the total time of the next poison circle. The material information comprises the position, the type, the attribute and the quantity of visible materials in the field of view of the intelligent body, wherein the type of the materials comprises but is not limited to guns, cutters, armour, helmets, medicines, throwing objects and the like. The sound information mainly includes the position, relative orientation, type of sound source, and the like of the sound source.

Step S102: and outputting current frame parallel task information and current frame non-parallel task information corresponding to the agent through an AI model based on the current frame state information and the current frame interaction information.

The parallel task information is information corresponding to related actions which can be synchronously executed by the representative agent in the same time, and the parallel task information includes, but is not limited to, mobile task information, first direction aiming task information, second direction aiming task information and non-parallel task selection information. The movement task information includes a movement direction, a movement speed, a posture during movement, and the like. The targeting task information is used to characterize the targeting direction of the agent.

That is, the agent may synchronously output at least one of an action corresponding to execution of the movement task, an action corresponding to execution of the first direction aiming task, an action corresponding to execution of the second direction aiming task, and an action corresponding to non-parallel task selection at the same time.

As shown in FIG. 3, the moving direction includes, but is not limited to, the eight directions of front, back, left, right, left front, left back, right front, and right back.

The aiming directions include, but are not limited to, four directions, i.e., up aiming, down aiming, left aiming, and right aiming, wherein the up aiming, the down aiming are in a first direction, the left aiming, and the right aiming are in a second direction, and the first direction and the second direction are perpendicular to each other.

The non-parallel task information is information corresponding to a task for representing that the intelligent agent outputs action mutual exclusion in the same time, and includes but is not limited to at least one of attack task information, material picking task information, posture control task information and blood volume supplement task information.

That is, the agent can output only one of the action corresponding to the execution of the attack task, the action corresponding to the execution of the material pickup task, the action corresponding to the execution of the posture control task, and the action corresponding to the execution of the blood volume supplement task at the same time. The attack task information is used for controlling selection and switching of weapons when the intelligent agent fights with other intelligent agents in formation, such as opening a gun, switching close-range weapons, switching far-range weapons, collecting weapons, throwing objects and the like.

The material picking task information is used for controlling the intelligent body to pick related information of corresponding articles in a preset range, such as weapons, blood bags and the like in the preset range.

The posture control task information is related information for controlling the intelligent agent to switch postures, such as jumping, squatting, lying, standing, running and walking, as shown in fig. 3.

The blood volume supplement task information is related information used for controlling the intelligent agent to select reasonable medicines to treat the intelligent agent so as to recover the state. And screening out the current frame parallel task information and the current frame non-parallel task information corresponding to the agent by using a preset AI model according to the current frame state information and the current frame interaction information. According to the acquired parallel task information and the parallel task information, the fact that the intelligent agent can output the action corresponding to one or more subtask information in the parallel task information in the current frame state and only can output the action corresponding to one subtask information in the non-parallel task information can be known, the intelligent agent is prevented from synchronously outputting a mutual exclusion action in the current state, and the AI simulation anthropomorphic effect is better.

Referring to fig. 4, in some embodiments, the outputting, by the AI model, current frame parallel task information and current frame non-parallel task information corresponding to the smart object based on the current frame state information and the current frame interaction information includes: respectively extracting the characteristics of the current frame state information and the current frame interactive information to obtain corresponding current frame state characteristic information and current frame interactive characteristic information; and acquiring corresponding current frame parallel task information and current frame non-parallel task information by the timing sequence characteristic extraction module of the AI model based on the current frame state characteristic information and the current frame interactive characteristic information.

The acquiring, by the timing characteristic extraction module of the AI model, corresponding current frame parallel task information and current frame non-parallel task information based on current frame state characteristic information and current frame interactive characteristic information includes: inputting the current frame state characteristic information and the current frame interactive characteristic information into a first full-connection network corresponding to the AI model to obtain corresponding first output information of the current frame; acquiring current frame fusion state vector information corresponding to the agent through a time sequence feature extraction module of the AI model based on the current frame first output information; and inputting the current frame fusion state vector information into a second full-connection network corresponding to the AI model to acquire corresponding current frame parallel task information and current frame non-parallel task information.

Referring to fig. 5, in some embodiments, the current frame interactive feature information includes current frame global feature information, current frame poison circle feature information, current frame material feature information, and current frame sound feature information. The current frame global feature information is obtained by extracting the current frame global information, the current frame poison circle feature information is obtained by extracting the current frame poison circle information, the current frame material feature information is obtained by extracting the current frame material information, and the current frame sound feature information is obtained by extracting the current frame sound information.

Respectively carrying out feature extraction on the current frame state information and the current frame interactive information of the intelligent agent to acquire corresponding current frame state feature information and current frame interactive feature information, and the method comprises the following steps:

and respectively taking the current frame global characteristic information, the current frame poison circle characteristic information, the current frame material characteristic information and the current frame sound characteristic information as the input of the first fully-connected network corresponding to the AI model so as to output the corresponding current frame first output information.

And the timing sequence feature extraction module based on the AI model performs timing sequence feature fusion on the first output information of the current frame to acquire fusion state vector information of the current frame corresponding to the agent.

And inputting the current frame fusion state vector information into the corresponding second full-connection network so as to obtain corresponding current frame parallel task information and current frame non-parallel task information.

In some embodiments, the current frame parallel task information includes mobile task information, first direction aiming task information, second direction aiming task information and non-parallel task selection information of the agent in the current frame; the current frame non-parallel task information comprises attack task information, material picking task information, attitude control task information and blood volume supplement task information of the intelligent agent in the current frame; inputting the current frame fusion state vector information into a second full-connection network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information, including:

and respectively inputting the current frame fusion state vector information to a second full-connection network corresponding to the AI model so as to output corresponding current frame movement task information, current frame first direction aiming task information, current frame second direction aiming task information, current frame non-parallel task selection information, current frame attack task information, current frame material picking task information, current frame attitude control task information and current frame blood volume supplementing task information.

The same current frame fusion state vector information is used as the input of the second full-connection network, and a plurality of multi-task learning results are output, so that the learning generalization effect is better, and the simulation anthropomorphic effect is stronger.

In this embodiment, the AI model is provided with a corresponding fully-connected neural network and a timing feature extraction module, where the timing feature extraction module includes, but is not limited to, an LSTM (Long Short-Term Memory) module, a GRU (Gated secure Unit), a transform module, and the like.

Taking the example that the timing feature extraction module is an LSTM module, the acquiring, by the timing feature extraction module of the AI model, current frame fusion state vector information corresponding to the agent based on the current frame first output information includes: acquiring previous frame hidden state information corresponding to the LSTM module; outputting, by the LSTM module, current frame hidden state information corresponding to the LSTM module based on the current frame first output information and the previous frame hidden state information; and acquiring current frame fusion state vector information corresponding to the agent according to the current frame hidden state information.

The LSTM module serves as an independent feature extraction unit, and can receive previous frame hidden state information and current frame first output information as inputs of the LSTM module, and output corresponding current frame hidden state information, where the hidden state information includes hidden information (hidden state) and cell state information (cell state), and the current frame hidden state information serves as an input of a next frame.

S103: and controlling the interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information so as to acquire the state information of the next frame and the interaction information of the next frame of the intelligent agent.

And controlling the intelligent agent to execute corresponding action output based on the output current frame action output information, so that the intelligent agent interacts with the 3D virtual environment, the state information and the interaction information of the intelligent agent are updated, and the state information and the interaction information of the next frame of the intelligent agent are obtained.

In some embodiments, the outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information includes: and outputting current frame action output information corresponding to the intelligent agent according to the current frame parallel task information and the current frame non-parallel task information based on a preset strategy gradient optimization function.

Illustratively, the predetermined strategic gradient optimization function

Expressed as:

wherein A is_tAn Advantage function (Advantage function) at time t is shown, and N represents the number of learning trajectories.

Gradient representing parallel tasks, F (a)_t|s_t) Represents the gradient of the operating space of all non-parallel tasks, T represents all the moments in a learning sequence, and T represents a certain moment in this sequence.

In particular, gradients of parallel tasks

Can be expressed as:

wherein W represents the number of parallel tasks, m represents the size of the operating space of each task,

represents the probability that any one operation in each parallel task is selected, and the W parallelable tasks do not obey a category distribution (i.e., the W tasks are independent of each other).

a_tRepresents the action selected at time t, s_tThe state of the t moment is represented, including the state of the intelligent agent at the t moment and the interaction state of the intelligent agent and the 3D virtual environment, such as material information, sound information, global information and equipment information at the t moment, a_jq,tIndicates that the q-th action selected at time t is any one of the task information.

Gradient F (a) of the operating space of non-parallel tasks_t|s_t) Can be expressed as:

wherein, M represents the number of non-parallel tasks, and M represents the size of the operation space of each task, i.e. the number of predictable actions. For each non-parallel task, all actions of the non-parallel task cannot be simultaneously predicted to be executed, and only one operation in the non-parallel task can be selected to be executed at any time.

S104: and outputting the next frame action output information corresponding to the agent according to the next frame state information and the next frame interaction information.

After the next frame state information and the next frame interaction information of the agent are obtained, the next frame action output information corresponding to the agent is output through the AI model based on the next frame state information and the next frame interaction information according to the operation in the step S102. The specific operation process can refer to the steps S102-S105, and is not described herein again.

The method for formulating the action strategy of the intelligent agent provided by the embodiment comprises the steps of obtaining current frame state information of the intelligent agent in a 3D virtual environment and current frame interaction information of the intelligent agent and the 3D virtual environment; outputting current frame parallel task information and current frame non-parallel task information corresponding to the agent through an AI model based on the current frame state information and the current frame interaction information; outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information; controlling the intelligent agent to interact with the 3D virtual environment according to the current frame action output information so as to obtain the next frame state information and the next frame interaction information of the intelligent agent; and outputting the next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information. The method comprises the steps of analyzing executable parallel task information and non-parallel task information of the intelligent agent in the current state, obtaining the action which can be synchronously executed and the mutually exclusive execution action of the current intelligent agent according to the parallel task information and the non-parallel task information, and controlling the intelligent agent to output the corresponding output action according to the parallel task information and the non-parallel task information, so that the action output by the intelligent agent is more reasonable and more humanized.

Referring to fig. 6, fig. 6 is a schematic block diagram of a server according to an embodiment of the present disclosure.

As shown in fig. 6, the server 30 may include a processor 301, a memory 302, and a network interface 303. The processor 301, memory 302, and network interface 303 are connected by a system bus, such as an I2C (Inter-integrated Circuit) bus.

Specifically, the Processor 301 may be a Micro-controller Unit (MCU), a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) magnetic disk, an optical disk, a usb disk, or a removable hard disk.

The network interface 303 is used for network communication such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor 301 is configured to run a computer program stored in the memory 302, and when executing the computer program, implement the following steps:

In some embodiments, the outputting, by the processor 301 through the AI model, the current frame parallel task information and the current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information includes:

respectively extracting the characteristics of the current frame state information and the current frame interactive information to obtain corresponding current frame state characteristic information and current frame interactive characteristic information;

and acquiring corresponding current frame parallel task information and current frame non-parallel task information based on the current frame state characteristic information and the current frame interactive characteristic information through the time sequence characteristic extraction module of the AI model.

In some embodiments, the processor 301 obtains, through the timing characteristic extraction module of the AI model, corresponding current frame parallel task information and current frame non-parallel task information based on current frame state characteristic information and current frame interactive characteristic information, and includes:

inputting the current frame state characteristic information and the current frame interactive characteristic information into a first full-connection network corresponding to the AI model to obtain corresponding first output information of the current frame;

acquiring current frame fusion state vector information corresponding to the agent through a time sequence feature extraction module of the AI model based on the current frame first output information;

and inputting the current frame fusion state vector information into a second full-connection network corresponding to the AI model to acquire corresponding current frame parallel task information and current frame non-parallel task information.

In some embodiments, the current frame interactive feature information includes current frame global feature information, current frame poison circle feature information, current frame material feature information, and current frame sound feature information, and the processor 301 inputs the current frame state feature information and the current frame interactive feature information to the first fully-connected network corresponding to the AI model to obtain corresponding first output information of the current frame, including:

and respectively inputting the current frame state characteristic information, the current frame global characteristic information, the current frame poison circle characteristic information, the current frame material characteristic information and the current frame sound characteristic information into corresponding first full-connection networks to obtain corresponding first output information of the current frame.

In some embodiments, the current frame parallel task information includes mobile task information, aiming task information and non-parallel task selection information of the agent in the current frame; the current frame non-parallel task information comprises attack task information, material picking task information, attitude control task information and blood volume supplement task information of the agent in the current frame; the processor 301 inputs the current frame fusion state vector information into the second fully-connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information, including:

and respectively inputting the current frame fusion state vector information to a second full-connection network corresponding to the AI model so as to obtain corresponding current frame movement task information, current frame aiming task information, current frame non-parallel task selection information, current frame attack task information, current frame material picking task information, current frame attitude control task information and current frame blood volume supplement task information.

In some embodiments, the timing feature extraction module includes an LSTM module, and the obtaining, by the processor 301 through the timing feature extraction module of the AI model, current frame fusion state vector information corresponding to the agent based on the current frame first output information includes:

acquiring previous frame hidden state information corresponding to the LSTM module;

outputting, by the LSTM module, current frame hidden state information corresponding to the LSTM module based on the current frame first output information and the previous frame hidden state information;

and acquiring current frame fusion state vector information corresponding to the agent according to the current frame hidden state information.

In some embodiments, the outputting, by the processor 301, current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information includes:

and outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information based on a preset strategy gradient optimization function.

In some embodiments, the strategic gradient optimization function

Comprises the following steps:

wherein A is_tThe merit function at time t is expressed, N represents the number of learning trajectories,

representing the gradient of the parallel task, F (a)_t|s_t) Represents the gradient of the operating space of the non-parallel task, T represents all the moments in a learning sequence, and T represents a certain moment in this sequence.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

The computer readable storage medium may be an internal storage unit of the server in the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk provided on the server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like.

Since the computer program stored in the computer-readable storage medium can execute any intelligent agent action policy making method provided in the embodiments of the present application, beneficial effects that can be achieved by any intelligent agent action policy making method provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An agent action policy making method, the method comprising:

outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information;

2. The method according to claim 1, wherein the outputting, by the AI model, current frame parallel task information and current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information includes:

3. The method according to claim 2, wherein the obtaining, by the timing feature extraction module of the AI model, corresponding current frame parallel task information and current frame non-parallel task information based on current frame state feature information and current frame interactive feature information includes:

4. The method according to claim 3, wherein the current frame interactive feature information includes current frame global feature information, current frame poison circle feature information, current frame material feature information, and current frame sound feature information, and the inputting the current frame state feature information and the current frame interactive feature information into the first fully-connected network corresponding to the AI model to obtain the corresponding current frame first output information includes:

5. The method of claim 4, wherein the current frame parallel task information comprises mobile task information, targeting task information, and non-parallel task selection information of the agent at the current frame; the current frame non-parallel task information comprises attack task information, material picking task information, attitude control task information and blood volume supplement task information of the intelligent agent in the current frame; the inputting the current frame fusion state vector information into a second full-connection network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information includes:

and respectively inputting the current frame fusion state vector information to a second fully-connected network corresponding to the AI model so as to obtain corresponding current frame movement task information, current frame aiming task information, current frame non-parallel task selection information, current frame attack task information, current frame material picking task information, current frame attitude control task information and current frame blood volume supplement task information.

6. The method according to claim 3, wherein the timing feature extraction module includes an LSTM module, and the obtaining, by the timing feature extraction module of the AI model, current frame fusion state vector information corresponding to the agent based on the current frame first output information includes:

7. The method according to claim 1, wherein the outputting current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information comprises:

8. The method of claim 7, wherein the strategic gradient optimization function

Comprises the following steps:

gradient representing parallel tasks, F (a)_t|s_t) Represents the gradient of the operating space of the non-parallel task, T represents all the moments in a learning sequence, and T represents a certain moment in this sequence.

9. A server, comprising a processor, a memory;

the memory stores a computer program and an AI model that can be invoked and executed by the processor, wherein the computer program, when executed by the processor, implements the agent action policy making method according to any one of claims 1 to 8.

10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out a method of agent action policy formulation according to any one of claims 1 to 8.