CN112494949B

CN112494949B - Intelligent body action policy making method, server and storage medium

Info

Publication number: CN112494949B
Application number: CN202011312201.7A
Authority: CN
Inventors: 杨木; 张弛; 武建芳; 王宇舟; 郭仁杰; 杨正云; 杨少杰; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2023-10-31
Anticipated expiration: 2040-11-20
Also published as: CN112494949A

Abstract

The application discloses an agent action strategy making method, a server and a storage medium, wherein the method comprises the steps of obtaining state information of an agent current frame in a 3D virtual environment and interaction information of the agent and the current frame of the 3D virtual environment; outputting current frame parallel task information and current frame non-parallel task information corresponding to the intelligent agent based on the current frame state information and the current frame interaction information through an AI model; outputting current frame action output information corresponding to the intelligent agent according to the current frame parallel task information and the non-parallel task information of the current frame; according to the current frame action output information, the agent is controlled to interact with the 3D virtual environment so as to acquire next frame state information and next frame interaction information of the agent; and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information. The application can realize highly anthropomorphic AI simulation.

Description

Intelligent body action policy making method, server and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an agent action policy making method, a server, and a storage medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, the artificial intelligence technology is widely applied to various fields of 3D games, virtual traffic, automatic driving simulation, robot trajectory planning and the like, and the AI simulation in a 3D virtual space has great commercial value, for example, the artificial intelligence technology can realize the exchange between an agent and a real person in various games.

At present, in AI simulation of a part of 3D virtual space, an agent needs to collect various resources in the 3D virtual space and fight against other agent players in a continuously reduced safety area, so that the agent needs to make correct action decisions in different environments in the AI simulation process, so that the agent can transfer and explore with the relatively safe area as a target point and fight against an enemy agent, so that the agent survives to the end.

Therefore, in order to enhance the game experience of the user, we want to highly personify the agent in AI simulation, so how to realize highly personified AI simulation becomes a problem to be solved.

Disclosure of Invention

The embodiment of the application provides an agent action strategy making method, a server and a storage medium, aiming at realizing highly anthropomorphic AI simulation.

In a first aspect, an embodiment of the present application provides a method for formulating an action policy of an agent, where the method includes:

acquiring state information of a current frame of an agent in a 3D virtual environment and interaction information of the agent and the current frame of the 3D virtual environment;

outputting current frame parallel task information and current frame non-parallel task information corresponding to the intelligent agent based on the current frame state information and the current frame interaction information through an AI model;

outputting current frame action output information corresponding to the intelligent agent according to the current frame parallel task information and the non-parallel task information of the current frame;

according to the current frame action output information, the agent is controlled to interact with the 3D virtual environment so as to acquire next frame state information and next frame interaction information of the agent;

and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information.

In a second aspect, an embodiment of the present application further provides a server, where the server includes a processor and a memory; the memory stores a computer program and an AI model which can be called and executed by the processor, wherein the method for formulating the action policy of the agent is realized when the computer program is executed by the processor.

In a third aspect, an embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium is configured to store a computer program, where the computer program when executed by a processor causes the processor to implement the method for preparing an agent action policy described above.

The embodiment of the application provides an agent action strategy making method, a server and a storage medium, wherein the agent action strategy making method is implemented by acquiring state information of an agent current frame in a 3D virtual environment and interaction information of the agent and the current frame of the 3D virtual environment; outputting current frame parallel task information and current frame non-parallel task information corresponding to the intelligent agent based on the current frame state information and the current frame interaction information through an AI model; outputting current frame action output information corresponding to the intelligent agent according to the current frame parallel task information and the non-parallel task information of the current frame; according to the current frame action output information, the agent is controlled to interact with the 3D virtual environment so as to acquire next frame state information and next frame interaction information of the agent; and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information. By analyzing the executable parallel task information and the non-parallel task information of the intelligent agent in the current state, the action which can be synchronously executed and the action which can be mutually exclusive executed by the current intelligent agent are obtained according to the parallel task information and the non-parallel task information, and the intelligent agent is controlled to output the corresponding output action according to the parallel task information and the non-parallel task information, so that the action output by the intelligent agent is more reasonable and humanized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart illustrating steps of a method for establishing an agent action policy according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an application scenario of an agent action policy formulation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a corresponding action that an agent can selectively output according to parallel task information and non-parallel task information in the application scenario corresponding to FIG. 2;

FIG. 4 is a schematic diagram of an AI model-based agent action output provided in accordance with one embodiment of the application;

FIG. 5 is another schematic diagram of an AI model-based agent action output provided in accordance with one embodiment of the application;

fig. 6 is a schematic block diagram of a server according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

To solve the above problems, embodiments of the present application provide an agent action policy formulation method, a server, and a computer-readable storage medium for implementing highly personified AI simulation. The method for formulating the action policy of the agent can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers.

Referring to fig. 1, fig. 1 is a flow chart of an agent action policy making method according to an embodiment of the application.

As shown in fig. 1, the action decision making method specifically includes steps S101 to S105.

Step S101: and acquiring state information of a current frame of the intelligent agent in the 3D virtual environment and interaction information of the intelligent agent and the current frame of the 3D virtual environment.

For example, in various application scenarios such as artificial intelligence (Artificial Intelligence, AI), robot simulation under a 3D virtual environment, a mechanical arm, unmanned, virtual traffic simulation, etc., or game AI in a 3D type game, in order to realize highly anthropomorphic simulation, highly anthropomorphic action decisions are made for an Agent (Agent) in the 3D virtual environment, current frame state information of the Agent in the 3D virtual environment and current frame interaction information of the Agent and the 3D virtual environment are obtained, so that corresponding action decisions are made according to the current frame state information and the current frame interaction information. The intelligent agent is an intelligent agent which is in a complex dynamic environment, autonomously perceives environment information, autonomously takes action and realizes a series of preset targets or tasks.

The current frame state information of the intelligent agent is self-related state data information for representing the intelligent agent in the current frame, and comprises the self-data information of the intelligent agent and equipment information worn by the intelligent agent. The intelligent agent self data information comprises position information, movement information, blood volume information, equipment information, affiliated camping information and the like.

The agent and 3D virtual environment interaction information is relative data information, such as global information, poison loop information, material information, and sound information, used for representing the current frame of the agent and the 3D virtual environment.

In the present embodiment, the description is given taking the AI simulation of the game in 3D game as an example, including but not limited to 3DFPS (3 DFirst Person Shooter) game, but may also be the AI simulation of the game in other types of 3D games, and the description is not limited herein.

As shown in fig. 2, in the 3D game, the agent may play with a preset number of other players, where the other players may be other agents or game characters controlled by human players, in this embodiment, the other players are taken as other agents for illustration, but not limited to, the other players may be only other agents.

The agents may participate in the game in a team or single team with other agents such that agents of different camps are present in the game. The agent participating in the game can select any area in the 3D virtual environment as a target area, and falls on the target area through the jump parachute, so that the agent needs to collect different resources such as weapons, protective equipment, props and the like in the 3D virtual environment to increase self fight force, meanwhile, along with the progress of the game, the safety area on the 3D virtual environment is gradually reduced, the poison circle area is gradually enlarged, the agent participating in the game can generate more fights in order to reach the safety area, and the agent kills the agent enemy agent belonging to other camps through various strategies, so that the winning is finally obtained.

By acquiring the position information, the motion information, the blood volume information, the equipment information and the affiliated camping information of the current frame of the intelligent agent, the related information of the intelligent agent can be accurately evaluated.

Wherein the location information comprises a spatial location of the agent in the 3D virtual environment, which may be represented by a spatial coordinate system; the motion information comprises the current direction and the moving speed of the intelligent body; the blood volume information comprises the total blood volume, the residual blood volume and the like of the intelligent agent; the equipment information comprises information of armors, helmets and weapons owned by each weapon slot on the intelligent agent, wherein the weapon information comprises weapon types, weapon states, such as weapon loading, residual bullet amount and the like.

Global information, poison circle information, material information and sound information generated after the intelligent agent interacts with the 3D virtual environment are acquired, so that the current environment information of the intelligent agent can be accurately evaluated.

The global information mainly comprises the progress time of the current game, the survival number of teammates, the killing total number of the team on my side and the like. The circle information includes record information of the local circle, such as the current circle center, the current circle radius, the stage of the current circle, the current circle residual time, the center of the next circle, the radius of the next circle and the total time of the next circle. The material information includes the location, type, nature, quantity of the visible material in the field of view of the agent, wherein the material type includes but is not limited to firearm, knife, armour, helmet, medicine, throwing object, etc. The sound information mainly includes the position, relative orientation, kind of sound source, and the like of the sound source.

Step S102: and outputting current frame parallel task information and current frame non-parallel task information corresponding to the intelligent agent based on the current frame state information and the current frame interaction information through an AI model.

The parallel task information is information corresponding to related actions for representing that the agent can synchronously execute at the same time, and the parallel task information includes, but is not limited to, mobile task information, first direction aiming task information, second direction aiming task information and non-parallel task selection information. The movement task information includes a movement direction, a movement speed, a posture at the time of movement, and the like. The targeting task information is a targeting direction that characterizes the agent.

That is, the agent can synchronously output at least any one of an action corresponding to the execution of the mobile task, an action corresponding to the execution of the first direction aiming task, an action corresponding to the execution of the second direction aiming task, and an action corresponding to the selection of the non-parallel task at the same time.

As shown in fig. 3, the moving directions include, but are not limited to, front, rear, left, right, front left, rear left, front right, and rear right directions.

The aiming direction includes, but is not limited to, four directions of upper aiming, lower aiming, left aiming and right aiming, wherein the upper aiming and the lower aiming are in a first direction, the left aiming and the right aiming are in a second direction, and the first direction and the second direction are perpendicular to each other.

The non-parallel task information is information corresponding to a task for representing that the intelligent agent outputs action mutex at the same time, and comprises at least one of attack task information, material picking task information, attitude control task information and blood volume supplementing task information.

That is, the agent can only output any one of the action corresponding to the execution of the attack task, the action corresponding to the execution of the material picking task, the action corresponding to the execution of the posture control task, and the action corresponding to the execution of the blood volume replenishment task at the same time. The attack task information is used for controlling selection and switching of weapons when the intelligent agent fights with other intelligent agents in camping, such as firing, switching of near weapons, switching of far weapons, weapon storage, throwing of throwing objects and the like.

The material picking task information is used for controlling the intelligent agent to pick up relevant information of corresponding articles in a preset range, such as weapons, blood bags and the like in the preset range.

The gesture control task information is related information for controlling the intelligent body to perform gesture switching, such as jumping, squatting, lying, standing, running and walking, as shown in fig. 3.

The blood volume supplementing task information is related information for controlling the intelligent agent to select reasonable medicines to treat the intelligent agent so as to recover the state. And screening out the current frame parallel task information and the current frame non-parallel task information corresponding to the intelligent agent according to the current frame state information and the current frame interaction information by using a preset AI model. According to the obtained parallel task information and the parallel task information, the intelligent agent can know that the intelligent agent can output actions corresponding to one or more subtask information in the parallel task information in the current frame state, and can only output actions corresponding to one subtask information in the non-parallel task information, so that the intelligent agent is prevented from synchronously outputting mutual exclusion actions in the current state, and the anthropomorphic effect of AI simulation is better.

Referring to fig. 4, the AI model includes a first fully-connected network and a second fully-connected network, where the fully-connected network is also called a fully-connected network layer (fully connected layer, FC), and in some embodiments, the outputting, by the AI model, current frame parallel task information and current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information includes: extracting the characteristics of the current frame state information and the current frame interaction information respectively to obtain corresponding current frame state characteristic information and current frame interaction characteristic information; and acquiring corresponding current frame parallel task information and current frame non-parallel task information based on the current frame state characteristic information and the current frame interaction characteristic information by the time sequence characteristic extraction module of the AI model.

The time sequence feature extraction module through the AI model obtains corresponding current frame parallel task information and current frame non-parallel task information based on current frame state feature information and current frame interaction feature information, and the time sequence feature extraction module comprises: inputting the current frame state characteristic information and the current frame interaction characteristic information into a first fully-connected network corresponding to the AI model to obtain corresponding current frame first output information; acquiring current frame fusion state vector information corresponding to the intelligent agent based on the first output information of the current frame through a time sequence feature extraction module of the AI model; and inputting the current frame fusion state vector information into a second fully-connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information.

Referring to fig. 5, in some embodiments, the current frame interaction feature information includes current frame global feature information, current frame poison circle feature information, current frame material feature information, and current frame sound feature information. The current frame global feature information is obtained through current frame global information extraction, the current frame poison circle feature information is obtained through current frame poison circle information extraction, the current frame material feature information is obtained through current frame material information extraction, and the current frame sound feature information is obtained through current frame sound information extraction.

Feature extraction is respectively carried out on the state information of the current frame and the interaction information of the current frame of the intelligent agent to obtain corresponding state feature information of the current frame and interaction feature information of the current frame, and the method comprises the following steps:

and respectively taking the global characteristic information of the current frame, the poison circle characteristic information of the current frame, the material characteristic information of the current frame and the sound characteristic information of the current frame as inputs of a first fully-connected network corresponding to the AI model so as to output corresponding first output information of the current frame.

And the time sequence feature extraction module based on the AI model performs time sequence feature fusion on the first output information of the current frame to acquire the fusion state vector information of the current frame corresponding to the intelligent agent.

And inputting the current frame fusion state vector information into a corresponding second fully-connected network so as to acquire corresponding current frame parallel task information and current frame non-parallel task information.

In some embodiments, the current frame parallel task information includes moving task information, first direction aiming task information, second direction aiming task information and non-parallel task selection information of the agent in the current frame; the non-parallel task information of the current frame comprises attack task information, material picking task information, attitude control task information and blood volume supplementing task information of the intelligent agent in the current frame; the step of inputting the current frame fusion state vector information into a second fully connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information, including:

and respectively inputting the current frame fusion state vector information into a second full-connection network corresponding to the AI model to output corresponding current frame movement task information, current frame first direction aiming task information, current frame second direction aiming task information, current frame non-parallel task selection information, current frame attack task information, current frame material picking task information, current frame attitude control task information and current frame blood volume supplementing task information.

The same current frame fusion state vector information is used as the input of a second fully-connected network, and a plurality of multi-task learning results are output, so that the learning generalization effect is better, and the simulation anthropomorphic effect is stronger.

In this embodiment, the AI model is provided with a corresponding fully-connected neural network and a timing characteristic extraction module, where the timing characteristic extraction module includes, but is not limited to, an LSTM (Long Short-Term Memory) module, a GRU (Gated Recurrent Unit, gate control unit network) module, a transducer module, and the like.

Taking the time sequence feature extraction module as an LSTM module for illustration, the time sequence feature extraction module passing through the AI model obtains the current frame fusion state vector information corresponding to the agent based on the first output information of the current frame, and the method comprises the following steps: acquiring the hidden state information of the last frame corresponding to the LSTM module; outputting the current frame hiding state information corresponding to the LSTM module based on the first output information of the current frame and the hiding state information of the previous frame through the LSTM module; and acquiring the fusion state vector information of the current frame corresponding to the intelligent agent according to the hiding state information of the current frame.

The LSTM module is used as an independent feature extraction unit, and can accept the hidden state information of the previous frame and the first output information of the current frame as the input of the LSTM module and output the corresponding hidden state information of the current frame, wherein the hidden state information comprises hidden information (hidden state) and cell state information (cell state), and the hidden state information of the current frame is used as the input of the next frame.

S103: and according to the current frame action output information, controlling interaction between the intelligent agent and the 3D virtual environment to acquire next frame state information and next frame interaction information of the intelligent agent.

And controlling the intelligent agent to execute corresponding action output based on the output current frame action output information, so that the intelligent agent interacts with the 3D virtual environment, and the state information and interaction information of the intelligent agent are updated to obtain the next frame state information and the next frame interaction information of the intelligent agent.

In some embodiments, the outputting the current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information includes: and outputting current frame action output information corresponding to the intelligent agent based on a preset strategy gradient optimization function according to the current frame parallel task information and the current frame non-parallel task information.

Illustratively, a predetermined policy gradient optimization functionExpressed as:

wherein A is _t The dominance function (Advantage function) at time t is shown, and N is the number of learning trajectories.Representing gradients of parallel tasks, F (a) _t |s _t ) Representing gradients of the operating space of all non-parallel tasks, T representing all moments in a learning sequence, T representing a certain moment in the sequence.

Specifically, the gradient of parallel tasksCan be expressed as:

wherein W represents the number of parallel tasks, m represents the operation space size of each task,representing each parallel arbitraryAny one of the tasks operates on the selected probability and the W parallelizable tasks do not follow the class distribution (categorical distribution), i.e., the W tasks are independent of each other.

a _t Representing the action selected at time t, s _t The state at time t is represented by the state of the agent at time t and the interaction state of the agent and the 3D virtual environment, such as material information, sound information, global information, equipment information at time t, a _jq,t The q-th action selected at time t is any task information.

Gradient F (a) of the operating space of non-parallel tasks _t |s _t ) Can be expressed as:

wherein M represents the number of non-parallel tasks, and M represents the operation space size of each task, i.e. the number of actions which can be predicted. For each non-parallel task, all actions of the non-parallel task cannot be simultaneously predicted to be executed, and only one operation in the non-parallel task can be selected to be executed at any time.

S104: and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information.

After the next frame status information and the next frame interaction information of the agent are obtained, according to the operation in the step S102, the next frame action output information corresponding to the agent is output through the AI model based on the next frame status information and the next frame interaction information. The specific operation process may be described with reference to steps S102-S105, and will not be described here.

According to the method for formulating the action strategy of the intelligent agent, the current frame state information of the intelligent agent in the 3D virtual environment and the current frame interaction information of the intelligent agent and the 3D virtual environment are obtained; outputting current frame parallel task information and current frame non-parallel task information corresponding to the intelligent agent based on the current frame state information and the current frame interaction information through an AI model; outputting current frame action output information corresponding to the intelligent agent according to the current frame parallel task information and the non-parallel task information of the current frame; according to the current frame action output information, the agent is controlled to interact with the 3D virtual environment so as to acquire next frame state information and next frame interaction information of the agent; and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information. By analyzing the executable parallel task information and the non-parallel task information of the intelligent agent in the current state, the action which can be synchronously executed and the action which can be mutually exclusive executed by the current intelligent agent are obtained according to the parallel task information and the non-parallel task information, and the intelligent agent is controlled to output the corresponding output action according to the parallel task information and the non-parallel task information, so that the action output by the intelligent agent is more reasonable and humanized.

Referring to fig. 6, fig. 6 is a schematic block diagram of a server according to an embodiment of the present application.

As shown in fig. 6, the server 30 may include a processor 301, a memory 302, and a network interface 303. The processor 301, memory 302, and network interface 303 are connected by a system bus, such as an I2C (Inter-integrated Circuit) bus.

Specifically, the processor 301 may be a Micro-controller Unit (MCU), a central processing Unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.

The network interface 303 is used for network communication such as transmission of assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of a portion of the structure associated with the present inventive arrangements and is not limiting of the server to which the present inventive arrangements are applied, and that a particular server may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Wherein the processor 301 is configured to run a computer program stored in the memory 302 and to implement the following steps when executing the computer program:

In some embodiments, the outputting, by the processor 301, current frame parallel task information and current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information through an AI model includes:

extracting the characteristics of the current frame state information and the current frame interaction information respectively to obtain corresponding current frame state characteristic information and current frame interaction characteristic information;

and acquiring corresponding current frame parallel task information and current frame non-parallel task information based on the current frame state characteristic information and the current frame interaction characteristic information by the time sequence characteristic extraction module of the AI model.

In some embodiments, the processor 301 obtains, through the timing feature extraction module of the AI model, corresponding current frame parallel task information and current frame non-parallel task information based on the current frame state feature information and the current frame interaction feature information, including:

inputting the current frame state characteristic information and the current frame interaction characteristic information into a first fully-connected network corresponding to the AI model to obtain corresponding current frame first output information;

acquiring current frame fusion state vector information corresponding to the intelligent agent based on the first output information of the current frame through a time sequence feature extraction module of the AI model;

and inputting the current frame fusion state vector information into a second fully-connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information.

In some embodiments, the current frame interaction feature information includes current frame global feature information, current frame poison circle feature information, current frame material feature information, and current frame sound feature information, and the processor 301 inputs the current frame state feature information and the current frame interaction feature information to a first fully-connected network corresponding to the AI model to obtain corresponding current frame first output information, including:

and respectively inputting the current frame state characteristic information, the current frame global characteristic information, the current frame poison circle characteristic information, the current frame material characteristic information and the current frame sound characteristic information into a corresponding first fully-connected network to acquire corresponding first output information of the current frame.

In some embodiments, the current frame parallel task information includes moving task information, aiming task information and non-parallel task selection information of the agent in the current frame; the non-parallel task information of the current frame comprises attack task information, material picking task information, attitude control task information and blood volume supplementing task information of the intelligent agent in the current frame; the processor 301 inputs the current frame fusion state vector information into the second fully-connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information, including:

and respectively inputting the current frame fusion state vector information into a second fully-connected network corresponding to the AI model to obtain corresponding current frame movement task information, current frame aiming task information, current frame non-parallel task selection information, current frame attack task information, current frame material picking task information, current frame attitude control task information and current frame blood volume supplementing task information.

In some embodiments, the timing feature extraction module includes an LSTM module, and the processor 301 obtains, by using the timing feature extraction module of the AI model, current frame fusion state vector information corresponding to the agent based on the current frame first output information, including:

acquiring the hidden state information of the last frame corresponding to the LSTM module;

outputting the current frame hiding state information corresponding to the LSTM module based on the first output information of the current frame and the hiding state information of the previous frame through the LSTM module;

and acquiring the fusion state vector information of the current frame corresponding to the intelligent agent according to the hiding state information of the current frame.

In some embodiments, the outputting, by the processor 301, the current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information includes:

and outputting current frame action output information corresponding to the intelligent agent based on a preset strategy gradient optimization function according to the current frame parallel task information and the current frame non-parallel task information.

In some embodiments, the policy gradient optimization functionThe method comprises the following steps:

wherein A is _t Represents the dominance function at time t, N represents the number of learning tracks,representing gradients of parallel tasks, F (a) _t |s _t ) Representing gradients in the operating space of non-parallel tasks, T representing all moments in a learning sequence, T representing a certain moment in the sequence.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The computer readable storage medium may be an internal storage unit of the server of the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the server.

Because the computer program stored in the computer readable storage medium can execute any of the agent action policy formulation methods provided by the embodiments of the present application, the beneficial effects that any of the agent action policy formulation methods provided by the embodiments of the present application can be achieved, and detailed descriptions thereof are omitted herein.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. An agent action policy formulation method, the method comprising:

controlling interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information so as to acquire next frame state information and next frame interaction information of the intelligent agent;

outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame interaction information;

the outputting, by the AI model, current frame parallel task information and current frame non-parallel task information corresponding to the agent based on the current frame state information and the current frame interaction information includes:

acquiring corresponding current frame parallel task information and current frame non-parallel task information based on current frame state feature information and current frame interaction feature information through a time sequence feature extraction module of the AI model;

the time sequence feature extraction module through the AI model obtains corresponding current frame parallel task information and current frame non-parallel task information based on current frame state feature information and current frame interaction feature information, and the time sequence feature extraction module comprises:

2. The method of claim 1, wherein the current frame interaction characteristic information includes current frame global characteristic information, current frame poison circle characteristic information, current frame material characteristic information, and current frame sound characteristic information, and the inputting the current frame state characteristic information and the current frame interaction characteristic information into the first fully connected network corresponding to the AI model to obtain corresponding current frame first output information includes:

3. The method of claim 2, wherein the current frame parallel task information includes movement task information, aiming task information, and non-parallel task selection information of the agent at the current frame; the non-parallel task information of the current frame comprises attack task information, material picking task information, attitude control task information and blood volume supplementing task information of the intelligent agent in the current frame; the step of inputting the current frame fusion state vector information into a second fully connected network corresponding to the AI model to obtain corresponding current frame parallel task information and current frame non-parallel task information, including:

4. The method of claim 1, wherein the timing feature extraction module comprises an LSTM module, the timing feature extraction module that passes through the AI model obtains current frame fusion state vector information corresponding to the agent based on the current frame first output information, comprising:

5. The method according to claim 1, wherein outputting the current frame action output information corresponding to the agent according to the current frame parallel task information and the current frame non-parallel task information comprises:

6. The method of claim 5, wherein the policy gradient optimization functionThe method comprises the following steps:

7. A server, wherein the server comprises a processor and a memory;

the memory stores a computer program and an AI model that can be called and executed by the processor, wherein the computer program, when executed by the processor, implements the agent action policy formulation method according to any one of claims 1 to 6.

8. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, causes the processor to implement the agent action policy formulation method according to any one of claims 1 to 6.