CN111401557B

CN111401557B - Agent decision making method, AI model training method, server and medium

Info

Publication number: CN111401557B
Application number: CN202010492473.3A
Authority: CN
Inventors: 张弛; 郭仁杰; 王宇舟; 武建芳; 杨木; 杨正云; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2020-09-18
Anticipated expiration: 2040-06-03
Also published as: CN111401557A

Abstract

The application discloses an agent decision making method based on an AI model, an AI model training method, a server and a medium, wherein the method comprises the following steps: acquiring current frame state information of an agent and current frame 3D map information in a 3D virtual environment; outputting current frame action output information corresponding to the agent based on the current frame state information and the current frame 3D map information through a time sequence feature extraction module of an AI model; obtaining the next frame state information of the intelligent agent according to the current frame action output information; acquiring historical position information of the intelligent agent, and generating the next frame of 3D map information according to the historical position information; and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame 3D map information. Therefore, reliable and efficient AI simulation is achieved.

Description

Agent decision making method, AI model training method, server and medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to an agent decision making method, an AI model training method, a server, and a medium.

Background

With the rapid development of Artificial Intelligence (AI) technology, the AI technology is widely applied to various fields such as 3D games, virtual traffic, automatic driving simulation, robot trajectory planning, etc., and AI simulation in a 3D virtual space has a great commercial value.

At present, the correct decisions which need to be made at different positions of an intelligent agent in AI simulation are generally learned through the memory capacity of a neural network, and a soft-attention mechanism is used to perform decision analysis on all state information, including dynamically changing information and static unchanging information, such as information that teammates and enemies continuously move in a 3D game, position of material points and other various information, so that some scenes that environmental information simply changes can be met, but the method is not suitable for scenes that the environmental information rapidly changes, and the intelligent agent is difficult to make long-term decisions. Therefore, how to realize reliable and efficient AI simulation becomes an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides an agent decision making method, an AI model training method, a server and a medium, which can realize reliable and efficient AI simulation.

In a first aspect, an embodiment of the present application provides an agent decision making method based on an AI model, including:

acquiring current frame state information of an agent and current frame 3D map information in a 3D virtual environment;

outputting current frame action output information corresponding to the agent based on the current frame state information and the current frame 3D map information through a time sequence feature extraction module of an AI model;

obtaining the next frame state information of the intelligent agent according to the current frame action output information;

acquiring historical position information of the intelligent agent, and generating the next frame of 3D map information according to the historical position information;

and outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame 3D map information.

In a second aspect, an embodiment of the present application further provides a method for training an AI model, including:

acquiring a sample data set, wherein the sample data set comprises multi-frame state information and multi-frame 3D map information of an agent;

outputting multi-frame fusion state vector information corresponding to the agent through a timing sequence feature extraction module of an AI model to be trained based on the multi-frame state information and the multi-frame 3D map information;

constructing a loss function according to the multi-frame fusion state vector information;

and performing multi-step iteration on the loss function to train and update the AI model.

In a third aspect, an embodiment of the present application further provides a server, where the server includes a processor, a memory, and a computer program stored on the memory and executable by the processor, where the memory stores an AI model, and where the computer program, when executed by the processor, implements the AI model-based agent decision making method as described above; alternatively, a training method of the AI model as described above is implemented.

In a fourth aspect, the present application further provides a computer-readable storage medium for storing a computer program, which when executed by a processor, causes the processor to implement the above-mentioned AI model-based agent decision-making method; alternatively, the above-described training method of the AI model is implemented.

The embodiment of the application provides an agent decision making method based on an AI model, an AI model training method, a server and a computer readable storage medium, based on current frame state information of an agent and current frame 3D map information in a 3D virtual environment, a time sequence feature extraction module of the AI model outputs current frame action output information corresponding to the agent based on the current frame state information of the agent and the current frame 3D map information, obtains next frame state information of the agent according to the current frame action output information, generates next frame 3D map information according to historical position information of the agent, further obtains next frame action output information of the agent according to the next frame state information of the agent and the next frame 3D map information, obtains each frame action output information of the agent according to the mode, thereby realizing long-term decision, therefore, reliable and efficient AI simulation is achieved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart illustrating steps of an AI model based agent decision-making method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a first-level channel of a 3D map provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of a second level channel of a 3D map provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of a third level channel of a 3D map provided by an embodiment of the present application;

FIG. 5 is a diagram illustrating a fourth layer of channels of a 3D map according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an AI model based agent action output provided by an embodiment of the application;

FIG. 7 is a flowchart illustrating steps of a method for training an AI model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of AI model training provided by an embodiment of the present application;

fig. 9 is a schematic block diagram of a server provided in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

At present, in the AI simulation of a 3D virtual space, generally, the memory capacity of a neural network is used to learn the correct decisions that need to be made at different positions of an agent in the AI simulation, and a soft-attention mechanism is used to analyze all state information including dynamically changing information and static unchanging information, such as information that teammates and enemies continuously move in a 3D game, and various information such as positions of material points, so that the AI simulation can meet some scenes where environmental information changes simply, but is not suitable for scenes where environmental information changes rapidly, and the agent is difficult to make long-term decisions.

In order to solve the above problems, embodiments of the present application provide an AI model-based agent decision making method, an AI model training method, a server, and a computer-readable storage medium, which are used to implement reliable and efficient AI simulation. The AI model-based agent decision making method and the AI model training method can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating an AI model-based agent decision-making method according to an embodiment of the present disclosure.

As shown in fig. 1, the AI model-based agent decision-making method specifically includes steps S101 to S105.

S101, obtaining current frame state information of an agent and current frame 3D map information in a 3D virtual environment.

For example, in various application scenarios such as Artificial Intelligence (AI), robot simulation in a 3D virtual environment, mechanical arm, unmanned driving, virtual traffic simulation, and the like, or in game AI in a 3D type game, in order to implement rapid and efficient simulation, a correct decision is made on an Agent (Agent) in the 3D virtual environment, and current frame state information of the Agent and current frame 3D map information in the 3D virtual environment are acquired. The intelligent agent is an intelligent agent which is hosted in a complex dynamic environment, autonomously senses environment information, autonomously takes action and realizes a series of preset targets or tasks. The status information of the agent includes, but is not limited to, location information, athletic information, combat power information, etc. of the agent.

Illustratively, the 3D map information is relative map information within a preset range centered on a current location of the agent, rather than global map information of the 3D virtual environment. For example, relative map information in the range of 90m × 90m around the current position of the agent as the center point.

In an embodiment, the 3D map of the 3D virtual environment comprises a plurality of layers of channels, each layer of channels being composed of a plurality of meshes, for example each layer of channels being composed of n × n meshes, say n is 9, each layer of channels being composed of 9 × 9 meshes. Each grid size is Lm, for example, L is 10, and each grid size is 10 m.

It should be noted that the number of grids of each layer of channels and the size of the grids can be flexibly set according to actual situations, and are not specifically limited herein. By carrying out grid segmentation on the local 3D map, the problem of overlarge map information dimension caused by overlarge map dimension is avoided.

Each layer of channel records different types of information, wherein the different types of information include but are not limited to whether the intelligent agent moves to the position of the grid, the frequency of the intelligent agent moving to the position of the grid, the sequence of the intelligent agent moving to the position of the grid, the number of material points in the position of the grid, the state information of the intelligent agent moving to the position of the grid, and the like.

Optionally, the multiple layers of channels of the 3D map include a first layer of channels, a second layer of channels, a third layer of channels, and a fourth layer of channels, where a grid of the first layer of channels records whether the agent moves to a position where the grid is located, a grid of the second layer of channels records a frequency with which the agent moves to the position where the grid is located, a grid of the third layer of channels records an order in which the agent moves to the position where the grid is located, and a grid of the fourth layer of channels records a number of material points at the position where the grid is located.

Illustratively, filling a corresponding grid of the first-layer channel with first identification information to represent that the intelligent agent moves to the position of the grid; and filling the second identification information into the corresponding grid of the first-layer channel to represent that the intelligent agent does not move to the position of the grid.

For example, as shown in fig. 2, the first identification information is set to be a value 1, the second identification information is set to be a value 0, and the grid of the first-layer channel stores the value 0 or the value 1 to represent whether the agent has moved to the location of the grid, where the grid storing the value 0 represents that the agent has not moved to the location of the grid, and the grid storing the value 1 represents that the agent has moved to the location of the grid.

Illustratively, the respective grid of the second-level channel is filled with respective integers, representing the frequency of the agent moving to the location of the grid. For example, as shown in FIG. 3, grid fill 0 of the second level of channels represents that the agent has not moved to the location of the grid, grid fill 1 of the second level of channels represents that the agent has moved to the location of the grid 1 time, grid fill 2 of the second level of channels represents that the agent has moved to the location of the grid 2 times, grid fill 3 of the second level of channels represents that the agent has moved to the location of the grid 3 times, and so on.

Illustratively, the respective grids filled to the third level of channels are numbered in different sizes, characterizing the order in which the agents move to the locations of the grids. For example, as shown in FIG. 4, the smaller the number stored in the grid of the third tier of channels, the later the time the agent moves to the location of the grid. It should be noted that, the representation may be reversed, and the number stored in the grid is smaller as the time for the agent to move to the position of the corresponding grid is earlier.

Illustratively, different values are adopted to fill corresponding grids of the fourth layer of channels, and the quantity of material points at the positions of the grids is represented. For example, as shown in fig. 5, a grid filling value 0 of the fourth layer of channels represents that there are no material points at the position of the grid, a grid filling value 1 of the fourth layer of channels represents that there are 1 material points at the position of the grid, a grid filling value 2 of the fourth layer of channels represents that there are 2 material points at the position of the grid, a grid filling value 3 of the fourth layer of channels represents that there are 3 material points at the position of the grid, and so on.

Based on the 3D map information, information such as whether the intelligent agent moves to the position of the corresponding grid of the 3D map, the frequency of the intelligent agent moving to the position of the grid, the sequence of the intelligent agent moving to the position of the grid, the number of material points at the position of the grid and the like is obtained.

In an embodiment, the 3D map records information such as whether the agent moves to the position of the corresponding grid of the 3D map within a preset time period, the frequency of the agent moving to the position of the grid, the order of the agent moving to the position of the grid, and the number of material points existing at the position of the grid. For example, setting the preset number of times as 20, where a grid storage value 0 of a first-layer channel indicates that the agent has not moved to the position of the grid in the 20 times of history data, and a storage value 1 indicates that the agent has moved to the position of the grid in the 20 times of history data; grid filling 1 of the second layer channel represents that the intelligent agent reaches the position of the grid for 1 time in the 20 times of historical data, and grid filling 2 of the second layer channel represents that the intelligent agent reaches the position of the grid for 2 times in the 20 times of historical data; the grid of the third tier of channels stores the order in which agents arrive at the grid in the 20 historical data, numbering the order in which agents arrive at the grid from 0 to 19, with the later the time to arrive at the grid, the smaller the number stored by the grid.

By embedding the corresponding position information of the intelligent agent into the channel of the 3D map and adding the information of the material points into the channel, the position information identification in the AI simulation is promoted, and the generalization of the AI model network is further improved.

And S102, outputting current frame action output information corresponding to the agent through a time sequence feature extraction module of an AI model based on the current frame state information and the current frame 3D map information.

In this embodiment, the AI model is provided with a corresponding timing feature extraction module, where the timing feature extraction module includes, but is not limited to, an LSTM (Long Short-Term Memory) module, a GRU (Gated secured unit) module, a transform module, and the like.

The method comprises the steps that an AI model is called, a time sequence feature extraction module based on the AI model takes current frame state information and current frame 3D map information of an agent as input information, the input information is processed by the time sequence feature extraction module, time sequence feature extraction is carried out, and current frame action output information corresponding to the agent is output.

In an embodiment, the current frame state information of the agent and the current frame 3D map information are first subjected to CONCAT fusion and then input to the timing feature extraction module for processing. Specifically, firstly, extracting the state embedding vector feature S in the current frame state information of the agent_tAnd obtaining map vector feature M according to current frame 3D map information_tEmbedding the states into vector features S_tAnd map vector feature M_tMerged input full-connection godProcessing the data through a network to obtain state embedded vector characteristics S_tAnd map vector feature M_tCorresponding fusion information. And inputting the fusion information into a time sequence characteristic extraction module for processing, and outputting current frame action output information corresponding to the intelligent agent.

In one embodiment, different types of information are recorded in a multi-layer channel based on a 3D map, and a map vector feature M is obtained according to current frame 3D map information_tSpecifically, the different types of information are subjected to multilayer convolution calculation to obtain corresponding map vector features M_t. For example, taking a 3D map including the four layers of channels as an example, the current frame 3D map information is subjected to 4 layers of convolution calculation, and is subjected to a leveling operation in the last layer of convolution calculation to obtain a map vector feature M_t。

And S103, acquiring state information of the next frame of the intelligent agent according to the current frame action output information.

And S104, acquiring historical position information of the intelligent agent, and generating the next frame of 3D map information according to the historical position information.

And controlling the intelligent agent to execute corresponding action output based on the output current frame action output information, interacting with the 3D virtual environment, updating the state information of the intelligent agent, and obtaining the next frame state information of the intelligent agent. Meanwhile, the location information of the agent is recorded, and the location information recorded each time is stored, and the location information is stored locally in the server as the historical location information of the agent, or may be stored in another storage device other than the server, which is not limited specifically herein.

And inquiring and acquiring the stored historical position information of the intelligent agent, and constructing and obtaining the next frame of 3D map information according to the historical position information. For example, a preset number corresponding to historical position information for constructing 3D map information is preset, the preset number of historical position information is acquired, and based on the preset number of historical position information, the next frame of 3D map information is constructed. Optionally, the preset number is set to 20, that is, the next frame of 3D map information is constructed according to 20 sets of historical position information. The preset number can be flexibly set according to actual conditions, and is not particularly limited herein.

In one embodiment, to save storage space, only a preset amount of historical location information is saved. The history position information is stored once every time the history position information is recorded, and the history position information with the earliest record is deleted from a plurality of stored history position information, so that the number of the stored history position information is maintained at a preset number. Specifically, each time the current location information of the agent is recorded, it is determined whether the number of stored historical location information reaches a preset number. If the quantity of the stored historical position information does not reach the preset quantity, directly storing the current position information; and if the quantity of the stored historical position information reaches the preset quantity, storing the current position information, and deleting the historical position information recorded earliest in the stored historical position information, so that the quantity of the stored historical position information is maintained at the preset quantity.

In one embodiment, the position information of the agent is not recorded every frame, but the position information of the agent is recorded and stored every preset time period by setting a preset time period as the historical position information of the agent. The preset duration can be flexibly set according to actual conditions, for example, the preset duration is set to 10s, that is, the position information of the agent is recorded and stored every 10 s.

In combination with the above preset number, it is assumed that the preset number is 20, the preset duration is 10s, that is, the location information of the agent is recorded every 10s, and a total of 20 sets of historical location information are saved, which is equivalent to saving the historical location information within a time span of 200 s. And constructing and obtaining the 3D map information of the next frame according to 20 groups of historical position information within the 200s time span.

For example, taking the time sequence feature extraction module as an LSTM module as an example, the LSTM module serves as an independent feature extraction unit, and can accept previous frame hidden state information and current frame state information as inputs, and output corresponding current frame hidden state information, where the hidden state information includes hidden information (hidden state) and cell state information (cell state), and the current frame hidden state information serves as an input of a next frame. The LSTM module carries out CONCAT fusion on the current frame state information and the current frame 3D map information of the agent based on the current frame state information, the current frame 3D map information and the previous frame hidden state information of the agent, then inputs the fusion information and the previous frame hidden state information into the LSTM module, and outputs the corresponding current frame hidden state information. And then, acquiring current frame action output information corresponding to the intelligent agent according to the current frame hidden state information.

For example, as shown in fig. 6, the current frame 3D map information is subjected to 4-layer convolution calculation to obtain a current frame map vector feature M_tEmbedding the current frame state corresponding to the current frame state information of the agent into the vector characteristic S_tWith the current frame map vector feature M_tCONCAT merging is carried out, input into the full-connection neural network for processing, corresponding fusion information is obtained, and then the fusion information and the previous frame of hidden information h are processed_t-1Last frame unit status information C_t-1And inputting the input into an LSTM module for processing, and outputting current frame action output information corresponding to the intelligent agent.

Three gates are designed in the LSTM module, which are a forgetting gate (forget gate), an input gate (input gate), and an output gate (output gate), and the three gates perform different processing on input information. Inputting the previous frame concealment state information including the previous frame concealment information h_t-1And last frame unit state information C_t-1And fusion information x of current frame state information and current frame 3D map information of agent_tOutputting the hidden information h of the current frame_tAnd current frame unit state information C_t. Hiding the previous frame with the information h through a forgetting gate_t-1And fusion information x_tMerging (CONCAT), passing through a forward network and then outputting forgetting probability f through Sigmoid function_t(a value between 0 and 1). The information h of the previous frame can be hidden by inputting the information h_t-1And fusion information x_tMerging (CONCAT), passing through a forward network and then through a Sigmoid function to output a corresponding input probability i_t(value between 0 and 1) and outputs the fusion information x through another forward network via the tanh function_tIs processed intoFruit C_t ^～By multiplication of f_tAnd last frame unit state information C_t-1Multiply, and i_tAnd C_t ^～Multiplying, adding the two obtained product values, and updating to the output current frame unit state information C with the added sum value_tThe method comprises the following steps:

C_t=f_t·C_t-1+i_t·C_t ^～

the output gate controls the output information of the LSTM unit, and the output current frame hidden information

Integrates the hidden information h of the previous frame_t-1And last frame unit state information C_t-1And fusion information x_t. Calculating fusion information x by Sigmoid function_tOutput probability of (1)_tAt the same time, the current frame unit state information C_tProcessed by tanh function and reacted with O_tMultiplying to obtain hidden information h of current frame_tComprises the following steps:

h_t=O_t·tanh(C_t)

wherein, the current frame conceals the information h_tThe method comprises the step of including fusion state vector information corresponding to an agent, and based on the current frame hidden information h in the output current frame hidden state information_tAnd acquiring fusion state vector information corresponding to the intelligent agent, wherein the fusion state vector information comprises multi-frame state information of the intelligent agent. And obtaining the current frame action output information of the agent according to the fusion state vector information.

And S105, outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame 3D map information.

After the next frame state information and the next frame 3D map information of the agent are obtained, according to the operation in step S102, the next frame action output information corresponding to the agent is output based on the next frame state information and the next frame 3D map information of the agent through the timing feature extraction module of the AI model. The specific operation process can refer to the process described in step S102, and is not described herein again.

Therefore, based on the state information of each frame of the intelligent agent and the 3D map information of each frame, the action output information of each frame corresponding to the intelligent agent can be output, and the intelligent agent can make efficient and reliable long-term decision according to the action output information of each frame corresponding to the intelligent agent. That is, by combining the timing feature extraction module, such as the LSTM module, with the 3D map, the agent can form a good memory in the 3D virtual environment, and make long-term decisions.

The method for making a decision of an agent based on an AI model according to the foregoing embodiments is based on current frame state information of the agent and current frame 3D map information in a 3D virtual environment, and outputs current frame action output information corresponding to the agent based on the current frame state information of the agent and the current frame 3D map information through a timing feature extraction module of the AI model, and obtains next frame state information of the agent according to the current frame action output information, and generates next frame 3D map information according to historical position information of the agent, and further obtains next frame action output information of the agent according to the next frame state information of the agent and the next frame 3D map information, and obtains each frame action output information of the agent according to the method, thereby implementing a long-term decision, and thus implementing reliable and efficient AI simulation.

The embodiment of the application also provides a training method of the AI model. The training method of the AI model can be applied to a server, so that reliable and efficient AI simulation can be realized by calling the trained AI model. The server may be a single server or a server cluster composed of a plurality of servers.

Referring to fig. 7, fig. 7 is a flowchart illustrating a method for training an AI model according to an embodiment of the present disclosure.

As shown in fig. 7, the AI model training method includes steps S201 to 204.

S201, obtaining a sample data set, wherein the sample data set comprises multi-frame state information and multi-frame 3D map information of the intelligent agent.

Illustratively, a sample data set corresponding to AI model training is stored in a Remote Dictionary service (Remote Dictionary service) database, and is used for training the AI model, wherein the sample data set includes, but is not limited to, multiframe state information and multiframe 3D map information of a smart agent. And acquiring a sample data set corresponding to AI model training through query access redis.

S202, outputting multi-frame fusion state vector information corresponding to the intelligent agent through a time sequence feature extraction module of the AI model to be trained on the basis of the multi-frame state information and the multi-frame 3D map information.

As described in the embodiment of the AI model-based agent decision making method, the AI model is provided with a corresponding time sequence feature extraction module, wherein the time sequence feature extraction module includes, but is not limited to, an LSTM module, a GRU module, a Transformer module, and the like.

And the timing sequence feature extraction module based on the AI model takes the multi-frame state information and the multi-frame 3D map information of the intelligent agent as input information, processes the input information by the timing sequence feature extraction module, extracts the timing sequence feature and outputs multi-frame fusion state vector information corresponding to the intelligent agent. Specifically, extracting state embedding vector feature S in multi-frame state information_iAnd map vector feature M corresponding to multi-frame 3D map information_iEmbedding multiframe states into vector features S_iAnd map vector feature M_iAnd the input time sequence characteristic extraction module processes and outputs multi-frame fusion state vector information corresponding to the intelligent agent.

For example, as shown in fig. 8, the map vector feature M corresponding to each frame of 3D map information is obtained by respectively performing multi-layer convolution calculation on multiple frames of 3D map information to obtain a time sequence feature extraction module, which is an LSTM module as an example_iIs composed of M_t、M_t+1And obtaining the state embedding vector characteristic S corresponding to the state information of each frame_iIncluding S_t、S_t+1Etc. respectively adding S_tAnd M_tAnd the previous frame hidden information h_t-1Last frame unit status information C_t-1Inputting LSTM module for processing, and outputting hidden information h of current frame_tCurrent frame unit state information C_tCorresponding fusion state vector information; will S_t+1、M_t+1And current frame hidden information h_tCurrent frame unit state information C_tInputting LSTM module for processing, and outputting the hidden information h of next frame_t+1Next frame unit state information C_t+1And obtaining multi-frame fusion state vector information according to the corresponding fusion state vector information.

S203, constructing a loss function according to the multi-frame fusion state vector information.

The loss function includes a value loss (value loss), a policy gradient loss (policy gradient loss), an entropy loss (entropy loss), and the like.

In an embodiment, for multi-frame fusion state vector information, based on each frame of fusion state vector information, motion output information corresponding to each frame of fusion state vector information and a cost function output value corresponding to the motion output information are obtained respectively. The value function output value is used for evaluating the action output information, and if the value function output value is high, the relevant action instruction of the corresponding action output information can be controlled to be executed; and if the value function output value is low, the relevant action command of the corresponding action output information is not executed. And constructing a corresponding loss function based on the obtained multi-frame action output information and the value function output value corresponding to the multi-frame action output information.

And S204, carrying out multi-step iteration on the loss function to train and update the AI model.

Optionally, as shown in fig. 8, the loss function is sent to a GPU (Graphics Processing Unit) for multi-step iterative optimization, so as to obtain relevant AI model parameters after iteration, where the AI model parameters include, but are not limited to, parameters of a timing feature extraction module, parameters of a cost function, and so on. And updating the parameters of the iterative relevant AI model to the AI model so as to finish the training and updating of the AI model.

Meanwhile, various information such as state information of the agent, 3D map information and the like generated by continuous interaction with the 3D virtual environment are stored in a data storage system, such as redis, and used as data in sample data set for iterative training of the AI model.

Referring to fig. 9, fig. 9 is a schematic block diagram of a server according to an embodiment of the present disclosure.

As shown in fig. 9, the server may include a processor, memory, and a network interface. The processor, memory, and network interface are connected by a system bus, such as an I2C (Inter-integrated Circuit) bus.

Specifically, the Processor may be a Micro-controller Unit (MCU), a Central Processing Unit (CPU), a Digital Signal Processor (DSP), or the like.

Specifically, the Memory may be a Flash chip, a Read-Only Memory (ROM) magnetic disk, an optical disk, a usb disk, or a removable hard disk.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 9 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the servers to which the subject application applies, as a particular server may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Wherein the processor is configured to run a computer program stored in the memory and to implement the following steps when executing the computer program:

In some embodiments, before implementing the timing feature extraction module through the AI model to output the current frame action output information corresponding to the agent based on the current frame state information and the current frame 3D map information, the processor further implements:

extracting state embedding vector features in the current frame state information, and acquiring map vector features according to the current frame 3D map information;

merging the state embedding vector features and the map vector features and inputting the merged state embedding vector features and the map vector features into a fully-connected neural network to obtain corresponding fusion information;

when the processor outputs the current frame action output information corresponding to the agent based on the current frame state information and the current frame 3D map information through the timing characteristic extraction module of the AI model, the following steps are specifically implemented:

and inputting the fusion information into the time sequence feature extraction module, and outputting the current frame action output information corresponding to the agent.

In some embodiments, the 3D map of the 3D virtual environment comprises a plurality of layers of channels, each layer of channels consisting of a plurality of meshes, the plurality of layers of channels each recording different types of information.

In some embodiments, the multi-layer channels include at least two layers of channels among a first layer of channels, a second layer of channels, a third layer of channels, and a fourth layer of channels, where a grid of the first layer of channels records whether the agent moves to a position where the grid is located, a grid of the second layer of channels records a frequency of the agent moving to the position where the grid is located, a grid of the third layer of channels records a sequence of the agent moving to the position where the grid is located, and a grid of the fourth layer of channels records a number of material points existing at the position where the grid is located.

In some embodiments, the current frame 3D map information includes different types of information recorded in multiple channels of a 3D map, and the processor specifically implements, when implementing the obtaining of the map vector feature according to the current frame 3D map information:

and carrying out multilayer convolution calculation on the different types of information to obtain the map vector characteristics.

In some embodiments, the current frame 3D map information is relative map information within a preset range centered on a current location of the agent.

In some embodiments, the processor, when executing the computer program, further implements:

and recording and storing the position information of the intelligent agent every preset time, wherein the historical position information of the intelligent agent is a plurality of stored position information.

In some embodiments, when the processor implements the recording and storing of the location information of the agent, the processor implements:

determining whether the quantity of the stored historical position information reaches a preset quantity every time the current position information of the intelligent agent is recorded;

if the quantity of the stored historical position information does not reach the preset quantity, storing the current position information;

and if the quantity of the stored historical position information reaches the preset quantity, storing the current position information, and deleting the historical position information recorded earliest in the stored historical position information.

In some embodiments, the temporal feature extraction module comprises an LSTM module, the processor, when executing the computer program, further implementing:

acquiring previous frame hidden state information corresponding to the LSTM module;

outputting, by the LSTM module, current frame hidden state information corresponding to the LSTM module based on the current frame state information, the current frame 3D map information, and the previous frame hidden state information;

and acquiring the current frame action output information corresponding to the agent according to the current frame hidden state information.

In some embodiments, when the processor obtains the current frame action output information corresponding to the agent according to the current frame hidden state information, the following is specifically implemented:

acquiring fusion state vector information corresponding to the agent according to the current frame hidden state information;

and acquiring the current frame action output information according to the fusion state vector information.

In some embodiments, the time-series feature extraction module includes an LSTM module, the sample data set further includes hidden state information corresponding to the LSTM module, and the processor is specifically configured to, when implementing that the time-series feature extraction module passing through the AI model to be trained outputs multi-frame fusion state vector information corresponding to the agent based on the multi-frame state information and the multi-frame 3D map information:

outputting the multi-frame fusion state vector information based on the hidden state information, the multi-frame state information and the multi-frame 3D map information through the LSTM module.

In some embodiments, when the processor implements the constructing of the loss function according to the multi-frame fusion state vector information, the following is specifically implemented:

acquiring multiframe action output information and a value function output value corresponding to the multiframe action output information according to the multiframe fusion state vector information;

and constructing the loss function according to the multi-frame action output information and the value function output value corresponding to the multi-frame action output information.

It should be noted that, as will be clearly understood by those skilled in the art, for convenience and brevity of description, the specific working process of the server described above may refer to the corresponding process in the foregoing embodiment of the AI model-based intelligent agent decision making method and/or the AI model training method, and details are not repeated herein.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement the steps of the AI model-based agent decision-making method and/or the AI model training method provided in the foregoing embodiments. For example, the computer program is loaded by a processor and may perform the following steps:

calling an AI model, and outputting current frame action output information corresponding to the agent through a time sequence feature extraction module of the AI model based on the current frame state information and the current frame 3D map information;

and acquiring historical position information of the intelligent agent, and generating the next frame of 3D map information according to the historical position information.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

The computer readable storage medium may be an internal storage unit of the server in the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk provided on the server, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like.

Since the computer program stored in the computer-readable storage medium can execute any one of the AI model-based agent decision making methods and/or AI model training methods provided in the embodiments of the present application, beneficial effects that can be achieved by any one of the AI model-based agent decision making methods and/or AI model training methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An AI model-based agent decision-making method, comprising:

acquiring current frame state information of an agent and current frame 3D map information in a 3D virtual environment; the current frame 3D map information is relative map information in a preset range taking the current position of the intelligent agent as a center;

outputting next frame action output information corresponding to the intelligent agent according to the next frame state information and the next frame 3D map information;

before the outputting of the current frame action output information corresponding to the agent by the timing characteristic extraction module of the AI model based on the current frame state information and the current frame 3D map information, the method includes:

the outputting of the current frame action output information corresponding to the agent by the timing characteristic extraction module of the AI model based on the current frame state information and the current frame 3D map information includes:

2. The method of claim 1, wherein the 3D map of the 3D virtual environment comprises a plurality of layers of channels, each layer of channels being composed of a plurality of meshes, the plurality of layers of channels each recording different types of information.

3. The method of claim 2, wherein the plurality of layers of channels includes at least two layers of channels selected from a first layer of channels, a second layer of channels, a third layer of channels, and a fourth layer of channels, wherein a grid of the first layer of channels records whether the agent moves to a location of the grid, wherein a grid of the second layer of channels records a frequency of the agent moving to the location of the grid, wherein a grid of the third layer of channels records an order of the agent moving to the location of the grid, and wherein a grid of the fourth layer of channels records a number of asset points at the location of the grid.

4. The method of claim 1, wherein the current frame 3D map information includes different types of information recorded in a multi-layer channel of a 3D map, and the obtaining map vector features according to the current frame 3D map information includes:

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein recording and storing the location information of the agent comprises:

7. The method of any of claims 1 to 6, wherein the temporal feature extraction module comprises an LSTM module, the method further comprising:

the outputting, by the timing feature extraction module of the AI model, current frame action output information corresponding to the agent based on the current frame state information and the current frame 3D map information includes:

8. The method according to claim 7, wherein the obtaining the current frame action output information corresponding to the agent according to the current frame hidden state information includes:

9. A method for training an AI model, comprising:

performing multi-step iteration on the loss function to train and update the AI model;

wherein, the constructing a loss function according to the multi-frame fusion state vector information comprises:

10. The method of claim 9, wherein the temporal feature extraction module comprises an LSTM module, the sample data set further includes hidden state information corresponding to the LSTM module, and the outputting, by the temporal feature extraction module of the AI model to be trained, multi-frame fusion state vector information corresponding to the agent based on the multi-frame state information and the multi-frame 3D map information comprises:

11. A server, characterized in that the server comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, the memory storing an AI model, wherein the computer program, when executed by the processor, implements an AI model based agent decision-making method according to any one of claims 1 to 8; alternatively, a training method of the AI model according to any one of claims 9 to 10 is implemented.

12. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, causes the processor to carry out the AI model-based agent decision-making method of any one of claims 1 to 8; alternatively, a training method of the AI model according to any one of claims 9 to 10 is implemented.