CN112295232B

CN112295232B - Navigation decision making method, AI model training method, server and medium

Info

Publication number: CN112295232B
Application number: CN202011325706.7A
Authority: CN
Inventors: 张弛; 武建芳; 杨木; 郭仁杰; 王宇舟; 杨正云; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-11-23
Filing date: 2020-11-23
Publication date: 2024-01-23
Anticipated expiration: 2040-11-23
Also published as: CN112295232A

Abstract

The application discloses a navigation decision making method, an AI model training method, a server and a medium, wherein the method comprises the steps of obtaining state information of a current frame of an agent in a 3D virtual environment and target area information of the current frame; outputting current frame action output information and current frame target area selection information corresponding to the intelligent agent based on the current frame state information and the current frame target area information through an AI model; controlling interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information and the current frame target area selection information so as to acquire next frame state information and next frame target area information of the intelligent agent; and acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and the next frame target area information. The method provided by the application can enable the intelligent body to reliably and efficiently make a correct navigation decision in a large-scale map in the 3D space.

Description

Navigation decision making method, AI model training method, server and medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a navigation decision-making method, an AI model training method, a server and a medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, the artificial intelligence technology is widely applied to various fields of 3D games, virtual traffic, automatic driving simulation, robot trajectory planning and the like, and the AI simulation in a 3D virtual space has great commercial value.

At present, in AI simulation of a part of 3D virtual space, an agent needs to collect various resources in the 3D virtual space and fight against other agent players in a continuously reduced safety area, so that the agent needs to make correct navigation decisions in different environments in the AI simulation process, so that the agent shifts and explores by taking the relatively safe area as a target point, and the agent survives to the end.

In the traditional navigation strategy formulation method, due to the fact that the map size of most 3D virtual spaces is small, an intelligent agent only needs to learn to perform a small-range obstacle avoidance in the 3D virtual space and then to reach a target point, so that short-term decisions are relatively easy to learn for traditional reinforcement learning. Therefore, the conventional navigation decision making method can only meet some scenes in which the environmental information is simply changed, but is not suitable for the navigation problem of a large-scale map in the 3D space.

Therefore, how to reliably and efficiently make a correct navigation decision in a large-scale map in a 3D space by an agent is a problem to be solved at present.

Disclosure of Invention

The embodiment of the application provides a navigation decision making method, an AI model training method, a server and a medium, which can realize that an intelligent agent can reliably and efficiently make a correct navigation decision in a large-scale map in a 3D space.

In a first aspect, an embodiment of the present application provides a navigation decision-making method, including:

acquiring current frame state information and current frame target area information of an agent in a 3D virtual environment;

outputting current frame action output information and current frame target area selection information corresponding to the intelligent agent based on the current frame state information and the current frame target area information through an AI model;

controlling interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information and the current frame target area selection information so as to acquire next frame state information and next frame target area information of the intelligent agent;

and acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and the next frame target area information.

In a second aspect, an embodiment of the present application further provides a training method of an AI model, including:

acquiring a sample data set, wherein the sample data set comprises multi-frame state information, multi-frame target area information and multi-frame target area rewarding information of an intelligent agent;

outputting multi-frame fusion state vector information and multi-frame target region selection information corresponding to the intelligent agent based on the multi-frame state information and the multi-frame target region information through a preset AI model;

constructing a loss function according to the multi-frame fusion state vector information, multi-frame target area selection information and multi-frame target area rewarding information;

and performing multi-step iteration on the loss function to train and update the preset AI model.

In a third aspect, embodiments of the present application further provide a server, where the server includes a processor and a memory; the memory stores a computer program and an AI model which can be called and executed by the processor, wherein the computer program realizes the navigation decision making method when being executed by the processor; alternatively, the training method of the AI model described above is implemented.

In a fourth aspect, embodiments of the present application further provide a computer readable storage medium, where the computer readable storage medium is configured to store a computer program, where the computer program when executed by a processor causes the processor to implement the navigation decision making method described above; alternatively, the training method of the AI model described above is implemented.

The embodiment of the application provides a navigation decision making method, a training method of an AI model, a server and a computer readable storage medium, wherein the navigation decision making method is implemented by acquiring state information of a current frame and target area information of the current frame of an agent in a 3D virtual environment; outputting current frame action output information and current frame target area selection information corresponding to the intelligent agent based on the current frame state information and the current frame target area information through an AI model; controlling interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information and the current frame target area selection information so as to acquire next frame state information and next frame target area information of the intelligent agent; and acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and the next frame target area information, so that the intelligent agent can reliably and efficiently make a correct navigation decision in a large-scale map in a 3D space.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of steps of a navigation decision making method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an AI model-based agent action output as provided by an embodiment of the application;

FIG. 3 is a flowchart of steps corresponding to the output of an AI-model-based motion provided in one embodiment of the application;

FIG. 4 is a schematic diagram of feature extraction according to status information according to an embodiment of the present application;

FIG. 5 is a schematic diagram of feature extraction according to target region information according to an embodiment of the present application;

FIG. 6 is a diagram providing a first class of target area feature extraction intent in accordance with one embodiment of the present application;

FIG. 7 is a diagram of providing a second class of target area feature extraction intent in accordance with one embodiment of the present application;

FIG. 8 is a flowchart of the steps for AI model training provided in one embodiment of the application;

FIG. 9 is a schematic diagram of AI model training provided in one embodiment of the application;

fig. 10 is a schematic block diagram of a server provided in an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

In the traditional navigation strategy formulation method, due to the fact that most 3D space maps are small in size, an intelligent agent only needs to learn to avoid obstacles in a small range in the 3D space and then reach a target point, so that short-term decisions are relatively easy to learn for traditional reinforcement learning. However, when the intelligent agent needs to make a large-scale transition and exploration in a large map, the time span often reaches hundreds of frames or even thousands of frames, and when the intelligent agent needs to make a short-term decision, the intelligent agent is required to make an overall decision in the direction towards the target point, and because the traditional navigation decision method often needs to traverse all paths which can reach the target point, the shortest path which can avoid the obstacle is selected as the navigation path, and the navigation mode cannot be completed in time for the large map.

Therefore, the conventional navigation decision-making method can only meet some scenes of simple change of environment information, but is not suitable for the navigation problem of a large-scale map in a 3D space, and an intelligent agent can hardly make an accurate navigation decision in the large-scale map.

To solve the above-mentioned problems, embodiments of the present application provide a navigation decision-making method, an AI model training method, a server, and a computer-readable storage medium for an agent to reliably and efficiently make a correct navigation decision in a large-scale map in a 3D space. The navigation decision making method and the AI model training method can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers.

Referring to fig. 1, fig. 1 is a flowchart of a navigation decision-making method according to an embodiment of the present application.

As shown in fig. 1, the navigation decision-making method specifically includes steps S101 to S104.

Step S101: and acquiring the state information of the current frame of the intelligent agent in the 3D virtual environment and the target area information of the current frame.

For example, in various application scenarios such as artificial intelligence (Artificial Intelligence, AI), robot simulation under a 3D virtual environment, a mechanical arm, unmanned, virtual traffic simulation, etc., or game AI in a 3D type game, in order to realize quick and efficient simulation, correct navigation decisions are made for an Agent (Agent) in the 3D virtual environment, and current frame state information and current frame target area information of the Agent in the 3D virtual environment are obtained. The intelligent agent is an intelligent agent which is in a complex dynamic environment, autonomously perceives environment information, autonomously takes action and realizes a series of preset targets or tasks. Status information of an agent includes, but is not limited to, location information of the agent, movement information, combat force information, information about the 3D environment in which the agent is currently located, and the like.

The target area information is information used for representing an area where a target point to be moved by the intelligent agent is located, and the target point is information of the intelligent agent, which is adaptive judgment according to environment information corresponding to a 3D virtual environment, so as to select a position coordinate point corresponding to the next to be moved, wherein the environment information corresponding to the 3D virtual environment comprises but is not limited to information of continuously moving teammates and enemies in a 3D game, positions of material points and the like.

For example, a 3DFPS (3D First Person Shooter) game is used as an application background, and the target area information is information in which an agent makes an adaptability judgment according to environment information corresponding to a 3D virtual environment where the agent is currently located, so that a relatively safe area is selected as an area where a position coordinate point to be moved next is located, and the target area can be any place such as a mountain area, a house, or even a crossing lake, and is not limited herein.

Step S102: and outputting current frame action output information and current frame target area selection information corresponding to the intelligent agent based on the current frame state information and the current frame target area information through an AI model.

Referring to fig. 2, in the present embodiment, the AI model is provided with a corresponding fully-connected neural network and a timing feature extraction module, where the timing feature extraction module includes, but is not limited to, an LSTM (Long Short-Term Memory) module, a GRU (Gated Recurrent Unit, gate control unit network) module, a transducer module, and the like.

And (3) by calling the AI model, the corresponding current frame state information and current frame target area information of the agent in the 3D environment are input into a fully-connected neural network of the AI model as input information, so that fusion characteristic information serving as input information of a time sequence characteristic extraction module and current frame target area selection information corresponding to the agent can be obtained.

And after the fusion characteristic information is acquired, taking the fusion characteristic information as the input of the time sequence characteristic extraction module based on the time sequence characteristic extraction module of the AI model so as to output current frame action output information corresponding to the intelligent agent.

The method has the advantages that the relevant information of the target area relative to the position of the intelligent agent is utilized to make effective expression, the current state information of the intelligent agent is successfully fused on the network model, the decision that the intelligent agent can make long-distance navigation only according to the real-time state information is ensured, and the navigation path has generalization, diversity and robustness.

Referring to fig. 3, in some embodiments, outputting, by an AI model, current frame action output information and current frame target region selection information corresponding to the agent based on the current frame state information and the current frame target region information includes:

Step S1021: and acquiring the current frame state characteristics of the intelligent agent according to the current frame state information, wherein the current frame state characteristics comprise the position characteristics of the current frame of the intelligent agent, the depth map characteristics and the tangent plane map characteristics of the 3D virtual environment where the current frame of the intelligent agent is positioned, and the position characteristics are used for representing the position relationship between the intelligent agent and the intelligent agent of the enemy.

As shown in fig. 4, illustratively, corresponding state features are extracted according to state information of an agent, so as to take the obtained corresponding state features as input of an AI model, where the state features include a position feature of the agent, a depth map feature and a section map feature of a 3D virtual environment in which the agent is located, and the position feature of the agent is used for characterizing a position relationship between the agent and at least one of an enemy agent and an friendly agent, and includes at least the position feature of the enemy agent and the position feature of the friendly agent.

Step S1022: and acquiring target area characteristics according to the target area information of the current frame.

And the AI model extracts the corresponding target region characteristics according to the target region information.

As shown in fig. 5 to 7, in some embodiments, the target area features include a first type of target area feature for characterizing whether the agent reaches a target area, and a second type of target area feature for characterizing a positional relationship with the target area during the process that the agent reaches the target area, where the merging and inputting the positional feature, the depth map feature, the tangent plane map feature, and the target area feature into a fully connected neural network of the AI model, to obtain corresponding merged feature information and current frame target area selection information corresponding to the agent, and the method includes:

And merging the position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature and inputting the merged position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature into the fully-connected neural network of the AI model to obtain corresponding merged feature information and current frame target area selection information corresponding to the intelligent agent.

In some embodiments, the first type of target area features includes the first type of target area features including a first feature for characterizing a target area size, a second feature for characterizing whether the agent is within a target area radius, a third feature for characterizing whether the agent reaches a target area, and a fourth feature for characterizing a residence time of the agent in a target area.

In some embodiments, the second class of target area features includes a first location feature for characterizing a distance of the agent to a target area center point, a second location feature for characterizing a distance of the agent to a target area edge, a third location feature for characterizing a vector position difference of the agent to a target area center point, and a fourth location feature for characterizing yaw angle information of the agent to a target area center point.

Specifically, whether the intelligent agent is within the radius of the target point area is expressed in a characteristic mode, the distance between the 3D coordinate of the intelligent agent and the 3D coordinate of the target point can be calculated, if the distance is smaller than the radius of the target area, the distance is set to be 1, otherwise, the distance is set to be 0. When expressing whether the agent reaches the target area in the characteristic manner, it may be set to 0 when the agent does not reach the target area, and set to 1 after the first arrival. The residence time of the agent in the target area is expressed in a characteristic mode, and the number of frames of the agent after reaching the target area for the first time can be recorded.

The distance from the agent to the target center can be obtained by calculating the distance value between the 3D coordinates of the agent and the 3D coordinates of the target point. The vector position difference of the agent to the center point of the target area can be obtained by calculating the difference between the 3D coordinates of the agent and the 3D coordinates of the target point. Yaw angle (yaw) information of the intelligent agent relative to the central point of the target area can be obtained by calculating the difference value of the connecting line of the central point coordinate of the target area and the coordinate of the intelligent agent relative to the yaw angle of the intelligent agent, and sine and cosine of the difference value.

Illustratively, a 3DFPS (3D First Person Shooter) game is described as an application context.

The intelligent body selects a target area with a preset size as a security area according to the 3D virtual environment information and moves towards the security area, and the distance change between the intelligent body and the center point or the edge of the security area is recorded in the process that the intelligent body moves towards the target area, so that the intelligent body moves towards the security area according to the distance change.

Meanwhile, whether the current moving direction is the optimal moving direction is known according to the size of the moving yaw angle, and the intelligent agent can select corresponding actions according to the yaw angle so as to efficiently reach the target area.

By monitoring whether the intelligent agent reaches a security defense area serving as a target area, the intelligent agent can be controlled to switch from a running target point behavior mode to a normal material picking up and fighting behavior mode after reaching the security defense area, so that the intelligent agent can freely collect strategies such as small-scale collection and fight after reaching the target area.

Step S1023: and merging the position feature, the depth map feature, the section map feature and the target area feature and inputting the merged position feature, the depth map feature, the section map feature and the target area feature into a fully-connected neural network of the AI model to obtain corresponding merged feature information and current frame target area selection information corresponding to the intelligent agent.

Wherein the target area selection information of the current frame may be the same as or different from the target area selection information of the previous adjacent frame. When the intelligent body reaches the target area for a preset time, a new target area is selected, namely the output target area selection information of the current frame is different from the target area selection information of the previous frame, or the currently selected target area cannot be close, and when the intelligent body needs to avoid, the target area is selected to be updated, namely the output target area selection information of the current frame is different from the target area selection information of the previous frame.

As shown in fig. 2, the current frame state information and the current frame target area information of the intelligent agent are firstly subjected to feature extraction to obtain corresponding position features, depth map features, tangent plane map features and target area features, and the position features, depth map features, tangent plane map features and target area features are input into a fully connected neural network of an AI model to perform CONCAT fusion, so that corresponding fusion feature information and current frame target area selection information corresponding to the intelligent agent are obtained.

In this embodiment, the target area selection may be regarded as a part of the action output, and the decision made by the agent according to the current state information is the same as the action output in the normal reinforcement learning, unlike the setting of the traditional reinforcement learning, the agent does not select to update the target area every frame, i.e. the decision frequency of selecting the target area is lower than that of other action outputs. And the intelligent agent only indicates that the intelligent agent has basically completed searching articles in the target area after reaching the target area for a preset time, and then selects a new target area or selects and updates the target area when the current target area needs to be avoided, such as when a poison circle floods the selected target area. Therefore, the intelligent agent can freely collect and fight strategies in a small range after reaching the target area.

Step S1024: and inputting the fusion characteristic information into a time sequence characteristic extraction module of the AI model to obtain the current frame action output information corresponding to the intelligent agent.

And inputting the fusion characteristic information into a time sequence characteristic extraction module of the AI model for processing, so as to obtain the current frame action output information corresponding to the intelligent agent.

In some embodiments, the timing feature extraction module includes an LSTM module, inputs the fused feature information into the timing feature extraction module of the AI model to obtain the current frame action output information corresponding to the agent, and includes: acquiring the hidden state information of the last frame corresponding to the LSTM module; outputting current frame hiding state information corresponding to the LSTM module based on the fusion characteristic information and the previous frame hiding state information through the LSTM module; and acquiring the action output information of the current frame corresponding to the intelligent agent according to the hidden state information of the current frame.

In some embodiments, the obtaining the current frame action output information corresponding to the agent according to the current frame hidden state information includes: acquiring fusion state vector information corresponding to the intelligent agent according to the hidden state information of the current frame; and acquiring the motion output information of the current frame according to the fusion state vector information.

Taking a time sequence feature extraction module as an LSTM module as an example, the LSTM module may accept the hidden state information of the previous frame and the state information of the current frame as input, and output corresponding hidden state information of the current frame, where the hidden state information includes hidden information (hidden state) and cell state information (cell state), and the hidden state information of the current frame is taken as input of the next frame.

As shown in fig. 2 and fig. 4, the current frame state information and the current frame target area information are subjected to feature extraction to obtain corresponding current frame position features, current frame depth map features, current frame section map features and current frame target area features, which are used as inputs of an AI model, and the AI model convolves and feature fuses the current frame target area features, the current frame position features, the current frame depth map features and the current frame section map features to obtain current frame fusion feature information. And inputting the fusion characteristic information and the hiding state information of the previous frame into an LSTM module, and outputting corresponding hiding state information of the current frame. And then, acquiring current frame action output information corresponding to the intelligent agent according to the current frame hiding state information.

The method comprises the steps of carrying out 4-layer convolution calculation on the current frame target region characteristic, the current frame position characteristic, the current frame depth map characteristic and the current frame tangent plane map characteristic, fusing and inputting calculation results into a fully-connected neural network to be processed, obtaining corresponding fused characteristic information, inputting fused characteristic information, previous frame hidden information and previous frame unit state information into an LSTM module to be processed, and outputting current frame action output information corresponding to an intelligent agent.

Three types of gates, namely a forget gate (forget gate), an input gate (input gate) and an output gate (output gate), are designed in the LSTM module, and the three types of gates can perform different processes on input information. Inputting previous frame hiding state information comprising previous frame hiding information and previous frame unit state information, and fusion characteristic information x of current frame state information and current frame target area information of an intelligent agent _t Outputting the hidden information h of the current frame _t And current frame cell shapeState information C _t . Concealing information h of the previous frame through forgetting door _t-1 And fusing the feature information x _t Merging (CONCAT), passing through a forward network, and outputting forgetting probability f through Sigmoid function _t (a value between 0 and 1). The previous frame of hidden information h can be input by the user _t-1 And fusing the feature information x _t Merging (CONCAT), passing through a forward network, and outputting corresponding input probability i via Sigmoid function _t (value between 0 and 1) and outputting the fusion characteristic information x through a Tanh function through another forward network _t Is the result of the processing of (2)The multiplication operation will f _t And last frame cell status information C _t-1 Multiply and add i _t And->Multiplying the two obtained product values to obtain a sum value, and updating the sum value to the output current frame unit state information C _t In (a):

outputting output information of the gate control LSTM unit, and outputting current frame hidden information h _t Integrate the previous frame hidden information h _t-1 And last frame cell status information C _t-1 And fusing the feature information x _t . Computing fusion feature information x by Sigmoid function _t Output probability o of (2) _t At the same time, the current frame unit state information C _t Processed through the Tanh function and combined with o _t Multiplying to obtain hidden information h of current frame _t The method comprises the following steps:

h _t ＝o _t ·tanh(C _t )

wherein, the current frame conceals information h _t Comprises fusion state vector information corresponding to the intelligent agent, and is based on the current frame hiding information h in the outputted current frame hiding state information _t And obtaining fusion state vector information corresponding to the intelligent agent, wherein the fusion state vector information comprises multi-frame state information of the intelligent agent. And obtaining current frame action output information of the intelligent agent according to the fusion state vector information.

S103: and controlling the interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information and the current frame target area selection information so as to acquire next frame state information and next frame target area information of the intelligent agent.

And controlling the intelligent agent to execute corresponding action output based on the output current frame action output information and the current frame target area selection information, so that the intelligent agent interacts with the 3D virtual environment, and the state information and the target area information of the intelligent agent are updated to obtain the next frame state information and the next frame target area information of the intelligent agent.

In some embodiments, the interaction between the agent and the 3D virtual environment is controlled according to the current frame action output information and the current frame target area selection information, so as to obtain next frame status information, next frame target area information and corresponding target area rewarding information of the agent, where the target area rewarding information is used to evaluate the selection made after the agent interacts with the environment and the advantages and disadvantages of the output action, and can be used as sample data to construct a loss function so as to optimize the AI model.

Wherein the target area rewards information includes a first type of rewards information for characterizing rewards values generated in accordance with the status and actions of the agent during arrival at the target area and a second type of rewards information for characterizing rewards values generated in accordance with the status and actions of the agent after arrival at the target area.

Specifically, the first type of rewarding information includes time rewarding information, distance rewarding information and yaw rewarding information, wherein the time rewarding information is used for representing rewards of the time period of the moving of the intelligent agent to the target area, namely punishment is carried out on the time period of the moving of the intelligent agent to the target area, and the punishment value is larger when the time period is more, the rewards can enable the intelligent agent to reach the target area in a shorter time period.

The distance rewards information is used for representing rewards of the distance of each frame of the agent moving towards the target area, namely, the larger the distance of each frame of the agent approaching the target area is, the larger the rewards are, and the rewards can enable the agent to approach the center of the target area as much as possible each time of advancing.

The yaw angle rewarding information is used for representing rewards of the offset angle of the moving direction when the intelligent agent moves to the target area, and the larger the difference value of the offset angles is, the larger the punishment value is. The rewards can lead the intelligent body to go to the target area in a straight path as far as possible, so that the intelligent body is more anthropomorphic and efficient.

The second type of rewards information includes first trade-off rewards information for balancing rewards moving to the target area with rewards of characteristics of pickup materials, combat, and the like, and second trade-off rewards information. For example, in the case of seeing an enemy agent, no first type of rewards are set to ensure that the agent only pays attention to the battle-related rewards during the battle.

The second type of balance rewarding information is used for balancing the rewards of picking up materials, for example, in the case of seeing materials, the relevant rewards of picking up materials are not set, so that the intelligent agent is guaranteed to only pay attention to how to pick up the corresponding materials quickly when picking up the materials.

And setting relevant rewards for the state and the action of the intelligent body in the process of moving to the target area, so that the intelligent body obtains corresponding target area rewards information in the process of interacting with the 3D virtual environment, and training an AI model by taking the target area rewards information as sample data to optimize the AI model.

S104: and acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and the next frame target area information.

After obtaining the next frame status information and the next frame target area selection information of the agent, according to the operation in the step S102, the next frame action output information and the next frame target area selection information corresponding to the agent are output through the AI model based on the next frame status information and the next frame target area information. The specific operation process may be described with reference to steps S102-104, and will not be described here.

According to the navigation decision-making method based on the AI model, the current frame state information and the current frame target area information of the intelligent agent in the 3D virtual environment are obtained; outputting current frame action output information and current frame target area selection information corresponding to the intelligent agent based on the current frame state information and the current frame target area information through an AI model; controlling interaction between the intelligent agent and the 3D virtual environment according to the current frame action output information and the current frame target area selection information so as to acquire next frame state information and next frame target area information of the intelligent agent; and acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and the next frame target area information, so that the intelligent agent can reliably and efficiently make a correct navigation decision in a large-scale map in a 3D space.

The embodiment of the application also provides a training method of the AI model. The training method of the AI model can be applied to a server to realize reliable and efficient AI simulation by calling the trained AI model. The server may be a single server or a server cluster composed of a plurality of servers.

Referring to fig. 8, fig. 8 is a flowchart of a training method of an AI model according to an embodiment of the present application.

As shown in fig. 8, the training method of the AI model includes steps S201 to 204.

Step S201: and acquiring a sample data set, wherein the sample data set comprises multi-frame state information, multi-frame target area information and multi-frame target area rewarding information of the intelligent agent.

Illustratively, a sample data set corresponding to AI model training is stored in a redis (Remote Dictionary Server, remote dictionary service) database, the sample data set being used to train AI models, wherein the sample data set includes, but is not limited to, multi-frame status information and frame target area rewards information of an agent, and the like. And acquiring a sample data set corresponding to AI model training through inquiring access redis.

Step S202: and outputting multi-frame fusion state vector information and multi-frame target region selection information corresponding to the intelligent agent based on the multi-frame state information and the multi-frame target region information through a preset AI model.

As described in the navigation decision making method embodiment, the AI model is provided with a corresponding timing feature extraction module, wherein the timing feature extraction module includes, but is not limited to, an LSTM module, a GRU module, a transducer module, and the like.

The sample data set further comprises hidden state information corresponding to the LSTM module, and outputs multi-frame fusion state vector information and multi-frame target region selection information corresponding to the intelligent agent based on the multi-frame state information and the multi-frame target region information through a preset AI model, and specifically comprises the following steps:

acquiring multi-frame state characteristics of the intelligent agent according to the multi-frame state information, wherein the multi-frame state characteristics comprise corresponding multi-frame position characteristics of the intelligent agent, multi-frame depth map characteristics and multi-frame section map characteristics corresponding to a 3D virtual environment where the intelligent agent is located;

acquiring multi-frame target area characteristics according to the multi-frame target area information;

combining and inputting the multi-frame position feature, the multi-frame depth map feature, the multi-frame tangent plane map feature and the multi-frame target area feature into a fully-connected neural network of the preset AI model to obtain corresponding multi-frame fusion feature information and multi-frame target area selection information corresponding to the intelligent agent;

And outputting the multi-frame fusion state vector information based on the hidden state information and the multi-frame fusion characteristic information through the LSTM module.

As shown in fig. 9, for example, taking the time sequence feature extraction module as an LSTM module, the multi-frame state information and the multi-frame target area information are respectively subjected to multi-layer convolution calculation to obtain a position feature corresponding to each frame state information, a depth map feature corresponding to the 3D virtual environment where the intelligent agent is located, a tangent plane map feature, and a target area feature corresponding to each frame target area information.

And merging and inputting the multi-frame position features, the multi-frame depth map features, the multi-frame tangent plane map features and the multi-frame target region features into a fully-connected neural network of a preset AI model to obtain corresponding multi-frame fusion feature information and multi-frame target region selection information corresponding to the intelligent agent.

Respectively fusing the characteristic information of the previous frame and hiding the information h of the previous frame _t-1 Last frame unit status information C _t-1 Inputting the LSTM module for processing and outputting the hidden information h of the current frame _t Current frame cell status information C _t Corresponding fusion state vector information.

Fusing characteristic information and hidden information h of current frame _t Current frame cell state information C _t Inputting LSTM module for processing, outputting hidden information h of next frame _t+1 Next frame unit state information C _t+1 Corresponding fusion state vector information, in this way, multi-frame fusion state vector information is obtained.

The target area features comprise first type target area features used for representing whether the intelligent agent reaches a target area and second type target area features used for representing the position relation between the intelligent agent and the target area in the process of reaching the target area.

The first class of target area features includes a first feature for characterizing a size of a target area, a second feature for characterizing whether the agent is within a radius of the target area, a third feature for characterizing whether the agent reaches the target area, and a fourth feature for characterizing a residence time of the agent in the target area.

The second class of target area features includes a first location feature for characterizing a distance of the agent to a target area center point, a second location feature for characterizing a distance of the agent to a target area edge, a third location feature for characterizing a vector position difference of the agent to a target area center point, and a fourth location feature for characterizing yaw angle information of the agent to a target area center point.

Step S203: and constructing a loss function according to the multi-frame fusion state vector information, the multi-frame target area selection information and the multi-frame target area rewarding information.

The loss functions include a cost function loss (value loss), a strategy gradient loss (policy gradient loss), an information entropy loss (entropy loss), and the like.

In some embodiments, for multi-frame fusion state vector information, motion output information corresponding to each frame fusion state vector information and a cost function output value corresponding to the motion output information are obtained respectively based on each frame fusion state vector information. The value function output value is used for evaluating the action output information, and if the value function output value is high, the relevant action instruction of the corresponding action output information can be controlled to be executed; if the output value of the cost function is low, the relevant action instruction of the corresponding action output information is not executed.

When the intelligent agent interacts with the 3D virtual environment where the intelligent agent is located according to the current frame action output information and the current frame target area selection information, the next frame state information, the next frame target area information and the corresponding target area rewarding information of the intelligent agent can be acquired.

And storing multi-frame state information, multi-frame target area information and multi-frame target area rewarding information of the intelligent agent in the sample data, and constructing a corresponding loss function based on the obtained multi-frame action output information, a value function output value corresponding to the multi-frame action output information, multi-frame target area selection information and multi-frame target area rewarding information.

Step S204: and performing multi-step iteration on the loss function to train and update the preset AI model.

Alternatively, as shown in fig. 9, the loss function is sent to the GPU (Graphics Processing Unit, graphics processor) for multi-step iterative optimization to obtain the iterated relevant AI model parameters, where the AI model parameters include, but are not limited to, parameters of a timing feature extraction module, parameters of a cost function, and the like. Updating to the AI model based on the related AI model parameters after iteration, thereby completing the training updating of the AI model.

Meanwhile, through various information such as state information, target area selection information, target area rewarding information and the like of the agent, which are generated by continuous interaction with the 3D virtual environment, the information is stored in a data storage system, such as redis, and is used as data in a sample data set to carry out iterative training on the AI model.

Referring to fig. 10, fig. 10 is a schematic block diagram of a server according to an embodiment of the present application.

As shown in fig. 10, the server 30 may include a processor 301, a memory 302, and a network interface 303. The processor 301, memory 302, and network interface 303 are connected by a system bus, such as an I2C (Inter-integrated Circuit) bus.

Specifically, the processor 301 may be a Micro-controller Unit (MCU), a central processing Unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), or the like.

Specifically, the Memory 302 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk, a removable hard disk, or the like.

The network interface 303 is used for network communication such as transmission of assigned tasks and the like. It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the server to which the present application is applied, and that a particular server may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

Wherein the processor 301 is configured to run a computer program stored in the memory 302 and to implement the following steps when executing the computer program:

In some embodiments, when outputting the current frame action output information and the current frame target area selection information corresponding to the agent based on the current frame state information and the current frame target area information through the AI model, the processor 301 specifically includes:

acquiring current frame state characteristics of the intelligent agent according to the current frame state information, wherein the current frame state characteristics comprise position characteristics of a current frame of the intelligent agent, depth map characteristics and section map characteristics of a 3D virtual environment where the current frame of the intelligent agent is located, and the position characteristics are used for representing the position relationship between the intelligent agent and an enemy intelligent agent;

acquiring target area characteristics according to the current frame target area information;

combining the position feature, the depth map feature, the tangent plane feature and the target area feature and inputting the combined position feature, the depth map feature, the tangent plane feature and the target area feature into a fully-connected neural network of the AI model to obtain corresponding fusion feature information and current frame target area selection information corresponding to the intelligent agent;

And inputting the fusion characteristic information into a time sequence characteristic extraction module of the AI model to obtain the current frame action output information corresponding to the intelligent agent.

In some embodiments, the target region features include a first type of target region feature for characterizing whether the agent reaches a target region and a second type of target region feature for characterizing a positional relationship with the target region during arrival of the agent at the target region; the step of merging the position feature, the depth map feature, the tangent plane feature and the target region feature and inputting the merged position feature, the depth map feature, the tangent plane feature and the target region feature into the fully connected neural network of the AI model to obtain corresponding merged feature information and current frame target region selection information corresponding to the intelligent agent, comprises the following steps:

In some embodiments, the first class of target region features includes a first feature for characterizing a size of a target region, a second feature for characterizing whether the agent is within a radius of the target region, a third feature for characterizing whether the agent reaches the target region, and a fourth feature for characterizing a residence time of the agent in the target region; the merging and inputting the position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature into the fully connected neural network of the AI model to obtain corresponding merging feature information and current frame target area selection information corresponding to the intelligent agent, wherein the merging and inputting the position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature into the fully connected neural network of the AI model comprises the following steps:

And merging the position feature, the depth map feature, the section map feature, the first feature, the second feature, the third feature, the fourth feature and the second type target region feature and inputting the merged position feature, the depth map feature, the section map feature, the first feature, the second feature, the third feature, the fourth feature and the second type target region feature into a fully connected neural network of the AI model to obtain corresponding fusion feature information and current frame target region selection information corresponding to the intelligent agent.

In some embodiments, the second class of target area features includes a first location feature for characterizing a distance of the agent to a target area center point, a second location feature for characterizing a distance of the agent to a target area edge, a third location feature for characterizing a vector position difference of the agent to a target area center point, a fourth location feature for characterizing yaw angle information of the agent to a target area center point; the merging and inputting the position feature, the depth map feature, the tangent plane feature, the first feature, the second feature, the third feature, the fourth feature, and the second type target region feature into the fully connected neural network of the AI model to obtain corresponding merging feature information and current frame target region selection information corresponding to the agent, including:

And merging the position feature, the depth map feature, the section map feature, the first feature, the second feature, the third feature and the fourth feature, and inputting the first position feature, the second position feature, the third position feature and the fourth position feature into a fully-connected neural network of the AI model to obtain corresponding fusion feature information and current frame target region selection information corresponding to the intelligent agent.

In some embodiments, the timing feature extraction module includes an LSTM module, and when the processor 301 inputs the fused feature information to the timing feature extraction module of the AI model to obtain the current frame action output information corresponding to the agent, the method specifically includes:

acquiring the hidden state information of the last frame corresponding to the LSTM module;

outputting current frame hiding state information corresponding to the LSTM module based on the fusion characteristic information and the previous frame hiding state information through the LSTM module;

and acquiring the action output information of the current frame corresponding to the intelligent agent according to the hidden state information of the current frame.

In some embodiments, when the processor obtains the current frame action output information corresponding to the agent according to the current frame hidden state information, the method specifically includes:

Acquiring fusion state vector information corresponding to the intelligent agent according to the hidden state information of the current frame;

and acquiring the motion output information of the current frame according to the fusion state vector information.

In some embodiments, the processor 301 is configured to run a computer program stored in the memory 302 and when executing the computer program implement the following steps:

In some embodiments, the preset AI model includes an LSTM module, the sample data set further includes hidden state information corresponding to the LSTM module, and when the processor 301 outputs, through the preset AI model, multi-frame fusion state vector information and multi-frame target area selection information corresponding to the agent based on the multi-frame state information and the multi-frame target area information, the method specifically includes:

In some embodiments, the processor 301 constructs the loss function according to the multi-frame fusion state vector information, multi-frame target area selection information and multi-frame target area rewards information, and specifically includes:

acquiring multi-frame action output information and a cost function output value corresponding to the multi-frame action output information according to the multi-frame fusion state vector information;

And constructing the loss function according to the multi-frame target area selection information, the multi-frame target area rewarding information, the multi-frame action output information and the value function output value.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The computer readable storage medium may be an internal storage unit of the server of the foregoing embodiment, for example, a hard disk or a memory of the server. The computer readable storage medium may also be an external storage device of the server, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the server.

Because the computer program stored in the computer readable storage medium can execute any one of the navigation decision making method and/or the training method of the AI model provided by the embodiments of the present application, the beneficial effects that any one of the navigation decision making method and/or the training method of the AI model provided by the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments. While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of navigation decision making, the method comprising:

acquiring next frame action output information and next frame target area selection information of the intelligent agent according to the next frame state information and next frame target area information;

the outputting, by the AI model, current frame action output information and current frame target region selection information corresponding to the agent based on the current frame state information and the current frame target region information, including:

acquiring current frame state characteristics of the intelligent agent according to the current frame state information, wherein the current frame state characteristics comprise position characteristics of a current frame of the intelligent agent, depth map characteristics and section map characteristics of a 3D virtual environment where the current frame of the intelligent agent is positioned, and the position characteristics are used for representing the position relationship between the intelligent agent and an enemy intelligent agent;

inputting the fusion characteristic information into a time sequence characteristic extraction module of the AI model to obtain the current frame action output information corresponding to the intelligent agent;

the target area features comprise first type target area features used for representing whether the intelligent agent reaches a target area and second type target area features used for representing the position relation between the intelligent agent and the target area in the process of reaching the target area; the step of merging the position feature, the depth map feature, the tangent plane feature and the target region feature and inputting the merged position feature, the depth map feature, the tangent plane feature and the target region feature into the fully connected neural network of the AI model to obtain corresponding merged feature information and current frame target region selection information corresponding to the intelligent agent, comprises the following steps:

2. The method of claim 1, wherein the first type of target area features comprises a first feature for characterizing target area size, a second feature for characterizing whether the agent is within a target area radius, a third feature for characterizing whether the agent has reached a target area, and a fourth feature for characterizing residence time of the agent in a target area; the merging and inputting the position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature into the fully connected neural network of the AI model to obtain corresponding merging feature information and current frame target area selection information corresponding to the intelligent agent, wherein the merging and inputting the position feature, the depth map feature, the tangent plane map feature, the first type target area feature and the second type target area feature into the fully connected neural network of the AI model comprises the following steps:

3. The method of claim 2, wherein the second type of target area features includes a first location feature for characterizing a distance of the agent to a target area center point, a second location feature for characterizing a distance of the agent to a target area edge, a third location feature for characterizing a vector position difference of the agent to a target area center point, a fourth location feature for characterizing yaw angle information of the agent to a target area center point; the merging and inputting the position feature, the depth map feature, the tangent plane feature, the first feature, the second feature, the third feature, the fourth feature, and the second type target region feature into the fully connected neural network of the AI model to obtain corresponding merging feature information and current frame target region selection information corresponding to the agent, including:

4. The method of any of claims 1-3, wherein the timing feature extraction module includes an LSTM module that inputs the fused feature information to a timing feature extraction module of the AI model to obtain the current frame action output information corresponding to the agent, comprising:

5. The method of claim 4, wherein the obtaining the current frame action output information corresponding to the agent according to the current frame hidden state information comprises:

6. A method of training an AI model, comprising:

multi-step iterations of the loss function are performed to train and update the preset AI model until an AI model is obtained, the AI model being applied to the navigation decision making method according to any of claims 1-5.

7. The method of claim 6, wherein the preset AI model includes an LSTM module, the sample data set further includes hidden state information corresponding to the LSTM module, the outputting, by the preset AI model, multi-frame fusion state vector information and multi-frame target region selection information corresponding to the agent based on the multi-frame state information and the multi-frame target region information includes:

8. The method according to claim 6 or 7, wherein said constructing a loss function from said multi-frame fusion state vector information, multi-frame target area selection information, and said multi-frame target area rewards information comprises:

9. A server, wherein the server comprises a processor and a memory;

the memory stores a computer program and an AI model that can be invoked and executed by the processor, wherein the computer program, when executed by the processor, implements the navigation decision making method of any of claims 1-5; alternatively, a training method of the AI model of any of claims 6-8 is implemented.

10. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, causes the processor to implement the navigation decision making method according to any of claims 1 to 5; alternatively, a training method of the AI model of any of claims 6-8 is implemented.