CN118267709A - Deep reinforcement learning-based agent control method and system - Google Patents

Deep reinforcement learning-based agent control method and system Download PDF

Info

Publication number
CN118267709A
CN118267709A CN202410275750.3A CN202410275750A CN118267709A CN 118267709 A CN118267709 A CN 118267709A CN 202410275750 A CN202410275750 A CN 202410275750A CN 118267709 A CN118267709 A CN 118267709A
Authority
CN
China
Prior art keywords
current
shooting
agent
hostile
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410275750.3A
Other languages
Chinese (zh)
Inventor
崔言
周越
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huaqing Feiyang Network Co ltd
Original Assignee
Beijing Huaqing Feiyang Network Co ltd
Filing date
Publication date
Application filed by Beijing Huaqing Feiyang Network Co ltd filed Critical Beijing Huaqing Feiyang Network Co ltd
Publication of CN118267709A publication Critical patent/CN118267709A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses an agent control method and system based on deep reinforcement learning, comprising the following steps: firstly, state information of a current intelligent agent and state information of an adversary intelligent agent are obtained through a multilayer perception mechanism, and feature vectors are generated. And determining the target enemy by using the attention mechanism. Meanwhile, the coding features of the game map where the intelligent agent is located are acquired and extracted. These features are input into the policy network, resulting in action commands. And then, respectively obtaining a predicted evaluation result and an actual rewarding result of the action command by using a preset value function network and a rewarding function. Finally, the policy network is optimized based on the two results in order to make better action decisions. By the design, deep learning and reinforcement learning are combined, and decision making capability of the intelligent agent is improved, so that game experience is enhanced.

Description

Deep reinforcement learning-based agent control method and system
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an agent control method and system based on deep reinforcement learning.
Background
In computer games, particularly in electronic competitive games in a multi-agent environment, how to improve decision efficiency and effect of agents (i.e., artificial intelligent characters) is an important research topic. The traditional rule-based method is too dependent on preset rules and judgment logic to cope with complex and dynamically-changed game environments. Accordingly, studies on agent control by artificial intelligence techniques such as deep learning have been attracting increasing attention.
Disclosure of Invention
The invention aims to provide an agent control method and system based on deep reinforcement learning.
In a first aspect, an embodiment of the present invention provides an agent control method based on deep reinforcement learning, including:
acquiring current state information of a current intelligent agent and hostile state information of a plurality of hostile intelligent agents, inputting the current state information of the current intelligent agent and hostile state information of the hostile intelligent agents into an entity scalar encoder constructed based on a multi-layer perception mechanism, and obtaining entity feature vectors of the current intelligent agent and hostile feature vectors corresponding to the hostile intelligent agents;
Inputting the entity feature vector and a plurality of hostile feature vectors into a target hostile selection unit constructed based on an attention mechanism, and determining a target hostile agent from the hostile agents;
acquiring a game map of the current intelligent agent, and extracting map coding features of the current intelligent agent;
Inputting the entity feature vector, the target hostile feature vector corresponding to the target hostile agent and the map coding feature into a strategy network, and obtaining an action command for indicating whether the current agent executes shooting operation for the target hostile agent;
acquiring a predicted evaluation result of the action command based on a preset value function network;
acquiring an actual rewarding result of the action command based on a preset rewarding function;
And optimizing and adjusting the strategy network according to the predicted evaluation result and the actual rewarding result so as to make a action decision of the current intelligent agent by utilizing the strategy network after optimizing and adjusting.
In a second aspect, an embodiment of the present invention provides a server system, including a server, where the server is configured to perform the method described in the first aspect.
Compared with the prior art, the invention has the beneficial effects that: by adopting the intelligent agent control method and system based on deep reinforcement learning, disclosed by the invention, the state information of the current intelligent agent and the hostile intelligent agent is obtained through a multi-layer sensing mechanism, and the feature vector is generated. And determining the target enemy by using the attention mechanism. Meanwhile, the coding features of the game map where the intelligent agent is located are acquired and extracted. These features are input into the policy network, resulting in action commands. And then, respectively obtaining a predicted evaluation result and an actual rewarding result of the action command by using a preset value function network and a rewarding function. Finally, the policy network is optimized based on the two results in order to make better action decisions. By the design, deep learning and reinforcement learning are combined, and decision making capability of the intelligent agent is improved, so that game experience is enhanced.
Drawings
In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described. It is appreciated that the following drawings depict only certain embodiments of the invention and are therefore not to be considered limiting of its scope. Other relevant drawings may be made by those of ordinary skill in the art without undue burden from these drawings.
FIG. 1 is a schematic flow chart of steps of an agent control method based on deep reinforcement learning according to an embodiment of the present invention;
Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
The following describes specific embodiments of the present invention in detail with reference to the drawings.
In order to solve the foregoing technical problems in the background art, fig. 1 is a schematic flow chart of an agent control method based on deep reinforcement learning according to an embodiment of the present disclosure, and the following describes the agent control method based on deep reinforcement learning in detail.
Step S201, obtaining current state information of a current intelligent agent and hostile state information of a plurality of hostile intelligent agents, inputting the current state information of the current intelligent agent and hostile state information of the hostile intelligent agents into an entity scalar encoder constructed based on a multi-layer perception mechanism, and obtaining entity feature vectors of the current intelligent agent and hostile feature vectors corresponding to the hostile intelligent agents;
Step S202, inputting the entity feature vector and a plurality of hostile feature vectors into a target hostile selection unit constructed based on an attention mechanism, and determining a target hostile agent from the hostile agents;
Step S203, a game map in which the current intelligent agent is located is obtained, and map coding features of the current intelligent agent are extracted;
Step S204, inputting the entity feature vector, the target hostile feature vector corresponding to the target hostile agent and the map coding feature into a strategy network, and obtaining an action command for indicating whether the current agent executes shooting operation for the target hostile agent;
step S205, obtaining a predicted evaluation result of the action command based on a preset value function network;
step S206, obtaining an actual rewarding result of the action command based on a preset rewarding function;
And step S207, optimizing and adjusting the strategy network according to the predicted evaluation result and the actual rewarding result so as to make an action decision of the current agent by utilizing the strategy network after optimizing and adjusting.
In an embodiment of the present invention, the server first obtains current state information of a current agent (e.g., one game character or robot of my party), including location, vital value, equipment status, etc. Meanwhile, the server also acquires hostile state information of a plurality of hostile agents (hostile characters or robots), such as their positions, vital values, behavior patterns, and the like. In a shooting game in which 10 persons participate, a server acquires information such as the location (coordinates X1, Y1) of an agent "fighter a" on my side, a vital value (100%), and a state of equipment (holding a rifle, a grenade). Meanwhile, the server also monitors various state information of 9 enemy agents. The server inputs the acquired state information of the current agent and the state information of a plurality of hostile agents into a solid scalar encoder constructed based on a multi-layer perception mechanism. This encoder converts the state information into physical feature vectors and hostile feature vectors. The server inputs status information of "fighter a", such as position, vital value, equipment, etc., and status information of 9 enemy agents into the physical scalar encoder, and outputs a physical feature vector of "fighter a" and 9 enemy feature vectors. The server inputs the entity feature vector and the plurality of hostile feature vectors into a target hostile selection unit constructed based on an attention mechanism. The unit analyzes the feature vectors to determine a target hostile agent from the plurality of hostile agents that the current agent should preferentially attack. By analyzing the entity feature vector and 9 hostile feature vectors of "fighter a", the target hostile selection unit determines that the hostile agent "sniper B" (located at X2, Y2, 80% of life value, holding sniper gun) is the target that threatens "fighter a" the greatest, thus selecting "sniper B" as the target hostile agent. The server obtains the game map information of the current agent and extracts map coding features related to the current agent, such as obstacle positions, walkable areas and the like. The game map where "fighter a" is located contains a plurality of buildings, obstacles, and walkable paths. The server analyzes the information and extracts map-coded features associated with the current location and activity of "warrior a". The server inputs the entity feature vector, the target hostile feature vector, and the map-encoded feature into a policy network. The policy network outputs an action command based on the input information, indicating whether the current agent should perform a shooting operation for the target hostile agent. The strategy network outputs an action command according to the entity feature vector of the fighter A, the target hostile feature vector of the sniper B and the map coding feature, and indicates that the fighter A should move to a more favorable position and then shoots the sniper B. And the server predicts and evaluates the action command by using a preset value function network to obtain a predicted and evaluated result. Meanwhile, the server calculates an actual rewarding result for the action command according to the actual game condition (such as whether the target is hit or not, the injury is caused, and the like). The value function network predicts that the action of "fighter a" moving to a new location and shooting "sniper B" will get a higher prize. In fact, the "fighter a" successfully moves to a new location and inflicts harm to the "marksman B", and the server calculates an actual rewards result for this action based on the amount of harm and the rules of the game. The server optimizes the policy network based on the predicted evaluation results and the actual rewards results to make more accurate action decisions in future games. The server compares the predicted evaluation result with the actual rewarding result and finds the difference between the two. The server then adjusts the parameters of the policy network to more accurately predict the best action decisions in similar situations in future games.
In order to more clearly describe embodiments of the present invention, a more detailed description is provided below.
In an embodiment of the present invention, by way of example, a current agent is meant an intelligent entity that we are focusing on or controlling, which may be a character, robot, or other type of intelligent agent in a game. In a multiplayer online shooting game, the current agent may be a character controlled by an AI on the player side or on the opposite side of the player, such as a game character known as "warrior A". The current state information refers to a data set of various attributes and conditions that the current agent has at a certain time. For "fighter A", the current state information may include its location coordinates (X1, Y1), vital values (e.g., 100%), equipment status (e.g., gun hold, number of ammunition, etc.), speed of movement, etc. An adversary agent refers to an intelligent entity in hostile relationship with a current agent, which is typically an opponent or enemy in a game. In a shooting game, the hostile agent may be an enemy AI-controlled character, such as "sniper B" or other AI-controlled enemy. Hostile state information refers to a collection of data of various attributes and conditions that a hostile agent has at a certain moment, which is very important to the current agent, as they determine how the current agent should deal with the hostile. For "marksman B", the hostile state information may include its location coordinates (X2, Y2), a life value (e.g., 80%), a weapon held (e.g., a marksman gun), a behavior pattern (e.g., hiding, attacking, etc.), and so forth. The multi-layer sensing mechanism is generally referred to as a neural network structure in deep learning, which contains multiple levels of neurons capable of learning and extracting complex features of input data. In a physical scalar encoder, the multi-layer perceptive mechanism may be a neural network consisting of a plurality of fully connected layers that receives as input state information of the current agent and the hostile agent and processes the information through neurons of the layers to ultimately output physical feature vectors and hostile feature vectors. The entity scalar encoder is a neural network model that functions to transform the raw state information of the current agent and the hostile agent into a feature representation, i.e., feature vector, that is easier to process and analyze. the entity scalar encoder can receive state information such as the position and the vital value of a fighter A, and state information of a hostile agent such as a sniper B as input, and then output an entity feature vector and a plurality of hostile feature vectors of the fighter A. The feature vectors capture key features in the original state information and can be used for subsequent tasks such as target enemy selection and action decision. The entity feature vector is a vector representation of the current agent state information, and the hostile feature vector is a vector representation of the hostile agent state information. These vectors are transformed by a solid scalar encoder, contain key features in the original state information, and can be used for input of a machine learning model. for "warrior A", the entity feature vector may be a multidimensional vector containing information about its location, vital value, equipment status, etc.; for "marksman B", the hostile feature vector is also a multidimensional vector containing information about its location, vital value, weapon being held, etc. These vectors play a key role in the subsequent target enemy selection and action decision process.
In addition, in embodiments of the present invention, the attention mechanism is a technique that allows a model to focus "attention" on relevant information while processing the information. In deep learning, the attention mechanism is typically implemented by calculating weights between different information elements so that the model can dynamically focus on the most important information for the current task. In the task of target enemy selection, the attention mechanism can help the model calculate a weight distribution according to the state of the current agent and the state information of the enemy agents, so as to indicate which enemy agents the current agent should pay more attention to. The target enemy selecting unit is a neural network module based on an attention mechanism, and has the function of selecting one or more main attack targets serving as current agents from a plurality of enemy agents according to the input entity feature vectors and the plurality of enemy feature vectors. In this unit, the attention mechanism calculates the attention weight of the current agent (e.g. "fighter A") to each hostile agent (e.g. "sniper B", "infantry C", "tank D"). Then, based on these weights, one of the highest-weighted hostile agents is selected as the target hostile agent. If "marksman B" poses the greatest threat to "fighter a", then "marksman B" may be selected as the target hostile agent. The target adversary agent is determined to be the adversary agent that the current agent should pay priority to pay attention to or attack after being processed by the target adversary selection unit in the current case. In a battle, a "fighter a" may face a plurality of enemies, but by calculation of the target enemy selection unit, a "sniper B" is determined as the enemy that should be focused most at present on the "fighter a" based on its own status and the status of the enemies, and thus the "sniper B" is the target enemy agent. Through such a process, the server can dynamically determine the primary attack target of the current agent in a complex gaming environment using the attention mechanism and the target enemy selection unit, thereby achieving a more intelligent and efficient decision.
Further, in the embodiment of the present invention, the game map refers to a two-dimensional or three-dimensional data structure representing the game space layout and the environmental information in the game world. It contains the information of the position, attribute and relation of all the interactable objects in the game. In a strategic game, the game map may be a two-dimensional plan view showing geographical elements such as mountains, rivers, cities, roads, etc. In a first person shooter game, the game map may be a three-dimensional space containing complex elements such as buildings, shelters, weapon refreshing points, etc. Acquiring a game map in which a current agent is located refers to extracting, from a game environment, map data of the portion in which the current agent is located or information of the entire game map. Such information is critical to decision making and action of the agent. It is assumed that in a strategy game, the character of the AI (i.e., the current agent) is located in a city on the map. Then "obtain the game map where the current agent is located" means to obtain map data of the city and its surrounding areas, including information such as topography, resource distribution, hostile force distribution, etc. of the city. Map coding features refer to converting information in a game map into a numerical or vector form so that machine learning models can process and understand. Such conversion is typically accomplished by means of feature engineering or deep learning. In a simple case, we can divide the game map into grids, each representing a specific area (e.g. grass, mountain, water, etc.). We can then use a two-dimensional matrix to represent the entire map, where each matrix element corresponds to a grid cell on the map and contains the type information for that region (e.g., 0 for grass, 1 for mountain, etc.). This two-dimensional matrix is an encoded feature of the map. Extracting map coding features of the current agent refers to extracting feature information related to decision and action of the current agent from a game map where the current agent is located. Such characteristic information may be the type of terrain, enemy distribution, resource location, etc. surrounding the current agent. In the previous strategic game example, if the current agent (AI's character) is located in a city and is ready to launch an attack, then extracting map-encoded features may include determining whether the terrain surrounding the city is favorable for the attack (e.g., whether mountain land is covered), the distribution of enemy forces (e.g., the location and number of enemy units), and the distribution of resources (e.g., whether available resource points are nearby), etc. This information will be converted into a numerical or vector form as input to the machine learning model.
Furthermore, in embodiments of the present invention, the strategy network is a trained machine learning model, typically a deep learning network. The method receives the state information of the current intelligent agent, the state information of the target hostile intelligent agent and the map coding features as inputs and outputs an action decision or an action command. In shooting games, the policy network may be a neural network, which determines actions that the current agent should take, such as moving to a certain location, shooting the target agent, etc., based on the input entity feature vector, the target agent feature vector, and the map coding feature. The action command is the result of the policy network output and is used to indicate the specific action that the current agent should perform. In a shooting game, action commands may include moving, shooting, changing a bullet, using props, etc. After the strategy network has processed the input information, it may output an action command indicating "fighter a" shooting "sniper B". This command is then translated into a specific operation within the game, such as adjusting the scope, pulling the trigger, etc. By integrating these different feature information (entity feature vector, target hostile feature vector, map coding feature) and inputting into the policy network, the server is able to intelligently generate the optimal action command for the current game state, thereby achieving a more intelligent and adaptive game AI behavior.
Furthermore, in embodiments of the present invention, the value function network is a key component in reinforcement learning to estimate long-term return after taking some action in a given state. It is typically a neural network that is trained to learn how to predict the value of a state-action pair, i.e., the expected future rewards total. This network can help the agent take long term effects into account when selecting actions. In a shooting game, the value function network may receive as input the current game status (including the current agent's location, vital value, enemy's location, etc.) and a potential action command (e.g., shooting or moving to a location) and output a predicted value indicating the sum of rewards that the agent may have acquired in the future if the action was performed. This predictive value can help the agent determine which actions are more likely to be optimal. An action command is a decision result made by an agent based on current state and environmental information, which indicates a specific action that the agent should take. In the context of reinforcement learning, action commands are typically generated by a policy network. In a shooting game, an action command may be "shoot to enemy's location" or "move to the back of the shelter". These commands are generated based on the agent's understanding of the state of the game and its goal (e.g., defeating enemies). The predictive assessment result is a prediction of rewards that the value function network may bring in the future for a given action command. It reflects the long-term return that would be expected to be obtained if the agent performed the action. In a shooting game, when an agent considers whether to shoot an enemy, the value function network provides a predictive assessment of the future prize sum that the agent may have if performing the shooting action. This predictive assessment may help the agent determine if the shot is a good choice.
In addition, in the embodiment of the present invention, the preset reward function is a core concept in reinforcement learning, which defines a reward signal obtained from the environment by an agent after performing a certain action. This reward signal is the basis for the agent to learn how to optimize its behavior. The pre-set reward function means that this function is already defined before the agent starts to interact with the environment, and the reward can be calculated based on the agent's actions and the state of the environment. In a shooting game, the bonus function might be defined as: if the agent shoots and hits the enemy, a positive reward is obtained; if the agent is hit by an enemy, a negative reward is obtained; if the agent has completed a certain task (e.g., occupied a point), an additional positive reward is obtained. These rules are determined at the beginning of the game design and are used to evaluate each action of the agent. The actual rewards result is the rewards actually obtained from the environment after the agent executes the action command. This incentive is calculated according to a preset incentive function, reflecting the impact of the action command on the environment and whether to progress towards the objectives of the agent. In a shooting game, if an agent performs an action command of "shooting to enemy's location" and successfully hits an enemy, the agent will obtain a positive prize as an actual prize outcome according to a predetermined prize function. This positive reward indicates that the agent's action is effective and proceeds toward completing the game target. By combining the preset rewarding function and the generated action command, the server can calculate and acquire rewarding results brought by the commands after the commands are actually executed. These actual rewards results provide feedback to the agent to make it aware of which actions are beneficial and which are adverse, and adjust its strategy to optimize future behavior.
In addition, in the embodiment of the present invention, the optimization adjustment refers to a process of updating and improving parameters of the policy network according to the action results (the predicted evaluation results and the actual rewards results) of the agent. The goal is to enable the policy network to make better action decisions in the future to maximize the cumulative rewards obtained from the environment. In a shooting game, if an agent performs a policy network generated action decision and obtains an actual rewards result, this result is compared with the previous predictive assessment result. If there is a discrepancy (e.g., the actual reward is lower than expected), then the parameters of the policy network are adjusted by an optimization algorithm to generate more accurate action decisions in future similar scenarios.
In the embodiment of the present invention, the target enemy selection unit includes a cascaded Softmax architecture and Gumbel Softmax Sampling architecture, and the foregoing step S202 may be implemented by the following example execution.
Performing dot product calculation on the entity feature vector and a plurality of hostile feature vectors to obtain a target matrix constructed by a plurality of vector multiplication results, wherein each element in the target matrix is used for representing the correlation between the current intelligent agent and each hostile intelligent agent;
Normalizing the target matrix by using the Softmax architecture to obtain probability distribution corresponding to the target matrix, wherein each element in the probability distribution is used for representing discrete action probability of the current agent for selecting each hostile agent;
Processing the probability distribution by utilizing the Gumbel Softmax Sampling framework to obtain a new probability distribution;
In the training stage, calculating a loss function according to the new probability distribution and executing back propagation update model parameters;
And in the reasoning stage, sampling is carried out according to the new probability distribution, and finally the target hostile agent selected is determined.
In an exemplary embodiment of the present invention, in a multiplayer online shooting game, the server needs to select a target from a plurality of hostile agents (e.g., hostile warriors B, C, D) for the current agent (e.g., my warrior a). The server first extracts the entity feature vector of the current agent (warrior a), which includes information of warrior a's location, vital value, equipment, etc. Meanwhile, the server also extracts hostile feature vectors of hostile warriors B, C, D. Next, the server performs a dot product calculation on the physical feature vector of warrior a and the hostile feature vector of B, C, D. Through dot product calculation, the server obtains a target matrix. Each element in this matrix represents a correlation between warrior a and enemy warrior B, C, D. For example, if warrior A is in close proximity to warrior B and warrior B has a low life value, then warrior A may have a high correlation with warrior B. The server normalizes the target matrix by using a Softmax architecture. The Softmax function may convert each element in the target matrix to a probability value between 0 and 1, and the sum of all probability values is 1. After Softmax processing, the server gets a probability distribution. Each element in this probability distribution represents a discrete action probability that warrior a selects enemy warrior B, C, D as the target. For example, warrior A has a probability of 0.6 for warrior B, 0.3 for warrior C, and 0.1 for warrior D. In order to better explore the different action choices during the training phase and avoid always choosing the one with the highest probability (i.e. avoid the problem of excessive certainty), the server uses Gumbel Softmax Sampling architecture to process the probability distribution. This approach allows more activity space to be explored during training by adding some randomness (i.e., gummel noise) to the original probability distribution. After Gumbel Softmax Sampling processing, the server gets a new probability distribution. This new probability distribution still maintains the general shape of the original probability distribution, but adds some randomness. During the training phase of the game, the server needs to optimize its decision-making ability based on the results of the actions of the agent. In particular, the server needs to calculate the loss function and perform back propagation to update the parameters of the model. During the training phase, the server calculates a loss function based on the new probability distribution and the actual outcome of the action (e.g., whether warrior a successfully hit the selected target). This loss function measures the gap between the decision of the agent and the actual result. The server then performs back propagation using an optimization algorithm such as gradient descent, updating the parameters of the model to reduce the value of the loss function. Through repeated iterative training, the parameters of the model are gradually optimized, and the decision making capability of the intelligent agent is improved. For example, warrior A can more accurately determine which enemy warrior is the most threatening or most susceptible to defeat when selecting a target. In the actual run phase of the game (i.e., the inference phase), the server needs to quickly and accurately select a target hostile agent for the current agent. In the inference phase, the server samples according to the new probability distribution. In particular, the server may decide which hostile agent to select as a target based on the size of each element in the probability distribution. For example, if warrior A has the highest probability of selecting warrior B, then the server selects warrior B as target for warrior A. Through the sampling operation, the server ultimately determines the target hostile agent (e.g., warrior B) of the current agent (warrior a). Next warrior a may perform a corresponding action (e.g., fire warrior B) based on this decision.
In the embodiment of the present invention, the policy network includes a mobile policy sub-network and a shooting policy sub-network, and the foregoing step S204 may be implemented by the following example execution.
Acquiring a spatial feature map of the map coding feature, wherein the spatial feature map comprises barrier information of the current intelligent agent and position information of the target hostile intelligent agent;
inputting the space feature map into the mobile strategy sub-network to obtain a feature map prediction result with a preset size;
normalizing the feature map prediction result to obtain the probability distribution of the moving direction of the current intelligent agent;
Performing element addition-based feature fusion on the entity feature vector, the target hostile feature vector and the map coding feature to obtain a first fusion feature vector;
inputting the first fusion feature vector into the shooting strategy sub-network to obtain the action probability distribution of whether the current intelligent agent shoots or not;
and acquiring an action command for indicating whether the current agent executes shooting operation for the target enemy agent according to the movement direction probability distribution and the action probability distribution.
In an embodiment of the present invention, the server needs to make action decisions for the current agent (e.g., my warrior a) in an exemplary multi-player online shooting game, including whether to shoot a target hostile agent (e.g., hostile warrior B) and how to move to optimize shooting location or avoid obstacles. The server first obtains coded features of the map including information on terrain, obstructions, shelters, etc. The server then generates a spatial signature from the encoded features. The space feature map not only contains obstacle information around the current agent, but also marks the location of the target hostile agent (hostile warrior B). The server obtains a feature map containing rich spatial information, which provides a basis for subsequent decision-making. The server inputs the spatial signature into the mobile policy subnetwork. This sub-network is a deep learning model dedicated to predicting the direction of movement of the current agent. The mobile strategy sub-network outputs a feature map prediction result with a preset size. This result includes the possibility that the current agent is moving in different directions. And the server normalizes the feature map prediction result and converts the feature map prediction result into a probability distribution form. Thus, each direction of movement is assigned a probability value between 0 and 1, and the sum of all probability values is 1. The server obtains the probability distribution of the moving direction of the current agent. This distribution clearly shows the magnitude of the probability of the current agent moving in all directions. And the server fuses the entity feature vector of the current agent, the feature vector of the target hostile agent and the map coding feature based on element addition. This fusion approach can preserve the information of the individual feature vectors and effectively combine them together. The server obtains a first fused feature vector. The vector integrates the information of the current agent, the target hostile agent and the map environment, and provides comprehensive input for subsequent shooting decisions. The server inputs the first fused feature vector into the firing strategy sub-network. This sub-network is also a deep learning model, dedicated to predicting whether the current agent should perform a shooting operation. The shooting strategy sub-network outputs the action probability distribution of whether the current agent shoots or not. This distribution contains the size of the likelihood that the current agent will or will not perform a fire. The server formulates the action command of the current agent according to the movement direction probability distribution and the action probability distribution. Specifically, the server may select the direction of movement and the shooting action with the highest probability as the action command of the current agent. The server gets an explicit action command indicating whether the current agent should perform a shooting operation for the target enemy agent and how it should move to optimize the location or avoid the risk. This action command is then sent to the current agent for execution.
In the embodiment of the present invention, the aforementioned step S205 may be implemented by the following example execution.
Acquiring the undetermined state information of the current intelligent agent and the undetermined state information of the plurality of hostile intelligent agents after the current intelligent agent executes the action command;
Extracting entity state vectors of the undetermined state information of the current intelligent agent and hostile state vectors corresponding to the undetermined state information of the hostile intelligent agents respectively;
Carrying out average processing on a plurality of hostile state vectors to obtain an average hostile state vector;
performing element-addition-based feature fusion on the entity state vector and the average hostile state vector to obtain a second fusion feature vector;
And inputting the second fusion feature vector into a preset value function network to obtain a prediction evaluation result of the action command.
In an embodiment of the present invention, the server needs to evaluate the effect of the current agent (e.g., my warrior a) after executing a certain action command in an exemplary multi-player online shooting game. This action command might be "shoot enemy fighter B and move to the back of shelter C". The server firstly acquires the undetermined state information of the current agent after executing the action command. This includes the current agent's new location, vital value, amount of ammunition, etc., as well as pending status information for all hostile agents (e.g., hostile warriors B, D, E), such as their location, vital value, whether they are behind a shelter, etc. The server has the latest state information of all relevant agents after executing the action command. The server extracts the entity state vector of the current intelligent agent and the hostile state vector of each hostile intelligent agent from the acquired undetermined state information. These vectors are a numerical representation that facilitates subsequent computation and processing. The server obtains the entity state vector of the current agent and the hostile state vectors of a plurality of hostile agents. The server averages the plurality of hostile state vectors. This means that the server calculates the average of all hostile state vectors in each dimension, resulting in one average hostile state vector. This vector represents the average state of all hostile agents. The server obtains an average hostile state vector that integrates the state information of all hostile agents. And the server performs element addition-based fusion on the entity state vector and the average hostile state vector of the current agent. This means that the elements of the two vectors at the corresponding positions are added to form a new fused feature vector. The server obtains a second fused feature vector that contains both the current agent's state and the average state information of all hostile agents. The server inputs the second fusion feature vector into a preset value function network. This network of value functions is a deep learning model that has been trained to predict long-term returns or effects of action commands in a given state. The value function network outputs a predictive evaluation result. This result is a value that represents the expected effect or return of executing the action command in the current state. The server may decide based on this result whether to execute the action command or select other more advantageous actions.
In the embodiment of the present invention, the aforementioned step S206 may be implemented by the following example execution.
Based on a preset reward function: acquiring an actual rewarding result of the action command;
Wherein Rewards is the actual bonus result and a i is the weight of the bonus type R i.
In an exemplary embodiment of the present invention, to avoid too sparse a winning or losing prize after each game is over (Eposide), an intermediate process prize is designed, including the following:
The kill reward R 1: when an Agent (current Agent) kills an enemy, the Agent is encouraged to kill the enemy given a fixed prize value.
Punished R 2: when an Agent dies, given a fixed negative reward, the Agent is encouraged to avoid injury.
Shooting hits enemy rewards R 3: when an Agent shoots a hit enemy, the Agent is awarded a prize for the hit enemy, the value of the prize is proportional to the amount of injury caused, and the Agent is encouraged to shoot the enemy.
Injury penalty R 4: when an Agent is injured, a penalty is given to the injury, the value of which is proportional to the magnitude of the injury value, encouraging the Agent to avoid the injury.
Shooting miss enemy penalty R 5: when Agent shooting does not hit enemy, the method is equivalent to wasting bullet resources, giving smaller punishment and avoiding ineffective shooting of the Agent.
Mobile prize R 6: the movement rewards are rewards obtained by finding a target enemy according to the Agent, and are used for avoiding ineffective random movement of the Agent for a long time.
Total rewards of
Where α i is the weight of each reward to determine which part of the reward the reinforcement learning model sees more, and can also be used to train agents of various styles. For example, the weighting of the forward type Agent movement rewards and the click rewards is higher, and the detour containment type Agent is punished more.
In an embodiment of the present invention, the policy network is obtained in the following manner.
Acquiring an initial strategy network, and interacting with a preset shooting game environment for a plurality of times based on the initial strategy network to obtain a game state track;
calculating discount rewards and advantage values of various states included in the game state track according to the game state track combined with a preset value function network;
Updating the initial strategy network based on the discount rewards and the dominance values to obtain an intermediate strategy network, and updating the preset value function network based on a mean square loss function;
And performing iterative training according to the intermediate strategy network and the updated preset value function network to obtain the strategy network after training is completed.
In the embodiment of the present invention, the server first needs an initial policy network as a starting point when developing a shooting game AI, for example. This initial policy network may be a simple neural network model whose structure and parameters have been obtained through some basic training or initialization method. The server loads or initializes an initial policy network. This network will be used for subsequent interactions and training. The server has an initial policy network that can interact with the environment. To train the strategy network, the server needs to have it interact multiple times in the shooting game environment to collect game state trajectory data. These data will be used to evaluate the performance of the policy network and guide the updating of the network. The server uses the initial strategy network to conduct multiple interactions in a preset shooting game environment. In each interaction, the policy network outputs an action command (e.g., fire, move, etc.) based on the current game state, and the game environment updates the state based on the action command and gives a bonus signal. The server records these status, action and rewards data to form a game status track. The server gathers a series of game state trajectory data reflecting the behavior of the initial policy network in the game. To evaluate the performance of the policy network and guide the updating of the network, the server needs to calculate discount rewards and advantage values for each state in the game state trajectory. The discounted rewards take into account the impact of future rewards on the current state, while the dominance value measures the difference between the actual rewards and the expected rewards. The server uses a pre-set network of value functions to calculate the value function (i.e., the expected rewards) for each state. The server then calculates a discount prize and advantage value for each state based on the prize signal and the value function in the game state trajectory. The discount rewards are obtained by accumulating future rewards at a discount rate, and the dominance value is the difference between the actual rewards and the value function. The server gets the discount rewards and advantage values for each state, which will be used to update the policy network and the value function network. With the discount rewards and advantage values, the server may begin updating the initial policy network and the value function network. The goal of the update is to enable the policy network to output better action commands to get higher rewards; and simultaneously, the value function network can predict the value function of the state more accurately. The server uses the discount rewards and the advantage values to update parameters of the initial policy network. In particular, the server may adjust parameters of the network by gradient ascent or the like to increase the probability of those action commands that result in high rewards. Meanwhile, the server also uses the mean square loss function to update the parameters of the preset value function network, so that the server can more accurately predict the value function of the state. This process may be iterated a number of times until the performance of the policy network and the value function network reaches a certain level. After multiple iterative training, the server obtains a trained strategy network and an updated value function network. These networks will be used for agent decision and evaluation in games.
In the embodiment of the present invention, the step of calculating the discount rewards and the advantage values of the game state track according to the game state track and combining with a preset value function network may be implemented by the following example.
According to the formula: Calculating to obtain discount rewards of the game state track; wherein R t is the discount prize, V φ(sn) is the value function of the state s n estimated by the preset value function network, and gamma is a discount factor;
According to the formula: a t=Rt-Vφ(st) calculating to obtain a dominance value of the game state track; wherein a t is the dominance value.
In an embodiment of the invention, the policy network is trained based on self-gaming. A shooting game client developed by unitity establishes long connection with a python server to send and receive game states and control instructions, and the specific flow is as follows:
The Actor policy network pi θ is interacted with the game environment n times, and the game state trace (s t,at,rt) fort e {1, 2..n-1 }.
The game client sends the environmental state s t of the current t moment of the game, namely the states of the player, the enemy, the map and the like, in a fixed period; the Actor neural network will act a t and return to the client based on state s t. The client controls the Agent to move and shoot according to the action a t, after the Agent interacts with the environment, the environment state becomes s t+1, the prize r t obtained by the action a t is determined according to the state change, and the server stores the game state track (s t,at,rt)
Calculating discount rewards and dominance functions
When the n times of data acquisition are completed, calculating discount rewards R t and dominance function A in each state according to the ternary number state track formed by (s t,at,rt) t
Calculation of discount rewards:
Where V φ(sn) is the value function of state s n estimated by value function critic, because the game agent interacts with the environment n times without the n+1st state to determine r n, the estimated value V φ(sn) by value function critic). Gamma is a discount factor, which is the multiplication of rewards at future times gamma to the current time in order to encourage the model to take into account long-term rewards. Gamma is a number between 0 and 1, and when gamma=0, the representation model considers only current rewards and not future rewards.
Calculation of the dominance function:
At=Rt-Vφ(st)
Where V φ(st) is a function of the value obtained by the critic network for the input state s t. The difference between the discount rewards R t and V φ(st) is calculated as the Advantage (Advantage) value A t for evaluating the performance of the action of the Actor relative to the benchmark V φ(st). Representing how advantageous a certain action is to perform an average action with respect to the current state.
3. Updating an Actor network
The loss function of model actor is as follows:
Where A t is the dominance value and r t (θ) is the ratio, the formula is as follows:
Where pi θ is the Actor old model parameters, i.e., model parameters that collect data interactively with the environment. Pi θ_new is a new model parameter of actor that is updated for each epoch. r t (θ) represents the probability ratio of actions under both the new and old strategies.
Truncation strategy in a formula
min(rt(θ)At,clip(rt(θ),1-∈,1+∈)At)
Limiting r t (θ) to between (1-e, 1 +. Epsilon.) where e is a super-parameter, represents the range of truncation. The purpose of the truncation is to ensure that the difference between the new and old parameters is not too large. When A t > 0, indicating that the value of this action is higher than average, the Actor minimization loss function will cause r t (θ) to increase, i.e., the probability of the action of the new strategy will increase, whereas when A t < 0, the probability of the action will decrease. The parameter update adopts an Adam optimizer and a gradient descent algorithm.
4. Updating the Critic network:
the loss function of model critic is as follows:
Where R t is the discount prize, is the target prize value calculated from the environmental return prize, and V φ(st) is the critic estimated cost function. The overall loss function is the mean square loss, and as training progresses, the mean square error of R t and V φ(st) is minimized so that the estimate V φ(st) of model critic is close to R t. So that the model critic can more accurately estimate the value of the current state.
After the k epochs are trained, updating the parameters of the old model into the parameters of the new model, interacting with the game by adopting the parameters of the new model, and collecting new game data. The training and data acquisition processes are alternately performed, namely 1-4 steps are repeated until the training is completed.
In the embodiment of the present invention, the following implementation manner is also provided.
Under the condition that the current intelligent agent executes an action command of shooting operation aiming at the target hostile intelligent agent, acquiring a shooting prop module which is configured by the current intelligent agent and is used for executing the shooting operation;
Acquiring continuous position coordinate information of the shooting prop module at a current time point; the continuous position coordinate information represents position coordinate information of a recorded time range and a predicted time range which are associated with the current time point by the shooting prop module;
Acquiring current prop interaction information of the shooting prop module at the current time point; the current prop interaction information characterizes the shooting prop state of the shooting prop module at the current time point;
based on the continuous position coordinate information and the current prop interaction information, prop operation information of the shooting prop module at the current time point is generated; the prop operation information characterizes an operation mode of the shooting prop module from the current time point to a subsequent time point of the current time point;
And determining the shooting prop state of the shooting prop module at the subsequent time point of the current time point based on the prop operation information and the current prop interaction information, and switching the shooting prop state indicated by the current prop interaction information of the shooting prop module into the shooting prop state at the subsequent time point of the current time point.
In an embodiment of the present invention, the server needs to process an action command of performing a shooting operation with respect to a target hostile agent (e.g., an adversary character) by a current agent (e.g., an AI-controlled character) in an exemplary shooting game. When the server receives an action command of the current agent to perform shooting operation for the target hostile agent, it first acquires a shooting prop module configured by the current agent for performing shooting operation. The firing prop module may be an AI-equipped weapon, such as a pistol, rifle, or the like. The server successfully acquires the shooting prop module information of the current intelligent agent configuration. In order to accurately simulate the behavior and effect of a shooting prop module, the server needs to know the position information of the shooting prop module at different time points. The server acquires continuous position coordinate information of the shooting prop module at the current time point. Such information includes position coordinates of a recorded time range (e.g., within the past few seconds) and position coordinates of a predicted time range (e.g., within the next few seconds). This may be achieved by physical engine calculations, interpolation algorithms or predictive models. The server obtains continuous position coordinate information of the shooting prop module in a period of time. In addition to the location information, the server needs to know the state information of the shooting prop module at the current point in time in order to properly handle its interaction with the environment. The server acquires the current prop interaction information of the shooting prop module at the current time point. Such information may include status information of the number of ammunition in the cartridge clip, the firing rate of the weapon, whether ammunition is being loaded, etc. The server successfully acquires the state information of the shooting prop module at the current time point. Based on the acquired position coordinate information and prop interaction information, the server needs to generate operation information of the shooting prop module at the current time point. And the server combines the continuous position coordinate information and the current prop interaction information to generate prop operation information of the shooting prop module at the current time point. Such information describes how the shooting prop module should operate, such as firing, loading ammunition, switching shooting modes, etc., from the current point in time to a subsequent point in time (e.g., the next frame or within a few seconds of the future). The server generates operational mode information of the shooting prop module at the current point in time. According to the generated prop operation information, the server needs to determine the state of the shooting prop module at the subsequent time point and execute corresponding state switching. And the server determines the shooting prop state of the shooting prop module at the subsequent time point of the current time point based on prop operation information and current prop interaction information. For example, if the operational information indicates firing and the current ammunition is sufficient, the state at the subsequent point in time may be a decrease in the number of ammunition, smoke generation at the muzzle, or the like. And then, the server switches the state of the shooting prop module indicated by the current prop interaction information into the state of a subsequent time point. The server successfully determines the state of the shooting prop module at the subsequent time point and executes corresponding state switching. This makes the shooting operation in the game more realistic and accurate.
In order to more clearly describe the solution provided by the embodiments of the present invention, the following description is made in more detail. In games, a shooting prop module generally refers to a virtual weapon or equipment for performing a shooting operation. This module contains the appearance of the weapon, performance parameters (e.g. range, speed, injury value, etc.), and interaction logic with other game elements.
In the embodiment of the present invention, the action command of the shooting operation is a specific instruction issued by the current agent for performing the shooting action, by way of example. This command may contain information about the target being shot, the weapon being used, the strength and direction of the shot, etc. In games, a shooting prop module generally refers to a virtual weapon or equipment for performing a shooting operation. This module contains the appearance of the weapon, performance parameters (e.g. range, speed, injury value, etc.), and interaction logic with other game elements. In a shooting game, the AK-47 rifle in the hand of an AI character is a shooting prop module. This module not only determines the visual effect of AI shooting, but also defines the accuracy of shooting, recoil, etc. physical characteristics. In the context of game programming, "retrieving" generally refers to the process of retrieving specific information or resources from a memory, database, or other storage location of a game. Such information or resources may be data regarding game status, character attributes, prop details, etc. When the server needs to handle the shooting operation, it must first "acquire" the information of the shooting prop module configured by the current agent. This includes the type of weapon, ammunition status, whether or not an accessory (such as a scope) is equipped, etc. The server retrieves this information by accessing the memory or database of the game. When the server receives an action command of a shooting operation from the current agent, it retrieves the relevant information of the shooting prop module configured by the current agent so as to correctly process and execute the shooting operation. This includes knowing the type, state, and interaction with other game elements, etc. of the weapon. The current point in time refers to the specific point in time that the system is currently processing or considering during analog or real-time operations. In game development, this is typically referred to as the moment when the game engine updates the world state. In a 60 frame per second game, every 1/60 second is a new current point in time, at which point the game engine updates the status of all objects in the game (including the shooting prop modules). The continuous position coordinate information refers to continuous position information of a certain object (in this example, a shooting prop module) in space for a period of time. Such position information is typically represented in the form of coordinates, such as (x, y, z) coordinates in three-dimensional space. When AI fires, a bullet is ejected from the muzzle and flies along a specific trajectory. In this process, the position coordinates of the bullet are updated for each frame (i.e., for each current point in time) to form a continuous sequence of position coordinate information. The recorded time range refers to a time range that has been recorded and processed by the system before the current point in time. In this time frame, the position coordinate information of the shooting prop module is recorded. Let us assume that we have a system that records the trajectory of the bullet flight in the last 5 seconds. If the current time point is the 3rd second after bullet firing, then the recorded time range is the first 3 seconds after bullet firing. The predicted time range refers to a time range predicted by the system based on existing information (such as the speed, direction, etc. of the object) after the current time point. In this time frame, the position coordinates of the firing prop module are calculated based on a predictive algorithm. Continuing with the example above, if we have a system that predicts the next 2 seconds of flight trajectory of a bullet, then the predicted time frame is 2 seconds after the current point in time. The system predicts the location coordinates of the bullet within 2 seconds of the future based on the current speed, direction, and other relevant factors of the bullet. When the server processes the shooting prop module, the server can acquire continuous position coordinate information of the shooting prop module at the current time point. The information includes actual position coordinates of the prop over a period of time (recorded time range) and predicted position coordinates over a period of time in the future (predicted time range). These data are critical to accurately simulating the behavior and effects of shooting prop modules in the game world.
In addition, in the embodiment of the present invention, the current prop interaction information refers to information generated or required by the shooting prop module when interacting with the game environment or other objects at the current time point. This information reflects the current state, behavior, and interrelationship of the prop with the surrounding environment. When the AI is aiming using a sniper rifle, the current prop interaction information may include the sight position of the sighting telescope, the remaining amount of ammunition, the recoil state of the weapon, etc. This information is necessary for the server to handle the firing operation, as it directly affects the outcome and effect of the firing. The shooting prop state refers to a specific state of the shooting prop module at the current time point. This includes the active state of the prop, ammunition state, cooling time, accessory state (e.g., whether a muffler or sight is provided), etc. If the AI sniper rifle is currently in a state of being charged with ammunition, its shooting prop state is "in charge". In this state, the AI cannot perform the shooting operation until the loading is completed. The server obtains the current prop interaction information to know the state of the prop, and judges whether the AI can execute shooting operation and the result of the shooting operation according to the state. When the server processes shooting operation, current prop interaction information of the shooting prop module at the current time point needs to be acquired. The information reflects the current state of the prop and the interaction condition with the game environment, and is an important basis for the server to judge whether the shooting operation is effective and what effect is generated. by acquiring and processing this information, the server can accurately simulate the behavior and effects of the shooting prop module in the game world. The prop operation information refers to a set of instructions or data generated based on continuous position coordinate information and current prop interaction information, and is used for describing an operation mode or a behavior mode of the shooting prop module from a current time point to a subsequent time point. Assuming that AI is aiming a shot using a sniper rifle, the prop operating information may include coordinates of aiming points, shooting strength, shooting frequency, etc. The information is calculated by the game engine according to the input of the AI and the current state of the prop module, and is used for guiding the prop module to perform shooting operation at the subsequent time point. The subsequent time points of the current time point refer to a certain time point or time points within a period of time after the current time point. In a game process, it is often necessary to predict or calculate the behavior or state of a prop module over a period of time in the future in order to make accurate game logic decisions and picture rendering. If the current point in time is the 10 th second in the game, the subsequent point in time may be the 11 th second, the 12 th second, etc. At these later points in time, the game engine updates the state and behavior of the shooting prop module based on prop operation information, such as updating the position of the bullet, determining whether the target is hit, etc. The server generates prop operation information based on the continuous position coordinate information and the current prop interaction information when processing the shooting operation. The information describes the operation mode or behavior mode of the shooting prop module from the current time point to the subsequent time point, and is an important basis for the server to perform game logic judgment and picture rendering. By accurately calculating and updating this information, the server can simulate a real and fluent shooting experience.
In addition, in the embodiment of the present invention, switching the shooting prop state refers to a process of transferring the prop module from one state to another state according to game logic and AI operation. This typically involves updating the internal properties and behavior patterns of the prop module. When the AI performs a reload, the prop module will switch from a "shootable" state to an "in-charge" state. During this process, the game engine may update the properties of the prop module, such as the number of ammunition, the loading progress, etc., and may temporarily disable the firing function until loading is complete. When the server processes shooting operation, firstly, the shooting prop state of the shooting prop module at a subsequent time point is determined based on prop operation information and current prop interaction information. Then, the server switches the prop module from the state indicated by the current prop interaction information to the state at the subsequent time point. This process involves updating of prop module attributes and adjustment of behavior patterns to ensure continuity of game logic and fluency in the player experience.
In the embodiment of the invention, the continuous position coordinate information is formed based on past position coordinate information and expected position coordinate information; the past position coordinate information is position coordinate information of the shooting prop module in a recorded time range associated with the current time point; the expected position coordinate information is obtained after analyzing initial expected position coordinate information and an operation command of the shooting prop module at the current time point, and the initial expected position coordinate information is position coordinate information of a predicted time range associated with the shooting prop module at the current time point and output based on a pre-trained position prediction model.
In the embodiment of the present invention, the server needs to handle the pistol module used by the AI in a first person shooter game, for example. The current time point is 10 seconds, and the server has recorded the position coordinate information of the pistol module from 1 st second to 10 th second. These position coordinate information constitute past position coordinate information of the pistol module. This information reflects the trajectory and position changes of the pistol module over the past 10 seconds. In the same shooting game, the server also uses a pre-trained position prediction model to predict the position coordinate information of the pistol module over a future period of time (e.g., the next 2 seconds). This prediction is based on the movement state, speed, direction, etc. of the pistol module over a period of time (e.g., the first 8 seconds). The predicted position coordinate information is referred to as initial expected position coordinate information. It represents the position that the pistol module would be expected to reach without intervention of a new AI operation command. At this point in time of 10 seconds, the AI simulation presses the left mouse button to fire and adjusts the aiming direction of the pistol slightly. These operational commands are captured and analyzed by the server. The server needs to evaluate the impact of these operating commands on the future position of the pistol module. The server, through analysis, finds that the firing operation of the AI will cause the pistol module to generate recoil, thereby changing its original flight trajectory. At the same time, adjustment of the aiming direction also affects the future position of the pistol module. Therefore, the server needs to adjust the initial expected position coordinate information by combining these factors, and obtains the adjusted expected position coordinate information. This adjusted information more accurately reflects the intended location of the pistol module after the AI operation is considered. Finally, the server combines the past position coordinate information (position coordinates of 1 st to 10 th seconds) with the adjusted expected position coordinate information (predicted position coordinates of 11 th to 12 th seconds) to form continuous position coordinate information of the pistol module. This continuous position coordinate information includes not only the past position of the pistol module, but also the future expected position after the AI operation is considered. The server uses this information to accurately simulate the movement and firing of the pistol module in the game world.
In the embodiment of the present invention, the following implementation manner is also provided.
Acquiring a plurality of historical moments in a recorded time range associated with the current time point;
Acquiring track coordinate points, orientation data and speed data of the shooting prop module at each historical moment in the plurality of historical moments;
And generating the past position coordinate information based on the track coordinate point, the orientation data and the speed data of the shooting prop module at each historical moment.
In an exemplary embodiment of the invention, the server is processing a sniper gun module used in a combat analog shooting game AI. The current point in time is the 20 th second in the game. In order to generate past position coordinate information, the server first needs to acquire a plurality of history times within a recorded time range associated with the 20 th second. For example, the server may choose to have every 0.5 seconds as one historical moment, so from 1st to 20 th seconds, the server has acquired a total of 40 historical moments (1 second, 1.5 seconds, 2 seconds. For each of the 40 historical moments described above, the server obtains position coordinates (trace coordinate points), orientation (e.g., aiming direction) and velocity data for the sniper gun module at the time of movement or shooting. These data completely record the position, direction and speed changes of the sniping gun module in the game world. At 5 seconds, the position coordinates of the sniping gun module are (100, 200, 300), the orientation is northeast, and the velocity is 0 (since the AI may be aiming, not moving). At 10 seconds, the position coordinates become (110, 210, 310), the heading becomes the forward east direction, and the velocity is still 0. At 15 seconds, the position coordinates become (115,215,315), the heading becomes southwest, and the velocity becomes 2 meters/second due to the slight movement and aiming adjustment by AI. Similarly, the server obtains all 40 historical time instances of such data. The server now has track coordinate points, orientation and rate data of the sniping gun modules at each historical moment. These data are used to generate past location coordinate information. The past position coordinate information is a continuous data set which records in detail the position change of the sniping gun module in the recorded time range (from 1st to 20 th seconds). The server may store this data in some form (e.g., a data structure or graph) in memory for use in subsequent processing of the game logic for shooting operations, collision detection, visual rendering, etc. For example, when the AI fires at 20 seconds, the server can use these past location coordinate information to calculate the initial flight trajectory and final drop point location of the bullet.
In the embodiment of the invention, the initial expected position coordinate information is obtained by prediction based on the pre-trained position prediction model in the process of determining the shooting prop state of the shooting prop module at the current time point at the moment above the current time point; the initial expected position coordinate information is composed of track coordinate points, orientation data and speed data of the shooting prop module at each predicted time.
In the embodiment of the invention, the server predicts the shooting property state of the shooting property module at the current time point by using a pre-trained position prediction model at the moment before the current time point, and obtains initial expected position coordinate information based on a prediction result. The initial expected position coordinate information includes track coordinate points, orientation data, and velocity data at a plurality of predicted times. Considering an online multiplayer shooting game, the server is handling the state of a gun module of the organization used by the name "AI character a". The performance of the machine gun module in a game is affected by a number of factors, including player input, game physics engine, and environmental interactions. Assuming the current time point is 10 seconds, then at the last time, i.e., at the end of 9 seconds, the server will make a series of calculations and predictions. These calculations are typically done to maintain real-time responsiveness of the game and smooth animation effects. The server has a pre-trained position prediction model, which may be based on machine learning algorithms (e.g., neural networks), and has been trained to predict changes in the position of the shooting pot module over a given time horizon. For example, the model may have learned the behavior pattern of the machine gun module under AI different operations (e.g., movement, aiming, shooting) and environmental conditions (e.g., wind speed, terrain, etc.). At the end of the 9 th second, the server uses this predictive model to predict the shooting pot status of the machine gun module at the 10 th second (current time point). The predicted status may include information about the intended position, orientation, and firing rate of the gun. The predictive model outputs a series of data that constitutes initial expected position coordinate information. The information includes expected location coordinates for each predicted time (e.g., every 0.1 seconds) within the 10 th second of the machine gun module, orientation data (e.g., angle pointed by muzzle) and velocity data (e.g., movement speed and firing speed). For example, the predictive model may output the following information: at 10.0 seconds, the position coordinates of the machine gun module are (X1, Y1, Z1), the orientation is north, and the shooting speed is 600 shots per minute. At 10.1 seconds, the predicted position is (X2, Y2, Z2), turning slightly north-east, and the speed of the jet remains unchanged. And so on until the predicted time range ends. These data provide a basis for the server to calculate the behavior of the gun module of AI character a in a short time in the future and provide the necessary information for other AI and gaming systems (e.g., collision detection, visual rendering, etc.). Meanwhile, the server can update and adjust the predictions in real time according to actual game conditions (such as new inputs of AI, changes in game environment and the like).
In the embodiment of the invention, the operation command comprises a direction vector and a speed value for indicating the shooting prop module to move; the embodiment of the invention also provides the following implementation modes.
Based on the direction vector in the operation command, performing correction operation on the orientation data of the shooting prop module at each prediction moment in the initial expected position coordinate information to obtain a correction direction vector of the shooting prop module at each prediction moment;
Based on the direction vector and the speed value in the operation command, performing correction operation on the speed data of the shooting prop module at each prediction moment in the initial expected position coordinate information to obtain a correction speed value of the shooting prop module at each prediction moment;
Based on the correction moving path points and the correction speed values matched with the precursor predicted time of the plurality of predicted time, performing correction operation on the track coordinate points matched with the target predicted time of the plurality of predicted time to obtain correction moving path points matched with the shooting prop module at the target predicted time; the preamble prediction time is the time before the target prediction time;
And obtaining the expected position coordinate information based on the correction direction vector, the correction speed value and the correction moving path point of the shooting prop module at each prediction moment.
In the exemplary embodiment of the present invention, consider an online first person shooter game, AI controlling a character and holding a module of properties that can be shot, such as a rifle. The server is responsible for processing game logic, including AI input, prop module status updates, and physical simulation of the game world. The server receives operational commands from the AI, including a direction vector and a velocity value that indicate movement of the shooting pot module. For example, AI may control the forward, backward and left-right movements of a character by simulating pressing W, A, S, D keys on a keyboard, while simulating mouse movements is used to control the aiming direction of a shooting prop module. The server firstly corrects the orientation data of the shooting prop module at each prediction moment in the initial expected position coordinate information according to the direction vector in the operation command. This means that the server will adjust the orientation of the prop module at each predicted time to make it more in line with the actual inputs of AI. If the AI aims the prop module in one direction by simulating the mouse operation, the server adjusts the orientation data of the prop module at each prediction moment according to the input, so as to ensure that the aiming direction of the prop module is consistent with the input of the AI. And then, the server corrects the velocity data of the shooting prop module at each prediction moment in the initial expected position coordinate information according to the direction vector and the velocity value in the operation command. This involves adjusting the speed of movement and firing rate of the prop module at each predicted moment to reflect the AI input. If the AI simulates pressing the W key to move the character forward, the server will increase the speed of the prop module forward at each predicted time based on this input. Likewise, if the AI simulates pressing the S key to back the character, the server will decrease the forward speed or increase the back speed. The server then corrects the locus coordinate point at which the target predicted time matches based on the corrected moving path point and the corrected speed value at which the preamble predicted time (i.e., the time immediately preceding the target predicted time) at the plurality of predicted times matches. This means that the server will adjust the prop module position at the current moment according to the correction result at the previous moment. If at the previous predicted time the prop module has moved a distance to the left due to AI input, the server will take this movement distance into account to adjust the position coordinate point of the prop module at the current predicted time. And finally, the server obtains corrected expected position coordinate information based on the correction direction vector, the correction speed value and the correction moving path point of the shooting prop module at each prediction moment. This information more accurately reflects the future expected position and behavior of the prop module after consideration of the AI input. After correction by the above steps, the server now has an updated, more accurate set of expected position coordinate information containing the expected position, orientation and velocity of the prop module at each predicted time. This information will be used in subsequent game logic processing such as collision detection, visual rendering, etc.
In the embodiment of the invention, the shooting prop module is configured with a plurality of dynamic nodes in virtual construction; the step of obtaining the current prop interaction information of the shooting prop module at the current time point can be implemented through the following example implementation.
Acquiring the number of rotation shafts matched with each dynamic node of the shooting prop module respectively;
Calculating the rotation quantity of each dynamic node of the shooting prop module at the current time point under the matched rotation axis number;
And generating the current prop interaction information based on the rotation amount of each dynamic node under each matched rotation axis number.
In the exemplary embodiment of the invention, consider an on-line first person shooting game, the AI holds a complex prop module that can shoot and has multiple movable parts (e.g., barrel, body, scope, etc.), such as an assault rifle that can fold the stock, adjust the scope magnification, and change the grip position. These movable components are controlled in the game by a plurality of dynamic nodes of virtual construction. The server needs to process the motion of the dynamic nodes in real time to respond to the operation of the AI and generate corresponding prop interaction information. First, the shooting prop module is already configured with multiple dynamic nodes of virtual construction at the design stage of the game. The nodes correspond to different movable components in the prop module and each node has its own number of axes of rotation defining in which directions the node can freely rotate. The scope may have a dynamic node whose rotational axis number allows it to rotate in both the horizontal and vertical directions so that the AI can adjust the scope's field of view; the stock may have another dynamic node whose rotational axis number allows it to fold or unfold in one direction. When the server needs to process the shooting prop module, it firstly obtains the rotation axis number matched with each dynamic node of the shooting prop module. This information is typically pre-set at the time of game design and stored in the configuration file of the game. The server queries the configuration file to see that the dynamic node of the scope has two rotational axes (one for horizontal rotation and one for vertical rotation) and the dynamic node of the stock has one rotational axis (for fold/unfold). Next, the server calculates the amount of rotation of each dynamic node of the shooting prop module at the current point in time at each of the matched numbers of rotation axes. This calculation is typically based on the inputs of the AI and the simulation results of the game physics engine. If the AI presses a key for adjusting the magnification of the sighting telescope in the game, the server calculates the rotation amount of the sighting telescope dynamic node under the vertical rotation axis number according to the input. Likewise, if the AI has operated the action of folding the stock, the server calculates the amount of rotation of the stock dynamic node with its rotational axis number. And finally, the server generates current prop interaction information based on the rotation amount of each dynamic node under each matched rotation axis number. This information is a data set that contains the current position and rotation status of all dynamic nodes in the shooting pot module. The server gathers the rotation amounts of the dynamic nodes such as the sighting telescope, the gun stock and the like into a data structure, and current prop interaction information is formed. This information is then used to update the prop module status in the game screen, handle interactions with other game objects (e.g., collision detection), and communicate to other game systems that require this data (e.g., network synchronization, animation rendering, etc.).
In the embodiment of the invention, the shooting prop module is configured with a plurality of dynamic nodes in virtual construction; each dynamic node is configured with a matched number of rotation axes; the step of generating the prop operation information of the shooting prop module at the current time point based on the continuous position coordinate information and the current prop interaction information may be implemented by the following example.
Calculating the reference rotation quantity of each dynamic node of the shooting prop module from the current time point state to the subsequent time point of the current time point under the matched rotation axis number of each dynamic node based on the pre-trained position prediction model according to the continuous position coordinate information and the current prop interaction information;
and generating the prop operation information based on the calculated reference rotation quantity of each dynamic node under the corresponding rotation axis number.
In the exemplary embodiment of the invention, consider an on-line shooting game in which an AI-controlled character holds a shooting prop module, such as an assault rifle, that can adjust the positions of various components. This rifle is designed in a game with multiple dynamic nodes, each of which can be rotated about a particular axis of rotation. The server needs to process the movements of these nodes and generate prop operation information based on the continuous position coordinate information and the current prop interaction information. During the game design process, the shooting stage module is configured with a plurality of virtually structured dynamic nodes. Each dynamic node has its matching number of axes of rotation defining the direction and degree of freedom in which the node can rotate. The telescope of the rifle may be rotated along a vertical axis to adjust the aiming angle, while the grip may be rotated along a horizontal axis to change the grip. Each such rotational action corresponds to a dynamic node and one or more rotational axis numbers. When the server processes the shooting prop module, the server can calculate the reference rotation quantity of each dynamic node under each matched rotation axis number according to continuous position coordinate information and current prop interaction information by utilizing a pre-trained position prediction model. The calculation process considers the historical motion trail of the prop module, the current input of the AI and the simulation result of the game physical engine. If the AI adjusts the scope through an input device (e.g., mouse, keyboard) at successive points in time, the server will collect these input data and combine with the position prediction model to calculate how much the scope dynamic node should be rotated in the number of vertical axes of rotation to achieve the AI's desired effect. The server generates prop operation information based on the estimated reference rotation amount of each dynamic node at the corresponding number of each rotation axis. This information is an instruction set that tells the game engine how to update the state of the shooting prop module to achieve the operational intent of the AI. If it is deduced that the telescope needs to be rotated 10 degrees along the vertical axis and the grip needs to be rotated 5 degrees along the horizontal axis, the server will generate prop operation information containing these rotation instructions. The game engine then uses this information to update the performance of the shooting pot module in the game world, ensuring that the AI operation is properly responded to. Through the steps, the server can accurately process the movements of a plurality of dynamic nodes in the shooting prop module, and corresponding prop operation information is generated according to real-time input of AI and game physical simulation results. The processing mode ensures that the shooting prop module in the game has high interactivity and reality, and provides more immersive game experience for players.
In the embodiment of the present invention, the step of calculating, based on the pre-trained position prediction model, the reference rotation amount of each dynamic node of the shooting prop module from the current time point state to the subsequent time point of the current time point under the matched number of each rotation axis according to the continuous position coordinate information and the current prop interaction information may be implemented by the following example implementation.
Calculating a normal distribution center value for indicating dynamic change of each dynamic node from the current time point to a subsequent time point of the current time point under the corresponding rotation axis number based on the continuous position coordinate information and the current prop interaction information based on the pre-trained position prediction model; the normal distribution center value comprises component values of a plurality of aspects, and the component value of one aspect in the normal distribution center value is used for indicating dynamic change of one dynamic node on the corresponding rotation axis number;
Acquiring target normal distribution determined by the normal distribution central value, and respectively carrying out random extraction on the target normal distribution aiming at each aspect of the normal distribution central value to obtain random rotation quantity matched with each aspect;
accumulating the component values matched in each aspect and the random rotation quantity respectively to obtain a reference rotation quantity of each dynamic node under the matched rotation axis number; one reference rotation amount is an accumulated value of a component value to which the corresponding aspect belongs and a random rotation amount.
In the exemplary embodiment of the invention, consider an on-line shooting game in which a server is handling an AI-controlled shooting prop module, such as a rifle having a plurality of adjustable components (e.g., sighting telescope, stock, etc.). These components perform various actions in the game through dynamic nodes and rotational axis numbers. The server needs to predict its future state changes based on the AI's operation and the prop module's historical state. The server firstly utilizes a pre-trained position prediction model, combines continuous position coordinate information and current prop interaction information to calculate a normal distribution center value of dynamic change of each dynamic node from a current time point to a subsequent time point under the corresponding rotation axis number. This central value is in effect a multidimensional vector in which each component corresponds to the expected amount of change in the number of particular axes of rotation for a particular dynamic node. If the riflescope of the rifle is adjustable by two axes of rotation (e.g. horizontal and vertical directions), the server will calculate a normal distribution center value component for each of these two directions. These components will be estimated based on factors such as the historical operation of the AI, the current position of the scope, and the simulation results of the game physics engine. Next, the server determines a target normal distribution based on the calculated normal distribution center value, and randomly extracts a rotation amount from this distribution for each component (i.e., the change in the number of each rotation axes of each dynamic node). This random extraction process increases the diversity of predicted outcomes and game uncertainty so that each AI operation produces slightly different outcomes. The server may extract a small random amount of rotation for the scope in the horizontal direction and a large random amount of rotation in the vertical direction. These random values reflect the uncertainty of the AI operation and the physical simulation error of the game world. And finally, accumulating each component value (namely each component of the normal distribution center value) and the corresponding random rotation quantity by the server to obtain the final reference rotation quantity of each dynamic node under the matched rotation axis number. This reference rotation will be used to update the state of the shooting prop module and generate the corresponding prop operation information. If the server estimates that the normal distribution central value component of the sighting telescope in the horizontal direction is 5 degrees, the randomly extracted rotation amount is-1 degree, and the final reference rotation amount is 4 degrees (5 degrees-1 degree). Likewise, if the central value component in the vertical direction is 3 degrees and the randomly extracted rotation amount is 2 degrees, the final reference rotation amount will be 5 degrees (3 degrees+2 degrees). The server will use these reference rotation amounts to update the position and state of the scope in the game. Through the steps, the server can predict future state changes of the prop module based on AI operation and historical states of the prop module, and corresponding prop operation information is generated. The method not only considers the intention of AI and the simulation result of the game physical engine, but also increases the diversity and the interest of the game by introducing randomness.
In order to more clearly describe the solution provided by the embodiments of the present invention, an integral frame is provided in the following technical solutions.
1) Solid scalar encoder (Entities Encoder):
Input: information including current Agent (current Agent) and enemy Agents (enemy Agent): such as blood volume, camping, range, orientation, radiation (whether there is an obstacle), cartesian coordinates xy, enemy FSM state (shooting state, moving state, stationary state, etc.), etc. The values are normalized to account for the dimensional differences in the features that may lead to numerical instability.
The structure is as follows: entities Encoder is a multi-layer perceptron (Multilayer Perceptron-MLP). Wherein the current Agent scalar is independent of one MLP, while the scalars of other Agents share the parameters of one MLP. The feature vector dimensions of all scalar feature codes remain the same, and the same feature dimensions facilitate subsequent feature fusion. Entities Encoder structures should include, and are not limited to, other forms of networks (e.g., CNN, RNN, LSTM, etc.).
2) Target enemy selection module (attention+ Gumbel Softmax Sampling):
input: the characteristic vector of the current Agent code and the characteristic vector of the enemy Agents entity code.
Model structure: the enemy selection module employs Attention mechanisms (Attention). The principle is that the characteristic code of Agent is used as query and dot product is made with the codes of other enemies.
Assuming that the Agent feature vector is q, the feature vector of the ith enemy is k i, and assuming that there are a maximum of 10 enemies, i e {0,1,2,..9 }. The dot product of q and k i can be expressed as a vector multiplication, i.e
The result a of the matrix multiplication is an i-dimensional vector representing correlations with individual enemy entities. By normalizing Softmax, the discrete action probability of selecting each enemy can be obtained, namely
π=Softmax(a)=[π123,...,πi]
The probability distribution pi is obtained, and random sampling is needed according to the distribution pi during model reasoning so as to determine the enemy selected by the current Agent. However, the argmax and random sample cannot return the gradient, and the conductivity of the gradient is guaranteed during training, and Gumbel Softmax Sampling is applied to the training of the module, so that the neural network samples discrete distribution, and the gradient can still be calculated for back propagation. The formula is as follows:
wherein k e {0,1,2,., i }, represents the kth of the i enemy entities;
The parameter τ >0 is called the temperature coefficient and g k is a random variable of the standard gummel distribution of independent co-distribution. g k is generated from the uniform distribution by gummel distribution inversion, i.e.:
gi=-log(-log(xi)),wh ere x~U(0,1)
3) Map information encoder (Map Encoder):
input: the present invention models a map by meshing a map within a game (for example, 1m×1m square is a minimum unit). The map information of a square having a side length of N meters is inputted to the network not as a game screen but as an Agent. The map information includes: 1. normalized terrain elevation map; 2. a fire Mask (Mask); 3. a movement Mask (Mask); 4. enemy location information.
The enemy position information is a gaussian distribution in which coordinates (μ x and μ y) on the feature map are mapped by coordinates (x, y) of the enemy, and σ is 2d of standard deviation around this.
Map Encoder is a Convolutional Neural Network (CNN) for extraction of map feature map features. The invention Map Encoder adopts a network structure of full convolution (Fully Convolutional Network-FCN) and maintains the spatial position characteristics thereof. Encoder should include, and are not limited to, other forms of networks (e.g., vgg, resNet, denseNet, efficientNet, vit, etc.) that process two-dimensional feature maps.
4) Policy network (Actor):
the Actor of the invention comprises a decision module for shooting and moving two actions, supports different action types to be combined, can shoot while moving, and can expand other action behaviors.
Mobile policy subnetwork:
The input is Map Encoder coded spatial feature map, and the moving direction of the Agent is output. The spatial feature map is input through a movement strategy sub-network, a predicted result of 3*3 feature maps can be obtained, and each spatial position of the feature map corresponds to 9 possible movement directions of an Agent in the actual environment. These 9 directions may include front, back, left, right, upper left, upper right, lower left, lower right, in-place, etc. The feature value at each location may represent the probability or confidence that the Agent selected the directional movement. Thus, by looking at the eigenvalues at each position of this 3*3 profile, the Agent can determine the direction of movement for the next step. For example, if the eigenvalue in a certain direction is significantly higher than in other directions, then the Agent may choose that direction to move. In this way, the mobility policy subnetwork can guide the mobility decisions of agents according to the spatial signature.
. The feature map is normalized by Softmax and converted into a probability distribution of the moving direction. The mobile policy subnetwork should include, but is not limited to, other forms of networks (e.g., vgg, resNet, denseNet, efficientNet, vit, etc.) that process two-dimensional feature maps.
Shooting strategy subnetwork:
The network inputs need to include both Map Encoder coded spatial features, as well as player status features of the MLP output, and target enemy features obtained by gummel sampling.
For three features of the input, feature fusion is first required. The fusion mode is that three features are ELEMENT WISE ADD, namely added element by element, so as to obtain a fused feature vector.
The action space of the shooting strategy sub-network is 2, namely two actions of shooting and non-shooting. The fused dimensional feature vectors pass through a shooting strategy sub-network to obtain 2-dimensional feature vectors, and the probability of two action (shooting and non-shooting) spaces is obtained through Softmax normalization.
5) Value function network (Critic):
Critic is used to learn the relationship between the environment and rewards in order to evaluate the action selected by the Actor and assist in training.
The input to Critic is the output vector of the solid scalar encoder. The network structure is characterized in that characteristic codes of k enemy entities are averaged to obtain characteristic vectors, and element-wise addition is carried out on the characteristic vectors of the current Agent to obtain fused characteristic vectors. The fused feature vector passes through the MLP network and finally outputs the value of Critic, which is the value V(s) in the state s, namely the Q value of all actions in the state s, and the expectations under the specific strategy are achieved.
The embodiment of the invention provides a computer device 100, where the computer device 100 includes a processor and a nonvolatile memory storing computer instructions, and when the computer instructions are executed by the processor, the computer device 100 executes the foregoing agent control method based on deep reinforcement learning. As shown in fig. 2, fig. 2 is a block diagram of a computer device 100 according to an embodiment of the present invention. The computer device 100 comprises a memory 111, a processor 112 and a communication unit 113. For data transmission or interaction, the memory 111, the processor 112 and the communication unit 113 are electrically connected to each other directly or indirectly. For example, the elements may be electrically connected to each other via one or more communication buses or signal lines.
The foregoing description, for purpose of explanation, has been presented with reference to particular embodiments. The illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the disclosure and its practical application, to thereby enable others skilled in the art to best utilize the disclosure and various embodiments with various modifications as are suited to the particular use contemplated.

Claims (10)

1. The intelligent agent control method based on deep reinforcement learning is characterized by comprising the following steps:
acquiring current state information of a current intelligent agent and hostile state information of a plurality of hostile intelligent agents, inputting the current state information of the current intelligent agent and hostile state information of the hostile intelligent agents into an entity scalar encoder constructed based on a multi-layer perception mechanism, and obtaining entity feature vectors of the current intelligent agent and hostile feature vectors corresponding to the hostile intelligent agents;
Inputting the entity feature vector and a plurality of hostile feature vectors into a target hostile selection unit constructed based on an attention mechanism, and determining a target hostile agent from the hostile agents;
acquiring a game map of the current intelligent agent, and extracting map coding features of the current intelligent agent;
Inputting the entity feature vector, the target hostile feature vector corresponding to the target hostile agent and the map coding feature into a strategy network, and obtaining an action command for indicating whether the current agent executes shooting operation for the target hostile agent;
acquiring a predicted evaluation result of the action command based on a preset value function network;
acquiring an actual rewarding result of the action command based on a preset rewarding function;
And optimizing and adjusting the strategy network according to the predicted evaluation result and the actual rewarding result so as to make a action decision of the current intelligent agent by utilizing the strategy network after optimizing and adjusting.
2. The method of claim 1, wherein the target enemy selection unit includes a cascading Softmax architecture and Gumbel Softmax Sampling architecture, wherein the inputting the entity feature vector and the plurality of enemy feature vectors into the target enemy selection unit constructed based on an attention mechanism determines a target enemy agent from the plurality of enemy agents, comprising:
performing dot product calculation on the entity feature vector and a plurality of hostile feature vectors to obtain a target matrix constructed by a plurality of vector multiplication results, wherein each element in the target matrix is used for representing the correlation between the current intelligent agent and each hostile intelligent agent;
Normalizing the target matrix by using the Softmax architecture to obtain probability distribution corresponding to the target matrix, wherein each element in the probability distribution is used for representing discrete action probability of the current agent for selecting each hostile agent;
Processing the probability distribution by utilizing the Gumbel Softmax Sampling framework to obtain a new probability distribution;
In the training stage, calculating a loss function according to the new probability distribution and executing back propagation update model parameters;
And in the reasoning stage, sampling is carried out according to the new probability distribution, and finally the target hostile agent selected is determined.
3. The method of claim 1, wherein the policy network comprises a mobile policy sub-network and a shooting policy sub-network, the inputting the entity feature vector, the target hostile feature vector corresponding to the target hostile agent, and the map-coded feature into the policy network, obtaining an action command for indicating whether the current agent performs a shooting operation for the target hostile agent, comprising:
Acquiring a spatial feature map of the map coding feature, wherein the spatial feature map comprises barrier information of the current intelligent agent and position information of the target hostile intelligent agent;
inputting the space feature map into the mobile strategy sub-network to obtain a feature map prediction result with a preset size;
normalizing the feature map prediction result to obtain the probability distribution of the moving direction of the current intelligent agent;
Performing element addition-based feature fusion on the entity feature vector, the target hostile feature vector and the map coding feature to obtain a first fusion feature vector;
inputting the first fusion feature vector into the shooting strategy sub-network to obtain the action probability distribution of whether the current intelligent agent shoots or not;
and acquiring an action command for indicating whether the current agent executes shooting operation for the target enemy agent according to the movement direction probability distribution and the action probability distribution.
4. The method according to claim 1, wherein the obtaining the predicted evaluation result of the action command based on the preset value function network includes:
Acquiring the undetermined state information of the current intelligent agent and the undetermined state information of the plurality of hostile intelligent agents after the current intelligent agent executes the action command;
Extracting entity state vectors of the undetermined state information of the current intelligent agent and hostile state vectors corresponding to the undetermined state information of the hostile intelligent agents respectively;
Carrying out average processing on a plurality of hostile state vectors to obtain an average hostile state vector;
performing element-addition-based feature fusion on the entity state vector and the average hostile state vector to obtain a second fusion feature vector;
And inputting the second fusion feature vector into a preset value function network to obtain a prediction evaluation result of the action command.
5. The method of claim 1, wherein the obtaining the actual rewards results of the action command based on the pre-set rewards function comprises:
based on a preset reward function: acquiring an actual rewarding result of the action command;
Wherein Rewards is the actual bonus result and a i is the weight of the bonus type R i.
6. The method of claim 1, wherein the policy network is obtained by:
acquiring an initial strategy network, and interacting with a preset shooting game environment for a plurality of times based on the initial strategy network to obtain a game state track;
calculating discount rewards and advantage values of various states included in the game state track according to the game state track combined with a preset value function network;
Updating the initial strategy network based on the discount rewards and the dominance values to obtain an intermediate strategy network, and updating the preset value function network based on a mean square loss function;
And performing iterative training according to the intermediate strategy network and the updated preset value function network to obtain the strategy network after training is completed.
7. The method of claim 6, wherein calculating discount rewards and advantage values for the game state trajectory based on the game state trajectory in combination with a network of preset value functions, comprises:
According to the formula: Calculating to obtain discount rewards of the game state track; wherein R t is the discount prize, V φ(sn) is the value function of the state s n estimated by the preset value function network, and gamma is a discount factor;
According to the formula: a t=Rt-Vφ(st) calculating to obtain a dominance value of the game state track; wherein a t is the dominance value.
8. The method according to claim 1, wherein the method further comprises:
Under the condition that the current intelligent agent executes an action command of shooting operation aiming at the target hostile intelligent agent, acquiring a shooting prop module which is configured by the current intelligent agent and is used for executing the shooting operation;
Acquiring continuous position coordinate information of the shooting prop module at a current time point; the continuous position coordinate information represents position coordinate information of a recorded time range and a predicted time range which are associated with the current time point by the shooting prop module; the continuous position coordinate information is formed based on past position coordinate information and expected position coordinate information; the past position coordinate information is position coordinate information of the shooting prop module in a recorded time range associated with the current time point; the expected position coordinate information is obtained after analyzing initial expected position coordinate information and an operation command of the shooting prop module at the current time point, and the initial expected position coordinate information is position coordinate information of a prediction time range associated with the shooting prop module at the current time point and output based on a pre-trained position prediction model; the past position coordinate information is generated by: acquiring a plurality of historical moments in a recorded time range associated with the current time point; acquiring track coordinate points, orientation data and speed data of the shooting prop module at each historical moment in the plurality of historical moments; generating the past position coordinate information based on the track coordinate point, the orientation data and the speed data of the shooting prop module at each historical moment; the initial expected position coordinate information is obtained by prediction based on the position prediction model which is trained in advance in the process of determining the shooting prop state of the shooting prop module at the current time point at the moment of the current time point; the initial expected position coordinate information is composed of track coordinate points, orientation data and speed data of the shooting prop module at each predicted time; the operation command comprises a direction vector and a speed value for indicating the shooting prop module to move; the expected position coordinate information is obtained by: based on the direction vector in the operation command, performing correction operation on the orientation data of the shooting prop module at each prediction moment in the initial expected position coordinate information to obtain a correction direction vector of the shooting prop module at each prediction moment; based on the direction vector and the speed value in the operation command, performing correction operation on the speed data of the shooting prop module at each prediction moment in the initial expected position coordinate information to obtain a correction speed value of the shooting prop module at each prediction moment; based on the correction moving path points and the correction speed values matched with the precursor predicted time of the plurality of predicted time, performing correction operation on the track coordinate points matched with the target predicted time of the plurality of predicted time to obtain correction moving path points matched with the shooting prop module at the target predicted time; the preamble prediction time is the time before the target prediction time; obtaining the expected position coordinate information based on the correction direction vector, the correction speed value and the correction moving path point of the shooting prop module at each prediction moment; the shooting prop module is configured with a plurality of dynamic nodes in virtual construction;
acquiring the number of rotation shafts matched with each dynamic node of the shooting prop module respectively;
Calculating the rotation quantity of each dynamic node of the shooting prop module at the current time point under the matched rotation axis number;
Generating current prop interaction information based on the rotation quantity of each dynamic node under each matched rotation axis number; the current prop interaction information characterizes the shooting prop state of the shooting prop module at the current time point; the shooting prop module is configured with a plurality of dynamic nodes in virtual construction; each dynamic node is configured with a matched number of rotation axes;
Calculating the reference rotation quantity of each dynamic node of the shooting prop module from the current time point state to the subsequent time point of the current time point under the matched rotation axis number of each dynamic node based on the pre-trained position prediction model according to the continuous position coordinate information and the current prop interaction information;
Generating prop operation information based on the calculated reference rotation quantity of each dynamic node under the corresponding rotation axis number; the prop operation information characterizes an operation mode of the shooting prop module from the current time point to a subsequent time point of the current time point;
And determining the shooting prop state of the shooting prop module at the subsequent time point of the current time point based on the prop operation information and the current prop interaction information, and switching the shooting prop state indicated by the current prop interaction information of the shooting prop module into the shooting prop state at the subsequent time point of the current time point.
9. The method of claim 8, wherein the estimating, based on the pre-trained position prediction model, a reference rotation amount of each dynamic node of the shooting prop module at each matched rotation axis number from the current time point state to a subsequent time point of the current time point based on the continuous position coordinate information and the current prop interaction information, comprises:
Calculating a normal distribution center value for indicating dynamic change of each dynamic node from the current time point to a subsequent time point of the current time point under the corresponding rotation axis number based on the continuous position coordinate information and the current prop interaction information based on the pre-trained position prediction model; the normal distribution center value comprises component values of a plurality of aspects, and the component value of one aspect in the normal distribution center value is used for indicating dynamic change of one dynamic node on the corresponding rotation axis number;
Acquiring target normal distribution determined by the normal distribution central value, and respectively carrying out random extraction on the target normal distribution aiming at each aspect of the normal distribution central value to obtain random rotation quantity matched with each aspect;
accumulating the component values matched in each aspect and the random rotation quantity respectively to obtain a reference rotation quantity of each dynamic node under the matched rotation axis number; one reference rotation amount is an accumulated value of a component value to which the corresponding aspect belongs and a random rotation amount.
10. A server system comprising a server for performing the method of any of claims 1-9.
CN202410275750.3A 2024-03-12 Deep reinforcement learning-based agent control method and system Pending CN118267709A (en)

Publications (1)

Publication Number Publication Date
CN118267709A true CN118267709A (en) 2024-07-02

Family

ID=

Similar Documents

Publication Publication Date Title
CN112329348B (en) Intelligent decision-making method for military countermeasure game under incomplete information condition
Foerster et al. Stabilising experience replay for deep multi-agent reinforcement learning
Cole et al. Using a genetic algorithm to tune first-person shooter bots
KR20210028728A (en) Method, apparatus, and device for scheduling virtual objects in a virtual environment
Laird It knows what you're going to do: Adding anticipation to a Quakebot
CN111111204B (en) Interactive model training method and device, computer equipment and storage medium
Gmytrasiewicz et al. Bayesian update of recursive agent models
CN112221149B (en) Artillery and soldier continuous intelligent combat drilling system based on deep reinforcement learning
CN114358141A (en) Multi-agent reinforcement learning method oriented to multi-combat-unit cooperative decision
CN116661503B (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN116360503B (en) Unmanned plane game countermeasure strategy generation method and system and electronic equipment
Hostetler et al. Inferring strategies from limited reconnaissance in real-time strategy games
Oh et al. Learning to sample with local and global contexts in experience replay buffer
CN112651486A (en) Method for improving convergence rate of MADDPG algorithm and application thereof
CN116956007A (en) Pre-training method, device and equipment for artificial intelligent model and storage medium
Cao et al. Autonomous maneuver decision of UCAV air combat based on double deep Q network algorithm and stochastic game theory
CN117313561B (en) Unmanned aerial vehicle intelligent decision model training method and unmanned aerial vehicle intelligent decision method
CN113509726B (en) Interaction model training method, device, computer equipment and storage medium
CN114404975A (en) Method, device, equipment, storage medium and program product for training decision model
CN112561032B (en) Multi-agent reinforcement learning method and system based on population training
CN113377099A (en) Robot pursuit game method based on deep reinforcement learning
CN118267709A (en) Deep reinforcement learning-based agent control method and system
CN116596343A (en) Intelligent soldier chess deduction decision method based on deep reinforcement learning
CN114662655A (en) Attention mechanism-based weapon and chess deduction AI hierarchical decision method and device
CN114757092A (en) System and method for training multi-agent cooperative communication strategy based on teammate perception

Legal Events

Date Code Title Description
PB01 Publication