CN111589157B

CN111589157B - AI model using method, apparatus and storage medium

Info

Publication number: CN111589157B
Application number: CN202010408928.9A
Authority: CN
Inventors: 王宇舟; 郭仁杰; 杨木; 张弛; 武建芳; 杨正云; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2023-10-31
Anticipated expiration: 2040-05-14
Also published as: CN111589157A

Abstract

The application discloses an AI model using method, equipment and a storage medium, wherein the method comprises the following steps: acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information; invoking an AI model corresponding to the intelligent agent, and inputting the observation characteristic into the AI model for prediction to obtain action behavior; the action behavior is sent to the intelligent agent, so that the intelligent agent executes the action behavior to obtain feedback information corresponding to the action behavior; acquiring the feedback information, and taking the feedback information, the observation characteristics and the action behaviors as training samples; and training and updating the AI model according to the training sample. The application improves the accuracy of the AI model.

Description

AI model using method, apparatus and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, and a storage medium for using an AI model.

Background

With the development of artificial intelligence (Artificial Intelligence, AI) technology, artificial intelligence is increasingly being applied to various fields such as automatic driving automobiles, interstellar disputes, and Dota 2. In the field of game play, an agent controlled by an AI model can reach a level exceeding that of professional players.

However, most of the current AI models are trained by adopting a task-based planning control method during training, but the AI models obtained by training in a task-based or rule-based manner are poor in performance in the field of multi-agent control of the vehicle instrument. This is because in the field of multi-agent control, the AI model needs to consider competition and cooperative relationships between multi-agents or between agents and players and control the vehicle instruments based on the competition and cooperative relationships. The increase of the situation complexity leads to the increase of the data volume needed to be analyzed by the AI model, which not only leads to the slow data analysis speed of the AI model, but also can lead to the failure of the AI model to effectively analyze based on the current situation, thereby representing slow action of the intelligent agent or unreasonable action.

Therefore, how to improve the accuracy of AI models in the field of multi-agent control is a problem to be solved.

Disclosure of Invention

The application provides an AI model using method, equipment and a storage medium, which are used for improving the accuracy of the AI model in the field of multi-agent control.

In a first aspect, the present application provides a method for using an AI model, the method including:

acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information;

Invoking an AI model corresponding to the intelligent agent, and inputting the observation characteristic into the AI model for prediction to obtain action behavior;

the action behavior is sent to the intelligent agent, so that the intelligent agent executes the action behavior to obtain feedback information corresponding to the action behavior;

acquiring the feedback information, and taking the feedback information, the observation characteristics and the action behaviors as training samples;

training and updating the AI model according to the training sample;

inputting the observation characteristics into an AI model to obtain probabilities corresponding to a plurality of action behaviors;

determining a target action behavior from a plurality of action behaviors according to probabilities corresponding to the action behaviors;

and sending the target action behavior to the intelligent agent so that the intelligent agent executes corresponding actions according to the target action behavior.

In a second aspect, the present application also provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the AI model using method as described above when executing the computer program.

In a third aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement an AI model using method as described above.

The application discloses an AI model using method, equipment and a storage medium, which are characterized in that through acquiring environment information observed by an intelligent body in a virtual environment, the observation characteristics of the intelligent body are extracted from the environment information, then the AI model corresponding to the intelligent body is called, the observation characteristics are input into the AI model for prediction, action behaviors are obtained, the action behaviors are sent to the intelligent body, the intelligent body executes the action behaviors, feedback information corresponding to the action behaviors is obtained, finally the feedback information, the observation characteristics and the action behaviors are used as training samples, and the AI model is trained and updated according to the training samples. The method comprises the steps of extracting observation characteristics from environment information, taking feedback information corresponding to action behaviors executed by the intelligent agents based on the observation characteristics as training samples, and training and updating the AI model, so that the accuracy of the AI model in the field of multi-agent control is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic view of a training AI model provided by an embodiment of the application;

FIG. 2 is a schematic flow chart of an AI model training method provided by an embodiment of the application;

FIG. 3 is a schematic view of angles provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of coding and behavior comparison provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a hierarchical structure of an AI model provided by an embodiment of the application;

FIG. 6 is a schematic diagram of a scenario using an AI model provided by an embodiment of the application;

FIG. 7 is a flowchart of an AI model using method according to an embodiment of the present application;

FIG. 8 is a flow chart of sub-steps of the AI model usage method provided in FIG. 5;

fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides an AI model using method, AI model using equipment and an AI model storing medium. The training method of the AI model can be applied to a server, wherein the server can be a single server or a server cluster consisting of a plurality of servers.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

It should be noted that, in the following, the application of the AI model to a ship game will be described in detail as an example, and it is to be understood that the AI model may also be applied to other multi-agent control scenarios.

Referring to fig. 1, fig. 1 is a schematic view of a training AI model according to an embodiment of the present application.

As shown in fig. 1, the model training server includes a prediction section and a training section. The prediction part is used for predicting action behaviors to generate training samples, and the training part is used for training and updating the AI model.

The virtual environment server sends the environment information observed by the agent in the virtual environment to a prediction part in the model training server, the prediction part extracts the characteristics of the environment information to obtain the observation characteristics, and the observation characteristics are input into the AI model to conduct behavior prediction to obtain the action behavior output by the AI model. And the prediction part sends a behavior instruction corresponding to the action behavior to an agent in the virtual environment server so as to control the agent to act according to the behavior instruction and generate feedback information.

And then the prediction part sends the observation characteristics, the action behaviors and the feedback information to the training part together as training samples, the training part trains the AI model according to the training samples, and the updated parameters are returned to the AI model in the prediction part so as to update the AI model.

Referring to fig. 2, fig. 2 is a schematic flowchart of an AI model training method according to an embodiment of the application.

As shown in fig. 2, the AI model training method specifically includes: step S101 to step S104.

S101, acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information.

The virtual environment may refer to a virtual environment server, and after the agent and the real player access the virtual environment server, the corresponding actions may be performed in the virtual environment server. For example, the virtual environment server may refer to a game server, and after the agent and the real player access the game server, the game may be played in a virtual game scene provided by the game server.

After the intelligent body is accessed to the virtual environment server, the virtual environment server can observe the surrounding environment from the perspective of the intelligent body to obtain environment information observed by the intelligent body, the environment information observed by the intelligent body is sent to the model training server, and the model training server extracts the observation characteristics of the intelligent body from the environment information after acquiring the environment information observed by the intelligent body. The observation feature refers to a feature which can be observed in the visual field of the intelligent agent, and comprises the feature of the intelligent agent and the features of other players in the visual field.

The model training server may include a prediction part and a training part, the prediction part is used for obtaining environment information observed by an agent in a virtual environment, extracting observation characteristics of the agent from the environment information, predicting action behaviors of the agent based on the observation characteristics, and obtaining feedback information corresponding to the action behaviors. The training part is used for carrying out model training on the AI model by taking the observation characteristics, the action behaviors and the corresponding feedback information as training samples.

In some embodiments, the agent observed environmental information may include user information, user configuration information, and global information. The extracting of the observation feature of the agent from the environment information may specifically be extracting a corresponding player feature, a configuration feature, and a global feature from the user information, the user configuration information, and the global information. That is, the observed features include player features, configuration features, and global features.

Specifically, a method of observation modeling may be utilized to model each user in the virtual environment for a player, model user configuration of the user in the virtual environment, and globally model the user, so as to obtain player characteristics, configuration characteristics, and global characteristics. Wherein, the users in the virtual environment can be a plurality of agents, or part of real players and at least one agent. The user configuration information may refer to weapon information configured by the player.

In the ship game, the user information refers to player information. Players include master players, teammates players, and hostile players. The master player is a currently observed player, that is, if the surrounding environment is observed by the viewing angle of the agent, the agent is the master player, and if the surrounding environment is observed by the viewing angle of the real player, the real player is the master player. Teammate players are players who are on the same team as the master player, and hostile players are players who are hostile to the master player and can be observed in the field of view of the master player.

If the opponent player is not in the range of the main player's visual field, the opponent player will not be considered as an opponent player, and the opponent player will not be considered as an opponent player until the opponent player is in the range of the main player's visual field.

When there are multiple teammate players and/or hostile players, each player may be marked in order to facilitate distinguishing between the multiple players. For example, four teammates players in the same team as the master player can be respectively marked as teammate 1, teammate 2, teammate 3 and teammate 4, and the opponent players which can be observed in the visual field of the master player can be respectively marked as opponent 1, opponent 2, opponent 3, opponent 4 and opponent 5.

The user information includes basic information and pose information. Wherein the base information is used to represent the status of the current player, e.g., whether the current player is a master player, whether it is alive, moving speed, etc. The pose information is then used to represent the global pose of the current player and the relative pose with respect to the opposing player (i.e., other teammate players and/or hostile players). The global pose comprises current player coordinates and current player angles, and the relative pose comprises relative player coordinates, relative distances relative to the relative player and relative angles.

The angles in the global pose and the relative pose may include the following: the current player's perspective relative to the body angle, relative to the player's bow angle, relative to the player's body angle, relative to the player's perspective angle, current player's hull angle, and current player's perspective angle.

As shown in fig. 3, in the course of the ship game, after the player enters the game, the player takes the initial direction as the facing direction to perform the activity, and the player can adjust the viewing angle based on the initial direction.

The player can rotate the angle of view under the condition that the body position is kept unchanged, and in the process of rotating the angle of view of the player, the angle of view of the player relative to the body can be obtained, and the angle of the current angle of view of the player relative to the body is recorded as < 1 >. The angle of the bow of the current player relative to the other players is taken as the bow angle of the opposite player and is recorded as < 2 >. The angle of the current player's body relative to the other players is taken as the relative player's body angle and is noted as +.3. The angle of view of the current player relative to the other players is taken as the angle of view of the opposing player and is noted as +.4. The angle of the hull of the current player relative to the initial direction is taken as the hull angle of the current player and is noted as < 5 >. The angle of the current player's view angle with respect to the initial direction is taken as the current player's view angle and is noted as +.6.

Since each player in the virtual environment is modeled when the player is modeled, that is, each player in the virtual environment is modeled as a master player, user information corresponding to each player is obtained. Thus, if the number of teammate players is insufficient and/or the number of hostile players is insufficient, default values may be used to populate missing values in the player's user information. Wherein, the insufficient number of teammate players or the insufficient number of opponent players means that the teammate players or the opponent players are reduced, such as death or disconnection of the players, and the like.

If the angle information in the pose information cannot be obtained, the angle which cannot be obtained can be filled by using a default value. For example, if the current player being modeled is not the master player and cannot acquire the current player's perspective, then +.1, +.4, and +.6 cannot be obtained, at which point the default values may be used to fill +.1, +.4, and +.6.

In the implementation process, the default value may be set in advance, so as to indicate that the value cannot be obtained and is a null value.

And acquiring environment information observed by the agent, and respectively modeling each player in the virtual environment to extract player characteristics and characteristic values corresponding to the player characteristics from the user information. The player characteristics include status characteristics and pose characteristics of the player, including whether the current player is a master player, whether it survives, the speed of movement, and so forth. The pose features comprise global pose features and relative pose features, wherein the global pose features comprise current player coordinates and current player angles, and the relative pose features comprise relative player coordinates, relative distances and relative angles relative to the relative player.

The user configuration information may refer to weapon information configured by the player. The user configuration information may include bow weapon information, board weapon information, and deck weapon information.

And carrying out observation modeling according to the user configuration information, thereby extracting configuration features and feature values corresponding to the configuration features from the user configuration information. Configuration features include whether the weapon is held, whether the weapon is activated, whether the weapon is available, the type of weapon, whether the current aiming direction is in firing range, whether there are hostile players in firing range, and the time of change of the weapon.

Global information includes team information and game information. And carrying out observation modeling according to the global information, thereby extracting the global features and feature values corresponding to the global features from the global information. Global features include a team feature to indicate game progress, including total team survival and death, and a game feature to indicate global progress of the game, including global remaining player count and remaining game time.

Each player in the virtual environment is respectively modeled in an observation modeling mode, so that the relative relation among a master player, a teammate player and an opponent player is expressed, and the trained intelligent agent controlled by the AI model can be matched with the teammate player. And modeling the user configuration information to obtain configuration features capable of expressing the weapon attribute in the game, and improving the smoothness of the trained intelligent agent controlled by the AI model on the weapon operation in the game.

S102, invoking an AI model corresponding to the agent, and inputting the observation characteristic into the AI model for prediction to obtain action behaviors.

After the observation characteristics are obtained, the model training server can call an AI model corresponding to the intelligent agent, the observation characteristics are input into the AI model, and the AI model predicts the behavior of the intelligent agent according to the input observation characteristics to obtain the action behavior.

The model training server stores a pre-stored initial AI model, the observation feature is input into the initial AI model, and the initial AI model predicts the behavior of the intelligent agent according to the input observation feature, so as to obtain the action behavior.

And S103, sending the action behavior to the intelligent agent so that the intelligent agent executes the action behavior to obtain feedback information corresponding to the action behavior.

After the action behavior output by the AI model is obtained, the model training server sends the action behavior to the agent in the virtual environment so as to control the agent to execute the received action behavior. After executing the received action, the agent generates feedback information corresponding to the executed action to feedback whether the action is suitable. Specifically, feedback information corresponding to the action behavior may be obtained by setting a feedback function.

In the process of executing action behaviors by the intelligent agent, an execution target is set for the intelligent agent, and the execution target is the target which the intelligent agent needs to reach after passing through multiple action behaviors. After the intelligent agent executes corresponding action behaviors each time, the action behaviors of the intelligent agent are fed back, and if the action behaviors of the intelligent agent approach the execution target, the feedback information is forward, so that the action behaviors of the intelligent agent are proper; if the action of the intelligent agent is far away from the execution target, the feedback information is negative, which indicates that the action of the intelligent agent is unsuitable.

For example, if the execution target of the agent is from the point a to the point B, after the agent executes the action behavior sent by the model training server, if the distance between the agent and the point B is farther than the distance between the agent and the point B in the original position, the feedback information is negative; if the distance between the intelligent body and the point B is closer than the distance between the intelligent body and the point B in the original position, the feedback information is forward.

In some embodiments, whether the AI model converges may be determined by feedback information over a continuous period of time, for example, feedback information over a period of time may be represented in a graph, and if the feedback information is always forward over a period of time, the AI model may be considered to converge.

S104, acquiring the feedback information, and taking the feedback information, the observation characteristics and the action behaviors as training samples.

After the feedback information is obtained through calculation, the model training server stores the observation characteristics, the feedback information and the action behaviors together as training samples so as to be used for training the AI model.

Since the action behavior of the agent is dynamic, i.e. the action behavior is performed continuously for a period of time by the agent. Thus, the environmental information observed by the agent at each time during a period of time may be different, resulting in different observation features extracted from the environmental information. In order to improve the prediction accuracy of the AI model obtained by training, the observation feature at the current moment, the action behavior at the current moment, the feedback information at the current moment, the observation feature at the next moment and whether the intelligent agent stops operating at the next moment can be stored together as a training sample.

In some embodiments, the method further comprises: acquiring the behavior sequence of a plurality of training samples, and storing the plurality of training samples in a training sample sequence according to the behavior sequence; and if the length of the training sample sequence is greater than a preset threshold value, deleting the training samples according to the behavior sequence.

The behavior sequence refers to the sequence in which the agent performs each action behavior. And for continuous training samples acquired in a period of time, storing the training samples in a training sample sequence according to the behavior sequence of a plurality of training samples, and deleting part of the training samples according to the behavior sequence when the number of the training samples stored in the training sample sequence is greater than a preset threshold value.

The preset threshold may be the maximum value of the number of training samples that can be stored in the training sample sequence, and when deleting the training samples according to the behavior sequence, the sequence of the behavior sequence may be deleted, and the sequence is sequentially deleted from front to back.

In some embodiments, the number of the agents controlled by the model training server may be multiple, and when there are multiple agents, the behavior sequence of each agent for executing each action behavior may be obtained separately, and the training samples are saved according to the behavior sequence.

In some embodiments, the method further comprises: and encoding the action behaviors to obtain behavior codes.

The action behavior comprises a behavior category and a behavior parameter, when the action behavior is coded, the action behavior can be subjected to behavior modeling to distinguish the behavior category and the behavior parameter, and then the behavior category and the behavior parameter are coded based on a preset coding strategy to obtain a behavior code. In particular, one-hot coding may be used to code behavior categories and behavior parameters.

Specifically, the behavior modeling may be hierarchical modeling, and specifically has two hierarchies, including a behavior selection layer and a behavior parameter layer, where the behavior selection layer refers to selecting one behavior category, and the behavior parameter layer refers to selecting a corresponding behavior parameter according to the selected behavior category, and different behavior categories also correspond to different behavior parameters. Performing behavior modeling on the action behavior reduces the number of labels on each dimension, and is beneficial to training of an AI model.

For example, behavior categories may include horizontal direction aiming offset values, vertical direction aiming offset values, firing, movement, weapon switching, and the like.

The horizontal direction aiming offset value and the vertical direction aiming offset value can be provided with different numbers of labels in the horizontal direction or the vertical direction according to different ship body types and weapon types.

The firing refers to the primary firing action of the intelligent body according to the current aiming angle.

The movement can be specifically classified into a linear velocity and an angular velocity, and the linear velocity and the angular velocity can be used in combination.

Weapon switching means that if the current weapon is not available or a plurality of weapons can be used at the same time, other weapons need to be switched, and different numbers of tags can be set according to weapon numbers or using sequences.

For example, a horizontal pointing offset may be set: within the interval [ -30 degrees, 30 degrees ], one tag per 1 degree, a total of 61 tags.

Vertical aiming offset: within the interval [ -10 degrees, 10 degrees ], one tag per 1 degree, 11 tags total.

Firing: a label.

Linear velocity of movement combined with angular velocity use: straight forward, forward right turn, in-situ right turn, backward right turn, straight backward, backward left turn, in-situ left turn, forward left turn, and total 8 labels.

Weapon switching: the last weapon and the next weapon are used without changing the weapon, and the total number of the weapons is 3.

With reference to fig. 4, if the current action is left-turned in place, the output one-hot code is [ [0, 1,0], [0, 1,0 … ], [0, …,1, … ], [1], [0,0,0,0,0,0,1,0], [1, 0] ], the first element [0, 1,0] being a behavior selection layer, indicating that a class 4 behavior is selected, i.e. move. The second element to the sixth element respectively represent corresponding behavior parameters in each class of behaviors, and the corresponding behavior parameters, namely [0,0,0,0,0,0,1,0], are selected from the class 4 behaviors, and the corresponding specific behavior parameters are in-situ left-turning.

And S105, training and updating the AI model according to the training sample.

After the training sample is obtained, the AI model can be trained and updated according to the training sample, and after the AI model is converged, the converged AI model is stored, so that the training is completed.

In some embodiments, referring to fig. 5, the AI model includes a first feature encoding layer, a second feature encoding layer, a third feature encoding layer, an encoding splice layer, a timing model layer, and an output layer; the training updating of the AI model according to the training sample comprises the following steps:

encoding the player characteristics through the first characteristic encoding layer to obtain player characteristic codes; encoding the configuration feature through the second feature encoding layer to obtain configuration feature encoding; coding the global feature through the third feature coding layer to obtain a global feature code; splicing the player feature codes, the configuration feature codes and the global feature codes through the code splicing layer to obtain spliced codes; inputting the splicing codes into a time sequence model layer to obtain time sequence splicing codes; the action behavior and the situation evaluation value are output through the output layer based on the time sequence splicing codes, and the dominance value is determined according to the feedback information and the situation evaluation value; and calculating a loss value according to the dominance value, and training and updating the AI model according to the loss value.

And the player characteristic codes are obtained by encoding the player characteristic codes through the first characteristic encoding layer, and when a plurality of players exist, the player characteristic codes of the plurality of players can be obtained by encoding the player characteristic codes of the plurality of players through the first characteristic encoding layer. And coding the configuration features through a second feature coding layer to obtain configuration feature codes, and coding the global features through a third feature coding layer to obtain global feature codes.

Wherein the first feature encoding layer, the second feature encoding layer, and the third feature encoding layer may all be implemented using a multi-layered perceptron. The player characteristics of a plurality of players are encoded by using the shared first characteristic encoding layer, and different types of characteristics, namely player characteristics, configuration characteristics and global characteristics are encoded by using different characteristic encoding layers, so that the number of parameters in the AI model is reduced, the training efficiency of the AI model is improved, and the expression capacity and generalization capacity of the AI model are also improved.

The code splicing layer splices the player feature codes, the configuration feature codes and the global feature codes to obtain a splice code, and inputs the obtained splice code into the time sequence model layer, wherein the time sequence model can be LSTM, so as to obtain the time sequence splice code, finally inputs the time sequence splice code into the output layer, and outputs the final result by the output layer.

The output layer includes a behavior prediction section and a situation evaluation section, wherein the behavior prediction section is a sampling probability of splicing respective action behaviors of the encoded output according to the input timing, representing a probability of executing the respective action behaviors. The situation evaluation section is a predicted value representing an accumulated sum of feedback information of action actions performed after the environment observation is started from the agent, based on one situation evaluation value of the input time sequence spliced code output, and is used for training the AI model.

After the output layer outputs the probability of executing the action behaviors, the action behaviors sent to the intelligent agent for execution can be determined from the action behaviors according to the probabilities of the action behaviors so as to control the intelligent agent to execute corresponding actions, acquire new environmental information and further generate training samples to form a cycle, and the training samples are continuously generated to train the AI model until the AI model converges.

In some embodiments, if the output action is an encoded action, the encoded action may be decoded to enable the agent to execute the corresponding action.

For example, if the output encoded action behavior is [ [0, 1,0], [0, 1,0 … ], [0, …,1, … ], [1], [0,0,0,0,0,0,1,0], [1, 0] ], the corresponding execution effect is in-place left-turn.

After the situation evaluation part of the output layer outputs the situation evaluation value, the dominant value can be calculated according to the situation evaluation value and the feedback information in the training sample, then the loss value of the AI model is calculated by using the dominant value, and finally the parameter of the AI model is updated by combining a back propagation algorithm until the AI model converges.

Because the training sample comprises five elements, namely the observation feature at the current moment, the action behavior at the current moment, the feedback information at the current moment, the observation feature at the next moment and whether the intelligent agent stops operating at the next moment.

Thus, for a training sample at time t, it may be representedFor, wherein->Observation feature representing the current moment, +.>Representing the action behaviour at the present moment,/->Feedback information indicating the current time,/->Indicating the observation characteristic of the next moment, +.>Indicating whether the agent is stopped at the next moment.

Then, training samples that are consecutive over a period of time may be expressed as from time T to time T，/>，…，/>。

In the calculation of the dominance value, the following formula may be used:

wherein,,is the dominance value obtained from consecutive training samples from time T to time T,/for the training samples>Representing the dominance value at time t- >Situation evaluation value indicating AI model output at time t,/->And->Is a training hyper-parameter. By calculating the loss value of the AI model by calculating the dominance value, the influence of the target shift caused by inaccurate situation assessment can be reduced.

The loss values of the AI model comprise player behavior loss values and local evaluation loss values, wherein the local evaluation loss values are calculated dominant values, and the player behavior loss values adopt a loss calculation method similar to a strategy. The policy approximation loss approximates to a confidence interval (Trust-region) based calculation, and results are similar, so that the calculation speed is faster, and the AI model is converged more quickly.

In calculating the player behavior loss value, a specific calculation method may be:

recording the training parameter of the current AI model as P, and the training parameter before updating as，/>Representing the current training parameters in state->Conduct behavior under->Is a probability of (2).

Wherein,,representing the approximate proportion of the whole, +.>Representing the truncated approximate proportion obtained by selecting the overall approximate proportion, < >>For a fixed value, ->Is the dominant value obtained from consecutive training samples from time T to time T,representing player behavior loss value,/->Representing situation assessment loss value,/->Then for the final calculated total loss value, < > >For training hyper-parameters, ->Is an I2 regularization term for the model parameters.

The total loss value of the AI model can be calculated by using the calculation formula. After the loss value of the AI model is calculated, whether the AI model converges or not can be judged according to the calculated loss value. The more the calculated loss value of the AI model approaches 0, the higher the prediction accuracy of the AI model at this time is,the more convergent the AI model.

The loss value of the AI model is calculated by adopting the dominant value and the strategy approximation mode, so that the problem of target deviation of the AI model in the training process is reduced, the training efficiency is improved, and the time cost and the calculation cost are reduced.

In some embodiments, when training the AI model, a plurality of agents may be accessed to the virtual environment server, so that the plurality of agents learn cooperation among the plurality of agents in a self-playing manner, the plurality of agents may be connected to the same model training server, collect a plurality of environmental information, and train the AI model with the collected plurality of environmental information as different batches.

According to the AI model training method provided by the embodiment, the environment information observed by the intelligent body in the virtual environment is obtained, the observation characteristics of the intelligent body are extracted from the environment information, then the AI model corresponding to the intelligent body is called, the observation characteristics are input into the AI model for prediction, the action behavior is obtained, the action behavior is sent to the intelligent body, the intelligent body executes the action behavior, the feedback information corresponding to the action behavior is obtained, finally the feedback information, the observation characteristics and the action behavior are used as training samples, and the AI model is trained and updated according to the training samples. The method comprises the steps of extracting observation characteristics from environment information, taking feedback information corresponding to action behaviors executed by the intelligent agents based on the observation characteristics as training samples, and training and updating the AI model, so that the accuracy of the AI model in the field of multi-agent control is improved.

Referring to fig. 6, fig. 6 is a schematic diagram of a scene using an AI model according to an embodiment of the application.

As shown in fig. 6, both a real player and an agent access a virtual environment server, the virtual environment server sends environmental information observed by the agent in the virtual environment to an AI online server, the AI online server performs feature extraction on the environmental information to obtain observation features, inputs the observation features into the trained AI model to perform behavior prediction, obtains action behaviors output by the AI model, and sends action instructions corresponding to the action behaviors output by the AI model to the agent in the virtual environment server so as to control the agent to perform actions according to the action instructions.

Referring to fig. 7, fig. 7 is a flowchart illustrating a method for using an AI model according to an embodiment of the application.

As shown in fig. 7, the AI model using method includes steps S201 to S204.

S201, acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information.

In the actual use process, a real player and an agent are both accessed to a virtual environment server, the AI online server acquires environment information observed by the agent in the virtual environment, and the observation characteristics of the agent are extracted from the environment information.

The environment information observed by the agent may include user information, user configuration information, and global information. The extracting of the observation feature of the agent from the environment information may specifically be extracting a corresponding player feature, a configuration feature, and a global feature from the user information, the user configuration information, and the global information. That is, the observed features include player features, configuration features, and global features.

Specifically, a method of observation modeling may be utilized to model each user in the virtual environment for a player, model user configuration of the user in the virtual environment, and globally model the user, so as to obtain player characteristics, configuration characteristics, and global characteristics. Wherein the user configuration may refer to weapon information configured by the player.

S202, inputting the observation characteristics into an AI model to obtain probabilities corresponding to a plurality of action behaviors.

After the AI online server obtains the observation feature, the observation feature is input into an AI model, so that the probability of executing each action behavior of the intelligent agent under the observation feature is output through an output layer of the AI model. The AI model is obtained by adopting the model training method.

S203, determining a target action behavior from the action behaviors according to probabilities corresponding to the action behaviors.

The output layer of the AI model outputs probabilities corresponding to a plurality of action behaviors according to the input observation feature, and when determining the target action behavior, the action behavior with the highest probability can be used as the target action behavior.

In some embodiments, the action behavior is an encoded action behavior, and when determining the target action behavior, probability input actions corresponding to a plurality of the action behaviors can be modeled, so that the behavior modeling respectively samples corresponding action categories and action parameters according to probabilities, and finally the target action behavior is determined from the plurality of action behaviors.

In some embodiments, referring to fig. 8, step S203 includes step S2031 and step S2032.

S2031, screening the action behaviors according to a preset action rule to obtain screened action behaviors.

After the probabilities corresponding to the action behaviors output by the AI model are obtained, the action behaviors can be respectively screened according to a preset behavior rule so as to filter illegal behaviors from the action behaviors, and the screened action behaviors are obtained.

The preset behavior rule may be a behavior rule set according to a current game rule, for example, a forbidden area is set in a current game scene map, which indicates that a player cannot enter or does not suggest to enter in the area. Then the action actions which can lead the intelligent agent to enter the forbidden zone can be used as illegal actions according to the action rules, and the action actions can be filtered out from the action actions, so as to obtain the screened action actions.

S2032, determining a target action according to the probability corresponding to the action after screening.

After the screened action behaviors are obtained, the target action behaviors can be determined according to the screened action behaviors. When determining the target action, the action with the highest probability can be selected from the screened action as the target action.

S204, the target action behavior is sent to the intelligent agent, so that the intelligent agent executes corresponding actions according to the target action behavior.

And the AI online server generates a behavior instruction according to the determined target action behavior and sends the behavior instruction to an agent in the virtual environment server so as to control the agent to execute a corresponding action according to the target action behavior characterized by the behavior instruction.

After the intelligent agent finishes executing the behavior instruction, the intelligent agent can acquire the environment information at the moment again and send the environment information to the AI online server again, the AI online server acquires the environment information, and extracts the observation characteristics from the environment information, so that the intelligent agent can continuously predict the next walking behavior of the AI user, and the intelligent agent can cooperate with a real player.

According to the model using method provided by the embodiment, the environment information acquired by the intelligent agent is acquired, the observation characteristics are extracted from the environment information, the observation characteristics are input into the AI model to obtain the probabilities corresponding to the plurality of action behaviors, the target action behaviors are determined from the plurality of action behaviors according to the probabilities corresponding to the plurality of action behaviors, and finally the target action behaviors are sent to the intelligent agent so that the intelligent agent can execute the corresponding actions. When the AI model is required to be called to play a game together with the real user, the intelligent agent can make corresponding actions according to the actual scene, so that the quick calling of the AI model is realized, and the user experience of the real user is effectively improved.

The apparatus provided by the above embodiments may be implemented in the form of a computer program which may be run on a computer device as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal.

With reference to FIG. 9, the computer device includes a processor, memory, and a network interface connected by a system bus, where the memory may include a non-volatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform any one of AI model training and/or AI model using methods.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any one of AI model training and/or AI model utilization methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

and training and updating the AI model according to the training sample.

In one embodiment, the context information includes user information, user configuration information, and global information; the processor, when implementing the extracting the observation feature of the agent from the environmental information, is configured to implement:

and extracting corresponding player characteristics, configuration characteristics and global characteristics from the user information, the user configuration information and the global information.

In one embodiment, the AI model includes a first feature encoding layer, a second feature encoding layer, a third feature encoding layer, an encoding splice layer, a timing model layer, and an output layer; the processor is configured to, when implementing the training update of the AI model according to the training sample, implement:

encoding the player characteristics through the first characteristic encoding layer to obtain player characteristic codes;

encoding the configuration feature through the second feature encoding layer to obtain configuration feature encoding;

coding the global feature through the third feature coding layer to obtain a global feature code;

Splicing the player feature codes, the configuration feature codes and the global feature codes through the code splicing layer to obtain spliced codes;

inputting the splicing codes into a time sequence model layer to obtain time sequence splicing codes;

the action behavior and the situation evaluation value are output through the output layer based on the time sequence splicing codes, and the dominance value is determined according to the feedback information and the situation evaluation value;

and calculating a loss value according to the dominance value, and training and updating the AI model according to the loss value.

In one embodiment, the processor, when implementing the encoding of the player characteristics by the first feature encoding layer to obtain player characteristic encodings, is configured to implement:

and encoding player characteristics of the plurality of players through the first characteristic encoding layer to obtain player characteristic codes of the plurality of players.

In one embodiment, the processor is further configured to implement:

and encoding the action behaviors to obtain behavior codes.

In one embodiment, the action behavior includes a behavior category and a behavior parameter; the processor is configured to, when implementing the encoding of the action behavior to obtain a behavior code, implement:

And coding the behavior category and the behavior parameter based on a preset coding strategy to obtain behavior codes.

In one embodiment, the processor is further configured to implement:

acquiring the behavior sequence of a plurality of training samples, and storing the plurality of training samples in a training sample sequence according to the behavior sequence;

and if the length of the training sample sequence is greater than a preset threshold value, deleting the training samples according to the behavior sequence.

In one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

inputting the observation characteristics into an AI model to obtain probabilities corresponding to a plurality of action behaviors, wherein the AI model is a model obtained by adopting the model training method;

In one embodiment, when the processor determines the target action behavior from the plurality of action behaviors according to probabilities corresponding to the plurality of action behaviors, the processor is configured to implement:

screening the action behaviors according to preset action rules to obtain screened action behaviors;

and determining a target action according to the probability corresponding to the action after screening.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, the computer program comprises program instructions, and the processor executes the program instructions to realize any one of the AI model training and/or AI model using methods provided by the embodiment of the application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, which are provided on the computer device.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of AI model use, comprising:

acquiring environment information observed by an agent in a virtual environment, wherein the environment information comprises user information, user configuration information and global information, and extracting corresponding player characteristics, configuration characteristics and global characteristics from the user information, the user configuration information and the global information;

invoking an AI model corresponding to the intelligent agent, and inputting the observation characteristic into the AI model for prediction to obtain action behavior; the AI model comprises a first feature coding layer, a second feature coding layer, a third feature coding layer, a coding splicing layer, a time sequence model layer and an output layer;

encoding the player characteristics through the first characteristic encoding layer to obtain player characteristic codes; encoding the configuration feature through the second feature encoding layer to obtain configuration feature encoding; coding the global feature through the third feature coding layer to obtain a global feature code; splicing the player feature codes, the configuration feature codes and the global feature codes through the code splicing layer to obtain spliced codes; inputting the splicing codes into a time sequence model layer to obtain time sequence splicing codes; the action behavior and the situation evaluation value are output through the output layer based on the time sequence splicing codes, and the dominance value is determined according to the feedback information and the situation evaluation value; calculating a loss value according to the dominance value, and training and updating the AI model according to the loss value;

inputting the observation characteristics into the AI model to obtain probabilities corresponding to a plurality of action behaviors;

2. The AI model of claim 1, wherein the encoding of the player characteristics by the first feature encoding layer to obtain player feature codes comprises:

3. The AI model use method according to claim 1, characterized in that the method further comprises:

and encoding the action behaviors to obtain behavior codes.

4. The AI model usage method of claim 3, wherein the action behavior includes a behavior category and a behavior parameter; the act of encoding to obtain an act encoding includes:

5. The AI model use method according to claim 1, characterized in that the method further comprises:

6. The AI model usage method of claim 1, wherein the determining a target action behavior from a plurality of action behaviors according to probabilities corresponding to a plurality of the action behaviors comprises:

7. A computer device, the computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor configured to execute the computer program and implement the AI model using method according to any one of claims 1 to 6 when the computer program is executed.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which when executed by a processor causes the processor to implement the AI model usage method according to any one of claims 1 to 6.