CN111589157A

CN111589157A - AI model training method, AI model using method, equipment and storage medium

Info

Publication number: CN111589157A
Application number: CN202010408928.9A
Authority: CN
Inventors: 王宇舟; 郭仁杰; 杨木; 张弛; 武建芳; 杨正云; 李宏亮; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-08-28
Anticipated expiration: 2040-05-14
Also published as: CN111589157B

Abstract

The application discloses an AI model training method, an AI model using method, a computer device and a storage medium, wherein the method comprises the following steps: acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information; calling an AI model corresponding to the agent, and inputting the observation characteristics into the AI model for prediction to obtain action behaviors; sending the action behavior to the intelligent agent so that the intelligent agent executes the action behavior to obtain feedback information corresponding to the action behavior; acquiring the feedback information, and taking the feedback information, the observation characteristics and the action behavior as training samples; and training and updating the AI model according to the training samples. The application improves the accuracy of the AI model.

Description

AI model training method, AI model using method, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an AI model training method, an AI model using method, a computer device, and a storage medium.

Background

With the development of Artificial Intelligence (AI) technology, Artificial Intelligence is increasingly being applied to various fields, such as auto-driving cars, interstellar dispute, and Dota 2. In the field of game play, agents controlled by AI models can reach a level beyond professional players.

However, most of the current AI models are trained by a task-based planning control method, but the AI models trained by the task-based or rule-based method do not perform well in the field of multi-agent control of vehicle instruments. This is because in the field of multi-agent control, the AI model needs to consider the competition and cooperation relationship between the multi-agents or between an agent and a player, and control the carrier device based on the competition and cooperation relationship. The situation complexity increases to increase the data volume to be analyzed by the AI model, which not only causes the data analysis speed of the AI model to become slow, but also may cause the AI model not to be effectively analyzed based on the current situation, and further embodied as slow movement of the agent or unreasonable movement.

Therefore, how to improve the accuracy of the AI model in the field of multi-agent control becomes an urgent problem to be solved.

Disclosure of Invention

The application provides an AI model training method, an AI model using method, a computer device and a storage medium, which are used for improving the accuracy of an AI model in the field of multi-agent control.

In a first aspect, the present application provides an AI model training method, including:

acquiring environment information observed by an agent in a virtual environment, and extracting observation characteristics of the agent from the environment information;

calling an AI model corresponding to the agent, and inputting the observation characteristics into the AI model for prediction to obtain action behaviors;

sending the action behavior to the intelligent agent so that the intelligent agent executes the action behavior to obtain feedback information corresponding to the action behavior;

acquiring the feedback information, and taking the feedback information, the observation characteristics and the action behavior as training samples;

and training and updating the AI model according to the training samples.

In a second aspect, the present application also provides an AI model using method, including:

inputting the observation features into an AI model to obtain probabilities corresponding to a plurality of action behaviors, wherein the AI model is obtained by adopting the model training method of the first aspect;

determining a target action behavior from a plurality of action behaviors according to the corresponding probabilities of the action behaviors;

and sending the target action behavior to the intelligent agent so that the intelligent agent executes a corresponding action according to the target action behavior.

In a third aspect, the present application further provides a computer device comprising a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and to implement the AI model training method and/or the AI model using method as described above when executing the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the AI model training method and/or the AI model using method as described above.

The application discloses an AI model training method, an AI model using method, computer equipment and a storage medium, wherein environmental information observed by an agent in a virtual environment is obtained, observation characteristics of the agent are extracted from the environmental information, then an AI model corresponding to the agent is called, the observation characteristics are input into the AI model for prediction to obtain action behaviors, the action behaviors are sent to the agent, the agent executes the action behaviors, feedback information corresponding to the action behaviors is obtained, and finally the feedback information, the observation characteristics and the action behaviors are used as training samples to train and update the AI model according to the training samples. The observation features are extracted from the environmental information, and the feedback information corresponding to the action executed by the agent based on the observation features is jointly used as a training sample to train and update the AI model, so that the accuracy of the AI model in the field of multi-agent control is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic view of a scenario for training an AI model according to an embodiment of the present disclosure;

fig. 2 is a schematic flow chart of an AI model training method provided in an embodiment of the present application;

FIG. 3 is a schematic angle view provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of the coding and behavior comparison provided by embodiments of the present application;

FIG. 5 is a schematic diagram of a hierarchical structure of an AI model provided by an embodiment of the application;

FIG. 6 is a schematic diagram of a scenario using an AI model according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart illustrating a method for using an AI model according to an embodiment of the present disclosure;

FIG. 8 is a flow diagram illustrating sub-steps of the AI model utilization method provided in FIG. 5;

fig. 9 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Embodiments of the present application provide an AI model training method, an AI model using method, a computer device, and a storage medium. The AI model training method can be applied to a server, and the server can be a single server or a server cluster consisting of a plurality of servers.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

It should be noted that, the following description will be given in detail by taking the application of the AI model to a ship game as an example, and it is understood that the AI model can also be applied to other multi-agent control scenarios.

Referring to fig. 1, fig. 1 is a schematic view of a scenario for training an AI model according to an embodiment of the present disclosure.

As shown in FIG. 1, the model training server includes a prediction portion and a training portion. The prediction part is used for predicting action behaviors so as to generate training samples, and the training part is used for training and updating the AI model.

The virtual environment server sends the environment information observed by the intelligent agent in the virtual environment to the prediction part in the model training server, the prediction part extracts the characteristics of the environment information to obtain observation characteristics, and the observation characteristics are input into the AI model to perform behavior prediction to obtain the action behavior output by the AI model. And the predicting part sends the behavior instruction corresponding to the action behavior to the intelligent agent in the virtual environment server so as to control the intelligent agent to act according to the behavior instruction and generate feedback information.

And then the prediction part takes the observation characteristics, the action behaviors and the feedback information as training samples to be sent to the training part, the training part trains the AI model according to the training samples, and the updated parameters are returned to the AI model in the prediction part so as to update the AI model.

Referring to fig. 2, fig. 2 is a schematic flow chart of an AI model training method according to an embodiment of the present disclosure.

As shown in fig. 2, the AI model training method specifically includes: step S101 to step S104.

S101, obtaining environment information observed by the intelligent agent in the virtual environment, and extracting observation characteristics of the intelligent agent from the environment information.

The virtual environment may refer to a virtual environment server, and after the agent and the real player access the virtual environment server, corresponding actions may be executed in the virtual environment server. For example, the virtual environment server may refer to a game server, and after the agent and the real player access the game server, the game can be played in a virtual game scene provided by the game server.

After the intelligent agent is accessed to the virtual environment server, the virtual environment server can observe the surrounding environment at the visual angle of the intelligent agent to obtain the environmental information observed by the intelligent agent, and sends the environmental information observed by the intelligent agent to the model training server, and the model training server extracts the observation characteristics of the intelligent agent from the environmental information after obtaining the environmental information observed by the intelligent agent. The observation features refer to features that can be observed in the field of view of the agent, including features of the agent and features of other players in the field of view.

The model training server may include a prediction section for acquiring environmental information observed by the agent in the virtual environment, extracting observation characteristics of the agent from the environmental information, predicting an action behavior of the agent based on the observation characteristics, and obtaining feedback information corresponding to the action behavior. And the training part is used for performing model training on the AI model by taking the observation characteristics, the action behaviors and the corresponding feedback information as training samples.

In some embodiments, the context information observed by the agent may include user information, user configuration information, and global information. The extracting of the observation feature of the agent from the environment information may specifically be extracting a corresponding player feature, configuration feature and global feature from the user information, user configuration information and global information. That is, the observed features include player features, configuration features, and global features.

Specifically, a method of observation modeling may be utilized to perform player modeling for each user in the virtual environment, model user configuration for the users in the virtual environment, and perform global modeling for the whole world, thereby obtaining player characteristics, configuration characteristics, and global characteristics. Wherein, the users in the virtual environment can be a plurality of agents, or can be part of real players and at least one agent. The user configuration information may refer to player-configured weapons information.

In the ship game, the user information is player information. Players include primary players, teammate players, and opposing players. The main player refers to a currently observed player, that is, if the surrounding environment is observed at the perspective of the agent, the agent is the main player, and if the surrounding environment is observed at the perspective of the real player, the real player is the main player. Teammate players are players who are on the same team as the main player, and opponent players are players who are opponents to the main player and can be observed in the visual field of the main player.

If the player of the enemy is not within the visual field of the main player, the player is not regarded as an enemy player until the player of the enemy is within the visual field of the main player.

When there are multiple teammate players and/or opposing players, each player may be flagged to facilitate differentiation between multiple players. Taking a five-player ship-class cooperative battle game as an example, for example, four teammates players who are in the same team as the main player can be respectively recorded as teammate 1, teammate 2, teammate 3 and teammate 4, and the enemy players which can be observed in the visual field range of the main player can be respectively recorded as enemy 1, enemy 2, enemy 3, enemy 4 and enemy 5.

The user information includes basic information and pose information. Wherein the basic information is used to indicate the status of the current player, e.g., whether the current player is the master player, whether it is alive, moving speed, etc. The pose information is then used to represent the global pose of the current player and the relative pose with respect to the relative players (i.e., other teammate players and/or opposing players). Wherein the global pose comprises the current player coordinates and the current player angle, and the relative pose comprises the relative player coordinates, the relative distance to the relative player, and the relative angle.

The angles in the global pose and the relative pose may include the following: the current player perspective is at an angle to the body, at a bow angle to the player, at a body angle to the player, at a perspective angle to the player, at a hull angle to the current player, and at a perspective angle to the current player.

As shown in fig. 3, in the course of the ship game, after the player enters the game, the player can move with the initial direction as the face, and the player can adjust the angle of view based on the initial direction.

The player can rotate the visual angle under the condition of keeping the body position unchanged, and in the player visual angle rotation process, the angle of the visual angle of the player relative to the body can be acquired, and the angle of the current visual angle of the player relative to the body is recorded as & lt 1 & gt. The angle of the bow of the current player relative to other players is taken as the bow angle of the relative player and is recorded as < 2. The angle of the body of the current player relative to other players is taken as the body angle relative to the players and is recorded as < 3. And taking the angle of the visual angle of the current player relative to other players as the visual angle of the relative player, and recording as ^ 4. And taking the angle of the current player's hull relative to the initial direction as the current player's hull angle, and recording as ^ 5. And taking the angle of the visual angle of the current player relative to the initial direction as the visual angle of the current player, and recording as ^ 6.

When modeling the players, each player in the virtual environment is modeled, that is, each player in the virtual environment is used as a main player to model, and then the user information corresponding to each player is obtained. Thus, default values may be used to fill in missing values in a player's user information if the number of teammate players is insufficient and/or the number of opposing players is insufficient. Wherein, the insufficient number of teammate players or the insufficient number of enemy players means that the teammate players or the enemy players have reduced membership, such as death or disconnection of players, and the like.

If the angle information in the pose information cannot be acquired, filling the angle which cannot be acquired by using a default value. For example, if the current player being modeled is not the primary player and cannot obtain the current player viewing angle, then < 1, < 4, and < 6 cannot be obtained, at which time, default values can be used to fill in < 1, < 4, and < 6.

In a specific implementation process, the default value may be preset to indicate that the value cannot be obtained and is a null value.

The method comprises the steps of obtaining environment information observed by an agent, and modeling each player in the virtual environment respectively so as to extract the characteristics of the player and characteristic values corresponding to the characteristics of the player from user information. The player characteristics include status characteristics and pose characteristics of the player, the status characteristics including whether the current player is the primary player, whether it is alive, speed of movement, and the like. The pose features include global pose features including current player coordinates and current player angles, and relative pose features including relative player coordinates, the relative distance from the relative player, and relative angle.

User configuration information may refer to player-configured weapons information. The user configuration information may include bow weapon information, side weapon information, and deck weapon information.

And carrying out observation modeling according to the user configuration information so as to extract the configuration characteristics and characteristic values corresponding to the configuration characteristics from the user configuration information. Configuration features include whether the weapon is held, whether the weapon is activated, whether the weapon is available, the type of weapon, whether the current aiming direction is within firing range, whether there is a hostile player within firing range, and the time to swap the weapon.

The global information includes team information and game information. And carrying out observation modeling according to the global information so as to extract the global features and feature values corresponding to the global features from the global information. The global features comprise team features and game features, the team features are used for indicating game progress of a team where a current player is located, the team features comprise survival number of the whole team and death number of the whole team, and the team information is used for indicating the global progress of the game, the team information comprises the global remaining number of players and remaining game time.

Each player in the virtual environment is modeled respectively through an observation modeling mode, the relative relation among the main player, the teammate player and the opponent player is expressed, and the trained intelligent agent controlled by the AI model can be matched with the teammate player. And modeling is carried out on the user configuration information to obtain configuration characteristics capable of expressing the attributes of weapons in the game, so that the smoothness degree of the trained intelligent agent controlled by the AI model to the operation of weapons in the game is improved.

And S102, calling an AI model corresponding to the agent, and inputting the observation characteristics into the AI model for prediction to obtain action behaviors.

After the observation characteristics are obtained, the model training server can call an AI model corresponding to the intelligent agent, the observation characteristics are input into the AI model, and the AI model predicts the behavior of the intelligent agent according to the input observation characteristics to obtain the action behavior.

The model training server stores a pre-stored initial AI model, inputs the observation characteristics into the initial AI model, and predicts the behavior of the agent according to the input observation characteristics by the initial AI model, thereby obtaining the action behavior.

S103, sending the action behaviors to the intelligent agent so that the intelligent agent executes the action behaviors to obtain feedback information corresponding to the action behaviors.

After the action behaviors output by the AI model are obtained, the model training server sends the action behaviors to the intelligent agent in the virtual environment so as to control the intelligent agent to execute the received action behaviors. After executing the received action behavior, the agent generates feedback information corresponding to the executed action behavior to feed back whether the action behavior is appropriate. Specifically, feedback information corresponding to the action behavior may be obtained by setting a feedback function.

In the process of executing the action behavior by the intelligent agent, an execution target is set for the intelligent agent, and the execution target is also the target which the intelligent agent needs to reach after multiple action behaviors. After the intelligent agent executes the corresponding action every time, the action of the intelligent agent is fed back, if the action of the intelligent agent makes the intelligent agent approach the execution target, the feedback information is positive, and the action of the action is proper; if the action of the agent makes the agent far away from the execution target, the feedback information is negative, and the action of the agent is not appropriate.

For example, if the execution target of the agent is from point a to point B, after the agent executes the action behavior sent by the model training server, if the distance from the agent to point B is farther from the original position of the agent than the original position of the agent, the feedback information is negative; and if the distance between the intelligent body and the point B is closer to the distance between the intelligent body and the point B relative to the original position of the intelligent body, the feedback information is positive.

In some embodiments, whether the AI model converges may be determined by feedback information over a continuous period of time, for example, the feedback information over a period of time may be embodied in a graph, and if the feedback information is always positive over a period of time, the AI model may be considered to converge.

And S104, obtaining the feedback information, and taking the feedback information, the observation characteristics and the action behavior as training samples.

After the feedback information is obtained through calculation, the model training server stores the observation characteristics, the feedback information and the action behaviors as training samples together for training the AI model.

Since the action behavior of the agent is dynamic, i.e. the agent performs the action behavior continuously over a period of time. Therefore, the environment information observed by the agent at each moment in a period of time may be different, which may result in different observation characteristics extracted from the environment information. In order to improve the prediction accuracy of the trained AI model, the observation characteristics at the current moment, the action behavior at the current moment, the feedback information at the current moment, the observation characteristics at the next moment, and whether the agent stops operating at the next moment may be jointly stored as a training sample.

In some embodiments, the method further comprises: acquiring a behavior sequence of a plurality of training samples, and storing the training samples in a training sample sequence according to the behavior sequence; and if the length of the training sample sequence is greater than a preset threshold value, deleting the training samples according to the behavior sequence.

The action order refers to the order in which the agent performs each action. And for continuous training samples collected in a period of time, storing the training samples in a training sample sequence according to the behavior sequence of a plurality of training samples, and deleting part of the training samples according to the behavior sequence when the number of the training samples stored in the training sample sequence is greater than a preset threshold value.

The preset threshold may refer to a maximum value of the number of training samples that can be stored in the training sample sequence, and when the training samples are deleted according to the behavior sequence, the sequence of the behavior sequence may be deleted, and the training samples are deleted sequentially from front to back.

In some embodiments, the number of the agents controlled by the model training server may be multiple, and when there are multiple agents, the behavior sequence of each agent executing each action may be obtained, and the training samples may be stored according to the behavior sequence.

In some embodiments, the method further comprises: and coding the action behaviors to obtain behavior codes.

The action behaviors comprise behavior categories and behavior parameters, when the action behaviors are coded, behavior modeling can be carried out on the action behaviors to distinguish the behavior categories and the behavior parameters, and then the behavior categories and the behavior parameters are coded based on a preset coding strategy to obtain behavior codes. In particular, one-hot encoding can be adopted to encode the behavior category and the behavior parameter.

Specifically, the behavior modeling may be a hierarchical modeling, and specifically has two hierarchies including a behavior selection layer and a behavior parameter layer, where the behavior selection layer selects one behavior category, the behavior parameter layer selects one corresponding behavior parameter according to the selected behavior category, and different behavior categories also correspond to different behavior parameters. And the behavior modeling is carried out on the action behaviors, so that the number of labels in each dimension is reduced, and the training of an AI model is facilitated.

For example, the behavior categories may include horizontal aiming offset values, vertical aiming offset values, firing, movement, and weapon switching, among others.

The horizontal aiming offset value and the vertical aiming offset value can be provided with different numbers of labels in the horizontal direction or the vertical direction according to different ship types and weapon types.

The firing is that the intelligent agent carries out a firing action according to the current aiming angle.

The movement may be specifically divided into a linear velocity and an angular velocity, and the linear velocity and the angular velocity may be used in combination.

The switching of the weapons means that if the current weapon is unavailable or a plurality of weapons can be used at the same time, other weapons need to be switched, and different numbers of tags can be set according to weapon numbers or use sequences.

Illustratively, a horizontal aiming offset may be set: within the interval [ -30 degrees, 30 degrees ], one tag every 1 degree, for a total of 61 tags.

Aiming offset in vertical direction: within the interval [ -10 degrees, 10 degrees ], one tag every 1 degree, for a total of 11 tags.

Firing: a tag.

Linear and angular velocities of movement use a combination: straight forward, forward right turn, pivot right turn, reverse right turn, straight reverse, reverse left turn, pivot left turn, forward left turn, total 8 tags.

Switching weapons: the previous weapon and the next weapon are 3 tags without replacing the weapon.

If the current action is left-turning in place, the output one-hot codes are [ [0,0,0,1,0], [0,0,1,0 … ], [0,0, …,1, … ], [1], [0,0,0,0,0,0,1,0], [1,0,0] ], and the first element [0,0,0,1,0] is a behavior selection layer, which indicates that the 4 th type of behavior, i.e., movement, is selected. The second element to the sixth element respectively represent corresponding behavior parameters in each type of behavior, and the corresponding behavior parameters are selected from the 4 th type of behavior, namely [0,0,0,0,0,0,1,0], and the corresponding specific behavior parameters are left turn in place.

And S105, training and updating the AI model according to the training samples.

And after the training sample is obtained, the AI model can be trained and updated according to the training sample, and after the AI model is converged, the converged AI model is stored to finish training.

In some embodiments, referring to fig. 5, the AI model includes a first feature encoding layer, a second feature encoding layer, a third feature encoding layer, an encoding splicing layer, a timing model layer, and an output layer; the training and updating the AI model according to the training samples comprises:

encoding the player characteristics through the first characteristic encoding layer to obtain player characteristic codes; encoding the configuration characteristics through the second characteristic encoding layer to obtain configuration characteristic codes; coding the global features through the third feature coding layer to obtain global feature codes; splicing the player characteristic code, the configuration characteristic code and the global characteristic code through the code splicing layer to obtain a splicing code; inputting the splicing codes into a time sequence model layer to obtain time sequence splicing codes; outputting action behaviors and a local evaluation value through the output layer based on the time sequence splicing codes, and determining an advantage value according to the feedback information and the local evaluation value; and calculating a loss value according to the advantage value so as to train and update the AI model according to the loss value.

The player feature codes are obtained by encoding the player feature codes through the first feature encoding layer, and when a plurality of players exist, the player features of the plurality of players can be encoded through the first feature encoding layer to obtain the player feature codes of the plurality of players. And coding the configuration characteristics through the second characteristic coding layer to obtain configuration characteristic codes, and coding the global characteristics through the third characteristic coding layer to obtain global characteristic codes.

The first feature encoding layer, the second feature encoding layer and the third feature encoding layer can be implemented by using a multi-layer perceptron. The player characteristics of a plurality of players are coded by using the shared first characteristic coding layer, and different characteristic coding layers are adopted for different types of characteristics, namely for the player characteristics, the configuration characteristics and the global characteristics, so that the quantity of parameters in the AI model is reduced, the training efficiency of the AI model is improved, and the expression capability and the generalization capability of the AI model are also improved.

The code splicing layer splices the player characteristic codes, the configuration characteristic codes and the global characteristic codes to obtain a splicing code, inputs the obtained splicing code into a time sequence model layer, wherein the time sequence model can be LSTM to obtain the time sequence splicing code, finally inputs the time sequence splicing code into an output layer, and outputs the final result by the output layer.

The output layer comprises a behavior prediction part and a local evaluation part, wherein the behavior prediction part is used for sampling probability of each action behavior output according to the input time sequence splicing code and expressing the probability of executing the corresponding action behavior. The situation assessment part is a situation assessment value output according to the input time sequence splicing codes, and a prediction value of the accumulated sum of feedback information representing action behaviors executed after the intelligent agent starts to carry out environment observation is used for training the AI model.

After the probability of executing the action behaviors is output by the output layer, the action behaviors which are sent to the intelligent agent for execution can be determined from the action behaviors according to the probability of the action behaviors so as to control the intelligent agent to perform corresponding actions, a new round of environmental information collection is performed, and then training samples are generated to form a loop, and the training samples are continuously generated to train the AI model until the AI model converges.

In some embodiments, if the output action behavior is an encoded action behavior, the encoded action behavior may be decoded to enable the agent to execute the corresponding action behavior.

For example, if the encoded operation behavior of the output is [ [0,0,0,1,0], [0,0,1,0 … ], [0,0, …,1, … ], [1], [0,0,0,0, 1,0], [1,0,0] ], the corresponding execution effect is left turn at the original position.

After the situation evaluation value is output by the situation evaluation part of the output layer, an advantage value can be calculated according to the situation evaluation value and the feedback information in the training sample, then the loss value of the AI model is calculated by utilizing the advantage value, and finally the parameters of the AI model are updated by combining with a back propagation algorithm until the AI model is converged.

The training sample comprises five elements, namely the observation characteristic of the current moment, the action behavior of the current moment, the feedback information of the current moment, the observation characteristic of the next moment and whether the intelligent agent stops operating at the next moment.

Thus, for the training sample at time t, it can be expressed as [ s ]_t，a_t，r_t，s_t+1,d_t+1]Wherein s is_tAn observation feature representing the current time, a_tRepresenting the action at the current time, r_tFeedback information, s, representing the current time_t+1Representing the observed feature at the next moment, d_t+1Indicating whether the agent stopped operating at the next time.

Then, the training samples that continue over a period of time may be expressed as s from time T to time T_t,a_t,r_t,s_t+1，d_t+1]，[s_t+1，a_t+1，r_t+1，s_t+2，d_t+2]，…，[s_T，a_T，r_T，s_T+1，d_T+1]。

In calculating the dominance value, the following formula may be used:

A_t＝_t+(γλ)_t+1+…+(γλ)^T-t _T

wherein A is_tIs from time T to time TThe dominant value obtained by training samples in succession,_tindicates the dominance value, V(s), at time t_t) And gamma and lambda are training hyper-parameters representing the situation assessment values output by the AI model at time t. By calculating the loss value of the AI model by calculating the dominance value, the influence of target offset caused by inaccurate situation evaluation can be reduced.

The loss value of the AI model comprises a player behavior loss value and a local evaluation loss value, wherein the local evaluation loss value is the calculated dominance value, and a strategy approximation loss calculation method is adopted for the player behavior loss value. The strategy approximation loss is approximate to a calculation based on a confidence coefficient interval (Trust-region), results are similar, the calculation speed is higher, and the AI model is converged faster.

When calculating the player behavior loss value, the specific calculation method may be:

recording the training parameter of the current AI model as P, and the training parameter before updating as P', P(s)_t，a_t) Indicating that the current training parameter is in state s_tLower performance behavior a_tThe probability of (c).

ratio_clip＝max(min(ration,1+∈),1-∈)

loss_policy＝min(ratio*A_t，ratio_clip*A_t)

loss_value＝A_t

loss_total＝loss_policy+c₁loss_value+c₂|W|

Wherein ratio represents the overall approximate ratio, ratio_clipRepresents a truncated approximation ratio obtained by selecting the overall approximation ratio, the ratio being a constant value, A_tIs the dominance value, loss, obtained from the training samples that are continuous from time T to time T_policyRepresents the loss value of player behavior, loss_valueRepresents the situation assessment loss value, loss_totalThe resulting total loss value, c, is then calculated₁，c₂To train the hyperparameters, | W | is the I2 regularization term for the model parameters.

The total loss value loss of the AI model can be calculated by using the calculation formula_total. After the loss value of the AI model is calculated, whether the AI model is converged can be judged according to the calculated loss value. The calculated loss value of the AI model is closer to 0, which indicates that the higher the prediction accuracy of the AI model at this time, the more the AI model converges.

The loss value of the AI model is calculated in a mode of approximating the dominant value and the strategy, so that the problem of target deviation of the AI model in the training process is reduced, the training efficiency is improved, and the time cost and the calculation cost are reduced.

In some embodiments, when the AI model is trained, a plurality of agents may be connected to the virtual environment server, so that the agents learn cooperation among the agents in a self-chess playing manner, and the agents may be connected to the same model training server, collect a plurality of pieces of environment information, and train the AI model with the collected pieces of environment information as different batches.

In the AI model training method provided in the above embodiment, the environmental information observed by the agent in the virtual environment is obtained, the observation characteristics of the agent are extracted from the environmental information, the AI model corresponding to the agent is called, the observation characteristics are input into the AI model for prediction to obtain the action behavior, the action behavior is sent to the agent, the agent executes the action behavior, the feedback information corresponding to the action behavior is obtained, and finally the feedback information, the observation characteristics, and the action behavior are used as training samples to train and update the AI model according to the training samples. The observation features are extracted from the environmental information, and the feedback information corresponding to the action executed by the agent based on the observation features is jointly used as a training sample to train and update the AI model, so that the accuracy of the AI model in the field of multi-agent control is improved.

Referring to fig. 6, fig. 6 is a schematic view of a scenario using an AI model according to an embodiment of the present application.

As shown in fig. 6, both the real player and the agent access the virtual environment server, the virtual environment server sends the environmental information observed by the agent in the virtual environment to the AI online server, the AI online server performs feature extraction on the environmental information to obtain observation features, inputs the observation features into the trained AI model to perform behavior prediction to obtain the action behavior output by the AI model, and sends the action command corresponding to the action behavior output by the AI model to the agent in the virtual environment server to control the agent to act according to the action command.

Referring to fig. 7, fig. 7 is a flowchart illustrating an AI model using method according to an embodiment of the disclosure.

As shown in fig. 7, the AI model using method includes steps S201 to S204.

S201, obtaining environment information observed by the intelligent agent in the virtual environment, and extracting observation characteristics of the intelligent agent from the environment information.

In the actual use process, both the real player and the intelligent agent are accessed to the virtual environment server, the AI online server obtains the environment information observed by the intelligent agent in the virtual environment, and the observation characteristics of the intelligent agent are extracted from the environment information.

The environment information observed by the agent may include user information, user configuration information, and global information. The extracting of the observation feature of the agent from the environment information may specifically be extracting a corresponding player feature, configuration feature and global feature from the user information, user configuration information and global information. That is, the observed features include player features, configuration features, and global features.

Specifically, a method of observation modeling may be utilized to perform player modeling for each user in the virtual environment, model user configuration for the users in the virtual environment, and perform global modeling for the whole world, thereby obtaining player characteristics, configuration characteristics, and global characteristics. Wherein the user configuration may refer to player-configured weapon information.

S202, inputting the observation characteristics into an AI model to obtain corresponding probabilities of a plurality of action behaviors.

And after obtaining the observation characteristics, the AI online server inputs the observation characteristics into the AI model so as to output the probability of executing each action by the agent under the observation characteristics through an output layer of the AI model. The AI model is obtained by adopting the model training method.

S203, determining a target action behavior from the action behaviors according to the corresponding probabilities of the action behaviors.

The output layer of the AI model outputs probabilities corresponding to the plurality of action behaviors according to the input observation features, and when determining the target action behavior, the action behavior with the highest probability may be used as the target action behavior.

In some embodiments, the action behaviors are coded action behaviors, and when a target action behavior is determined, probability input action modeling corresponding to a plurality of action behaviors can be performed, so that the action modeling samples corresponding action categories and action parameters according to the probability respectively, and finally the target action behavior is determined from the plurality of action behaviors.

In some embodiments, referring to fig. 8, step S203 includes step S2031 and step S2032.

S2031, screening the action behaviors according to a preset action rule to obtain the screened action behaviors.

After the probabilities corresponding to the plurality of action behaviors output by the AI model are obtained, the plurality of action behaviors can be respectively screened according to preset behavior rules to filter out illegal behaviors and obtain screened action behaviors.

The preset behavior rule may refer to a behavior rule set according to a current game rule, for example, a forbidden area is set in a current game scene map, which indicates that a player in the area cannot enter or does not suggest to enter. Then, according to the behavior rule, an action behavior that may cause the agent to enter the forbidden zone may be filtered out from the plurality of action behaviors as an illegal behavior, so as to obtain a filtered action behavior.

S2032, determining a target action behavior according to the probability corresponding to the action behavior after screening.

After the screened action behaviors are obtained, the target action behaviors can be determined according to the screened action behaviors. When determining the target action behavior, the action behavior with the highest probability may be selected from the filtered action behaviors as the target action behavior.

And S204, sending the target action behavior to the intelligent agent so that the intelligent agent executes corresponding action according to the target action behavior.

And the AI online server generates a behavior instruction according to the determined target action behavior and sends the behavior instruction to the intelligent agent in the virtual environment server so as to control the intelligent agent to execute a corresponding action according to the target action behavior represented by the behavior instruction.

After the agent finishes executing the behavior instruction, the agent can acquire the environmental information again, and sends the environmental information to the AI online server again, the AI online server acquires the environmental information, and extracts the observation characteristics from the environmental information, and then the agent can continue to predict the next behavior of the AI user, so that the agent can cooperate with the real player.

The model using method provided in the foregoing embodiment obtains environment information acquired by the agent, extracts observation features from the environment information, inputs the observation features into the AI model to obtain probabilities corresponding to a plurality of action behaviors, determines a target action behavior from the plurality of action behaviors according to the probabilities corresponding to the plurality of action behaviors, and finally sends the target action behavior to the agent, so that the agent executes a corresponding action. When the AI model needs to be called to play a game together with a real user, the intelligent agent can make corresponding actions according to the actual scene, so that the AI model can be quickly called, and the user experience of the real user is effectively improved.

The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 9.

Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 9, the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any one of an AI model training and/or AI model using method.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any one of the AI model training and/or AI model using methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

and training and updating the AI model according to the training samples.

In one embodiment, the environment information includes user information, user configuration information, and global information; the processor, in causing the extraction of the observed features of the agent from the environmental information to be performed, is configured to cause:

and extracting corresponding player characteristics, configuration characteristics and global characteristics from the user information, the user configuration information and the global information.

In one embodiment, the AI model includes a first feature encoding layer, a second feature encoding layer, a third feature encoding layer, an encoding splice layer, a timing model layer, and an output layer; when the processor implements the training and updating of the AI model according to the training samples, the processor is configured to implement:

encoding the player characteristics through the first characteristic encoding layer to obtain player characteristic codes;

encoding the configuration characteristics through the second characteristic encoding layer to obtain configuration characteristic codes;

coding the global features through the third feature coding layer to obtain global feature codes;

splicing the player characteristic code, the configuration characteristic code and the global characteristic code through the code splicing layer to obtain a splicing code;

inputting the splicing codes into a time sequence model layer to obtain time sequence splicing codes;

outputting action behaviors and a local evaluation value through the output layer based on the time sequence splicing codes, and determining an advantage value according to the feedback information and the local evaluation value;

and calculating a loss value according to the advantage value so as to train and update the AI model according to the loss value.

In one embodiment, the processor, when implementing the encoding of the player characteristics by the first characteristic encoding layer to obtain a player characteristic encoding, is configured to implement:

and the player characteristics of the plurality of players are coded through the first characteristic coding layer to obtain player characteristic codes of the plurality of players.

In one embodiment, the processor is further configured to implement:

and coding the action behaviors to obtain behavior codes.

In one embodiment, the action behavior includes a behavior category and a behavior parameter; when the processor implements the encoding of the action behavior to obtain a behavior encoding, the processor is configured to implement:

and coding the behavior category and the behavior parameters based on a preset coding strategy to obtain a behavior code.

In one embodiment, the processor is further configured to implement:

acquiring a behavior sequence of a plurality of training samples, and storing the training samples in a training sample sequence according to the behavior sequence;

and if the length of the training sample sequence is greater than a preset threshold value, deleting the training samples according to the behavior sequence.

In one embodiment, the processor is configured to execute a computer program stored in the memory to perform the steps of:

inputting the observation characteristics into an AI model to obtain probabilities corresponding to a plurality of action behaviors, wherein the AI model is obtained by adopting the model training method;

In one embodiment, the processor, when implementing the determining the target action behavior from the plurality of action behaviors according to the probabilities corresponding to the plurality of action behaviors, is configured to implement:

screening the action behaviors according to a preset action rule to obtain screened action behaviors;

and determining the target action behavior according to the probability corresponding to the screened action behavior.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any one of the methods for training an AI model and/or using an AI model provided in the embodiments of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An AI model training method, comprising:

and training and updating the AI model according to the training samples.

2. The AI model training method of claim 1, wherein the environmental information includes user information, user configuration information, and global information; the extracting observed features of the agent from the environmental information includes:

3. The AI model training method of claim 2, wherein the AI model comprises a first feature coding layer, a second feature coding layer, a third feature coding layer, a code concatenation layer, a timing model layer, and an output layer; the training and updating the AI model according to the training samples comprises:

4. The AI model training method of claim 3, wherein the encoding the player characteristics via the first characteristic encoding layer to obtain a player characteristic encoding comprises:

5. The AI model training method of claim 1, further comprising:

and coding the action behaviors to obtain behavior codes.

6. The AI model training method of claim 5, wherein the action behaviors include a behavior category and a behavior parameter; the encoding the action behavior to obtain a behavior code includes:

7. The AI model training method of claim 1, further comprising:

8. An AI model using method, comprising:

inputting the observation features into an AI model to obtain probabilities corresponding to a plurality of action behaviors, wherein the AI model is obtained by adopting the model training method of any one of claims 1 to 7;

9. The AI model invocation method according to claim 8, wherein the determining a target action behavior from a plurality of action behaviors according to the probabilities corresponding to the plurality of action behaviors comprises:

10. A computer device, wherein the computer device comprises a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and for implementing the AI model training method according to one of claims 1 to 7 and/or for implementing the AI model use method according to one of claims 8 to 9 when executing the computer program.

11. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, causes the processor to implement the AI model training method according to any one of claims 1 to 7 and/or the AI model using method according to any one of claims 8 to 9.