CN111841018B

CN111841018B - Model training method, model using method, computer device, and storage medium

Info

Publication number: CN111841018B
Application number: CN202010496299.XA
Authority: CN
Inventors: 纪晓龙; 朱晓龙; 季兴; 汤善敏; 周正; 李宏亮; 张正生; 刘永升
Original assignee: Super Parameter Technology Shenzhen Co ltd
Current assignee: Super Parameter Technology Shenzhen Co ltd
Priority date: 2020-06-03
Filing date: 2020-06-03
Publication date: 2023-09-19
Anticipated expiration: 2040-06-03
Also published as: CN111841018A

Abstract

The embodiment of the application discloses a model training method, a model using method, computer equipment and a storage medium, wherein training sample data can be obtained, and observation variables can be extracted from the training sample data; acquiring a real coding type of the observed variable and a real action of an intelligent agent; coding the observed variable through a coding model to obtain a predictive coding type; adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model; mapping the observed variable into a high-dimensional characteristic through the trained coding model; determining, by the classification model, a predicted action of the agent based on the high-dimensional features; and adjusting parameters of the classification model according to the real actions and the predicted actions to obtain a trained classification model. The reliability and the accuracy of model training are improved.

Description

Model training method, model using method, computer device, and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a model training method, a model using method, a computer device, and a storage medium.

Background

With the rapid development of artificial intelligence (Artificial Intelligence, AI) technology, the artificial intelligence technology is widely used in various fields, for example, in the field of game entertainment, and the game of a virtual user and a real person in various games can be realized by the artificial intelligence technology. In the prior art, in the application of games, the artificial intelligence usually uses supervised learning (supervised learning) or reinforcement learning (reinforcement learning) to directly learn actions (actions) corresponding to the intelligent agent, and the prior art is an end-to-end learning mode, so that the model is in a state of a complete black box, learned things are not significant, the internal structure cannot be understood, and the accuracy of model training is reduced. For the game of the two parties, the information amount is increased rapidly along with the increase of the number of the parties, and the fixed map design cannot flexibly cope with the increase of the number of the parties, the map change requirement and the like, so that the accuracy of the model calculation result is lower.

Disclosure of Invention

The embodiment of the application provides a model training method, a model using method, computer equipment and a storage medium, which can improve the reliability and accuracy of model training.

In a first aspect, an embodiment of the present application provides a model training method, including:

acquiring training sample data;

extracting observation variables from the training sample data;

acquiring a real coding type of the observed variable and a real action of an intelligent agent;

coding the observed variable through a coding model to obtain a predictive coding type;

adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model;

mapping the observed variable into a high-dimensional characteristic through the trained coding model;

determining, by the classification model, a predicted action of the agent based on the high-dimensional features;

and adjusting parameters of the classification model according to the real actions and the predicted actions to obtain a trained classification model.

In a second aspect, an embodiment of the present application further provides a model training method, including:

acquiring training sample data, and extracting observation variables from the training sample data;

mapping the observed variable into a high-dimensional characteristic through a coding model, and transmitting the high-dimensional characteristic to a classification model;

Coding the observation variable based on the high-dimensional characteristics through the coding model to obtain a predictive coding type;

In a third aspect, the embodiment of the present application further provides a model using method, which is applied to a server, where the model is a model obtained by training by using any one of the model training methods provided in the embodiment of the present application, and is deployed in the server; the method comprises the following steps:

acquiring current frame data in an operation environment;

extracting a target observation variable from the current frame data;

mapping the target observation variable into a target high-dimensional characteristic through the trained coding model;

determining the action of a target agent in the current frame data based on the target high-dimensional characteristics through the trained classification model;

and controlling the target intelligent agent to execute the action.

In a fourth aspect, an embodiment of the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the processor invokes the computer program in the memory, any one of the model training methods provided by the embodiment of the present application is executed.

In a fifth aspect, embodiments of the present application further provide a storage medium storing a computer program, where the computer program is loaded by a processor to perform any of the model training methods provided by the embodiments of the present application.

According to the embodiment of the application, training sample data can be obtained, and the observation variable is extracted from the training sample data; acquiring a real coding type of an observation variable and a real action of an intelligent agent; coding the observation variable through the coding model to obtain a predictive coding type, and then adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model; secondly, mapping the observed variable into high-dimensional characteristics through a trained coding model, and determining the prediction action of the intelligent agent based on the high-dimensional characteristics through a classification model; at this time, parameters of the classification model can be adjusted according to the real actions and the predicted actions, so as to obtain a trained classification model. The scheme can effectively encode the observed variable, so that training and learning of the coding model are meaningful, the feature learned by the coding model is more expressive through unsupervised learning, the observed variable of the intelligent body can be better represented, and the high-dimensional feature output by the coding model is used for predicting and training the action of the intelligent body by the classification model, so that the reliability and accuracy of model training are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the model training provided by an embodiment of the present application;

FIG. 3 is another flow chart of a model training method according to an embodiment of the present application;

FIG. 4 is another structural schematic diagram of model training provided by an embodiment of the present application;

FIG. 5 is a flow chart of a method for using a model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

The embodiment of the application provides a model training method, a model using method, computer equipment and a storage medium. The model training method can be applied to computer equipment, wherein the computer equipment can comprise terminals such as smart phones, tablet computers, palm computers, notebook computers or desktop computers, and the computer equipment can also be a server.

The model training method provided by the embodiment of the application relates to the technologies such as machine learning technology in artificial intelligence, and the artificial intelligence technology and the machine learning technology are explained below.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, and teaching learning.

Referring to fig. 1, fig. 1 is a flow chart of a model training method according to an embodiment of the application. The model training method may include steps S101 to S107, etc., and specifically may be as follows:

s101, acquiring training sample data, and extracting observation variables from the training sample data.

The training sample data may be flexibly set according to an actual application scenario, for example, in an application scenario of a game of big ball battle, where each set may include multiple sets of videos recorded in a complete game of big ball battle, and for a game application scenario in which a player glows, each set may include multiple sets of videos recorded in a complete game of big ball battle. The training sample data may be collected manually or may be automatically generated by a computer device.

For example, the training sample data may be automatically generated by a predetermined artificial intelligence model, which may be a neural network model or other model, etc., the details of which are not limited herein. When the prestored artificial intelligent model needs to be trained, the artificial intelligent model corresponding to each participant in the game can be called, each participant is controlled to execute corresponding sample generation operation, and training sample data can be collected after the sample generation operation is finished. Alternatively, the training sample data may be downloaded from a preset server, or pre-stored training sample data may be obtained from a local database of the computer device, or the like, and the training sample data may be composed of multi-frame data.

After the training sample data is obtained, the observation variable may be extracted from the training sample data, for example, the observation variable may be extracted from each frame of data. In some implementations, the observation variables can include a location of the agent, a location of the counterparty, and a location of the partner, among others. The intelligent agent can be a game role controlled by the computer equipment, and the position of the intelligent agent can be the current position of the intelligent agent in a game map corresponding to the current frame data; the opponent can be an enemy of the agent, and the position of the opponent can be the current position of the enemy in the game map corresponding to the current frame data; the partner may be a teammate of the agent, and the location of the partner may be a current location of the teammate in the game map corresponding to the current frame data.

The observation variable may be defined as high-order data (may also be referred to as high-order abstract data), for example, in an action game, there is generally high-level skills such as a hit and kill enemies and team cooperation, and the observation (observation) of an agent corresponding to the high-level skills is defined as high-order data, that is, the high-order data may include information such as a location of the agent, a location of the enemies, and a location of teammates.

It should be noted that, in order to improve reliability of model training, effective frame data may be screened from training sample video, where screening of the effective frame data may be automatic screening or manual screening, for example, in order to improve convenience of screening of the effective frame data, effective frame data may be automatically screened based on an analysis index, where the analysis index may be flexibly set according to actual needs, for example, the analysis index may include a parameter group rate and an uncooperation rate, where the parameter group rate refers to probability of collaboration between an agent and a teammate, and the uncooperation rate refers to probability of uncooperation between the agent and the teammate. Specifically, each analysis index in the training sample video can be counted, and then effective frame data is screened out from the training sample video according to the analysis index obtained by counting, for example, a data frame with higher parameter group rate can be screened out as the effective frame data, and the data frame with higher noncompliance rate is removed. After the valid frame data is obtained, observation variables may be extracted from each frame of valid frame data, respectively.

S102, acquiring the real coding type of the observation variable and the real action of the intelligent agent.

After the observation variable is obtained, the actual encoding type of the observation variable may be obtained, which in some embodiments may include the number of partners, the number of counterweights, the number of strikes against the counterweights, and the number of team cooperations. The number of the partners can be the number of teammates, the number of the opponents can be the number of enemies, the number of times of striking the opponents can be the number of times of striking the enemies, and the number of times of team cooperation can be the number of times of team cooperation between the intelligent body and teammates. The true encoding type may be automatically counted by the computer device or manually counted.

And, acquiring a real action of the agent, wherein the real action may include a running direction, a action type, and the like, the running direction may include running in a forward, backward, leftward, rightward, and the like direction, and the action type may include actions of eating spores, eating grains, splitting (e.g., splitting one ball into two balls), spitting spores, or moving, and the like, and the action type may include actions of killing enemies, launching a skills, launching B skills, or moving, and the like, for example. The actual action may be automatically counted by the computer device or manually counted.

S103, coding the observation variable through a coding model to obtain a predictive coding type.

The type of the coding model can be flexibly set according to actual needs, for example, the coding model can be a deep neural network (Deep Neural Network, DNN) or a convolutional neural network (Convolutional Neural Networks, CNN) and the like, and the coding model can be an unsupervised coding model.

The coding model can perform classification learning on the high-order data (namely, observation variables) to code the high-order data, for example, the observation (observation) of the agent to kill the enemies learns the times of classifying the agent to kill the enemies, so as to achieve the aim of coding learning on the high-order data. Specifically, the enemy-killing skills can be categorized as: multiple types of killing one enemy, two enemies, three enemies and the like are classified according to the number of times of killing the enemies: the team cooperation skills can be classified according to the interaction times (namely, the team cooperation times) as follows: multiple types such as one interaction, two interactions, three interactions and the like can be classified according to the number of teams and enemies, and the positions of the intelligent agents can be classified: divided into left, right, middle, etc.

After the observed variables are obtained, the observed variables can be encoded through an encoding model to obtain a predictive encoding type, so that corresponding labels generated by all the observed variables can be generated. For example, the observation variables can be mapped into high-dimensional features through the coding model, and the high-dimensional features are mapped into corresponding labels of high-order data, such as teammate numbers or enemy numbers, so as to optimize the learning of the coding model.

In some implementations, the predictive coding type of the observation variable may include the number of partners, the number of opponents, the number of strikes against the opponents, and the number of team cooperations. The number of the partners can be the number of teammates, the number of the opponents can be the number of enemies, the number of times of striking the opponents can be the number of times of striking the enemies, and the number of times of team cooperation can be the number of times of team cooperation between the intelligent body and teammates.

And S104, adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain the trained coding model.

The coding model has good expressivity on the characteristics generated by observing variables through unsupervised learning, such as the form, the number or the form of teammates of the intelligent agent; the method has good expression effect on the cognition of scenes, such as killing scenes or cooperation scenes, so that the actions corresponding to the observed variables can be learned by using good characteristics, the unsupervised processing of high-order abstract data is realized, and the learning of the coding model is more meaningful.

In some embodiments, adjusting parameters of the coding model according to the true coding type and the predictive coding type, and obtaining the trained coding model may include: and converging the real coding type and the predictive coding type through the first loss function so as to adjust the parameters of the coding model to proper values and obtain the trained coding model.

After the true coding type and the predictive coding type are obtained, parameters of the coding model can be adjusted according to the true coding type and the predictive coding type, and the trained coding model is obtained. In order to improve the accuracy of model training, the real coding type and the predictive coding type can be converged through a first loss function, so that parameters of the coding model are adjusted to proper values, and the trained coding model is obtained. The first loss function can be flexibly set according to actual needs. For example, as shown in fig. 2, the first loss function may be an auxiliary loss function, after obtaining the prediction coding type based on the observed variable in the training sample data through the coding model, the actual coding type and the prediction coding type of the unsupervised label may be converged through the auxiliary loss function, so as to adjust the parameters of the coding model to appropriate values, obtain the trained coding model, complete the unsupervised learning of stage 1, and through the unsupervised processing, the characteristics learned by the coding model are more expressive, and can better represent the observed variable of the agent. The method and the device abstract high-order data such as the number of times of killing enemies, the number of team cooperation, the number of enemies, the number of teammates and the like, and independently learn the coding model by taking the data as a supervision learning signal of an observation variable. Therefore, the coding model is more significant, for example, the number of teammates and the number of enemies can be learned, and the coding model can effectively code a single individual in the observation variable; the number of the killing enemies is learned, so that the coding model can effectively code the individuals and the killing scenes; and the code model can effectively code the team cooperation scene for the study of team cooperation, so that the code model study has significance.

S105, mapping the observed variable into a high-dimensional characteristic through the trained coding model.

After the trained coding model is obtained, the action learning of the intelligent agent in the stage 2 is performed, and at this time, the observed variable can be mapped into a high-dimensional feature through the trained coding model, wherein the high-dimensional feature can be characteristic information of the observed variable, so that the classification model can be trained by using the output feature of the coding model trained in the stage 1.

S106, determining the prediction action of the agent based on the high-dimensional characteristics through the classification model.

The type of the classification model can be flexibly set according to actual needs, for example, the classification model can be DNN or CNN, and the coding model can be an unsupervised coding model. According to the abstract structure of the high-order data, such as the number of teammates, the number of enemies and the like, as labels, the coding model can learn better abstract features and has better expression capability, so that the classification model can learn by using the learned coding features, and the classification model can learn actions corresponding to the observation variables by using the good features. Specifically, after the high-dimensional features are obtained, the predicted actions of the agent can be determined based on the high-dimensional features through the classification model. The predicted action may include a direction of travel, which may include traveling in a forward, backward, left, or right direction, for example, a ball megacombat game, and an action type, which may include actions such as eating spores, eating grains, splitting (e.g., splitting a ball into two balls), spitting spores, or moving, for example, a king glowing, and an action type which may include actions such as battling enemies, launching a skills, launching B skills, or moving.

And S107, adjusting parameters of the classification model according to the real actions and the predicted actions to obtain a trained classification model.

In some embodiments, adjusting parameters of the classification model according to the actual motion and the predicted motion, the obtaining the trained classification model may include: and converging the real action and the predicted action through the second loss function so as to adjust the parameters of the classification model to proper values and obtain the trained classification model.

After the actual action and the predicted action are obtained, parameters of the classification model can be adjusted according to the actual action and the predicted action, and the trained classification model is obtained. In order to improve accuracy of model training, the actual action and the predicted action can be converged through a second loss function, so that parameters of the classification model are adjusted to proper values, and the trained classification model is obtained, wherein the second loss function can be flexibly set according to actual needs. For example, as shown in fig. 2, the second loss function may be a classification loss function, after obtaining a predicted motion based on the high-dimensional feature output by the trained coding model through the classification model, the predicted motion and the real motion may be converged through the classification loss function to adjust the parameters of the classification model to appropriate values, so as to obtain the trained classification model, and complete the training in stage 2.

In some embodiments, the model training method may further include: acquiring current frame data in an operation environment; extracting a target observation variable from current frame data; mapping the target observation variable into a target high-dimensional characteristic through the trained coding model; determining the action of the target agent in the current frame data based on the target dimension characteristics through the trained classification model; and controlling the target agent to execute the action.

After the trained coding model and the trained classification model are obtained, the trained coding model and the trained classification model can be used for performing the matching. Specifically, current frame data in the operation environment may be obtained, where the operation environment and the corresponding current frame data may be flexibly set according to an actual application scenario, and specific content is not limited herein. For example, in a ball game, an owner blaze game, or a agar io game, current frame data in a game running environment may be acquired, and since video data may be recorded and formed during the game running, the current frame data of the video data may be acquired at intervals of a preset time, which may be flexibly set according to actual needs, for example, the preset time may be set to 1 second or 0.1 second, or the like. Taking the ball big combat game as an example, the current frame data may include an environment map, a location of an agent, a speed of movement of the agent, a location of a teammate, a speed of movement of the teammate, a location of an enemy, a speed of movement of the enemy, a location of spores, a weight of spores, a location of grains, a weight of grains, a game time, and the like.

In the following, an example will be given of a game of agar.io in an IO game, in which a player can freely control his own ball, like a big fish eating a small fish, and can control his own ball to be swallowed with a ball smaller than his own ball, grains, or a thorn ball. Meanwhile, the player can combine a plurality of small balls into a big ball by adopting operations such as splitting, spore spitting and the like and matching with teammates; or by splitting operation, eat enemy's balls smaller than their own balls, the game is aimed at making the area of their own team larger by eating resources and phagocytizing enemy's balls as much as possible. The game scene can comprise: a player can control a plurality of balls through a rocker, for example, player 1 can control 3 balls with the number of 1, and the 3 balls with the number of 1 are the control units of player 1; the player's field of view is not fixed and can change as the player's ball size, number and position change; each player can only see the game scene in his own view, e.g., player 4 has a view less than that of player 1, and player 4 cannot see player 1, player 2, players 5 and 6; different players may form a plurality of teams, with at least 2 teams per game, e.g., player 1 and player 2 forming team a, player 3 and player 4 forming team B, and player 5 and player 6 forming team C; the game has different types of resources, such as grains, thorns, and the like, and a player can control the ball eating resources of the player so as to obtain the increase of the area; the game duration is fixed at 10 minutes or 12 minutes, and the team with the largest total area finally gets the winner. The IO game has the following characteristics: multiple players are in instant fight, and competition relationship is generated with other players to obtain fun; instant play without permanent growing system, each office is restarted; the more growth is limited, the higher the grade, the more defects in some way, such as a greater volume and slower movement speed; a simple growing system: resources are acquired and opponents are killed.

After the current frame data is obtained, a target observation variable can be extracted from the current frame data, wherein the target observation variable can comprise the position of an intelligent agent, the position of an opponent, the position of a partner and the like, the target intelligent agent can be a game role controlled by computer equipment, and the position of the target intelligent agent can be the current position of the intelligent agent in a game map corresponding to the current frame data; the opponent can be an enemy of the agent, and the position of the opponent can be the current position of the enemy in the game map corresponding to the current frame data; the partner may be a teammate of the agent, and the location of the partner may be a current location of the teammate in the game map corresponding to the current frame data.

The target observed variable may then be mapped by the trained coding model to a target high-dimensional feature, where the target high-dimensional feature may be characteristic information of the observed variable. At this time, the actions of the target agent in the current frame data may be determined based on the target dimensional features through the trained classification model, the actions of the target agent may include a running direction, an action type, and the like, the running direction may include a forward, backward, leftward, rightward, and the like, the action type may include actions such as eating spores, eating grains, splitting, spitting spores, or moving, and the like, the action type may include actions such as shooting enemies, launching a skills, launching B skills, or moving, and the like, taking the ball large combat game as an example. Finally, the target agent can be controlled to execute the action.

In order to improve the reliability of the model, before deploying the trained coding model and the trained classification model (hereinafter, for convenience of description, the trained coding model and the trained classification model are referred to as a trained model), the model obtained by training may be evaluated, for example, from two aspects: on the one hand, anthropomorphic level: the trained model can be used for competing with human beings, and then the accuracy of the trained model prediction can be obtained through statistics in an evaluation data set and used for describing the anthropomorphic level of the trained model. When the accuracy is greater than or equal to a first preset threshold, the performance of the trained model is reliable, the first preset threshold can be flexibly set according to actual needs, and when the accuracy is smaller than the first preset threshold, the performance of the trained model is poor, and the trained model needs to be trained again until the accuracy of the trained model prediction is greater than or equal to the first preset threshold. On the other hand, difficulty level: the method comprises the steps that a trained model can be utilized to fight human beings, then the success rate of the fight of the trained model can be counted in an evaluation result, when the success rate is larger than or equal to a second preset threshold value, the performance of the trained model is better, the second preset threshold value can be flexibly set according to actual needs, when the success rate is smaller than the second preset threshold value, the performance of the trained model is poorer, and the trained model needs to be trained again until the success rate of the trained model is larger than or equal to the second preset threshold value.

After obtaining the trained model, the trained model can be deployed to the appropriate player's game according to the strategy, and the application scenario can include: 1. novice teaching: the novice player is instructed how to bring his own controlled units to maximum value to win the benefits. 2. And (3) carrying out line dropping hosting: in the event of a loss of player's line, the player is assisted in performing a reasonable action to maximize the benefits of his game or minimize the losses of his game and avoid affecting the experience of other players. 3. Man-machine challenges: for high-level players, accessing the model after high-level training allows the players to challenge scores and increases liveness.

It should be noted that, the next frame data may be obtained as the current frame data every time an action is completed, for example, when a game starts, the first frame data in the game running environment may be obtained as the current frame data, the target observation variable may be extracted from the first frame data, etc., the target observation variable may be mapped to the target high-dimensional feature by the trained coding model, the action of the target agent in the current frame data may be determined based on the target high-dimensional feature by the trained classification model, the target agent may be controlled to execute the action, etc. Then, after the action execution is completed, second frame data in the game running environment is obtained as current frame data, and a target observation variable is extracted from the second frame data so as to control a target intelligent agent to execute the action; and so on until the game is finished.

Alternatively, the next frame data may be acquired as the current frame data every 1 second, for example, when the game timing starts, the first frame data in the game running environment may be acquired as the current frame data, the target observation variable may be extracted from the first frame data set, etc., the target observation variable may be mapped to the target high-dimensional feature by the trained coding model, the action of the target agent in the current frame data may be determined based on the target high-dimensional feature by the trained classification model, the target agent may be controlled to execute the action, etc. Then when the game starts for 1 second, second frame data in the game running environment is obtained as current frame data, and target observation variables are extracted from the second frame data so as to control a target intelligent agent to execute actions; when the game starts for 2 seconds, third frame data in the game running environment is obtained as current frame data, and a target observation variable is extracted from the third frame data so as to control a target intelligent agent to execute actions; and so on until the game is finished.

It should be noted that online deployment of the trained model may be implemented, for example, the trained model may be deployed in an AI player access service, where both the player client and the AI agent (i.e., player) access the game server in the same protocol. The game server often sends scene information to the client and the player to access the service cluster in a frame synchronization mode, and after each end obtains an action instruction, the action instruction is reversely transmitted back to the game server. The game server finally merges the action instructions sent to the server by the client player and the AI player in a fixed manner and iterates through the game process. The trained model in the implementation can model the perception and action of a single agent of an IO game, adopts a modeling mode based on images and units, has fine decision granularity and is suitable for a wide range of scenes; the method is suitable for a multi-person team formation scene of IO games, and the actual experience is insensitive to the number of teams; the method is suitable for the situation of player visual field change, can flexibly process game visual field change scenes, is suitable for different map sizes, and has mobility; in addition, the online access scheme enables access to be flexible in the mode that online and players access in the same protocol, and can be effectively applied to the application scenes such as offline hosting, man-machine fight, man-machine mixing and the like of various IO games.

Referring to fig. 3, fig. 3 is a flow chart of a model training method according to an embodiment of the application. The model training method may include steps S201 to S207, etc., and specifically may be as follows:

s201, training sample data are acquired, and observation variables are extracted from the training sample data.

S202, acquiring the real coding type of the observation variable and the real action of the intelligent agent.

Step S201 is similar to step S101, step S202 is similar to step S102, and reference may be made to the detailed descriptions of step S101 and step S102, which are not repeated herein.

S203, mapping the observed variable into high-dimensional characteristics through the coding model, and transmitting the high-dimensional characteristics to the classification model.

The type of the coding model can be flexibly set according to actual needs, for example, the coding model can be DNN or CNN, and the coding model can be an unsupervised coding model. After the observed variable is obtained, the observed variable can be mapped into high-dimensional features through the coding model, the high-dimensional features are transmitted to the classification model, and the high-dimensional features can be characteristic information of the observed variable.

S204, coding the observation variable based on the high-dimensional characteristics through a coding model to obtain a predictive coding type.

After the high-dimensional characteristics are obtained, the observed variables can be encoded based on the high-dimensional characteristics through the encoding model to obtain the predictive encoding type, so that corresponding labels generated for all the observed variables can be generated. For example, the observation variables can be mapped into high-dimensional features through the coding model, and the high-dimensional features are mapped into corresponding labels of high-order data, such as teammate numbers or enemy numbers, so as to optimize the learning of the coding model.

S205, determining the prediction action of the agent based on the high-dimensional characteristics through the classification model.

S206, adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain the trained coding model.

After the true coding type and the predictive coding type are obtained, parameters of the coding model can be adjusted according to the true coding type and the predictive coding type, and the trained coding model is obtained. In order to improve the accuracy of model training, the real coding type and the predictive coding type can be converged through a first loss function, so that parameters of the coding model are adjusted to proper values, and the trained coding model is obtained. The first loss function can be flexibly set according to actual needs. For example, as shown in fig. 4, the first loss function may be an auxiliary loss function, and in fig. 4, a network model of Multi-task Learning may be constructed, and as the auxiliary loss function, the Learning of the coding model is supervised, so that the Learning of the coding model has abstract feature representation. The observed variables are used as inputs to the coding model, and the outputs of the coding model are respectively sent to the classification model and the auxiliary loss function learning. At this time, the total loss function l=la (θ) +lc (θ), where La (θ) is the auxiliary loss function and Lc (θ) is the classification loss function. After obtaining the predictive coding type based on the observed variable in the training sample data through the coding model, the real coding type and the predictive coding type of the unsupervised label can be converged through the auxiliary loss function, so that the parameters of the coding model are adjusted to proper values, the trained coding model is obtained, unsupervised learning is completed, and the characteristics learned by the coding model are more expressive through unsupervised processing, so that the observed variable of the intelligent body can be better represented. The method and the device abstract high-order data such as the number of times of killing enemies, the number of team cooperation, the number of enemies, the number of teammates and the like, and independently learn the coding model by taking the data as a supervision learning signal of an observation variable. Therefore, the coding model is more significant, for example, the number of teammates and the number of enemies can be learned, and the coding model can effectively code a single individual in the observation variable; the number of the killing enemies is learned, so that the coding model can effectively code the individuals and the killing scenes; and the code model can effectively code the team cooperation scene for the study of team cooperation, so that the code model study has significance.

S207, parameters of the classification model are adjusted according to the real actions and the predicted actions, and the trained classification model is obtained.

After the actual action and the predicted action are obtained, parameters of the classification model can be adjusted according to the actual action and the predicted action, and the trained classification model is obtained. In order to improve accuracy of model training, the actual action and the predicted action can be converged through a second loss function, so that parameters of the classification model are adjusted to proper values, and the trained classification model is obtained, wherein the second loss function can be flexibly set according to actual needs. For example, as shown in fig. 4, the second loss function may be a classification loss function, and after obtaining the predicted motion based on the high-dimensional feature output by the coding model through the classification model, the predicted motion and the real motion may be converged through the classification loss function to adjust the parameters of the classification model to appropriate values, so as to obtain the trained classification model.

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

According to the embodiment of the application, training sample data can be obtained, and the observation variable is extracted from the training sample data; acquiring a real coding type of an observation variable and a real action of an intelligent agent; mapping the observed variable into high-dimensional characteristics through the coding model, and transmitting the high-dimensional characteristics to the classification model; coding the observation variable based on the high-dimensional characteristics through a coding model to obtain a predictive coding type; determining a prediction action of the agent based on the high-dimensional characteristics through the classification model; adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model; and adjusting parameters of the classification model according to the real actions and the predicted actions to obtain the trained classification model. The scheme can effectively encode the observed variable, so that training and learning of the coding model are meaningful, the feature learned by the coding model is more expressive through unsupervised learning, the observed variable of the intelligent body can be better represented, and the high-dimensional feature output by the coding model is used for predicting and training the action of the intelligent body by the classification model, so that the reliability and accuracy of model training are improved.

Referring to fig. 5, fig. 5 is a flow chart illustrating a method for using a model according to an embodiment of the application. The model application method can be applied to a server, the model can comprise a coding model, a classification model and other models, the coding model, the classification model and other models are obtained by training by adopting the model training method, and the model training method is deployed in the server and can comprise the steps S301 to S305 and the like, and specifically can be as follows:

s301, acquiring current frame data in an operation environment.

S302, extracting a target observation variable from current frame data.

S303, mapping the target observation variable into a target high-dimensional characteristic through the trained coding model.

S304, determining the action of the target agent in the current frame data based on the target dimension characteristics through the trained classification model.

S305, controlling the target agent to execute the action.

According to the embodiment of the application, the current frame data in the running environment can be obtained, the target observation variable is extracted from the current frame data, then the trained coding model is used for mapping the target observation variable into the target high-dimensional characteristic, and the action of the target agent in the current frame data is determined based on the target high-dimensional characteristic through the trained classification model, so that the target agent can be controlled to execute the action. The scheme can accurately determine the action of the target agent based on the trained coding model and classification model and based on the observation variable, so as to accurately control the target agent to execute the corresponding action, and improve the accuracy and reliability of the control of the target agent.

Referring to fig. 6, fig. 6 is a schematic block diagram of a computer device according to an embodiment of the present application.

As shown in fig. 6, the computer device 300 may include a processor 302, a memory 303, and a communication interface 304 connected by a system bus 301, wherein the memory 303 may include a non-volatile computer readable storage medium and an internal memory.

The non-transitory computer readable storage medium may store a computer program. The computer program comprises program instructions that, when executed, cause a processor to perform any of a number of model training methods.

The processor 302 is used to provide computing and control capabilities to support the operation of the overall computer device.

The memory 303 provides an environment for the execution of a computer program in a non-transitory computer readable storage medium that, when executed by the processor 302, causes the processor 302 to perform any one of the model training methods.

The communication interface 304 is used for communication. It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device 300 to which the present inventive arrangements may be applied, and that a particular computer device 300 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the bus 301 may be, for example, an I2C (Inter-integrated Circuit) bus, the Memory 303 may be a Flash chip, a Read-Only Memory (ROM) disk, an optical disk, a U-disk or a removable hard disk, etc., the processor 302 may be a central processing unit (Central Processing Unit, CPU), the processor 302 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field-programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

In some embodiments, the processor 302 is configured to execute a computer program stored in the memory 303 to perform the following steps:

acquiring training sample data, and extracting observation variables from the training sample data; acquiring a real coding type of an observation variable and a real action of an intelligent agent; coding the observation variable through a coding model to obtain a predictive coding type; adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model; mapping the observed variable into a high-dimensional characteristic through the trained coding model; determining a prediction action of the agent based on the high-dimensional characteristics through the classification model; and adjusting parameters of the classification model according to the real actions and the predicted actions to obtain the trained classification model.

In one embodiment, the observation variables include the location of the agent, the location of the counterparty, and the location of the partner.

In one embodiment, the predictive coding type or the true coding type of the observed variable includes the number of partners, the number of opponents, the number of times the opponents were hit, and the number of team cooperations.

In one embodiment, when adjusting the parameters of the coding model according to the true coding type and the predictive coding type to obtain a trained coding model, the processor 302 further performs: and converging the real coding type and the predictive coding type through the first loss function so as to adjust the parameters of the coding model to proper values and obtain the trained coding model.

In one embodiment, when the parameters of the classification model are adjusted according to the actual motion and the predicted motion to obtain the trained classification model, the processor 302 further performs: and converging the real action and the predicted action through the second loss function so as to adjust the parameters of the classification model to proper values and obtain the trained classification model.

In one embodiment, the processor 302 also performs: acquiring current frame data in an operation environment; extracting a target observation variable from current frame data; mapping the target observation variable into a target high-dimensional characteristic through the trained coding model; determining the action of the target agent in the current frame data based on the target dimension characteristics through the trained classification model; and controlling the target agent to execute the action.

acquiring training sample data, and extracting observation variables from the training sample data; acquiring a real coding type of an observation variable and a real action of an intelligent agent; mapping the observed variable into high-dimensional characteristics through the coding model, and transmitting the high-dimensional characteristics to the classification model; coding the observation variable based on the high-dimensional characteristics through a coding model to obtain a predictive coding type; determining a prediction action of the agent based on the high-dimensional characteristics through the classification model; adjusting parameters of the coding model according to the real coding type and the predictive coding type to obtain a trained coding model; and adjusting parameters of the classification model according to the real actions and the predicted actions to obtain the trained classification model.

acquiring current frame data in an operation environment; extracting a target observation variable from current frame data; mapping the target observation variable into a target high-dimensional characteristic through the trained coding model; determining the action of the target agent in the current frame data based on the target dimension characteristics through the trained classification model; and controlling the target agent to execute the action.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of an embodiment that are not described in detail, reference may be made to the foregoing detailed description of the model training method, which is not repeated herein.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the computer readable storage medium stores a computer program, wherein the computer program comprises program instructions, and a processor executes the program instructions to realize any model training method provided by the embodiment of the application.

The computer readable storage medium may be an internal storage unit of the computer device of the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device.

Because the computer program stored in the computer readable storage medium can execute any model training method provided by the embodiment of the present application, the beneficial effects that any model training method provided by the embodiment of the present application can achieve can be achieved, which are detailed in the previous embodiments and are not described herein.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of model training, comprising:

determining a predicted action of the agent based on the high-dimensional features by a classification model;

2. The model training method of claim 1, wherein the observation variables include a location of the agent, a location of an opponent, and a location of a partner.

3. The model training method according to claim 1, wherein the predictive coding type or the true coding type of the observed variable includes the number of partners, the number of opponents, the number of striking opponents, and the number of team cooperations.

4. The model training method according to claim 1, wherein the adjusting parameters of the coding model according to the true coding type and the predictive coding type, to obtain a trained coding model, includes:

converging the real coding type and the predictive coding type through a first loss function to adjust parameters of the coding model to proper values, so as to obtain a trained coding model;

adjusting parameters of the classification model according to the real actions and the predicted actions, and obtaining a trained classification model comprises:

And converging the real action and the predicted action through a second loss function so as to adjust the parameters of the classification model to proper values and obtain the trained classification model.

5. The model training method according to any one of claims 1 to 4, characterized in that the model training method further comprises:

acquiring current frame data in an operation environment;

extracting a target observation variable from the current frame data;

and controlling the target intelligent agent to execute the action.

6. A method of model training, comprising:

7. The model training method of claim 6, wherein the observation variables include a location of the agent, a location of an opponent, and a location of a partner.

8. The model training method of claim 6, wherein the predictive coding type or the true coding type of the observed variable includes a number of partners, a number of opponents, a number of striking opponents, and a number of team cooperations.

9. Model training method according to any of the claims 6 to 8, characterized in that the model training method further comprises:

acquiring current frame data in an operation environment;

extracting a target observation variable from the current frame data;

and controlling the target intelligent agent to execute the action.

10. A model using method, characterized by being applied to a server, wherein the model is a model trained by the model training method according to any one of claims 1 to 5, or the model is a model trained by the model training method according to any one of claims 6 to 9, and is deployed in the server; the method comprises the following steps:

acquiring current frame data in an operation environment;

extracting a target observation variable from the current frame data;

and controlling the target intelligent agent to execute the action.

11. A computer device comprising a processor and a memory, the memory having stored therein a computer program, the processor executing the model training method of any of claims 1 to 5 when calling the computer program in the memory, or the processor executing the model training method of any of claims 6 to 9 when calling the computer program in the memory.

12. A storage medium for storing a computer program to be loaded by a processor to perform the model training method of any one of claims 1 to 5 or to be loaded by a processor to perform the model training method of any one of claims 6 to 9.