CN116999831A

CN116999831A - Training method, training device, training equipment and training storage medium for game artificial intelligent model

Info

Publication number: CN116999831A
Application number: CN202211288937.4A
Authority: CN
Inventors: 曾政文; 张良鹏; 张镇; 万乐
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-11-07

Abstract

The embodiment of the application provides a training method, a training device, training equipment and a training storage medium for a game artificial intelligence model, which are used for updating a game AI model rapidly and efficiently. Comprising the following steps: acquiring sample data of an initial reinforcement learning model and a game character set, wherein the game character set comprises a newly added game character and a newly adjusted game character when a first game version is updated to a second game version; training an initial reinforcement learning model according to sample data to obtain a first artificial intelligent game AI model, wherein the first game AI model is a game AI model corresponding to a game role set; and utilizing a strategy distillation method to combine and train the first game AI model and the second game AI model to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, and the third game AI model is a game AI model corresponding to the second game version. The technical scheme provided by the application can be applied to the field of artificial intelligence.

Description

Training method, training device, training equipment and training storage medium for game artificial intelligent model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a training method, apparatus, device, and storage medium for a game artificial intelligence model.

Background

Artificial intelligence (Artificial Intelligence, AI) is a new technical science to study, develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. Since birth, artificial intelligence theory and technology are mature, the application field is expanded, and it is supposed that technological products brought by artificial intelligence in the future are "containers" of human intelligence. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is not human intelligence, but can think like a human, and may also exceed human intelligence. Alpha dogs (Alpha zero), for example, implemented by artificial intelligence technology, have overcome the top-most human in board games.

Compared with chess games, the fighting game has the characteristics of short decision time, large decision space, rich strategy change and the like. During the course of a battle, limited by the game's defined fighting time and scope, players need to avoid risks by reasonable displacements and as far as possible harm the opponent's character. Because opponents have rich and changeable behavior strategies, the making, selecting and executing of strategies is a vital link of the game intelligent system in face of the huge decision space and the real-time requirement of decisions.

The character richness and the high frequency of update iteration of the mainstream combat game at present have higher requirements on the training of the game AI model.

Disclosure of Invention

The embodiment of the application provides a training method, a training device, training equipment and a training storage medium for a game artificial intelligence model, which are used for updating a game AI model rapidly and efficiently.

In view of this, the present application provides, in one aspect, a training method for an artificial intelligence model for a game, including: acquiring sample data of an initial reinforcement learning model and a game character set, wherein the game character set comprises at least one of a newly added game character or a newly adjusted game character in a first game version; training the initial reinforcement learning model according to the sample data to obtain a first artificial intelligent game AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set; and training the first game AI model and a second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, the third game AI model is a game AI model corresponding to the second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

Another aspect of the present application provides a game artificial intelligence model training apparatus, comprising: the acquisition module is used for acquiring sample data of an initial reinforcement learning model and a game role set, wherein the game role set comprises a newly added game role and a newly adjusted game role when the first game version is updated to the second game version;

the training module is used for training the initial reinforcement learning model according to the sample data to obtain a first artificial intelligent game AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set; and training the first game AI model and a second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, the third game AI model is a game AI model corresponding to the second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

In one possible design, in another implementation manner of another aspect of the embodiments of the present application, the obtaining module is further configured to obtain an initial game AI model, where the initial game AI model is an initialization model of a game AI model corresponding to the second game version; acquiring a first training sample of the initial game AI model;

the training module is specifically configured to train the initial game AI model to obtain a first intermediate game AI model according to the first training sample, the first game AI model and the second game AI model;

the acquisition module is further used for acquiring a second training sample of the first intermediate game AI model;

the training module is specifically configured to train the first intermediate game AI model to obtain a second intermediate game AI model according to the second training sample, the first game AI model and the second game AI model;

repeating the above operation until the third game AI model is obtained.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the obtaining module is specifically configured to invoke the initial game AI model to control interaction of a game character with a game environment to generate a sample set, where the game environment includes a game scene and a game fight object;

The first training sample is sampled from the sample set.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the obtaining module is specifically configured to invoke the first game AI model to control interaction of a game character with a game environment to generate first data, and invoke the second game AI model to control interaction of the game character with the game environment to generate second data, where the game environment includes a game scene and a game fight object;

saving the first data and the second data as a sample set;

the first training sample is sampled from the sample set.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the training module is specifically configured to input the first training sample into the initial game AI model to obtain a first motion probability distribution, and input the first training sample into the first game AI model or the second game AI model to obtain a second motion probability distribution;

calculating a loss function from the first action probability distribution and the second action probability distribution;

and adjusting parameters of the initial game AI model by using an optimization algorithm according to the loss function to obtain the first intermediate game AI model.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the loss function is a KL divergence, a cross entropy loss function, a square error loss function, or a regression loss function;

the optimization algorithm is a batch gradient descent algorithm, a random gradient descent algorithm or an adaptive moment estimation algorithm.

In a possible design, in another implementation manner of another aspect of the embodiments of the present application, the game module training device further includes a processing module, configured to obtain, according to a first evaluation system, a first evaluation result of the third game AI model, where the first evaluation system is configured to evaluate whether a game character controlled by the third game AI model meets a first behavior index;

when the first evaluation result indicates that the game character controlled by the third game AI model meets the first behavior index, confirming that the third game AI model is a trained game AI model;

and repeating the strategy distillation training process when the first evaluation result indicates that the game role controlled by the third game AI model does not meet the first behavior index.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the processing module is specifically configured to invoke the third game AI model to control a first game character to interact with a game environment to output a first action probability, and invoke the first game AI model to control the first game character to interact with the game environment to output a second action probability, where the first game character is the game character controlled by the first game AI model;

Invoking the third game AI model to control a second game character to interactively output a third action probability with the game environment, and invoking the second game AI model to control the second game character to interactively output a fourth action probability with the game environment, wherein the second game character is the game character controlled by the second game AI model;

calculating a first KL divergence according to the first action probability and the second action probability, and calculating a second KL divergence according to the third action probability and the fourth action probability;

and obtaining the first evaluation result according to the first KL divergence and the second KL divergence.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the obtaining module is further configured to obtain a third training sample of the third game AI model, where the third training sample includes a sample of game characters that do not meet the behavior index and a sample of target actions;

the training module is further configured to train the third game AI model to obtain a target game AI model according to the third training sample, the first game AI model and the second game AI model, where the target game AI model is a game AI model corresponding to the second game version.

In another implementation of another aspect of the embodiments of the present application, the training module is specifically configured to train the initial reinforcement learning model to obtain the first game AI model according to the sample data using a neighbor strategy gradient algorithm and a self-game, where the initial reinforcement learning model is a deep neural network structure.

In a possible design, in another implementation manner of another aspect of the embodiments of the present application, the game model training device further includes a processing module, configured to obtain a second evaluation result of the first game AI model according to a second evaluation system, where the second evaluation system is configured to evaluate whether the game character controlled by the first game AI model meets a second behavior index;

when the second evaluation result indicates that the game character controlled by the first game AI model meets the second behavior index, confirming that the first game AI model is a trained game AI model;

and repeating the training process of the initial reinforcement learning model when the second evaluation result indicates that the game character controlled by the first game AI model does not meet the second behavior index.

In one possible design, in another implementation of another aspect of the embodiments of the present application, the second behavior index includes, but is not limited to, a number of hits for each skill of the game character, a hit rate, a skill avoidance rate, a distance traveled, a play object use case, a customized behavior of the game character.

Another aspect of the present application provides a game character control method applied to the third game AI model trained in the first aspect, including: acquiring current game situation information, wherein the game situation information comprises state information of a first game role, state information of a second game role competing with the first game role and game scene information;

inputting the current game situation information into the third game AI model to obtain the action probability distribution of the first game role;

and controlling the action of the first game character according to the action probability distribution.

Another aspect of the present application provides a computer apparatus comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor is used for executing the program in the memory, and the processor is used for executing the method according to the aspects according to the instructions in the program code;

the bus system is used to connect the memory and the processor to communicate the memory and the processor.

Another aspect of the application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the methods of the above aspects.

In another aspect of the application, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the above aspects.

From the above technical solutions, the embodiment of the present application has the following advantages: performing independent reinforcement learning on incremental game role information generated in the process of updating the game version to obtain a specific game AI model; and then carrying out strategy distillation training on the game AI model and the trained game AI model before the update of the game version to obtain the game AI model after the update of the game version, so that the model update according to the increment information is realized, and meanwhile, the strategy distillation technology is adopted to enable the increment model to carry out effective knowledge migration with the original model, thereby avoiding the process of carrying out reinforcement training from zero again or directly carrying out reinforcement learning according to the increment information, and further achieving the purpose of updating the game AI model rapidly and efficiently.

Drawings

FIG. 1 is a schematic diagram of a training system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of one embodiment of a training method for a game artificial intelligence model in an embodiment of the present application;

FIG. 3a is a schematic flow chart of a schematic distillation method according to an embodiment of the present application;

FIG. 3b is a schematic diagram of another embodiment of the present application;

FIG. 4 is a schematic diagram of another embodiment of the present application for strategic distillation;

FIG. 5 is an evaluation schematic diagram of the first evaluation system according to the embodiment of the application;

FIG. 6 is a flow chart of a training method of a game artificial intelligence model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of an embodiment of a method for controlling a game character according to an embodiment of the present application;

FIG. 8 is a schematic diagram of one embodiment of a game artificial intelligence model training apparatus in accordance with an embodiment of the present application;

FIG. 9 is a schematic diagram of another embodiment of a game artificial intelligence model training apparatus in accordance with an embodiment of the present application;

FIG. 10 is a schematic diagram of another embodiment of a game artificial intelligence model training apparatus in accordance with an embodiment of the present application;

FIG. 11 is a schematic diagram of another embodiment of a training apparatus for a game artificial intelligence model according to an embodiment of the present application.

Detailed Description

The embodiment of the application provides a training method, device and equipment of a game model and a storage medium, which are used for updating the game AI model rapidly and efficiently.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "includes" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence (ArtificialIntelligence, AI) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The artificial intelligence is applied to the game model training process in the game field, such as fighting games, and a human-computer interaction scene usually exists, wherein the game characters operated by the machine are controlled by the game model. The fighting game has the characteristics of short decision time, large decision space, rich strategy change and the like. During the course of a battle, limited by the game's defined fighting time and scope, players need to avoid risks by reasonable displacements and as far as possible harm the opponent's character. Because opponents have rich and changeable behavior strategies, the making, selecting and executing of strategies is a vital link of the game intelligent system in face of the huge decision space and the real-time requirement of decisions. The character richness and the high frequency of update iteration of the mainstream combat game at present have higher requirements on the training of the game AI model.

In order to solve the problems, the application provides the following technical scheme: acquiring sample data of an initial reinforcement learning model and a game character set, wherein the game character set comprises a newly added game character and a newly adjusted game character when a first game version is updated to a second game version; training the initial reinforcement learning model according to the sample data to obtain a first artificial intelligent game AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set; and training the first game AI model and a second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, the third game AI model is a game AI model corresponding to the second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

For ease of understanding, some of the terms used in the present application are described below.

Game AI (Artificial Intelligence, AI): game AI is a program or character introduced in a game to enrich gameplay, enhance a player's gaming experience, in combination with artificial intelligence related techniques.

Reinforcement learning: reinforcement learning is machine learning in which the system learns from the environment to maximize rewards. The initial reinforcement learning model learns through continuous trial and error and feedback, and is often used for making sequence decisions or control problems, such as games AI, unmanned aerial vehicles and the like.

Policy distillation: policy distillation is one way of knowledge migration, defined by the transfer of knowledge from a teacher's network to a student's network by training the student's network to produce the same output as the teacher's network.

Fighting game: based on role playing, boxing, weapons or other props are used to fight and fight against enemy games.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medicine, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The training method, the training device, the training equipment and the training storage medium for the game artificial intelligence model can update the game AI model rapidly and efficiently. An exemplary application of the electronic device provided by the embodiment of the present application is described below, where the electronic device provided by the embodiment of the present application may be implemented as various types of user terminals, and may also be implemented as a server.

By running the training method of the game artificial intelligence model provided by the embodiment of the application, the electronic equipment can update the game AI model quickly and efficiently, namely the electronic equipment can update the game AI model quickly and efficiently, and the method is suitable for a plurality of application scenes in a game scene. Such as a combat game or a multiplayer online tactical game (Multiplayer Online Battle Arena, MOBA) or a shooting-type game, etc.

Referring to fig. 1, fig. 1 is an optional architecture diagram of an application scenario of a training method of a game artificial intelligence model according to an embodiment of the present application, in order to implement a training method for supporting a game artificial intelligence model, a terminal device 100 is connected to a server 300 through a network 200, the server 300 is connected to a database 400, and the network 200 may be a wide area network or a local area network, or a combination of both. The game client is deployed on the terminal device 100, where the client may run on the terminal device 100 in the form of a browser, or may run on the terminal device 100 in the form of a stand-alone Application (APP), and the specific presentation form of the client is not limited herein. The server 300 according to the present application may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 100 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a palm computer, a personal computer, a smart television, a smart watch, a vehicle-mounted device, a wearable device, etc. The terminal device 100 and the server 300 may be directly or indirectly connected through the network 200 by wired or wireless communication, and the present application is not limited herein. The number of servers 300 and terminal devices 100 is also not limited. The scheme provided by the application can be independently completed by the terminal equipment 100, can be independently completed by the server 300, and can be completed by the cooperation of the terminal equipment 100 and the server 300, so that the application is not particularly limited. The database 400 may be considered as an electronic file cabinet, i.e. a place where electronic files are stored, and a user may perform operations such as adding, querying, updating, deleting, etc. on data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application. The database management system (Database Management System, DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup, and the like. The database management system may classify according to the database model it supports, e.g., relational, extensible markup language (Extensible Markup Language, XML); or by the type of computer supported, e.g., server cluster, mobile phone; or by classification according to the query language used, e.g. structured query language (Structured Query Language, SQL), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously. In the present application, the database 400 may be used to store the interaction data of the game character in the current game situation, and of course, the storage location of the interaction data of the game character in the current game situation is not limited to the database, and may be stored in the terminal device 100, the blockchain, or the distributed file system of the server 300, for example.

In some embodiments, the server 300 may perform the training method of the game artificial intelligence model provided by the embodiment of the present application in conjunction with the terminal device 100, and in this embodiment, the specific flow may be as follows: the terminal device 100 invokes the initial reinforcement learning model to control each game character in the game character set to interact with the game environment to generate sample data; the terminal device 100 then stores the sample data in the database 400 or in a memory of the terminal device 100; the server 300 samples the training samples of the initialized initial reinforcement learning model from the sample data stored in the database 400 or the terminal device 100; then training the initial reinforcement learning model according to the training sample to obtain the first game AI model; then, the terminal device 100 invokes the first game AI model or the initial game AI model or the second game AI model to obtain a sample set of the initial game AI, where the initial game AI model is an initialized game AI corresponding to an updated game version, and the second game AI model is a trained game AI model corresponding to an un-updated game version; the terminal device 100 stores the sample set in a database 400 or a memory of the terminal device 100; the server 300 samples again from the database 400 or the sample set stored in the terminal device 100 to obtain a training sample of the initial game AI; finally, the server 300 trains the initial game AI model to obtain the third game AI model by using a policy distillation method according to the training samples, the first game AI model and the second game AI model. Finally, the server 300 may deploy the third game AI model to the terminal device 100, so as to enable the terminal device 100 to invoke the third game AI model to control the game character to implement game play; or, the server 300 deploys the third game AI model on a server corresponding to the game client, so that the terminal device 100 can call the third game AI model to the server of the game client to control the game character to realize the game play.

In another embodiment, the terminal device 100 independently executes the information recommendation method provided in the embodiment of the present application, and in this embodiment, the specific flow may be as follows: the terminal device 100 invokes the initial reinforcement learning model to control each game character in the game character set to interact with the game environment to generate sample data; the terminal device 100 then stores the sample data in the database 400 or in a memory of the terminal device 100; the terminal device 100 samples from the database 400 or sample data stored in the terminal device 100 to obtain a training sample of the initialized initial reinforcement learning model; then training the initial reinforcement learning model according to the training sample to obtain the first game AI model; then, the terminal device 100 invokes the first game AI model or the initial game AI model or the second game AI model to obtain a sample set of the initial game AI, where the initial game AI model is an initialized game AI corresponding to an updated game version, and the second game AI model is a trained game AI model corresponding to an un-updated game version; the terminal device 100 stores the sample set in a database 400 or a memory of the terminal device 100; the terminal device 100 samples again from the database 400 or the sample set stored in the terminal device 100 to obtain the training sample of the initial game AI; finally, the terminal device 100 trains the initial game AI model to obtain the third game AI model according to the training sample, the first game AI model and the second game AI model by using a policy distillation method. Finally, the terminal device 100 may deploy the third game AI model to the terminal device 100, so as to enable the terminal device 100 to invoke the third game AI model to control the game character to realize game play; or, the terminal device 100 deploys the third game AI model on the server corresponding to the game client, so that the terminal device 100 can call the third game AI model to the server of the game client to control the game characters to realize game play.

Based on the above system, referring specifically to fig. 2, an embodiment of a training method of a game artificial intelligence model according to the present application is described with a server as an execution subject:

201. sample data is obtained for an initial reinforcement learning model and a set of game characters including newly added game characters and newly adjusted game characters when updated from a first game version to a second game version.

In this embodiment, the server obtains an initial reinforcement learning model of the game character set after the game is updated, and obtains sample data corresponding to the game character set by using the initial reinforcement learning model.

Alternatively, the server may invoke the initial reinforcement learning model to control each game character in the game set to interact with the game environment to obtain the sample data. The game environment is used for describing opponent game characters, game scenes and state information of the opponent game characters of the current game under the battle bureau and the game characters controlled by the model. In one exemplary scenario, the server invokes an initialized reinforcement learning model to control the newly added game character a to play against game character B, currently in game scenario a, wherein game character a may use skill set a, game character B may use skill set B, the blood volume of game character a is eighty percent, the blood volume of game character B is fifty percent, and so on.

It will be appreciated that the sample data includes current game situation information, as well as a tag to which the sample data corresponds. The game situation information may be game situation information that can be acquired by any AI game character, and may include at least one of the following, but is not limited thereto: position information, blood volume information, skill information, distance information between the AI game character and other game characters, obstacle information, time information, and score information of the AI game character. Alternatively, the time information may be a game duration or the like, to which the present application is not limited. Optionally, the game situation information further includes at least one of the following, but is not limited thereto: the AI game character can acquire position information, blood volume information, score information, skill information, and the like of the opponent side. It should be understood that in the fight scene of man-machine interaction, the AI-game character involved in the game situation information refers to an AI-game character on the machine side. While other game characters involved in the game situation information may be player-side game characters. The label corresponding to the sample data may be action information output by the game AI model. The actions related to the AI game character or any of the combat game AI models related to the present application include, but are not limited to: left move, right move, up move, attack, jump, catch, etc.

In one exemplary scenario, for each play of two AI game characters, the server may obtain sample data as follows: under the initial game situation, assuming that the game situation information acquired by the game role A is S1, the game situation information acquired by the game role B is S1, inputting the S1 into an initial reinforcement learning model to obtain the action probability distribution of the game role A, executing the action a1 by the game role A according to the action probability distribution, and likewise inputting the S1 into the initial reinforcement learning model to obtain the action probability distribution of the game role B, and executing the action B1 by the game role B according to the action probability distribution; then enter the next game situation where game character a performs action a2, game character B performs action B2, and so on until both AI game characters win or lose. Assuming that the final game character a wins n combat, then its corresponding samples are: (S1, a1, 1), (S2, a2, 1) … (Sn, an, 1), and the samples corresponding to the game character B are respectively: (s 1, b1, 0), (s 2, b2, 0) … (sn, bn, 0), and finally these samples may constitute a plurality of sample data of the initial reinforcement learning model as described above. Wherein the 1 indicates that the game character wins in the fight and the 0 indicates that the game character loses in the fight.

It will be appreciated that the sample data from the loser may contain a number of information that can be used to significantly avoid causing the failure. Therefore, the server can select at least one sample from the sample data corresponding to the game role B for mutation, wherein the server can randomly select the sample corresponding to the at least one game role B, or can select the sample corresponding to the at least one game role B according to a certain rule, and in a word, the application does not limit how the server selects the sample corresponding to the at least one game role B. For example: the server may randomly select 50% of the samples corresponding to the game character B to be mutated. The server may adjust a second motion probability distribution corresponding to the sample corresponding to the game character B according to the second game situation information corresponding to the sample corresponding to the game character B, so as to obtain a third motion probability distribution. For example: when the game role B is attacked, the server determines that the game role can select a jump or a grid, and if the probability of the jump and the grid obtained according to the action probability distribution under the current situation is not high, the server can perform variation adjustment on the probability of the jump and the grid, such as increasing the probability of the jump or the grid. The samples after these variations, although not guaranteeing that game character B must win, can achieve better results than losing, so the server can mark the winning rate of that game character B as being between win and lose, 2. Based on this, the server can construct sample data from the samples corresponding to the game character a, the samples corresponding to the game character B, and the samples generated by the sample variation corresponding to the game character B, so as to train the initial reinforcement learning model.

Alternatively, the initial reinforcement learning model may be a deep neural network structure. In this deep neural network structure, it generally includes an input layer, a hidden layer, and an output layer. The input layer is the first layer of the deep neural network, the output layer is the last layer of the deep neural network, and the middle layers are all used as hidden layers, wherein the hidden layers of multiple layers can increase the expression capacity of the model. In the deep neural network, all layers are connected, i.e. any neuron in the ith layer is necessarily connected with any neuron in the (i+1) th layer. In this embodiment, the information input by the input layer is sample data corresponding to each game role in the game role set, and the hidden layer is used for extracting features of the sample data and expressing features of the sample data; and then obtaining the predicted action probability distribution corresponding to the sample data through a plurality of hidden layers, and obtaining the predicted action according to the predicted action probability distribution.

202. Training the initial reinforcement learning model according to the sample data to obtain a first game artificial intelligence AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set.

The server trains the initial reinforcement learning model by utilizing a neighbor strategy gradient algorithm and a self-game according to the sample data to obtain the first game AI model. In this embodiment, the neighbor policy gradient algorithm is a reinforcement learning algorithm. The method aims at solving the problems of low sample utilization rate and unstable training in the traditional reinforcement learning algorithm. Specifically, the neighbor strategy gradient algorithm uses importance sampling to multiplex samples to improve the sample utilization rate, and meanwhile, an objective function of the neighbor strategy gradient algorithm can restrict the updating amplitude of the strategy, so that the strategy is always optimized towards a good aspect. As a learning paradigm, self-gaming can effectively improve the comprehensiveness and richness of AI policies. As the name suggests, self-gaming is self-gaming with itself. Specifically, during training using the neighbor strategy gradient algorithm, AI will save the current strategy models and game these models as opponents at intervals. As training time increases, the more opponent models will be, the more and the richer and the more and the better the strategies will be. While AI requires constant training to defeat these opponent models, this is the training process of self-gaming.

It will be appreciated that in combat games, the game characters of the two parties to combat are opposed in several rounds in a certain game scene, the core of the opposed is the defenses, counterattack, linkage, starting and standing back between the characters. Therefore, a series of behavior indexes can be designed from the angles, including the starting number, the hit rate and the hit rate of counterattack of each skill (including common attack), skill avoidance conditions, standing-back time length and distance, the use of special props and skills, injury number per second, customized behaviors of special roles and the like. When the server obtains the first game AI model through reinforcement learning, the server can evaluate the first game AI model through a second evaluation system to obtain an evaluation result, and then adjusts and trains the first game AI model according to the evaluation result to obtain the game AI model which meets the requirements best. Through the series of behavior indexes, the safe and controllable behavior of the fighting game AI character can be realized, so that the anthropomorphic property of the first game AI model reaches an optimized state.

203. And training the first game AI model and a second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to a second game version, the third game AI model is a game AI model corresponding to the first game version, the first game version is a game version updated by the second game version, the second game AI model is used for controlling each game role corresponding to the second game version, and the third game AI model is used for controlling each game role corresponding to the first game version.

In this embodiment, the flow of the method for performing the strategic distillation training by the server may be specifically divided into two parts, i.e. the iterative training process includes two processes of sample acquisition and training, as shown in fig. 3 a.

The overall process may be as follows: the server obtains an initial game AI model, which is an initialization model of the game AI model corresponding to the first game version, namely, a student model as shown in fig. 3 a; then the server acquires a first training sample of the initial game AI model; the server trains the initial game AI model according to the first training sample, the first game AI model and the second game AI model to obtain a first intermediate game AI model; acquiring a second training sample of the first mid-game AI model; training the first intermediate game AI model according to the second training sample, the first game AI model and the second game AI model to obtain a second intermediate game AI model; repeating the above operation until the third game AI model is obtained.

Optionally, when the server acquires the initial game AI model, the architecture of the initial game AI model is the same as that of the second game AI model, and the parameters of the initial game AI model may be selected from parameters of the second game AI model or parameters generated randomly. For example, assume that the second game AI model is a model based on a deep neural network structure, where the initial game AI model is also a model based on the deep neural network structure, while parameters for each hidden layer of the second game AI model may be randomly generated or referenced. If the randomly generated parameters are adopted, the influence of the second game AI model on the initial game AI model is smaller, and the feature expression is more comprehensive in the training process; if the parameters of the hidden layer of the second game AI model are used, the convergence process of the initial game AI model is faster.

Optionally, the process of obtaining the first training sample by the server may be as follows: the server calls the initial game AI model to control the game role to interact with the game environment to generate a sample, and the sample is stored in a sample storage pool; the server then samples the first training sample from the sample storage pool. It is understood that the first training sample has the same training sample structure as the sample data of the initial reinforcement learning model. That is, the process of obtaining the first training sample by the server may refer to the sample obtaining process in the foregoing step 201, which is not described herein in detail.

Optionally, the process of obtaining the first mid-game AI model by the server according to the training of the first training sample may be as follows: inputting the first training sample into the initial game AI model to obtain a first action probability distribution, and inputting the first training sample into the first game AI model or the second AI model to obtain a second action probability distribution; calculating a loss function from the first action probability distribution and the second action probability distribution; and adjusting parameters of the initial game AI model by using an optimization algorithm according to the loss function to obtain the first intermediate game AI model. In this embodiment, when the server invokes the initial game AI model to obtain the first training sample, all game characters need to be controlled to perform character combination fight, and all character combination samples are obtained; at this time, the sample obtained in controlling the newly added character needs to be the teacher model in fig. 3a, and the sample obtained in controlling the game character corresponding to the original game version needs to be the teacher model in fig. 3 a.

As shown in fig. 3a, the training process described above may be repeated until the loss function reaches a convergence condition to obtain the tertiary game AI model.

It will be appreciated that the server, upon obtaining the sample data, may also control the game character to interact with the game environment by invoking the first game AI model and the second game AI model. The schematic flow of this strategic distillation can now be shown in fig. 3 b.

Based on the above flow diagrams, the following description is made in connection with a flow diagram of game scene policy distillation, as shown in fig. 4: assuming that there are M characters in the current 1v1 fighting game, there are M combat combinations, and the current game AI model for M characters is denoted as model a. After the fighting game version is updated iteratively, the game will add N characters or adjust N characters. Adding N roles means adding 2N (m+n) combat combinations, and the AI model needs to cope with the new combat combinations. While adjusting N roles means adjusting 2N x m combat combinations, the AI model needs to adjust the policies adopted on these combat combinations. As shown in fig. 4, the final game AI model for both cases is denoted as model C in fig. 4. The specific training process can be as follows: firstly, the terminal equipment simulates sample data of N combinations of N game characters by using the initial reinforcement learning model, and trains the initial reinforcement learning model according to the sample data to obtain a first intermediate model; then simulating N-N combined sample data of the N game roles by using the first intermediate model, and training the first intermediate model again according to the newly obtained sample data to obtain a second intermediate model; the above operation is repeated until the model parameters converge to obtain a model for controlling the N game characters, denoted as model B in fig. 4. It will be appreciated that in order to ensure the accuracy of the model B, the model B may also be evaluated by an evaluation system (which may also be referred to as a rewards engineering technique); and finally, carrying out strategy distillation fusion training on the model B and the model A to obtain the model C. The final model C not only performs as expected on newly added game characters or modified game character combinations, but also keeps the behavior on previous character combinations unchanged.

It will be appreciated that the core of policy distillation is to fuse knowledge of multiple models together using supervised learning. However, in practical applications, the student model cannot be fitted to the behavior of the teacher model one hundred percent, and deviations in behavior may occur. Therefore, in order to evaluate the model obtained by policy distillation, it is necessary to evaluate whether the deviation in performance is controllably acceptable in game deployment. In view of the above problems, in the field of games, deviations in the strength or behavior of a student model mean inconsistent behavior between the student model and a teacher model, and therefore we use the behavior consistency of the two models as an evaluation index. From the aspect of behavior consistency, a first evaluation system can be built, and an exemplary scheme can be shown in fig. 5:

in the first evaluation system, KL divergence is used as a target for policy distillation algorithm optimization, wherein the KL divergence is used for calculating a difference between the motion probability distribution output by the third game AI model and the motion probability distribution output by the first game AI model and calculating a difference between the motion probability distribution output by the third game AI model and the motion probability distribution output by the second game AI model. Therefore, whether the behavior consistency of the student model and the teacher model meets the condition can be evaluated according to the KL divergence, and the safe and controllable behavior of the fighting game AI model is realized.

In this embodiment, the server may perform the following operations in the evaluation process: the server obtains a first evaluation result of the third game AI model according to a first evaluation system, wherein the first evaluation system is used for evaluating whether the game role controlled by the third game AI model meets a first behavior index; when the first evaluation result indicates that the game character controlled by the third game AI model meets the first behavior index, confirming that the third game AI model is a trained game AI model; and repeating the strategy distillation training process when the first evaluation result indicates that the game role controlled by the third game AI model does not meet the first behavior index.

Optionally, the process of obtaining the first evaluation result by the server may be as follows: invoking the third game AI model to control a first game role to interactively output a first action probability with a game environment, and invoking the first game AI model to control the first game role to interactively output a second action probability with the game environment, wherein the first game role is the game role controlled by the first game AI model; invoking the third game AI model to control a second game character to interactively output a third action probability with the game environment, and invoking the second game AI model to control the second game character to interactively output a fourth action probability with the game environment, wherein the second game character is the game character controlled by the second game AI model; calculating a first KL divergence according to the first action probability and the second action probability, and calculating a second KL divergence according to the third action probability and the fourth action probability; and obtaining the first evaluation result according to the first KL divergence and the second KL divergence.

In this embodiment, the action probability distribution output by the game AI model is the probability distribution of each action of the game character, for example: the corresponding actions of the game character comprise leftward movement, rightward movement, upward movement, attack, jump and gear, and the corresponding execution probabilities are respectively 0.2, 0.3, 0.6, 0.8, 0.3 and 0.2. The probability distribution of motion is (0.2, 0.3, 0.6, 0.8, 0.3, 0.2). The action probability distribution output by the game AI model is used by the client where the AI game character is located to determine the action to be performed in the current game situation, for example: and the client executes the action with the highest probability according to the action probability distribution.

The training method of the game artificial intelligence model of the present application is described below with an exemplary execution game, and as shown in fig. 6, the training method may be divided into three parts in the execution architecture, one part is an algorithm module, the other part is a training process, and the other part is an evaluation process. The algorithm module comprises a reinforcement learning module, a strategy distillation module and a strategy distillation optimization module. The training process may include reinforcement learning training, strategic distillation training, and strategic distillation optimization training; the evaluation flow comprises an evaluation system facing reinforcement learning and an evaluation system facing strategy distillation. Based on the execution architecture, the specific execution flow may be as follows:

Performing reinforcement learning on the newly added game character or the newly adjusted game character to obtain a game AI model of a specific game character combination aiming at the newly added game character or the newly adjusted game character; in the reinforcement learning process, evaluating the game AI model of the specific game role combination by using an evaluation system facing reinforcement learning, and performing optimization adjustment of reinforcement learning on the game AI model of the specific game role combination according to an evaluation result; then, the game AI model of the optimized specific game role combination and the game AI model before the game iteration update are subjected to fusion training through strategy distillation to obtain game AI models aiming at all game roles; then, evaluating the game AI models aiming at all game roles through an evaluation system oriented to strategy distillation to obtain an evaluation result; and carrying out repeated strategy distillation training on the game AI models aiming at all game roles according to the evaluation result so as to obtain the game AI model after incremental updating.

The following describes the application of the tertiary game AI model in the embodiment of the present application, as shown in fig. 7:

701. and acquiring current game situation information, wherein the game situation information comprises state information of a first game role, state information of a second game role competing with the first game role and game scene information, and the first game role is a game role controlled by the third game AI model.

In this embodiment, the game situation information may be game situation information that can be acquired by any AI character, and may include at least one of the following, but is not limited thereto: position information, blood volume information, skill information, distance information between the AI game character and other game characters, obstacle information, time information, and score information of the AI game character. Alternatively, the time information may be a game duration or the like, to which the present application is not limited. Optionally, the game situation information further includes at least one of the following, but is not limited thereto: the AI game character can acquire position information, blood volume information, score information, skill information, and the like of the opponent side. It should be understood that in the fight scene of man-machine interaction, the AI-game character involved in the game situation information refers to an AI-game character on the machine side. While other game characters involved in the game situation information may be player-side game characters. The label corresponding to the sample data may be action information output by the game AI model. The actions related to the AI game character or any of the combat game AI models related to the present application include, but are not limited to: left move, right move, up move, attack, jump, catch, etc.

702. And inputting the current game situation information into the third game AI model to obtain the action probability distribution of the first game character.

The server obtains the motion probability distribution of the first game character from the current game situation information for the third game AI game. In this embodiment, the action probability distribution output by the game AI model is the probability distribution of each action of the game character, for example: the corresponding actions of the game character comprise leftward movement, rightward movement, upward movement, attack, jump and gear, and the corresponding execution probabilities are respectively 0.2, 0.3, 0.6, 0.8, 0.3 and 0.2. The probability distribution of motion is (0.2, 0.3, 0.6, 0.8, 0.3, 0.2).

703. And controlling the action of the first game character according to the action probability distribution.

In this embodiment, the action probability distribution output by the game AI model is used for determining the execution action under the current game situation by the client where the AI game character is located. For example: the server performs an action with the highest probability according to the action probability distribution.

Referring to fig. 8, fig. 8 is a schematic diagram of an embodiment of a game artificial intelligence model training device according to the present application, and the game artificial intelligence model training device 20 includes:

An acquisition module 201 for acquiring sample data of an initial reinforcement learning model and a game character set including a newly added game character and a newly adjusted game character when updating from a first game version to a second game version;

a training module 202, configured to train the initial reinforcement learning model according to the sample data to obtain a first artificial intelligence game AI model, where the first game AI model is a game AI model corresponding to the game character set, and the first game AI model is used to control each game character in the game character set; and training the first game AI model and a second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, the third game AI model is a game AI model corresponding to the second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the incremental game role information generated in the process of updating the game version is subjected to independent reinforcement learning to obtain a specific game AI model; and then carrying out strategy distillation training on the game AI model and the trained game AI model before the update of the game version to obtain the game AI model after the update of the game version, so that the model update according to the increment information is realized, and meanwhile, the strategy distillation technology is adopted to enable the increment model to carry out effective knowledge migration with the original model, thereby avoiding the process of carrying out reinforcement training from zero again or directly carrying out reinforcement learning according to the increment information, and further achieving the purpose of updating the game AI model rapidly and efficiently.

Alternatively, based on the embodiment corresponding to fig. 8, in another embodiment of the game artificial intelligence model training apparatus 20 provided in the embodiment of the present application,

the obtaining module 201 is further configured to obtain an initial game AI model, where the initial game AI model is an initialization model of a game AI model corresponding to the second game version; acquiring a first training sample of the initial game AI model;

the training module 202 is specifically configured to train the initial game AI model to obtain a first intermediate game AI model according to the first training sample, the first game AI model, and the second game AI model;

the obtaining module 201 is further configured to obtain a second training sample of the first mid-game AI model;

the training module 202 is specifically configured to train the first mid-game AI model to obtain a second mid-game AI model according to the second training sample, the first game AI model, and the second game AI model;

repeating the above operation until the third game AI model is obtained.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the incremental first game AI model and the trained second game AI model are used for fusion training, so that the third game AI model corresponding to the new version is obtained, the model update according to the incremental information is realized, meanwhile, the strategy distillation technology is adopted, so that the incremental model can be effectively knowledge-shifted with the original model, the process of strengthening training from zero again is avoided, or strengthening learning is directly carried out according to the incremental information, and the game AI model is updated rapidly and efficiently.

Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the game artificial intelligence model training device 20 provided by the embodiment of the present application, the obtaining module 201 is specifically configured to invoke the initial game AI model to control the interaction of the game character with the game environment to generate a sample set, where the game environment includes a game scene and a game fight object;

the first training sample is sampled from the sample set.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the sample data is generated by utilizing the interaction between the student model and the game environment, so that the sample data can be more in line with the distribution requirement of the sample, and the model training is faster.

Optionally, in another embodiment of the game artificial intelligence model training device 20 provided in the embodiment of the present application based on the embodiment corresponding to fig. 8, the obtaining module 201 is specifically configured to invoke the first game AI model to control the game character to interact with the game environment to generate first data, and invoke the second game AI model to control the game character to interact with the game environment to generate second data, where the game environment includes a game scene and a game fight object;

Saving the first data and the second data as a sample set;

the first training sample is sampled from the sample set.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the teacher model and the game environment are interacted to generate the sample data, so that the sample data is more representative, and the device is more suitable for sample collection of specific roles and targeted training.

Optionally, in another embodiment of the game artificial intelligence model training device 20 according to the embodiment of the present application, the training module 202 is specifically configured to input the first training sample into the initial game AI model to obtain a first motion probability distribution, and input the first training sample into the first game AI model or the second AI model to obtain a second motion probability distribution;

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the model updating according to the incremental information is realized, and meanwhile, the strategy distillation technology is adopted, so that the incremental model can be effectively shifted from the original model, the process of performing reinforcement training from zero again or the reinforcement learning is directly performed according to the incremental information is avoided, and the game AI model is updated rapidly and efficiently.

Optionally, on the basis of the embodiment corresponding to fig. 8, in another embodiment of the game artificial intelligence model training device 20 provided by the embodiment of the present application, the loss function is a KL divergence, a cross entropy loss function, a square error loss function or a regression loss function;

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, various loss functions and priority algorithms are provided, so that the feasibility and operability of the scheme are improved.

Optionally, in another embodiment of the game artificial intelligence model training device 20 according to the embodiment of fig. 8, as shown in fig. 9, the game module training device further includes a processing module 203, where the processing module 203 is configured to obtain a first evaluation result of the third game AI model according to a first evaluation system, where the first evaluation system is configured to evaluate whether a game character controlled by the third game AI model meets a first behavior index;

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, an evaluation system is constructed in the training process, so that the game AI model obtained through training can meet the customization requirement.

Optionally, in another embodiment of the game artificial intelligence model training device 20 according to the embodiment of the present application, based on the embodiment corresponding to fig. 9, the processing module 203 is specifically configured to invoke the third game AI model to control a first game role to interact with a game environment to output a first action probability, and invoke the first game AI model to control the first game role to interact with the game environment to output a second action probability, where the first game role is the game role controlled by the first game AI model;

The embodiment of the application provides a game artificial intelligence model training device. With the adoption of the device, the evaluation system utilizes the KL divergence between the action probability of the third game AI model and the action probability of the first game AI model and the KL divergence between the action probability of the third game AI model and the action probability of the second game AI model to determine whether the behaviors of the third game AI model on the control game role, the first game AI model and the second game AI model are consistent or not, and an optimization direction is provided for humanized adjustment according to the situation, so that the behaviors of the game role controlled by the game AI model are safe and controllable.

Optionally, in another embodiment of the game artificial intelligence model training device 20 according to the embodiment of fig. 8, the obtaining module 201 is further configured to obtain a third training sample of the third game AI model, where the third training sample includes a sample of game characters that do not meet the behavior index and a sample of the target action;

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, after passing through the evaluation system, a more targeted sample acquisition scheme is provided, so that training is more targeted, and model training is quickened.

Optionally, in another embodiment of the game artificial intelligence model training apparatus 20 according to the embodiment of fig. 8, the training module 202 is specifically configured to obtain the first game AI model by using a neighbor strategy gradient algorithm and self-game training the initial reinforcement learning model according to the first sample data, where the initial reinforcement learning model is a deep neural network structure.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the first game AI model can be obtained more quickly by training the model in a reinforcement learning mode, so that the game AI can be updated efficiently.

Optionally, in another embodiment of the game artificial intelligence model training device 20 according to the embodiment of fig. 8, as shown in fig. 9, the game model training device further includes a processing module 203, where the processing module 203 is configured to obtain a second evaluation result of the first game AI model according to a second evaluation system, where the second evaluation system is configured to evaluate whether the game character controlled by the first game AI model meets a second behavior index;

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, the behavior of the game character controlled by the first game AI model is safe and controllable.

Optionally, in another embodiment of the game artificial intelligence model training apparatus 20 according to the embodiment of fig. 8, the second behavior index includes, but is not limited to, a number of hits, a hit rate, a skill avoidance rate, a moving distance, a play object use condition, and a customized behavior of the game character.

The embodiment of the application provides a game artificial intelligence model training device. By adopting the device, various evaluation indexes are provided, so that the feasibility and operability of the scheme are improved.

Referring to fig. 10, fig. 10 is a schematic diagram of a server structure according to an embodiment of the present application, where the server 300 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (central processing units, CPU) 322 (e.g., one or more processors) and a memory 332, and one or more storage media 330 (e.g., one or more mass storage devices) storing application programs 342 or data 344. Wherein the memory 332 and the storage medium 330 may be transitory or persistent. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 322 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the server 300.

The Server 300 may also include one or more power supplies 326, one or more wired or wireless network interfaces 350, one or more input/output interfaces 358, and/or one or more operating systems 341, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 10.

The game artificial intelligence model training device provided by the application can be used for terminal equipment, and refer to fig. 11, which only shows the parts related to the embodiment of the application for convenience of explanation, and specific technical details are not disclosed, and refer to the method parts of the embodiment of the application. In the embodiment of the application, a terminal device is taken as a smart phone for example to describe:

fig. 11 is a block diagram showing a part of a structure of a smart phone related to a terminal device provided by an embodiment of the present application. Referring to fig. 11, the smart phone includes: radio Frequency (RF) circuitry 410, memory 420, input unit 430, display unit 440, sensor 450, audio circuitry 460, wireless fidelity (wireless fidelity, wiFi) module 470, processor 480, and power supply 490. Those skilled in the art will appreciate that the smartphone structure shown in fig. 11 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes each component of the smart phone in detail with reference to fig. 11:

the RF circuit 410 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, in particular, after receiving downlink information of the base station, the downlink information is processed by the processor 480; in addition, the data of the design uplink is sent to the base station. In general, RF circuitry 410 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (low noise amplifier, LNA), a duplexer, and the like. In addition, the RF circuitry 410 may also communicate with networks and other devices via wireless communications. The wireless communications may use any communication standard or protocol including, but not limited to, global system for mobile communications (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS), and the like.

The memory 420 may be used to store software programs and modules, and the processor 480 may perform various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 420 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the smart phone. In particular, the input unit 430 may include a touch panel 431 and other input devices 432. The touch panel 431, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 431 or thereabout using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 431 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 480, and can receive commands from the processor 480 and execute them. In addition, the touch panel 431 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 430 may include other input devices 432 in addition to the touch panel 431. In particular, other input devices 432 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 440 may be used to display information input by a user or information provided to the user and various menus of the smart phone. The display unit 440 may include a display panel 441, and optionally, the display panel 441 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 431 may cover the display panel 441, and when the touch panel 431 detects a touch operation thereon or nearby, the touch operation is transmitted to the processor 480 to determine the type of the touch event, and then the processor 480 provides a corresponding visual output on the display panel 441 according to the type of the touch event. Although in fig. 11, the touch panel 431 and the display panel 441 are two separate components to implement the input and input functions of the smart phone, in some embodiments, the touch panel 431 and the display panel 441 may be integrated to implement the input and output functions of the smart phone.

The smartphone may also include at least one sensor 450, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 441 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 441 and/or the backlight when the smartphone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for identifying the application of the gesture of the smart phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the smart phone are not described in detail herein.

Audio circuitry 460, speaker 461, microphone 462 can provide an audio interface between the user and the smartphone. The audio circuit 460 may transmit the received electrical signal after the audio data conversion to the speaker 461, and the electrical signal is converted into a sound signal by the speaker 461 and output; on the other hand, microphone 462 converts the collected sound signals into electrical signals, which are received by audio circuit 460 and converted into audio data, which are processed by audio data output processor 480, and transmitted via RF circuit 410 to, for example, another smart phone, or which are output to memory 420 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a smart phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 470, so that wireless broadband Internet access is provided for the user. Although fig. 11 shows a WiFi module 470, it is understood that it does not belong to the essential constitution of a smart phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

The processor 480 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions and processes data of the smart phone by running or executing software programs and/or modules stored in the memory 420 and invoking data stored in the memory 420, thereby performing overall monitoring of the smart phone. Optionally, the processor 480 may include one or more processing units; alternatively, the processor 480 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 480.

The smart phone also includes a power supply 490 (e.g., a battery) for powering the various components, optionally in logical communication with the processor 480 through a power management system that performs functions such as managing charge, discharge, and power consumption.

Although not shown, the smart phone may further include a camera, a bluetooth module, etc., which will not be described herein.

The steps performed by the terminal device in the above embodiments may be based on the terminal device structure shown in fig. 11.

Embodiments of the present application also provide a computer-readable storage medium having a computer program stored therein, which when run on a computer, causes the computer to perform the method as described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising a program which, when run on a computer, causes the computer to perform the method described in the previous embodiments.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training an artificial intelligence model for a game, comprising:

acquiring sample data of an initial reinforcement learning model and a game character set, wherein the game character set comprises a newly added game character and a newly adjusted game character when a first game version is updated to a second game version;

training the initial reinforcement learning model according to the sample data to obtain a first game artificial intelligence AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set;

and training the first game AI model and a second game AI model through strategy distillation method fusion to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to the first game version, the third game AI model is a game AI model corresponding to the second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

2. The method of claim 1, wherein the fusing training the first game AI model and the second game AI model to obtain a third game AI model using a policy distillation method comprises:

acquiring an initial game AI model, wherein the initial game AI model is an initialization model of a game AI model corresponding to the second game version;

acquiring a first training sample of the initial game AI model;

training the initial game AI model according to the first training sample, the first game AI model and the second game AI model to obtain a first intermediate game AI model;

acquiring a second training sample of the first mid-game AI model;

training the first intermediate game AI model according to the second training sample, the first game AI model and the second game AI model to obtain a second intermediate game AI model;

repeating the above operation until the third game AI model is obtained.

3. The method of claim 2, wherein the obtaining the first training sample of the initial game AI model comprises:

invoking the initial game AI model to control game roles to interact with a game environment to generate a sample set, wherein the game environment comprises a game scene and a game fight object;

Sampling from the sample set to obtain the first training sample.

4. The method of claim 2, wherein the obtaining the first training sample of the initial game AI model comprises:

invoking the first game AI model to control interaction between the game character and a game environment to generate first data, and invoking the second game AI model to control interaction between the game character and the game environment to generate second data, wherein the game environment comprises a game scene and a game fight object;

saving the first data and the second data as a sample set;

sampling from the sample set to obtain the first training sample.

5. The method of claim 2, wherein the training the initial game AI model to obtain a first intermediate game AI model based on the first training sample, the first game AI model, and a second game AI model comprises:

inputting the first training sample into the initial game AI model to obtain a first action probability distribution, and inputting the first training sample into the first game AI model or the second game AI model to obtain a second action probability distribution;

calculating a loss function according to the first action probability distribution and the second action probability distribution;

6. The method according to claim 5, wherein the loss function is a KL-divergence, a cross entropy loss function, a square error loss function, or a regression loss function;

7. The method according to any one of claims 1 to 6, further comprising:

obtaining a first evaluation result of the third game AI model according to a first evaluation system, wherein the first evaluation system is used for evaluating whether a game role controlled by the third game AI model meets a first behavior index;

8. The method of claim 7, wherein the deriving an evaluation of the tertiary game AI model from the first evaluation system comprises:

invoking the third game AI model to control a first game character to interactively output a first action probability with a game environment, and invoking the first game AI model to control the first game character to interactively output a second action probability with the game environment, wherein the first game character is the game character controlled by the first game AI model;

invoking the third game AI model to control a second game character to interactively output a third action probability with a game environment, and invoking the second game AI model to control the second game character to interactively output a fourth action probability with the game environment, wherein the second game character is the game character controlled by the second game AI model;

9. The method of claim 7, wherein the iterative strategy distillation training process comprises:

Obtaining a third training sample of the third game AI model, wherein the third training sample comprises a sample of game characters which do not meet the behavior index and a sample of target actions;

and training the third game AI model according to the third training sample, the first game AI model and the second game AI model to obtain a target game AI model, wherein the target game AI model is a game AI model corresponding to the second game version.

10. The method of any of claims 1 to 6, wherein the training the initial reinforcement learning model from the sample data to obtain a first artificial intelligence game AI model comprises:

and training the initial reinforcement learning model by utilizing a neighbor strategy gradient algorithm and a self-game according to the sample data to obtain the first game AI model, wherein the initial reinforcement learning model is of a deep neural network structure.

11. The method of claim 10, wherein after the training the initial reinforcement learning model to obtain the first game AI model from the first sample data using a neighbor strategy gradient algorithm and from gaming, before the training the first game AI model and the second game AI model to obtain a third game AI model using a strategy distillation method, the method further comprises:

Obtaining a second evaluation result of the first game AI model according to a second evaluation system, wherein the second evaluation system is used for evaluating whether the game role controlled by the first game AI model meets a second behavior index;

and repeating the reinforcement learning training process when the second evaluation result indicates that the game character controlled by the first game AI model does not meet the second behavior index.

12. The method of claim 11, wherein the second performance metrics include, but are not limited to, number of hits for each skill of the game character, hit rate, skill avoidance rate, distance moved, play object usage, custom behavior of the game character.

13. A game model training device, comprising:

the acquisition module is used for acquiring sample data of an initial reinforcement learning model and a game role set, wherein the game role set comprises a newly added game role and a newly adjusted game role when the first game version is updated to the second game version;

The training module is used for training the initial reinforcement learning model according to the sample data to obtain a first artificial intelligent game AI model, wherein the first game AI model is a game AI model corresponding to the game role set, and the first game AI model is used for controlling each game role in the game role set; and training the first game AI model and the second game AI model through strategy distillation to obtain a third game AI model, wherein the second game AI model is a game AI model corresponding to a first game version, the third game AI model is a game AI model corresponding to a second game version, the second game version is a game version updated by the first game version, the second game AI model is used for controlling each game role corresponding to the first game version, and the third game AI model is used for controlling each game role corresponding to the second game version.

14. A computer device, comprising: a memory, a processor, and a bus system;

wherein the memory is used for storing programs;

the processor being for executing a program in the memory, the processor being for executing the method of any one of claims 1 to 12 according to instructions in program code;

The bus system is used for connecting the memory and the processor so as to enable the memory and the processor to communicate.

15. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 12.