CN115496191B

CN115496191B - Model training method and related device

Info

Publication number: CN115496191B
Application number: CN202211391056.5A
Authority: CN
Inventors: 姜允执; 黄新昊; 万乐; 徐志鹏; 顾子卉; 谢宇轩; 刘林韬; 郑规; 殷俊; 邓大付; 欧阳卓能; 金鼎健; 廖明翔; 刘总波; 梁宇宁; 官冰权; 杨益浩; 申家忠; 刘思亮; 高丽娜
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-08
Filing date: 2022-11-08
Publication date: 2023-04-07
Anticipated expiration: 2042-11-08
Also published as: CN115496191A

Abstract

The embodiment of the application discloses a model training method and a related device in the field of artificial intelligence, wherein the method comprises the following steps: acquiring skill sample data; the skill sample data comprises a game state data sequence and an operation data sequence which have corresponding relations, and the game state data sequence and the operation data sequence correspond to the length of a target frame; adopting a supervised learning algorithm, and jointly training a variational self-encoder and a prior strategy model according to the skill sample data; the variational self-encoder comprises an encoder and a decoder, wherein the encoder is used for mapping the operation data sequence into a trick vector, and the decoder is used for reconstructing the operation data sequence according to the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence; and training a game AI model constructed according to the prior strategy model and the decoder by adopting a reinforcement learning algorithm. The method can reduce training data required by game AI model training and reduce manpower required by game AI model training.

Description

Model training method and related device

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a model training method and a related device.

Background

Artificial Intelligence (AI) is a technology for automatically and intelligently controlling virtual characters in games. When the virtual character in the game is controlled based on the game AI, the action required to be executed by the virtual character can be decided according to the current game state, and then the virtual character is controlled to execute the action.

In the related art, a pre-trained model can be used to decide the action to be performed by the virtual character, and the model can be obtained through supervised learning or reinforcement learning training. However, when the model is trained through supervised learning, a large amount of training data is often required to be relied on, and if the amount of the training data cannot meet the requirement, the performance of the trained model is affected; when the model is trained through reinforcement learning, more manpower needs to be invested to repeatedly and finely adjust the reward function used in the training process, and the model obtained through training can be guaranteed to have better human simulation.

In summary, how to reduce the training data required for training the game AI model, reduce the manpower input for training the model, and ensure that the trained model has better performance has become a problem to be solved urgently.

Disclosure of Invention

The embodiment of the application provides a model training method and a related device, which can reduce training data required by model training and reduce manpower required by model training under the condition of ensuring that a trained game AI model has better performance.

In view of the above, a first aspect of the present application provides a model training method, including:

acquiring skill sample data; the skill sample data comprises a game state data sequence and an operation data sequence which have a corresponding relation, and the game state data sequence and the operation data sequence correspond to the length of a target frame;

jointly training a variational self-encoder and a prior strategy model according to the skill sample data by adopting a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping the operational data sequence to a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence;

training a game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder.

A second aspect of the present application provides a model training apparatus, the apparatus comprising:

the sample acquisition module is used for acquiring skill sample data; the skill sample data comprises a game state data sequence and an operation data sequence which have a corresponding relation, and the game state data sequence and the operation data sequence correspond to the length of a target frame;

the supervised learning module is used for jointly training the variational self-encoder and the prior strategy model according to the skill sample data by adopting a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping the operational data sequence to a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence;

the reinforcement learning module is used for training the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder.

A third aspect of the application provides a computer apparatus comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is adapted to perform the steps of the model training method according to the first aspect according to the computer program.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for performing the steps of the model training method of the first aspect described above.

A fifth aspect of the present application provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read by a processor of the computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of the model training method according to the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a model training method which innovatively proposes the concept of 'skill', namely the representation of an operation data sequence with a specific length in a low-dimensional vector space; by introducing the concept of "skill", the embodiment of the present application splits the strategy (mapping of game state data to operation data) that the conventional game AI model needs to learn into an upper-layer strategy and a lower-layer strategy, wherein the upper-layer strategy is mapping of a game state data sequence to skill, and the lower-layer strategy is mapping of skill to an operation data sequence. For learning of lower-layer strategies, the embodiment of the application is realized by adopting a supervised learning algorithm to train a variational self-encoder according to the skill sample data; the method comprises the steps that a supervised learning algorithm is adopted, an encoder and a decoder in a variational self-encoder are trained based on an operation data sequence in trick sample data, wherein the encoder is used for mapping the operation data sequence into a trick vector, and the decoder is used for reconstructing the operation data sequence according to the trick vector; because the mapping between the skill vector and the operation data sequence does not relate to the game state data space, the complexity of the training task is greatly reduced, and therefore, the variational self-encoder with better performance can be obtained by training only by using a small amount of skill sample data. For the learning of the upper-layer strategy, the embodiment of the application is realized by training a prior strategy model by combining a supervised learning algorithm and a reinforcement learning algorithm; firstly, training a prior strategy model based on game state sequence data in skill sample data and an output result of an encoder in a variational self-encoder by adopting a supervised learning algorithm, and then training a game AI model comprising the prior strategy model and a decoder in the variational self-encoder by adopting a reinforcement learning algorithm; the prior strategy model is trained by adopting a supervised learning algorithm, so that the prior strategy model has certain skill coding capability, and the game AI model comprising the prior strategy model is subjected to reinforcement learning, the training time consumption of the reinforcement learning can be reduced to a certain extent, the adjustment of a reward function is simpler, the manpower for adjusting the reward function required in the reinforcement learning process can be reduced, and the game AI model obtained by training has better personification.

Drawings

Fig. 1 is a schematic view of an application scenario of a model training method provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart illustrating a model training method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a joint training variational self-encoder and a prior strategy model provided in an embodiment of the present application;

FIG. 4 is a schematic diagram of a process for training a residual error model in a reinforcement learning stage according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a training of an AI model in a reinforcement learning phase according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating training of an AI model of another reinforcement learning phase game according to an embodiment of the present disclosure;

FIG. 7 is a schematic interface diagram of an FPS game according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The scheme provided by the embodiment of the application relates to a machine learning technology in an artificial intelligence technology, and is specifically explained by the following embodiment:

the game AI is used as a part of the electronic game, can control a non-player character (NPC) in the game to enrich the game experience of the player, can also be used for filling the position of an opponent or a teammate to reduce the matching time of an online battle game, and can also be used as a game testing means to help a developer debug parameters of relevant settings in the game and verify the rationality of the relevant settings. It can be seen that designing a more intelligent and comprehensive game AI is an important link in game production.

In the related art, a scheme for training the game AI model generally needs to rely on a large amount of training data, or needs to invest a large amount of manpower to adjust a reward function in the model training process, and the training cost is generally high. The embodiment of the application provides a model training method for training a game AI model, which aims to reduce the training cost of the game AI model, reduce training data used for training the game AI model and manpower input in the process of training the game AI model, and ensure that the trained game AI model has better performance.

It should be noted that the model training method provided by the embodiment of the present application may be executed by a computer device, and the computer device may be a terminal device or a server. The terminal devices include, but are not limited to, mobile phones, computers, intelligent voice interaction devices, intelligent household appliances, vehicle-mounted terminals, aircrafts and the like. The server may be an independent physical server, a server cluster or distributed system composed of a plurality of physical servers, or a cloud server. In addition, the related data related to the embodiments of the present application may be stored in the blockchain network.

In order to facilitate understanding of the model training method provided in the embodiment of the present application, an application scenario of the model training method is exemplarily described below by taking an execution subject of the model training method as a server as an example.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a model training method provided in an embodiment of the present application. As shown in fig. 1, the application scenario includes a server 110 and a database 120; the server 110 may access the database 120 via a network, or the database 120 may be integrated in the server 110. The server 110 is configured to execute the model training method provided in the embodiment of the present application to train a game AI model with better performance; the database 120 is used to store data used in training the game AI model.

In practical applications, the server 110 may retrieve several skill sample data from the database 120, where each skill sample data includes a game state data sequence and an operation data sequence having a corresponding relationship, and the game state data sequence and the operation data sequence both correspond to a target frame length, and the target frame length is a data sequence length according to which a preset learning skill vector is based.

Server 110 may then jointly train the variational self-encoder and the prior strategy model based on the acquired skill sample data using a supervised learning algorithm. The variational self-encoder comprises an encoder and a decoder, wherein the encoder is used for mapping an input operation data sequence into a trick vector, and the decoder is used for reconstructing the operation data sequence according to the trick vector output by the encoder; the prior strategy model is used to determine skill vectors from the game state data sequence.

It should be noted that the embodiment of the present application proposes the concept of "trick" based on the operation data sequence of the target frame length, where "trick" refers to the representation of the operation data sequence of the target frame length in the low-dimensional feature space, and "trick" is specifically represented as a trick vector in the embodiment of the present application. By introducing the concept of "skill", the embodiment of the present application splits the strategy (mapping of game state data to operation data) required to be learned by the traditional game AI model into an upper-layer strategy and a lower-layer strategy, wherein the upper-layer strategy is mapping of a game state data sequence to a skill vector, and the lower-layer strategy is mapping of the skill vector to an operation data sequence. The embodiment of the application realizes the learning of the lower-layer strategy by supervising and training the variational self-encoder, so that a decoder in the variational self-encoder learns the capacity of mapping the skill vector into the operation data sequence, and the difficulty of a training task is greatly reduced because the mapping of the skill vector into the operation data sequence does not relate to a game state data space, so that the training of the variational self-encoder can be completed only by a small amount of skill sample data. According to the embodiment of the application, the learning of the upper-layer strategy is realized by supervising and training the prior strategy model, so that the prior strategy model has the capability of mapping a game state data sequence into a skill vector, but the performance of the prior strategy model cannot meet the actual application requirement due to the fact that insufficient skill sample data are used in the training of the prior strategy model, and further reinforcement learning needs to be carried out on the prior strategy model.

That is, the server 110 may construct the game AI model by using the prior strategy model obtained by supervised learning and the decoder in the variational self-encoder, and then train the game AI model by using the reinforcement learning algorithm to further fine-tune the prior strategy model in the game AI model, thereby improving the performance of the prior strategy model and enabling the game AI model to integrally meet the actual application requirements. Because the prior strategy model has certain skill coding capability through supervised learning, a large amount of manpower is not required to be invested to adjust the reward function in the reinforcement learning stage, and the trained game AI model can also have better human simulation.

It should be understood that the application scenario shown in fig. 1 is only an example, and in practical applications, the model training method provided by the embodiment of the present application may also be applied to other scenarios, for example, the server 110 may obtain the skill sample data from other channels, and for example, the model training method may be executed by the terminal device to train the game AI model, and so on. The application scenario to which the model training method provided in the embodiment of the present application is applicable is not limited at all.

The model training method provided by the present application is described in detail below by way of method embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart of a model training method provided in the embodiment of the present application. For convenience of description, the following embodiments are still introduced by taking the execution subject of the model training method as an example of the server. As shown in fig. 2, the model training method includes the following steps:

step 201: acquiring skill sample data; the skill sample data comprises a game state data sequence and an operation data sequence which have corresponding relations, and the game state data sequence and the operation data sequence correspond to the length of a target frame.

In the embodiment of the present application, before the server trains the game AI model, the server needs to acquire skill sample data for training the game AI model.

It should be noted that the game AI model is a neural network model for deciding the actions that the virtual character needs to perform in the game; in general, the game AI model can decide the action to be performed by the virtual character, i.e. the operation data for controlling the virtual character, according to the game state data in the game pair. It should be understood that the virtual character controlled based on the game AI model may be an NPC in the game, or may be a teammate character in the game (i.e., a character in the same battle as the virtual character controlled by the real player) or an opponent character (i.e., a character in a different battle from the virtual character controlled by the real player), and the type of the virtual character controlled by the game AI model in the embodiment of the present application is not limited in any way.

It should be noted that the skill sample data is training sample data used in the supervised learning phase in the embodiment of the present application, where the training sample data includes a game state data sequence and an operation data sequence having a corresponding relationship, and the game state data sequence and the operation data sequence both correspond to a target frame length, and the target frame length is a preset target frame number (e.g. 5 frames, 10 frames, etc.). That is, the game state data sequence in the skill sample data includes game state data of the sequentially arranged target frame numbers, and the operation data sequence in the skill sample data includes operation data of the sequentially arranged target frame numbers; for the game state data sequence and the operation data sequence which belong to the same skill sample data, the game state data included in the game state data sequence and the operation data included in the operation data sequence have one-to-one correspondence relationship; for example, the target frame length is 5 frames, and the game state data sequence composed of the game state data of the first frame to the fifth frame and the operation data sequence composed of the operation data of the first frame to the fifth frame are included in the certain skill sample data.

In the embodiment of the present application, the game state data is data for describing game play state, and the game state data may include any one or more of the following data, for example: position data of the virtual character, posture data of the virtual character, moving speed of the virtual character, life value of the virtual character, energy value of the virtual character, virtual prop used by the virtual character, virtual material data included in a virtual character backpack, state data of the virtual character (such as being in a firing state, being in a aiming state, being in a moving state and the like) and the like; the virtual characters can comprise virtual characters in game play or virtual characters in the visual field of the virtual characters operated by the player; of course, in practical applications, the game state data may also include other types of data, and the data included in the game state data is not limited in any way herein.

In the embodiment of the present application, the operation data is data for describing a manipulation operation triggered for a target virtual character, where the target virtual character is a virtual character to be learned by the game AI model, that is, the game AI model to be trained needs to learn a behavior action of the target virtual character, and the target virtual character may be, for example, a virtual character operated by a specific player in a game session. The operational data may include, for example, any one or more of the following: the control data includes movement control data, skill release control data, aiming control data, virtual item use control data, virtual item switching control data, and the like.

In one possible implementation, the server may obtain the skill sample data by: acquiring game example data, wherein the game example data comprises an original game state data sequence and an original operation data sequence generated in a training game pair; then, according to the length of the target frame, the original game state data sequence and the original operation data sequence in the game example data are respectively segmented to obtain a game state data sequence and an operation data sequence with corresponding relations, and the game state data sequence and the operation data sequence are utilized to form skill sample data.

It should be noted that, the game example data is generated by executing a game task under a machine learning task environment by a human expert or a standard strategy, and the game example data comprises an original game state data sequence and an original operation data sequence generated in a training game play; the original game state data sequence and the original operation data sequence may correspond to a complete training game, that is, the original game state data sequence includes game state data of each frame in the training game, and the original operation data sequence includes operation data of each frame in the training game for the target virtual character.

For example, the server may obtain the game example data from a database, or the server may also obtain the game example data from a terminal device, or the server may obtain the game example data from a background server of a game. Then, the server may perform segmentation processing on the original game state data sequence and the original operation data sequence in the obtained game instance data according to the target frame length, so as to obtain a plurality of game state data sequences corresponding to the target frame length and a plurality of operation data sequences corresponding to the target frame length by segmentation. Furthermore, the server can utilize the game state data sequence and the operation data sequence with the corresponding relation to form skill sample data; the game state data sequence and the operation data sequence having a correspondence relationship can be understood as a game state data sequence and an operation data sequence corresponding to the same frame range.

In addition, in practical application, the server can select corresponding game example data to construct skill sample data according to training requirements of the game AI model.

As an example, in an early development stage of a game, such as a game development stage or an early stage of online, there are usually only a small amount of internal experience data and closed test data, and at this time, these internal experience data and closed test data may be used as game example data for constructing skill sample data, constructing skill sample data based on these game example data, and training a game AI model according to the constructed skill sample data. It should be appreciated that the game instance data at this stage is generally free of specific stylistic characteristics, and accordingly, the operations decided by the game AI model trained based on such game instance data are also free of specific stylistic characteristics, which can assist the developer in performing a cold start of the relevant AI model as a base model for subsequent game AI development.

As another example, in a stable operation phase of a game, the embodiment of the present application may train a game AI model with a specific style for stylized requirements of an online game AI model.

For example, when training such a game AI model, the game instance data obtained by the server may include an original game state data sequence and an original operation data sequence having a target style; accordingly, a game AI intelligence model trained based on such game instance data can be used to instruct the virtual character to perform actions that conform to the target style. That is, the server may obtain game instance data having a target style, which may be, for example, a tendency toward rapid attack, a tendency toward conservative attack, a tendency toward self-defense, and the like, and train a game AI model having the target style based on the game instance data having the target style, and when a virtual character is manipulated using the game AI model, the virtual character tends to be controlled to perform an action in accordance with the target style.

For another example, when training such a game AI model, the game instance data acquired by the server may include an original game state data sequence and an original operation data sequence generated by the target player; accordingly, a game AI model trained based on such game instance data is used to instruct the virtual character to perform an action that conforms to the game style of the target player. That is, the server may acquire game instance data having a strong personal style, that is, an original operation data sequence generated when a specific target player participates in a game, and an original game state data sequence of the game as game instance data; further, based on game instance data having a strong personal style, a game AI model having the personal style is trained, and when a virtual character is manipulated using the game AI model, the virtual character tends to be controlled to perform an action in accordance with the game style of the target player.

It should be noted that, in the embodiment of the present application, before the server obtains the operation data generated by the player, it needs to obtain an authorization permission of the player. Namely, the server can send an authorization notification message to the player, and the player is requested to grant the server with the authority to acquire the operation data through the authorization notification message; if the permission that the player obtains the operation data of the player through the authorization notification message is detected, the server can legally obtain the operation data generated by the player in the game process; on the contrary, if the authority of the player for obtaining the operation data is not detected through the authorization notification message, the server does not have the authority to obtain the operation data generated by the player in the game process.

It should be understood that, in practical applications, the server may also obtain other types of game instance data to train the game AI model, and the application does not limit the type of the obtained game instance data in any way.

It should be understood that, in practical applications, the server may obtain the skill sample data in other manners besides obtaining the game example data first and then generating the skill sample data based on the game example data in the manner described above, for example, the server may directly obtain the previously constructed skill sample data, or for example, the server may also sample the game state data sequence and the operation data sequence according to the target frame length in the game play session to construct the skill sample data, and the like.

Step 202: jointly training a variational self-encoder and a prior strategy model according to the skill sample data by adopting a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping the operational data sequence to a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used to determine skill vectors from the game state data sequence.

After the server acquires the skill sample data, a supervised learning algorithm can be adopted to jointly train the variational self-encoder and the prior strategy model according to the acquired skill sample data. The supervised learning algorithm is a method for training a model by using training sample data and enabling the model to have certain capacity, and requires the trained model to imitate and fit an expert strategy from the training sample data and output data consistent with label data in the training sample data as much as possible; the supervised learning algorithm can adjust the capability of the trained model by adjusting training sample data, does not need to interact with the environment, and can be completely trained offline.

It should be noted that the joint training in the embodiment of the present application refers to that, in the process of training the variational autocoder and the prior strategy model, an intermediate processing result generated by the variational autocoder is used to assist in training the prior strategy model, that is, the training of the prior strategy model depends on the training of the variational autocoder.

It should be noted that a Variational Auto Encoder (VAE) is used to compress features or data with complex expression form, which can embed high-dimensional data set into low-dimensional space, and reduce data complexity for subsequent processing and training; the variational auto-encoder is also a generation model that generates data that is not present in the training sample data set but is similar to the training sample data in the training sample data set by sampling in a low-dimensional space. In an embodiment of the present application, a variational self-encoder includes an encoder and a decoder; the encoder is used for embedding the high-dimensional data set into a low-dimensional space, namely mapping a complex operation data sequence into a skill vector with a lower dimension; the decoder is used for sampling in a low dimensional space to generate reconstructed data, i.e. for decoding the lower dimensional trick vectors into a sequence of operational data.

It should be noted that the prior strategy model is a neural network model for determining skill vectors according to the game state data sequence, which is proposed in the embodiment of the present application, that is, the prior strategy model is used for learning the mapping of the game state data sequence to the skill vectors. The prior policy model may include, for example, a fully-connected network with a preset number of layers, and the model structure of the prior policy model is not limited herein.

Fig. 3 is a schematic diagram of a joint training variational autocoder and a prior strategy model according to an embodiment of the present application. As shown in fig. 3, when training the variational self-encoder, the server may input the operation data sequence in the trick sample data into the variational self-encoder, the encoder in the variational self-encoder may obtain the corresponding trick vector by processing the operation data sequence, then the decoder in the variational self-encoder may reconstruct the operation data sequence according to the trick vector output by the encoder, and further, may train the variational self-encoder according to the difference between the reconstructed operation data sequence and the operation data sequence initially input into the variational self-encoder. When training the prior strategy model, the server can input a game state data sequence in the skill sample data into the prior strategy model, and the prior strategy model can obtain a corresponding skill vector by processing the game state data sequence; furthermore, the prior strategy model can be trained according to the difference between the skill vector output by the prior strategy model and the skill vector determined by the encoder in the variational self-encoder according to the operation data sequence in the skill sample data.

In one possible implementation, the server may train the variational autoencoder and the a priori policy model simultaneously. Specifically, the server may determine, by an encoder in the variational self-encoder, a first trick vector according to an operation data sequence in a trick sample data, and determine, by a decoder in the variational self-encoder, a reconstructed operation data sequence according to the first trick vector; and determining a second skill vector from the sequence of game state data in the skill sample data through the prior strategy model. Further, a variational self-encoder and a prior strategy model are trained simultaneously based on a difference between the reconstructed operational data sequence and the operational data sequence and a difference between a second skill vector and a first skill vector.

For example, the server may input an operation data sequence and a game state data sequence in some skill sample data into a variational auto-encoder and a prior strategy model to be trained, respectively. The encoder in the variational self-encoder processes the input operation data sequence, and may map the operation data sequence to a corresponding first trick vector, and the decoder in the variational self-encoder processes the first trick vector determined by the encoder, and may obtain a reconstructed operation data sequence, which may be understood as an operation data sequence regenerated by the decoder according to the trick vector. The prior strategy model processes the input game state data sequence to generate a second skill vector.

Further, the server may train a variational self-encoder using a difference between the reconstructed operational data sequence and the operational data sequence, and train a priori strategy models using a difference between the second skill vector and the first skill vector. In this implementation, the server may perform synchronous training on the variational self-encoder and the prior strategy model based on the two differences; that is, the server may construct a synthetic loss function to synchronously train the variational autocoder and the prior strategy model based on the difference between the reconstructed operational data sequence and the difference between the second trick vector and the first trick vector.

As an example, the server may construct a first penalty function based on a difference between the reconstructed operational data sequence and the operational data sequence, and construct a second penalty function based on a difference between the second trick vector and the first trick vector; then, determining a comprehensive loss function according to the first loss function and the second loss function; further, based on the synthetic loss function, model parameters of the variational self-encoder and model parameters of the prior strategy model are adjusted.

Specifically, the server may determine a reconstruction error according to a difference between the reconstruction operation data sequence and the operation data sequence, and determine a first loss function based on the reconstruction error; meanwhile, the server may determine KL (Kullback-Leibler) divergence between the second skill vector and the first skill vector according to the second skill vector, and determine a second loss function based on the KL divergence. The KL divergence is a metric used to measure the similarity of two probability distributions, and in the embodiment of the present application, the KL divergence between the first and second skill vectors can reflect the similarity between the first and second skill vectors.

Then, the server may add the first loss function and the second loss function to obtain a comprehensive loss function; or, the server may perform weighted addition on the first loss function and the second loss function according to a preset weight to obtain a comprehensive loss function. Furthermore, the server can simultaneously adjust the model parameters of the variational self-encoder and the model parameters of the prior strategy model based on the gradient descent algorithm with the aim of minimizing the comprehensive loss function until the variational self-encoder and the prior strategy model both meet the training end condition.

It should be understood that the above-mentioned training end condition is a condition for measuring whether to end the model training process, and the training end condition may be, for example, that the iterative training time of the model reaches a preset time threshold, or, for example, that the performance of the trained model meets a preset requirement (e.g., the difference between the reconstructed operation data sequence output by the variational self-encoder and the input operation data sequence is smaller than a preset threshold, e.g., the difference between the skill vector output by the prior strategy model and the skill vector output by the variational self-encoder is smaller than a preset threshold), or, for example, that the performance of the trained model no longer significantly increases following the progress of the training, or the like, and the present application does not limit the training end condition in any way.

According to the embodiment of the application, the variational self-encoder and the prior strategy model are synchronously trained, so that the model training efficiency in the supervised learning stage can be improved, and the model training time can be shortened; experiments prove that compared with the method for respectively training the variational autocoder and the prior strategy model, the method for synchronously training the variational autocoder and the prior strategy model can reduce the training time by 25 percent.

In another possible implementation manner, the server may also train the variational auto-encoder first, and train the prior strategy model after completing the training of the variational auto-encoder. Specifically, the server may first determine a reconstruction operation data sequence according to an operation data sequence in certain skill sample data through a variational self-encoder to be trained, and train the variational self-encoder according to a difference between the reconstruction operation data sequence and the operation data sequence; after the variational self-encoder meets the training end condition, the server can determine a third skill vector according to a game state data sequence in certain skill sample data through a prior strategy model to be trained, and train the prior strategy model according to the difference between the third skill vector and a fourth skill vector; the fourth trick vector is here determined by the encoder in the variational auto-encoder that satisfies the training end condition from the sequence of operation data in the trick sample data.

For example, the server may first train the variational autoencoder. When the variational self-encoder is specifically trained, an operation data sequence in the skill sample data can be input into the variational self-encoder, the encoder in the variational self-encoder can output a corresponding skill vector by processing the input operation data sequence, and the decoder in the variational self-encoder can generate a reconstructed operation data sequence by processing the skill vector output by the encoder; furthermore, the server may determine a reconstruction error from the reconstruction operation data sequence and the operation data sequence, construct a loss function based on the reconstruction error, and adjust model parameters of the variational self-encoder with the goal of minimizing the loss function. In this way, the operation is repeatedly executed iteratively based on the operation data sequence in each acquired skill sample data until the variational self-encoder meets the training end condition; the training end condition here may be, for example, that the iterative training number of the variational auto-encoder reaches a preset number threshold, or may be that the performance of the variational auto-encoder meets a preset performance requirement, or may be that the performance of the variational auto-encoder no longer follows the model training to generate a significant improvement, or the like.

After the training of the variational autocoder is completed, the server may further train the prior strategy model. When the prior strategy model is specifically trained, the game state data sequence in the skill sample data can be input into the prior strategy model, and the prior strategy model can output a corresponding third skill vector by processing the input game state data sequence; meanwhile, the server can input the operation data sequence in the skill sample data into a variational self-encoder which meets the training end condition, and acquire a fourth skill vector generated by the encoder in the variational self-encoder processing the operation data sequence; furthermore, the server may determine a KL divergence between the third and fourth trick vectors, the KL divergence being capable of reflecting a similarity between the third and fourth trick vectors, construct a loss function based on the KL divergence, minimize the loss function as a target, and adjust model parameters of the prior policy model. In this way, the operation is repeatedly executed in an iterative manner based on the game state data sequence in each acquired skill sample data until the prior strategy model meets the training end condition; the training end condition here may be, for example, that the iterative training time of the prior strategy model reaches a preset time threshold, or may be that the performance of the prior strategy model meets a preset performance requirement, or may be that the performance of the prior strategy model no longer follows the model training to generate a significant improvement, or the like.

According to the embodiment of the application, the variational self-encoder and the prior strategy model are trained separately, the variational self-encoder is trained firstly, the prior strategy model is trained after the variational self-encoder is trained, the reliability and the stability of model training can be guaranteed, and the variational self-encoder and the prior strategy model obtained by training have better performance.

Step 203: training a game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder.

After the server completes training of the variation self-encoder and the prior strategy model in the supervision learning stage, a prior strategy model for determining a skill vector according to a game state data sequence and a decoder used for reconstructing an operation data sequence according to the skill vector in the variation self-encoder can be obtained, and based on the fact, the server can utilize the prior strategy model and the decoder to construct a game AI model for determining the operation data sequence according to the game state data sequence.

In consideration of the fact that the number of the skill sample data used in the supervised learning stage may not be sufficient, it is difficult to ensure that the strength of the trained prior strategy model meets the actual game requirement, that is, because the prior strategy model relates to the mapping from the game state data sequence to the skill vector, and the game state data space is extremely complex, it is difficult to ensure that the strength of the trained prior strategy model meets the actual game requirement only based on a small number of skill sample data in the supervised learning stage. Therefore, in the embodiment of the present application, the server needs to use a reinforcement learning algorithm to further train the game AI model constructed based on the prior strategy model and the decoder, so as to mainly improve the strength of the prior strategy model. The reinforcement learning algorithm is used for describing and solving the problem that the intelligent agent achieves the maximum return or achieves a specific target through a learning strategy in the interaction process with the environment, and has the advantages that high-strength and diversified-behavior AI models can be obtained without expert knowledge and heuristic rules; the core idea of the reinforcement learning algorithm is that at each time step, the AI model obtains the state from the environment and takes action, and then the environment gives corresponding feedback according to the action taken by the AI, and in the process of continuously interacting with the environment and obtaining the feedback, the AI model can learn the optimal behavior decision-making capability.

In a possible implementation manner, the server may additionally establish a residual model independent of the prior strategy model in the game AI model, and in the reinforcement learning stage, the gradient update is not performed on the prior strategy model, but only the residual model is learned. Namely, the server can train a residual error model in the game AI model by adopting a reinforcement learning algorithm; the game AI model at this time includes a prior strategy model, a residual model for determining a correction amount of a skill vector output from the prior strategy model, and a decoder.

That is, in this implementation, the server fixes the model parameters of the prior strategy model and the decoder in the game AI model in the reinforcement learning stage, and only performs gradient adjustment on the model parameters of the residual model in the game AI model. Therefore, only the residual error model is adjusted in the reinforcement learning stage, on one hand, the style characteristics of the prior strategy model learned in the supervision learning stage can be kept, the phenomenon that the operation style of the prior strategy model learned based on stylized training sample data is weakened or worn out in the reinforcement learning stage is avoided, and on the other hand, the game AI model obtained through final training can be guaranteed to have better stability.

As an example, the residual error model in this embodiment may include a plurality of layers of parallel initial fully-connected networks and an integrated fully-connected network, where different initial fully-connected networks are used to process different types of data in the game state data sequence, and the integrated fully-connected network is used to further process the processing result of each integrated initial fully-connected network, so as to obtain a correction amount of a skill vector output by the prior model. Of course, the residual model may also be expressed as other model structures, and the present application does not limit the model structure of the residual model.

Fig. 4 is a schematic flowchart of training a residual error model in a reinforcement learning stage according to an embodiment of the present disclosure. As shown in fig. 4, the training process for the residual model specifically includes the following steps:

step 401: a training game state data sequence in a training game environment is obtained.

In the reinforcement learning phase, the server may train a residual model in the game AI model based on interactions between the game AI model and the training game environment. Specifically, the server needs to acquire a training game state data sequence in a training game environment for processing by a prior strategy model and a residual model in the game AI model.

It should be noted that the training game environment is a game-play environment in which virtual characters controlled by the game AI model are located in the reinforcement learning stage; for example, if the game AI model controls the virtual character to participate in a game play, the game environment of the game play is the training game environment. The training game state data sequence is a game state data sequence acquired in the training game environment, which is similar to the game state data sequence in the skill sample data described above, and which also corresponds to the target frame length.

Step 402: determining a basic skill vector according to the training game state data sequence through the prior strategy model; determining a correction quantity according to the training game state data sequence through the residual error model; determining, by the decoder, a sequence of prediction operation data based on the base trick vector and the correction.

Fig. 5 is a schematic diagram illustrating training of an AI model of a reinforcement learning phase game according to an embodiment of the present disclosure. As shown in fig. 5, the server may input the obtained training game state data sequence into the game AI model, i.e., into the prior strategy model and the residual model in the game AI model. The prior strategy model can correspondingly output basic skill vectors by processing the training game state data sequence; the residual error model can correspondingly output the correction quantity for correcting the basic skill vector by processing the training game state data sequence; furthermore, a decoder in the game AI model can determine a prediction operation data sequence according to the basic skill vector output by the prior strategy model and the correction quantity output by the residual error model; for example, the decoder may process the sum of the base trick vector and the modifier to obtain a sequence of prediction operation data.

It should be noted that the predicted operation data sequence is an operation data sequence predicted by the game AI model according to the input training game state data sequence, the predicted operation data sequence is used for guiding the action performed next by the virtual character controlled by the game AI model, the predicted operation data sequence is similar to the operation data sequence in the skill sample data described above, and the predicted operation data sequence also corresponds to the target frame length.

Step 403: and controlling the virtual character in the training game environment to execute the action sequence indicated by the predicted operation data sequence, and acquiring game state change data generated when the virtual character executes the action sequence.

Further, the server may control a virtual character (i.e., a virtual character in the training game environment that is controlled by the currently trained game AI model) to perform an action sequence indicated by the predicted operation data sequence output by the game AI model. The virtual character executing the action in the action sequence may cause a game state change in the training game environment, and accordingly, the server needs to acquire game state change data generated when the virtual character executes the action sequence. The game state change data is data for characterizing a game state change caused by the virtual character performing an action, and may include, for example, any one or more of the following data: vital value change data, energy value change data, remaining virtual resource change data, and the like; it should be understood that the above change data may include change data of each virtual character in the training game environment, and is not limited to change data of virtual characters controlled by the game AI model, and the data included in the game state change data is not limited in any way herein.

As an example, the server may acquire the game state change data by: acquiring game state change data corresponding to each action in the action sequence to form a game state change data sequence; the game state change data corresponding to the action is used for representing the change situation of the game state after the virtual character executes the action.

Specifically, after the game AI model outputs a predicted operation data sequence of the target frame length, the server may control the virtual character to sequentially execute the action indicated by each operation data based on each operation data included in the predicted operation data sequence; the virtual character executes an action each time, which causes a corresponding change in the game state in the training game environment, and at this time, the server may obtain game state change data representing a change situation of the game state as game state change data corresponding to the action executed by the virtual character. Thus, each time the virtual character executes the action indicated by the operation data, the server can obtain game state change data corresponding to the action; after the virtual character executes the action sequence indicated by the prediction operation data sequence, the server can correspondingly obtain the game state change data of the target frame length, and then the game state change data are utilized to form a game state change data sequence. The sequence of game state change data may then be used as a basis for determining a target prize value for training the residual model.

It should be understood that the above-mentioned manner of obtaining the game state change data for determining the target prize value is only an example, and in practical applications, the server may obtain the game state change data for determining the target prize value in other manners; for example, after obtaining the predicted operation data sequence output by the game AI model, the server may control the virtual character to execute only the action indicated by the first operation data in the predicted operation data sequence, further obtain the game state change data generated after the virtual character executes the action, and use the game state change data as a basis for determining the target bonus value.

Step 404: determining a target reward value according to the game state change data through a reward function; training the residual model based on the target reward value.

As shown in fig. 5, after the server acquires the game state change data, the target bonus value can be determined according to the acquired game state change data through the bonus function. Further, a residual error model in the game AI model is trained with the goal of obtaining the maximum goal reward value, and model parameters of the residual error model are adjusted.

The bonus function is a function for determining a bonus value for the game AI model according to a change of a game state; the target bonus value is a bonus value determined by a bonus function according to a game state change, and can be used for determining a gradient when model parameters of the residual model are adjusted. The reward function can be adjusted according to a training target of the game AI model and an action style executed by the virtual character according to a prediction operation data sequence output by the game AI model; for example, assuming that the training target of the game AI model is that the virtual character tends to perform an attack operation in order to control the virtual character, if the virtual character performs a defense operation according to the predicted operation data sequence output by the game AI model, the weight configured for the attack operation in the bonus function may be increased, and the weight configured for the defense operation in the bonus function may be decreased. In the embodiment of the application, the prior strategy model in the game AI model already obtains a certain skill coding capability through supervised learning, so that the difficulty of adjusting the reward function in the reinforcement learning stage can be reduced to a certain extent, and accordingly, the manpower required for adjusting the reward function can be reduced.

As an example, if the game state change data acquired by the server for determining the target prize value is a game state change data sequence, the server may determine the target prize value by: determining the reward value corresponding to each game state change data in the game state change data sequence through a reward function; and further, determining the target reward value according to the reward value corresponding to each game state change data in the game state change data sequence.

For example, the server may determine, one by one, a prize value corresponding to each game state change data in the game state change data sequence through a prize function, where a prize value corresponding to a game state change data may be understood as a prize value assigned for such a game state change. Furthermore, the server can sum the reward values corresponding to the game state change data in the game state change data sequence, and the obtained total reward value is the target reward value. Therefore, the overall game state change condition caused by each operation data in the prediction operation data sequence is comprehensively considered, the target reward value is determined according to the overall game state change condition, the residual error model in the game AI model is trained based on the target reward value, and the performance of the trained residual error model is favorably improved.

In practical applications, of course, the server may also determine the target bonus value according to the single game state change data through the bonus function, and adjust the residual error model in the game AI model based on the target bonus value, which is not limited in this application.

As an example, the server may train a residual model in the game AI model based on the target prize value using a near-end Policy Optimization algorithm (PPO). That is, the server may determine the loss function correction coefficients based on the difference between the sequence of prediction operation data and the sequence of base operation data, where the base operation data sequence is determined by the decoder based on the base trick vector; further, a target loss function is constructed according to the loss function correction coefficient and the target reward value by adopting a PPO algorithm; and based on the target loss function, adjusting model parameters of the residual model.

It should be noted that the PPO algorithm is a method for constructing a loss function in reinforcement learning, and a loss function correction coefficient needs to be determined when the loss function is constructed based on the PPO algorithm, and in the related art, the loss function correction coefficient usually adopts an entropy regular term in a policy gradient method to encourage a reinforcement learning training model to make different decisions. In the embodiment of the application, the loss function correction coefficient is innovatively determined according to the difference between the strategy determined by the game AI model and the strategy determined based on the prior strategy model alone, so that the game AI model for reinforcement learning training is encouraged to make a decision according with the decision style of the prior strategy model, and therefore, the fact that information learned by a supervision learning stage according to the skill sample data with a specific style is not weakened or worn out in the reinforcement learning stage is guaranteed.

Specifically, the server may use a decoder in the game AI model to decode the basic skill vector output by the prior strategy model separately to obtain a basic operation data sequence, and further, the server may calculate a KL divergence between the basic operation data sequence and the predicted operation data sequence output by the game AI model according to the basic operation data sequence and the predicted operation data sequence output by the game AI model, and use the KL divergence to replace an entropy regular term commonly used in a strategy gradient method in the PPO algorithm, that is, use the KL divergence as a loss function correction coefficient. Then, the server may use a PPO algorithm to correct the loss function constructed based on the target reward value by using the loss function correction coefficient, so as to obtain a target loss function.

More specifically, the PPO algorithm is adopted to solve the target loss function

The formula (2) is shown in the following formula (1):

（1）

wherein, E2]Representing a desired function;

represents a model parameter in the residual model, < > is present>

Model parameters representing a prior strategy model; />

The part of the strategy gradient in the objective loss function is specifically expanded as shown in the following formula (2);

is a value function part in the objective loss function; />

Correcting the coefficient for the above-mentioned loss function>

Represents the above-mentioned basic operating data sequence, and>

indicating the above-described predictive operationA sequence of data. />

（2）

Wherein the content of the first and second substances,

representing a strategy ratio function in the PPO algorithm; />

Represents a merit function that is determined based on a target prize value>

Indicates the game status data, <' > is asserted>

A skill vector representing the output of the prior strategy model>

A correction amount representing a residual model output; clip () represents the truncation function in the PPO algorithm, clip (x, y, z) represents if the input value x is located at [ y, z]Within the interval, outputting x directly, if the input value x is smaller than y, outputting y, and if the input value x is larger than z, outputting z; />

The superparameters in the PPO algorithm are represented, the empirical value range is between 0.05 and 0.2, and the adjustment can be carried out according to actual conditions.

Further, the server may adjust parameters of a residual model in the game AI model based on the objective loss function to maximize the objective prize value.

In this way, the above steps 401 to 404 are executed in a loop until the training end condition is satisfied, and the game AI model which can be put into practical use is obtained. The training end condition here may be, for example, that the iterative training time of the residual model in the game AI model reaches a preset time threshold, or, for example, that the performance of the game AI model meets a preset performance requirement (for example, the target reward value reaches the preset requirement), or, for example, that the performance of the game AI model is no longer improved along with the progress of the training (for example, the target reward value is no longer significantly changed), or the like.

Therefore, the residual error model in the game AI model is trained independently through the mode, the prior strategy model and the decoder in the fixed game AI model are unchanged, the behavior style learned by the prior strategy model in the supervised learning stage can be reserved, and the trained game AI model can have better stability.

In another possible implementation manner, the server may also perform further training on the prior strategy model in the game AI model in the reinforcement learning stage to further improve the strength of the prior strategy model. Namely, the server can adopt a reinforcement learning algorithm to train a prior strategy model in the game AI model; in this implementation, the game AI model is used to determine a skill vector from the game state data sequence by the prior strategy model and an operational data sequence from the skill vector by the decoder.

That is to say, in this implementation, the server constructs the game AI model only based on the prior strategy model trained in the supervised learning stage and the decoder in the variational self-encoder, and since the performance of the decoder has reached better through the supervised learning, and the strength of the prior strategy model is still insufficient to meet the actual requirements of the game, it is necessary to further optimize the prior strategy model through reinforcement learning to improve the strength thereof.

Fig. 6 is a schematic training diagram of another reinforced learning phase game AI model according to an embodiment of the present disclosure. As shown in fig. 6, the way of training the prior strategy model in the game AI model in the reinforcement learning phase in this implementation is similar to the way of training the residual model in the game AI model in the reinforcement learning phase shown in fig. 4 and 5 above, except that the game AI model works differently and the training objects in the game AI model are different. The difference in the working modes of the game AI model means that in the implementation mode, the game AI model needs to determine a skill vector according to a training game state data sequence through a prior strategy model, and then determines a prediction operation data sequence according to the skill vector determined by the prior strategy model through a decoder, namely, the game AI model does not relate to the relevant processing of a residual error model; the difference of objects to be trained in the game AI model means that, in this implementation, after the target bonus value is determined by the bonus function according to the change of the game state, the model parameters of the prior strategy model are adjusted based on the target bonus value, instead of the model parameters of the residual model, but the specific adjustment mode of the model parameters of the prior strategy model is basically the same as the specific adjustment mode of the model parameters of the residual model, and when the model parameters of the prior strategy model are adjusted based on the PPO algorithm, the entropy regular term commonly used in the strategy gradient method can be retained as the correction coefficient of the loss function.

The model training method provided by the embodiment of the application innovatively provides a concept of 'skill', namely the expression of an operation data sequence with a specific length in a low-dimensional vector space; by introducing the concept of "skill", the embodiment of the present application splits the strategy (mapping of game state data to operation data) required to be learned by the conventional game AI model into an upper-layer strategy and a lower-layer strategy, wherein the upper-layer strategy is mapping of a game state data sequence to the skill, and the lower-layer strategy is mapping of the skill to the operation data sequence. For learning of lower-layer strategies, the embodiment of the application is realized by adopting a supervised learning algorithm to train a variational self-encoder according to the skill sample data; the method comprises the steps that a supervised learning algorithm is adopted, an encoder and a decoder in a variational self-encoder are trained based on an operation data sequence in trick sample data, wherein the encoder is used for mapping the operation data sequence into a trick vector, and the decoder is used for reconstructing the operation data sequence according to the trick vector; because the mapping between the skill vector and the operation data sequence does not relate to the game state data space, the complexity of the training task is greatly reduced, and therefore, the variational self-encoder with better performance can be obtained by training only a small amount of skill sample data. For the learning of the upper-layer strategy, the embodiment of the application is realized by training a prior strategy model by combining a supervised learning algorithm and a reinforcement learning algorithm; firstly, training a prior strategy model based on game state sequence data in skill sample data and an output result of an encoder in a variational self-encoder by adopting a supervised learning algorithm, and then training a game AI model comprising the prior strategy model and a decoder in the variational self-encoder by adopting a reinforcement learning algorithm; the prior strategy model is trained by adopting a supervised learning algorithm, so that the prior strategy model has certain skill coding capability, and the game AI model comprising the prior strategy model is subjected to reinforcement learning, the training time consumption of the reinforcement learning can be reduced to a certain extent, the adjustment of a reward function is simpler, the manpower for adjusting the reward function required in the reinforcement learning process can be reduced, and the game AI model obtained by training has better personification.

In order to further understand the model training method provided in the embodiment of the present application, the following takes as an example that the model training method is used for training a game AI model suitable for a First-person shooter (FPS) game, and the model training method is generally and exemplarily described. Fig. 7 is an interface schematic diagram of an FPS game according to an embodiment of the present disclosure.

In the embodiment of the application, the concept of "skill" is provided based on a fixed-length continuous operation data sequence generated in the interaction between the intelligent agent and the game environment, wherein the "skill" is the representation of the fixed-length continuous operation data sequence in a low-dimensional vector space, and the fixed length specifically can refer to the length of a target frame. By introducing the concept of 'skill', the embodiment of the application divides the strategy (mapping from game state data to operation data) required to be learned by the traditional game AI model into an upper strategy (mapping from a game state data sequence to the skill) and a lower strategy (mapping from the skill to an operation data sequence), and learns aiming at the two parts respectively, so that the problem is simplified, and the two parts are optimized respectively, thereby improving the overall performance of the game AI model.

For learning of a lower-layer strategy (mapping of skills to an operation data sequence), the embodiment of the application is realized in a mode of a supervised learning variation self-encoder; a variational self-encoder is trained based on game instance data, the variational self-encoder including an encoder for mapping the operational data sequence to a skill vector and a decoder for reconstructing the operational data sequence from the skill vector. Because the mapping from the skill vector to the operation data sequence does not relate to the game state data space, the complexity of the training task is greatly reduced, and the training of the variational self-encoder can be completed through a small amount of game example data.

For the learning of an upper-layer strategy (mapping from a game state data sequence to skills), in the embodiment of the application, a priori strategy model is trained based on game example data through supervised learning, then a game AI model is constructed based on the priori strategy model and a decoder in the variational autocoder, the game AI model also comprises a residual error model used for determining a correction quantity according to a skill vector output by the priori strategy model, and then the residual error model is trained through reinforcement learning, so that the correction of the priori strategy model is realized, and the game AI model with better performance is obtained.

The whole process of the technical scheme comprises the following steps:

1. based on the FPS game scene needing to use the game AI model, game example data is collected, wherein the game example data comprises a game state data sequence in the FPS game scene and an operation data sequence generated by a target player in the FPS game scene, and the target player can be a player with a higher FPS game level. Furthermore, the collected game example data is segmented according to the target frame length, namely, the game state data sequence in the game example data is segmented into a plurality of game state data sequences corresponding to the target frame length, the operation data sequence in the game example data is segmented into a plurality of operation data sequences corresponding to the target frame length, and each pair of game state data sequences and operation data sequences with corresponding relations obtained by segmentation are utilized to form the skill sample data.

2. And (4) synchronously training the variational self-encoder and the prior strategy model by adopting a supervised learning algorithm. Namely, the server can respectively input the operation data sequence and the game state data sequence in certain skill sample data into the variational self-encoder and the prior strategy model; the encoder in the variational self-encoder can correspondingly obtain the skill vector by processing the input operation data sequence, and the decoder in the variational self-encoder can output a reconstruction operation data sequence by processing the skill vector output by the encoder; the prior strategy model can obtain skill vectors accordingly by processing the input game state data sequence. Furthermore, the server can synchronously adjust the model parameters of the variational self-encoder and the prior strategy model according to the difference between the operation data sequence input into the variational self-encoder and the reconstruction operation data sequence output by the variational self-encoder and the difference between the skill vector output by the prior strategy model and the skill vector output by the encoder in the variational self-encoder.

3. After the training of the variational self-encoder and the prior strategy model is completed through a supervised learning algorithm, a residual model can be independently established, and a game AI model is formed by utilizing the prior strategy model, the residual model and a decoder in the variational self-encoder; in the game AI model, a priori strategy model is used to determine a basic skill vector according to an input game state data sequence, a residual model is used to determine a correction amount for the basic skill vector according to the input game state sequence data, and a decoder is used to determine a prediction operation data sequence for manipulating a virtual character according to a sum of the basic skill vector and the correction amount.

4. And training the residual error model in the game AI model by adopting a reinforcement learning algorithm so as to improve the overall capability of the game AI model and finally obtain the game AI model which can be used in the FPS game. The game AI model can be used for the online deployment of tests and games by a development team in the early development stage of the FPS game and can be used as a basic model for the development of a subsequent game AI model; the game AI model may also control a virtual character to perform an action having a specific style in a stable operation stage of the FPS game, and the controlled virtual character may be an NPC in the FPS game, or a teammate virtual character or an opponent virtual character which is filled in the FPS game in order to reduce a game matching time.

It should be understood that, in practical applications, the model training method provided in the embodiment of the present application may be used for training, in addition to the game AI model suitable for the FPS game, a game AI model suitable for other games, such as a game AI model suitable for a Third-person Shooting (TPS) game, a game AI model suitable for a Multiplayer Online tactical sports game (MOBA), and the like, and the game type to which the trained game AI model is suitable is not limited in any way herein.

Aiming at the model training method described above, the present application also provides a corresponding model training device, so that the model training method can be applied and implemented in practice.

Referring to fig. 8, fig. 8 is a schematic diagram of a model training apparatus 800 corresponding to the model training method shown in fig. 2. As shown in fig. 8, the model training apparatus 800 includes:

a sample obtaining module 801, configured to obtain trick sample data; the skill sample data comprises a game state data sequence and an operation data sequence which have a corresponding relationship, and the game state data sequence and the operation data sequence correspond to the length of a target frame;

a supervised learning module 802, configured to jointly train a variational self-encoder and a prior strategy model according to the skill sample data by using a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping the operational data sequence to a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence;

the reinforcement learning module 803 is used for training the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder.

Optionally, the supervised learning module 802 is specifically configured to:

determining, by the encoder in the variational self-encoder, a first trick vector according to the operational data sequence in the trick sample data; determining, by the decoder in the variational self-encoder, a reconstruction operation data sequence from the first trick vector;

determining a second skill vector according to the game state data sequence in the skill sample data through the prior strategy model;

training the variational self-encoder and the prior strategy model based on differences between the reconstructed operational data sequence and the operational data sequence and differences between the second skill vector and the first skill vector.

Optionally, the supervised learning module 802 is specifically configured to:

constructing a first loss function according to the difference between the reconstruction operation data sequence and the operation data sequence; constructing a second penalty function based on the difference between the second trick vector and the first trick vector;

determining a comprehensive loss function according to the first loss function and the second loss function;

and adjusting the model parameters of the variational self-encoder and the model parameters of the prior strategy model based on the comprehensive loss function.

Optionally, the supervised learning module 802 is specifically configured to:

determining a reconstruction operation data sequence according to the operation data sequence in the skill sample data through the variational self-encoder; training the variational self-encoder according to the difference between the reconstruction operation data sequence and the operation data sequence;

determining a third skill vector according to the game state data sequence in the skill sample data through the prior strategy model; training the prior strategy model according to the difference between the third and fourth skill vectors; the fourth trick vector is determined by an encoder of the variational auto-encoder that satisfies an end-of-training condition from a sequence of operational data in the trick sample data.

Optionally, the reinforcement learning module 803 is specifically configured to:

training a residual error model in the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model comprises the prior strategy model, the residual model and the decoder, and the residual model is used for determining the correction quantity of the skill vector output by the prior strategy model.

Optionally, the reinforcement learning module 803 includes:

the state data acquisition submodule is used for acquiring a training game state data sequence in a training game environment;

the operation data prediction sub-module is used for determining a basic skill vector according to the training game state data sequence through the prior strategy model; determining a correction quantity according to the training game state data sequence through the residual error model; determining, by the decoder, a sequence of prediction operation data based on the base trick vector and the correction;

the change data acquisition sub-module is used for controlling the virtual character in the training game environment to execute the action sequence indicated by the prediction operation data sequence and acquiring game state change data generated when the virtual character executes the action sequence;

the model training submodule is used for determining a target reward value according to the game state change data through a reward function; training the residual model based on the target reward value.

Optionally, the change data obtaining sub-module is specifically configured to:

acquiring game state change data corresponding to each action in the action sequence to form a game state change data sequence; the game state change data corresponding to the action is used for representing the change condition of the game state of the virtual character after the action is executed;

the model training submodule is specifically configured to:

determining a reward value corresponding to each game state change data in the game state change data sequence through the reward function; and determining the target reward value according to the reward value corresponding to each game state change data in the game state change data sequence.

Optionally, the model training sub-module is specifically configured to:

determining a loss function correction coefficient according to the difference between the prediction operation data sequence and a basic operation data sequence; the base sequence of operational data is determined by the decoder from the base trick vector;

constructing a target loss function according to the loss function correction coefficient and the target reward value by adopting a near-end strategy optimization algorithm;

adjusting model parameters of the residual model based on the target loss function.

training the prior strategy model in the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is used for determining a skill vector according to a game state data sequence through the prior strategy model and determining an operation data sequence according to the skill vector through the decoder.

Optionally, the sample acquiring module 801 is specifically configured to:

acquiring game example data; the game example data comprises an original game state data sequence and an original operation data sequence generated in the training game play;

and according to the length of the target frame, respectively carrying out segmentation processing on the original game state data sequence and the original operation data sequence in the game example data to obtain the game state data sequence and the operation data sequence which have corresponding relations, and forming the skill sample data by using the game state data sequence and the operation data sequence.

Optionally, the game instance data includes the original game state data sequence and the original operation data sequence with a target style; the game artificial intelligence model is used for indicating the virtual character to execute the action conforming to the target style;

or, the game instance data includes the original game state data sequence and the original operation data sequence generated by the target player; the game artificial intelligence model is used for instructing the virtual character to execute the action which is in accordance with the game style of the target player.

The model training device provided by the embodiment of the application innovatively provides the concept of 'skill', namely the representation of an operation data sequence with a specific length in a low-dimensional vector space; by introducing the concept of "skill", the embodiment of the present application splits the strategy (mapping of game state data to operation data) required to be learned by the conventional game AI model into an upper-layer strategy and a lower-layer strategy, wherein the upper-layer strategy is mapping of a game state data sequence to the skill, and the lower-layer strategy is mapping of the skill to the operation data sequence. For learning of a lower-layer strategy, the embodiment of the application is realized by training a variational self-encoder according to skill sample data by adopting a supervised learning algorithm; the method comprises the steps that a supervised learning algorithm is adopted, an encoder and a decoder in a variational self-encoder are trained based on an operation data sequence in trick sample data, wherein the encoder is used for mapping the operation data sequence into a trick vector, and the decoder is used for reconstructing the operation data sequence according to the trick vector; because the mapping between the skill vector and the operation data sequence does not relate to the game state data space, the complexity of the training task is greatly reduced, and therefore, the variational self-encoder with better performance can be obtained by training only by using a small amount of skill sample data. For the learning of the upper-layer strategy, the embodiment of the application is realized by training a prior strategy model by combining a supervised learning algorithm and a reinforcement learning algorithm; firstly, training a prior strategy model based on game state sequence data in skill sample data and an output result of an encoder in a variational self-encoder by adopting a supervised learning algorithm, and then training a game AI model comprising the prior strategy model and a decoder in the variational self-encoder by adopting a reinforcement learning algorithm; the prior strategy model is trained by adopting a supervised learning algorithm, so that the prior strategy model has certain skill coding capability, and the game AI model comprising the prior strategy model is subjected to reinforcement learning, the training time consumption of the reinforcement learning can be reduced to a certain extent, the adjustment of a reward function is simpler, the manpower for adjusting the reward function required in the reinforcement learning process can be reduced, and the game AI model obtained by training has better personification.

The embodiment of the present application further provides a computer device for model training, where the computer device may specifically be a terminal device or a server, and the terminal device and the server provided in the embodiment of the present application will be described in terms of hardware implementation.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in fig. 9, for convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the technology are not disclosed, please refer to the method part of the embodiments of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the terminal as a computer as an example:

fig. 9 is a block diagram showing a partial structure of a computer related to a terminal provided in an embodiment of the present application. Referring to fig. 9, the computer includes: radio Frequency (RF) circuitry 910, memory 920, input unit 930 (including touch panel 931 and other input devices 932), display unit 940 (including display panel 941), sensor 950, audio circuitry 960 (which may connect speaker 961 and microphone 962), wireless fidelity (WiFi) module 970, processor 980, and power supply 990. Those skilled in the art will appreciate that the computer architecture shown in FIG. 9 is not intended to be limiting of computers, and may include more or fewer components than those shown, or some components in combination, or a different arrangement of components.

The memory 920 may be used to store software programs and modules, and the processor 980 performs various functional applications of the computer and data processing by operating the software programs and modules stored in the memory 920. The memory 920 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the computer, etc. Further, the memory 920 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 980 is a control center of the computer, connects various parts of the entire computer using various interfaces and lines, performs various functions of the computer and processes data by running or executing software programs and/or modules stored in the memory 920 and calling data stored in the memory 920. Alternatively, processor 980 may include one or more processing units; preferably, the processor 980 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 980.

In this embodiment, the processor 980 included in the terminal is further configured to execute the steps of any implementation manner of the model training method provided in this embodiment.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a server 1000 according to an embodiment of the present application. The server 1000, which may vary significantly due to configuration or performance, may include one or more Central Processing Units (CPUs) 1022 (e.g., one or more processors) and memory 1032, one or more storage media 1030 (e.g., one or more mass storage devices) that store applications 1042 or data 1044. Memory 1032 and storage medium 1030 may be, among other things, transient or persistent storage. The program stored on the storage medium 1030 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1022 may be disposed in communication with the storage medium 1030, and configured to execute a series of instruction operations in the storage medium 1030 on the server 1000.

The Server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input-output interfaces 1058, and/or one or more operating systems, such as a Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM , Linux ^TM ，FreeBSD ^TM And so on.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 10.

The CPU 1022 may be configured to execute the steps of any implementation manner of the model training method provided in the embodiment of the present application.

The embodiment of the present application further provides a computer-readable storage medium for storing a computer program, where the computer program is configured to execute any one implementation manner of the model training method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, so that the computer device executes any one of the implementation manners of the model training method described in the foregoing embodiments.

It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing computer programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (AM), a magnetic disk, and an optical disk.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present application.

Claims

1. A method of model training, the method comprising:

jointly training a variational self-encoder and a prior strategy model according to the skill sample data by adopting a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping an operational data sequence into a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence; the trick vector is a representation of the sequence of operational data of the target frame length in a low dimensional feature space;

training a game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder after joint training.

2. The method of claim 1, wherein said jointly training a variational self-encoder and a priori strategy model based on said skill sample data using a supervised learning algorithm comprises:

determining, by the encoder in the variational self-encoder, a first trick vector according to the operational data sequence in the trick sample data; determining, by the decoder in the variational self-encoder, a sequence of reconstruction operational data from the first trick vector;

training the variational auto-encoder and the a priori strategy model based on differences between the reconstructed operational data sequence and the operational data sequence and differences between the second trick vector and the first trick vector.

3. The method of claim 2, wherein said training the variational autocoder and the a priori strategy model based on the difference between the reconstructed operational data sequence and the difference between the second trick vector and the first trick vector comprises:

4. The method of claim 1, wherein jointly training a variational autocoder and an a priori strategy model based on the skill sample data using a supervised learning algorithm comprises:

5. The method of claim 1, wherein training the game artificial intelligence model using the reinforcement learning algorithm comprises:

training a residual error model in the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model comprises the prior strategy model, the residual error model and the decoder, and the residual error model is used for determining the correction quantity of the skill vector output by the prior strategy model.

6. The method of claim 5, wherein training a residual model in the game artificial intelligence model using a reinforcement learning algorithm comprises:

acquiring a training game state data sequence in a training game environment;

determining basic skill vectors according to the training game state data sequence through the prior strategy model; determining a correction quantity according to the training game state data sequence through the residual error model; determining, by the decoder, a sequence of prediction operation data based on the base trick vector and the correction;

controlling virtual characters in the training game environment to execute action sequences indicated by the predicted operation data sequences, and acquiring game state change data generated when the virtual characters execute the action sequences;

determining a target reward value according to the game state change data through a reward function; training the residual model based on the target reward value.

7. The method of claim 6, wherein obtaining game state change data generated by the virtual character during the sequence of actions comprises:

obtaining game state change data corresponding to each action in the action sequence to form a game state change data sequence; the game state change data corresponding to the action is used for representing the change condition of the game state of the virtual character after the action is executed;

the determining a target prize value according to the game state change data through a prize function comprises:

8. The method of claim 6, wherein training the residual model based on the target reward value comprises:

determining a loss function correction coefficient according to the difference between the prediction operation data sequence and the basic operation data sequence; the base sequence of operational data is determined by the decoder from the base trick vector;

9. The method of claim 1, wherein training the game artificial intelligence model using the reinforcement learning algorithm comprises:

10. The method of claim 1 wherein said obtaining skill sample data comprises:

and according to the length of the target frame, respectively segmenting the original game state data sequence and the original operation data sequence in the game example data to obtain the game state data sequence and the operation data sequence which have corresponding relations, and forming the skill sample data by using the game state data sequence and the operation data sequence.

11. The method of claim 10, wherein the game instance data comprises the original game state data sequence and the original operational data sequence having a target style; the game artificial intelligence model is used for indicating the virtual character to execute the action conforming to the target style;

alternatively, the game instance data comprises the original game state data sequence and the original operation data sequence generated by the target player; the game artificial intelligence model is used for instructing the virtual character to execute actions according with the game style of the target player.

12. A model training apparatus, the apparatus comprising:

the supervised learning module is used for jointly training the variational self-encoder and the prior strategy model according to the skill sample data by adopting a supervised learning algorithm; the variational self-encoder comprises an encoder for mapping the operational data sequence to a trick vector and a decoder for reconstructing the operational data sequence from the trick vector; the prior strategy model is used for determining skill vectors according to the game state data sequence; the trick vector is a representation of the sequence of operational data of the target frame length in a low dimensional feature space;

the reinforcement learning module is used for training the game artificial intelligence model by adopting a reinforcement learning algorithm; the game artificial intelligence model is constructed based on the prior strategy model and the decoder after joint training.

13. A computer device, the device comprising a processor and a memory;

the memory is used for storing a computer program;

the processor is configured to perform the model training method of any one of claims 1 to 11 in accordance with the computer program.

14. A computer-readable storage medium for storing a computer program for performing the model training method of any one of claims 1 to 11.