CN112717415B

CN112717415B - Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game

Info

Publication number: CN112717415B
Application number: CN202110091260.4A
Authority: CN
Inventors: 张轶飞; 程帆; 张冬梅
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-08-16
Anticipated expiration: 2041-01-22
Also published as: CN112717415A

Abstract

The invention relates to an AI training method of a reinforcement learning battle game based on an information bottleneck theory, which comprises the following steps: 1) initializing an AI training model; 2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set; 3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages; 4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file. Compared with the prior art, the method has the advantages of high sampling efficiency, high training speed, high testing flexibility, AI grading and the like.

Description

Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game

Technical Field

The invention relates to the field of game intelligent AI learning, in particular to an AI training method for a reinforcement learning battle game based on an information bottleneck theory.

Background

With the development of deep learning technology in recent years, many achievements are obtained in the field of deep reinforcement learning, and more methods (such as DQN, A2C, PPO, DDPG, and the like) combining deep learning and reinforcement learning algorithms show strong effects in the aspect of video games AI, however, in many cases, in the reinforcement learning problem, the interaction cost of an agent and the environment is high, so it is desirable to make the algorithms converge as fast as possible, so as to save the training cost, that is, a higher-level intelligent strategy is learned through the same sampling rate.

In the existing fighting game, a man-machine fighting mode is one of important components of the game, the existing game AI is designed by artificially setting strategy distribution and targeted action mapping, so that the course is single and the flexibility of fighting among players is not provided, and meanwhile, in the existing method for training the game AI by reinforcement learning, the original pixels are used as input, and a lot of redundant information is carried to influence the network learning efficiency and the speed of a reinforcement learning algorithm. In the deep learning experiment, the neural network firstly remembers the input by the mutual Information of the variables of the input layer and the representation layer in the training Process, and then compresses the input Information according to a specific learning task to discard useless redundant Information, namely, reduce the mutual Information between the input layer and the representation layer, which is the Information E-C Process, but the existing reinforcement learning algorithm is not optimized for the Information extraction Process.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an AI training method for a reinforcement learning battle game based on an information bottleneck theory.

The purpose of the invention can be realized by the following technical scheme:

an AI training method for a reinforcement learning battle game based on an information bottleneck theory comprises the following steps:

1) initializing an AI training model;

2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set;

3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages;

4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file.

The step 1) specifically comprises the following steps:

11) determining an AI optional operation set A and the number n of AI with different required capability levels according to the game operation description;

12) initializing all network parameters in the model, including a value network model parameter theta and a strategy network model parameter phi;

13) determining hyper-parameters beta and eta according to the resolution of the game picture;

14) setting a model learning rate E;

15) the number of samples m sampled from the probability distribution model is set.

The step 2) specifically comprises the following steps:

21) sampling environmental sample batch data from environment

Wherein, X _t ,X _t+1 A game picture sampled at the current time t and a game picture at the next time t +1, A _t Optional set of operations AI for the current time t game, R _t K is the total number of sampling moments in order to select the environment reward corresponding to the operation, namely the game integral and the real-time attribute of the game role;

22) for each sampled game frame

The game AI obtains m operation samples according to the strategy network model

Wherein the content of the first and second substances,

for AI real-time operations sampled in the distribution according to a strategic network model, P _φ (.|X _t ) Is a probability distribution model;

23) batch data of environmental samples

And obtaining m operation samples by AI according to the strategy network model

Corresponding integration is carried out to obtain a sample training batch data set

In the step 3), an AI training model is iteratively trained by adopting an A2C algorithm.

In the step 3), when the model is iteratively trained, the gradient of the model representation layer is calculated by the following formula:

wherein the content of the first and second substances,

is the gradient of the model representation layer, P (X) is the distribution probability of the game picture X, phi (Z) _i X) AI operation Z when the game screen obtained by the strategy network model is X _i E represents expectation.

In the step 3), when the model is iteratively trained, the gradient of the reinforcement learning algorithm is calculated by the following formula:

wherein the content of the first and second substances,

to strengthen the gradient of the learning algorithm, J (Z; θ) is the loss function of the A2C algorithm that adds an information bottleneck loss term.

The expression of the loss function J (Z; theta) is as follows:

wherein the content of the first and second substances,

for the existing loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change of the game credits and the game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ ^Φ (Z _t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi _t X) real-time value estimation, H (P) _φ (a _t |X _t (ii) a θ)) is P distribution P _φ (a _t |X _t (ii) a θ) entropy.

In the step 3), when the model is iteratively trained by adopting a reinforcement learning algorithm, the network parameters theta and phi are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is carried out again, if the network parameters are converged, the training is stopped and the model is saved in stages, wherein the updating expression of the network parameters theta and phi is as follows:

the step 4) specifically comprises the following steps:

41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n _j (j ═ 1,2, …, n-1), each intermediate model comprising a value network model and a policy network model;

42) the intermediate model M _j The convolution layer parameters are fixed, and the middle model M is updated by adopting the same training mode in the AI training process again _j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;

43) the n-1 middle models M retrained in the step 42) are used ₁ ,M ₂ ,…,M _n-1 And the final convergence model M obtained in the initial training _n The strategy network model in the game is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels are generated according to the strategy network model ₁ ,F ₂ ,…,F _n ；

44) Distributing files F of game AI strategy ₁ ,F ₂ ,…,F _n The real-time operation strategy is used as a real-time operation strategy of the fighting AI with n different capability levels, and is merged with other code files to construct the fighting AI with n different capability levels.

In the step 3), when an AI training model is iteratively trained by using an A2C algorithm, a monte carlo estimation method is adopted for fitting the expectation, and a Stein variation gradient descent method is adopted for gradient update.

Compared with the prior art, the invention has the following advantages:

firstly, the existing strategy gradient algorithms such as PPO, A2C and DDPG focus the visual field on the part of the convergence of the reinforcement learning algorithm, and the problem of information extraction from the environment state to the part of a cost function is not considered.

In the invention, the optimized gradient is obtained by adopting a Stein variation gradient descent method, and the problem that the probability distribution model of the representation layer under the condition of the input layer cannot be calculated in the information bottleneck problem is solved by utilizing the lower bound to normalize the unknown distribution.

Compared with the traditional manually designed AI, the fighting game AI designed by the invention has more possibly generated fighting strategies according to the real-time dynamic state of the game and more flexibility in actual test.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

FIG. 2 is a diagram of the model architecture of the present invention.

FIG. 3 is a diagram of an AI training model according to the invention.

Fig. 4 shows a specific embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.

Examples

As shown in fig. 1, the invention provides an AI training method for a reinforcement learning battle game based on an information bottleneck theory, comprising the following steps:

1) initializing network parameters and hyper-parameters of an AI training model (in this example, a CNN model is adopted, and the specific model structure is shown in FIG. 3), and setting learning rate and the number of samples sampled from parameter distribution;

2) performing decision interaction in a simulation environment through AI to obtain a sample training batch data set;

3) iteratively training an AI training model by adopting a reinforcement learning algorithm (in the example, an A2C algorithm) on the basis of a sample training batch data set obtained by the interaction of AI and the environment, and storing model parameters in stages;

4) and fixing part of the stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.

The specific process of each step is as follows:

the step 1) specifically comprises the following steps:

11) determining an AI optional operation set A (taking a street game as an example, specifically comprising operations of up-down, left-right movement, attack operation, defense operation, skill recruitment and the like in the street game) and the AI number n of required different capability levels according to the game operation description;

12) initializing all corresponding network parameters theta and phi in a value network model theta and a strategy network model phi;

13) determining a hyper-parameter beta, eta according to the resolution of the game picture;

14) setting a model learning rate E;

The step 2) specifically comprises the following steps:

21) a batch of data is sampled from the environment:

wherein X _t ,X _t+1 Respectively representing the game picture at that moment and the game picture at the next moment, A _t Representing the game AI optional operation set (including up, down, left and right movement operation, attack operation, defense operation and skill recruitment in the street game)Operation), R _t Representing environmental rewards corresponding to the state behaviors (including life value change, magic value change, distance from opponents, skill cooling time, game real-time scores and the like).

22) Real-time sampling of each game frame

The game AI takes m samples according to the strategy network model

Wherein

The AI sampled in the distribution according to the strategy network model is operated in real time (including operations of how to move, whether to defend, how to attack and the like).

23) Batch data of environmental samples

And data obtained by AI sampling

Corresponding integration is carried out to obtain training batch data

Calculating the gradient of the representation layer in the sample training batch data D in the step 3) as follows:

wherein Z is _i Representing agent slave probabilistic model P _φ (.|X _t ) The real-time operation sample obtained by sampling, P (X) is the game picture distribution probability, phi (Z) _i X) AI operation Z when the game screen obtained by the strategy network model is X _i The probability of (c).

The gradient of the reinforcement learning algorithm is:

wherein Z is _i Representing agent slave probabilistic model P _φ (.|X _t ) The real-time operating sample obtained from the middle sampling, J (Z; θ) is the loss function of the A2C algorithm that adds the information bottleneck loss term:

wherein

For the common loss function in the framework of the A2C algorithm:

wherein, R is real-time reward value (including life value change condition delta HP, magic value change condition delta MP, distance d between opponents, skill cooling time and the like, game real-time Score Score and the like, a concrete expression of R can be designed according to requirements, an example is given here, R is delta HP + delta MP + delta HP + d + Score), alpha is reward attenuation coefficient in the algorithm and can be automatically adjusted according to different design requirements, and theta is ^Φ (Z _t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi _t And X) real-time value estimation, wherein H (P) is the entropy of P distribution, in the whole calculation process, fitting the expected E by using a Monte Carlo estimation method, and performing gradient optimization calculation by using a Stein variation gradient descent method.

The A2C algorithm adopted in this embodiment is mainly implemented by a Pytorch, and a specific AI training model architecture is shown in fig. 2.

The step of specifically updating the model parameters comprises:

31) updating the model network parameter phi according to the following updating principle:

32) updating the model network parameter phi according to the following updating principle:

33) and judging whether the model parameters are converged according to the parameter updating range, if not, re-training from the step 2, and if the convergence is reached, stopping training and storing the model, and particularly, storing all network model parameters in a time sequence every 100 times of updating the parameters in the whole training process.

And 4) fixing part of stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.

The method specifically comprises the following steps:

41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n _i (i-1, 2, …, n-1) M herein _i Simultaneously, the method comprises a value network model and a strategy network model;

42) model M _i The convolution layer parameters are fixed, and the model M is updated by adopting the same training mode in the AI training process again _i The parameters of the middle full-connection layer part are up to convergence, and the converged model parameters are stored;

43) the n-1 models M retrained in 42) are added ₁ ,M ₂ ,…,M _n-1 And the final convergence model M obtained in the initial training _n The strategy network model in (1) is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels can be obtained ₁ ,F ₂ ,…,F _n 。

44) Distributing policy to file F ₁ ,F ₂ ,…,F _n Real-time operating strategies as n competing AI's of different capability levels, and other generationsThe code files are merged together to construct n fighting AI with different capability levels.

According to the method, by introducing an information bottleneck theory, a mutual information penalty item is added into a loss function of an A2C algorithm to accelerate an AI training process, and meanwhile, the output strategies of AI of different grades are re-smoothed by using the technical means of pre-training and fine-tuning model parameters. Compared with the prior art, the invention has the effects of high sampling efficiency in the training process, further accelerating the training of the fighting games AI with different capability grades and enough flexibility.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A reinforcement learning fighting game AI training method based on an information bottleneck theory is characterized by comprising the following steps:

1) initializing an AI training model;

3) according to a sample training batch data set obtained by interaction of game AI and environment, adopting a reinforcement learning algorithm to iteratively train an AI training model, and saving parameters of the AI training model in stages, in the step 3), adopting an A2C algorithm to iteratively train the AI training model, wherein when the model is iteratively trained, the gradient of a model representation layer is calculated by the following formula:

wherein the content of the first and second substances,

ladder for model representation layerDegree, P (X) is the distribution probability of the game picture X, phi (Z) _i X) AI operation Z when the game screen obtained by the strategy network model is X _i E represents expectation;

in the iterative training of the model, the gradient of the reinforcement learning algorithm is calculated as:

wherein the content of the first and second substances,

in order to strengthen the gradient of the learning algorithm, J (Z; theta) is a loss function of the A2C algorithm for increasing the information bottleneck loss term;

the expression of the loss function J (Z; theta) is as follows:

wherein the content of the first and second substances,

is a loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change in game credits and game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ ^Φ (Z _t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi _t X) real-time value estimation;

2. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 1) specifically comprises the following steps:

14) setting a model learning rate E;

3. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 2, wherein the step 2) specifically comprises the following steps:

21) sampling environmental sample batch data from environment

22) for each sampled game frame

The game AI obtains m operation samples according to the strategy network model

Wherein the content of the first and second substances,

23) ringEnvironmental sample batch data

And obtaining m operation samples by AI according to the strategy network model

4. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein in the step 3), when the reinforcement learning algorithm is adopted to iteratively train the model, the network parameters θ and φ are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is performed again, if the convergence is reached, the training is stopped and the model is saved in stages, and the network parameters θ and φ are updated according to the following expression:

5. the AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 4) specifically comprises the following steps:

42) the intermediate model M _j Parameter fixing of the convolutional layer, re-samplingTraining in the same way as in the AI training process, and updating the intermediate model M _j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;

6. The AI training method for reinforcement learning battle games based on information bottleneck theory as claimed in claim 4, wherein in the step 3), when the AI training model is iteratively trained by the A2C algorithm, the expectation is fitted by the Monte Carlo estimation method, and the gradient is updated by the Stein variation gradient descent method.