CN112717415B - Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game - Google Patents

Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game Download PDF

Info

Publication number
CN112717415B
CN112717415B CN202110091260.4A CN202110091260A CN112717415B CN 112717415 B CN112717415 B CN 112717415B CN 202110091260 A CN202110091260 A CN 202110091260A CN 112717415 B CN112717415 B CN 112717415B
Authority
CN
China
Prior art keywords
model
training
game
reinforcement learning
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110091260.4A
Other languages
Chinese (zh)
Other versions
CN112717415A (en
Inventor
张轶飞
程帆
张冬梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110091260.4A priority Critical patent/CN112717415B/en
Publication of CN112717415A publication Critical patent/CN112717415A/en
Application granted granted Critical
Publication of CN112717415B publication Critical patent/CN112717415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to an AI training method of a reinforcement learning battle game based on an information bottleneck theory, which comprises the following steps: 1) initializing an AI training model; 2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set; 3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages; 4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file. Compared with the prior art, the method has the advantages of high sampling efficiency, high training speed, high testing flexibility, AI grading and the like.

Description

Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game
Technical Field
The invention relates to the field of game intelligent AI learning, in particular to an AI training method for a reinforcement learning battle game based on an information bottleneck theory.
Background
With the development of deep learning technology in recent years, many achievements are obtained in the field of deep reinforcement learning, and more methods (such as DQN, A2C, PPO, DDPG, and the like) combining deep learning and reinforcement learning algorithms show strong effects in the aspect of video games AI, however, in many cases, in the reinforcement learning problem, the interaction cost of an agent and the environment is high, so it is desirable to make the algorithms converge as fast as possible, so as to save the training cost, that is, a higher-level intelligent strategy is learned through the same sampling rate.
In the existing fighting game, a man-machine fighting mode is one of important components of the game, the existing game AI is designed by artificially setting strategy distribution and targeted action mapping, so that the course is single and the flexibility of fighting among players is not provided, and meanwhile, in the existing method for training the game AI by reinforcement learning, the original pixels are used as input, and a lot of redundant information is carried to influence the network learning efficiency and the speed of a reinforcement learning algorithm. In the deep learning experiment, the neural network firstly remembers the input by the mutual Information of the variables of the input layer and the representation layer in the training Process, and then compresses the input Information according to a specific learning task to discard useless redundant Information, namely, reduce the mutual Information between the input layer and the representation layer, which is the Information E-C Process, but the existing reinforcement learning algorithm is not optimized for the Information extraction Process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an AI training method for a reinforcement learning battle game based on an information bottleneck theory.
The purpose of the invention can be realized by the following technical scheme:
an AI training method for a reinforcement learning battle game based on an information bottleneck theory comprises the following steps:
1) initializing an AI training model;
2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set;
3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages;
4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file.
The step 1) specifically comprises the following steps:
11) determining an AI optional operation set A and the number n of AI with different required capability levels according to the game operation description;
12) initializing all network parameters in the model, including a value network model parameter theta and a strategy network model parameter phi;
13) determining hyper-parameters beta and eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
The step 2) specifically comprises the following steps:
21) sampling environmental sample batch data from environment
Figure BDA0002912635500000021
Wherein, X t ,X t+1 A game picture sampled at the current time t and a game picture at the next time t +1, A t Optional set of operations AI for the current time t game, R t K is the total number of sampling moments in order to select the environment reward corresponding to the operation, namely the game integral and the real-time attribute of the game role;
22) for each sampled game frame
Figure BDA0002912635500000022
The game AI obtains m operation samples according to the strategy network model
Figure BDA0002912635500000023
Wherein the content of the first and second substances,
Figure BDA0002912635500000024
for AI real-time operations sampled in the distribution according to a strategic network model, P φ (.|X t ) Is a probability distribution model;
23) batch data of environmental samples
Figure BDA0002912635500000025
And obtaining m operation samples by AI according to the strategy network model
Figure BDA0002912635500000026
Corresponding integration is carried out to obtain a sample training batch data set
Figure BDA0002912635500000027
In the step 3), an AI training model is iteratively trained by adopting an A2C algorithm.
In the step 3), when the model is iteratively trained, the gradient of the model representation layer is calculated by the following formula:
Figure BDA0002912635500000028
wherein the content of the first and second substances,
Figure BDA0002912635500000029
is the gradient of the model representation layer, P (X) is the distribution probability of the game picture X, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i E represents expectation.
In the step 3), when the model is iteratively trained, the gradient of the reinforcement learning algorithm is calculated by the following formula:
Figure BDA0002912635500000031
wherein the content of the first and second substances,
Figure BDA0002912635500000032
to strengthen the gradient of the learning algorithm, J (Z; θ) is the loss function of the A2C algorithm that adds an information bottleneck loss term.
The expression of the loss function J (Z; theta) is as follows:
Figure BDA0002912635500000033
Figure BDA0002912635500000034
wherein the content of the first and second substances,
Figure BDA0002912635500000035
for the existing loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change of the game credits and the game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t X) real-time value estimation, H (P) φ (a t |X t (ii) a θ)) is P distribution P φ (a t |X t (ii) a θ) entropy.
In the step 3), when the model is iteratively trained by adopting a reinforcement learning algorithm, the network parameters theta and phi are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is carried out again, if the network parameters are converged, the training is stopped and the model is saved in stages, wherein the updating expression of the network parameters theta and phi is as follows:
Figure BDA0002912635500000036
Figure BDA0002912635500000037
the step 4) specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n j (j ═ 1,2, …, n-1), each intermediate model comprising a value network model and a policy network model;
42) the intermediate model M j The convolution layer parameters are fixed, and the middle model M is updated by adopting the same training mode in the AI training process again j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;
43) the n-1 middle models M retrained in the step 42) are used 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in the game is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels are generated according to the strategy network model 1 ,F 2 ,…,F n
44) Distributing files F of game AI strategy 1 ,F 2 ,…,F n The real-time operation strategy is used as a real-time operation strategy of the fighting AI with n different capability levels, and is merged with other code files to construct the fighting AI with n different capability levels.
In the step 3), when an AI training model is iteratively trained by using an A2C algorithm, a monte carlo estimation method is adopted for fitting the expectation, and a Stein variation gradient descent method is adopted for gradient update.
Compared with the prior art, the invention has the following advantages:
firstly, the existing strategy gradient algorithms such as PPO, A2C and DDPG focus the visual field on the part of the convergence of the reinforcement learning algorithm, and the problem of information extraction from the environment state to the part of a cost function is not considered.
In the invention, the optimized gradient is obtained by adopting a Stein variation gradient descent method, and the problem that the probability distribution model of the representation layer under the condition of the input layer cannot be calculated in the information bottleneck problem is solved by utilizing the lower bound to normalize the unknown distribution.
Compared with the traditional manually designed AI, the fighting game AI designed by the invention has more possibly generated fighting strategies according to the real-time dynamic state of the game and more flexibility in actual test.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a diagram of the model architecture of the present invention.
FIG. 3 is a diagram of an AI training model according to the invention.
Fig. 4 shows a specific embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
As shown in fig. 1, the invention provides an AI training method for a reinforcement learning battle game based on an information bottleneck theory, comprising the following steps:
1) initializing network parameters and hyper-parameters of an AI training model (in this example, a CNN model is adopted, and the specific model structure is shown in FIG. 3), and setting learning rate and the number of samples sampled from parameter distribution;
2) performing decision interaction in a simulation environment through AI to obtain a sample training batch data set;
3) iteratively training an AI training model by adopting a reinforcement learning algorithm (in the example, an A2C algorithm) on the basis of a sample training batch data set obtained by the interaction of AI and the environment, and storing model parameters in stages;
4) and fixing part of the stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.
The specific process of each step is as follows:
the step 1) specifically comprises the following steps:
11) determining an AI optional operation set A (taking a street game as an example, specifically comprising operations of up-down, left-right movement, attack operation, defense operation, skill recruitment and the like in the street game) and the AI number n of required different capability levels according to the game operation description;
12) initializing all corresponding network parameters theta and phi in a value network model theta and a strategy network model phi;
13) determining a hyper-parameter beta, eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
The step 2) specifically comprises the following steps:
21) a batch of data is sampled from the environment:
Figure BDA0002912635500000051
wherein X t ,X t+1 Respectively representing the game picture at that moment and the game picture at the next moment, A t Representing the game AI optional operation set (including up, down, left and right movement operation, attack operation, defense operation and skill recruitment in the street game)Operation), R t Representing environmental rewards corresponding to the state behaviors (including life value change, magic value change, distance from opponents, skill cooling time, game real-time scores and the like).
22) Real-time sampling of each game frame
Figure BDA0002912635500000052
The game AI takes m samples according to the strategy network model
Figure BDA0002912635500000053
Wherein
Figure BDA0002912635500000054
The AI sampled in the distribution according to the strategy network model is operated in real time (including operations of how to move, whether to defend, how to attack and the like).
23) Batch data of environmental samples
Figure BDA0002912635500000055
And data obtained by AI sampling
Figure BDA0002912635500000056
Corresponding integration is carried out to obtain training batch data
Figure BDA0002912635500000057
Calculating the gradient of the representation layer in the sample training batch data D in the step 3) as follows:
Figure BDA0002912635500000058
wherein Z is i Representing agent slave probabilistic model P φ (.|X t ) The real-time operation sample obtained by sampling, P (X) is the game picture distribution probability, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i The probability of (c).
The gradient of the reinforcement learning algorithm is:
Figure BDA0002912635500000061
wherein Z is i Representing agent slave probabilistic model P φ (.|X t ) The real-time operating sample obtained from the middle sampling, J (Z; θ) is the loss function of the A2C algorithm that adds the information bottleneck loss term:
Figure BDA0002912635500000062
wherein
Figure BDA0002912635500000063
For the common loss function in the framework of the A2C algorithm:
Figure BDA0002912635500000064
wherein, R is real-time reward value (including life value change condition delta HP, magic value change condition delta MP, distance d between opponents, skill cooling time and the like, game real-time Score Score and the like, a concrete expression of R can be designed according to requirements, an example is given here, R is delta HP + delta MP + delta HP + d + Score), alpha is reward attenuation coefficient in the algorithm and can be automatically adjusted according to different design requirements, and theta is Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t And X) real-time value estimation, wherein H (P) is the entropy of P distribution, in the whole calculation process, fitting the expected E by using a Monte Carlo estimation method, and performing gradient optimization calculation by using a Stein variation gradient descent method.
The A2C algorithm adopted in this embodiment is mainly implemented by a Pytorch, and a specific AI training model architecture is shown in fig. 2.
The step of specifically updating the model parameters comprises:
31) updating the model network parameter phi according to the following updating principle:
Figure BDA0002912635500000065
32) updating the model network parameter phi according to the following updating principle:
Figure BDA0002912635500000066
33) and judging whether the model parameters are converged according to the parameter updating range, if not, re-training from the step 2, and if the convergence is reached, stopping training and storing the model, and particularly, storing all network model parameters in a time sequence every 100 times of updating the parameters in the whole training process.
And 4) fixing part of stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.
The method specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n i (i-1, 2, …, n-1) M herein i Simultaneously, the method comprises a value network model and a strategy network model;
42) model M i The convolution layer parameters are fixed, and the model M is updated by adopting the same training mode in the AI training process again i The parameters of the middle full-connection layer part are up to convergence, and the converged model parameters are stored;
43) the n-1 models M retrained in 42) are added 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in (1) is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels can be obtained 1 ,F 2 ,…,F n
44) Distributing policy to file F 1 ,F 2 ,…,F n Real-time operating strategies as n competing AI's of different capability levels, and other generationsThe code files are merged together to construct n fighting AI with different capability levels.
According to the method, by introducing an information bottleneck theory, a mutual information penalty item is added into a loss function of an A2C algorithm to accelerate an AI training process, and meanwhile, the output strategies of AI of different grades are re-smoothed by using the technical means of pre-training and fine-tuning model parameters. Compared with the prior art, the invention has the effects of high sampling efficiency in the training process, further accelerating the training of the fighting games AI with different capability grades and enough flexibility.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A reinforcement learning fighting game AI training method based on an information bottleneck theory is characterized by comprising the following steps:
1) initializing an AI training model;
2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set;
3) according to a sample training batch data set obtained by interaction of game AI and environment, adopting a reinforcement learning algorithm to iteratively train an AI training model, and saving parameters of the AI training model in stages, in the step 3), adopting an A2C algorithm to iteratively train the AI training model, wherein when the model is iteratively trained, the gradient of a model representation layer is calculated by the following formula:
Figure FDA0003592515980000011
wherein the content of the first and second substances,
Figure FDA0003592515980000012
ladder for model representation layerDegree, P (X) is the distribution probability of the game picture X, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i E represents expectation;
in the iterative training of the model, the gradient of the reinforcement learning algorithm is calculated as:
Figure FDA0003592515980000013
wherein the content of the first and second substances,
Figure FDA0003592515980000014
in order to strengthen the gradient of the learning algorithm, J (Z; theta) is a loss function of the A2C algorithm for increasing the information bottleneck loss term;
the expression of the loss function J (Z; theta) is as follows:
Figure FDA0003592515980000015
wherein the content of the first and second substances,
Figure FDA0003592515980000016
is a loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change in game credits and game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t X) real-time value estimation;
4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file.
2. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 1) specifically comprises the following steps:
11) determining an AI optional operation set A and the number n of AI with different required capability levels according to the game operation description;
12) initializing all network parameters in the model, including a value network model parameter theta and a strategy network model parameter phi;
13) determining hyper-parameters beta and eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
3. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 2, wherein the step 2) specifically comprises the following steps:
21) sampling environmental sample batch data from environment
Figure FDA0003592515980000021
Wherein, X t ,X t+1 A game picture sampled at the current time t and a game picture at the next time t +1, A t Optional set of operations AI for the current time t game, R t K is the total number of sampling moments in order to select the environment reward corresponding to the operation, namely the game integral and the real-time attribute of the game role;
22) for each sampled game frame
Figure FDA0003592515980000022
The game AI obtains m operation samples according to the strategy network model
Figure FDA0003592515980000023
Wherein the content of the first and second substances,
Figure FDA0003592515980000024
for AI real-time operations sampled in the distribution according to a strategic network model, P φ (.|X t ) Is a probability distribution model;
23) ringEnvironmental sample batch data
Figure FDA0003592515980000025
And obtaining m operation samples by AI according to the strategy network model
Figure FDA0003592515980000026
Corresponding integration is carried out to obtain a sample training batch data set
Figure FDA0003592515980000027
4. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein in the step 3), when the reinforcement learning algorithm is adopted to iteratively train the model, the network parameters θ and φ are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is performed again, if the convergence is reached, the training is stopped and the model is saved in stages, and the network parameters θ and φ are updated according to the following expression:
Figure FDA0003592515980000028
Figure FDA0003592515980000029
5. the AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 4) specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n j (j ═ 1,2, …, n-1), each intermediate model comprising a value network model and a policy network model;
42) the intermediate model M j Parameter fixing of the convolutional layer, re-samplingTraining in the same way as in the AI training process, and updating the intermediate model M j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;
43) the n-1 middle models M retrained in the step 42) are used 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in the game is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels are generated according to the strategy network model 1 ,F 2 ,…,F n
44) Distributing files F of game AI strategy 1 ,F 2 ,…,F n The real-time operation strategy is used as a real-time operation strategy of the fighting AI with n different capability levels, and is merged with other code files to construct the fighting AI with n different capability levels.
6. The AI training method for reinforcement learning battle games based on information bottleneck theory as claimed in claim 4, wherein in the step 3), when the AI training model is iteratively trained by the A2C algorithm, the expectation is fitted by the Monte Carlo estimation method, and the gradient is updated by the Stein variation gradient descent method.
CN202110091260.4A 2021-01-22 2021-01-22 Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game Active CN112717415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110091260.4A CN112717415B (en) 2021-01-22 2021-01-22 Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110091260.4A CN112717415B (en) 2021-01-22 2021-01-22 Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game

Publications (2)

Publication Number Publication Date
CN112717415A CN112717415A (en) 2021-04-30
CN112717415B true CN112717415B (en) 2022-08-16

Family

ID=75595220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110091260.4A Active CN112717415B (en) 2021-01-22 2021-01-22 Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game

Country Status (1)

Country Link
CN (1) CN112717415B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113269315B (en) * 2021-06-29 2024-04-02 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing tasks using deep reinforcement learning
CN113641905B (en) * 2021-08-16 2023-10-03 京东科技信息技术有限公司 Model training method, information pushing method, device, equipment and storage medium
CN116109525B (en) * 2023-04-11 2024-01-05 北京龙智数科科技服务有限公司 Reinforcement learning method and device based on multidimensional data enhancement
CN116808590B (en) * 2023-08-25 2023-11-10 腾讯科技(深圳)有限公司 Data processing method and related device
CN117162102A (en) * 2023-10-30 2023-12-05 南京邮电大学 Independent near-end strategy optimization training acceleration method for robot joint action

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109923560A (en) * 2016-11-04 2019-06-21 谷歌有限责任公司 Neural network is trained using variation information bottleneck
CN111886059A (en) * 2018-03-21 2020-11-03 威尔乌集团 Automatically reducing use of cheating software in an online gaming environment
CN111985640A (en) * 2020-07-10 2020-11-24 清华大学 Model training method based on reinforcement learning and related device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9679258B2 (en) * 2013-10-08 2017-06-13 Google Inc. Methods and apparatus for reinforcement learning
KR20180087060A (en) * 2017-01-24 2018-08-01 라인 가부시키가이샤 Method, apparatus, computer program and recording medium for providing game service
CN110327624B (en) * 2019-07-03 2023-03-17 广州多益网络股份有限公司 Game following method and system based on curriculum reinforcement learning
CN112169311A (en) * 2020-10-20 2021-01-05 网易(杭州)网络有限公司 Method, system, storage medium and computer device for training AI (Artificial Intelligence)
CN112221152A (en) * 2020-10-27 2021-01-15 腾讯科技(深圳)有限公司 Artificial intelligence AI model training method, device, equipment and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109923560A (en) * 2016-11-04 2019-06-21 谷歌有限责任公司 Neural network is trained using variation information bottleneck
CN111886059A (en) * 2018-03-21 2020-11-03 威尔乌集团 Automatically reducing use of cheating software in an online gaming environment
CN111985640A (en) * 2020-07-10 2020-11-24 清华大学 Model training method based on reinforcement learning and related device

Also Published As

Publication number Publication date
CN112717415A (en) 2021-04-30

Similar Documents

Publication Publication Date Title
CN112717415B (en) Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN111310915B (en) Data anomaly detection defense method oriented to reinforcement learning
CN111766782B (en) Strategy selection method based on Actor-Critic framework in deep reinforcement learning
US20220176248A1 (en) Information processing method and apparatus, computer readable storage medium, and electronic device
CN111260027B (en) Intelligent agent automatic decision-making method based on reinforcement learning
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN109284812B (en) Video game simulation method based on improved DQN
CN107346459B (en) Multi-mode pollutant integrated forecasting method based on genetic algorithm improvement
CN111856925B (en) State trajectory-based confrontation type imitation learning method and device
CN112488310A (en) Multi-agent group cooperation strategy automatic generation method
CN111282272B (en) Information processing method, computer readable medium and electronic device
CN111159489A (en) Searching method
CN113947022B (en) Near-end strategy optimization method based on model
CN112257348B (en) Method for predicting long-term degradation trend of lithium battery
Tong et al. Enhancing rolling horizon evolution with policy and value networks
KR102209917B1 (en) Data processing apparatus and method for deep reinforcement learning
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
Liu et al. Forward-looking imaginative planning framework combined with prioritized-replay double DQN
CN116204849A (en) Data and model fusion method for digital twin application
CN115587615A (en) Internal reward generation method for sensing action loop decision
Villarrubia-Martin et al. A hybrid online off-policy reinforcement learning agent framework supported by transformers
CN116521584B (en) MPC cache updating method and system based on multiple intelligent agents
Jia et al. DQN Algorithm Based on Target Value Network Parameter Dynamic Update
Desai et al. Deep Reinforcement Learning to Play Space Invaders

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant