CN112717415B - Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game - Google Patents
Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game Download PDFInfo
- Publication number
- CN112717415B CN112717415B CN202110091260.4A CN202110091260A CN112717415B CN 112717415 B CN112717415 B CN 112717415B CN 202110091260 A CN202110091260 A CN 202110091260A CN 112717415 B CN112717415 B CN 112717415B
- Authority
- CN
- China
- Prior art keywords
- model
- training
- game
- reinforcement learning
- parameters
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/60—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
- A63F13/67—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention relates to an AI training method of a reinforcement learning battle game based on an information bottleneck theory, which comprises the following steps: 1) initializing an AI training model; 2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set; 3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages; 4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file. Compared with the prior art, the method has the advantages of high sampling efficiency, high training speed, high testing flexibility, AI grading and the like.
Description
Technical Field
The invention relates to the field of game intelligent AI learning, in particular to an AI training method for a reinforcement learning battle game based on an information bottleneck theory.
Background
With the development of deep learning technology in recent years, many achievements are obtained in the field of deep reinforcement learning, and more methods (such as DQN, A2C, PPO, DDPG, and the like) combining deep learning and reinforcement learning algorithms show strong effects in the aspect of video games AI, however, in many cases, in the reinforcement learning problem, the interaction cost of an agent and the environment is high, so it is desirable to make the algorithms converge as fast as possible, so as to save the training cost, that is, a higher-level intelligent strategy is learned through the same sampling rate.
In the existing fighting game, a man-machine fighting mode is one of important components of the game, the existing game AI is designed by artificially setting strategy distribution and targeted action mapping, so that the course is single and the flexibility of fighting among players is not provided, and meanwhile, in the existing method for training the game AI by reinforcement learning, the original pixels are used as input, and a lot of redundant information is carried to influence the network learning efficiency and the speed of a reinforcement learning algorithm. In the deep learning experiment, the neural network firstly remembers the input by the mutual Information of the variables of the input layer and the representation layer in the training Process, and then compresses the input Information according to a specific learning task to discard useless redundant Information, namely, reduce the mutual Information between the input layer and the representation layer, which is the Information E-C Process, but the existing reinforcement learning algorithm is not optimized for the Information extraction Process.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide an AI training method for a reinforcement learning battle game based on an information bottleneck theory.
The purpose of the invention can be realized by the following technical scheme:
an AI training method for a reinforcement learning battle game based on an information bottleneck theory comprises the following steps:
1) initializing an AI training model;
2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set;
3) according to a sample training batch data set obtained by interaction of the game AI and the environment, iteratively training an AI training model by adopting a reinforcement learning algorithm, and storing parameters of the AI training model in stages;
4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file.
The step 1) specifically comprises the following steps:
11) determining an AI optional operation set A and the number n of AI with different required capability levels according to the game operation description;
12) initializing all network parameters in the model, including a value network model parameter theta and a strategy network model parameter phi;
13) determining hyper-parameters beta and eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
The step 2) specifically comprises the following steps:
21) sampling environmental sample batch data from environmentWherein, X t ,X t+1 A game picture sampled at the current time t and a game picture at the next time t +1, A t Optional set of operations AI for the current time t game, R t K is the total number of sampling moments in order to select the environment reward corresponding to the operation, namely the game integral and the real-time attribute of the game role;
22) for each sampled game frameThe game AI obtains m operation samples according to the strategy network modelWherein the content of the first and second substances,for AI real-time operations sampled in the distribution according to a strategic network model, P φ (.|X t ) Is a probability distribution model;
23) batch data of environmental samplesAnd obtaining m operation samples by AI according to the strategy network modelCorresponding integration is carried out to obtain a sample training batch data set
In the step 3), an AI training model is iteratively trained by adopting an A2C algorithm.
In the step 3), when the model is iteratively trained, the gradient of the model representation layer is calculated by the following formula:
wherein the content of the first and second substances,is the gradient of the model representation layer, P (X) is the distribution probability of the game picture X, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i E represents expectation.
In the step 3), when the model is iteratively trained, the gradient of the reinforcement learning algorithm is calculated by the following formula:
wherein the content of the first and second substances,to strengthen the gradient of the learning algorithm, J (Z; θ) is the loss function of the A2C algorithm that adds an information bottleneck loss term.
The expression of the loss function J (Z; theta) is as follows:
wherein the content of the first and second substances,for the existing loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change of the game credits and the game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t X) real-time value estimation, H (P) φ (a t |X t (ii) a θ)) is P distribution P φ (a t |X t (ii) a θ) entropy.
In the step 3), when the model is iteratively trained by adopting a reinforcement learning algorithm, the network parameters theta and phi are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is carried out again, if the network parameters are converged, the training is stopped and the model is saved in stages, wherein the updating expression of the network parameters theta and phi is as follows:
the step 4) specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n j (j ═ 1,2, …, n-1), each intermediate model comprising a value network model and a policy network model;
42) the intermediate model M j The convolution layer parameters are fixed, and the middle model M is updated by adopting the same training mode in the AI training process again j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;
43) the n-1 middle models M retrained in the step 42) are used 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in the game is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels are generated according to the strategy network model 1 ,F 2 ,…,F n ;
44) Distributing files F of game AI strategy 1 ,F 2 ,…,F n The real-time operation strategy is used as a real-time operation strategy of the fighting AI with n different capability levels, and is merged with other code files to construct the fighting AI with n different capability levels.
In the step 3), when an AI training model is iteratively trained by using an A2C algorithm, a monte carlo estimation method is adopted for fitting the expectation, and a Stein variation gradient descent method is adopted for gradient update.
Compared with the prior art, the invention has the following advantages:
firstly, the existing strategy gradient algorithms such as PPO, A2C and DDPG focus the visual field on the part of the convergence of the reinforcement learning algorithm, and the problem of information extraction from the environment state to the part of a cost function is not considered.
In the invention, the optimized gradient is obtained by adopting a Stein variation gradient descent method, and the problem that the probability distribution model of the representation layer under the condition of the input layer cannot be calculated in the information bottleneck problem is solved by utilizing the lower bound to normalize the unknown distribution.
Compared with the traditional manually designed AI, the fighting game AI designed by the invention has more possibly generated fighting strategies according to the real-time dynamic state of the game and more flexibility in actual test.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
FIG. 2 is a diagram of the model architecture of the present invention.
FIG. 3 is a diagram of an AI training model according to the invention.
Fig. 4 shows a specific embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of protection of the present invention.
Examples
As shown in fig. 1, the invention provides an AI training method for a reinforcement learning battle game based on an information bottleneck theory, comprising the following steps:
1) initializing network parameters and hyper-parameters of an AI training model (in this example, a CNN model is adopted, and the specific model structure is shown in FIG. 3), and setting learning rate and the number of samples sampled from parameter distribution;
2) performing decision interaction in a simulation environment through AI to obtain a sample training batch data set;
3) iteratively training an AI training model by adopting a reinforcement learning algorithm (in the example, an A2C algorithm) on the basis of a sample training batch data set obtained by the interaction of AI and the environment, and storing model parameters in stages;
4) and fixing part of the stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.
The specific process of each step is as follows:
the step 1) specifically comprises the following steps:
11) determining an AI optional operation set A (taking a street game as an example, specifically comprising operations of up-down, left-right movement, attack operation, defense operation, skill recruitment and the like in the street game) and the AI number n of required different capability levels according to the game operation description;
12) initializing all corresponding network parameters theta and phi in a value network model theta and a strategy network model phi;
13) determining a hyper-parameter beta, eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
The step 2) specifically comprises the following steps:
21) a batch of data is sampled from the environment:wherein X t ,X t+1 Respectively representing the game picture at that moment and the game picture at the next moment, A t Representing the game AI optional operation set (including up, down, left and right movement operation, attack operation, defense operation and skill recruitment in the street game)Operation), R t Representing environmental rewards corresponding to the state behaviors (including life value change, magic value change, distance from opponents, skill cooling time, game real-time scores and the like).
22) Real-time sampling of each game frameThe game AI takes m samples according to the strategy network modelWhereinThe AI sampled in the distribution according to the strategy network model is operated in real time (including operations of how to move, whether to defend, how to attack and the like).
23) Batch data of environmental samplesAnd data obtained by AI samplingCorresponding integration is carried out to obtain training batch data
Calculating the gradient of the representation layer in the sample training batch data D in the step 3) as follows:
wherein Z is i Representing agent slave probabilistic model P φ (.|X t ) The real-time operation sample obtained by sampling, P (X) is the game picture distribution probability, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i The probability of (c).
The gradient of the reinforcement learning algorithm is:
wherein Z is i Representing agent slave probabilistic model P φ (.|X t ) The real-time operating sample obtained from the middle sampling, J (Z; θ) is the loss function of the A2C algorithm that adds the information bottleneck loss term:
wherein, R is real-time reward value (including life value change condition delta HP, magic value change condition delta MP, distance d between opponents, skill cooling time and the like, game real-time Score Score and the like, a concrete expression of R can be designed according to requirements, an example is given here, R is delta HP + delta MP + delta HP + d + Score), alpha is reward attenuation coefficient in the algorithm and can be automatically adjusted according to different design requirements, and theta is Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t And X) real-time value estimation, wherein H (P) is the entropy of P distribution, in the whole calculation process, fitting the expected E by using a Monte Carlo estimation method, and performing gradient optimization calculation by using a Stein variation gradient descent method.
The A2C algorithm adopted in this embodiment is mainly implemented by a Pytorch, and a specific AI training model architecture is shown in fig. 2.
The step of specifically updating the model parameters comprises:
31) updating the model network parameter phi according to the following updating principle:
32) updating the model network parameter phi according to the following updating principle:
33) and judging whether the model parameters are converged according to the parameter updating range, if not, re-training from the step 2, and if the convergence is reached, stopping training and storing the model, and particularly, storing all network model parameters in a time sequence every 100 times of updating the parameters in the whole training process.
And 4) fixing part of stored parameters of the models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain final strategy models of different levels of AI.
The method specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n i (i-1, 2, …, n-1) M herein i Simultaneously, the method comprises a value network model and a strategy network model;
42) model M i The convolution layer parameters are fixed, and the model M is updated by adopting the same training mode in the AI training process again i The parameters of the middle full-connection layer part are up to convergence, and the converged model parameters are stored;
43) the n-1 models M retrained in 42) are added 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in (1) is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels can be obtained 1 ,F 2 ,…,F n 。
44) Distributing policy to file F 1 ,F 2 ,…,F n Real-time operating strategies as n competing AI's of different capability levels, and other generationsThe code files are merged together to construct n fighting AI with different capability levels.
According to the method, by introducing an information bottleneck theory, a mutual information penalty item is added into a loss function of an A2C algorithm to accelerate an AI training process, and meanwhile, the output strategies of AI of different grades are re-smoothed by using the technical means of pre-training and fine-tuning model parameters. Compared with the prior art, the invention has the effects of high sampling efficiency in the training process, further accelerating the training of the fighting games AI with different capability grades and enough flexibility.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (6)
1. A reinforcement learning fighting game AI training method based on an information bottleneck theory is characterized by comprising the following steps:
1) initializing an AI training model;
2) carrying out decision interaction in a simulation environment through a game AI to obtain a sample training batch data set;
3) according to a sample training batch data set obtained by interaction of game AI and environment, adopting a reinforcement learning algorithm to iteratively train an AI training model, and saving parameters of the AI training model in stages, in the step 3), adopting an A2C algorithm to iteratively train the AI training model, wherein when the model is iteratively trained, the gradient of a model representation layer is calculated by the following formula:
wherein the content of the first and second substances,ladder for model representation layerDegree, P (X) is the distribution probability of the game picture X, phi (Z) i X) AI operation Z when the game screen obtained by the strategy network model is X i E represents expectation;
in the iterative training of the model, the gradient of the reinforcement learning algorithm is calculated as:
wherein the content of the first and second substances,in order to strengthen the gradient of the learning algorithm, J (Z; theta) is a loss function of the A2C algorithm for increasing the information bottleneck loss term;
the expression of the loss function J (Z; theta) is as follows:
wherein the content of the first and second substances,is a loss function in the framework of the A2C algorithm, R is the real-time prize value, i.e., the real-time change in game credits and game character attributes, α is the prize attenuation coefficient in the A2C algorithm, Θ Φ (Z t X) is a state decision pair (Z) estimated using the value network model when the policy network model is phi t X) real-time value estimation;
4) and fixing part of the stored parameters of the AI training models in different stages, and retraining the rest parameters by using a reinforcement learning algorithm for fine adjustment to obtain the final AI training models of different levels of AI, thereby generating the fighting game AI file.
2. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 1) specifically comprises the following steps:
11) determining an AI optional operation set A and the number n of AI with different required capability levels according to the game operation description;
12) initializing all network parameters in the model, including a value network model parameter theta and a strategy network model parameter phi;
13) determining hyper-parameters beta and eta according to the resolution of the game picture;
14) setting a model learning rate E;
15) the number of samples m sampled from the probability distribution model is set.
3. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 2, wherein the step 2) specifically comprises the following steps:
21) sampling environmental sample batch data from environmentWherein, X t ,X t+1 A game picture sampled at the current time t and a game picture at the next time t +1, A t Optional set of operations AI for the current time t game, R t K is the total number of sampling moments in order to select the environment reward corresponding to the operation, namely the game integral and the real-time attribute of the game role;
22) for each sampled game frameThe game AI obtains m operation samples according to the strategy network modelWherein the content of the first and second substances,for AI real-time operations sampled in the distribution according to a strategic network model, P φ (.|X t ) Is a probability distribution model;
4. The AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein in the step 3), when the reinforcement learning algorithm is adopted to iteratively train the model, the network parameters θ and φ are respectively updated, whether the network parameters are converged is judged, if the network parameters are not converged, the training is performed again, if the convergence is reached, the training is stopped and the model is saved in stages, and the network parameters θ and φ are updated according to the following expression:
5. the AI training method for the reinforcement learning battle game based on the information bottleneck theory as claimed in claim 1, wherein the step 4) specifically comprises the following steps:
41) taking out n-1 intermediate models M stored in the AI training process according to the required AI grade number n j (j ═ 1,2, …, n-1), each intermediate model comprising a value network model and a policy network model;
42) the intermediate model M j Parameter fixing of the convolutional layer, re-samplingTraining in the same way as in the AI training process, and updating the intermediate model M j The parameters of the middle full-connection layer part are converged, and the converged model parameters are stored;
43) the n-1 middle models M retrained in the step 42) are used 1 ,M 2 ,…,M n-1 And the final convergence model M obtained in the initial training n The strategy network model in the game is taken out independently, and n game AI strategy distribution files F corresponding to different capability levels are generated according to the strategy network model 1 ,F 2 ,…,F n ;
44) Distributing files F of game AI strategy 1 ,F 2 ,…,F n The real-time operation strategy is used as a real-time operation strategy of the fighting AI with n different capability levels, and is merged with other code files to construct the fighting AI with n different capability levels.
6. The AI training method for reinforcement learning battle games based on information bottleneck theory as claimed in claim 4, wherein in the step 3), when the AI training model is iteratively trained by the A2C algorithm, the expectation is fitted by the Monte Carlo estimation method, and the gradient is updated by the Stein variation gradient descent method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110091260.4A CN112717415B (en) | 2021-01-22 | 2021-01-22 | Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110091260.4A CN112717415B (en) | 2021-01-22 | 2021-01-22 | Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112717415A CN112717415A (en) | 2021-04-30 |
CN112717415B true CN112717415B (en) | 2022-08-16 |
Family
ID=75595220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110091260.4A Active CN112717415B (en) | 2021-01-22 | 2021-01-22 | Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112717415B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113269315B (en) * | 2021-06-29 | 2024-04-02 | 安徽寒武纪信息科技有限公司 | Apparatus, method and readable storage medium for performing tasks using deep reinforcement learning |
CN113641905B (en) * | 2021-08-16 | 2023-10-03 | 京东科技信息技术有限公司 | Model training method, information pushing method, device, equipment and storage medium |
CN116109525B (en) * | 2023-04-11 | 2024-01-05 | 北京龙智数科科技服务有限公司 | Reinforcement learning method and device based on multidimensional data enhancement |
CN116808590B (en) * | 2023-08-25 | 2023-11-10 | 腾讯科技(深圳)有限公司 | Data processing method and related device |
CN117162102A (en) * | 2023-10-30 | 2023-12-05 | 南京邮电大学 | Independent near-end strategy optimization training acceleration method for robot joint action |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109923560A (en) * | 2016-11-04 | 2019-06-21 | 谷歌有限责任公司 | Neural network is trained using variation information bottleneck |
CN111886059A (en) * | 2018-03-21 | 2020-11-03 | 威尔乌集团 | Automatically reducing use of cheating software in an online gaming environment |
CN111985640A (en) * | 2020-07-10 | 2020-11-24 | 清华大学 | Model training method based on reinforcement learning and related device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9679258B2 (en) * | 2013-10-08 | 2017-06-13 | Google Inc. | Methods and apparatus for reinforcement learning |
KR20180087060A (en) * | 2017-01-24 | 2018-08-01 | 라인 가부시키가이샤 | Method, apparatus, computer program and recording medium for providing game service |
CN110327624B (en) * | 2019-07-03 | 2023-03-17 | 广州多益网络股份有限公司 | Game following method and system based on curriculum reinforcement learning |
CN112169311A (en) * | 2020-10-20 | 2021-01-05 | 网易(杭州)网络有限公司 | Method, system, storage medium and computer device for training AI (Artificial Intelligence) |
CN112221152A (en) * | 2020-10-27 | 2021-01-15 | 腾讯科技(深圳)有限公司 | Artificial intelligence AI model training method, device, equipment and medium |
-
2021
- 2021-01-22 CN CN202110091260.4A patent/CN112717415B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109923560A (en) * | 2016-11-04 | 2019-06-21 | 谷歌有限责任公司 | Neural network is trained using variation information bottleneck |
CN111886059A (en) * | 2018-03-21 | 2020-11-03 | 威尔乌集团 | Automatically reducing use of cheating software in an online gaming environment |
CN111985640A (en) * | 2020-07-10 | 2020-11-24 | 清华大学 | Model training method based on reinforcement learning and related device |
Also Published As
Publication number | Publication date |
---|---|
CN112717415A (en) | 2021-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112717415B (en) | Information bottleneck theory-based AI (Artificial intelligence) training method for reinforcement learning battle game | |
CN112668235B (en) | Robot control method based on off-line model pre-training learning DDPG algorithm | |
CN111310915B (en) | Data anomaly detection defense method oriented to reinforcement learning | |
CN111766782B (en) | Strategy selection method based on Actor-Critic framework in deep reinforcement learning | |
US20220176248A1 (en) | Information processing method and apparatus, computer readable storage medium, and electronic device | |
CN111260027B (en) | Intelligent agent automatic decision-making method based on reinforcement learning | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN109284812B (en) | Video game simulation method based on improved DQN | |
CN107346459B (en) | Multi-mode pollutant integrated forecasting method based on genetic algorithm improvement | |
CN111856925B (en) | State trajectory-based confrontation type imitation learning method and device | |
CN112488310A (en) | Multi-agent group cooperation strategy automatic generation method | |
CN111282272B (en) | Information processing method, computer readable medium and electronic device | |
CN111159489A (en) | Searching method | |
CN113947022B (en) | Near-end strategy optimization method based on model | |
CN112257348B (en) | Method for predicting long-term degradation trend of lithium battery | |
Tong et al. | Enhancing rolling horizon evolution with policy and value networks | |
KR102209917B1 (en) | Data processing apparatus and method for deep reinforcement learning | |
CN116090549A (en) | Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium | |
Liu et al. | Forward-looking imaginative planning framework combined with prioritized-replay double DQN | |
CN116204849A (en) | Data and model fusion method for digital twin application | |
CN115587615A (en) | Internal reward generation method for sensing action loop decision | |
Villarrubia-Martin et al. | A hybrid online off-policy reinforcement learning agent framework supported by transformers | |
CN116521584B (en) | MPC cache updating method and system based on multiple intelligent agents | |
Jia et al. | DQN Algorithm Based on Target Value Network Parameter Dynamic Update | |
Desai et al. | Deep Reinforcement Learning to Play Space Invaders |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |