CN114154397A - Implicit adversary modeling method based on deep reinforcement learning - Google Patents
Implicit adversary modeling method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN114154397A CN114154397A CN202111316717.3A CN202111316717A CN114154397A CN 114154397 A CN114154397 A CN 114154397A CN 202111316717 A CN202111316717 A CN 202111316717A CN 114154397 A CN114154397 A CN 114154397A
- Authority
- CN
- China
- Prior art keywords
- network
- opponent
- action
- value
- estimation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 39
- 230000002787 reinforcement Effects 0.000 title claims abstract description 14
- 230000009471 action Effects 0.000 claims description 42
- 239000003795 chemical substances by application Substances 0.000 claims description 29
- 230000008901 benefit Effects 0.000 claims description 17
- 230000007613 environmental effect Effects 0.000 claims description 17
- 230000008569 process Effects 0.000 claims description 15
- 238000012549 training Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 11
- 230000002452 interceptive effect Effects 0.000 claims description 8
- 230000033001 locomotion Effects 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 7
- 239000004576 sand Substances 0.000 claims description 6
- 230000003993 interaction Effects 0.000 claims description 4
- 238000010606 normalization Methods 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000008859 change Effects 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 4
- 230000004913 activation Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- FCAKZZMVXCLLHM-UHFFFAOYSA-N 1,1-dimethyl-3-[3-(1,1,2,2-tetrafluoroethoxy)phenyl]urea Chemical compound CN(C)C(=O)NC1=CC=CC(OC(F)(F)C(F)F)=C1 FCAKZZMVXCLLHM-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/80—Special adaptations for executing a specific game genre or game mode
- A63F13/822—Strategy games; Role-playing games
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses an implicit opponent modeling method based on deep reinforcement learning, and belongs to the field of opponent modeling of multi-agent reinforcement learning directions. The invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.
Description
Technical Field
The invention belongs to the field of adversary modeling of multi-agent reinforcement learning direction, and particularly relates to an implicit adversary modeling method based on deep reinforcement learning.
Background
Intelligent decisions are intended to allow the agent to make reasonable decisions in the gaming environment to maximize its own revenue, and if the actions, preferences, etc. of the opponent are modeled in this process, the opponent's behavior can be better predicted, thereby optimizing the decisions. For example, in a chess game, if one party can predict the next step of falling of an opponent, targeted strategy layout can be performed in advance; in automatic driving, if the automobile can predict the moving direction of other vehicles or pedestrians in advance, avoidance can be performed in advance. Therefore, modeling other agents in a gaming environment is critical to decision optimization, and adversary modeling has become one of the important research directions in the field of artificial intelligence.
Existing opponent modeling techniques mostly assume that an opponent adopts a fixed strategy, and in most real game environments, the opponent often dynamically changes the strategy to maximize the profit, and at the moment, the profit obtained by the main intelligent agent is greatly influenced by the change of the strategy of the opponent. In this case, dynamic adversary features need to be modeled to accommodate changes in adversary policies. There are also currently some adversary modeling techniques that can achieve dynamic modeling of adversary features under different constraints. For example, the AWESOME algorithm proposed by Vincent conditioner et al, at the university of tomilon in the card, can ensure that the host agent makes the optimal decision in the event that the adversary's policy eventually becomes stable. The DriftER algorithm proposed by Pablo Hernandez-Leal of CMI and Yusen Zhan et al of university at Washington, assumes that the opponent's policy switches among a variety of fixed policies, and the main agent monitors the opponent's timing of changing the policy through prediction errors, and readjusts its own policy to adapt to the change of the opponent's policy accordingly.
The above-mentioned adversary modeling techniques all belong to display modeling. In display modeling, since the modeling process of an adversary and the environment-based planning process are separated, the modeling process usually requires the use of a large amount of domain-specific knowledge, which makes it difficult for display modeling to be applied in places lacking domain knowledge and difficult for display modeling to migrate from one domain to another. The implicit modeling can combine modeling and planning processes, specific field knowledge is not needed, and the adversary modeling can be carried out only according to historical interaction information of the opponent, so that the construction of a universal opponent modeling framework is possible.
Thanks to the rapid development of deep reinforcement learning in recent years, a series of new deep reinforcement learning methods are proposed, which provides a new idea for the adversary modeling technology. A relatively representative work of this is the DRON algorithm proposed by He et al. The DRON algorithm is used for implicit opponent modeling in a dynamic gaming environment. On the basis of the DQN algorithm, the historical interactive information of an opponent is read, the opponent characteristics and the environmental characteristics are coded in a neural network together in an implicit mode, the opponent modeling can be performed implicitly without any domain knowledge, the model has good effects in football and question-answering games, but is limited by the characteristics of the DQN algorithm, and the problems of easiness in overestimation, low convergence speed and the like are still achieved.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.
The technical scheme of the invention is as follows:
an implicit Opponent modeling method based on Deep Reinforcement learning comprises two neural Network models DRON-DualFc2(Deep knowledge Oppont Network-Dual and Full Connected 2Networks) and DRON-DualMOE (Deep knowledge Oppont Network-Dual and lean of experiments Networks) for implicit Opponent modeling and a DecoleDRON learning algorithm for relieving the overestimation problem in the algorithm.
DRON-DualFc2 and DRON-DualMOE are two neural network models that can be used to model an adversary. They enable the host intelligence to better understand the behavior of an opponent by performing implicit opponent modeling through input opponent features that are primarily based on an assessment of the ability of the opponent and observations of recent actions taken by the opponent, such as in a question-and-answer game, opponent features can be characterized by the number of questions the opponent has answered, average accuracy, etc., while in a soccer game, opponent features can be characterized by the frequency with which the opponent breaks their balls, recent actions taken, etc. The DRON-DualFc2 and the DRON-DualMOE are both composed of a strategy learning network and an adversary model learning network, wherein the strategy learning network is used for predicting the Q value, and the adversary model learning network is used for implicit adversary modeling. The difference between the two network models DRON-DualFc2 and DRON-DualMOE mainly lies in the difference of the ways of fusing a strategy learning network and an adversary model learning network. Specifically, the method comprises the following steps:
in DRON-DualFc2, the input of the strategy learning network is environment information s, the input of the adversary model learning network is adversary characteristics, and the two-path input obtains two-path hidden layer output h after passing through respective hidden layerssAnd hoDRON-DualFc2 by ligation hsAnd hoFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layerπ(s, o) and an estimate of motion advantage Aπ(s, o, a), and finally adding the state estimation value and the normalized action advantage estimation value to obtain an action estimation value Q.
In the formula, Vπ(s,o)、Qπ(s, o, a) and Aπ(s, o, a) respectively indicate a state estimation value, a Q value of action a, and an action superiority estimation value when the environment information is s and the opponent feature is s.Indicating the number of all possible actions. Sigmaa′Aπ(s, o, a') represents the sum of the motion advantage estimates for all possible motions.
By decomposing the Q value, the state estimation value can be directly updated each time the Q value is updated, so that all the Q values in the state are updated, and the DRON-DualFc 2network has higher convergence speed.
In DRON-DualMOE, the strategy learning network can be regarded as an expert network, and the environment information s is input; the adversary model learning network can be regarded as a weight network, and input is adversary characteristics and environment information s. The expert network comprises k expert subnetworks, each of which outputs an independent state estimate Vπ(s, o) and normalized motion advantage estimationValue Aπ(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w, which can be regarded as confidence that the adversary takes different strategies. Output V of w and k expert subnetworksπ(s, o) and Aπ(s, o, a) are weighted and summed respectively to obtain a final state estimate and an action advantage estimate, and a final Q value is obtained by the same normalization operation as DRON-DualFc 2. Similar to DRON-DualFc2, DRON-DualMOE also accelerates the convergence speed of the network by decomposing the Q value into a state estimate and an action advantage estimate.
In the formula, wiRepresenting the ith component, V, of a k-dimensional weight vectorπ(s,oi) Andrespectively representing the state estimation and the action advantage estimation output by the ith expert subnetwork.
Meanwhile, in order to relieve the common overestimation problem in Q-learning, the invention also uses a DecoupleDRON learning algorithm. The algorithm improves the training process of the DRON, uses the current value estimation network to select the action, and uses the target value estimation network to estimate the action, thereby decoupling the action selection and the action estimation and further relieving the over-estimation problem. In the decoruperdon learning algorithm, the target value y is calculated as follows:
wherein r istThe gain obtained for the host agent at time t, γ is the decay Rate, st+1Environmental information at time t +1, ot+1Is the feature of the opponent at time t +1, θtEstimating a network parameter, θ ', for a current value at time t'tEstimating network parameters for the target value at time t (the current value estimation network and the target value estimation network have the same structure, that isDRON-DualFc2 or DRON-DualMOE).
The method comprises the following specific steps:
step S1: initializing an experience pool with the capacity of N to store interactive experiences generated in the algorithm training process. The interactive experience mainly comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information at the next time step, and o' represents opponent features at the next time step.
Step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters. And repeating the steps S3-S9M times, wherein M is the number of training rounds, the value of M depends on a specific application scene, and the M is a hyper-parameter.
Step S3: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step t is initialized to 1.
Step S4: the main intelligent agent obtains the environmental information s of the current time steptAnd opponent feature ot. The master agent performs an action a randomly with a probability epsilontOtherwise, perform the action
Step S5: master agent performing action atLater, the instant profit r is obtained from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1And experiences(s) generated by interaction of the agent with the environmentt,ot,at,rt,st+1,ot+1) And storing in an experience pool.
Step S6: randomly sampling a batch of experience from a pool of experiences. For each experience(s) in the batchj,oj,aj,rj,sj+1,oj+1) Where j represents the pen experience pairCorresponding time step, target value yjThe calculation is carried out according to the following formula:
step S7: defining a loss function L according to the formula (5), and estimating a parameter theta of the network for the current valuetA gradient descent is performed.
Step S8: estimating parameters theta 'of the target value estimation network every C time steps'tUpdating a parameter θ of a network to a current value estimatet。
Step S9: if s ist+1If the state is a non-terminated state, updating the time step t to t +1, and repeatedly executing the steps S4 to S8; otherwise, ending the training of the round.
The invention has the beneficial effects that:
(1) the invention belongs to an implicit adversary modeling method, and constructs a general adversary modeling framework without depending on specific field knowledge.
(2) The method can be applied to a dynamic game environment, and can better model the opponent strategy which changes dynamically.
(3) The invention can better deal with the overestimation problem in the training process and has higher convergence rate.
Drawings
FIG. 1 is a flow chart of the training process of the method of the present invention.
Fig. 2 is a flow chart of parameter updating of the neural network.
FIG. 3 is a data flow diagram of the training process of the method of the present invention.
FIG. 4 is a network structure diagram of DRON-DualFc 2.
Fig. 5 is a network structure diagram of DRON-dual moe.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings.
The training flow chart of the invention is shown in fig. 1, and the steps are described as follows:
the first step is as follows: and initializing an experience pool for storing interactive experiences generated by the main intelligent body in the algorithm training process.
The second step is that: DRON-DualFc2 is selected as the network structure of the current value estimation network and the target value estimation network, and then the current network and the target network are initialized randomly to the same parameters. Repeating the third step to the tenth step M times.
The third step: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step is initialized to 1.
The fourth step: the master agent obtains environmental information and opponent features of the current time step.
The fifth step: the master agent selects the action to be performed according to the epsilon-greedy algorithm. That is, an action is randomly selected according to the probability epsilon, otherwise, the acquired environmental information and the opponent characteristics are transmitted to the current value estimation network, and the action to be executed is obtained.
And a sixth step: and the main intelligent agent executes the action obtained in the last step.
The seventh step: the agent obtains the instant profit r from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1。
Eighth step: and storing the experience generated by the interaction of the main intelligent agent and the environment into an experience pool.
The ninth step: and updating parameters of the current value estimation network and the target value estimation network.
The tenth step: if the next moment is not in a termination state, updating the time step and repeatedly executing the processes from the fourth step to the ninth step; if the next moment is in a termination state, the training of the round is finished.
The parameter updating process of the neural network is shown in fig. 2, and the specific steps are described as follows:
step 1: and randomly sampling a batch of interactive experiences from the experience pool, wherein the batch of interactive experiences mainly comprise the environmental state of the current time step, the opponent characteristics, the action taken by the main agent, the instant reward obtained by the main agent and the sequence of the environmental state and the opponent characteristics of the next time step, which are generated when the main agent interacts in the environment.
Step 2: the loss of the network is calculated according to the formula (4) and the formula (5).
And 3, step 3: and calculating the gradient of the loss function relative to each parameter in the current value estimation network after the loss function is propagated reversely.
And 4, step 4: and updating the parameters of the current value estimation network according to the gradient descent optimization method.
And 5, step 5: and updating the parameters of the target value estimation network into the parameters of the current value estimation network every C time steps.
The data flow of the training process of the method of the invention is shown in fig. 3, and for the main agent, the current value estimation network takes the environmental information and the opponent characteristics at the time t-1 as input, and outputs the action and the Q value which should be executed by the main agent at the time t.
The network structure of DRON-DualFc2 is shown in FIG. 4, where all layers in the network are fully connected layers and the activation function is a ReLU function. The network input is the current environment information s and the adversary characteristics, and the two paths of input obtain two paths of hidden layer outputs h after passing through respective hidden layerssAnd hoThen h is mixedsAnd hoAnd sending the data to a subsequent hidden layer after connection to obtain an output h. And h, respectively obtaining a state estimated value and an action advantage estimated value through two independent hidden layers, and finally adding the state estimated value and the normalized action advantage estimated value to obtain a Q value.
Fig. 5 shows a network structure diagram of the DRON-dual moe, all layers in the network are fully connected layers, and except that the activation function when outputting the weight vector w is a Softmax function, the activation functions in the other layers are ReLU functions. The DRON-DualMOE network consists of two parts, namely a weight network and an expert network. The weight network inputs current adversary features and environment information s, and outputs a k-dimensional weight vector w through a Softmax function after passing through two full-connected layers. The expert network inputs the current environmental information s, which will go through two full pathsOutput obtained after connecting layersAnd as input of k expert subnetworks, each expert subnetwork outputs an independent state estimation value and a normalized action advantage estimation value, and the Q value is obtained through weighting summation of weight vectors and normalization processing.
The above description is only for the purpose of illustrating the embodiments of the present invention and the appended claims are not to be construed as limiting the invention, but rather as encompassing all the modifications, equivalents, and improvements made within the spirit and scope of the present invention.
Claims (3)
1. An implicit opponent modeling method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: initializing an experience pool with the capacity of N to store interactive experience generated in the algorithm training process; the interactive experience comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information of the next time step, o' represents opponent characteristics of the next time step;
step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters; repeating the steps S3-S9M times, wherein M is the number of training rounds;
step S3: initializing a gaming environment, including environmental information and opponent characteristics; initializing the current time step t as 1;
step S4: the main intelligent agent obtains the environmental information s of the current time steptAnd opponent feature ot(ii) a The master agent performs an action a randomly with a probability epsilontOtherwise, perform the action
Step S5: master agent performing action atLater, the instant profit r is obtained from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1And experiences(s) generated by interaction of the agent with the environmentt,ot,at,rt,st+1,ot+1) Storing the data into an experience pool;
step S6: randomly sampling experience of a batch from an experience pool; for each experience(s) in the batchj,oj,aj,rj,sj+1,oj+1) Target value yjThe calculation is carried out according to the following formula, wherein j represents the time step corresponding to the experience:
step S7: defining a loss function L according to the formula (2), and estimating a parameter theta of the network for the current valuetGradient descending is carried out;
step S8: estimating parameters theta 'of the target value estimation network every C time steps'tUpdating a parameter θ of a network to a current value estimatet;
Step S9: if s ist+1If the state is a non-terminated state, updating the time step t to t +1, and repeatedly executing the steps S4 to S8; otherwise, ending the training of the round.
2. The implicit adversary modeling method based on deep reinforcement learning of claim 1, wherein the DRON-DualFc2 is composed of a strategy learning network and an adversary model learning network; the input of the strategy learning network is environment information s, and the input of the adversary model learning network isFor the hand feature o, the two inputs are processed by respective hidden layers to obtain two hidden layer outputs hsAnd hoDRON-DualFc2 by ligation hsAnd hoFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layerπ(s, o) and an estimate of motion advantage Aπ(s, o, a), finally adding the state estimation value and the normalized action dominance estimation value to obtain an action estimation value Q:
in the formula, Vπ(s,o)、Qπ(s, o, a) and Aπ(s, o, a) respectively represent a state estimation value, a Q value of action a and an action advantage estimation value when the environment information is s and the opponent feature is o;representing the number of all possible actions; sigmaa′Aπ(s, o, a') represents the sum of the motion advantage estimates for all possible motions.
3. The implicit opponent modeling method based on deep reinforcement learning according to claim 1 or 2, wherein the DRON-DualmOE is composed of a strategy learning network and an opponent model learning network; wherein, the strategy learning network is regarded as an expert network and is input as environment information s; the adversary model learning network is regarded as a weight network and is input as an adversary feature o and environmental information s; the expert network comprises k expert subnetworks, each of which outputs an independent state estimate Vπ(s, o) and normalized action dominance estimate Aπ(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w; output V of w and k expert subnetworksπ(s, o) and Aπ(s, o, a) performing weighted summation to obtain a final state estimation value and an action advantage estimation value, and performing normalization to obtain a final Q value:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111316717.3A CN114154397B (en) | 2021-11-09 | Implicit opponent modeling method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111316717.3A CN114154397B (en) | 2021-11-09 | Implicit opponent modeling method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114154397A true CN114154397A (en) | 2022-03-08 |
CN114154397B CN114154397B (en) | 2024-05-10 |
Family
ID=
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115018017A (en) * | 2022-08-03 | 2022-09-06 | 中国科学院自动化研究所 | Multi-agent credit allocation method, system and equipment based on ensemble learning |
CN117077553A (en) * | 2023-10-18 | 2023-11-17 | 崂山国家实验室 | Interaction strategy optimization method for underwater attack and defense rapid opponent modeling |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020024097A1 (en) * | 2018-07-30 | 2020-02-06 | 东莞理工学院 | Deep reinforcement learning-based adaptive game algorithm |
CA3060914A1 (en) * | 2018-11-05 | 2020-05-05 | Royal Bank Of Canada | Opponent modeling with asynchronous methods in deep rl |
CN113095488A (en) * | 2021-04-29 | 2021-07-09 | 电子科技大学 | Cooperative game method based on multi-agent maximum entropy reinforcement learning |
CN113326902A (en) * | 2021-07-08 | 2021-08-31 | 中国人民解放军国防科技大学 | Online learning-based strategy acquisition method, device and equipment |
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020024097A1 (en) * | 2018-07-30 | 2020-02-06 | 东莞理工学院 | Deep reinforcement learning-based adaptive game algorithm |
CA3060914A1 (en) * | 2018-11-05 | 2020-05-05 | Royal Bank Of Canada | Opponent modeling with asynchronous methods in deep rl |
CN113095488A (en) * | 2021-04-29 | 2021-07-09 | 电子科技大学 | Cooperative game method based on multi-agent maximum entropy reinforcement learning |
CN113326902A (en) * | 2021-07-08 | 2021-08-31 | 中国人民解放军国防科技大学 | Online learning-based strategy acquisition method, device and equipment |
Non-Patent Citations (3)
Title |
---|
刘强;姜峰;: "基于深度强化学习的群体对抗策略研究", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) * |
曹雷;陈希亮;徐志雄;赖俊;: "多智能体深度强化学习研究综述", 计算机工程与应用, no. 05, 14 February 2020 (2020-02-14) * |
石文浩;孟军;张朋;刘婵娟: "融合CNN和Bi-LSTM的miRNA-lncRNA互作关系预测模型", 计算机研究与发展, no. 008, 31 December 2019 (2019-12-31) * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115018017A (en) * | 2022-08-03 | 2022-09-06 | 中国科学院自动化研究所 | Multi-agent credit allocation method, system and equipment based on ensemble learning |
CN117077553A (en) * | 2023-10-18 | 2023-11-17 | 崂山国家实验室 | Interaction strategy optimization method for underwater attack and defense rapid opponent modeling |
CN117077553B (en) * | 2023-10-18 | 2023-12-15 | 崂山国家实验室 | Interaction strategy optimization method for underwater attack and defense rapid opponent modeling |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109635917B (en) | Multi-agent cooperation decision and training method | |
CN108921298B (en) | Multi-agent communication and decision-making method for reinforcement learning | |
CN112132263A (en) | Multi-agent autonomous navigation method based on reinforcement learning | |
CN112488310A (en) | Multi-agent group cooperation strategy automatic generation method | |
CN113919482A (en) | Intelligent agent training method and device, computer equipment and storage medium | |
CN112069504A (en) | Model enhanced defense method for resisting attack by deep reinforcement learning | |
CN112613608A (en) | Reinforced learning method and related device | |
CN114881228A (en) | Average SAC deep reinforcement learning method and system based on Q learning | |
CN115409158A (en) | Robot behavior decision method and device based on layered deep reinforcement learning model | |
CN116147627A (en) | Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation | |
CN113276852B (en) | Unmanned lane keeping method based on maximum entropy reinforcement learning framework | |
CN114154397B (en) | Implicit opponent modeling method based on deep reinforcement learning | |
CN114154397A (en) | Implicit adversary modeling method based on deep reinforcement learning | |
CN115009291B (en) | Automatic driving assistance decision making method and system based on network evolution replay buffer area | |
CN114861368B (en) | Construction method of railway longitudinal section design learning model based on near-end strategy | |
CN113240118B (en) | Dominance estimation method, dominance estimation device, electronic device, and storage medium | |
CN115936058A (en) | Multi-agent migration reinforcement learning method based on graph attention network | |
Almalki et al. | Exploration of reinforcement learning to play snake game | |
JPH10340192A (en) | Fuzzy logic controller and its non-fuzzying method | |
Chen et al. | Modified PPO-RND method for solving sparse reward problem in ViZDoom | |
CN112906868A (en) | Behavior clone-oriented demonstration active sampling method | |
CN116757969B (en) | Image blind denoising method and system based on self-adaptive curvature feature fusion | |
CN112884129B (en) | Multi-step rule extraction method, device and storage medium based on teaching data | |
Awheda | On Multi-Agent Reinforcement Learning in Matrix, Stochastic and Differential Games | |
CN113869488A (en) | Game AI intelligent agent reinforcement learning method facing continuous-discrete mixed decision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |