CN114154397A - Implicit adversary modeling method based on deep reinforcement learning - Google Patents

Implicit adversary modeling method based on deep reinforcement learning Download PDF

Info

Publication number
CN114154397A
CN114154397A CN202111316717.3A CN202111316717A CN114154397A CN 114154397 A CN114154397 A CN 114154397A CN 202111316717 A CN202111316717 A CN 202111316717A CN 114154397 A CN114154397 A CN 114154397A
Authority
CN
China
Prior art keywords
network
opponent
action
value
estimation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111316717.3A
Other languages
Chinese (zh)
Other versions
CN114154397B (en
Inventor
刘婵娟
赵天昊
刘睿康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN202111316717.3A priority Critical patent/CN114154397B/en
Priority claimed from CN202111316717.3A external-priority patent/CN114154397B/en
Publication of CN114154397A publication Critical patent/CN114154397A/en
Application granted granted Critical
Publication of CN114154397B publication Critical patent/CN114154397B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/80Special adaptations for executing a specific game genre or game mode
    • A63F13/822Strategy games; Role-playing games
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses an implicit opponent modeling method based on deep reinforcement learning, and belongs to the field of opponent modeling of multi-agent reinforcement learning directions. The invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.

Description

Implicit adversary modeling method based on deep reinforcement learning
Technical Field
The invention belongs to the field of adversary modeling of multi-agent reinforcement learning direction, and particularly relates to an implicit adversary modeling method based on deep reinforcement learning.
Background
Intelligent decisions are intended to allow the agent to make reasonable decisions in the gaming environment to maximize its own revenue, and if the actions, preferences, etc. of the opponent are modeled in this process, the opponent's behavior can be better predicted, thereby optimizing the decisions. For example, in a chess game, if one party can predict the next step of falling of an opponent, targeted strategy layout can be performed in advance; in automatic driving, if the automobile can predict the moving direction of other vehicles or pedestrians in advance, avoidance can be performed in advance. Therefore, modeling other agents in a gaming environment is critical to decision optimization, and adversary modeling has become one of the important research directions in the field of artificial intelligence.
Existing opponent modeling techniques mostly assume that an opponent adopts a fixed strategy, and in most real game environments, the opponent often dynamically changes the strategy to maximize the profit, and at the moment, the profit obtained by the main intelligent agent is greatly influenced by the change of the strategy of the opponent. In this case, dynamic adversary features need to be modeled to accommodate changes in adversary policies. There are also currently some adversary modeling techniques that can achieve dynamic modeling of adversary features under different constraints. For example, the AWESOME algorithm proposed by Vincent conditioner et al, at the university of tomilon in the card, can ensure that the host agent makes the optimal decision in the event that the adversary's policy eventually becomes stable. The DriftER algorithm proposed by Pablo Hernandez-Leal of CMI and Yusen Zhan et al of university at Washington, assumes that the opponent's policy switches among a variety of fixed policies, and the main agent monitors the opponent's timing of changing the policy through prediction errors, and readjusts its own policy to adapt to the change of the opponent's policy accordingly.
The above-mentioned adversary modeling techniques all belong to display modeling. In display modeling, since the modeling process of an adversary and the environment-based planning process are separated, the modeling process usually requires the use of a large amount of domain-specific knowledge, which makes it difficult for display modeling to be applied in places lacking domain knowledge and difficult for display modeling to migrate from one domain to another. The implicit modeling can combine modeling and planning processes, specific field knowledge is not needed, and the adversary modeling can be carried out only according to historical interaction information of the opponent, so that the construction of a universal opponent modeling framework is possible.
Thanks to the rapid development of deep reinforcement learning in recent years, a series of new deep reinforcement learning methods are proposed, which provides a new idea for the adversary modeling technology. A relatively representative work of this is the DRON algorithm proposed by He et al. The DRON algorithm is used for implicit opponent modeling in a dynamic gaming environment. On the basis of the DQN algorithm, the historical interactive information of an opponent is read, the opponent characteristics and the environmental characteristics are coded in a neural network together in an implicit mode, the opponent modeling can be performed implicitly without any domain knowledge, the model has good effects in football and question-answering games, but is limited by the characteristics of the DQN algorithm, and the problems of easiness in overestimation, low convergence speed and the like are still achieved.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an improved implicit opponent modeling method by utilizing a deep reinforcement learning technology around the opponent modeling problem in a dynamic game environment. The implicit modeling method does not depend on specific field knowledge, can adapt to the dynamic change of the opponent strategy, solves the problem of overestimation and has higher convergence speed.
The technical scheme of the invention is as follows:
an implicit Opponent modeling method based on Deep Reinforcement learning comprises two neural Network models DRON-DualFc2(Deep knowledge Oppont Network-Dual and Full Connected 2Networks) and DRON-DualMOE (Deep knowledge Oppont Network-Dual and lean of experiments Networks) for implicit Opponent modeling and a DecoleDRON learning algorithm for relieving the overestimation problem in the algorithm.
DRON-DualFc2 and DRON-DualMOE are two neural network models that can be used to model an adversary. They enable the host intelligence to better understand the behavior of an opponent by performing implicit opponent modeling through input opponent features that are primarily based on an assessment of the ability of the opponent and observations of recent actions taken by the opponent, such as in a question-and-answer game, opponent features can be characterized by the number of questions the opponent has answered, average accuracy, etc., while in a soccer game, opponent features can be characterized by the frequency with which the opponent breaks their balls, recent actions taken, etc. The DRON-DualFc2 and the DRON-DualMOE are both composed of a strategy learning network and an adversary model learning network, wherein the strategy learning network is used for predicting the Q value, and the adversary model learning network is used for implicit adversary modeling. The difference between the two network models DRON-DualFc2 and DRON-DualMOE mainly lies in the difference of the ways of fusing a strategy learning network and an adversary model learning network. Specifically, the method comprises the following steps:
in DRON-DualFc2, the input of the strategy learning network is environment information s, the input of the adversary model learning network is adversary characteristics, and the two-path input obtains two-path hidden layer output h after passing through respective hidden layerssAnd hoDRON-DualFc2 by ligation hsAnd hoFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layerπ(s, o) and an estimate of motion advantage Aπ(s, o, a), and finally adding the state estimation value and the normalized action advantage estimation value to obtain an action estimation value Q.
Figure BDA0003343869050000031
In the formula, Vπ(s,o)、Qπ(s, o, a) and Aπ(s, o, a) respectively indicate a state estimation value, a Q value of action a, and an action superiority estimation value when the environment information is s and the opponent feature is s.
Figure BDA0003343869050000032
Indicating the number of all possible actions. Sigmaa′Aπ(s, o, a') represents the sum of the motion advantage estimates for all possible motions.
By decomposing the Q value, the state estimation value can be directly updated each time the Q value is updated, so that all the Q values in the state are updated, and the DRON-DualFc 2network has higher convergence speed.
In DRON-DualMOE, the strategy learning network can be regarded as an expert network, and the environment information s is input; the adversary model learning network can be regarded as a weight network, and input is adversary characteristics and environment information s. The expert network comprises k expert subnetworks, each of which outputs an independent state estimate Vπ(s, o) and normalized motion advantage estimationValue Aπ(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w, which can be regarded as confidence that the adversary takes different strategies. Output V of w and k expert subnetworksπ(s, o) and Aπ(s, o, a) are weighted and summed respectively to obtain a final state estimate and an action advantage estimate, and a final Q value is obtained by the same normalization operation as DRON-DualFc 2. Similar to DRON-DualFc2, DRON-DualMOE also accelerates the convergence speed of the network by decomposing the Q value into a state estimate and an action advantage estimate.
Figure BDA0003343869050000041
In the formula, wiRepresenting the ith component, V, of a k-dimensional weight vectorπ(s,oi) And
Figure BDA0003343869050000042
respectively representing the state estimation and the action advantage estimation output by the ith expert subnetwork.
Meanwhile, in order to relieve the common overestimation problem in Q-learning, the invention also uses a DecoupleDRON learning algorithm. The algorithm improves the training process of the DRON, uses the current value estimation network to select the action, and uses the target value estimation network to estimate the action, thereby decoupling the action selection and the action estimation and further relieving the over-estimation problem. In the decoruperdon learning algorithm, the target value y is calculated as follows:
Figure BDA0003343869050000043
wherein r istThe gain obtained for the host agent at time t, γ is the decay Rate, st+1Environmental information at time t +1, ot+1Is the feature of the opponent at time t +1, θtEstimating a network parameter, θ ', for a current value at time t'tEstimating network parameters for the target value at time t (the current value estimation network and the target value estimation network have the same structure, that isDRON-DualFc2 or DRON-DualMOE).
The method comprises the following specific steps:
step S1: initializing an experience pool with the capacity of N to store interactive experiences generated in the algorithm training process. The interactive experience mainly comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information at the next time step, and o' represents opponent features at the next time step.
Step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters. And repeating the steps S3-S9M times, wherein M is the number of training rounds, the value of M depends on a specific application scene, and the M is a hyper-parameter.
Step S3: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step t is initialized to 1.
Step S4: the main intelligent agent obtains the environmental information s of the current time steptAnd opponent feature ot. The master agent performs an action a randomly with a probability epsilontOtherwise, perform the action
Figure BDA0003343869050000051
Step S5: master agent performing action atLater, the instant profit r is obtained from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1And experiences(s) generated by interaction of the agent with the environmentt,ot,at,rt,st+1,ot+1) And storing in an experience pool.
Step S6: randomly sampling a batch of experience from a pool of experiences. For each experience(s) in the batchj,oj,aj,rj,sj+1,oj+1) Where j represents the pen experience pairCorresponding time step, target value yjThe calculation is carried out according to the following formula:
Figure BDA0003343869050000052
step S7: defining a loss function L according to the formula (5), and estimating a parameter theta of the network for the current valuetA gradient descent is performed.
Figure BDA0003343869050000061
Step S8: estimating parameters theta 'of the target value estimation network every C time steps'tUpdating a parameter θ of a network to a current value estimatet
Step S9: if s ist+1If the state is a non-terminated state, updating the time step t to t +1, and repeatedly executing the steps S4 to S8; otherwise, ending the training of the round.
The invention has the beneficial effects that:
(1) the invention belongs to an implicit adversary modeling method, and constructs a general adversary modeling framework without depending on specific field knowledge.
(2) The method can be applied to a dynamic game environment, and can better model the opponent strategy which changes dynamically.
(3) The invention can better deal with the overestimation problem in the training process and has higher convergence rate.
Drawings
FIG. 1 is a flow chart of the training process of the method of the present invention.
Fig. 2 is a flow chart of parameter updating of the neural network.
FIG. 3 is a data flow diagram of the training process of the method of the present invention.
FIG. 4 is a network structure diagram of DRON-DualFc 2.
Fig. 5 is a network structure diagram of DRON-dual moe.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings.
The training flow chart of the invention is shown in fig. 1, and the steps are described as follows:
the first step is as follows: and initializing an experience pool for storing interactive experiences generated by the main intelligent body in the algorithm training process.
The second step is that: DRON-DualFc2 is selected as the network structure of the current value estimation network and the target value estimation network, and then the current network and the target network are initialized randomly to the same parameters. Repeating the third step to the tenth step M times.
The third step: a gaming environment is initialized, including environmental information and opponent characteristics. The current time step is initialized to 1.
The fourth step: the master agent obtains environmental information and opponent features of the current time step.
The fifth step: the master agent selects the action to be performed according to the epsilon-greedy algorithm. That is, an action is randomly selected according to the probability epsilon, otherwise, the acquired environmental information and the opponent characteristics are transmitted to the current value estimation network, and the action to be executed is obtained.
And a sixth step: and the main intelligent agent executes the action obtained in the last step.
The seventh step: the agent obtains the instant profit r from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1
Eighth step: and storing the experience generated by the interaction of the main intelligent agent and the environment into an experience pool.
The ninth step: and updating parameters of the current value estimation network and the target value estimation network.
The tenth step: if the next moment is not in a termination state, updating the time step and repeatedly executing the processes from the fourth step to the ninth step; if the next moment is in a termination state, the training of the round is finished.
The parameter updating process of the neural network is shown in fig. 2, and the specific steps are described as follows:
step 1: and randomly sampling a batch of interactive experiences from the experience pool, wherein the batch of interactive experiences mainly comprise the environmental state of the current time step, the opponent characteristics, the action taken by the main agent, the instant reward obtained by the main agent and the sequence of the environmental state and the opponent characteristics of the next time step, which are generated when the main agent interacts in the environment.
Step 2: the loss of the network is calculated according to the formula (4) and the formula (5).
And 3, step 3: and calculating the gradient of the loss function relative to each parameter in the current value estimation network after the loss function is propagated reversely.
And 4, step 4: and updating the parameters of the current value estimation network according to the gradient descent optimization method.
And 5, step 5: and updating the parameters of the target value estimation network into the parameters of the current value estimation network every C time steps.
The data flow of the training process of the method of the invention is shown in fig. 3, and for the main agent, the current value estimation network takes the environmental information and the opponent characteristics at the time t-1 as input, and outputs the action and the Q value which should be executed by the main agent at the time t.
The network structure of DRON-DualFc2 is shown in FIG. 4, where all layers in the network are fully connected layers and the activation function is a ReLU function. The network input is the current environment information s and the adversary characteristics, and the two paths of input obtain two paths of hidden layer outputs h after passing through respective hidden layerssAnd hoThen h is mixedsAnd hoAnd sending the data to a subsequent hidden layer after connection to obtain an output h. And h, respectively obtaining a state estimated value and an action advantage estimated value through two independent hidden layers, and finally adding the state estimated value and the normalized action advantage estimated value to obtain a Q value.
Fig. 5 shows a network structure diagram of the DRON-dual moe, all layers in the network are fully connected layers, and except that the activation function when outputting the weight vector w is a Softmax function, the activation functions in the other layers are ReLU functions. The DRON-DualMOE network consists of two parts, namely a weight network and an expert network. The weight network inputs current adversary features and environment information s, and outputs a k-dimensional weight vector w through a Softmax function after passing through two full-connected layers. The expert network inputs the current environmental information s, which will go through two full pathsOutput obtained after connecting layers
Figure BDA0003343869050000081
And as input of k expert subnetworks, each expert subnetwork outputs an independent state estimation value and a normalized action advantage estimation value, and the Q value is obtained through weighting summation of weight vectors and normalization processing.
The above description is only for the purpose of illustrating the embodiments of the present invention and the appended claims are not to be construed as limiting the invention, but rather as encompassing all the modifications, equivalents, and improvements made within the spirit and scope of the present invention.

Claims (3)

1. An implicit opponent modeling method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: initializing an experience pool with the capacity of N to store interactive experience generated in the algorithm training process; the interactive experience comprises (s, o, a, r, s ', o'), wherein s represents the environmental information of the current time step; o represents the opponent's feature at the current time step; a represents the action of the master agent; r represents the benefit obtained after the main agent executes the action a; s 'represents environment information of the next time step, o' represents opponent characteristics of the next time step;
step S2: selecting DRON-DualFc2 or DRON-DualMOE as the network structures of the current value estimation network and the target value estimation network, and then randomly initializing the current value estimation network and the target value estimation network to be the same parameters; repeating the steps S3-S9M times, wherein M is the number of training rounds;
step S3: initializing a gaming environment, including environmental information and opponent characteristics; initializing the current time step t as 1;
step S4: the main intelligent agent obtains the environmental information s of the current time steptAnd opponent feature ot(ii) a The master agent performs an action a randomly with a probability epsilontOtherwise, perform the action
Figure FDA0003343869040000011
Step S5: master agent performing action atLater, the instant profit r is obtained from the game environmenttEnvironmental information s of the next time stept+1And opponent feature ot+1And experiences(s) generated by interaction of the agent with the environmentt,ot,at,rt,st+1,ot+1) Storing the data into an experience pool;
step S6: randomly sampling experience of a batch from an experience pool; for each experience(s) in the batchj,oj,aj,rj,sj+1,oj+1) Target value yjThe calculation is carried out according to the following formula, wherein j represents the time step corresponding to the experience:
Figure FDA0003343869040000012
step S7: defining a loss function L according to the formula (2), and estimating a parameter theta of the network for the current valuetGradient descending is carried out;
Figure FDA0003343869040000021
step S8: estimating parameters theta 'of the target value estimation network every C time steps'tUpdating a parameter θ of a network to a current value estimatet
Step S9: if s ist+1If the state is a non-terminated state, updating the time step t to t +1, and repeatedly executing the steps S4 to S8; otherwise, ending the training of the round.
2. The implicit adversary modeling method based on deep reinforcement learning of claim 1, wherein the DRON-DualFc2 is composed of a strategy learning network and an adversary model learning network; the input of the strategy learning network is environment information s, and the input of the adversary model learning network isFor the hand feature o, the two inputs are processed by respective hidden layers to obtain two hidden layer outputs hsAnd hoDRON-DualFc2 by ligation hsAnd hoFusing the environmental information and the opponent characteristics, and then outputting a state estimation value V after passing through a subsequent hidden layerπ(s, o) and an estimate of motion advantage Aπ(s, o, a), finally adding the state estimation value and the normalized action dominance estimation value to obtain an action estimation value Q:
Figure FDA0003343869040000022
in the formula, Vπ(s,o)、Qπ(s, o, a) and Aπ(s, o, a) respectively represent a state estimation value, a Q value of action a and an action advantage estimation value when the environment information is s and the opponent feature is o;
Figure FDA0003343869040000023
representing the number of all possible actions; sigmaa′Aπ(s, o, a') represents the sum of the motion advantage estimates for all possible motions.
3. The implicit opponent modeling method based on deep reinforcement learning according to claim 1 or 2, wherein the DRON-DualmOE is composed of a strategy learning network and an opponent model learning network; wherein, the strategy learning network is regarded as an expert network and is input as environment information s; the adversary model learning network is regarded as a weight network and is input as an adversary feature o and environmental information s; the expert network comprises k expert subnetworks, each of which outputs an independent state estimate Vπ(s, o) and normalized action dominance estimate Aπ(s, o, a), the weight network outputs a corresponding k-dimensional weight vector w; output V of w and k expert subnetworksπ(s, o) and Aπ(s, o, a) performing weighted summation to obtain a final state estimation value and an action advantage estimation value, and performing normalization to obtain a final Q value:
Figure FDA0003343869040000031
in the formula, wiAn i-th component representing a k-dimensional weight vector; vπ(s,oi) And
Figure FDA0003343869040000032
respectively representing the state estimation and the action advantage estimation output by the ith expert subnetwork.
CN202111316717.3A 2021-11-09 Implicit opponent modeling method based on deep reinforcement learning Active CN114154397B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111316717.3A CN114154397B (en) 2021-11-09 Implicit opponent modeling method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111316717.3A CN114154397B (en) 2021-11-09 Implicit opponent modeling method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN114154397A true CN114154397A (en) 2022-03-08
CN114154397B CN114154397B (en) 2024-05-10

Family

ID=

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018017A (en) * 2022-08-03 2022-09-06 中国科学院自动化研究所 Multi-agent credit allocation method, system and equipment based on ensemble learning
CN117077553A (en) * 2023-10-18 2023-11-17 崂山国家实验室 Interaction strategy optimization method for underwater attack and defense rapid opponent modeling

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024097A1 (en) * 2018-07-30 2020-02-06 东莞理工学院 Deep reinforcement learning-based adaptive game algorithm
CA3060914A1 (en) * 2018-11-05 2020-05-05 Royal Bank Of Canada Opponent modeling with asynchronous methods in deep rl
CN113095488A (en) * 2021-04-29 2021-07-09 电子科技大学 Cooperative game method based on multi-agent maximum entropy reinforcement learning
CN113326902A (en) * 2021-07-08 2021-08-31 中国人民解放军国防科技大学 Online learning-based strategy acquisition method, device and equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020024097A1 (en) * 2018-07-30 2020-02-06 东莞理工学院 Deep reinforcement learning-based adaptive game algorithm
CA3060914A1 (en) * 2018-11-05 2020-05-05 Royal Bank Of Canada Opponent modeling with asynchronous methods in deep rl
CN113095488A (en) * 2021-04-29 2021-07-09 电子科技大学 Cooperative game method based on multi-agent maximum entropy reinforcement learning
CN113326902A (en) * 2021-07-08 2021-08-31 中国人民解放军国防科技大学 Online learning-based strategy acquisition method, device and equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘强;姜峰;: "基于深度强化学习的群体对抗策略研究", 智能计算机与应用, no. 05, 1 May 2020 (2020-05-01) *
曹雷;陈希亮;徐志雄;赖俊;: "多智能体深度强化学习研究综述", 计算机工程与应用, no. 05, 14 February 2020 (2020-02-14) *
石文浩;孟军;张朋;刘婵娟: "融合CNN和Bi-LSTM的miRNA-lncRNA互作关系预测模型", 计算机研究与发展, no. 008, 31 December 2019 (2019-12-31) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115018017A (en) * 2022-08-03 2022-09-06 中国科学院自动化研究所 Multi-agent credit allocation method, system and equipment based on ensemble learning
CN117077553A (en) * 2023-10-18 2023-11-17 崂山国家实验室 Interaction strategy optimization method for underwater attack and defense rapid opponent modeling
CN117077553B (en) * 2023-10-18 2023-12-15 崂山国家实验室 Interaction strategy optimization method for underwater attack and defense rapid opponent modeling

Similar Documents

Publication Publication Date Title
CN109635917B (en) Multi-agent cooperation decision and training method
CN108921298B (en) Multi-agent communication and decision-making method for reinforcement learning
CN112132263A (en) Multi-agent autonomous navigation method based on reinforcement learning
CN112488310A (en) Multi-agent group cooperation strategy automatic generation method
CN113919482A (en) Intelligent agent training method and device, computer equipment and storage medium
CN112069504A (en) Model enhanced defense method for resisting attack by deep reinforcement learning
CN112613608A (en) Reinforced learning method and related device
CN114881228A (en) Average SAC deep reinforcement learning method and system based on Q learning
CN115409158A (en) Robot behavior decision method and device based on layered deep reinforcement learning model
CN116147627A (en) Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
CN113276852B (en) Unmanned lane keeping method based on maximum entropy reinforcement learning framework
CN114154397B (en) Implicit opponent modeling method based on deep reinforcement learning
CN114154397A (en) Implicit adversary modeling method based on deep reinforcement learning
CN115009291B (en) Automatic driving assistance decision making method and system based on network evolution replay buffer area
CN114861368B (en) Construction method of railway longitudinal section design learning model based on near-end strategy
CN113240118B (en) Dominance estimation method, dominance estimation device, electronic device, and storage medium
CN115936058A (en) Multi-agent migration reinforcement learning method based on graph attention network
Almalki et al. Exploration of reinforcement learning to play snake game
JPH10340192A (en) Fuzzy logic controller and its non-fuzzying method
Chen et al. Modified PPO-RND method for solving sparse reward problem in ViZDoom
CN112906868A (en) Behavior clone-oriented demonstration active sampling method
CN116757969B (en) Image blind denoising method and system based on self-adaptive curvature feature fusion
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data
Awheda On Multi-Agent Reinforcement Learning in Matrix, Stochastic and Differential Games
CN113869488A (en) Game AI intelligent agent reinforcement learning method facing continuous-discrete mixed decision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant