CN111898728A - Team robot decision-making method based on multi-Agent reinforcement learning - Google Patents

Team robot decision-making method based on multi-Agent reinforcement learning Download PDF

Info

Publication number
CN111898728A
CN111898728A CN202010490427.XA CN202010490427A CN111898728A CN 111898728 A CN111898728 A CN 111898728A CN 202010490427 A CN202010490427 A CN 202010490427A CN 111898728 A CN111898728 A CN 111898728A
Authority
CN
China
Prior art keywords
action
robot
network
value
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010490427.XA
Other languages
Chinese (zh)
Inventor
田宇飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202010490427.XA priority Critical patent/CN111898728A/en
Publication of CN111898728A publication Critical patent/CN111898728A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • B25J9/163Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning

Abstract

The invention relates to a team robot decision-making method based on multi-Agent reinforcement learning, which comprises the following steps: initializing a network by adopting a DQN reinforcement learning method, randomly generating weights, and initializing an experience playback area; performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment; storing the transition sample after the action is executed into an experience playback area, and randomly extracting partial data for network updating; and updating the network by using a gradient descent method, circulating the steps, and training a value function network with excellent decision-making capability by continuously interacting with the environment. The DQN method is adopted to train the decision-making capability of the team robot, so that the problem that the state space and the action space are too complex due to multiple agents is solved, and the robot has more excellent decision-making capability.

Description

Team robot decision-making method based on multi-Agent reinforcement learning
Technical Field
The invention relates to the field of multi-robot reinforcement learning decision making, in particular to a team robot decision making method based on multi-Agent reinforcement learning.
Background
The robot technology is the mainstream top-grade science and technology in the world today, and after more than 50 years of development, a brand new era is met. The global robotic industry will exhibit well-jet growth in the coming years, and china will become one of the most important markets worldwide.
In the robot technology, a decision system is the key of the performance of a robot, and the robot can make an optimal decision in the face of environmental changes with excellent decision capability to obtain the highest benefit. The reinforcement learning can lead the robot to continuously explore and learn in the interaction with the environment, and form a decision-making method according to the obtained return. The team robot belongs to a multi-Agent system and consists of a group of autonomous and mutually interactive robots which share the same environment, sense the environment through a sensor and take action through an actuator. The challenges faced by multi-Agent reinforcement learning are also more complex: firstly, the dimension disaster problem is more serious, and a state transition probability function and a return function of a single robot are calculated under a combined action space. As states and actions increase, computational complexity grows exponentially. Secondly, the learning target is not well defined, the return of the Agent is related to the behaviors of other agents, and the return of a certain Agent cannot be maximized independently. Instability is also a problem, agents are learned at the same time, each Agent faces an environment which changes constantly, and the best strategy may change with the changes of other Agent strategies. Finally, the exploration and greedy process is more complicated, and under multiple agents, the exploration not only aims at acquiring the information of the environment, but also includes the information of other agents so as to adapt to the behaviors of other agents. But not over-explored or otherwise unbalanced to other agents. Therefore, the design of a decision algorithm for a multi-Agent system is particularly critical in the selection of a reinforcement learning method and a specific strategy.
Disclosure of Invention
The invention aims to provide a team robot decision method based on Agent reinforcement learning, so that a team robot can make an optimal decision in the face of environmental changes under the face of different scenes and different tasks. The scheme applies a deep reinforcement learning method to train the decision-making ability of the robot in interaction with the environment, adopts a DQN method to regard a plurality of robots as a whole and simultaneously output the decision-making action of each robot to solve the multi-Agent problem, and optimizes the decision-making ability of the robot by adjusting parameters.
In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: the method includes the steps that multiple robots are regarded as a whole, actions of each robot are output simultaneously to avoid the multi-Agent problem, a DQN reinforcement learning method is adopted, a neural network approximation value function is trained, an environment state is input, a Q value corresponding to each action is output, an e-greedy strategy is adopted, an exploration environment is considered while known information is utilized, and iteration is conducted in a circulating mode until a good value function network is trained.
In order to achieve the purpose, the technical scheme of the invention is as follows: a team robot decision-making method based on multi-Agent reinforcement learning comprises the following steps:
step 1: initializing a network, randomly generating weights, and initializing an experience playback area;
step 1.1: initializing an experience playback area Memory D, wherein the capacity of the experience playback area Memory D is N, and the experience playback area Memory D is used for storing data experienced by the robot;
step 1.2: initializing a Q network, randomly generating a weight omega as a trained value function network, inputting a state obtained for the robot, and outputting a Q value for executing each action in the state, wherein the Q value is used as a reference for robot decision making;
step 1.3: initializing a target Q network with a weight of omega-ω, used to calculate the Q value estimate target; the Q network interacts with the environment and is updated continuously, and the target Q network does not interact with the environment and is updated in each step, but is updated at regular intervals, so that the approximation function can be ensured to be converged.
Step 2: performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment;
step 2.1: the robot acquires an input state, arranges the current environment, and sets and adjusts the state type according to different environment scenes;
step 2.2: calculating a value function through a neural network, namely sorting the Q value obtained by taking each action in the current state;
step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. By the aid of the E-greedy strategy, the robot can utilize known information and simultaneously take into account more information in the environment, and by the aid of the E-greedy strategy, the robot can continue to explore the environment while utilizing learned knowledge, so that the robot is limited to the existing knowledge or excessively explores the environment.
And step 3: recording data, and randomly extracting data for network updating;
step 3.1: executing the selected action, calculating the obtained reward and the change of the surrounding environment after the action is executed, namely the new state, wherein the reward is set and adjusted according to the strategy tendency and the training effect of the task executed by the robot;
step 3.2: recording transition data, storing transition samples (state, action, reward, new state) in an experience playback area;
step 3.3: a portion of the data (state, action, reward, new state) is randomly drawn in the empirical playback zone for network updates. Supervised learning requires that data are independent, and data subjected to Agent experience are stored in an experience playback area, but are ordered, and when parameters are updated, a part of data are extracted in a sampling mode to be used for updating the parameters, so that the association between the data is broken; the memory library D can enable the robot to have richer learning experiences, and the correlation among the experiences is disturbed, so that the robot is not influenced by the previous state during decision making.
And 4, step 4: updating the network, and circularly iterating;
step 4.1: and calculating the current Q value estimation target. DQN belongs to the Off-policy Learning method, namely, the strategy for generating actions is different from the strategy for evaluating, the actions are generated by using the strategy of e-greedy, the evaluation is carried out by using the strategy of greedy, and the target is calculated by adopting a value function corresponding to the action which maximizes the Q value. If S ist+1In the final state, the target is the reward R of the current actiontOtherwise, the expression for calculating target is:
Figure BDA0002520873150000031
wherein gamma is a discount factor for reinforcement learning, and a is an action taken by the robot;
step 4.2: calculating a loss function, wherein the expression of the loss function is as follows:
Figure BDA0002520873150000032
this is a residual model, i.e. the square of the difference between the true value, which is replaced by the estimated value target found in step 4.1, and the predicted value, which is Q (s, a; ω), which is the output of the neural network, where s is the current state, a is the execution action, and ω is the current network weight;
step 4.3: updating the Q network, and updating omega by using a gradient descent method according to a loss function;
step 4.4: updating the target Q network, and updating the weight omega every certain step number-=ω;
Step 4.5: and (4) performing loop iteration, adjusting the state, action and reward constitution of the robot according to the training effect, and trying various deep network structures until a good value function network is trained, namely the trained action strategy can meet the task requirement of the team robot or is converged. The training effect can be paid attention in real time through the rewards obtained by the robot, and the robot can learn different strategy tendencies by modifying a reward function.
Compared with the prior art, the invention has the following advantages:
1. all robots are combined into a whole for strategy learning by adopting group reinforcement learning, all actions and states in a group are combined into a group, the problem that the strategy of an independent robot is difficult to converge is solved, and the multi-Agent problem of a team robot is solved.
2. When the robot faces complex states, actions and environments, expected benefits which can be obtained by each action which can be taken by the robot can be output according to the input environment states through an approximate value function trained by a neural network, the action with the maximum benefit is selected and executed to complete the optimal decision, and the strategy tendency of the robot can be adjusted by changing the state space, the action space and the reward function of the robot.
3. The adopted DQN algorithm belongs to a model-free type reinforcement learning algorithm, so that the robot can continuously learn and explore by obtaining environment feedback through interaction with the environment when facing an unknown environment so as to obtain an optimal strategy for the current environment.
Drawings
FIG. 1 is a flow chart of the present invention;
Detailed Description
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1: as shown in FIG. 1, the invention provides a team robot decision-making method based on multi-Agent reinforcement learning, which comprises the following detailed steps:
step 1: initializing a network, randomly generating weights, and initializing an experience playback area;
step 1.1: initializing an experience playback area Memory D, wherein the capacity of the experience playback area Memory D is N, and the experience playback area Memory D is used for storing data experienced by the robot;
step 1.2: initializing a Q network, randomly generating a weight omega as a trained value function network, inputting a state obtained for the robot, and outputting a Q value for executing each action in the state, wherein the Q value is used as a reference for robot decision making;
step 1.3: initializing a target Q network with a weight of omega-ω, used to calculate the Q value estimate target;
step 2: performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment;
step 2.1: the robot acquires an input state, arranges the current environment, and sets and adjusts the state type according to different environment scenes;
step 2.2: calculating a value function through a neural network, namely sorting the Q value obtained by taking each action in the current state;
step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. Through the E-greedy strategy, the robot can utilize the known information and simultaneously take into account more information in the excavation environment;
and step 3: recording data, and randomly extracting data for network updating;
step 3.1: executing the selected action, calculating the obtained reward and the change of the surrounding environment after the action is executed, namely the new state, wherein the reward is set and adjusted according to the strategy tendency and the training effect of the task executed by the robot;
step 3.2: recording transition data, storing transition samples (state, action, reward, new state) in an experience playback area;
step 3.3: a portion of the data (state, action, reward, new state) is randomly drawn in the empirical playback zone for network updates. Supervised learning requires that data are independent, and data subjected to Agent experience are stored in an experience playback area, but are ordered, and when parameters are updated, a part of data are extracted in a sampling mode to be used for updating the parameters, so that the association between the data is broken;
and 4, step 4: updating the network, and circularly iterating;
step 4.1: and calculating the current Q value estimation target. DQN belongs to the Off-policy Learning method, namely, the strategy for generating actions is different from the strategy for evaluating, the actions are generated by using the strategy of e-greedy, the evaluation is carried out by using the strategy of greedy, and the target is calculated by adopting a value function corresponding to the action which maximizes the Q value. If S ist+1In the final state, the target is the reward R of the current actiontOtherwise, the expression for calculating target is:
Figure BDA0002520873150000051
wherein gamma is a discount factor for reinforcement learning, and a is an action taken by the robot;
step 4.2: calculating a loss function, wherein the expression of the loss function is as follows:
Figure BDA0002520873150000052
this is a residual model, i.e. the square of the difference between the true value, which is replaced by the estimated value target found in step 4.1, and the predicted value, i.e. Q (s, a; ω), which is the output of the neural network, where s is the current state, a is the execution action, and ω is the current network weight.
Step 4.3: updating the Q network, and updating omega by using a gradient descent method according to a loss function;
step 4.4: updating the target Q network, and updating the weight omega every certain step number-=ω;
Step 4.5: and (4) performing loop iteration, adjusting the state, action and reward constitution of the robot according to the training effect, and trying various deep network structures until a good value function network is trained, namely the trained action strategy can meet the task requirement of the team robot or is converged.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims (5)

1. A team robot decision-making method based on multi-Agent reinforcement learning is characterized in that: the method comprises the following steps:
step 1: initializing a network, randomly generating weights, and initializing an experience playback area;
step 2: performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment;
and step 3: recording data, and randomly extracting data for network updating;
and 4, step 4: and updating the network and circularly iterating.
2. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: the step 1: initializing a network, randomly generating weights, and initializing an experience playback zone, wherein the method specifically comprises the following steps:
step 1.1: initializing an experience playback area Memory D, wherein the capacity of the experience playback area Memory D is N, and the experience playback area Memory D is used for storing data experienced by the robot;
step 1.2: initializing a Q network, randomly generating a weight omega as a trained value function network, inputting a state obtained for the robot, and outputting a Q value for executing each action in the state, wherein the Q value is used as a reference for robot decision making;
step 1.3: initializing a target Q network with a weight of omega-ω, for calculating the Q value estimation target.
3. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: step 2: performing network training, and generating the next action of the robot by using an Ee-greedy strategy in each interaction with the environment, wherein the method specifically comprises the following steps:
step 2.1: the robot acquires an input state, arranges the current environment, and sets and adjusts the state type according to different environment scenes;
step 2.2: calculating a value function through a neural network, namely sorting the Q value obtained by taking each action in the current state;
step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. Through the E-greedy strategy, the robot can utilize the known information and simultaneously consider more information in the mining environment.
4. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: and step 3: recording data, and randomly extracting data for network updating, wherein the method specifically comprises the following steps:
step 3.1: executing the selected action, calculating the obtained reward and the change of the surrounding environment after the action is executed, namely the new state, wherein the reward is set and adjusted according to the strategy tendency and the training effect of the task executed by the robot;
step 3.2: recording transition data, storing transition samples (state, action, reward, new state) in an experience playback area;
step 3.3: a part of data (state, action, reward and new state) is randomly extracted from an experience playback area for network updating, supervised learning requires that the data are independent, and the experience playback area stores the data of Agent experience.
5. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: and 4, step 4: updating the network, and circularly iterating, specifically as follows:
step 4.1: calculating the current Q value to estimate target, wherein DQN belongs to an Off-policy Learning method, namely the strategy for generating action is different from the strategy for evaluating, the action is generated by using an e-greedy strategy, the evaluation is performed by using a greedy strategy, the target is calculated by adopting a value function corresponding to the action with the maximum Q value, and if S is the maximum Q value, the method comprises the steps of calculating the target by using a value function corresponding to the actiont+1In the final state, the target is the reward R of the current actiontOtherwise, the expression for calculating target is:
Figure FDA0002520873140000021
wherein gamma is a discount factor for reinforcement learning, and a is an action taken by the robot;
step 4.2: calculating a loss function, wherein the expression of the loss function is as follows:
Figure FDA0002520873140000022
this is a residual model, i.e. the square of the difference between the true value, which is replaced by the estimated value target found in step 4.1, and the predicted value, which is Q (s, a; ω), which is the output of the neural network, where s is the current state, a is the execution action, and ω is the current network weight;
step 4.3: updating the Q network, and updating omega by using a gradient descent method according to a loss function;
step 4.4: updating the target Q network, and updating the weight omega every certain step number-=ω;
Step 4.5: and (4) performing loop iteration, adjusting the state, action and reward constitution of the robot according to the training effect, and trying various deep network structures until a good value function network is trained, namely the trained action strategy can meet the task requirement of the team robot or is converged.
CN202010490427.XA 2020-06-02 2020-06-02 Team robot decision-making method based on multi-Agent reinforcement learning Pending CN111898728A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010490427.XA CN111898728A (en) 2020-06-02 2020-06-02 Team robot decision-making method based on multi-Agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010490427.XA CN111898728A (en) 2020-06-02 2020-06-02 Team robot decision-making method based on multi-Agent reinforcement learning

Publications (1)

Publication Number Publication Date
CN111898728A true CN111898728A (en) 2020-11-06

Family

ID=73206625

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010490427.XA Pending CN111898728A (en) 2020-06-02 2020-06-02 Team robot decision-making method based on multi-Agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN111898728A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650394A (en) * 2020-12-24 2021-04-13 深圳前海微众银行股份有限公司 Intelligent device control method, device and readable storage medium
CN112734030A (en) * 2020-12-31 2021-04-30 中国科学技术大学 Unmanned platform decision learning method for empirical playback sampling by using state similarity
CN113031437A (en) * 2021-02-26 2021-06-25 同济大学 Water pouring service robot control method based on dynamic model reinforcement learning
CN113095500A (en) * 2021-03-31 2021-07-09 南开大学 Robot tracking method based on multi-agent reinforcement learning
CN113222253A (en) * 2021-05-13 2021-08-06 珠海埃克斯智能科技有限公司 Scheduling optimization method, device and equipment and computer readable storage medium
CN113269315A (en) * 2021-06-29 2021-08-17 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing task using deep reinforcement learning
CN113467481A (en) * 2021-08-11 2021-10-01 哈尔滨工程大学 Path planning method based on improved Sarsa algorithm
CN113485119A (en) * 2021-07-29 2021-10-08 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN115563527A (en) * 2022-09-27 2023-01-03 西南交通大学 Multi-Agent deep reinforcement learning framework and method based on state classification and assignment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762281A (en) * 2018-06-08 2018-11-06 哈尔滨工程大学 It is a kind of that intelligent robot decision-making technique under the embedded Real-time Water of intensified learning is associated with based on memory
CN110321666A (en) * 2019-08-09 2019-10-11 重庆理工大学 Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108762281A (en) * 2018-06-08 2018-11-06 哈尔滨工程大学 It is a kind of that intelligent robot decision-making technique under the embedded Real-time Water of intensified learning is associated with based on memory
CN110321666A (en) * 2019-08-09 2019-10-11 重庆理工大学 Multi-robots Path Planning Method based on priori knowledge Yu DQN algorithm

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650394A (en) * 2020-12-24 2021-04-13 深圳前海微众银行股份有限公司 Intelligent device control method, device and readable storage medium
CN112650394B (en) * 2020-12-24 2023-04-25 深圳前海微众银行股份有限公司 Intelligent device control method, intelligent device control device and readable storage medium
CN112734030B (en) * 2020-12-31 2022-09-02 中国科学技术大学 Unmanned platform decision learning method for empirical playback sampling by using state similarity
CN112734030A (en) * 2020-12-31 2021-04-30 中国科学技术大学 Unmanned platform decision learning method for empirical playback sampling by using state similarity
CN113031437A (en) * 2021-02-26 2021-06-25 同济大学 Water pouring service robot control method based on dynamic model reinforcement learning
CN113095500A (en) * 2021-03-31 2021-07-09 南开大学 Robot tracking method based on multi-agent reinforcement learning
CN113222253B (en) * 2021-05-13 2022-09-30 珠海埃克斯智能科技有限公司 Scheduling optimization method, device, equipment and computer readable storage medium
CN113222253A (en) * 2021-05-13 2021-08-06 珠海埃克斯智能科技有限公司 Scheduling optimization method, device and equipment and computer readable storage medium
CN113269315A (en) * 2021-06-29 2021-08-17 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing task using deep reinforcement learning
CN113269315B (en) * 2021-06-29 2024-04-02 安徽寒武纪信息科技有限公司 Apparatus, method and readable storage medium for performing tasks using deep reinforcement learning
CN113485119A (en) * 2021-07-29 2021-10-08 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN113485119B (en) * 2021-07-29 2022-05-10 中国人民解放军国防科技大学 Heterogeneous homogeneous population coevolution method for improving swarm robot evolutionary capability
CN113467481A (en) * 2021-08-11 2021-10-01 哈尔滨工程大学 Path planning method based on improved Sarsa algorithm
CN115563527A (en) * 2022-09-27 2023-01-03 西南交通大学 Multi-Agent deep reinforcement learning framework and method based on state classification and assignment

Similar Documents

Publication Publication Date Title
CN111898728A (en) Team robot decision-making method based on multi-Agent reinforcement learning
Moriarty et al. Evolutionary algorithms for reinforcement learning
CN110515303B (en) DDQN-based self-adaptive dynamic path planning method
CN112052936B (en) Reinforced learning exploration method and device based on generation countermeasure mechanism
Wang et al. Learning robust manipulation strategies with multimodal state transition models and recovery heuristics
CN112613608A (en) Reinforced learning method and related device
Hameed et al. Gradient monitored reinforcement learning
US20230268035A1 (en) Method and apparatus for generating chemical structure using neural network
US20220413496A1 (en) Predictive Modeling of Aircraft Dynamics
Hagg et al. Modeling user selection in quality diversity
Mohammadpour et al. Chaotic genetic algorithm based on explicit memory with a new strategy for updating and retrieval of memory in dynamic environments
WO2021140698A1 (en) Information processing device, method, and program
CN113503885B (en) Robot path navigation method and system based on sampling optimization DDPG algorithm
Taranovic et al. Adversarial imitation learning with preferences
Liu et al. Quadratic interpolation based orthogonal learning particle swarm optimization algorithm
KR102259786B1 (en) Method for processing game data
Sun et al. Emulation Learning for Neuromimetic Systems
Olesen et al. Evolutionary planning in latent space
Zhu et al. CausalDyna: Improving Generalization of Dyna-style Reinforcement Learning via Counterfactual-Based Data Augmentation
Roth et al. MSVIPER
Beigi et al. A simple interaction model for learner agents: An evolutionary approach
Meena et al. A Survey on Intrinsically Motivated Reinforcement Learning
Andersen et al. Safer reinforcement learning for agents in industrial grid-warehousing
Jain RAMario: Experimental Approach to Reptile Algorithm--Reinforcement Learning for Mario
Sopov Self-configuring Multi-strategy Genetic Algorithm for Non-stationary Environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Tian Yufei

Inventor after: Huang Yongming

Inventor before: Tian Yufei