CN111898728A

CN111898728A - Team robot decision-making method based on multi-Agent reinforcement learning

Info

Publication number: CN111898728A
Application number: CN202010490427.XA
Authority: CN
Inventors: 田宇飞
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-06-02
Filing date: 2020-06-02
Publication date: 2020-11-06

Abstract

The invention relates to a team robot decision-making method based on multi-Agent reinforcement learning, which comprises the following steps: initializing a network by adopting a DQN reinforcement learning method, randomly generating weights, and initializing an experience playback area; performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment; storing the transition sample after the action is executed into an experience playback area, and randomly extracting partial data for network updating; and updating the network by using a gradient descent method, circulating the steps, and training a value function network with excellent decision-making capability by continuously interacting with the environment. The DQN method is adopted to train the decision-making capability of the team robot, so that the problem that the state space and the action space are too complex due to multiple agents is solved, and the robot has more excellent decision-making capability.

Description

Team robot decision-making method based on multi-Agent reinforcement learning

Technical Field

The invention relates to the field of multi-robot reinforcement learning decision making, in particular to a team robot decision making method based on multi-Agent reinforcement learning.

Background

The robot technology is the mainstream top-grade science and technology in the world today, and after more than 50 years of development, a brand new era is met. The global robotic industry will exhibit well-jet growth in the coming years, and china will become one of the most important markets worldwide.

In the robot technology, a decision system is the key of the performance of a robot, and the robot can make an optimal decision in the face of environmental changes with excellent decision capability to obtain the highest benefit. The reinforcement learning can lead the robot to continuously explore and learn in the interaction with the environment, and form a decision-making method according to the obtained return. The team robot belongs to a multi-Agent system and consists of a group of autonomous and mutually interactive robots which share the same environment, sense the environment through a sensor and take action through an actuator. The challenges faced by multi-Agent reinforcement learning are also more complex: firstly, the dimension disaster problem is more serious, and a state transition probability function and a return function of a single robot are calculated under a combined action space. As states and actions increase, computational complexity grows exponentially. Secondly, the learning target is not well defined, the return of the Agent is related to the behaviors of other agents, and the return of a certain Agent cannot be maximized independently. Instability is also a problem, agents are learned at the same time, each Agent faces an environment which changes constantly, and the best strategy may change with the changes of other Agent strategies. Finally, the exploration and greedy process is more complicated, and under multiple agents, the exploration not only aims at acquiring the information of the environment, but also includes the information of other agents so as to adapt to the behaviors of other agents. But not over-explored or otherwise unbalanced to other agents. Therefore, the design of a decision algorithm for a multi-Agent system is particularly critical in the selection of a reinforcement learning method and a specific strategy.

Disclosure of Invention

The invention aims to provide a team robot decision method based on Agent reinforcement learning, so that a team robot can make an optimal decision in the face of environmental changes under the face of different scenes and different tasks. The scheme applies a deep reinforcement learning method to train the decision-making ability of the robot in interaction with the environment, adopts a DQN method to regard a plurality of robots as a whole and simultaneously output the decision-making action of each robot to solve the multi-Agent problem, and optimizes the decision-making ability of the robot by adjusting parameters.

In order to realize the purpose of the invention, the technical scheme adopted by the invention is as follows: the method includes the steps that multiple robots are regarded as a whole, actions of each robot are output simultaneously to avoid the multi-Agent problem, a DQN reinforcement learning method is adopted, a neural network approximation value function is trained, an environment state is input, a Q value corresponding to each action is output, an e-greedy strategy is adopted, an exploration environment is considered while known information is utilized, and iteration is conducted in a circulating mode until a good value function network is trained.

In order to achieve the purpose, the technical scheme of the invention is as follows: a team robot decision-making method based on multi-Agent reinforcement learning comprises the following steps:

step 1: initializing a network, randomly generating weights, and initializing an experience playback area;

step 1.1: initializing an experience playback area Memory D, wherein the capacity of the experience playback area Memory D is N, and the experience playback area Memory D is used for storing data experienced by the robot;

step 1.2: initializing a Q network, randomly generating a weight omega as a trained value function network, inputting a state obtained for the robot, and outputting a Q value for executing each action in the state, wherein the Q value is used as a reference for robot decision making;

step 1.3: initializing a target Q network with a weight of omega^-ω, used to calculate the Q value estimate target; the Q network interacts with the environment and is updated continuously, and the target Q network does not interact with the environment and is updated in each step, but is updated at regular intervals, so that the approximation function can be ensured to be converged.

Step 2: performing network training, and generating the next action of the robot by using an e-greedy strategy in each interaction with the environment;

step 2.1: the robot acquires an input state, arranges the current environment, and sets and adjusts the state type according to different environment scenes;

step 2.2: calculating a value function through a neural network, namely sorting the Q value obtained by taking each action in the current state;

step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. By the aid of the E-greedy strategy, the robot can utilize known information and simultaneously take into account more information in the environment, and by the aid of the E-greedy strategy, the robot can continue to explore the environment while utilizing learned knowledge, so that the robot is limited to the existing knowledge or excessively explores the environment.

And step 3: recording data, and randomly extracting data for network updating;

step 3.1: executing the selected action, calculating the obtained reward and the change of the surrounding environment after the action is executed, namely the new state, wherein the reward is set and adjusted according to the strategy tendency and the training effect of the task executed by the robot;

step 3.2: recording transition data, storing transition samples (state, action, reward, new state) in an experience playback area;

step 3.3: a portion of the data (state, action, reward, new state) is randomly drawn in the empirical playback zone for network updates. Supervised learning requires that data are independent, and data subjected to Agent experience are stored in an experience playback area, but are ordered, and when parameters are updated, a part of data are extracted in a sampling mode to be used for updating the parameters, so that the association between the data is broken; the memory library D can enable the robot to have richer learning experiences, and the correlation among the experiences is disturbed, so that the robot is not influenced by the previous state during decision making.

And 4, step 4: updating the network, and circularly iterating;

step 4.1: and calculating the current Q value estimation target. DQN belongs to the Off-policy Learning method, namely, the strategy for generating actions is different from the strategy for evaluating, the actions are generated by using the strategy of e-greedy, the evaluation is carried out by using the strategy of greedy, and the target is calculated by adopting a value function corresponding to the action which maximizes the Q value. If S is_t+1In the final state, the target is the reward R of the current action_tOtherwise, the expression for calculating target is:

wherein gamma is a discount factor for reinforcement learning, and a is an action taken by the robot;

step 4.2: calculating a loss function, wherein the expression of the loss function is as follows:

this is a residual model, i.e. the square of the difference between the true value, which is replaced by the estimated value target found in step 4.1, and the predicted value, which is Q (s, a; ω), which is the output of the neural network, where s is the current state, a is the execution action, and ω is the current network weight;

step 4.3: updating the Q network, and updating omega by using a gradient descent method according to a loss function;

step 4.4: updating the target Q network, and updating the weight omega every certain step number^-＝ω；

Step 4.5: and (4) performing loop iteration, adjusting the state, action and reward constitution of the robot according to the training effect, and trying various deep network structures until a good value function network is trained, namely the trained action strategy can meet the task requirement of the team robot or is converged. The training effect can be paid attention in real time through the rewards obtained by the robot, and the robot can learn different strategy tendencies by modifying a reward function.

Compared with the prior art, the invention has the following advantages:

1. all robots are combined into a whole for strategy learning by adopting group reinforcement learning, all actions and states in a group are combined into a group, the problem that the strategy of an independent robot is difficult to converge is solved, and the multi-Agent problem of a team robot is solved.

2. When the robot faces complex states, actions and environments, expected benefits which can be obtained by each action which can be taken by the robot can be output according to the input environment states through an approximate value function trained by a neural network, the action with the maximum benefit is selected and executed to complete the optimal decision, and the strategy tendency of the robot can be adjusted by changing the state space, the action space and the reward function of the robot.

3. The adopted DQN algorithm belongs to a model-free type reinforcement learning algorithm, so that the robot can continuously learn and explore by obtaining environment feedback through interaction with the environment when facing an unknown environment so as to obtain an optimal strategy for the current environment.

Drawings

FIG. 1 is a flow chart of the present invention;

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1: as shown in FIG. 1, the invention provides a team robot decision-making method based on multi-Agent reinforcement learning, which comprises the following detailed steps:

step 1.3: initializing a target Q network with a weight of omega^-ω, used to calculate the Q value estimate target;

step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. Through the E-greedy strategy, the robot can utilize the known information and simultaneously take into account more information in the excavation environment;

and step 3: recording data, and randomly extracting data for network updating;

step 3.3: a portion of the data (state, action, reward, new state) is randomly drawn in the empirical playback zone for network updates. Supervised learning requires that data are independent, and data subjected to Agent experience are stored in an experience playback area, but are ordered, and when parameters are updated, a part of data are extracted in a sampling mode to be used for updating the parameters, so that the association between the data is broken;

and 4, step 4: updating the network, and circularly iterating;

this is a residual model, i.e. the square of the difference between the true value, which is replaced by the estimated value target found in step 4.1, and the predicted value, i.e. Q (s, a; ω), which is the output of the neural network, where s is the current state, a is the execution action, and ω is the current network weight.

Step 4.5: and (4) performing loop iteration, adjusting the state, action and reward constitution of the robot according to the training effect, and trying various deep network structures until a good value function network is trained, namely the trained action strategy can meet the task requirement of the team robot or is converged.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and all equivalent substitutions or substitutions made on the basis of the above-mentioned technical solutions belong to the scope of the present invention.

Claims

1. A team robot decision-making method based on multi-Agent reinforcement learning is characterized in that: the method comprises the following steps:

and step 3: recording data, and randomly extracting data for network updating;

and 4, step 4: and updating the network and circularly iterating.

2. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: the step 1: initializing a network, randomly generating weights, and initializing an experience playback zone, wherein the method specifically comprises the following steps:

step 1.3: initializing a target Q network with a weight of omega^-ω, for calculating the Q value estimation target.

3. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: step 2: performing network training, and generating the next action of the robot by using an Ee-greedy strategy in each interaction with the environment, wherein the method specifically comprises the following steps:

step 2.3: generating the next action of the robot by using the E-greedy strategy: one action is randomly extracted from all actions with the probability of ∈ and the action which maximizes the value function is selected with the probability of 1 ∈, and the action is set and adjusted according to the task performed by the robot. Through the E-greedy strategy, the robot can utilize the known information and simultaneously consider more information in the mining environment.

4. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: and step 3: recording data, and randomly extracting data for network updating, wherein the method specifically comprises the following steps:

step 3.3: a part of data (state, action, reward and new state) is randomly extracted from an experience playback area for network updating, supervised learning requires that the data are independent, and the experience playback area stores the data of Agent experience.

5. The multi-Agent reinforcement learning-based team robot decision-making method according to claim 1, characterized in that: and 4, step 4: updating the network, and circularly iterating, specifically as follows:

step 4.1: calculating the current Q value to estimate target, wherein DQN belongs to an Off-policy Learning method, namely the strategy for generating action is different from the strategy for evaluating, the action is generated by using an e-greedy strategy, the evaluation is performed by using a greedy strategy, the target is calculated by adopting a value function corresponding to the action with the maximum Q value, and if S is the maximum Q value, the method comprises the steps of calculating the target by using a value function corresponding to the action_t+1In the final state, the target is the reward R of the current action_tOtherwise, the expression for calculating target is: