CN114489035B - Multi-robot collaborative search method based on accumulated trace reinforcement learning - Google Patents

Multi-robot collaborative search method based on accumulated trace reinforcement learning Download PDF

Info

Publication number
CN114489035B
CN114489035B CN202011267650.4A CN202011267650A CN114489035B CN 114489035 B CN114489035 B CN 114489035B CN 202011267650 A CN202011267650 A CN 202011267650A CN 114489035 B CN114489035 B CN 114489035B
Authority
CN
China
Prior art keywords
action
state
neural network
reinforcement learning
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011267650.4A
Other languages
Chinese (zh)
Other versions
CN114489035A (en
Inventor
徐志雄
陈希亮
洪志理
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Army Engineering University of PLA
Original Assignee
Army Engineering University of PLA
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Army Engineering University of PLA filed Critical Army Engineering University of PLA
Priority to CN202011267650.4A priority Critical patent/CN114489035B/en
Publication of CN114489035A publication Critical patent/CN114489035A/en
Application granted granted Critical
Publication of CN114489035B publication Critical patent/CN114489035B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

The application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which records state action pairs visited by an intelligent agent, accumulates the visit times of each action state pair, introduces a Boltzmann distribution method, designs a search mode based on accumulated trace, and guides the intelligent agent to search by adding an additional rewarding mode based on accumulated trace. The method guides the intelligent agent to explore by adding the additional rewards based on accumulated trace, takes the reinforcement learning algorithm as a main algorithm, improves the reinforcement learning algorithm, can be combined with any reinforcement learning algorithm, improves the learning efficiency of reinforcement learning, and realizes the efficient collaborative exploration of multiple robots in complex environments.

Description

Multi-robot collaborative search method based on accumulated trace reinforcement learning
Technical Field
The application relates to a collaborative search method based on accumulated trace reinforcement learning, in particular to a multi-robot collaborative search method for introducing a reward and exploration method based on accumulated trace into reinforcement learning, belonging to the technical fields of intelligent decision, intelligent task planning and intelligent command and control of military information systems.
Background
With the rapid development of military science and technology, unmanned combat is gradually becoming one of the main combat patterns of future warfare. For land unmanned combat, replacing real combat with robot soldiers is a trend in future combat system development. With the use of more and more unmanned aerial vehicles, unmanned vehicles and other devices in military systems, traditional centralized solutions have failed to meet the requirements of modern warfare. More and more military systems require the use of distributed, multi-robot systems with positioning and communication functions. The cooperation among multiple robots can often accomplish various complex tasks more efficiently and quickly than a single robot, and mainly comprises positioning, task allocation, obstacle avoidance, path planning tasks and the like. Among them, the multi-robot search problem is one of the main research problems at present.
The multi-robot search problem can be divided into a single-target search and a multi-target search, the former is a special case of the latter, and for the single-target search method, there is an intended cooperation method for explicitly planning the robot behavior. Such as gradient descent methods and game theory methods. There are emerging cooperative methods for accomplishing a prescribed task through intelligent behavior emerging, such as an extended particle swarm algorithm, a firefly algorithm, and the like. Among these, the collaborative approach emerges to be more efficient. Single-target collaborative search only occurs at an individual level, is essentially fine-grained collaboration, and has been studied more fully at present. The multi-target search relates to different requirements on a system operation mechanism, functions and roles of each robot and interrelation among robots in the task decomposition, distribution, planning, decision-making and execution processes, and is a complex environment which is oriented to a dynamic environment and has autonomy and adaptivity.
The main task of multi-robot multi-target collaborative search is to find all targets in the shortest time, and for the optimization problem, optimization algorithms such as a particle swarm algorithm, a firefly algorithm, a genetic algorithm and the like can be used. For example, pugh and the like are used for cooperation of group robots after the particle swarm algorithm is expanded, tang and Eberhard use the particle swarm algorithm for target search research, attention is paid to algorithm parameter optimization, and Feng Yangong aims at the defects that the firefly algorithm is slow in convergence speed and low in solving precision in global optimization search, is easy to fall into a local extremum region and the like, and a dynamic population firefly algorithm based on a chaos theory is provided. Zhang Yi aiming at the problems of low searching efficiency and easy sinking into a local optimal solution in the traditional genetic algorithm, an improved genetic algorithm is provided: and the simple one-dimensional code is adopted to replace the complex two-dimensional code, so that the storage space is saved. In the design of genetic operators, crossover operators and mutation operators are redefined, the situation that local optimization is involved is avoided, and finally, the shortest path and collision-free combination are used as fitness functions for genetic optimization. However, the global position information of the target in the main focused problems of the heuristic algorithms is unknown, so that the better solution is difficult to obtain only by using the heuristic algorithms, and the searching action of the robot is required to be guided by utilizing all available information and knowledge.
The reinforcement learning is used as a machine learning method for solving the sequential decision problem, and the strategy is directly learned through 'interactive trial and error' with the task environment, so that an artificial construction reasoning model and a large amount of sample data are not needed, the reinforcement learning has strong applicability and universality, but the obvious problem is that the learning efficiency is low, and even an effective strategy cannot be learned in a complex task environment. How to improve the learning efficiency of reinforcement learning under the condition of limited computing resources and realize the efficient collaborative exploration of multiple robots in a complex environment is the main problem to be solved by the application.
Disclosure of Invention
The application aims to provide a cumulative trace-based rewarding and exploring method which is used for solving contradiction between exploration and utilization in the training process of a reinforcement learning method and improving the learning efficiency of reinforcement learning.
The application adopts the following technical scheme.
The application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which comprises the following steps: step 1: determining states and actions corresponding to the multiple targets at each moment according to the tasks;
step 2: initializing a neural network and a target neural network, and setting the neuron parameters of the neural network and the target neural network to be the same; setting a reward factor, a discount coefficient and a reward and punishment amount corresponding to each action;
step 3: according to the state s at the current moment t Selecting action a t The method comprises the steps of carrying out a first treatment on the surface of the Executing action a t Determining corresponding punishment quantity r t State s at next time t+1 Obtaining a current state s t Action a of selection t Punishment and punishment amount r t State s of the next time state t+1 Is a state action pair data(s) t ,a t ,r t ,s t+1 ) And pair(s) of state actions t ,a t ) The number of accesses count(s) t ,a t ) Adding 1; step 3 is repeatedly performed until a specific number of state action pair data is obtained, the state action pair (s t ,a t ) The number of accesses count(s) t ,a t ) I.e. state s t Lower selection action a t Is a number of times (1);
step 4: selecting a set number N of state action pairs from the obtained specific number of state action pairs; calculating, for each selected action state pair data, a bonus for each state action pair based on a preset bonus factor and the number of accesses to that state action pair; calculating the output quantity of the target neural network based on the obtained bonus, the discount coefficient and the state action pair data; calculating an error between the output quantity of the target neural network and the output quantity of the neural network, and updating the neuron parameters of the neural network according to random gradient descent; setting the neuron parameters of the neural network and the target neural network to be the same after setting the step number;
step 5: returning to the step 3 until the neural network converges, and ending training to obtain a neural network strategy model;
step 6: and obtaining a multi-robot collaborative search strategy according to the neural network strategy model, and generating a multi-robot target collaborative search method.
Further, the specific method for selecting the action according to the state of the current moment is as follows:
generating a random number (optionally, the generated random number ranges from 0 to 1, the preset threshold is 0.5), and comparing the generated random number with the preset threshold;
if the random number is smaller than the preset threshold value, calculating the selection probability of other actions except the action selected when the neural network value is maximum in the state of the action set at the current moment, determining the action with the maximum selection probability and selecting the action;
if the random number is greater than or equal to a preset threshold, selecting an action selected when the neural network value is maximized in a state of the current moment, wherein the action is expressed as follows:
wherein the method comprises the steps ofState s representing the current time t And selecting an action a when the neural network value is maximized, wherein θ is a neuron parameter of the target neural network, and A is an action set.
Still further, act a i The selection probability of (2) is calculated as follows:
wherein prob (a) j ) Representing selection action a j Is selected according to the selection probability of (1); t represents the temperature, count (s, a) i ) Representing pairs of state actions (s, a j ) I.e. selecting action a in state s i Count (s, a) k ) Representing pairs of state actions (s, a k ) I.e. selecting action a in state s k K is an index of actions, and its value range is the number of all actions in the action space (action space is action set a).
Further, for each action state pair, a specific method for calculating the additional rewards of the state action pair based on a preset rewards factor comprises the following steps:
wherein r is + For bonus, β is the bonus factor, count (s i ,a i ) Representing pairs of state actions(s) i ,a i ) The number of accesses of (a) is in state s i Lower selection action a i Is a number of times (1).
Further, calculating the output y of the target neural network based on the obtained bonus, the discount coefficient, and the reward and punishment in the state action pair i The method of (2) is as follows: if the state moves the pair(s) i ,a i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected i Output y as target neural network i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is i =r i
If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula iWhere i=1, 2 … N, r + For additional rewards, beta is the rewarding factor, gamma is the discount factor, ++>Represented in state s i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ - And (3) representing neuron parameters of the target neural network, wherein A is an action set.
Further, the target neural network is calculated in the state action pair (s i ,a i ) The output quantity and neural network are in state action pair (s i ,a i ) The error between the lower outputs is given by:
where i=1, 2 … N, y i In-state action pairs(s) for target neural networks i ,a i ) Lower output, Q (s i ,a i The method comprises the steps of carrying out a first treatment on the surface of the θ) is the action pair(s) of the neural network in the state i ,a i ) And a lower output.
The beneficial technical effects obtained by the application are as follows:
the application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which solves the contradiction between exploration and utilization in the training process of the reinforcement learning method. The application records the state action pairs accessed by the intelligent agent, accumulates the access times of each action state pair, introduces a Boltzmann distribution method, designs an exploration mode based on accumulated trace, and calculates and obtains the selection action a i The magnitude of the probability, temperature T, represents the magnitude of randomness. Along with the continuous deep learning, the temperature is reduced continuously, and the former learning result is ensured not to be destroyed. The application guides the intelligent agent to explore by adding the additional rewards based on accumulated trace, takes the reinforcement learning algorithm as a main algorithm, improves on the basis of the reinforcement learning algorithm, can be combined with any reinforcement learning algorithm, improves the learning efficiency of reinforcement learning, and realizes that multiple robots are complexAnd the method is efficiently and cooperatively explored in the environment.
Drawings
FIG. 1 is a reinforcement learning framework based on cumulative trace exploration and rewarding means in accordance with an embodiment of the present application;
FIG. 2 is a flowchart of a multi-robot target search method based on accumulated trace reinforcement learning according to an embodiment of the present application.
Detailed Description
The present application will be further described below in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the following examples are only for more clearly illustrating the technical solution of the present application, and are not intended to limit the scope of the present application.
The reinforcement learning target is that the intelligent agent performs trial and error actions through interaction with an unknown environment, and continuously learns to adjust own strategies so as to maximize rewards obtained by the intelligent agent. Because of the unknown nature of the environment, the agent can only find the optimal strategy after "fully" exploring the environment. In most cases, however, the agents need to have a trade-off under limited computing resources in order to obtain the highest rewards, which is one of the key challenges in applying reinforcement learning to practical robotic target search problems—exploration and utilization dilemma. On the one hand, the most beneficial behavior strategy needs to be selected by using the existing experience, and on the other hand, the search range needs to be enlarged, a new strategy is selected, and an unknown space is explored so as to find a better behavior strategy.
In the traditional reinforcement learning, since no real target value marks the current sample, the thinking of the common time sequence difference in reinforcement learning is that the current estimated value function Q replaces the real value function as the target value y, namely
Wherein r is t Is the environmental rewards received by the intelligent agent at the moment t, gamma is the discount coefficient in reinforcement learning, s t+1 Represents the environmental state at the time t+1, a represents the action of the intelligent agent at the time t,is expressed in state s i+1 The largest Q value among Q values corresponding to a specific action output from the lower neural network.
Because of contradiction between exploration and utilization of the intelligent agent action selection, the learning efficiency of reinforcement learning is low, and the reinforcement learning gradually converges only after a plurality of steps of iteration. In order to overcome the defect to a certain extent, the application provides a cumulative trace-based rewarding and exploring method, which records state action pairs visited by an intelligent agent, accumulates the visit times of each action state pair, introduces a Boltzmann distribution method, and designs an exploring mode based on the cumulative trace:
prob(a i ) Representing selection action a i T is the temperature magnitude, representing the magnitude of randomness. Along with the continuous deep learning, the temperature is reduced continuously, and the former learning result is ensured not to be destroyed.
In addition, the application guides the agent to explore by adding the bonus based on the accumulated trace, and the calculation mode of the bonus is as follows:
β is a reward factor and count (s, a) records the number of times an agent accesses a different pair of action states, access to an action state pair (s, a), i.e. select action a in state s, with count (s, a) cumulatively incremented by one for each access. The agent awards (r+r) + ) Training was performed without adding additional rewards during the test.
In the following, taking as an example how a plurality of unmanned vehicles search for targets in a battlefield environment and how to efficiently search for all targets, a multi-robot collaborative search process based on accumulated trace reinforcement learning is described. The flow chart is shown in fig. 1, and comprises the following steps:
step 1 construction of space description and Environment of combat mission
1) Multiple unmanned vehicles find multiple target basic situations in complex mountain regions
The unmanned vehicle is responsible for leading edge array reconnaissance measurement tasks. According to the combat plan, the unmanned vehicle starts from a certain region, a plurality of unknown position barriers exist in the battlefield environment, the unmanned vehicle needs to reach the planned place, and the reconnaissance measurement is carried out on the current army-keeping battlefield. It is estimated that 1000 time steps later reach X, Y, Z main points, the reconnaissance measurement task is completed, and the upper-level depth attack force is guaranteed to enter the fight.
2) Unmanned vehicle braiding and deployment
The unmanned vehicles are in 5 groups, each group is expressed as 1 reconnaissance unit, and one group has 1 unmanned vehicle. Unmanned vehicles were first deployed in the northwest corner.
3) Condition for completion of scout task
The unmanned vehicles are responsible for reconnaissance tasks, and after 1000 time steps of the combat time, the X, Y, Z key points are reached by at least 1 unmanned vehicle unit.
4) State set, action set, and reward feedback
State s at each time t t The method comprises the following steps:
s t ={x 1 ,y 1 ,x 2 ,y 2 ,x 3 ,y 3 ,x 4 ,y 4 ,x 5 ,y 5 ,x 6 ,y 6 ,x 7 ,y 7 ,x 8 ,y 8 }
where x and y represent the location (abscissa) of each entity, respectively. Subscripts represent entity numbers: {1,2,3,4,5} represents my combat unit (unmanned vehicle), and {6,7,8} represents mission target location. The value range of the state variable of each dimension is x epsilon [0,3000], y epsilon [0,3000].
Action a at each time t t The method comprises the following steps:
a t ={move 1 ,move 2 ,move 3 ,move 4 }
wherein, move 1 Representing the maneuver of the unmanned vehicle, advancing forward; move (move) 2 Representing the back-off and move of the unmanned vehicle 3 Representing that the unmanned vehicle turns left and moves 4 Representing that the unmanned vehicle turns right, the value range of the maneuvering action is move E [0,50 ]]0 represents no maneuver and 1-50 represents maneuver to 50 adjacent squares.
Prize feedback:
rewards reaching the points of X, Y and Z are r= +500;
obstacle penalty r= -200 is encountered
Unmanned car collision penalty r= -100;
step 2 initialization of algorithm parameters
Neural network Q (s, a; θ) and target neural network Q (s, a; θ) - ) All adopt fully connected neural networks containing 2 hidden layers (64 neurons), and initialize the parameter synchronization theta of the two neural networks - =θ, optimization method uses Adam gradient descent algorithm with learning rate of 0.0001. Specific programming implementations were developed using the python language based on Google's TensorFlow machine learning library. The memory cell D is initialized to a capacity of 20000. Initializing a count (s, a) =1 of the number of accesses to (s, a) by the state action, where s is the state and a is the action, and the number of accesses to (s, a), that is, the number of times of selecting action a in state s. Setting a discount coefficient gamma=0.95, a greedy strategy epsilon=0.01, a batch size eta=32, a temperature T=20 and a target neural network update interval C=10.
Step 3 action selection
According to the battlefield environment (i.e. the state of the current moment t) s t Selecting action a t The selection method comprises the following steps:
generating a random number, and comparing the generated random number with a preset threshold value;
if the random number is smaller than the preset threshold value, calculating the selection probability of other actions except the action selected when the neural network value is maximum in the state of the action set at the current moment, determining the action with the maximum selection probability and selecting the action;
if the random number is greater than or equal to the preset threshold value, selecting the state at the current moment when the neural network value is maximizedThe expression:wherein->State s representing the current time t And selecting an action a when the neural network value is maximized, wherein θ is a neuron parameter of the target neural network, and A is an action set.
Action a j The selection probability of (2) is calculated as follows:
wherein prob (a) j ) Representing selection action a j Is selected according to the selection probability of (1); t represents the temperature, count (s, a) j ) Representing pairs of state actions (s, a j ) I.e. selecting action a in state s j Count (s, a) k ) Representing pairs of state actions (s, a k ) I.e. selecting action a in state s k K is the index of actions, and the value range is the number of actions in the action set A.
Step 4 executing action a t Then, the situation s of the next moment is obtained by observing the environment t+1 And rewarding r t Will(s) t ,a t ,r t ,s t+1 ) The data is stored in the memory unit D.
Step 5 State action adds one to the counter
count(s t ,a t )←count(s t ,a t )+1
Step 6, if the capacity D of the memory storage unit D is not full, returning to the step 3; if so, go to step 7.
Step 7 from D, 32 groups (s i ,a i ,r i ,s i+1 ) (i=1, 2 … N) data, where s i Representing the state at time i, s i+1 Indicating the state at the next moment.
Step 8, calculating the bonus for the counter according to the state action:
step 9 if the state is motion pair(s) i ,a i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected i Output y as target neural network i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is i =r i
If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula iWhere i=1, 2 … N, r + For additional rewards, beta is the rewarding factor, gamma is the discount factor, ++>Represented in state s i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ - And (3) representing neuron parameters of the target neural network, wherein A is an action set.
Note that: calculating y i When using the target neural networkRather than Q.
Step 10 calculates error= (y) i -Q(s i ,a i ;θ)) 2 And updates the neural network neuron parameters θ according to random gradient descent.
Step 11 target neural network every 10 stepsSynchronization with the network neuron parameters of the neural network Q, i.e., equal neuron parameter settings of the neural network and the target neural network, θ=θ -
Step 12 returns to step 3 until the value function neural network Q converges, and ends.
Step 13 obtaining a strategy according to the value function QAnd deploying the strategy into a simulation environment to generate the multi-robot target collaborative search method.
The application takes the reinforcement learning algorithm as a main algorithm, improves on the basis of the reinforcement learning algorithm, and can be combined with any reinforcement learning algorithm. The representative algorithms DQN and DDPG in the reinforcement learning algorithm are taken as examples for improvement.
In the second embodiment, the flow of the improved DQN algorithm is shown in fig. 1, and the specific implementation steps of the improved DQN algorithm are as follows, and the cumulative trace-based rewards and exploring design are located in steps 3 and 8. The multi-robot collaborative search method based on accumulated trace reinforcement learning provided by the embodiment comprises the following steps:
and step 1, constructing a simulation environment for robot target searching. The output that the simulation environment can provide is: battlefield situation s at each time t t The method comprises the steps of carrying out a first treatment on the surface of the When the robot is based on the battlefield environment s t Decision a is made t Battlefield environment s at the next moment t+1 And rewarding r t
Initializing algorithm parameters: the neural network Q (s, a; θ) is initialized. Initializing a target neural network Q (s, a; θ) - ) And synchronizing the neuron parameters of the two neural networks by theta - =θ. The initialization state action counter count (s, a) =1. Initializing a memory storage unit D, setting a discount coefficient gamma, a temperature T, a batch processing capacity eta, and a target neural network updating interval C, wherein the capacity D is set as D.
Step 3 according to battlefield environment s t Selecting action a t The selection method is the same as that in the above embodiment, and will not be described in detail in this embodiment.
Step 4 executing action a t Then, the environment is observed to obtain the environment s at the next moment t+1 And rewarding r t Will(s) t ,a t ,r t ,s t+1 ) The data is stored in the memory unit D.
Step 5 State action adds one to the counter
count(s t ,a t )←count(s t ,a t )+1
Step 6, if the capacity D of the memory storage unit D is not full, returning to the step 3; if so, go to step 7.
Step 7 from D, N groups (s i ,a i ,r i ,s i+1 ) (i=1, 2 … N) data, where s i+1 Indicating the state at the next moment.
Step 8, calculating the bonus for the counter according to the state action:
step 9 calculating y for each set of data i The method comprises the following steps: if the state moves the pair(s) i ,a i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected i Output y as target neural network i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is i =r i
If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula iWhere i=1, 2 … N, r + For additional rewards, beta is the rewarding factor, gamma is the discount factor, ++>Represented in state s i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ - And (3) representing neuron parameters of the target neural network, wherein A is an action set.
Note that: calculating y i When using the target neural networkRather than Q.
Step 10 calculates error= (y) i -Q(s i ,a i ;θ)) 2 And updates the neural network neuron parameters θ according to random gradient descent.
Step 11 will be every C stepsSynchronous to the parameters of Q->
Step 12 returns to step 3 until the value function neural network Q converges, and ends.
Step 13 obtaining a strategy according to the value function QAnd deploying the strategy into a simulation environment to generate a robot target search strategy.
The specific implementation steps of the improved DDPG algorithm are as follows, and the accumulated trace-based rewards and exploration design is located in the steps 3 and 8:
and step 1, constructing a simulation environment for robot target searching. The output that the simulation environment can provide is: battlefield situation s at each time t t The method comprises the steps of carrying out a first treatment on the surface of the When the robot is based on the battlefield environment s t Decision a is made t Battlefield environment s at the next moment t+1 And rewarding r t
Initializing algorithm parameters: initializing a critic neural network Q (s, a; θ) Q ) And an actor network u (s; θ u ). Initializing the target network Q' (s, a; θ) Q′ ) And u' (s; θ u′ ) At the same time let the parameter theta Q′ =θ Q ,θ μ′ =θ μ ,θ Q′Qu′u Parameters respectively representing the critic network in the target network, and evaluating the critic network in the networkParameters, parameters of an actor network in a target network, and parameters of the actor network in the network are evaluated. The initialization state action counter count (s, a) =1. Initializing a memory storage unit D, setting a discount coefficient gamma, a temperature T, a batch capacity eta, a target neural network updating interval C and a target neural network updating coefficient tau, wherein the capacity of the memory storage unit D is D.
Step 3 according to battlefield environment s t Selecting action a t The selection method is the same as described in embodiment one.
Step 4 executing action a t Then, the environment is observed to obtain the environment s at the next moment t+1 And rewarding r t Will(s) t ,a t ,r t ,s t+1 ) The data is stored in the memory unit D.
Step 5 State action adds one to the counter
count(s t ,a t )←count(s t ,a t )+1
Step 6, if the capacity D of the memory storage unit D is not full, returning to the step 3; if so, go to step 7.
Step 7 from D, N groups (s i ,a i ,r i ,s i+1 ) (i=1, 2 … N) data, where s i+1 Indicating the state at the next moment.
Step 8, calculating the bonus for the counter according to the state action:
step 9 computing for each set of data
Here, reaching the end state refers to the state action pair (s i ,a i ) Is the last state action pair selected.
Note that: calculating y i The target neural network Q' is used instead of Q.
Step 10 calculating error
Step 11 calculating gradientAnd updating the neural network neuron parameters theta according to random gradient descent Q And theta u
Step 12, updating the target network parameter θ correspondingly every C steps Q′ =τθ Q +(1-τ)θ Q′ ,θ μ′ =τθ μ +(1-τ)θ μ′
Step 13 returns to step 3 until the critic neural network Q (s, a|θ Q ) And actor network μ (s|θ μ ) And (5) convergence and ending.
Step 14 according to the actor network μ (s|θ μ ) Obtaining a policyAnd deploying the strategy into a simulation environment to generate a robot target search strategy.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims (6)

1. The multi-robot collaborative search method based on accumulated trace reinforcement learning is characterized by comprising the following steps of:
step 1: determining states and actions corresponding to the multiple targets at each moment according to the tasks;
step 2: initializing a neural network and a target neural network, and setting the neuron parameters of the neural network and the target neural network to be the same; setting a reward factor, a discount coefficient and a reward and punishment amount corresponding to each action;
step 3: according to the state s at the current moment t Selecting action a t The method comprises the steps of carrying out a first treatment on the surface of the Executing action a t Determining corresponding punishment quantity r t State s at next time t+1 Obtaining a current state s t Action a of selection t Punishment and punishment amount r t State s of the next time state t+1 Is a state action pair data(s) t ,a t ,r t ,s t+1 ) And pair(s) of state actions t a t ) 1 is added to the number of accesses; step 3 is repeatedly performed until a certain number of state action pairs are obtained,
according to the above;
step 4: selecting a set number N of state action pairs from the obtained specific number of state action pairs; calculating, for each selected action state pair data, a bonus for each state action pair based on a preset bonus factor and the number of accesses to that state action pair; calculating the output quantity of the target neural network based on the obtained bonus, the discount coefficient and the state action pair data; calculating an error between the output quantity of the target neural network and the output quantity of the neural network, and updating the neuron parameters of the neural network according to random gradient descent; setting the neuron parameters of the neural network and the target neural network to be the same after setting the step number;
step 5: returning to the step 3 until the neural network converges, and ending training to obtain a neural network strategy model;
step 6: obtaining a multi-robot collaborative search strategy according to the neural network strategy model, and generating a multi-robot target collaborative search method;
based on the selected set number N of action state pairs, for each action state pair data, a specific method for calculating the additional rewards of the state action pairs based on the preset rewards factors and the access times of the state action pairs comprises the following steps:
wherein r is + For bonus, β is a bonus factor, i=1, 2 … N, count (s i ,a i ) Representing pairs of state actions(s) i ,a i ) Is used for the number of accesses.
2. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 1, wherein the specific method of selecting an action according to the state of the current moment is as follows:
generating a random number, and comparing the generated random number with a preset threshold value;
if the random number is smaller than the preset threshold value, calculating the selection probability of other actions except the action selected when the neural network value is maximum in the state of the action set at the current moment, determining the action with the maximum selection probability and selecting the action;
if the random number is greater than or equal to a preset threshold, selecting an action selected when the neural network value is maximized in the state of the current moment, wherein the expression is as follows:
wherein the method comprises the steps ofState s representing the current time t And selecting an action a when the neural network value is maximized, wherein θ is a neuron parameter of the target neural network, and A is an action set.
3. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 2, wherein act a j The selection probability of (2) is calculated as follows:
wherein prob (a) j ) Representing selection action a j Is selected according to the selection probability of (1); t represents the temperature level at which the temperature is high,count(s,a j ) Representing pairs of state actions (s, a j ) I.e. selecting action a in state s j Count (s, a) k ) Representing pairs of state actions (s, a k ) I.e. selecting action a in state s k K is the index of actions, and the value range is the number of actions in the action set A.
4. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 3, wherein the generated random number ranges from 0 to 1, and the predetermined threshold is 0.5.
5. The cumulative trace reinforcement learning based multi-robot collaborative search method according to claim 1, wherein an output y of a target neural network is calculated based on the obtained bonus, the discount coefficient, and a punishment and punishment in a state action pair i The method of (2) is as follows:
if the state moves the pair(s) i ,a i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected i Output y as target neural network i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is i =r i
If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula i
Where i=1, 2 … N, r + For bonus, beta is the bonus factor, gamma is the discount factor,represented in state s i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ - And (3) representing neuron parameters of the target neural network, wherein A is an action set.
6. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 1, wherein a formula for calculating an error between an output quantity of a target neural network and an output quantity of the neural network is as follows:
where i=1, 2 … N, y i In-state action pairs(s) for target neural networks i ,a i ) Lower output, Q (s i ,a i The method comprises the steps of carrying out a first treatment on the surface of the θ) is the action pair(s) of the neural network in the state i ,a i ) And a lower output.
CN202011267650.4A 2020-11-13 2020-11-13 Multi-robot collaborative search method based on accumulated trace reinforcement learning Active CN114489035B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011267650.4A CN114489035B (en) 2020-11-13 2020-11-13 Multi-robot collaborative search method based on accumulated trace reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011267650.4A CN114489035B (en) 2020-11-13 2020-11-13 Multi-robot collaborative search method based on accumulated trace reinforcement learning

Publications (2)

Publication Number Publication Date
CN114489035A CN114489035A (en) 2022-05-13
CN114489035B true CN114489035B (en) 2023-09-01

Family

ID=81490522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011267650.4A Active CN114489035B (en) 2020-11-13 2020-11-13 Multi-robot collaborative search method based on accumulated trace reinforcement learning

Country Status (1)

Country Link
CN (1) CN114489035B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning
CN111461321A (en) * 2020-03-12 2020-07-28 南京理工大学 Improved deep reinforcement learning method and system based on Double DQN
CN111563188A (en) * 2020-04-30 2020-08-21 南京邮电大学 Mobile multi-agent cooperative target searching method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
动作预测在多机器人强化学习协作中的应用;曹洁 等;《计算机工程与应用》;第49卷(第8期);参见第257-259页 *

Also Published As

Publication number Publication date
CN114489035A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN110321666B (en) Multi-robot path planning method based on priori knowledge and DQN algorithm
CN114603564B (en) Mechanical arm navigation obstacle avoidance method, system, computer equipment and storage medium
CN112465151A (en) Multi-agent federal cooperation method based on deep reinforcement learning
CN110442129B (en) Control method and system for multi-agent formation
CN114741886B (en) Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN113919485B (en) Multi-agent reinforcement learning method and system based on dynamic hierarchical communication network
CN116661503B (en) Cluster track automatic planning method based on multi-agent safety reinforcement learning
CN114083539B (en) Mechanical arm anti-interference motion planning method based on multi-agent reinforcement learning
CN116307464A (en) AGV task allocation method based on multi-agent deep reinforcement learning
CN116700327A (en) Unmanned aerial vehicle track planning method based on continuous action dominant function learning
CN116551703B (en) Motion planning method based on machine learning in complex environment
CN114489035B (en) Multi-robot collaborative search method based on accumulated trace reinforcement learning
CN116227622A (en) Multi-agent landmark coverage method and system based on deep reinforcement learning
Panov Simultaneous learning and planning in a hierarchical control system for a cognitive agent
CN113485323B (en) Flexible formation method for cascading multiple mobile robots
Liu et al. Her-pdqn: A reinforcement learning approach for uav navigation with hybrid action spaces and sparse rewards
CN112827174B (en) Distributed multi-robot target searching method
CN114967472A (en) Unmanned aerial vehicle trajectory tracking state compensation depth certainty strategy gradient control method
Wang et al. Improved grey wolf optimizer for multiple unmanned aerial vehicles task allocation
CN116718198B (en) Unmanned aerial vehicle cluster path planning method and system based on time sequence knowledge graph
Yu et al. An intelligent robot motion planning method and application via lppo in unknown environment
CN117075596B (en) Method and system for planning complex task path of robot under uncertain environment and motion
CN114371626B (en) Discrete control fence function improvement optimization method, optimization system, terminal and medium
CN112131786B (en) Target detection and distribution method and device based on multi-agent reinforcement learning
CN117908565A (en) Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant