CN114489035B

CN114489035B - Multi-robot collaborative search method based on accumulated trace reinforcement learning

Info

Publication number: CN114489035B
Application number: CN202011267650.4A
Authority: CN
Inventors: 徐志雄; 陈希亮; 洪志理
Original assignee: Army Engineering University of PLA
Current assignee: Army Engineering University of PLA
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2023-09-01
Anticipated expiration: 2040-11-13
Also published as: CN114489035A

Abstract

The application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which records state action pairs visited by an intelligent agent, accumulates the visit times of each action state pair, introduces a Boltzmann distribution method, designs a search mode based on accumulated trace, and guides the intelligent agent to search by adding an additional rewarding mode based on accumulated trace. The method guides the intelligent agent to explore by adding the additional rewards based on accumulated trace, takes the reinforcement learning algorithm as a main algorithm, improves the reinforcement learning algorithm, can be combined with any reinforcement learning algorithm, improves the learning efficiency of reinforcement learning, and realizes the efficient collaborative exploration of multiple robots in complex environments.

Description

Multi-robot collaborative search method based on accumulated trace reinforcement learning

Technical Field

The application relates to a collaborative search method based on accumulated trace reinforcement learning, in particular to a multi-robot collaborative search method for introducing a reward and exploration method based on accumulated trace into reinforcement learning, belonging to the technical fields of intelligent decision, intelligent task planning and intelligent command and control of military information systems.

Background

With the rapid development of military science and technology, unmanned combat is gradually becoming one of the main combat patterns of future warfare. For land unmanned combat, replacing real combat with robot soldiers is a trend in future combat system development. With the use of more and more unmanned aerial vehicles, unmanned vehicles and other devices in military systems, traditional centralized solutions have failed to meet the requirements of modern warfare. More and more military systems require the use of distributed, multi-robot systems with positioning and communication functions. The cooperation among multiple robots can often accomplish various complex tasks more efficiently and quickly than a single robot, and mainly comprises positioning, task allocation, obstacle avoidance, path planning tasks and the like. Among them, the multi-robot search problem is one of the main research problems at present.

The multi-robot search problem can be divided into a single-target search and a multi-target search, the former is a special case of the latter, and for the single-target search method, there is an intended cooperation method for explicitly planning the robot behavior. Such as gradient descent methods and game theory methods. There are emerging cooperative methods for accomplishing a prescribed task through intelligent behavior emerging, such as an extended particle swarm algorithm, a firefly algorithm, and the like. Among these, the collaborative approach emerges to be more efficient. Single-target collaborative search only occurs at an individual level, is essentially fine-grained collaboration, and has been studied more fully at present. The multi-target search relates to different requirements on a system operation mechanism, functions and roles of each robot and interrelation among robots in the task decomposition, distribution, planning, decision-making and execution processes, and is a complex environment which is oriented to a dynamic environment and has autonomy and adaptivity.

The main task of multi-robot multi-target collaborative search is to find all targets in the shortest time, and for the optimization problem, optimization algorithms such as a particle swarm algorithm, a firefly algorithm, a genetic algorithm and the like can be used. For example, pugh and the like are used for cooperation of group robots after the particle swarm algorithm is expanded, tang and Eberhard use the particle swarm algorithm for target search research, attention is paid to algorithm parameter optimization, and Feng Yangong aims at the defects that the firefly algorithm is slow in convergence speed and low in solving precision in global optimization search, is easy to fall into a local extremum region and the like, and a dynamic population firefly algorithm based on a chaos theory is provided. Zhang Yi aiming at the problems of low searching efficiency and easy sinking into a local optimal solution in the traditional genetic algorithm, an improved genetic algorithm is provided: and the simple one-dimensional code is adopted to replace the complex two-dimensional code, so that the storage space is saved. In the design of genetic operators, crossover operators and mutation operators are redefined, the situation that local optimization is involved is avoided, and finally, the shortest path and collision-free combination are used as fitness functions for genetic optimization. However, the global position information of the target in the main focused problems of the heuristic algorithms is unknown, so that the better solution is difficult to obtain only by using the heuristic algorithms, and the searching action of the robot is required to be guided by utilizing all available information and knowledge.

The reinforcement learning is used as a machine learning method for solving the sequential decision problem, and the strategy is directly learned through 'interactive trial and error' with the task environment, so that an artificial construction reasoning model and a large amount of sample data are not needed, the reinforcement learning has strong applicability and universality, but the obvious problem is that the learning efficiency is low, and even an effective strategy cannot be learned in a complex task environment. How to improve the learning efficiency of reinforcement learning under the condition of limited computing resources and realize the efficient collaborative exploration of multiple robots in a complex environment is the main problem to be solved by the application.

Disclosure of Invention

The application aims to provide a cumulative trace-based rewarding and exploring method which is used for solving contradiction between exploration and utilization in the training process of a reinforcement learning method and improving the learning efficiency of reinforcement learning.

The application adopts the following technical scheme.

The application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which comprises the following steps: step 1: determining states and actions corresponding to the multiple targets at each moment according to the tasks;

step 2: initializing a neural network and a target neural network, and setting the neuron parameters of the neural network and the target neural network to be the same; setting a reward factor, a discount coefficient and a reward and punishment amount corresponding to each action;

step 3: according to the state s at the current moment _t Selecting action a _t The method comprises the steps of carrying out a first treatment on the surface of the Executing action a _t Determining corresponding punishment quantity r _t State s at next time _t+1 Obtaining a current state s _t Action a of selection _t Punishment and punishment amount r _t State s of the next time state _t+1 Is a state action pair data(s) _t ,a _t ,r _t ,s _t+1 ) And pair(s) of state actions _t ，a _t ) The number of accesses count(s) _t ,a _t ) Adding 1; step 3 is repeatedly performed until a specific number of state action pair data is obtained, the state action pair (s _t ，a _t ) The number of accesses count(s) _t ,a _t ) I.e. state s _t Lower selection action a _t Is a number of times (1);

step 4: selecting a set number N of state action pairs from the obtained specific number of state action pairs; calculating, for each selected action state pair data, a bonus for each state action pair based on a preset bonus factor and the number of accesses to that state action pair; calculating the output quantity of the target neural network based on the obtained bonus, the discount coefficient and the state action pair data; calculating an error between the output quantity of the target neural network and the output quantity of the neural network, and updating the neuron parameters of the neural network according to random gradient descent; setting the neuron parameters of the neural network and the target neural network to be the same after setting the step number;

step 5: returning to the step 3 until the neural network converges, and ending training to obtain a neural network strategy model;

step 6: and obtaining a multi-robot collaborative search strategy according to the neural network strategy model, and generating a multi-robot target collaborative search method.

Further, the specific method for selecting the action according to the state of the current moment is as follows:

generating a random number (optionally, the generated random number ranges from 0 to 1, the preset threshold is 0.5), and comparing the generated random number with the preset threshold;

if the random number is smaller than the preset threshold value, calculating the selection probability of other actions except the action selected when the neural network value is maximum in the state of the action set at the current moment, determining the action with the maximum selection probability and selecting the action;

if the random number is greater than or equal to a preset threshold, selecting an action selected when the neural network value is maximized in a state of the current moment, wherein the action is expressed as follows:

wherein the method comprises the steps ofState s representing the current time _t And selecting an action a when the neural network value is maximized, wherein θ is a neuron parameter of the target neural network, and A is an action set.

Still further, act a _i The selection probability of (2) is calculated as follows:

wherein prob (a) _j ) Representing selection action a _j Is selected according to the selection probability of (1); t represents the temperature, count (s, a) _i ) Representing pairs of state actions (s, a _j ) I.e. selecting action a in state s _i Count (s, a) _k ) Representing pairs of state actions (s, a _k ) I.e. selecting action a in state s _k K is an index of actions, and its value range is the number of all actions in the action space (action space is action set a).

Further, for each action state pair, a specific method for calculating the additional rewards of the state action pair based on a preset rewards factor comprises the following steps:

wherein r is ⁺ For bonus, β is the bonus factor, count (s _i ,a _i ) Representing pairs of state actions(s) _i ,a _i ) The number of accesses of (a) is in state s _i Lower selection action a _i Is a number of times (1).

Further, calculating the output y of the target neural network based on the obtained bonus, the discount coefficient, and the reward and punishment in the state action pair _i The method of (2) is as follows: if the state moves the pair(s) _i ,a _i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected _i Output y as target neural network _i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is _i ＝r _i ；

If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula _i ：Where i=1, 2 … N, r ⁺ For additional rewards, beta is the rewarding factor, gamma is the discount factor, ++>Represented in state s _i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ ^- And (3) representing neuron parameters of the target neural network, wherein A is an action set.

Further, the target neural network is calculated in the state action pair (s _i ,a _i ) The output quantity and neural network are in state action pair (s _i ,a _i ) The error between the lower outputs is given by:

where i=1, 2 … N, y _i In-state action pairs(s) for target neural networks _i ,a _i ) Lower output, Q (s _i ,a _i The method comprises the steps of carrying out a first treatment on the surface of the θ) is the action pair(s) of the neural network in the state _i ,a _i ) And a lower output.

The beneficial technical effects obtained by the application are as follows:

the application provides a multi-robot collaborative search method based on accumulated trace reinforcement learning, which solves the contradiction between exploration and utilization in the training process of the reinforcement learning method. The application records the state action pairs accessed by the intelligent agent, accumulates the access times of each action state pair, introduces a Boltzmann distribution method, designs an exploration mode based on accumulated trace, and calculates and obtains the selection action a _i The magnitude of the probability, temperature T, represents the magnitude of randomness. Along with the continuous deep learning, the temperature is reduced continuously, and the former learning result is ensured not to be destroyed. The application guides the intelligent agent to explore by adding the additional rewards based on accumulated trace, takes the reinforcement learning algorithm as a main algorithm, improves on the basis of the reinforcement learning algorithm, can be combined with any reinforcement learning algorithm, improves the learning efficiency of reinforcement learning, and realizes that multiple robots are complexAnd the method is efficiently and cooperatively explored in the environment.

Drawings

FIG. 1 is a reinforcement learning framework based on cumulative trace exploration and rewarding means in accordance with an embodiment of the present application;

FIG. 2 is a flowchart of a multi-robot target search method based on accumulated trace reinforcement learning according to an embodiment of the present application.

Detailed Description

The present application will be further described below in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the following examples are only for more clearly illustrating the technical solution of the present application, and are not intended to limit the scope of the present application.

The reinforcement learning target is that the intelligent agent performs trial and error actions through interaction with an unknown environment, and continuously learns to adjust own strategies so as to maximize rewards obtained by the intelligent agent. Because of the unknown nature of the environment, the agent can only find the optimal strategy after "fully" exploring the environment. In most cases, however, the agents need to have a trade-off under limited computing resources in order to obtain the highest rewards, which is one of the key challenges in applying reinforcement learning to practical robotic target search problems—exploration and utilization dilemma. On the one hand, the most beneficial behavior strategy needs to be selected by using the existing experience, and on the other hand, the search range needs to be enlarged, a new strategy is selected, and an unknown space is explored so as to find a better behavior strategy.

In the traditional reinforcement learning, since no real target value marks the current sample, the thinking of the common time sequence difference in reinforcement learning is that the current estimated value function Q replaces the real value function as the target value y, namely

Wherein r is _t Is the environmental rewards received by the intelligent agent at the moment t, gamma is the discount coefficient in reinforcement learning, s _t+1 Represents the environmental state at the time t+1, a represents the action of the intelligent agent at the time t,is expressed in state s _i+1 The largest Q value among Q values corresponding to a specific action output from the lower neural network.

Because of contradiction between exploration and utilization of the intelligent agent action selection, the learning efficiency of reinforcement learning is low, and the reinforcement learning gradually converges only after a plurality of steps of iteration. In order to overcome the defect to a certain extent, the application provides a cumulative trace-based rewarding and exploring method, which records state action pairs visited by an intelligent agent, accumulates the visit times of each action state pair, introduces a Boltzmann distribution method, and designs an exploring mode based on the cumulative trace:

prob(a _i ) Representing selection action a _i T is the temperature magnitude, representing the magnitude of randomness. Along with the continuous deep learning, the temperature is reduced continuously, and the former learning result is ensured not to be destroyed.

In addition, the application guides the agent to explore by adding the bonus based on the accumulated trace, and the calculation mode of the bonus is as follows:

β is a reward factor and count (s, a) records the number of times an agent accesses a different pair of action states, access to an action state pair (s, a), i.e. select action a in state s, with count (s, a) cumulatively incremented by one for each access. The agent awards (r+r) ⁺ ) Training was performed without adding additional rewards during the test.

In the following, taking as an example how a plurality of unmanned vehicles search for targets in a battlefield environment and how to efficiently search for all targets, a multi-robot collaborative search process based on accumulated trace reinforcement learning is described. The flow chart is shown in fig. 1, and comprises the following steps:

step 1 construction of space description and Environment of combat mission

1) Multiple unmanned vehicles find multiple target basic situations in complex mountain regions

The unmanned vehicle is responsible for leading edge array reconnaissance measurement tasks. According to the combat plan, the unmanned vehicle starts from a certain region, a plurality of unknown position barriers exist in the battlefield environment, the unmanned vehicle needs to reach the planned place, and the reconnaissance measurement is carried out on the current army-keeping battlefield. It is estimated that 1000 time steps later reach X, Y, Z main points, the reconnaissance measurement task is completed, and the upper-level depth attack force is guaranteed to enter the fight.

2) Unmanned vehicle braiding and deployment

The unmanned vehicles are in 5 groups, each group is expressed as 1 reconnaissance unit, and one group has 1 unmanned vehicle. Unmanned vehicles were first deployed in the northwest corner.

3) Condition for completion of scout task

The unmanned vehicles are responsible for reconnaissance tasks, and after 1000 time steps of the combat time, the X, Y, Z key points are reached by at least 1 unmanned vehicle unit.

4) State set, action set, and reward feedback

State s at each time t _t The method comprises the following steps:

s _t ＝{x ₁ ,y ₁ ,x ₂ ,y ₂ ,x ₃ ,y ₃ ,x ₄ ,y ₄ ,x ₅ ,y ₅ ,x ₆ ,y ₆ ,x ₇ ,y ₇ ,x ₈ ,y ₈ }

where x and y represent the location (abscissa) of each entity, respectively. Subscripts represent entity numbers: {1,2,3,4,5} represents my combat unit (unmanned vehicle), and {6,7,8} represents mission target location. The value range of the state variable of each dimension is x epsilon [0,3000], y epsilon [0,3000].

Action a at each time t _t The method comprises the following steps:

a _t ＝{move ₁ ,move ₂ ,move ₃ ,move ₄ }

wherein, move ₁ Representing the maneuver of the unmanned vehicle, advancing forward; move (move) ₂ Representing the back-off and move of the unmanned vehicle ₃ Representing that the unmanned vehicle turns left and moves ₄ Representing that the unmanned vehicle turns right, the value range of the maneuvering action is move E [0,50 ]]0 represents no maneuver and 1-50 represents maneuver to 50 adjacent squares.

Prize feedback:

rewards reaching the points of X, Y and Z are r= +500;

obstacle penalty r= -200 is encountered

Unmanned car collision penalty r= -100;

step 2 initialization of algorithm parameters

Neural network Q (s, a; θ) and target neural network Q (s, a; θ) ^- ) All adopt fully connected neural networks containing 2 hidden layers (64 neurons), and initialize the parameter synchronization theta of the two neural networks ^- =θ, optimization method uses Adam gradient descent algorithm with learning rate of 0.0001. Specific programming implementations were developed using the python language based on Google's TensorFlow machine learning library. The memory cell D is initialized to a capacity of 20000. Initializing a count (s, a) =1 of the number of accesses to (s, a) by the state action, where s is the state and a is the action, and the number of accesses to (s, a), that is, the number of times of selecting action a in state s. Setting a discount coefficient gamma=0.95, a greedy strategy epsilon=0.01, a batch size eta=32, a temperature T=20 and a target neural network update interval C=10.

Step 3 action selection

According to the battlefield environment (i.e. the state of the current moment t) s _t Selecting action a _t The selection method comprises the following steps:

generating a random number, and comparing the generated random number with a preset threshold value;

if the random number is greater than or equal to the preset threshold value, selecting the state at the current moment when the neural network value is maximizedThe expression:wherein->State s representing the current time _t And selecting an action a when the neural network value is maximized, wherein θ is a neuron parameter of the target neural network, and A is an action set.

Action a _j The selection probability of (2) is calculated as follows:

wherein prob (a) _j ) Representing selection action a _j Is selected according to the selection probability of (1); t represents the temperature, count (s, a) _j ) Representing pairs of state actions (s, a _j ) I.e. selecting action a in state s _j Count (s, a) _k ) Representing pairs of state actions (s, a _k ) I.e. selecting action a in state s _k K is the index of actions, and the value range is the number of actions in the action set A.

Step 4 executing action a _t Then, the situation s of the next moment is obtained by observing the environment _t+1 And rewarding r _t Will(s) _t ,a _t ,r _t ,s _t+1 ) The data is stored in the memory unit D.

Step 5 State action adds one to the counter

count(s _t ,a _t )←count(s _t ,a _t )+1

Step 6, if the capacity D of the memory storage unit D is not full, returning to the step 3; if so, go to step 7.

Step 7 from D, 32 groups (s _i ,a _i ,r _i ,s _i+1 ) (i=1, 2 … N) data, where s _i Representing the state at time i, s _i+1 Indicating the state at the next moment.

Step 8, calculating the bonus for the counter according to the state action:

step 9 if the state is motion pair(s) _i ,a _i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected _i Output y as target neural network _i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is _i ＝r _i ；

Note that: calculating y _i When using the target neural networkRather than Q.

Step 10 calculates error= (y) _i -Q(s _i ,a _i ；θ)) ² And updates the neural network neuron parameters θ according to random gradient descent.

Step 11 target neural network every 10 stepsSynchronization with the network neuron parameters of the neural network Q, i.e., equal neuron parameter settings of the neural network and the target neural network, θ=θ ^- 。

Step 12 returns to step 3 until the value function neural network Q converges, and ends.

Step 13 obtaining a strategy according to the value function QAnd deploying the strategy into a simulation environment to generate the multi-robot target collaborative search method.

The application takes the reinforcement learning algorithm as a main algorithm, improves on the basis of the reinforcement learning algorithm, and can be combined with any reinforcement learning algorithm. The representative algorithms DQN and DDPG in the reinforcement learning algorithm are taken as examples for improvement.

In the second embodiment, the flow of the improved DQN algorithm is shown in fig. 1, and the specific implementation steps of the improved DQN algorithm are as follows, and the cumulative trace-based rewards and exploring design are located in steps 3 and 8. The multi-robot collaborative search method based on accumulated trace reinforcement learning provided by the embodiment comprises the following steps:

and step 1, constructing a simulation environment for robot target searching. The output that the simulation environment can provide is: battlefield situation s at each time t _t The method comprises the steps of carrying out a first treatment on the surface of the When the robot is based on the battlefield environment s _t Decision a is made _t Battlefield environment s at the next moment _t+1 And rewarding r _t 。

Initializing algorithm parameters: the neural network Q (s, a; θ) is initialized. Initializing a target neural network Q (s, a; θ) ^- ) And synchronizing the neuron parameters of the two neural networks by theta ^- =θ. The initialization state action counter count (s, a) =1. Initializing a memory storage unit D, setting a discount coefficient gamma, a temperature T, a batch processing capacity eta, and a target neural network updating interval C, wherein the capacity D is set as D.

Step 3 according to battlefield environment s _t Selecting action a _t The selection method is the same as that in the above embodiment, and will not be described in detail in this embodiment.

Step 4 executing action a _t Then, the environment is observed to obtain the environment s at the next moment _t+1 And rewarding r _t Will(s) _t ,a _t ,r _t ,s _t+1 ) The data is stored in the memory unit D.

Step 5 State action adds one to the counter

count(s _t ,a _t )←count(s _t ,a _t )+1

Step 7 from D, N groups (s _i ,a _i ,r _i ,s _i+1 ) (i=1, 2 … N) data, where s _i+1 Indicating the state at the next moment.

Step 8, calculating the bonus for the counter according to the state action:

step 9 calculating y for each set of data _i The method comprises the following steps: if the state moves the pair(s) _i ,a _i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected _i Output y as target neural network _i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is _i ＝r _i ；

Note that: calculating y _i When using the target neural networkRather than Q.

Step 11 will be every C stepsSynchronous to the parameters of Q->

Step 13 obtaining a strategy according to the value function QAnd deploying the strategy into a simulation environment to generate a robot target search strategy.

The specific implementation steps of the improved DDPG algorithm are as follows, and the accumulated trace-based rewards and exploration design is located in the steps 3 and 8:

Initializing algorithm parameters: initializing a critic neural network Q (s, a; θ) ^Q ) And an actor network u (s; θ ^u ). Initializing the target network Q' (s, a; θ) ^Q′ ) And u' (s; θ ^u′ ) At the same time let the parameter theta ^Q′ ＝θ ^Q ，θ ^μ′ ＝θ ^μ ，θ ^Q′ ,θ ^Q ,θ ^u′ ,θ ^u Parameters respectively representing the critic network in the target network, and evaluating the critic network in the networkParameters, parameters of an actor network in a target network, and parameters of the actor network in the network are evaluated. The initialization state action counter count (s, a) =1. Initializing a memory storage unit D, setting a discount coefficient gamma, a temperature T, a batch capacity eta, a target neural network updating interval C and a target neural network updating coefficient tau, wherein the capacity of the memory storage unit D is D.

Step 3 according to battlefield environment s _t Selecting action a _t The selection method is the same as described in embodiment one.

Step 5 State action adds one to the counter

count(s _t ,a _t )←count(s _t ,a _t )+1

Step 8, calculating the bonus for the counter according to the state action:

step 9 computing for each set of data

Here, reaching the end state refers to the state action pair (s _i ,a _i ) Is the last state action pair selected.

Note that: calculating y _i The target neural network Q' is used instead of Q.

Step 10 calculating error

Step 11 calculating gradientAnd updating the neural network neuron parameters theta according to random gradient descent ^Q And theta ^u 。

Step 12, updating the target network parameter θ correspondingly every C steps ^Q′ ＝τθ ^Q +(1-τ)θ ^Q′ ，θ ^μ′ ＝τθ ^μ +(1-τ)θ ^μ′

Step 13 returns to step 3 until the critic neural network Q (s, a|θ ^Q ) And actor network μ (s|θ ^μ ) And (5) convergence and ending.

Step 14 according to the actor network μ (s|θ ^μ ) Obtaining a policyAnd deploying the strategy into a simulation environment to generate a robot target search strategy.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims

1. The multi-robot collaborative search method based on accumulated trace reinforcement learning is characterized by comprising the following steps of:

step 1: determining states and actions corresponding to the multiple targets at each moment according to the tasks;

step 3: according to the state s at the current moment _t Selecting action a _t The method comprises the steps of carrying out a first treatment on the surface of the Executing action a _t Determining corresponding punishment quantity r _t State s at next time _t+1 Obtaining a current state s _t Action a of selection _t Punishment and punishment amount r _t State s of the next time state _t+1 Is a state action pair data(s) _t ,a _t ,r _t ,s _t+1 ) And pair(s) of state actions _t a _t ) 1 is added to the number of accesses; step 3 is repeatedly performed until a certain number of state action pairs are obtained,

according to the above;

step 6: obtaining a multi-robot collaborative search strategy according to the neural network strategy model, and generating a multi-robot target collaborative search method;

based on the selected set number N of action state pairs, for each action state pair data, a specific method for calculating the additional rewards of the state action pairs based on the preset rewards factors and the access times of the state action pairs comprises the following steps:

wherein r is ⁺ For bonus, β is a bonus factor, i=1, 2 … N, count (s _i ,a _i ) Representing pairs of state actions(s) _i ,a _i ) Is used for the number of accesses.

2. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 1, wherein the specific method of selecting an action according to the state of the current moment is as follows:

if the random number is greater than or equal to a preset threshold, selecting an action selected when the neural network value is maximized in the state of the current moment, wherein the expression is as follows:

3. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 2, wherein act a _j The selection probability of (2) is calculated as follows:

wherein prob (a) _j ) Representing selection action a _j Is selected according to the selection probability of (1); t represents the temperature level at which the temperature is high,count(s,a _j ) Representing pairs of state actions (s, a _j ) I.e. selecting action a in state s _j Count (s, a) _k ) Representing pairs of state actions (s, a _k ) I.e. selecting action a in state s _k K is the index of actions, and the value range is the number of actions in the action set A.

4. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 3, wherein the generated random number ranges from 0 to 1, and the predetermined threshold is 0.5.

5. The cumulative trace reinforcement learning based multi-robot collaborative search method according to claim 1, wherein an output y of a target neural network is calculated based on the obtained bonus, the discount coefficient, and a punishment and punishment in a state action pair _i The method of (2) is as follows:

if the state moves the pair(s) _i ,a _i ) The last state action pair is selected, and the punishment quantity r corresponding to the state action is selected _i Output y as target neural network _i The method comprises the steps of carrying out a first treatment on the surface of the The expression is: y is _i ＝r _i ；

If the state action pair is not the last state action pair selected, the output y of the target neural network is calculated according to the following formula _i ：

Where i=1, 2 … N, r ⁺ For bonus, beta is the bonus factor, gamma is the discount factor,represented in state s _i+1 The largest Q value in the Q values corresponding to the actions output by the target neural network is evaluated; θ ^- And (3) representing neuron parameters of the target neural network, wherein A is an action set.

6. The multi-robot collaborative search method based on accumulated trace reinforcement learning according to claim 1, wherein a formula for calculating an error between an output quantity of a target neural network and an output quantity of the neural network is as follows: