CN116643586A

CN116643586A - Complex scene-oriented multi-robot collaborative reconnaissance method

Info

Publication number: CN116643586A
Application number: CN202310578631.0A
Authority: CN
Inventors: 刘惠; 丁博; 万天娇; 冯大为; 傅翔; 许可乐; 翟远钊; 高梓健; 贾宏达
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-05-22
Filing date: 2023-05-22
Publication date: 2023-08-25

Abstract

The invention discloses a complex scene-oriented multi-robot collaborative reconnaissance method, which aims to ensure that the reconnaissance time is short and the reconnaissance task under various scenes can be completed. The technical proposal is as follows: constructing a multi-robot reconnaissance system consisting of robot nodes and cloud server nodes; pre-training a robot node simulation model, and storing a personalized collaborative strategy of the robot node simulation model to a server node; training a strategy network in a multi-robot collaborative reconnaissance task under a real scene by using a trained strategy in a simulation scene to obtain a sampling strategy of the multi-robot autonomous collaborative reconnaissance real scene; the multi-robot system completes multi-target reconnaissance tasks according to the sampling strategy. The invention adopts a double-layer judging mechanism, introduces an excitation mechanism based on the overall situation and the individual situation, so that a plurality of robots can generate personalized and flexible decisions for different environments, can perform efficient autonomous collaborative reconnaissance in a dynamically-changed complex environment, and can efficiently complete reconnaissance tasks in a shorter time.

Description

Complex scene-oriented multi-robot collaborative reconnaissance method

Technical Field

The invention relates to a method for enabling a distributed multi-robot to finish collaborative scout tasks by utilizing knowledge in the field of multi-robot system and multi-agent reinforcement learning in a complex unknown environment. In particular to a method for realizing the goal tracking, scene reconnaissance and other tasks by mutually cooperating to form a cooperative strategy while maintaining personalized decisions when a plurality of robots face reconnaissance task scenes (such as post-disaster terrain reconnaissance and environment reconnaissance) which cannot be predicted in advance and lack prior information.

Background

The mobile robot is a complex system related to knowledge in multiple fields such as energy, control, materials and computers, and with the development of the technical fields such as artificial intelligence and sensors in recent years, the robot has a great degree of intelligence, can replace human beings to bear tasks such as reconnaissance and search and rescue in complex, dangerous or restricted environments, can omit scene restriction to a great extent, and reduces manpower risks and cost.

Compared with the research and performance improvement of single robots, from the beginning of the 80 th century, a plurality of countries have put into great research enthusiasm on how to realize cooperation among multiple robots, and to construct an autonomous cooperation and efficient cooperation multi-robot system. This motivation stems from the fact that: in many scenarios, a multi-robot system has better applicability than a single robot. In particular, the multi-robot system is more efficient and reliable, single-point faults can not occur, multiple functions can be realized, even a certain degree of group intelligence can be shown in the study of tasks, and the characteristics enable the multi-robot system to provide a more effective solution in many practical scenes such as search and rescue, exploration, transportation and the like. Through years of development, the multi-robot system has gradually revealed advantages in the fields of manufacture, logistics, service and the like, and meanwhile, plays a vital role in modernization of national defense and army. For example, in an unmanned countermeasure scene, a multi-robot system shows mature intelligence, and the system can realize functions of autonomous aggregation, formation change, cooperative striking and the like of the unmanned aerial vehicle in the air. In scenes such as post-disaster rescue, the multi-robot system also has great advantages, search and rescue efficiency and personnel survival rate are greatly improved while search and rescue tasks are completed, and labor cost and labor risk are reduced. In addition, the multi-robot collaborative cooperation plays a great role in other scenes such as collaborative navigation, collaborative transportation, collaborative reconnaissance and the like.

However, the current strategy for achieving tasks by autonomous cooperation of multiple robots is designed manually by human beings, and usually requires training in advance, and the strategy obtained by training can only be suitable for a specific task scene. However, the real task environment is not static and closed, and there is a great uncertainty that the robot can join, exit or malfunction at any time during the task execution, and the task targets and task environments of each execution also have a great difference. Thus, when faced with such dynamically changing scenarios, the strategies of human manual design fail to take into account all possible scenarios, which also presents additional challenges for multiple robots to complete collaborative tasks. Thus, to cope with real scenes, if the task scene is a complex transformation that multiple robots never see, multiple robots need to autonomously form a collaborative policy according to the observed environment. For example, after an earthquake occurs, a post-disaster reconnaissance task is often required to be performed, so as to obtain the change of the topography and the landform after the earthquake, the collapse condition at each place and the possible life track distribution condition. However, the environment that the post-disaster robot needs to explore is unknown and the moment is in dynamic change. In this case, it is difficult to design an exploration strategy for an unknown environment by manually surveying on site, and valuable disaster relief time is inevitably wasted, which is not beneficial to more efficient and rapid development of disaster relief work. Therefore, how to cope with the unknown scene of complex transformation, and realize the autonomous formation of a cooperative strategy by multiple robots, so as to quickly realize effective reconnaissance on the field environment after the disaster occurs, is a difficult hot spot problem which is urgently desired to be solved by the technicians in the field. Aiming at the problems, how to realize that the optimal reconnaissance strategy can be quickly obtained by training a plurality of robots in a short time, and the unknown complex scene is adequately and accurately reconnaissance according to the optimal reconnaissance strategy is a difficult point to be solved in the field.

The main problem faced by multiple robots in scout tasks is how to coordinate the behavior of multiple robots autonomously. The number and roles of robots will change continuously during the exploration of how to accomplish the task, the behavior of other robots will also have an impact on the current robot environment and rewards (defined below), and this interaction is persistent. Therefore, there is a need to coordinate the high degree of coordination between multiple robots to achieve adequate detection and reconnaissance of the target environment. Lack of coordination may result in overlapping of multiple robot environment detection zones or insufficient detection of the environment, incomplete task completion.

Most of the research on multi-robot control by the traditional method aims at a first-order or second-order integrator model, and cooperative control under a nonlinear model is rarely considered. In a complex dynamic environment, the problem of multiple robots often has the characteristic of nonlinear time variation, is influenced by random factors such as environmental disturbance and the like, and is difficult to accurately model.

Reinforcement learning algorithms have been used in recent years to solve problems in the robot field as an alternative, particularly reinforcement learning algorithms for individual robots. Reinforcement learning is an important branch of machine learning. The robot obtains the rewards of the environment feedback through continuous interaction with the environment, the robot penalizes the actions with lower values according to the rewards, and encourages the actions with higher rewards, so that the accumulated rewards of the intelligent agent are maximized, and an optimal strategy is learned. Reinforcement learning does not require supervision, nor does it rely on artificial modeling and design strategies, and an ideal strategy can be learned in an unknown environment. Specifically, when the robot performs a task, action a is taken in the current state S to obtain a new state S ', and an immediate prize R is fed back by the environment, so that an experience tuple < S, a, R, S' is formed and saved. Through continuous exploration and interaction with the environment, a large number of experience samples are obtained, from which the robot gradually learns a reconnaissance optimal strategy, and selects an action that obtains the greatest reward when faced with a new state. Because the reinforcement learning algorithm has the characteristics of flexibility, self-adaptability and the like, the problems of multiple robots in complex unknown tasks can be solved, and the effect of accurately detecting in an open application scene is achieved.

However, when reinforcement learning algorithms for a single robot are applied to a multi-robot scenario, non-stationarity issues of the environment are faced, i.e., actions taken by a single robot may cause changes in the environment and affect decisions of other robots. Therefore, under the multi-robot scene, the strategy network of the multi-robot needs to be subjected to joint training, so that the robots are guided to mutually cooperate to jointly complete the task. For example, when multiple robots complete real tasks of reconnaissance, different robots are charged with reconnaissance tasks for different areas, and meanwhile, the robots need to communicate with each other to cooperate with each other to cooperatively complete the reconnaissance tasks. The characteristics of interaction and correlation of robots in the multi-robot system are very critical to completing the collaborative task, but also bring great uncertainty and complexity to the generation of the collaborative strategy.

In order to solve the above-mentioned environmental non-stationarity problem, a "centralized training distributed execution" paradigm is proposed and widely applied in multi-robot reconnaissance scenarios. The methods are mainly divided into two types: actor-critique based methods and value based methods. For the first type of method, for example, paper Multi-agent deep deterministic policy gradient (translated into Multi-agent depth determination strategy gradient algorithm, MADDPG for short, published in arXiv pre-printed paper website, algorithm source website: https:// gitsub.com/openai/madppg) proposes a depth determination strategy for multiple robots, and in order to avoid environmental instability caused by mutual interference among multiple robots, the strategy of other robots is considered by a single robot in decision; the paper Counterfactual multi-agent policy gradients (translated into a multi-agent strategy gradient of a feedback baseline, published in conference paper website AAAI, published in 2018, the downloadable website is https:// ojs.aaai.org/index.php/AI/arc/download/11794/11653) introduces the feedback baseline to solve the credit allocation problem among the multi-agents, and feeds back the rewards to each robot according to the contribution of each robot to the completion target, so that the rewards of each robot are reasonably allocated; the paper "Actor-Attention-Critic for Multi-Agent Reinforcement Learning" (translated into "Actor-Critic Multi-agent reinforcement learning method based on Attention mechanism", published on PMLR conference website, website is http:// proceedings.press/v 97/iqbal19a.html, publication time 2019) proposes to apply Attention mechanism to Multi-robot scene, which is beneficial for agent to consider important information of other agents in decision, thus not only solving expansibility problem, but also helping agent to selectively screen important environmental information, ignoring unimportant information. The second category is a value-based method, such as paper Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning (translated into monotone value function decomposition for multi-agent deep reinforcement learning, published in conference paper website ICML, publication time 2018, downloadable website https:// www.jmlr.org/papers/volume 21/20-081. Pdf), which utilizes value decomposition to decompose the joint behavior value of the robot, and distributes an individual behavior value to each agent, thereby more accurately coordinating the behavior between robots; papers Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning (translated as "collaborative multi-agent reinforcement learning based on factor transformation", published in ICML conference paper website, downloadable website is http:// proceedings. Mr. Press/v97/son19 a. Pdf, publication time 2019) relaxes the monotonicity constraint in the Qmix method mentioned earlier, thereby facilitating more efficient learning. In summary, both of the above reinforcement learning methods for multiple robots (actor-critique based methods and value based methods) are by jointly coordinating all robots 'behavior strategies to obtain a larger reward from the environment that evaluates all robots' joint behavior. In other words, these methods are based on the behavior evaluation of all robots, guiding the behavior strategy of each agent.

However, the two methods have certain problems: the assessment of the joint behavior of all robots is based on a global perspective, and individual contributions cannot be assessed to motivate individual behavior, which can lead to catastrophic discordance, known in the industry as contribution mismatch. In analogy to the management model of teams in human life, it is not reasonable for a manager to measure the contribution of a single agent by the performance of the entire team alone. Not only may this potentially mask individuals from being lazy, not fully exploiting individual value, but may even result in the entire community being involved in sub-optimal conditions. For example, a community together performs a task that can only score the performance of an entire team, as the boss is not known to each person's division. If the evaluation score of the whole team is taken as the final score of each individual, the evaluation score is not matched, and the execution of tasks is greatly promoted because the contribution of some individuals to the team is large; some individuals do not fully exert their utility and even slow the progress of the task. Such an approach would only negate the activity enthusiasm of the individual, and would be detrimental to maximizing team value.

On the other hand, the evaluation mechanism that scores only all robot joint behaviors is single. If this evaluation score is used to motivate the behavior of all robots, this will lead to homogeneous behavior of the agent, thus suppressing the personalized strategy. However, in the investigation scene, the specific environments facing each robot are different, some robots may need to detect the mountain terrain, and another robot needs to investigation the lake area, and the strategies that need to be adopted by the robots in the different scenes should be heterogeneous. Therefore, how to design a more accurate evaluation mechanism for a single robot, guide a plurality of robots to finish tasks in a cooperation way, keep the isomerism of each robot as a strategy, and excite the personalized contribution of the single robot, so that the cooperation tasks can be finished more quickly and efficiently in a complex scene, and the method becomes one of the problems that technicians need to solve in a collaborative reconnaissance scene.

At present, the multi-robot collaborative reconnaissance method for solving the problems is to fit the contribution of each robot in a team by adopting a complex neural network, but the problems of large consumption of calculation resources, fixed application scene, limited application scene and the like are faced. In a real-world scenario, a single robot may join or leave the system at any time, and each time the number of robots or roles change, neural network-based techniques require retraining a neural network model and redeploying the reconnaissance strategy to the local robot, which is time-consuming and impractical. Whether it is a "time or money" disaster area scout scenario or other environmental scout that is transient, there is a high demand for time and efficiency, in which multiple robotic systems need to quickly cope with dynamic changes in agents, tasks and environments, while robots remain personalized strategies, and need to quickly coordinate with each other to accomplish the scout objective. Paper Celebrating Diversity in Shared Multi-Agent Reinforcement Learning (translated as "diversity for stimulating multi-agent reinforcement learning", published in the NIPS conference paper website, downloadable website is https:// proceedings. Neurops. Cc/paper/2021/file/20aee3a5f4643755a79ee5f6a73050ac-P aper. Pdf, publication time 2021 years) maximizes the mutual information between robot trajectories and identities by means of information theory, thereby encouraging extensive exploration and diverse personalized behavior. Although the method encourages the intelligent body to adopt heterogeneous behavior strategies, the problem of scene configuration-based fixation exists, and when the task scene changes, the originally trained model may not achieve ideal effects in a new task.

Disclosure of Invention

The application scene of the invention is that the multi-robot collaborative reconnaissance task is carried out by the multi-robot under the condition that the prior information is not available. The application scenario has the following challenges: first, the time and training effects of the scout task are very demanding, both in disaster scout and other environmental surveys, the field situation is very complex and transient, which is very demanding, both in training time and training effects of the multi-robot system, and there may not be enough time to get an effective strategy. Second, the specific information of the scene is not known until the task is performed, and only the type of scene to be faced is likely to be known, but specific environmental details, such as the number of robots required and the environmental rewards mechanism, etc., are not known.

The technical problem to be solved by the invention is to provide a complex scene-oriented multi-robot collaborative reconnaissance method, so that robots can quickly and efficiently obtain collaborative strategies under a new reconnaissance task scene, and personalized behavior strategies among different robots are maintained, so that reconnaissance time is short and reconnaissance tasks under various scenes possibly occurring after disaster can be completed.

Therefore, aiming at the limitation analysis of the prior art, the invention provides a multi-robot collaborative reconnaissance method facing a complex scene, and the multi-robot learns heterogeneous behavior strategies while keeping collaboration by introducing contribution excitation mechanisms based on a global level and an individual level respectively, so that personalized flexible decisions are generated for different environments, and finally, the efficient autonomous collaborative reconnaissance in the complex environment with dynamic changes is achieved.

The technical scheme of the invention is that a multi-robot reconnaissance system consisting of robot nodes and cloud server nodes is constructed; pre-training a robot node simulation model, and storing a personalized collaborative strategy of the robot node simulation model to a server node; training a strategy network in a multi-robot collaborative reconnaissance task under a real scene by using a trained strategy in a simulation scene to obtain a sampling strategy of the multi-robot autonomous collaborative reconnaissance real scene; the multi-robot system completes multi-target reconnaissance tasks according to the sampling strategy. The invention adopts a double-layer judging mechanism, introduces an excitation mechanism based on the overall situation and the individual situation, so that a plurality of robots can generate personalized and flexible decisions for different environments, can perform efficient autonomous collaborative reconnaissance in a dynamically-changed complex environment, and can efficiently complete reconnaissance tasks in a shorter time. Specifically, when the policy network updates the policy, it is required to acquire evaluation values for taking current actions from the local evaluation network and the global evaluation network, and learn how to acquire higher group rewards and higher individual rewards according to the evaluation values, so as to formulate more ideal action instructions. On the global level, the global evaluation network evaluates the behaviors of the robots based on a global view angle so as to coordinate the team to obtain the maximum rewards and complete the collaborative reconnaissance task; and at the individual level, according to the local information of each robot, a separate local evaluation network is introduced for each robot to excite personalized contribution, so that different robots can select personalized behavior strategies according to different environments. An intuitive explanation is that in a team, a large boss is required to evaluate the performance of the whole team and guide the team to achieve better results; the related responsible person is required to hear the individual opinion, judge the behavior of each individual, and balance the maximization of individual benefit and the group benefit. By means of the technology, the robot system can rapidly cope with dynamic changes of intelligent bodies, tasks and environments in unpredictable complex associated scout task scenes, form flexible joint strategies independent of a specific cooperative mode, and meanwhile keep the isomerism of the behavior strategies among different robots.

There is no disclosure related to the application of a double layer evaluation technique to multi-robot collaborative reconnaissance.

The invention comprises the following steps:

the first step, a multi-robot reconnaissance system is constructed, wherein the multi-robot reconnaissance system consists of K robot nodes and a cloud server node, and K is a positive integer. K robot nodes are all connected with the cloud server node. The robot nodes are composed of robots capable of observing, moving and communicating, can run a software system and have the same working mode. For example, unmanned plane nodes Intel aero, ground robot tunelebot 3, etc., each robot node is installed with Ubuntu (16.04 and above versions) operating system matching X86 architecture processor or Ubuntu mate (16.04 and above versions) operating system matching ARM architecture processor, robot operating system ROS and deep learning framework pytorrach 1.7.0, and is also installed with detection module, first calculation module, first storage module, motion module, first communication module.

The detection module is a sensor for collecting environmental data, such as an infrared camera, a depth camera, a scanning radar, and the like. The detection module is connected with the first storage module, shoots or scans the environmental information and other robot node states in the video range every t seconds to obtain state information of the reconnaissance task, and stores the state information into the first storage module. Here, the value of t is recommended to be between 0.3 seconds and 1 second.

The first storage module is connected with the detection module and the first calculation module, and the available space is more than 1 GB. An experience playback pool is arranged in the first storage module, and the latest N pieces of track experience information (the value range of N advice is 5000-10000) of the robot node is stored. Wherein the empirical information of the nth (1. Ltoreq.n) track is expressed as s _n ,a _n ,r _n ,s _n+1 ]，s _n The task state information observed by the detection module at the nth time point comprises environment information in an observable range and speed and position information of other K-1 robot nodes except the robot node to which the task state information belongs. a, a _n Indicating that the robot node K (1. Ltoreq.k) takes an action between the n and n+1 th points in time, such as the robot node performing an action that applies a force or accelerates in a certain direction. When the robot node k executes the action, the detection module automatically records a _n 。r _n Is the feedback score observed by the detection module for the current task completion at the nth point in time. s is(s) _n+1 The task state information observed by the n+1th time point detection module also comprises speed position information of other K-1 robots, position information of environmental barriers and the like besides the current robot node K. From a logical relation, s _n+1 It can be understood that the term "s _n Performing action a at the robot _n And obtaining new task state information.

The first computing module is responsible for designing and optimizing a scout strategy and sending an action instruction to the motion module according to the scout strategy. The first computing module is connected with the first storage module, the motion module and the first communication module. The first computing module consists of 6 fully connected neural networks, which are created by using a pytorch deep learning framework, and are a strategy network and a target strategy network for forming action instructions, and a local evaluation network, a global evaluation network, a local target evaluation network and a global target evaluation network for optimizing strategies respectively (the principle can be translated into an Actor-commentator algorithm by referring to paper 'Actor-Critic Algorithms', published in ACM conference publishing institutions, the publishing time is 4 months 2001, and the downloadable website is https:// www.researchgate.net/publication/2354219_actor-critic_algorithms).

The strategy network is connected with the first storage module, the motion module, the first communication module, the target strategy network, the local evaluation network and the global evaluation network, H (1-H-H) pieces of local track experience information (namely track experience information of the robot) are extracted from an experience playback pool of the first storage module, next action is determined according to current task state information in the local track experience information, an instruction containing the next action is sent to the motion module, and network parameters (namely weight matrix and bias vector of each layer of network) of the strategy network are sent to the target strategy network. The strategy network acquires the evaluation value of the action taken by the current strategy network from the local evaluation network and the global evaluation network, learns how to acquire a higher action evaluation value according to the evaluation value, and formulates more ideal action instructions. The policy network saves the policy network parameters as a file in a data format, and sends the file to the first communication module. And after the action is finished, the file storing the strategy network parameters is sent to the first communication module.

The target policy network is connected with the policy network, and the network parameters of the target policy network are updated according to the network parameters of the policy network acquired from the policy network.

The local evaluation network is connected with the first storage module, the strategy network and the local target evaluation network, receives h pieces of local track experience information extracted by the strategy network from the first storage module, receives a loss function measurement value from the local target evaluation network, evaluates the action value of the robot according to the loss function measurement value, and simultaneously sends own network parameters to the local target evaluation network. To direct policy network updates, the local evaluation network sends an evaluation value of the action taken by the current policy network to the policy network.

The local target evaluation network is connected with the local evaluation network, and the network parameters of the local target evaluation network are updated according to the network parameters of the local evaluation network acquired from the local evaluation network.

The global evaluation network is connected with the first storage module, the strategy network and the global target evaluation network, receives h pieces of global track experience information (namely joint track experience information of all robots) which are extracted from the local evaluation network at the same time step from the first storage module, receives loss function metric values from the global target evaluation network, evaluates joint action values of all robots according to the loss function metric values, and simultaneously transmits own network parameters to the global target evaluation network. To guide policy network updates, the global evaluation network sends the evaluation value of the joint action (action taken by the current policy network in combination with action taken by other robots) to the policy network.

The global target evaluation network is connected with the global evaluation network, and the network parameters of the global target evaluation network are updated according to the network parameters of the global evaluation network acquired from the global evaluation network.

Reference in this specification to "policies" is specifically expressed as parameters of a policy network: the current task state information s is input into a strategy network, and an action instruction a is obtained at an output layer of the strategy network through multiplication of weight matrixes among layer-by-layer neurons and addition of bias vectors of each layer of neurons. On a macroscopic level, the policy network enables the robot nodes to independently decide what action to perform next based on the current observed scout scene state. Thus, the parameters of the policy network also reflect the decision process of the robot node, i.e. "policy". Each robot node has independent policy network parameters and autonomously determines its own behavior. Of course, other 5 neural networks in addition to the policy network assist in guiding the parameter update of the policy network.

The motion module is composed of a motor, a tire and other driving equipment and a digital-to-analog converter. The motion module is connected with the first calculation module, receives an action instruction from a strategy network of the first calculation module, converts a digital signal into an analog signal by means of a built-in digital-to-analog converter, and transmits the analog signal to the driving equipment, so that the robot makes corresponding actions, and finally, a reconnaissance scene is changed.

The first communication module is connected with the cloud server node and the first computing module, and the first communication module (such as a wireless network card) receives the parameter file in the data format from the policy network of the first computing module and uploads the parameter file in the data format to the cloud server node through the SSH communication service.

The cloud server node refers to a cloud device such as a workstation and a server, and is provided with a second communication module (for example, a wireless network card) and a second storage module (for example, a hard disk with a capacity greater than 500 GB). The second storage module is provided with an Ubuntu16.04 operating system and a Pytorch deep learning framework with the same version as the robot node. And on the cloud server, the second communication module is connected with the second storage module and is communicated with the K robot nodes through an SSH communication protocol.

Second, M scout scenes are built based on Gazebo simulation environment (required versions 9.10.0 and above) for pre-training six networks in the first computing module. The method comprises the following specific steps:

2.1 selecting a computer provided with a Ubuntu operating system (version should be consistent with that of the robot nodes), installing and operating a Gazebo simulation environment, simulating K robot nodes in the multi-robot system constructed in the first step, and establishing corresponding robot node simulation models for the K robot nodes.

2.2 referring to various environmental elements (such as barriers, target points, robots in the reconnaissance environment and the like) possibly occurring in the reconnaissance environment, performing equal-scale scaling modeling, and constructing an environment simulation model which approximates the real environment as much as possible.

2.3 randomly selecting a robot node simulation model and various environment elements, and randomly initializing the initial positions and the numbers of the robot node simulation model and the various environment elements, so as to form M reconnaissance scenes for simulating various possible real scenes. Wherein M is a positive integer, M is not less than 10, and the larger M is better under the condition of resource permission.

2.4, designing an evaluation index of the completion degree of the reconnaissance task, and evaluating the task completion effect of the current multi-robot combined strategy in a simulation environment. If the robot node simulation model leaks a target point in reconnaissance, 1 minute is deducted; the robot node simulation model collides with the obstacle model in the motion process, and is buckled for 10 minutes; collision with other robot node simulation models is reduced by 5 minutes. For example, at a certain moment, 5 target points are missed from the process of starting to move to the time point, the robot node simulation model collides with the obstacle model 1 time and collides with other robot node simulation models 2 times in the reconnaissance process, and the score obtained at the time point is- (5×1+10+2×5) = -25 points. The task completion evaluation index is flexibly formulated by a user according to the site situation. In principle, points are given to actions that are favorable to the completion of the task goal, while points are given to actions that are unfavorable to the completion of the task.

The third step, pre-training the K robot node simulation models in the M reconnaissance scenes constructed in the second step, so that the robot node simulation models acquire collaborative reconnaissance strategies under different reconnaissance scenes, and K data format parameter files recording the collaborative strategies of the K robot node simulation models under each reconnaissance simulation scene are obtained; the single robot node simulation model adopts DDPG (Continuous control with deep reinforcement learning, translated into 'continuous control of deep reinforcement learning', published in arXiv pre-printed paper website, website is https:// arxiv.53yu.com/abs/1509.02971, publication time 2015 month 9) reinforcement learning algorithm to train strategy network, of course, pre-training is not limited to this algorithm, other algorithms such as PPO (Proximal policy optimization algorithms, translated into 'near-end strategy optimization algorithm', published in arXiv pre-printed paper website, website is https:// arxiv.53yu.com/abs/1707.06347, publication time 2017 month 7) algorithm can be adopted by reference. K robot node simulation models are mutually independent and train the strategy network in parallel, wherein the method for training the strategy network by adopting the DDPG reinforcement learning algorithm by the kth (K is more than or equal to 1) robot node simulation model is as follows:

3.1 initializing a first calculation module of the robot node simulation model K, wherein six neural networks in the first calculation module all need to initialize parameters, the parameters comprise a weight matrix and a bias vector between each layer of each neural network, the values of the weight matrix and the bias vector are randomly generated according to normal distribution with an expected value of 0 and a variance of 2, and the parameters selected by the K robot node simulation models can be different from each other or the same.

3.2 initialisation variable m=0.

3.3 selecting an mth simulation scout scene from the M simulation scout scenes in the Gazebo simulation environment.

3.4 randomly initializing the initial position of the kth robot node simulation model and the initial points of various environment elements in the mth simulation reconnaissance scene. Initializing training round number i=0, and setting maximum training round number I (I is a positive integer, and the value range is recommended to be 400-1000).

3.5 initializing the action step number j=0 of the kth robot node simulation model in the ith training round number, setting the maximum step number J (J is a positive integer and the recommended value range is 30-50) of the kth robot node simulation model in the ith training round number, and if the kth robot node simulation model can finish tasks within the J steps, setting the actual action step number of the kth robot node simulation model in the ith training round number to be smaller than J.

3.6 policy network of kth robot node simulation model obtains local state information o at the jth step from the first storage module _j (e.g., information on the position of the kth robot node simulation model, etc.), will be i _j Inputting a strategy network, multiplying the weight matrix among the layer neurons in the strategy network, and adding the bias vectors of the neurons in each layer to output an action instruction a in the j step _j Will a _j To the first storage module and the movement module.

3.7 first memory Module stores a _j The simultaneous movement module performs action a _j 。

3.8 the kth robot node simulation model detects the actions executed by other K-1 robot nodes in the jth step, and combines the actions a executed by the kth robot node simulation model _j Obtaining the joint action executed by the K robot nodes in the j stepMay also be denoted as A _j And storing the first data to a first storage module.

3.9 executing action a by the kth robot node simulation model according to the evaluation index formulated in the step 2.4 _j Obtaining the task completion degree score r _j Will r _j And feeding back to a first storage module in the kth robot node simulation model.

3.10 kth robot node simulation model executing action a _j Then the state of the reconnaissance environment is changed, and the detection module of the kth robot node simulation model observes global state information s in the (j+1) th step _j+1 And local state information o _j+1 Will s _j+1 and o_j+1 Stored in the first memory module.

3.11 kth robot node simulation model first storage module is formed by respectively combining s _j 、s _j+1 、A _j 、r _j and o_j 、o _j+1 、a _j 、r _j Obtaining j-th global track experience information s _j ,A _j ,r _j ,s _j+1 ]And j-th set of local track experience information o _j ,a _j ,r _j ,o _j+1 ]The global track experience information and the local track experience information of the same step are called track experience information pairs. And pair the track experience information to s _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]An experience playback pool sent to the first storage module, the experience playback pool pair [ s ] _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]And storing separately.

3.12 the first calculation module of the kth robot node simulation model judges the storage amount of the experience playback pool, and if H track experience information pairs are already stored, the H track experience information pairs are randomly extracted from the experience playback pool. H pieces of global track experience information are sent to a global evaluation network, H pieces of local track experience information are sent to a local evaluation network, and the step 3.13 is shifted; otherwise, let j=j+1, go to step 3.6, and the value of h is suggested to be 100-200.

3.13 the first calculation module of the kth robot node simulation model numbers the extracted H track experience information pairs with the numbers of 1-H according to the sequence. The initialization flag sequence number q=0.

3.14 the first calculation module of the kth robot node simulation model starts from the marking sequence number q=0, and selects the first H track experience information pairs according to the sequence, wherein H is more than or equal to 1 and less than or equal to H. And let q=q+h, i.e. update q to the marker sequence number of the last pair of the h selected track experience information pairs.

3.15 Global evaluation network of kth robot node simulation model according to the h pieces of selected global track experience information, respectively minimizing global loss function L by gradient descent method _global The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the local evaluation network of the kth robot node simulation model adopts a gradient descent method to respectively minimize a local loss function L according to the experience information of the h selected local tracks _local Thereby updating parameters of the global and local evaluation networks:

wherein ,Q_global Representing a global evaluation network, Q _local Representing a local evaluation network, Q ^′ _global Representing a global target evaluation network, Q ^′ _local Representing a local target evaluation network. a, a _j Representing actions selected by the strategy network for the kth robot node simulation model in the jth step; a is that _j And representing joint actions executed by the K robot node simulation models in the j-th step. Mu' represents the target policy network, the symbols in brackets immediately following these letters represent the input information to the network, mu ^′ (o _j+1 ) Representing local state information o observed by the kth robot node simulation model in the (j+1) th step _j+1 An action instruction output after being input into the target strategy network; mu (mu) ^′ (s _j+1 ) Representing that K robots in the j+1th step are based on global state information s _j+1 Predicted joint actions. Q (Q) _local (o _j ,a _j ) Representing the local state information o _j And action a _j And inputting the motion information into a local evaluation network to obtain an estimated score of the motion of the simulation model of the kth robot node. (r) _j +αQ′ _global (s _j+1 ,μ′)-Q _global (s _j ,A _j ) Representing an estimated score Q for a joint action by a global evaluation network _global (s _j ,A _j ) The closer to the global target score r _j +αQ′ _global (s _j+1 The better, μ'), the same way, (r _j +αQ′ _local (o _j+1 ,μ′(o _j+1 ))-Q _local (o _j ,a _j ) Representing an estimated score Q for the action of the local evaluation network on the kth robot node simulation model _local (o _j ,a _j ) Closer to the local target score r _j +αQ′ _local (o _j+1 ,μ′(o _j+1 ) The better).

Alpha represents a discount factor, and takes a constant of 0 to 1, and the preferred range of alpha is 0.2 to 0.3.

Part of equation (2) comes from the MDP (full process Markov decision process, translated into a "Markov decision process"), the local evaluation network of the kth robot node simulation model evaluates the scores of the kth robot node simulation model's actions at step j. The strategy network of the kth robot node simulation model predicts the actions of the jth+1st step according to the state information of the jth+1st step, and the local target evaluation network of the kth robot node simulation model The method comprises the steps of evaluating the state and the estimated action of the j+1th step to obtain the estimated score of the j+1th step, multiplying the estimated score of the j+1th step by a discount factor alpha and adding the reward r obtained from the environment after the j step _j And obtaining the target score of the j-th action. Subtracting the estimated score of the j-th step from the target score to obtain a difference value, taking the difference value as a gradient optimization target of the local evaluation network, updating parameters of the local evaluation network by using the gradient optimization target of the local evaluation network, and updating parameters of the local target evaluation network of the k-th robot node simulation model. By minimizing the objective function L _local The method can effectively help the local evaluation network to make better estimation on the j-th action value.

And (3) the same principle as the formula (1), and estimating the estimated scores of the joint actions of the K robot node simulation models in the j-th step by using a global evaluation network of the K robot node simulation models. Meanwhile, the global state and the joint action of the j+1th step are evaluated to obtain the estimated score of the joint action of the j+1th step, and the estimated score of the joint action of the j+1th step is multiplied by a discount factor alpha and added with a reward r obtained from the environment after the joint action of the j th step _j And obtaining the target score of the j-th combined action. Subtracting the estimated score of the j-th combined action from the target score to obtain a difference value, taking the difference value as a gradient optimization target of the global evaluation network, updating parameters of the global evaluation network by using the gradient optimization target of the global evaluation network, and minimizing an objective function L _global The method can effectively help the global evaluation network to make better estimation on the joint action value.

3.16 the global evaluation network and the local evaluation network adopt a double-layer evaluation mechanism to guide policy updating together, and the kth robot node simulation model policy network carries out parameter updating according to a formula (3):

wherein a_j Representing a policy network u _j In the j-th step, selecting actions for the kth robot node simulation model,Q _global (s _j ,A _j ) Representing global state information s _j And joint action a _j Inputting the estimated score of the joint action into a global evaluation network; q (Q) _local (o _j ,a _j ) Representing the local state information o _j And action a _j . And inputting the motion information into a local evaluation network to obtain an estimated score of the motion of the simulation model of the kth robot node. The two-layer evaluation mechanism of the global evaluation network and the local evaluation network can be seen from the formula (3).

3.17 if q+h > H, executing step 3.18; otherwise, go to step 3.13.

3.18 the global target evaluation network of the kth robot node simulation model updates own network parameters according to the parameters of the global evaluation network, as shown in a formula (4); the local target evaluation network updates own network parameters according to the parameters of the local evaluation network, as shown in a formula (5); and the target policy network updates its own network parameters according to the parameters of the policy network, as shown in formula (6):

wherein ,parameters of the global evaluation network, the local evaluation network and the strategy network in the j-th step are respectively represented; />And (5) representing parameters of the global target evaluation network, the local target evaluation network and the target policy network in the j-th step. τ ₁ 、τ ₂ Is a weight constant between 0 and 1, and the recommended value range is 0.2 to 0.3. The assignment of three target network parameters is realized by the formulas (4), (5) and (6), and the training pace is slowed down through incomplete updating, so that over training is avoided.

3.19 let j=j+1, if J reaches the set maximum number of steps per round J, execute step 3.20; otherwise, executing the step 3.6.

3.20, making i=i+1, if I reaches the set maximum training round number I, executing step 3.21; otherwise, executing the step 3.4.

3.21, making m=m+1, if M is equal to the maximum number of reconnaissance scenes M, executing the step 3.23; otherwise, step 3.22 is performed.

3.22 the first calculation module of the kth robot node simulation model stores the trained strategy network model and names the strategy network model by the serial number m of the simulated reconnaissance scene. Meanwhile, the first calculation module of the kth robot node simulation model empties all network parameters in the strategy network, the target strategy network, the global evaluation network, the global target evaluation network and the local target evaluation network, and gives the random initial values again. The first storage module of the kth robot node simulation model empties the experience pool and is ready for training of new scene tasks. Turning to step 3.23.

3.23 the first calculation module of the kth robot node simulation model saves the parameters of the policy network (i.e., the collaborative policy of the kth robot node simulation model in the scout simulation scenario) in a file in the data format (generated by the pytorch deep learning framework. Parameter file in the data format). Next, a fourth step is performed.

And the third step is executed by the K robot node simulation models in parallel, so that K parameter files in data format aiming at each simulation scene are obtained, and the collaborative strategies of the K robot node simulation models under the scout simulation scene are recorded for further adjusting the robot node simulation model strategies.

And step four, the K robot node simulation models respectively upload the parameter files in the data format to the cloud server node through SSH service, namely, the personalized collaborative strategy obtained through the training in the step three is stored in the cloud server node, so that the strategy network model obtained in the pre-training stage is saved and shared. The K simulation models of the robot nodes are executed in parallel, and the K simulation model of the robot node is taken as an example for explanation.

4.1 the first calculation module of the kth robot node simulation model sends the parameter file in the data format to the first communication module of the kth robot node simulation model.

4.2 the first communication module of the kth robot node simulation model sends the parameter file in the data format to the second communication module (cloud server node) through SSH communication service.

And 4.3, the second communication module stores the received parameter file in the data format in a second storage module of the cloud server node.

And fifthly, deploying the multi-robot system constructed in the first step to a place where the scout task needs to be carried out, namely, a disaster area or a war area and other real complex scenes where the scout task needs to be completed, and utilizing a well trained strategy model in a simulation scene, namely, a data format parameter file to help the multi-robot to cooperate with training of a strategy network in the scout task in the real scene, so as to improve the task completion effect of the scout strategy trained in the simulation scout scene in the real scene. The specific method is that K robot nodes execute the following steps in parallel, and the K robot node is taken as an example to describe the specific method:

5.1 the first communication module of the kth robot node sends a downloading request to the second communication module of the cloud server node, and requests to download the policy model.

And 5.2, the second communication module reads the well-saved policy network model parameter file in the data format in the simulation scene which is most similar to the unknown scene faced by the kth robot node from the second storage module of the cloud server node, and transmits the parameter file in the data format to the first communication module of the kth robot node through an SSH service protocol.

And 5.3, the first communication module of the kth robot node transmits the parameter file in the data format to the first calculation module of the kth robot node.

5.4 the kth robot node loads the parameter file in the data format into the policy network of the kth robot node and directly loads the parameter file by the Pytorch deep learning framework.

And 5.5, respectively initializing parameters of a target strategy network, a local evaluation network and a global evaluation network of the kth robot node. The weight matrix and the bias vector of each neural network are randomly generated in normal distribution with the expected value of 0 and the variance of 2, and model initialization parameters of K robot nodes can be the same or different.

5.6 emptying the experience playback pool of the first storage module of the kth robot node.

5.7 initializing the k-th robot node to perform action step number j=0, wherein the maximum executable step number is J (in order to ensure that the robot node can acquire an ideal strategy within the step J, the recommended value range of J is 1000-2000).

5.8 policy network in the first calculation Module of the kth robot node obtains the local State information o of the jth scout scene from the first storage Module _j O is equal to _j Inputting a strategy network to obtain an action instruction a _j Will a _j Store to the first storage module and store a _j Sending the motion module to a motion module;

5.9 motion module makes action a according to instruction _j ；

5.10 executing action a by the kth robot node according to the reconnaissance task completion degree evaluation index designed in the step 2.4 _j Obtaining the completion degree score r of task scene feedback _j And r is taken as _j A first memory module stored to the robot node k.

5.11 kth robot node performs action a _j The global environment and the local state of the reconnaissance environment are changed, and the detection module of the robot node k observes the j+1th step global state information s _j+1 And local state information o _j+1 After that, s _j+1 and o_j+1 Stored in the first memory module.

5.12 kth robot node respectively integrating information s _j 、s _j+1 、a _j 、r _j and o_j 、o _j+1 、A _j 、r _j Obtaining j-th global track experience information s _j ,A _j ,r _j ,s _j+1 ]And j-th set of local track experience information o _j ,a _j ,r _j ,o _j+1 ]Will [ s ] _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]And a playback pool of experience tracks stored to the first storage module.

5.13 the first storage module of the first calculation module of the kth robot node judges the data in the experience playback pool, if 2H pieces of track experience information are already stored, randomly extracting corresponding H pieces of global track experience information and H pieces of local track experience information from the playback pool, optimizing parameters for six neural networks in the first calculation module, wherein the suggested value of H is 50-100 (inconsistent with the previous), and then executing step 5.14; otherwise, let j=j+1, go to step 5.8.

5.14 Global evaluation network and local evaluation network of kth robot node respectively read H pieces of global track experience information and H pieces of local track experience information, and minimize global loss function L in formula (1) by gradient descent method _global And a local loss function L in equation (2) _local Thereby respectively updating the parameters of the global evaluation network and the local evaluation network.

And 5.15, the strategy network of the kth robot node reads H pieces of global track experience information and H pieces of local track experience information, and updates parameters of the strategy network according to a strategy gradient update formula in the formula (3) by a gradient descent method, so that optimization of the strategy network is realized.

And 5.16. The global target evaluation network, the local target evaluation network and the target policy network of the kth robot node update parameters of the global target evaluation network, the local target evaluation network and the target policy network according to the update formulas of formulas (4), (5) and (6).

5.17, enabling j=j+1, if J reaches the maximum step number J, representing that the collaborative scout strategy is optimized in the real scene, and executing a sixth step; otherwise, step 5.8 is performed.

After the fifth step is executed, according to the scout strategy obtained by training and optimizing the current real scene, the robot node can autonomously decide the action to be taken in the next step according to the observed global state and the local state information. The policy network parameters of the K robot nodes together form a sampling policy of the multi-robot autonomous collaborative reconnaissance real scene.

And sixthly, deploying K robot nodes into the real scene (namely the scene needing to perform the reconnaissance task) in the fifth step.

And seventhly, the multi-robot reconnaissance system autonomously and cooperatively completes the multi-target reconnaissance task in an open environment according to the sampling strategy of the multi-robot autonomous and cooperatively reconnaissance real scene obtained in the fifth step. The K robot nodes finish the multi-target reconnaissance task in parallel, wherein the method for finishing the multi-target reconnaissance task by the kth robot node comprises the following steps:

7.1, setting a plurality of target points to be detected by the multi-robot detection system according to the detection task requirement, storing coordinates of all the target points in a list L, and sending the L to a first communication module of a kth robot node. The first communication module forwards the list L to the kth robot node first calculation module. The robot node selects a target point from the list L with reference to environmental information such as the position of the target point.

7.2 initializing the execution step number j=0 of the kth robot node.

7.3 detection module of kth robot node obtains jth step global state information s _j (e.g. surrounding other robot node information, various environmental element information) and j-th step local state information o _j (e.g., information on the position of the kth robot node, etc.), will s _j and o_j And the first calculation module is transmitted to the kth robot node.

7.4 calculation Module of kth robot node will s _j ，o _j And integrating the coordinates of the target points selected from the list L into a state triplet.

7.5 policy network of the first calculation module of the kth robot node according to state threeThe tuple makes a decision and outputs an action instruction a _j Will a _j And sending the motion information to the motion module.

7.6 motion module of kth robot node receives action command a _j Thereafter, an action is performed to the selected target point.

7.7 if the kth robot node arrives within a meter near the target point coordinate, indicating that the robot has detected the target (the suggested value of a is 0.5-0.8 meter), deleting the target point coordinate from the list L, and then executing step 7.8; otherwise, let j=j+1, execute step 7.3.

7.8 kth robot node judges whether the target point coordinates are stored in L, if L is not empty, the step 7.2 is shifted; otherwise, turning to the eighth step.

And eighth step, ending.

The invention can achieve the following beneficial techniques: the robot can quickly and efficiently obtain the collaborative strategy under the new reconnaissance task scene, and the personalized behavior strategy among different robots is kept, so that the reconnaissance time is short, and the reconnaissance task under various scenes possibly occurring after disaster or in other environments can be completed.

1. The third step of the invention pre-trains the strategy network model in the simulation environment, transfers knowledge from the pre-training, and realizes the rapid optimization of the current multi-robot collaborative reconnaissance strategy (i.e. the strategy network of K robots) in a real open scene, thereby effectively solving the problem that a large amount of time and a large amount of data are required for learning the strategy by the multi-robots in a complex scene, and further completing the reconnaissance task more efficiently.

2. According to the invention, the global evaluation network and the local evaluation network guide the parameter updating of the single robot node simulation model or the single robot strategy network together, so that when the single robot faces to a complex environment in a multi-robot system, the task can be completed by cooperation and cooperation of multiple robots while the personalized decision is kept. Thus helping multiple robots in a real scene learn flexible behavior strategies and completing reconnaissance tasks in various scenes possibly occurring after disaster or in other environments.

3. The fifth step of the invention is that the global evaluation network, the global target evaluation network, the local target evaluation network and the target strategy network assist the training of the strategy network together, so that when facing a reconnaissance task scene which cannot be predicted in advance, the multi-robot is independent of manually designing a search route and an obstacle avoidance strategy aiming at different task scenes, and the multi-robot autonomously optimizes the strategy network for cooperatively completing the task, thereby realizing that the task can be easily executed according to the steps under the condition of no professional.

4. Through the test based on the simulation environment, compared with the MADDPG method in the background technology, the invention realizes the effect of helping robots to learn effective cooperative strategies in a shorter time and obtains higher accumulated rewards.

Drawings

Fig. 1 is a logic structure diagram of a multi-robot autonomous collaborative reconnaissance system constructed in the first step of the present invention.

Fig. 2 is a general flow chart of the present invention.

Fig. 3 is a software module deployment schematic diagram of the multi-robot autonomous collaborative reconnaissance system constructed in the first step of the present invention.

FIG. 4 is a schematic diagram of experiments conducted in a simulation environment to test the effects of the present invention. Fig. 4 (a) is a diagram of the left side of the arrow showing a simulated scout scene (i.e., a known scout scene) that the multi-robot system faces in the third step, and the diagram of the right side of the arrow showing a simulated schematic view of the scout scene (i.e., the real scene mentioned in the present invention) that the multi-robot system faces in the fifth step and the sixth step, each of which is set to 3 robot nodes (represented by black dots in the figure) to scout 3 target point positions (represented by light dots in the figure) on the ground, and the scout target point information is represented when the 3 robot nodes reach the vicinity of the target point. Fig. 4 (b) shows experimental effect tests under the scout continuous moving target scene, the left side graph of the arrow shows the scout scene (i.e., the known scout scene) that the multi-robot system faces in the third step, the right side graph shows a simulation diagram of the real scout scene that the multi-robot system faces in the fifth and sixth steps, and 4 robot nodes (represented by solid dots in the figure) are set to chase 2 moving target points (represented by hollow dots in the figure) to acquire information such as the positions of the target points in real time.

Fig. 5 is a graph of experimental results of an effect test experiment. Fig. 5 (a), fig. 5 (b) shows a cumulative prize feedback value obtained after repeatedly performing the experimental setup of fig. 4 (a) and fig. 4 (b) 200 times in a multi-robot system by using the multi-robot collaborative reconnaissance method based on a double-layer evaluation mechanism, wherein the prize is a task completion score set for a specific reconnaissance scene.

Detailed Description

Fig. 2 is a general flow chart of the present invention, which includes the following steps, as shown in fig. 2:

The first storage module is connected with the detection module and the first calculation module, and the available space is more than 1 GB. An experience playback pool is arranged in the first storage module, and the latest N pieces of track experience information (the value range of N advice is 5000-10000) of the robot node is stored. Wherein n (1. Ltoreq.n)N) pieces of track empirical information are expressed as [ s ] _n ,a _n ,r _n ,s _n+1 ]，s _n The task state information observed by the detection module at the nth time point comprises environment information in an observable range and speed and position information of other K-1 robot nodes except the robot node to which the task state information belongs. a, a _n Indicating that the robot node K (1. Ltoreq.k) takes an action between the n and n+1 th points in time, such as the robot node performing an action that applies a force or accelerates in a certain direction. When the robot node k executes the action, the detection module automatically records a _n 。r _n Is the feedback score observed by the detection module for the current task completion at the nth point in time. s is(s) _n+1 The task state information observed by the n+1th time point detection module also comprises speed position information of other K-1 robots, position information of environmental barriers and the like besides the current robot node K. From a logical relation, s _n+1 It can be understood that the term "s _n Performing action a at the robot _n And obtaining new task state information.

3.2 initialisation variable m=0.

3.6 policy network of kth robot node simulation model obtains local state information o at the jth step from the first storage module _j (e.g., information on the position of the kth robot node simulation model, etc.), o will be _j Inputting a strategy network, multiplying the weight matrix among the layer neurons in the strategy network, and adding the bias vectors of the neurons in each layer to output an action instruction a in the j step _j Will a _j To the first storage module and the movement module.

3.10 kth robotThe node simulation model is executing action a _j Then the state of the reconnaissance environment is changed, and the detection module of the kth robot node simulation model observes global state information s in the (j+1) th step _j+1 And local state information o _j+1 Will s _j+1 and o_j+1 Stored in the first memory module.

Part of equation (2) comes from the MDP (full process Markov decision process, translated into a "Markov decision process"), the local evaluation network of the kth robot node simulation model evaluates the scores of the kth robot node simulation model's actions at step j. The strategy network of the kth robot node simulation model predicts the j+1 step action according to the j+1 step state information, the local target evaluation network of the kth robot node simulation model evaluates the j+1 step state and the estimated action to obtain the j+1 step action estimated score, and the j+1 step action estimated score is multiplied by the discount factor alpha and added with the reward r obtained from the environment after the j step action _j And obtaining the target score of the j-th action. Subtracting the estimated score of the j-th step from the target score to obtain a difference value, taking the difference value as a gradient optimization target of the local evaluation network, updating parameters of the local evaluation network by using the gradient optimization target of the local evaluation network, and updating parameters of the local target evaluation network of the k-th robot node simulation model. By minimizing the objective function L _local The method can effectively help the local evaluation network to make better estimation on the j-th action value.

And (3) the same principle as the formula (1), and estimating the estimated scores of the joint actions of the K robot node simulation models in the j-th step by using a global evaluation network of the K robot node simulation models. At the same time, global to step j+1The state and the joint action are evaluated to obtain the estimated score of the j+1 step joint action, and the estimated score of the j+1 step joint action is multiplied by a discount factor alpha and added with a reward r obtained from the environment after the j step joint action _j And obtaining the target score of the j-th combined action. Subtracting the estimated score of the j-th combined action from the target score to obtain a difference value, taking the difference value as a gradient optimization target of the global evaluation network, updating parameters of the global evaluation network by using the gradient optimization target of the global evaluation network, and minimizing an objective function L _global The method can effectively help the global evaluation network to make better estimation on the joint action value.

wherein a_j Representing a policy network u _j In the j step, the motion selected for the kth robot node simulation model is Q _global (s _j ,A _j ) Representing global state information s _j And joint action a _j Inputting the estimated score of the joint action into a global evaluation network; q (Q) _local (o _j ,a _j ) Representing the local state information o _j And action a _j . And inputting the motion information into a local evaluation network to obtain an estimated score of the motion of the simulation model of the kth robot node.

3.17 if q+h > H, executing step 3.18; otherwise, go to step 3.13.

5.8 policy network in the first calculation Module of the kth robot node is stored from the first memoryObtaining local state information o of j-th step reconnaissance scene in storage module _j O is equal to _j Inputting a strategy network to obtain an action instruction a _j Will a _j Store to the first storage module and store a _j Sending the motion module to a motion module;

5.9 motion module makes action a according to instruction _j ；

7.2 initializing the execution step number j=0 of the kth robot node.

7.5 policy network of the first calculation module of the kth robot node makes a decision according to the state triplet, and outputs an action instruction a _j Will a _j And sending the motion information to the motion module.

And eighth step, ending.

The experimental effect of the invention is illustrated by experiments in the following simulation environment:

the technology of the invention performs effect test in simulation test experiments. The environment has a real physical engine and has objective physical factors such as friction, inertia and the like. The task targets are multiple robot systems, and the detection modules can help the robots to autonomously and cooperatively detect multiple target points, which can be fixed or movable. A computer for experiments is provided with a 64-bit Ubuntu operating system, an AMD graphics processor and an Intel Rui eight-core CPU (processing frequency is 3.6 GHz), and the memory capacity of the computer is 16GB.

Fig. 4 is a schematic diagram of experiments for testing the effect of the present invention in a simulation test experiment. Fig. 4 (a) is a left side view showing a scout scene (i.e., a known scout scene) that a robot system simulated in a simulation environment may face, in which three robot nodes (indicated by black dots in the figure) scout positions (indicated by light dots in the figure) of three target points on the ground, respectively, based on detection information, and when the three robot nodes reach the vicinity of the target point, the scout target point information is indicated while collision with other robot nodes is also avoided. 4 (a) the right graph shows a simulation schematic diagram of a post-disaster or post-war reconnaissance scene (i.e. the real scene mentioned in the invention) which has a friction coefficient different from that of the simulation scene, and three robot nodes still need to reconnaissance information of three target points. Fig. 4 (b) shows a schematic diagram of a simulation test experiment two for a moving scene of a scout target, the left side diagram shows a scout scene (i.e. a known scout scene) simulating a possible occurrence in a simulation environment, and the right side diagram shows a schematic diagram of a simulation of a scout scene (i.e. an open scene as referred to in the present invention) that is not predictable in advance. They are all set as four robot nodes (represented by black dots in the figure) to chase two moving target points (represented by dots in the figure), and the task goal of the multiple robots is to acquire information such as the positions of the target points in real time. The environmental factors such as friction coefficient are different in these two scenarios.

Fig. 5 is a graph of experimental results for the effect test of fig. 4 (a). The abscissa of the graph is the number of rounds of reconnaissance in the scene, and the ordinate is the cumulative prize value, which is the completion degree evaluation index scale value for measuring the task completion effect after the current number of rounds is finished, namely the task completion degree score. When a task is executed, the closer the distance between the robot and a corresponding target point is, the higher the task completion degree score is; conversely, if the robot collides with another robot or obstacle, a score is deducted and a penalty is given. Therefore, the higher the task completion degree score is, the more intelligent the behavior of the multiple robots in the current number of rounds is, the higher the coordination degree is, and the better the learning effect is. For the robot in each round number, the maximum action of the robot is 25 steps, the training period is 5000 rounds, and the robot node extracts 64 pieces of track experience information from the experience playback pool for training each time. The experiment has two groups of data, one group is control group data. The contrast group applies the MADDPG multi-robot reinforcement learning technology which is taught by the background technology, a pre-training model is directly used for optimization in a real scene, a collaborative reconnaissance strategy is formed while training, and the multi-robot collaborative reconnaissance strategy method based on a double-layer judging mechanism is provided by the invention, so that the effectiveness of the invention is verified. Each group of data is repeatedly executed by the robot node for 200 times, and the average value of rewards of each execution round number is calculated to form a broken line in the graph.

Fig. 5 (a) is a situation of completing a scout task of multiple robots on a fixed target point, fig. 5 (b) is a situation of measuring a learning situation of a cooperative strategy of the multiple robots on a moving target real-time scout scene, and the experiment is aimed at verifying that the multiple robots cooperative scout method of the present invention can help the multiple robots learn an effective cooperative strategy faster and acquire a higher accumulated rewards with the help of a pre-training strategy network. As can be seen from fig. 5 (a), as the number of training rounds increases, the task completion scores of the different techniques all fluctuate, but all show an increasing trend. It is evident that the data curves of the present invention are much higher than the control, both in terms of convergence rate (rate of increase in task completion score) and learning score. The method and the device can quickly learn an effective cooperative strategy when the task is started to be executed, and the task completion degree of the method and the device is far higher than that of a control group method in the whole training process, so that the cooperative strategy learned based on the method and the device has better performance when the scout task is completed.

The multi-robot collaborative reconnaissance technology for complex scenes provided by the invention is described in detail above. The principles and embodiments of the present invention are described herein to assist in understanding the core concepts of the invention. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the invention can be made without departing from the principles of the invention and such modifications and adaptations are intended to be within the scope of the invention as defined in the following claims.

Claims

1. A complex scene-oriented multi-robot collaborative reconnaissance method is characterized by comprising the following steps of:

firstly, constructing a multi-robot reconnaissance system, wherein the multi-robot reconnaissance system consists of K robot nodes and a cloud server node, and K is a positive integer; the K robot nodes are connected with the cloud server node; the robot nodes are composed of robots capable of observing, moving and communicating, and can run a software system, and the working modes of the robot nodes are the same; each robot node is provided with an UbuntuMate operating system matched with an X86 architecture processor or an UbuntuMate operating system matched with an ARM architecture processor, a robot operating system ROS and a deep learning frame Pytorch1.7.0, and is also provided with a detection module, a first calculation module, a first storage module, a motion module and a first communication module;

the detection module is a sensor for collecting environmental data, is connected with the first storage module, shoots or scans the environmental information and other robot node states in the video line range every t seconds, obtains the state information of the reconnaissance task, and stores the state information into the first storage module;

the first storage module is connected with the detection module and the first calculation module, and an experience playback pool is arranged in the first storage module and used for storing the latest N pieces of track experience information of the robot node; wherein the nth track experience information is expressed as s _n ,a _n ,r _n ,s _n+1 ]，1≤n≤N，s _n Indicating the task state information observed by the nth time point detection module, including the environment information in the observable range and excluding the robot sectionSpeed and position information of other K-1 robot nodes outside the point; a, a _n Representing actions taken by the robot node K between the n and n+1 time points, wherein K is greater than or equal to 1 and less than or equal to K; r is (r) _n The feedback score of the environment on the current task completion condition at the nth time point is observed by the detection module; s is(s) _n+1 Is task state information observed by the detection module at the n+1th time point; s is(s) _n+1 Is formed by s _n Performing action a at the robot _n New task state information is obtained;

the first computing module is responsible for designing and optimizing a scout strategy and sending an action instruction to the motion module according to the scout strategy; the first computing module is connected with the first storage module, the motion module and the first communication module; the first calculation module consists of 6 fully-connected neural networks, wherein the 6 fully-connected neural networks are created by using a pytorch deep learning framework and are a strategy network and a target strategy network for forming action instructions, and a local evaluation network, a global evaluation network, a local target evaluation network and a global target evaluation network for optimizing strategies;

The strategy network is connected with the first storage module, the motion module, the first communication module, the target strategy network and the local evaluation network, the global evaluation network is connected with the first communication module, h pieces of local track experience information are extracted from an experience playback pool of the first storage module, a next action is determined according to current task state information in the local track experience information, an instruction containing the next action is sent to the motion module, and network parameters of the strategy network, namely weight matrixes and bias vectors of all layers of networks, are sent to the target strategy network; the strategy network acquires the evaluation value of the action taken by the current strategy network from the local evaluation network and the global evaluation network, learns to acquire a higher action evaluation value according to the evaluation value, and formulates an action instruction; the strategy network saves strategy network parameters as a data format file and sends the data format file to the first communication module; after the action is completed, the file storing the strategy network parameters is sent to a first communication module;

the target strategy network is connected with the strategy network, and the network parameters of the target strategy network are updated according to the network parameters of the strategy network acquired from the strategy network;

the local evaluation network is connected with the first storage module, the strategy network and the local target evaluation network, receives h pieces of local track experience information extracted by the strategy network from the first storage module, receives a loss function metric value from the local target evaluation network, evaluates the action value of the robot according to the loss function metric value, and simultaneously sends own network parameters to the local target evaluation network; the local evaluation network also transmits the evaluation value of the action taken by the current policy network to the policy network;

The local target evaluation network is connected with the local evaluation network, and the network parameters of the local target evaluation network are updated according to the network parameters of the local evaluation network acquired from the local evaluation network;

the global evaluation network is connected with the first storage module, the strategy network and the global target evaluation network, receives h pieces of global track experience information which are extracted from the local evaluation network and are synchronous with the h pieces of local track experience information from the first storage module, receives a loss function metric value from the global target evaluation network, evaluates the joint action value of all robots according to the loss function metric value, and simultaneously transmits own network parameters to the global target evaluation network; the global evaluation network also transmits the evaluation value of the combined action to the strategy network, wherein the combined action refers to the action taken by the current strategy network and the action taken by other robots;

the global target evaluation network is connected with the global evaluation network, and the network parameters of the global target evaluation network are updated according to the network parameters of the global evaluation network acquired from the global evaluation network;

the motion module consists of a driving device and a digital-to-analog converter; the motion module is connected with the first calculation module, receives an action instruction from a strategy network of the first calculation module, converts a digital signal into an analog signal by means of a built-in digital-to-analog converter, and transmits the analog signal to the driving equipment, so that the robot makes corresponding actions, and finally, a reconnaissance scene is changed;

The first communication module is connected with the cloud server node and the first computing module, and receives the data format file from the policy network of the first computing module and uploads the data format file to the cloud server node through SSH communication service;

the cloud server node refers to cloud equipment, and is provided with a second communication module and a second storage module; the second storage module is provided with an Ubuntu16.04 operating system and a Pytorch deep learning frame with the same version as the robot node; on the cloud server, the second communication module is connected with the second storage module and is communicated with K robot nodes through an SSH communication protocol;

secondly, constructing M reconnaissance scenes based on a Gazebo simulation environment, and pre-training six networks in a first computing module; the method comprises the following steps:

2.1 selecting a computer with the same version as the Ubuntu operating system or UbuntuMate operating system on the robot nodes, installing and operating a Gazebo simulation environment, simulating K robot nodes in the multi-robot system constructed in the first step, and establishing corresponding robot node simulation models for the K robot nodes;

2.2, referring to various environment elements possibly occurring in the reconnaissance environment, carrying out equal-proportion scaling modeling, and constructing an environment simulation model which approximates to the real environment as much as possible;

2.3 randomly selecting a robot node simulation model and various environment elements, and randomly initializing the initial positions and the numbers of the robot node simulation model and the various environment elements to form M reconnaissance scenes for simulating various possible real scenes; wherein M is a positive integer;

2.4, designing an evaluation index of the completion degree of the reconnaissance task, and evaluating the task completion effect of the current multi-robot combined strategy in a simulation environment; if the robot node simulation model leaks a target point in reconnaissance, 1 minute is deducted; the robot node simulation model collides with the obstacle model in the motion process, and is buckled for 10 minutes; collision with other robot node simulation models is reduced by 5 minutes;

the third step, pre-training the K robot node simulation models in the M reconnaissance scenes constructed in the second step to obtain K parameter files in data format, wherein the K parameter files record the cooperative strategies of the K robot node simulation models in each reconnaissance simulation scene; the single robot node simulation model adopts a DDPG reinforcement learning algorithm training strategy network, the K robot node simulation models are mutually independent and train the strategy network in parallel, wherein the method for adopting the DDPG reinforcement learning algorithm training strategy network by the kth robot node simulation model is as follows:

3.1 initializing a first calculation module of a robot node simulation model K, wherein six neural networks in the first calculation module all need to initialize parameters, the parameters comprise a weight matrix and a bias vector between each layer of each neural network, the values of the weight matrix and the bias vector are randomly generated according to normal distribution with an expected value of 0 and a variance of 2, and K is more than or equal to 1 and less than or equal to K;

3.2 initializing variable m=0;

3.3 selecting an mth simulation scout scene from M simulation scout scenes in the Gazebo simulation environment;

3.4 randomly initializing the initial position of a kth robot node simulation model and the initial points of various environment elements in the mth simulation reconnaissance scene; initializing the training round number i=0, and setting the maximum training round number I, wherein I is a positive integer;

3.5 initializing the action step number j=0 of the kth robot node simulation model in the ith training round number, setting the maximum step number J which can be acted by the kth robot node simulation model in the ith training round number, wherein J is a positive integer, and if the kth robot node simulation model can finish tasks within the J steps, the actual action step number of the kth robot node simulation model in the ith training round number is smaller than J;

3.6 policy network of kth robot node simulation model obtains local state information o at the jth step from the first storage module _j O is equal to _j Inputting a strategy network, multiplying the weight matrix among the layer neurons in the strategy network, and adding the bias vectors of the neurons in each layer to output an action instruction a in the j step _j Will a _j Transmitting to a first storage module and a motion module;

3.7 first memory Module stores a _j The simultaneous movement module performs action a _j ；

3.8The kth robot node simulation model detects actions executed by other K-1 robot nodes in the jth step, and combines actions a executed by the kth robot node simulation model _j Obtaining the joint action executed by the K robot nodes in the j stepRepresenting the joint action as A _j Storing the first data to a first storage module;

3.9 executing action a by the kth robot node simulation model according to the evaluation index formulated in the step 2.4 _j Obtaining the task completion degree score r _j Will r _j Feeding back to a first storage module in a kth robot node simulation model;

3.10 kth robot node simulation model executing action a _j Then the state of the reconnaissance environment is changed, and the detection module of the kth robot node simulation model observes global state information s in the (j+1) th step _j+1 And local state information o _j+1 Will s _j+1 and o_j+1 Storing the first data to a first storage module;

3.11 kth robot node simulation model first storage module is formed by respectively combining s _j 、s _j+1 、A _j 、r _j and o_j 、o _j+1 、a _j 、r _j Obtaining j-th global track experience information s _j ,A _j ,r _j ,s _j+1 ]And j-th set of local track experience information o _j ,a _j ,r _j ,o _j+1 ]The global track experience information and the local track experience information of the same step are called a track experience information pair; and pair the track experience information to s _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]An experience playback pool sent to the first storage module, the experience playback pool pair [ s ] _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]Storing separately;

3.12 the first calculation module of the kth robot node simulation model judges the storage capacity of the experience playback pool, and if H track experience information pairs are already stored, H track experience information pairs are randomly extracted from the experience playback pool; h pieces of global track experience information are sent to a global evaluation network, H pieces of local track experience information are sent to a local evaluation network, and the step 3.13 is shifted; otherwise, let j=j+1, go to step 3.6, h is a positive integer;

3.13 the first calculation module of the kth robot node simulation model numbers the extracted H track experience information pairs with the numbers of 1-H according to the sequence; initializing a mark sequence number q=0;

3.14 the first calculation module of the kth robot node simulation model starts from the mark sequence number q=0, and selects the first H track experience information pairs according to the sequence, wherein H is more than or equal to 1 and less than or equal to H; and let q=q+h, i.e. q is updated to the mark sequence number of the last pair of the h selected track experience information pairs;

wherein ,Q_global Representing a global evaluation network, Q _local Representing a local evaluation network, Q ^′ _global Representing a global target evaluation network, Q ^′ _local Representing a local target evaluation network; mu' represents the target policy network and the symbols in brackets immediately following these letters represent the input message to the networkRest, mu ^′ (o _j+1 ) Representing local state information o observed by the kth robot node simulation model in the (j+1) th step _j+1 An action instruction output after being input into the target strategy network; mu (mu) ^′ (s _j+1 ) Representing that K robots in the j+1th step are based on global state information s _j+1 Predicted joint actions; q (Q) _local (o _j ,a _j ) Representing the local state information o _j And action a _j Inputting the motion information into a local evaluation network to obtain an estimated score of the motion of the kth robot node simulation model; q (Q) _global (s _j ,A _j ) Representing global state information s _j And joint action a _j Inputting the estimated score of the joint action into a global evaluation network; (r) _j +αQ′ _global (s _j+1 ,μ′)-Q _global (s _j ,A _j ) Representing an estimated score Q for a joint action by a global evaluation network _global (s _j ,A _j ) The closer to the global target score r _j +αQ′ _global (s _j+1 The better (r) _j +αQ′ _local (o _j+1 ,μ′(o _j+1 ))-Q _local (o _j ,a _j ) Representing an estimated score Q for the action of the local evaluation network on the kth robot node simulation model _local (o _j ,a _j ) Closer to the local target score r _j +αQ′ _local (o _j+1 ,μ′(o _j+1 ) The better);

alpha represents a discount factor, and takes a constant of 0 to 1;

3.16 the global evaluation network and the local evaluation network adopt a double-layer evaluation mechanism to guide policy updating together, and the policy network of the kth robot node simulation model carries out parameter updating according to a formula (3):

3.17 if q+h > H, executing step 3.18; otherwise, turning to step 3.13;

3.18 the global target evaluation network of the kth robot node simulation model updates own network parameters according to the parameters of the global evaluation network, as shown in a formula (4); the local target evaluation network updates own network parameters according to the parameters of the local evaluation network, as shown in a formula (5); the target policy network updates its own network parameters according to the parameters of the policy network, as shown in formula (6):

wherein ,parameters of the global evaluation network, the local evaluation network and the strategy network in the j-th step are respectively represented; />Parameters of the global target evaluation network, the local target evaluation network and the target policy network in the j-th step are represented; τ ₁ 、τ ₂ A weight constant of 0 to 1;

3.19 let j=j+1, if J reaches the set maximum number of steps per round J, execute step 3.20; otherwise, executing the step 3.6;

3.20, making i=i+1, if I reaches the set maximum training round number I, executing step 3.21; otherwise, executing the step 3.4;

3.21, making m=m+1, if M is equal to the maximum number of reconnaissance scenes M, executing the step 3.23; otherwise, executing the step 3.22;

3.22 the first calculation module of the kth robot node simulation model stores the trained strategy network model and names the strategy network model by the serial number m of the simulation reconnaissance scene; meanwhile, the first calculation module of the kth robot node simulation model empties all network parameters in a strategy network, a target strategy network, a global evaluation network, a global target evaluation network and a local target evaluation network, and gives random initial values to the network parameters again; the first storage module of the kth robot node simulation model empties an experience pool, and is ready for training of new scene tasks; turning to step 3.23;

3.23, a first calculation module of the kth robot node simulation model stores parameters of a strategy network, namely a collaborative strategy of the kth robot node simulation model in a reconnaissance simulation scene in a data format file, and a fourth step is executed;

the K robot node simulation models respectively upload data format files to the cloud server node through SSH service, namely, the personalized collaborative strategy obtained through the third training is stored in the cloud server node, and the strategy network model obtained in the pre-training stage is saved and shared; the method for uploading the data format file to the cloud server node through the SSH service is that the K robot node simulation models are executed in parallel;

4.1, the first calculation module of the kth robot node simulation model sends the data format file to the first communication module of the kth robot node simulation model;

4.2 the first communication module of the kth robot node simulation model sends the data format file to the second communication module through SSH communication service;

4.3 the second communication module stores the received data format file in a second storage module of the cloud server node;

And fifthly, deploying the multi-robot system constructed in the first step to a place where a scout task needs to be carried out, and utilizing a trained strategy model in a simulation scene, namely, a data format file to help training of a strategy network in the multi-robot collaborative scout task in a real scene, wherein the specific method is that K robot nodes are executed in parallel, and the execution process of the K robot nodes is as follows:

5.1, the first communication module of the kth robot node sends a downloading request to the second communication module of the cloud server node to request for downloading of the strategy model;

5.2 the second communication module reads the data strategy network model parameter file which is stored in a simulation scene which is most similar to the unknown scene faced by the kth robot node from the second storage module of the cloud server node, and transmits the data format file to the first communication module of the kth robot node through an SSH service protocol;

5.3 the first communication module of the kth robot node transmits the data format file to the first calculation module of the kth robot node;

5.4, the kth robot node loads the data format file into a local strategy network and directly loads the data format file by a Pytorch deep learning framework;

5.5, respectively initializing parameters of a target strategy network, a local evaluation network and a global evaluation network of the kth robot node; the weight matrix and the bias vector of each neural network are randomly generated in a normal distribution with an expected value of 0 and a variance of 2;

5.6, emptying the experience playback pool of the first storage module of the kth robot node;

initializing the k robot node to perform action step number j=0, wherein the maximum executable step number is J, and J is a positive integer;

5.9 motion module makes action a according to instruction _j ；

5.10 executing action a by the kth robot node according to the reconnaissance task completion degree evaluation index designed in the step 2.4 _j Obtaining the completion degree score r of task scene feedback _j And r is taken as _j A first storage module stored to the robot node k;

5.11 kth robotThe node performs action a _j After that, the global environment and the local state of the reconnaissance environment are changed, and the detection module of the robot node k observes the j+1th step global state information s _j+1 And local state information o _j+1 After that, s _j+1 and o_j+1 Storing the first data to a first storage module;

5.12 kth robot node respectively integrating information s _j 、s _j+1 、a _j 、r _j and o_j 、o _j+1 、A _j 、r _j Obtaining j-th global track experience information s _j ,A _j ,r _j ,s _j+1 ]And j-th set of local track experience information o _j ,a _j ,r _j ,o _j+1 ]Will [ s ] _j ,A _j ,r _j ,s _j+1] and [o_j ,a _j ,r _j ,o _j+1 ]An experience track playback pool stored to the first storage module;

5.13 the first storage module of the first calculation module of the kth robot node judges the data in the experience playback pool, if 2H pieces of track experience information are already stored, randomly extracting the corresponding H pieces of global track experience information and H pieces of local track experience information from the playback pool, optimizing parameters for six neural networks in the first calculation module, and then executing the step 5.14; otherwise, let j=j+1, go to step 5.8;

5.14 Global evaluation network and local evaluation network of kth robot node respectively read H pieces of global track experience information and H pieces of local track experience information, and minimize global loss function L in formula (1) by gradient descent method _global And a local loss function L in equation (2) _local Updating parameters of a global evaluation network and a local evaluation network respectively;

5.15 the strategy network of the kth robot node reads H pieces of global track experience information and H pieces of local track experience information, and updates parameters of the strategy network according to a strategy gradient update formula in a formula (3) through a gradient descent method, so that optimization of the strategy network is realized;

5.16 the global target evaluation network, the local target evaluation network and the target policy network of the kth robot node respectively update parameters of the global target evaluation network, the local target evaluation network and the target policy network according to update formulas of formulas (4), (5) and (6);

5.17, enabling j=j+1, if J reaches the maximum step number J, representing that the collaborative scout strategy is optimized in the real scene, and executing a sixth step; otherwise, executing the step 5.8;

after the fifth step is executed, the strategy network parameters of the K robot nodes jointly form a sampling strategy of autonomous collaborative reconnaissance real scenes of the multiple robots;

sixthly, deploying K robot nodes into a scene needing to develop a reconnaissance task;

seventhly, the multi-robot system autonomously and cooperatively completes multi-target reconnaissance tasks in an open environment according to the sampling strategy of the multi-robot autonomous and cooperatively reconnaissance real scene obtained in the fifth step; the method for completing the multi-target reconnaissance task by the kth robot node comprises the following steps:

7.1, setting a plurality of target points to be detected by a multi-robot detection system according to detection task requirements, storing coordinates of all the target points in a list L, and sending the L to a first communication module of a kth robot node; the first communication module forwards the list L to a first calculation module of a kth robot node; the robot node selects a target point from the list L with reference to the position of the target point;

7.2 initializing the execution step number j=0 of the kth robot node;

7.3 detection module of kth robot node obtains jth step global state information s _j And j-th step local state information o _j Will s _j and o_j A first calculation module sent to the kth robot node;

7.4 calculation Module of kth robot node will s _j ，o _j And integrating the coordinates of the target points selected from the list L into a state triplet;

7.5 policy network of the first calculation module of the kth robot node makes a decision according to the state triplet, and outputs an action instruction a _j Will a _j Sending the motion module to a motion module;

7.6 motion module of kth robot node receives action command a _j Then, executing an action to the selected target point;

7.7 if the kth robot node arrives within a meter of the vicinity of the target point coordinates, indicating that the robot has detected the target, deleting the target point coordinates from the list L, and then executing step 7.8; otherwise, let j=j+1, go to step 7.3;

7.8 kth robot node judges whether the target point coordinates are stored in L, if L is not empty, the step 7.2 is shifted; otherwise, turning to an eighth step;

and eighth step, ending.

2. The complex scene oriented multi-robot collaborative reconnaissance method of claim 1, wherein the robot node is an unmanned node Intel aero or a ground robot Turtlebot3; the UbuntuMate operating system is 16.04 versions and above, and the UbuntuMate operating system is 16.04 versions and above.

3. A complex scene oriented multi-robot collaborative reconnaissance method according to claim 1, wherein the detection module is an infrared camera or a depth camera or a scanning radar.

4. The complex scene oriented multi-robot collaborative reconnaissance method of claim 1, wherein the value of t is between 0.3 seconds and 1 second; the available space of the first storage module is more than 1 GB; the value range of N is 5000-10000; the M is not less than 10.

5. The complex scene-oriented multi-robot collaborative reconnaissance method of claim 1, wherein the cloud device is a workstation and a server, the first communication module and the second communication module are both wireless network cards, and the capacity of the second storage module is greater than 500GB.

6. The complex scene oriented multi-robot collaborative reconnaissance method of claim 1, wherein in the second step said Gazebo simulation environment requires version 9.10.0 and above.

7. The complex scene-oriented multi-robot collaborative reconnaissance method according to claim 1, wherein the maximum training wheel number I in 3.4 steps is set to be a positive integer between 400 and 1000; 3.5, wherein J is a positive integer between 30 and 50; step 3.12, H is a positive integer between 100 and 200; 3.15 steps, wherein alpha is 0.2-0.3; step 3.18 said τ ₁ 、τ ₂ The value range of (2) is 0.2-0.3.

8. The complex scene-oriented multi-robot collaborative reconnaissance method according to claim 1, wherein the value range of the J in the step 5.7 is a positive integer between 1000 and 2000.

9. The complex scene oriented multi-robot collaborative reconnaissance method according to claim 1, wherein the value of H is 50-100.

10. The complex scene oriented multi-robot collaborative reconnaissance method according to claim 1, wherein the value of a in the step 7.7 is 0.5-0.8 m.