CN115933638A

CN115933638A - Multi-robot collaborative exploration method and device, electronic equipment and medium

Info

Publication number: CN115933638A
Application number: CN202211404643.3A
Authority: CN
Inventors: 张弛; 刘永新; 何强; 连亦芳; 陈天恒; 金诗晨; 马史栋; 卢乾坤
Original assignee: Shaoxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Shaoxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-04-07

Abstract

The invention discloses a multi-robot collaborative exploration method, a multi-robot collaborative exploration device, electronic equipment and a medium, wherein the method comprises the steps of acquiring exploration information of a plurality of robots in the current environment; predicting behaviors corresponding to the robots according to the exploration information of the robots based on the behavior network; dividing the plurality of robots into areas to obtain a target point set corresponding to each robot; selecting a mobile target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots; planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path. The application carries out accurate segmentation with the detection environment of each robot thereby reducing repeated exploration possibility to can maximize the exploration scope of robot, this application makes every robot possess self network, thereby carries out the decision-making with the mode of distributed execution, can explore as much as possible unknown environment with shorter time, has improved exploration efficiency.

Description

Multi-robot collaborative exploration method and device, electronic equipment and medium

Technical Field

The invention mainly relates to the technical field of intelligent control, in particular to a multi-robot collaborative exploration method, a multi-robot collaborative exploration device, electronic equipment and a medium.

Background

With the rapid development of automation and computer technologies, research for intelligent robots to explore unknown environments becomes more and more important. Recent industrial and scientific research scenes such as Mars landing, cave rescue, power grid planning and the like further stimulate the requirements for autonomous exploration. Compared with single robot exploration, the multi-robot collaborative exploration improves the efficiency and robustness of task execution and can complete tasks which can not be completed by a single robot through an information sharing mode. Therefore, the multi-robot collaborative exploration is widely concerned and researched by researchers.

At present, the existing multi-robot collaborative exploration method usually adopts a centralized control mode, wherein a contract network model is most commonly used, but the contract network model is evaluated by a bidding party, so that a system bottleneck is easily caused; the coordination strategy of the multi-robot system cooperative operation is complex in calculation, the requirement on hardware configuration of a central processing unit is high, the software structure is complex, and the cost of the multi-robot system is invisibly increased; in addition, the only few strategies that are available to achieve the distribution of multiple robots within an environment are local decentralization, resulting in inefficient exploration.

Disclosure of Invention

The application aims to provide a multi-robot collaborative exploration method, a multi-robot collaborative exploration device, electronic equipment and a medium, which can improve exploration efficiency while fully mobilizing each robot to conduct collaborative exploration.

In a first aspect, the present application provides a multi-robot collaborative exploration method, including: acquiring exploration information of a plurality of robots in the current environment; predicting behaviors corresponding to the robots according to the exploration information of the robots based on a behavior network; dividing the plurality of robots into areas to obtain a target point set corresponding to each robot; selecting a mobile target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots; planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment.

In the application, exploration information of a plurality of robots in the current environment is obtained; then, based on the behavior network, predicting the corresponding behavior of each robot according to the exploration information of the robot; then, carrying out region division on the multiple robots to obtain a target point set corresponding to each robot; then, based on the corresponding behaviors of the robots, selecting a mobile target point from a target point set according to an evaluation network; then, planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment. The application accurately segments the detection environment of each robot, thereby reducing the possibility of repeated exploration and maximizing the exploration range of the robot. According to the method and the system, each robot is provided with the network, so that decision is made in a distributed execution mode, unknown environments as many as possible can be explored in a short time, and meanwhile, the requirements on environment structures are relaxed.

In one implementation manner of the first aspect, acquiring exploration information of a plurality of robots in a current environment includes: acquiring the current position and the last moving position of any one of the multiple robots and the current positions of other robots in the multiple robots; acquiring the relative position between the current position and the last moving position of any one robot; and acquiring the relative positions between the current position of any one robot and the current positions of other robots respectively.

In the present application, the search information of the multiple robots may include position information of the multiple robots, where the position information includes a current position and a previous moving position of any one of the multiple robots, and a relative position between the current position and the previous moving position of any one of the multiple robots; and the relative position between the current position of any one robot and the current positions of other robots respectively.

In an implementation manner of the first aspect, predicting, based on a behavior network, a behavior corresponding to each robot according to probe information of the robot includes: inputting the detection information of any robot into a behavior network; extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network.

In the application, the behavior network comprises a first input layer, a plurality of first full-connection layers and a first activation function output layer, and detection information of any one robot is input to the behavior network as the first input layer of the behavior network; then extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and then outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network.

In an implementation manner of the first aspect, performing area division on a plurality of robots to obtain a target point set corresponding to each robot includes: connecting all adjacent robots in the plurality of robots into a triangle, and making a vertical bisector of each side of the triangle; enclosing a plurality of perpendicular bisectors around each robot into a polygon to obtain a polygon area graph corresponding to a plurality of robots; and obtaining a target point set corresponding to each robot according to the polygonal area graph.

In the application, when a plurality of robots are divided into areas, all adjacent robots in the plurality of robots are connected into a triangle, and a vertical bisector of each side of the triangle is made; then, a plurality of perpendicular bisectors around each robot are encircled into a polygon, and a polygon area graph corresponding to a plurality of robots is obtained; and then obtaining a target point set corresponding to each robot according to the polygonal area graph. The polygon area graph comprises a plurality of polygons, each polygon only contains one robot, the distance between a point in the polygon area and the corresponding robot is the closest, and the distances between the points on the polygon sides and the robots on the two sides are equal.

In an implementation manner of the first aspect, selecting a moving target point from the target point set according to the evaluation network based on a behavior corresponding to each robot includes: inputting the behavior, the target point set and the exploration information corresponding to any one robot into an evaluation network; extracting the target point characteristics corresponding to any one robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

According to the application, the evaluation network comprises a second input layer, a plurality of second full-connection layers and a second gate control circulation unit layer, and behaviors, target point sets and exploration information corresponding to any one robot can be input into the evaluation network; extracting the target point characteristics corresponding to any robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and then outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

In an implementation manner of the first aspect, performing path planning according to the movement target point to obtain a movement path corresponding to the robot includes: and (3) taking the current position of any one robot as a planning initial point and the moving target point of any one robot as a planning target point, and finishing the path planning of any one robot by adopting a heuristic search algorithm.

In the method, the feasible track from the current position of the robot to the moving target point can be determined through the heuristic search algorithm, and the heuristic search algorithm can be utilized to independently design a path for each robot so as to facilitate obstacle avoidance walking of each robot, so that the efficiency of multi-robot collaborative exploration is improved.

In an implementation manner of the first aspect, the method further includes: when the robots are controlled to move to the moving target points according to the moving paths, acquiring reward values of any one robot and exploration information of any one robot after moving to the moving target points; calculating a target value function value according to the reward value of any one robot and the exploration information of any one robot after moving to the moving target point;

and calculating a loss value according to the target value function value, wherein the loss value is used for updating the parameters of the evaluation network.

In the application, after the robot is controlled to move to the moving target point according to the moving path, the reward value of any one robot and the exploration information of any one robot after moving to the moving target point can be acquired; then, calculating a target value function value according to the reward value of any one robot and the exploration information after any one robot moves to a moving target point; and calculating a loss value according to the target value function value, wherein the loss value is used for updating the parameters of the evaluation network. The network precision is improved, and meanwhile, the efficiency of multi-robot collaborative exploration is also improved.

In a second aspect, the present application provides a multi-robot collaborative exploration apparatus, including: the information acquisition module is used for acquiring exploration information of a plurality of robots in the current environment; the behavior prediction module is used for predicting the corresponding behaviors of the robots according to the exploration information of the robots based on the behavior network; the area division module is used for carrying out area division on the plurality of robots to obtain a target point set corresponding to each robot; the target point selection module is used for selecting a mobile target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots; the path planning module is used for planning a path according to the mobile target point to obtain a mobile path corresponding to the robot; and the target point execution module is used for controlling the robot to move to a moving target point according to the moving path and completing the collaborative exploration tasks of all the robots in the current environment.

In a third aspect, the present application provides an electronic device, comprising: a memory storing a plurality of instructions; the processor loads instructions from the memory to perform the steps of any one of the multi-robot collaborative discovery methods provided by the embodiments of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is configured to, when executed by a multi-robot collaborative discovery apparatus, implement the steps in any one of the multi-robot collaborative discovery methods provided in this application.

The method comprises the steps of firstly obtaining exploration information of a plurality of robots in the current environment; then, based on the behavior network, predicting the corresponding behavior of each robot according to the exploration information of the robot; then, carrying out region division on the multiple robots to obtain a target point set corresponding to each robot; then, based on the corresponding behaviors of the robots, selecting a mobile target point from a target point set according to an evaluation network; then, planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment. The application accurately segments the detection environment of each robot, thereby reducing the possibility of repeated exploration and maximizing the exploration range of the robot.

According to the method and the system, each robot is provided with the network, the robots are simultaneously stimulated to explore in unknown environments, the possibility of redundant exploration is reduced, decision is made in a distributed execution mode, the unknown environments as many as possible can be explored in a short time, and meanwhile, the requirements for environment structures are relaxed.

Drawings

Fig. 1 is a schematic view of an application scenario of a multi-robot collaborative exploration method according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a multi-robot collaborative exploration method according to an embodiment of the present application.

Fig. 3 is a schematic view illustrating an update process of a network according to an embodiment of the present application.

FIG. 4 is a diagram of a simulated training and testing environment according to an embodiment of the present application.

FIG. 5 is a comparison of training curves for a multi-robot collaborative exploration method and baseline MADDPG in accordance with an embodiment of the present application.

Fig. 6 is a schematic structural diagram of a multi-robot collaborative search apparatus according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and its several details are capable of modifications and/or changes in various respects, all without departing from the spirit of the present application. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present application, and the drawings only show the components related to the present application and are not drawn according to the number, shape and size of the components in actual implementation, and the type, number and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The cooperative control technology means that a series of tasks are smoothly completed by multiple robots under the condition that no one participates in control, and the autonomous control of the whole multi-robot system is realized. The following embodiments of the application provide a multi-robot collaborative exploration method, apparatus, electronic device and medium, where the multi-robot collaborative exploration apparatus may be specifically integrated in an electronic device, and the electronic device may be a terminal, a server, and other devices. The terminal can be a mobile phone, a tablet Computer, an intelligent bluetooth device, a notebook Computer, or a Personal Computer (PC), and the like; the server may be a single server or a server cluster composed of a plurality of servers.

In some embodiments, the multi-robot collaborative discovery apparatus may also be integrated into a plurality of electronic devices, for example, the multi-robot collaborative discovery apparatus may be integrated into a plurality of servers, and the multi-robot collaborative discovery method of the present application is implemented by the plurality of servers.

In some embodiments, the server may also be implemented in the form of a terminal.

For example, referring to fig. 1, the electronic device may include a plurality of robots 10, a storage terminal 11, a server 12, and the like, the plurality of robots 10 include a plurality of intelligent robots, the storage terminal 11 is configured to store discovery information of the plurality of robots, and the plurality of robots 10, the storage terminal 11, and the server 12 are in communication connection with each other, which is not described herein again.

The server 12 may include a processor, a memory, and the like. The server 12 may obtain exploration information of a plurality of robots in the current environment; predicting behaviors corresponding to the robots according to the exploration information of the robots based on the behavior network; dividing the plurality of robots into areas to obtain a target point set corresponding to each robot; selecting a mobile target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots; planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment.

The technical solutions in the embodiments of the present application will be described in detail below with reference to the drawings in the embodiments of the present application.

As shown in fig. 2, with the server 12 as the executing agent, the embodiment provides a multi-robot collaborative exploration method, which includes steps S210 to S260 as follows:

s210, the server 12 acquires search information of a plurality of robots in the current environment.

In one embodiment, acquiring exploration information of a plurality of robots in a current environment comprises: acquiring the current position and the last moving position of any one of the multiple robots and the current positions of other robots in the multiple robots; acquiring the relative position between the current position and the last moving position of any one robot; and acquiring the relative positions between the current position of any one robot and the current positions of other robots respectively.

In this embodiment, the exploration information of the multiple robots may include location information of the multiple robots, where the location information refers to information related to locations of the robots, for example, geographic locations where the robots are located, or relative locations with other communication nodes, and the like, for example, the location information may include a current location and a last mobile location of any one of the multiple robots, and a relative location between the current location and the last mobile location of any one of the multiple robots; and the relative position between the current position of any one robot and the current positions of the other robots, respectively.

In addition, the search information of the plurality of robots in the present embodiment may include environment information, state information, and the like of the plurality of robots. The environmental information refers to information related to the environment around the robot, for example, the topography, features, obstacles, and the like around the robot; the state information refers to information related to the state of the robot itself, for example, the attribute, the electric quantity, the orientation, the control information, and the like of the robot itself.

S220, the server 12 predicts behaviors corresponding to the robots based on the behavior network according to the search information of the robots.

In one embodiment, predicting the behavior corresponding to each robot according to the detection information of the robot based on the behavior network includes: inputting the detection information of any robot into a behavior network; extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network.

In this embodiment, each robot includes its own behavior network and evaluation network. The behavior network is a strategy network of the robot and is used for outputting decision behaviors; and the evaluation network is used for evaluating and adjusting the behavior network so as to update the parameters of the behavior network.

The behavior network can comprise a first input layer, a plurality of first full-connection layers and a first activation function output layer, and detection information of any robot is input into the behavior network as the first input layer of the behavior network; then extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and then outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network. In this embodiment, in order to introduce the memory characteristic, the behavior Network may be an RNN (current Neural Network) Network, an LSTM (Long short-term memory) Network, or the like.

S230, the server 12 performs area division on the multiple robots to obtain a target point set corresponding to each robot.

In an embodiment, performing area division on a plurality of robots to obtain a target point set corresponding to each robot includes: connecting all adjacent robots in the plurality of robots into a triangle, and making a vertical bisector of each side of the triangle; enclosing a plurality of perpendicular bisectors around each robot into a polygon to obtain a polygon area graph corresponding to a plurality of robots; and obtaining a target point set corresponding to each robot according to the polygonal area graph.

In the embodiment, when the plurality of robots are divided into areas, all adjacent robots in the plurality of robots are connected into a triangle, and a vertical bisector of each side of the triangle is made; then, a plurality of perpendicular bisectors around each robot are encircled into a polygon, and a polygon area graph corresponding to a plurality of robots is obtained; and then obtaining a target point set corresponding to each robot according to the polygonal area graph. The polygon area graph comprises a plurality of polygons, each polygon only contains one robot, the distance between a point in the polygon area and the corresponding robot is the closest, and the distances between the points on the polygon sides and the robots on the two sides are equal.

S240, the server 12 selects a moving target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots.

In one embodiment, selecting a moving target point from the target point set according to the evaluation network based on the behavior corresponding to each robot includes: inputting the behavior, the target point set and the exploration information corresponding to any one robot into an evaluation network; extracting the target point characteristics corresponding to any one robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

In the embodiment, the evaluation network comprises a second input layer, a plurality of second full-connection layers and a second gating circulation unit layer, and the behavior, the target point set and the exploration information corresponding to any one robot can be input into the evaluation network; extracting the target point characteristics corresponding to any robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and then outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

In this embodiment, in order to introduce the memory characteristic, the evaluation Network may be an RNN (current Neural Network) Network, an LSTM (Long short-term memory) Network, or the like.

And S250, the server 12 performs path planning according to the movement target point to obtain a movement path corresponding to the robot.

In one embodiment, performing path planning according to a moving target point to obtain a moving path corresponding to a robot includes: and (3) taking the current position of any one robot as a planning initial point and the moving target point of any one robot as a planning target point, and finishing the path planning of any one robot by adopting a heuristic search algorithm.

In the embodiment, the feasible track from the current position of the robot to the moving target point can be determined through the heuristic search algorithm, and the heuristic search algorithm can be utilized to independently design a path for each robot to facilitate obstacle avoidance walking of each robot, so that the efficiency of multi-robot collaborative exploration is improved. Wherein, the heuristic search algorithm can be A-Star algorithm.

Specifically, the search area of the robot is simplified into a set of quantifiable nodes, and starting from the current position of the robot as a starting point, nodes adjacent to the starting point are traversed, then points adjacent to the traversed nodes are traversed, and outward diffusion is performed step by step until an end point is found, that is, a moving target point corresponding to the robot is found.

And S260, the server 12 controls the robot to move to a movement target point according to the movement path, and the collaborative exploration tasks of all the robots in the current environment are completed.

In the embodiment, after the robot is controlled to move to the moving target point according to the moving path, the reward value of any one robot and the exploration information of any one robot after moving to the moving target point can be acquired; then, calculating a target value function value according to the reward value of any one robot and the exploration information after any one robot moves to a moving target point; and calculating a loss value according to the target value function value, wherein the loss value is used for updating the parameters of the evaluation network. The network precision is improved, and meanwhile, the efficiency of multi-robot collaborative exploration is also improved.

In one embodiment, the method further comprises: when the robot is controlled to move to a moving target point according to the moving path, acquiring the reward value of any one robot and the exploration information of any one robot after moving to the moving target point; calculating a target value function value according to the reward value of any one robot and the exploration information after any one robot moves to the moving target point;

In the embodiment, exploration information of a plurality of robots in the current environment is obtained; then, based on the behavior network, predicting the corresponding behavior of each robot according to the exploration information of the robot; then, carrying out region division on the multiple robots to obtain a target point set corresponding to each robot; then, based on the corresponding behaviors of the robots, selecting a mobile target point from a target point set according to an evaluation network; then, planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path to complete the collaborative exploration tasks of all the robots in the current environment. The application accurately segments the detection environment of each robot, thereby reducing the possibility of repeated exploration and maximizing the exploration range of the robot. According to the method and the system, each robot is provided with the network, so that decision is made in a distributed execution mode, unknown environments as many as possible can be explored in a short time, and meanwhile, the requirements on environment structures are relaxed.

The intelligent robots in the distributed structure are completely autonomous and are in equal positions with each other, and no primary and secondary points exist. The intelligent robots carry out information interaction with each other through a communication means, and make decisions independently by using local information. The distributed structure improves the stability and flexibility of the system and relieves the bottleneck problem existing in the aspect of control. In consideration of the flexibility, robustness and robustness of the whole multi-robot cooperative system, the distributed method is adopted in the embodiment to realize cooperative control among multiple robots and realize optimization of cooperative control indexes.

In the embodiment, the detection environment of each robot is accurately segmented through the polygon partition and the deep reinforcement learning method, so that the repeated exploration possibility is reduced, and the exploration range of each robot is maximized. In the embodiment, the next ideal position point, namely the moving target point, of each robot can be obtained through centralized training, and a path is independently designed for each robot by using a classical heuristic search algorithm, so that the robots can conveniently avoid obstacles to walk. And finally, all the robots cooperatively complete exploration tasks, such as cooperatively completing construction of the current environment map.

The embodiment also relates to updating of a behavior network and an evaluation network, and as shown in fig. 3, the method specifically includes the following steps: s1, initializing a behavior network, evaluating the network and a target network corresponding to each robot, and playing back an experience pool R.

In this embodiment, each robot includes its own behavior network and evaluation network. The behavior network is a strategy network of the robot and is used for outputting decision behaviors; and the evaluation network is used for evaluating and adjusting the parameters of the behavior network so as to update the parameters of the behavior network. The experience playback pool R is used for storing experience values obtained by interaction between each robot and the environment in each state. The behavior network and the evaluation network of the embodiment each have a respective target network, that is, a target network corresponding to each robot, and the target network provides regular update and iteration of model parameters for the corresponding behavior network or evaluation network.

And S2, resetting the environment, and initializing a Gaussian random process of behavior detection.

And S3, receiving the initial state S and the observation information o of each robot.

Specifically, the input to the neural network consists of a concatenated vector of rangefinder data (a 48-dimensional vector), its relative position with respect to the previous vector (a two-dimensional vector), and relative positions with other mobile robots (a 4-dimensional vector). The input layer is first connected to three Full Connection (FC) layers, each layer containing 256 nodes, followed by a gated-round unit (GRU) layer containing 256 nodes. And the behavior network outputs an action through the sigmoid function, namely the corresponding behavior of the robot.

In this embodiment, all actions of the robot and their observations are used as inputs to the evaluation network, which then processes the actions by three fully connected layers and one gated round robin unit layer, and the evaluation network outputs Q values by linear activation functions, i.e. state S and action a as inputs, while outputting Q' values, which refer to possible rewards for performing action a in state S.

And S4, selecting the behavior a in the current state according to the strategy network pi, and calculating a selectable target point set through Voronoi division.

In the embodiment, the selectable target point set is calculated by Voronoi division, and the target point set is

Where Q is the robot's movable area, p _i ,p _j The positions of robots i and j, respectively.

S5, mapping the behaviors to target points and passing through A ^* The algorithm performs path planning.

S6, the prize value R, the next observation state o 'and the next state S' are obtained and stored in the empirical replay pool R.

In this embodiment, the reward value r may include a mapping reward, a task completion reward, a penalty portion, and the like. The server 12 can return a reward signal, i.e. a reward value, after the robot has interacted with the environment. The observation state is information observed by the robot, such as position information and the like; the state is state information of the robot itself, and the like.

And S7, repeating the steps S4 to S6 until all the behavior networks corresponding to the robots are traversed once.

And S8, sampling a plurality of samples from the empirical playback pool R.

And S9, calculating a target value function, and updating the evaluation network by minimizing the loss value.

In this embodiment, when calculating the objective function, the z-th robot is calculated as follows: y = r _z +γmin _c Q _z,c 。

Wherein y is a target value, Q _z,c As an estimate, r _z The reward value of the z-th robot is determined, gamma is a discount factor, the value range of gamma is between 0 and 1, c =1 and 2 respectively represent the behavior network and the target network of the robot, and the smaller value is selected.

The present embodiment updates the evaluation network by minimizing the loss value

I.e. minimizing the difference between the target value and the estimated value. Wherein B is the total number of samples.

S10, repeating the steps S8 to S9 until all the evaluation networks traverse once;

s11, if the iteration times meet a certain interval step length, updating the behavior network by using a strategy gradient method, and updating the target network of each robot, otherwise, executing the step S13;

the present embodiment updates the behavior network using a policy gradient method, which is a method of delaying policy update, that is, the update frequency of the behavior network is slower than that of the evaluation network, and updates the target network of each robot, which aims to maximize the value of the evaluation network and update the weight thereof by using the sampled policy gradient.

And S12, repeating the steps S4 to S11 until T times of circulation are performed, wherein T is more than 0.

And S13, repeating the step S2 to the step S12 until a certain number of iterations is met.

The present embodiment first constructs various 3D simulation environments including static obstacles using Gazebo based on ROS platform. Each mobile robot is equipped with a 360 degree lidar for detecting selectable target points and mapping in unknown environments. It has been shown that the platform can reduce as much as possible the differences between the simulated environment and the real world. FIG. 4 is a simulated training and testing environment.

The input to the neural network consists of a concatenated vector of rangefinder data (a 48-dimensional vector), its relative position with respect to the previous vector (a two-dimensional vector), and the relative position with other mobile robots (a 4-dimensional vector). The input layer is first connected to three fully connected layers, each layer containing 256 nodes, followed by a gated round-robin unit (GRU) layer containing 256 nodes. The actor network finally outputs an action through the sigmoid function. All actions of the robot and its observations are used as inputs to the evaluation network, which is then processed by three FC layers and one GRU layer, outputting Q values through a linear activation function. Comparing the Multi-robot collaborative exploration method provided by the invention with a baseline MADDPG (Multi-agent Deep Deterministic Policy Gradient), the two methods have the same observation space, action space and reward space, but the MADDPG may have more exploration in the early stage, and the exploration degree of the algorithm provided by the invention gradually increases and surpasses with the time.

The method and the training curve of the baseline MADDPG are shown in FIG. 5, and it can be seen that the return values of the two are continuously increased and gradually converged, but the method provided by the invention is faster and more stable than the baseline MADDPG in learning, and the convergence value is higher.

The protection scope of the multi-robot collaborative search method according to the embodiment of the present application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes of increasing, decreasing, and replacing the steps in the prior art according to the principles of the present application are included in the protection scope of the present application.

The embodiment of the present application further provides a multi-robot collaborative exploration apparatus, which can implement the multi-robot collaborative exploration method of the present application, but the implementation apparatus of the multi-robot collaborative exploration method of the present application includes, but is not limited to, the structure of the multi-robot collaborative exploration apparatus listed in this embodiment, and all structural modifications and replacements in the prior art made according to the principles of the present application are included in the protection scope of the present application.

As shown in fig. 6, the present application provides a multi-robot collaborative discovery apparatus, including: an information obtaining module 310, configured to obtain exploration information of multiple robots in a current environment; the behavior prediction module 320 is used for predicting the corresponding behaviors of the robots according to the exploration information of the robots based on a behavior network; the area division module 330 is configured to perform area division on the multiple robots to obtain a target point set corresponding to each robot; a target point selection module 340, configured to select a mobile target point from the target point set according to the evaluation network based on the behavior corresponding to each robot; the path planning module 350 is configured to perform path planning according to the movement target point to obtain a movement path corresponding to the robot; and the target point executing module 360 is configured to control the robot to move to a moving target point according to the moving path, and complete a collaborative exploration task of all the robots in the current environment.

In the embodiment, exploration information of a plurality of robots in the current environment is obtained; then, based on the behavior network, predicting behaviors corresponding to the robots according to the exploration information of the robots; then, carrying out region division on the multiple robots to obtain a target point set corresponding to each robot; then, based on the corresponding behaviors of the robots, selecting a mobile target point from a target point set according to an evaluation network; then, planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment. The application accurately segments the detection environment of each robot, thereby reducing the possibility of repeated exploration and maximizing the exploration range of the robot. According to the method and the system, each robot is provided with the network, so that decision is made in a distributed execution mode, unknown environments as many as possible can be explored in a short time, and meanwhile, the requirements on environment structures are relaxed.

In one embodiment, the information obtaining module 310 includes a position information obtaining module configured to: acquiring the current position and the last moving position of any one of the multiple robots and the current positions of other robots in the multiple robots; acquiring the relative position between the current position and the last moving position of any one robot; and acquiring the relative positions between the current position of any one robot and the current positions of other robots respectively.

In this embodiment, the exploration information of the multiple robots may include position information of the multiple robots, where the position information includes a current position and a previous moving position of any one of the multiple robots, and a relative position between the current position and the previous moving position of any one of the multiple robots; and the relative position between the current position of any one robot and the current positions of the other robots, respectively.

In one embodiment, the behavior prediction module 320 includes a behavior prediction sub-module configured to: inputting the detection information of any one robot into a behavior network; extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network.

In the embodiment, the behavior network comprises a first input layer, a plurality of first full-connection layers and a first activation function output layer, and detection information of any one robot is input to the behavior network as the first input layer of the behavior network; extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network; and then outputting the corresponding behaviors of any one robot through a first activation function output layer in the behavior network.

In one embodiment, the region partitioning module 330 includes a region partitioning sub-module configured to: connecting all adjacent robots in the plurality of robots into a triangle, and making a vertical bisector of each side of the triangle; enclosing a plurality of perpendicular bisectors around each robot into a polygon to obtain a polygon area graph corresponding to a plurality of robots; and obtaining a target point set corresponding to each robot according to the polygonal area graph.

In the embodiment, when the plurality of robots are divided into areas, all adjacent robots in the plurality of robots are connected into a triangle, and a perpendicular bisector of each side of the triangle is made; then, a plurality of perpendicular bisectors around each robot are encircled into a polygon, and a polygon area graph corresponding to a plurality of robots is obtained; and then obtaining a target point set corresponding to each robot according to the polygonal area graph. The polygon area graph comprises a plurality of polygons, each polygon only contains one robot, the distance between a point in the polygon area and the corresponding robot is the closest, and the distances between the points on the polygon sides and the robots on the two sides are equal.

In one embodiment, the target selection module 340 includes a target selection sub-module configured to: inputting behaviors, target point sets and exploration information corresponding to any one robot into an evaluation network; extracting the target point characteristics corresponding to any one robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

In the embodiment, the evaluation network comprises a second input layer, a plurality of second full-connection layers and a second gating circulation unit layer, and the behavior, the target point set and the exploration information corresponding to any one robot can be input into the evaluation network; extracting the target point characteristics corresponding to any one robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network; and then outputting a moving target point corresponding to any one robot through a first activation function output layer in the actor network.

In one embodiment, the path planning module 350 includes a path planning sub-module configured to: and (3) taking the current position of any one robot as a planning initial point and the moving target point of any one robot as a planning target point, and finishing the path planning of any one robot by adopting a heuristic search algorithm.

In the embodiment, the feasible track from the current position of the robot to the moving target point can be determined through the heuristic search algorithm, and the heuristic search algorithm can be utilized to independently design a path for each robot to facilitate obstacle avoidance walking of each robot, so that the efficiency of multi-robot collaborative exploration is improved.

In one embodiment, the multi-robot collaborative discovery apparatus further includes a network update module configured to: when the robot is controlled to move to a moving target point according to the moving path, acquiring the reward value of any one robot and the exploration information of any one robot after moving to the moving target point; calculating a target value function value according to the reward value of any one robot and the exploration information after any one robot moves to the moving target point;

In the embodiment, after the robot is controlled to move to the moving target point according to the moving path, the reward value of any one robot and the exploration information of any one robot after moving to the moving target point can be acquired; then, a target value function value is calculated according to the reward value of any one robot and the exploration information after any one robot moves to a moving target point; and calculating a loss value according to the target value function value, wherein the loss value is used for updating the parameters of the evaluation network. The network precision is improved, and meanwhile, the efficiency of multi-robot collaborative exploration is also improved.

In specific implementation, the above modules may be implemented as independent entities, or may be combined arbitrarily, and implemented as the same or several entities, and specific implementations of the above modules may refer to the foregoing method embodiment, which is not described herein again.

As can be seen from the above, the multi-robot collaborative search apparatus provided by the present application can accurately segment the detection environment of each robot, thereby reducing the possibility of repeated search and maximizing the search range of the robot. According to the method and the system, each robot is provided with the network, so that decision is made in a distributed execution mode, unknown environments as many as possible can be explored in a short time, and meanwhile, the requirements on environment structures are relaxed.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, or method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a module/unit may be divided into only one logical functional division, and an actual implementation may have another division, for example, a plurality of modules or units may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules or units, and may be in an electrical, mechanical or other form.

Modules/units described as separate parts may or may not be physically separate, and parts displayed as modules/units may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules/units can be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in the embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units are integrated into one module/unit.

It will be further appreciated by those of ordinary skill in the art that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application further provides the electronic equipment which can be a terminal, a server and the like. The terminal can be a mobile phone, a tablet computer, an intelligent Bluetooth device, a notebook computer, a personal computer and the like; the server may be a single server, a server cluster composed of a plurality of servers, and the like.

In some embodiments, the multi-robot collaborative discovery apparatus provided by the present application may also be integrated into a plurality of electronic devices, for example, the multi-robot collaborative discovery apparatus may be integrated into a plurality of servers, and the multi-robot collaborative discovery method of the present application is implemented by the plurality of servers. Alternatively, the multi-robot collaborative search apparatus may be integrated into a plurality of servers, and the multi-robot collaborative search method according to the present application may be implemented by the plurality of servers.

In this embodiment, the electronic device of this embodiment is taken as a server for example to describe in detail, for example, as shown in fig. 7, it shows a schematic structural diagram of the server according to the embodiment of the present application, and specifically:

the server may include components such as a processor 410 of one or more processing cores, memory 420 of one or more computer-readable storage media, a power supply 430, an input module 440, and a communication module 450. Those skilled in the art will appreciate that the server architecture shown in FIG. 7 is not meant to be limiting, and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 410 is a control center of the server, connects various parts of the entire server using various interfaces and lines, performs various functions of the server and processes data by running or executing software programs and/or modules stored in the memory 420 and calling data stored in the memory 420, thereby performing overall monitoring of the server. In some embodiments, processor 410 may include one or more processing cores; in some embodiments, the processor 410 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 410.

The memory 420 may be used to store software programs and modules, and the processor 410 executes various functional applications and data processing by operating the software programs and modules stored in the memory 420. The memory 420 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to the use of the server, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, memory 420 may also include a memory controller to provide processor 410 access to memory 420.

The server also includes a power supply 430 for supplying power to the various components, and in some embodiments, the power supply 430 may be logically connected to the processor 410 via a power management system, so that the power management system performs functions of managing charging, discharging, and power consumption. The power supply 430 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The server may further include an input module 440, and the input module 440 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The server may also include a communication module 450, and in some embodiments the communication module 450 may include a wireless module, through which the server may wirelessly transmit over short distances to provide wireless broadband internet access to the user. For example, the communication module 450 may be used to assist a user in emailing, browsing web pages, accessing streaming media, and the like.

Although not shown, the server may further include a display unit and the like, which will not be described in detail herein. Specifically, in the present embodiment, the processor 410 in the server loads the executable file corresponding to the process of one or more application programs into the memory 420 according to the following instructions, and the processor 410 runs the application programs stored in the memory 420, thereby implementing various functions of the multi-robot collaborative discovery apparatus.

The server of the embodiment can firstly acquire exploration information of a plurality of robots in the current environment; then, based on the behavior network, predicting the corresponding behavior of each robot according to the exploration information of the robot; then, carrying out region division on the multiple robots to obtain a target point set corresponding to each robot; then, based on the corresponding behaviors of the robots, selecting a mobile target point from a target point set according to an evaluation network; then planning a path according to the moving target point to obtain a moving path corresponding to the robot; and controlling the robot to move to a moving target point according to the moving path, and completing the collaborative exploration tasks of all the robots in the current environment. The application accurately segments the detection environment of each robot, thereby reducing the possibility of repeated exploration and maximizing the exploration range of the robot. According to the method and the system, each robot is provided with the network, so that decision is made in a distributed execution mode, unknown environments as many as possible can be searched in a short time, and meanwhile, the requirements on environment structures are relaxed.

In some embodiments, the present application further provides a computer-readable storage medium. It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program instructing a processor, and the program may be stored in a computer-readable storage medium, which is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state drive, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk) and any combination thereof. The storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available media integrated servers, data centers, and the like. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Video Disc (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer instructions are loaded and executed on a computing device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, e.g., the computer instructions may be transmitted from one website site, computer, or data center to another website site, computer, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.).

When the computer program product is executed by a computer, the computer executes the method of the aforementioned method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case it is desired to use the method as described above.

The descriptions of the flows or structures corresponding to the above-mentioned drawings have their respective emphasis, and a part that is not described in detail in a certain flow or structure may refer to the related descriptions of other flows or structures.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which may be made by those skilled in the art without departing from the spirit and technical spirit of the present disclosure be covered by the claims of the present application.

Claims

1. A multi-robot collaborative exploration method is characterized by comprising the following steps:

acquiring exploration information of a plurality of robots in the current environment;

predicting behaviors corresponding to the robots according to the exploration information of the robots based on a behavior network;

dividing the plurality of robots into areas to obtain a target point set corresponding to each robot;

selecting a moving target point from the target point set according to an evaluation network based on the corresponding behaviors of the robots;

planning a path according to the moving target point to obtain a moving path corresponding to the robot;

and controlling the robot to move to the movement target point according to the movement path, and completing the collaborative exploration tasks of all the robots in the current environment.

2. The method of claim 1, wherein the obtaining discovery information of the plurality of robots in the current environment comprises:

acquiring the current position and the last moving position of any one of the multiple robots and the current positions of other robots in the multiple robots;

acquiring a relative position between the current position of any one of the robots and the last moving position;

and acquiring the relative positions between the current position of any one robot and the current positions of other robots respectively.

3. The method of claim 2, wherein the predicting the behavior of each robot based on the behavior network according to the detection information of the robot comprises:

inputting the detection information of any one robot into the behavior network;

extracting behavior characteristics corresponding to the detection information through a plurality of first full-connection layers and a first gating circulation unit layer in the behavior network;

and outputting the corresponding behavior of any one robot through a first activation function output layer in the behavior network.

4. The method of claim 3, wherein the dividing the plurality of robots into the areas to obtain the target point sets corresponding to the robots comprises:

connecting all adjacent robots in the plurality of robots into a triangle, and making a vertical bisector of each side of the triangle;

enclosing a plurality of perpendicular bisectors around each robot into a polygon to obtain a polygon area graph corresponding to the plurality of robots;

and obtaining a target point set corresponding to each robot according to the polygonal area graph.

5. The method of claim 1, wherein the selecting the moving target point from the target point set according to the evaluation network based on the behavior corresponding to each robot comprises:

inputting the behavior, the target point set and the exploration information corresponding to any one robot into the evaluation network;

extracting the target point characteristics corresponding to any one robot through a plurality of second full-connection layers and a second gating circulation unit layer in the evaluation network;

and outputting the moving target point corresponding to any one robot through a first activation function output layer in the actor network.

6. The method of claim 5, wherein the performing a path planning according to the movement target point to obtain a movement path corresponding to the robot comprises:

and finishing the path planning of any one robot by adopting a heuristic search algorithm by taking the current position of any one robot as a planning initial point and the moving target point of any one robot as a planning target point.

7. The method of claim 6, wherein the method further comprises:

when the robot is controlled to move to the movement target point according to the movement path, acquiring the reward value of any one robot and exploration information of any one robot after moving to the movement target point;

calculating a target value function value according to the reward value of any one robot and the exploration information of the any one robot after moving to the moving target point;

8. A multi-robot collaborative exploration apparatus, the apparatus comprising:

the information acquisition module is used for acquiring exploration information of a plurality of robots in the current environment;

the behavior prediction module is used for predicting the corresponding behaviors of the robots according to the exploration information of the robots based on a behavior network;

the area division module is used for carrying out area division on the plurality of robots to obtain a target point set corresponding to each robot; the target point selection module is used for selecting a mobile target point from the target point set according to the evaluation network based on the corresponding behaviors of the robots;

the path planning module is used for planning a path according to the moving target point to obtain a moving path corresponding to the robot;

and the target point execution module is used for controlling the robot to move to the moving target point according to the moving path and completing the collaborative exploration tasks of all the robots in the current environment.

9. An electronic device, characterized in that the electronic device comprises:

a memory storing a plurality of instructions;

a processor loading instructions from the memory to perform the steps in the multi-robot collaborative exploration method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a multi-robot collaborative discovery apparatus, implements the steps in the multi-robot collaborative discovery method according to any one of claims 1 to 7.