CN117873089A

CN117873089A - Multi-mobile robot cooperation path planning method based on clustering PPO algorithm

Info

Publication number: CN117873089A
Application number: CN202410036441.0A
Authority: CN
Inventors: 李骏; 李马兵; 夏鹏程; 曾振平; 于霄
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2024-01-10
Filing date: 2024-01-10
Publication date: 2024-04-12
Anticipated expiration: 2044-01-10
Also published as: CN117873089B

Abstract

The invention discloses a path planning method for cooperation of multiple mobile robots based on a clustering PPO algorithm, which comprises the following steps: s1, collecting position information of all targets, and cleaning and standardizing target data; s2, performing target node allocation by using a K-means clustering algorithm; s3, optimizing the path of each mobile robot by using a PPO algorithm; and S4, updating the strategy network by using a PPO algorithm, and stopping training when the strategy network is stable or reaches the preset iteration times. The method for planning the path by combining the clustering algorithm and the deep reinforcement learning algorithm provides a new thought for solving the path planning, and has important practical application value for improving the storage efficiency and reducing the transportation cost.

Description

Multi-mobile robot cooperation path planning method based on clustering PPO algorithm

Technical Field

The invention relates to the technical field of warehouse logistics in an industrial field, in particular to a path planning method for cooperation of multiple mobile robots based on a clustering PPO algorithm.

Background

In the field of warehouse logistics in an industrial field, the path planning of the mobile robot can improve the production efficiency and enhance the flexibility of a manufacturing system. Path planning involves the need to simultaneously consider avoiding obstacles, minimizing movement time, reducing energy consumption, and ensuring coordination and safety of the robotic system for the mobile robot in a given working environment.

With the rise of the industrial automation degree, more and more mobile robots are deployed in the same working area. This requires the path planning system to plan a path for each mobile robot, and also to minimize the completion time of the maximized task in consideration of the task load balancing between the mobile robots and the task.

The traditional solving path planning method such as an accurate algorithm and a heuristic algorithm has the problems of high computational complexity and difficult application to a large-scale scene, and solves the problem of load balancing between the mobile robot and the task.

Disclosure of Invention

The invention aims to provide a path planning method for cooperation of multiple mobile robots based on a clustering PPO algorithm, wherein a K-means clustering algorithm is used for distributing targets to the mobile robots, and then the PPO algorithm is used for optimizing the path of each mobile robot. By the method, the task completion solution close to the minimum maximum can be found while the solving efficiency is ensured. The method combining the clustering algorithm and the deep reinforcement learning algorithm provides a new thought for solving the path planning, and is expected to play an important role in practical application.

In order to achieve the above purpose, the present invention provides a path planning method for cooperation of multiple mobile robots based on a clustering PPO algorithm, comprising the following steps:

s1, collecting position information of all targets, and cleaning and standardizing target data;

s2, performing target node allocation by using a K-means clustering algorithm;

s3, optimizing the path of each mobile robot by using a PPO algorithm;

and S4, updating the strategy network by using a PPO algorithm, and stopping training when the strategy network is stable or reaches the preset iteration times.

Preferably, in step S1, the collected position information of all the targets is expressed as:

s＝(x ₁ ,x ₂ ,…,x _K )；

the target data is cleaned, any missing value or abnormal value is processed, the target data coordinates are subjected to standardized processing, and the target node index is obtained and expressed as:

wherein node 0 represents the start point and the end point of each mobile robot.

Preferably, the step S2 specifically includes the following steps:

selecting a certain number of mobile robots and setting indexes of the mobile robots

S21, initializing a cluster center, distributing target nodes to each mobile robot by using a K-means clustering algorithm, and updating the cluster center according to a distribution result, wherein the cluster center is expressed as:

C(i)＝argmin _{j∈{1,2,…,K}} distance(x _i ,y _j )

wherein C (i) is the index of the cluster center to which target i is assigned, x _i Is the coordinates of object i, y _j Is the coordinate of the cluster center, distance () is the Euclidean distance between two points;

s22, updating a clustering result: according to the average position of all data points in the responsible cluster of each mobile robot, calculating a cluster center, wherein the calculation formula is as follows:

wherein L is _j Is the coordinates of the cluster center, S _j Is a set of targets assigned to the mobile robot, |S _j I is set S _j The number of intermediate nodes;

allocation matrix u= [ U ] _nm ]Is an N M matrix, where u _nm Representing the allocation of the nth destination node to the mth mobile robot, the allocation function is represented as:

s23, repeating the steps S21 and S22, stopping updating when the sum of squares of errors in the clusters is converged below a threshold value or reaches the preset iteration times, and checking the balance of target allocation.

Preferably, in step S3, the optimizing the path of each robot by using the PPO algorithm specifically includes:

a strategy network is initialized for each mobile robot, a path is generated for each mobile robot by updating the strategy network, and the return of each path is calculated according to the moving distance or the cost of the mobile robot.

Preferably, in step S3, the motion space of the mobile robot m is denoted as a _m The state space is denoted as S _m The reward function is denoted as r _m The motion space of mobile robot m at time step t is expressed as:

the state space of the mobile robot m is expressed as:

wherein V is _m ＝{n|u _nm =1, i.e. the target node n is accessed by the mobile robot m, m e m,

the rewarding function of the mobile robot m is determined by the distance from the node of the current time step t to the node of the next time step node t+1 and multiple access penalties, and pi is defined _m (t) represents the access policy of the mobile robot, then r _m Expressed as:

r _m ＝-distance(π _m (t),π _m (t+1))-λr _collision (t)；

cumulative bonus function R for mobile robot m _m Then it is expressed as:

wherein distance represents the distance between two nodes, lambda is the multiple access penalty coefficient weight, r _collision Penalizing for multiple accesses.

Preferably, step S4 comprises the steps of:

s41, utilizing the amplitude of gradient update of the near-end optimization clipping limiting strategy, and combining the dominance function and the importance sampling, wherein the objective function is expressed as:

wherein pi _θ Indicating the policy that is currently in use,representing old strategy->Representing the dominance function, e being a super parameter for controlling the intensity of the near-end optimized cut;

s42, updating the parameters of the neural network by using a PPO algorithm, and initializing a strategy network pi _θ (a _t |s _t ) Sum value networkWherein θ and->Parameters respectively representing a policy network and a value network;

updating parameters of the policy network by a gradient ascent method:

updating parameters of the value network by gradient descent:

and performing repeated iterative updating until the training times of the mobile robot reach the maximum, stopping training, and learning an optimal traversal node path strategy when the cumulative prize convergence of the mobile robot reaches the maximum.

Therefore, the path planning method for cooperation of the multiple mobile robots based on the clustering PPO algorithm has the following beneficial effects:

(1) According to the invention, through the clustering PPO algorithm, good balance between stability and sample efficiency can be obtained by limiting the step length of strategy updating, and targets can be rapidly and effectively distributed to the mobile robots, so that the task burden balance of each mobile robot is ensured, and the overall moving distance is reduced.

(2) According to the invention, the data set is divided into a pre-designated number of clusters (K) by the K-means clustering method, and samples are distributed to the nearest clusters by iterative optimization, so that the K-means clustering algorithm can be used for efficiently distributing targets to different mobile robots, and the task load balance of the mobile robots is ensured.

(3) The training method of the invention can avoid large fluctuation when updating the strategy, and ensure the stability of the learning process, which is particularly important for the complex path planning problem because of the large number of variables and constraint conditions involved.

(4) The invention can adapt to the path planning problems of different scales and complexity, and can maintain good performance even under the condition of more targets or more mobile robots.

(5) According to the invention, the target allocation and the path optimization are processed separately, so that the calculation cost is reduced, and the solution of the large-scale path planning problem becomes more feasible.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is an overall flow chart of an embodiment of a path planning method for multi-mobile robot collaboration based on a clustered PPO algorithm of the present invention;

FIG. 2 is a graph of clustering partition results of target nodes of an embodiment of a path planning method for multi-mobile robot cooperation based on a clustering PPO algorithm;

FIG. 3 is an optimal path diagram of each industrial mobile robot of an embodiment of a path planning method for multi-mobile robot cooperation based on a clustered PPO algorithm of the present invention;

FIG. 4 is a training result diagram of an embodiment of a path planning method for multi-mobile robot cooperation based on PPO algorithm of the present invention;

fig. 5 is a training result diagram of an embodiment of a path planning method of multi-mobile robot cooperation based on a clustering PPO algorithm of the present invention.

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.

Under the condition of considering complex target environment and congestion, the method combines a clustering algorithm and a deep reinforcement learning algorithm, is applied to the path planning problem of the multi-industry mobile robot, so as to optimize the path selection of the mobile robot and improve the overall transportation efficiency, and aims at finding a group of paths on the premise of meeting specific constraint conditions, so that the minimum and maximum task completion time of the mobile robot is ensured.

As shown in fig. 1, the path planning method for multi-mobile robot cooperation based on the clustering PPO algorithm comprises the following steps:

s1, collecting position information of all targets, including data such as coordinates, and expressing the collected position information of all targets as:

s＝(x ₁ ,x ₂ ,…,x _K )；

the target data is cleaned, any missing value or abnormal value is processed, the target data coordinates are subjected to standardized processing, the data is ensured to be on the same scale, and the target node index is obtained and expressed as:

S2, as shown in FIG. 2, the areas with different colors represent areas responsible for each industrial mobile robot, and target node allocation is performed by using a K-means clustering algorithm, wherein the step S2 specifically comprises the following steps:

selecting a certain number of mobile robots to setIndexing of mobile robots

C(i)＝argmin _{j∈{1,2,…,K}} distance(x _i ,y _j )

s23, repeating the steps S21 and S22, continuously iterating the allocation process, stopping updating when the sum of squares of errors in the clusters is converged below a threshold value or the preset iteration times are reached, and checking the balance of target allocation to ensure that no mobile robot is overloaded or overloaded.

S3, optimizing the path of each mobile robot by using a PPO algorithm;

The motion space of the mobile robot m is denoted as A _m The state space is denoted as S _m The reward function is denoted as r _m The motion space of mobile robot m at time step t is expressed as:

it is indicated that mobile robot m can pick a node in its own set of targets, this formula indicates that at time step t, the set of actions of mobile robot m consists of the target nodes it will access at time step t+1, note that here k=t+1, since it is assumed that mobile robot is already at target s at time step t _km And ready to move to the next destination.

The state space of the mobile robot m is expressed as:

r _m ＝-distance(π _m (t),π _m (t+1))-λr _collision (t)；

cumulative bonus function R for mobile robot m _m Then it is expressed as:

S4, updating the strategy network by using a PPO algorithm, and stopping training when the strategy network is stable or reaches the preset iteration times;

each mobile robot initializes its own strategy network and value network and interacts information with the environment, estimates the optimal action according to the strategy network, and outputs a state value according to the value network.

The policy network requires the output of the value network to calculate the merit function, while the value network requires the data generated by the policy network to calculate the state value. This dependency means that the performance of the two networks can affect each other during the training process.

The policy network and value network structure includes an input layer, two hidden layers, and an output layer. The input layer of the policy network and the value network takes as input the state of the mobile robot, the dimensions of which depend on the size of the mobile robot state space. The strategy network and the value network share two hidden layers, wherein the two layers are all connected layers, a ReLU function is adopted as an activation function between all layers, and then the two layers are crossed into respective output layers, and the number of neurons of the hidden layers is customized according to the complexity of the problem; the policy network outputs a softmax layer equal in size to the size of the action space, representing the probability distribution of the action. The value network outputs a scalar quantity representing the state value estimation, thereby completing the construction of the strategy network and the value network of the mobile robot.

The specific implementation steps are as follows:

s42, the core of the PPO algorithm is to update the parameters of the neural network and initialize the policy network pi _θ (a _t |s _t ) Sum value networkWherein θ and->Parameters respectively representing a policy network and a value network;

updating parameters of the policy network by a gradient ascent method:

updating parameters of the value network by gradient descent:

and performing repeated iterative updating until the training time of the mobile robot is the maximum training time, and stopping training. As shown in fig. 3, the paths of different colors represent the movement paths of each industrial mobile robot, and when the cumulative prize convergence of the mobile robot reaches the maximum value, the optimal traversing node path strategy is learned.

Examples

Python programming simulation is carried out under a computer with an operating system of Windows 11, and the specific scene is as follows:

representing path topology states of target clusters as graphsWherein-> Representing node set,/->Edge e is the edge set _i,j Epsilon represents the path of the mobile robot from target node i to target node j.

According to the cluster partition model, the topology of the graph G is dynamically changed. After the division is completed, each mobile robot has its own target node expressed asThe time is discretized into time slots T epsilon {1,2, …, T }, each mobile robot selects a target node in own node set, the target node is saved in own history information, and each time slot can only select a node once.

After the target data nodes are cleaned, normalized and clustered, the target data nodes enter a reinforcement learning module for training, and in the subsequent simulation process, when the training times of the mobile robot reach the maximum training times, the simulation is ended. The specific simulation parameters are shown in table 1:

TABLE 1 schematic representation of main simulation parameters

Target area	100m×100m
		Target node number	50
Number of mobile robots	5
		Maximum training times	10000

As shown in fig. 4 and 5, the average cumulative prize is relatively low because the initial model has not yet learned sufficiently. As training time increases, the average jackpot also continues to rise, eventually tending to stabilize. More specifically, by comparison, it can be found that the performance of the clustering-based PPO algorithm is superior to the PPO algorithm in both the convergence rate and the final stable value of the average jackpot.

Therefore, the path planning method based on the clustering PPO algorithm and used for cooperation of the multiple mobile robots is applied to the field of storage logistics in an industrial field, can reduce the network training cost, ensures the stability of the learning process and reduces the overall task completion time.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. The path planning method for cooperation of the multiple mobile robots based on the clustering PPO algorithm is characterized by comprising the following steps:

s2, performing target node allocation by using a K-means clustering algorithm;

s3, optimizing the path of each mobile robot by using a PPO algorithm;

2. The path planning method for cooperation of multiple mobile robots based on clustering PPO algorithm according to claim 1, wherein: in step S1, the collected position information of all the targets is expressed as:

s＝(x ₁ ,x ₂ ,…,x _K )；

3. The path planning method for cooperation of multiple mobile robots based on the clustering PPO algorithm according to claim 1, wherein step S2 specifically comprises the steps of:

C(i)＝argmin _{j∈{1,2,…,K}} distance(x _i ,y _j )

4. The path planning method for cooperation of multiple mobile robots based on the clustering PPO algorithm according to claim 1, wherein in step S3, the optimizing the path of each robot by using the PPO algorithm is specifically:

5. The path planning method for cooperation of multiple mobile robots based on clustering PPO algorithm according to claim 4, wherein: in step S3, the motion space of the mobile robot m is denoted as a _m The state space is denoted as S _m The reward function is denoted as r _m The motion space of mobile robot m at time step t is expressed as:

the state space of the mobile robot m is expressed as:

r _m ＝-distance(π _m (t),π _m (t+1))-λr _collision (t)；

cumulative bonus function R for mobile robot m _m Then it is expressed as:

6. The path planning method for cooperation of multiple mobile robots based on the clustering PPO algorithm according to claim 1, wherein step S4 comprises the steps of:

updating parameters of the policy network by a gradient ascent method:

updating parameters of the value network by gradient descent: