CN115421494A

CN115421494A - Cleaning robot path planning method, system, computer device and storage medium

Info

Publication number: CN115421494A
Application number: CN202211147813.4A
Authority: CN
Inventors: 王羽钧; 洪晓鹏; 沈超
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2022-09-19
Filing date: 2022-09-19
Publication date: 2022-12-02

Abstract

The invention belongs to the field of artificial intelligence and robot path planning, and discloses a cleaning robot path planning method, a cleaning robot path planning system, computer equipment and a storage medium, wherein the cleaning robot path planning system comprises the following components: acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of a robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload; and calling a preset deep reinforcement learning model for path planning of the cleaning robot according to the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload, and obtaining a path planning result of each cleaning robot. The path planning method can realize path planning of multiple cleaning robots, solve the path planning problem of the cleaning robots with multiple robots and a large number of points to be cleaned, fit practical application scenes, obtain a path planning scheme superior to that of a traditional optimization method, and solve the path planning problem by far shorter operation time than traditional methods such as an ant colony algorithm and a dynamic planning algorithm.

Description

Cleaning robot path planning method, system, computer equipment and storage medium

Technical Field

The invention belongs to the field of artificial intelligence and robot path planning, and relates to a cleaning robot path planning method, a cleaning robot path planning system, computer equipment and a storage medium.

Background

The vigorous development of artificial intelligence and robot technology provides a prerequisite for the large-scale application of the cleaning robot, the continuous rise of labor cost and actual landing market space for the cleaning robot. Nowadays, the figure of the cleaning robot is seen in a small space such as a home house from a large public place such as an airport, a hospital, and a school. Obviously, it has become a trend of the times to complete cleaning work by a robot instead of a human.

The path planning must be performed before the robot starts to perform the cleaning task. The quality of the path planning directly affects the efficiency of completing the cleaning task and indirectly affects the energy consumption and wear rate of each robot. Existing path planning methods fall into two categories: the first type is a full-coverage path planning method represented by a cattle farming method, which enables a robot to traverse all cleaning areas according to some preset rules, and is simple to implement, but low in efficiency under the conditions of large cleaning space and sparse garbage distribution. The second type is a path planning method based on a traditional optimization technology represented by an ant colony algorithm, dynamic planning, gurobi and the like, the solving time of the method generally has an exponential relation with the number of path nodes and robots, and the method is not suitable for solving a large-scale multi-machine path planning problem.

Disclosure of Invention

The present invention is directed to overcome the above-mentioned disadvantage of the prior art that multi-robot path planning of a cleaning robot is difficult, and provides a method, a system, a computer device and a storage medium for path planning of a cleaning robot.

In order to achieve the purpose, the invention adopts the following technical scheme to realize the purpose:

in a first aspect of the present invention, a cleaning robot path planning method includes:

acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of a robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload;

and calling a preset deep reinforcement learning model for path planning of the cleaning robot according to the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload, and obtaining a path planning result of each cleaning robot.

Optionally, the deep reinforcement learning model for path planning of the cleaning robot is constructed in the following manner:

establishing a mathematical model of a path planning problem of the cleaning robot;

establishing a Markov decision process model of the cleaning robot path planning problem according to a mathematical model of the cleaning robot path planning problem;

establishing an initial deep reinforcement learning model for path planning of the cleaning robot according to a Markov decision process model of the path planning problem of the cleaning robot;

and training an initial deep reinforcement learning model for path planning of the cleaning robot through a preset training set to obtain the deep reinforcement learning model for path planning of the cleaning robot.

Optionally, the mathematical model of the cleaning robot path planning problem includes optimization variables, optimization objectives, and constraint conditions;

wherein the optimization variables comprise a first optimization variable Y and a second optimization variable Z:

Z＝{z _i,j |i∈P,j∈P}

wherein P is a node set formed by the robot library and the points to be cleaned

n is the number of points to be cleaned, p ₀ Representing a robot library node; r is a set of cleaning robots

k is the number of the cleaning robots,

to indicate the variables, indicate whether the cleaning robot r is from p _i Go out and arrive at p _j If the robot r is from p _i Go out and arrive at p _j Then, then

Otherwise

z _i,j Is p _i From x _i To p _j Coordinate x of (2) _j Total amount of waste of (a);

the optimization objective is shown as follows:

wherein, c _j Is the point p to be cleaned _j Cleaning workload of c ₀ ＝0；v _r Is the running speed of the cleaning robot r;

the constraint conditions comprise optimization variable value range constraint, region access frequency constraint, robot path continuity constraint, total garbage amount constraint and garbage transportation constraint which can be carried by the robot;

the value range constraint of the optimization variable is shown as the following formula:

z _i,j ≥0,i∈P,j∈P

the region access times constraint is as follows:

the robot path continuity constraint is given by:

the total amount of garbage that the robot can carry is constrained as follows:

wherein, b _r Is the garbage bin capacity of the cleaning robot r;

the refuse transport constraint is as follows:

wherein, P' = P- { P ₀ P' is a set of n points to be cleaned, g _j Is the point p to be cleaned _j Amount of garbage of g ₀ =0, m is a preset constant.

Optionally, the markov decision process model of the cleaning robot path planning problem includes an environmental state, an action, a state transition rule, and a cost;

wherein the environmental state S _t As shown in the following formula:

S _t ＝(D _t ,E _t )，

wherein, t is the number of steps,

to clean the remaining capacity of the robot r waste bin at step t,

in order to clean the node where the robot r is located at the t-th step,

a set formed by nodes visited by the cleaning robot r up to the t step;

to node p at the t-th step _i If node p has access to state _i Has been accessed, then

Otherwise

Action A _t As shown in the following formula:

A _t ＝(d _t ,p _t )

wherein, d _t For node decoders activated at step t, p _t E, P is the node selected in the t step;

state transition rules ST for actions according to _t The environmental state is changed from S by the following formula _t Transfer to S _t+1 ：

Wherein r is _t Is node decoder d _t A corresponding cleaning robot is arranged on the cleaning machine,

represents p is _t Is spliced at

A terminal end;

the cost F is shown as follows:

wherein T is the total number of steps,

is the cost of the cleaning robot r at step t,

obtained by the following formula:

wherein, the first and the second end of the pipe are connected with each other,

to represent

And

the distance of (a) to (b),

is p _t The coordinates of (a) are calculated,

is composed of

The coordinates of (c).

Optionally, the deep reinforcement learning model for cleaning robot path planning includes: an encoder and a decoder; the encoder comprises a node encoder and a robot encoder, and the decoder comprises a decoder selector and k node decoders; the output ends of the node encoder and the robot encoder are connected with the input end of a decoder selector, and the output end of the decoder selector is connected with the input ends of the k node decoders;

the node encoder comprises a linear mapping layer and L1 graph encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _node Is the index of the graph coding module of the node encoder, when 1 is less than or equal to l _node If < L1, the graph coding module L _node Output terminal of (1) and _node the input ends of +1 image coding modules are connected, when l _node If = L1, the pattern coding module L _node The output end of the decoder is connected with the input end of the decoder selector; the robot encoder comprises a linear mapping layer and L2 image encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _robot For the index of the image coding module of the robot encoder, when 1 is less than or equal to l _robot If < L2, the graph coding module L _robot Output terminal of (1) and _robot the input ends of +1 image coding modules are connected, when l _robot = L2, graph coding module L _robot The output end of the decoder is connected with the input end of the decoder selector; the decoder selector comprises a multi-head attention layer and a fitness layer, and the output end of the multi-head attention layer is connected with the input end of the fitness layer; the node decoder comprises a multi-head attention layer and a fitness layer, wherein the output end of the multi-head attention layer is connected with the input end of the fitness layer.

Optionally, the linear mapping layer is shown as follows:

Linear(x)＝Wx+B

wherein the content of the first and second substances,

is the input of the data to be transmitted,

and

is a learnable parameter, d _in Is the dimension of the data input, d _out Is the output dimension of the linear mapping layer;

the fitness layer is represented by the following formula:

wherein softmax () is a normalized exponential function;

the multiheaded attention layer is represented by the formula:

MHA(X)＝Concat(head ₁ ,head ₂ ,…,head _h )W ^O

is the input of the multi-head attention layer, nxd _x Is the dimension of the input data, concat is the matrix splicing operation,

is a trainable parameter, h is the number of heads of attention, d _v Is the dimension of a vector of values, head _i Is the output of the ith attention head; head _i The calculation method of (a) is as follows:

wherein Q is _i ＝XWi _i ^Q ,

V _i ＝XW _i ^V ；

And

is a learnable parameter, d _k Is the dimension of the key vector;

the graph encoding module is represented by the following formula:

X _l+1 ＝GraphEncoder(X _l )

wherein the content of the first and second substances,

is the input to the graph coding module and,

is the output of the graph coding module and,

the method comprises the steps that a graph coding module calculates a process vector, and FF is a forward propagation module and is formed by connecting a plurality of linear mapping layers and a ReLU function layer; BN () is a batch normalization layer;

the ReLU function layer is expressed as follows:

ReLU(x)＝max(0,x)

the batch normalization layer is shown below:

where γ and β are learnable parameters, E [ x ] is the expectation of x, var [ x ] is the variance of x, and E is a constant to prevent the denominator from being zero;

the input of the node encoder is I _P ＝{(x _i ,c _i ,g _i ) I belongs to P, and the output is

is the code of the ith node;

the input of the robot encoder is I _R ＝{(v _r ,b _r ) L R belongs to R, and the output is

is the code of the ith cleaning robot;

the decoder selector inputs at time step t as

is the path taken by the cleaning robot r up to time step t-1,

node decoder d with maximum probability output _t ；

The node decoder has the input of

Where r' is the node decoder d _t Corresponding cleaning robot, h _p Is the code of the node where the cleaning robot is located, h _r′ Is the code of the cleaning robot r'; node p with the output of the maximum probability _t 。

Optionally, when the initial depth-enhanced learning model for cleaning robot path planning is trained, model parameters of the initial depth-enhanced learning model for cleaning robot path planning are optimized according to the following formula:

where θ is the model parameter, s is the output path planning scheme, F _s Is the cost of the path planning scheme s, b(s) is the evaluation of the reference method to the path planning scheme s, the strategy of the pi reinforcement learning method, p _θ (π | s) represents the probability of outputting the path planning solution s under the parameter θ and the policy π.

In a second aspect of the present invention, a cleaning robot path planning system includes:

the data acquisition module is used for acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload;

and the model calling module is used for calling a preset deep reinforcement learning model for path planning of the cleaning robot according to the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload, so as to obtain a path planning result of each cleaning robot.

In a third aspect of the invention, a computer device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the cleaning robot path planning method when executing the computer program.

In a fourth aspect of the present invention, a computer-readable storage medium stores a computer program, which when executed by a processor implements the steps of the above cleaning robot path planning method.

Compared with the prior art, the invention has the following beneficial effects:

the cleaning robot path planning method is based on the calling of a deep reinforcement learning model for cleaning robot path planning, can realize the path planning of multiple cleaning robots only by acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of a robot library, the coordinates of points to be cleaned, the garbage amount and the cleaning workload, can solve the path planning problem of the cleaning robots with multiple robots and a large number of points to be cleaned, is more suitable for practical application scenes, fully utilizes cleaning task information and cleaning robot information, and the solved path planning scheme is superior to the traditional optimization method. Meanwhile, the deep reinforcement learning model for path planning of the cleaning robot is based on deep reinforcement learning, the operation speed can be greatly increased by using a graphic processor, and the operation time required for solving the path planning problem is far shorter than that of the traditional methods such as an ant colony algorithm and a dynamic planning algorithm.

Drawings

Fig. 1 is a flowchart of a cleaning robot path planning method according to an embodiment of the present invention.

FIG. 2 is a diagram of a deep reinforcement learning model architecture according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a detailed architecture of a deep reinforcement learning model according to an embodiment of the present invention.

Fig. 4 is a block diagram of a path planning system of a cleaning robot according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention is described in further detail below with reference to the accompanying drawings:

referring to fig. 1, in an embodiment of the present invention, a cleaning robot path planning method is provided, and particularly, a cleaning robot path planning method based on deep reinforcement learning, which can implement path planning of multiple cleaning robots, and has a fast solving speed and high solving quality.

Specifically, the cleaning robot path planning method comprises the following steps:

s1: and acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload.

S2: and calling a preset deep reinforcement learning model for path planning of the cleaning robot according to the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload, and obtaining a path planning result of each cleaning robot.

The garbage bin capacity and the running speed of each cleaning robot can be obtained from a specification or a manufacturer of the cleaning robot, and the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload are set according to an actual working scene.

In one possible embodiment, the deep reinforcement learning model for cleaning robot path planning is constructed by the following steps: establishing a mathematical model of a path planning problem of the cleaning robot; establishing a Markov decision process model of the cleaning robot path planning problem according to a mathematical model of the cleaning robot path planning problem; establishing an initial deep reinforcement learning model for path planning of the cleaning robot according to a Markov decision process model of the path planning problem of the cleaning robot; training an initial deep reinforcement learning model for path planning of the cleaning robot through a preset training set to obtain the deep reinforcement learning model for path planning of the cleaning robot.

Optionally, the mathematical model of the path planning problem of the cleaning robot includes an optimization variable, an optimization target, and constraint conditions, where the optimization variable includes a first optimization variable Y and a second optimization variable Z, and the constraint conditions include an optimization variable value range constraint, a region access frequency constraint, a robot path continuity constraint, a total garbage amount constraint that the robot can carry, and a garbage transportation constraint.

Set up robot storehouse and wait to clean some constitution node set

Where n is the number of spots to be cleaned, p ₀ Representing the node of the robot library, and setting P' = P- { P ₀ P' is a set of n points to be cleaned; set up robot storehouse and wait to clean coordinate composition set of point

Wherein x is _i Is p _i The coordinates of (a); set all cleaning robots to form a set

Where k is the number of cleaning robots.

The first optimization variable Y is shown as follows:

wherein k is the number of the cleaning robots,

to indicate the variable, the indication indicates whether the robot r is from p _i Go out and arrive at p _j If the robot r is from p _i Go out and arrive at p _j Then, then

Otherwise

The second optimization variable Z is shown as follows:

Z＝{z _i,j |i∈P,j∈P}

wherein z is _i,j Represents from x _i To x _j Total amount of waste.

The optimization objective is shown below:

wherein, c _j Is the point p to be cleaned _j Cleaning workload of c ₀ ＝0；v _r Is the running speed of the cleaning robot r.

z _i,j ≥0,i∈P,j∈P

the region access times constraint is as follows:

the robot path continuity constraint is given by:

the total amount of garbage that the robot can carry is constrained as follows:

wherein, b _r Is the garbage bin capacity of the robot r;

the refuse transport constraint is as follows:

wherein, g _j Is the point p to be cleaned _j The quantity of garbage is set as g ₀ And M is a larger preset constant number of 0.

Optionally, the markov decision process model of the cleaning robot path planning problem includes environmental states, actions, state transition rules, and costs.

In particular, the ambient state S _t ＝(D _t ,E _t )，

Wherein t is the number of steps,

to clean the remaining capacity of the robot r waste bin at step t,

in order to clean the node where the robot r is located at the t-th step,

a set formed by nodes visited by the cleaning robot r from the t step;

to the node p at the t step _i If node p has access to state _i Has been accessed, then

Otherwise

Action A _t ＝(d _t ,p _t ) Wherein d is _t For node decoders activated at step t, p _t And e P is the node selected in the t step.

State transition rule ST according to action A _t From the environmental state S _t Transfer to S _t+1 The method specifically comprises the following steps:

represents p is _t Is spliced at

And (4) ending.

Cost of

Wherein T is the total number of steps,

is the cost of the robot r at step t,

the calculation method of (a) is shown as follows:

wherein the content of the first and second substances,

indicating points

And point

The distance of (c).

Referring to fig. 2, optionally, the deep reinforcement learning model for cleaning robot path planning includes: an encoder and a decoder; the encoder comprises a node encoder and a robot encoder, and the decoder comprises a decoder selector and k node decoders; the output ends of the node encoder and the robot encoder are connected with the input end of the decoder selector, and the output end of the decoder selector is connected with the input ends of the k node decoders.

Referring to fig. 3, the components constituting the node encoder, the robot encoder, the decoder selector, and the node decoder include a linear mapping layer, a ReLU function layer, a single-headed attention layer, a multi-headed attention layer, a batch normalization layer, and a graph encoding module. Specifically, the node encoder comprises a linear mapping layer and L1 graph encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _node For the index of the graph coding module of the node encoder, when 1 is less than or equal to l _node If < L1, the graph coding module L _node Output terminal of (1) and _node the input ends of +1 image coding modules are connected, when l _node If = L1, the pattern coding module L _node The output end of the decoder is connected with the input end of the decoder selector; the robot encoder comprises a linear mapping layer and L2 image encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _rodot For the index of the image coding module of the robot encoder, when 1 is less than or equal to l _robot If < L2, the graph coding module L _robot Output terminal of (1) and _robot the input ends of +1 image coding modules are connected, when l _robot If = L2, the pattern coding module L _robot The output end of the decoder is connected with the input end of the decoder selector; the decoder selector comprises a multi-head attention layer and a fitness layer, and the output end of the multi-head attention layer is connected with the input end of the fitness layer; the node decoder comprises a multi-head attention layer and a fitness layer, wherein the output end of the multi-head attention layer is connected with the input end of the fitness layer.

Specifically, the linear mapping layer is represented by the following formula:

Linear(x)＝Wx+B

is the input of the data to be transmitted,

and

is a learnable parameter, d _in Is the dimension of the data input, d _out Is the output dimension of the linear mapping layer.

The fitness layer is represented by the following formula:

wherein softmax () is a normalized exponential function.

The multi-head attention layer is shown as follows:

MHA(X)＝COncat(head ₁ ,head ₂ ,…,head _h )W ^O

is the input of a multi-head attention layer, nxd _x Is the dimension of the input data, concat is the matrix splicing operation,

is a trainable parameter, h is the number of attention heads, when h is 1, i.e. single head attention level, d _v Is the dimension of the value vector, head _i Is the output of the ith attention head; head _j The calculation method of (a) is as follows:

wherein Q is _i ＝XW _i ^Q ,

V _i ＝XW _i ^V ；

And

is a learnable parameter, d _k Is the dimension of the key vector.

The single-headed attention layer is represented by the following formula:

wherein Q = XW ^Q ,K＝XW ^K ,V＝XW ^V ；

And

are learnable parameters.

The graph encoding module is represented by the following formula:

X _l+1 ＝GraphEncoder(X _l )

is the input to the graph coding module and,

is the output of the graph encoding module and,

the FF is a forward propagation module and is formed by connecting a plurality of linear mapping layers and a ReLU function layer; BN () is the batch normalization layer.

The ReLU function layer is shown as follows:

ReLU(x)＝max(0,x)

the batch normalization layer is shown below:

where γ and β are learnable parameters, E [ x ] is the expectation of x, var [ x ] is the variance of x, and E is a constant to prevent the denominator from being zero.

In this embodiment, the forward propagation module is formed by a linear mapping layer with an input dimension of 128 and an output dimension of 512, a ReLU activation function layer, and a linear mapping layer with an input dimension of 512 and an output dimension of 128.

In a possible implementation, the input of the node encoder is node information I _P ＝{(x _i ,c _i ,g _i ) I belongs to P, and the output is

is the code of the ith node; the input of the robot encoder is robot information I _R ＝{(v _r ,b _r ) L R belongs to R, and the output is

Wherein the content of the first and second substances,

is the code of the ith cleaning robot; the decoder selector has as input at time step t

Node decoder d with maximum output probability _t (ii) a The node decoder has the input of

Specifically, the input of the node encoder is

The node encoder firstly maps I through a linear mapping layer _P Mapping to a high-dimensional feature space:

wherein the content of the first and second substances,

Linear _P is 4 and the output dimension is 128.

Then extracting features through m graph coding modules:

wherein k is the serial number of the graph coding module; the output of the node decoder is

Wherein

Is the code of the ith node and is,

the input of the robot encoder is

The robot encoder first maps I through the linear mapping layer _R Mapping to a high-dimensional feature space:

wherein the content of the first and second substances,

Linear _R has an input dimension of 2 and an output dimension of 128.

Then extracting features through m graph coding modules:

the output of the robot encoder is

is the code of the ith cleaning robot.

The decoder selector has as input at time step t

Wherein

The path that the cleaning robot r has traveled when the time step i-1 is reached;

the decoder selector first extracts the Tour by maximum pooling ^t-1 The information in (1):

then, the extracted information is input into a forward propagation module to obtain

Wherein, FF _ST The device is formed by connecting a linear mapping layer with an input dimension of 5 and an output dimension of 128, a linear mapping layer with an input dimension of 128 and an output dimension of 512, a ReLU activation function layer, and a linear mapping layer with an input dimension of 512 and an output dimension of 128.

Then, the V is put ^t-1 Input into another forward propagation module to obtain

Wherein, FF _ST The device is formed by connecting a linear mapping layer with an input dimension of 640 and an output dimension of 128, a linear mapping layer with an input dimension of 128 and an output dimension of 512, a ReLU activation function layer, and a linear mapping layer with an input dimension of 512 and an output dimension of 128.

Will be provided with

And with

Splicing and inputting the linear layer to obtain the logarithmic probability:

wherein, linear _S 256 in input dimension and 5 in output dimension.

Will logits _S Inputting softmax function to obtain probability prob of selecting each decoder _S ：

prob _S ＝softmax(logits _S )

Wherein the content of the first and second substances,

representing the probability of selecting decoder i, and finally obtaining node decoder d with the maximum probability _t ：

The output of the decoder selector is d _t 。

The node decoder has the input of

Wherein r' is a node decoder d _t Corresponding cleaning robot, h _p Is the code of the node where the cleaning robot is located, h _r′ Is the code of the cleaning robot r'; the output is the node p with the maximum probability _t 。

Node decoder first decodes C _D Inputting the linear mapping layer to obtain

Wherein, linear _D Is 257 and the output dimension is 128.

Then will

And

splicing to obtain

Wherein the content of the first and second substances,

will be provided with

Inputting the multi-head attention layer to obtain

then calculating the probability of selecting the ith node

Wherein the content of the first and second substances,

d _key is key _i Dimension (d); finally, the product is processedGet the node p with the maximum probability _t ：

The output of the node decoder is p _t 。

In one possible embodiment, training an initial deep reinforcement learning model for cleaning robot path planning through a preset training set comprises:

s11: and setting the size of the training data set, the size of the batch, the number E of training rounds and the learning rate. In this embodiment, the training data set size is 1280000, the batch size is 512, the number of training rounds E =50, and the learning rate is 0.0001.

S12: generating a training sample set; the current number of training rounds e =1 is set.

S13: inputting training samples into a network in batches according to the set batch size, and calculating a path planning scheme; and optimizing the model parameters according to the path planning scheme output by the network and according to the following formula:

S14: training round number e = e +1.

S15: if E > E, the training is ended; otherwise, return to S12.

In one possible implementation, a test set containing 1280 samples is used, for three benchmark methods based on conventional optimization techniques: ant colony algorithm, genetic algorithm and Gurobi, two reference methods based on reinforcement learning: AM and DRL, and the cleaning robot path planning method of the invention, the results are shown in Table 1:

TABLE 1

Method	Optimizing the value of the target	Solution time (unit: second)
			Ant colony algorithm	7.07	261097
Genetic algorithm	8.85	175670
			Gurobi	7.38	129039
AM	7.09	0.63
			DRL	6.69	1.21
The invention	6.59	1.27

Therefore, from the perspective of an optimization target, the cleaning robot path planning method is superior to the five reference methods; from the perspective of solving time, the cleaning robot path planning method is obviously superior to three methods based on the traditional optimization technology.

The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details not disclosed in the device embodiments, reference is made to the method embodiments of the invention.

Referring to fig. 4, in a further embodiment of the present invention, a cleaning robot path planning system is provided, which can be used to implement the cleaning robot path planning method described above, and specifically, the cleaning robot path planning system includes a data obtaining module and a model invoking module.

The data acquisition module is used for acquiring the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload; the model calling module is used for calling a preset deep reinforcement learning model for path planning of the cleaning robot according to the garbage bin capacity and the running speed of each cleaning robot, the coordinates of the robot library, the coordinates of each point to be cleaned, the garbage amount and the cleaning workload, so as to obtain a path planning result of each cleaning robot.

In one possible embodiment, the deep reinforcement learning model for cleaning robot path planning is constructed by: establishing a mathematical model of a path planning problem of the cleaning robot; establishing a Markov decision process model of the cleaning robot path planning problem according to a mathematical model of the cleaning robot path planning problem; establishing an initial depth reinforcement learning model for path planning of the cleaning robot according to a Markov decision process model of the path planning problem of the cleaning robot; and training an initial deep reinforcement learning model for path planning of the cleaning robot through a preset training set to obtain the deep reinforcement learning model for path planning of the cleaning robot.

In one possible embodiment, the mathematical model of the cleaning robot path planning problem includes optimization variables, optimization objectives, and constraints; the optimization variables comprise a first optimization variable Y and a second optimization variable Z:

Z＝{z _i,j |i∈P,j∈P}

wherein, P is a node set formed by the robot library and the points to be cleaned

n is the number of points to be cleaned, p ₀ Representing a robot library node; r is a set formed by each cleaning robot

k is the number of the cleaning robots,

Otherwise

z _i,j Is p _i From x _i To p _j Coordinate x of _j Total amount of waste.

The optimization objective is shown as follows:

The constraint conditions comprise optimization variable value range constraint, area access frequency constraint, robot path continuity constraint, total garbage amount constraint and garbage transportation constraint which can be carried by the robot.

z _i,j ≥0,i∈P,j∈P

the region access times constraint is as follows:

the robot path continuity constraint is given by:

the total amount of garbage that the robot can carry is constrained as follows:

wherein, b _r Is the garbage bin capacity of the cleaning robot r;

the refuse transport constraint is shown by the following formula:

In one possible embodiment, the markov decision process model of the cleaning robot path planning problem includes environmental states, actions, state transition rules, and costs.

Wherein the environmental state S _t As shown in the following formula:

S _t ＝(D _t ,E _t )，

wherein t is the number of steps,

to clean the remaining capacity of the robot r waste bin at step t,

in order to clean the node where the robot r is located at the t-th step,

a set formed by nodes visited by the cleaning robot r from the t step;

Otherwise

Action A _t As shown in the following formula:

A _t ＝(d _t ,p _t )

wherein d is _t For node decoders activated at step t, p _t And e P is the node selected in the t step.

represents p is _t Is spliced at

And (4) ending.

The cost F is shown as follows:

wherein T is the total number of steps,

is the cost of the cleaning robot r at step t,

obtained by the following formula:

wherein the content of the first and second substances,

to represent

And

the distance of (a) to (b),

is p _t Is determined by the coordinate of (a) in the space,

is composed of

The coordinates of (a).

In one possible embodiment, the deep reinforcement learning model for cleaning robot path planning includes: an encoder and a decoder; the encoder comprises a node encoder and a robot encoder, and the decoder comprises a decoder selector and k node decoders; the output ends of the node encoder and the robot encoder are connected with the input end of the decoder selector, and the output end of the decoder selector is connected with the input ends of the k node decoders; the node encoder comprises a linear mapping layer and L1 graph encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _node Is the index of the graph coding module of the node encoder, when 1 is less than or equal to l _node If < L1, the graph coding module L _node Output terminal of (1) and _node the input ends of +1 image coding modules are connected, when l _node If = L1, the pattern coding module L _node The output end of the decoder is connected with the input end of the decoder selector; the robot encoder comprises a linear mapping layer and L2 image encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; let l _robot For the index of the image coding module of the robot encoder, when 1 is less than or equal to l _robot If < L2, the graph coding module L _robot Output terminal of (1) and _robot the input ends of +1 image coding modules are connected, when l _robot If = L2, the pattern coding module L _robot The output end of the decoder is connected with the input end of the decoder selector; the decoder selector comprises a multi-head attention layer and a fitness layer, and the output end of the multi-head attention layer is connected with the input end of the fitness layer; the node decoder comprises a multi-head attention layer and a fitness layer, wherein the output end of the multi-head attention layer is connected with the input end of the fitness layer.

In one possible embodiment, the linear mapping layer is represented by the following formula:

Linear(x)＝Wx+B

is the input of the data to be transmitted,

and

The fitness layer is represented by the following formula:

wherein softmax () is a normalized exponential function.

The multi-head attention layer is shown as follows:

MHA(X)＝Concat(head ₁ ,head ₂ ,…,head _h )W ^O

wherein the content of the first and second substances,

is the input of the multi-head attention layer, nxd _x Is the dimension of the input data, concat is the matrix stitchingIn the operation of the method, the operation,

is a trainable parameter, h is the number of heads of attention, d _v Is the dimension of the value vector, head _i Is the output of the ith attention head; head _i The calculation method of (a) is as follows:

wherein Q _i ＝XW _i ^Q ,

V _i ＝XW _i ^V ；

And

is a learnable parameter, d _k Is the dimension of the key vector.

The graph encoding module is represented by the following formula:

X _l+1 ＝GraphEncoder(X _l )

wherein the content of the first and second substances,

is the input to the graph coding module and,

is the output of the graph coding module and,

wherein the content of the first and second substances,

is a picture plaitThe code module calculates a process vector, and FF is a forward propagation module and is formed by connecting a plurality of linear mapping layers and a ReLU function layer; BN () is a batch normalization layer.

The ReLU function layer is expressed as follows:

ReLU(x)＝max(0,x)

the batch normalization layer is shown as follows:

Wherein the content of the first and second substances,

is the code of the ith node; the input of the robot encoder is I _R ＝{(v _r ,b _r ) I R belongs to R, and the output is

Wherein the content of the first and second substances,

Wherein the content of the first and second substances,

is cut off to the time step t-1 and the cleaning robot r walksThe path of the beam is a path of the beam,

node decoder d with maximum output probability _t (ii) a Node decoder d with maximum output probability _t (ii) a The node decoder has the input of

Where r' is the node decoder d _t Corresponding cleaning robot, h _p Is the code of the node where the cleaning robot is located, h _r′ Is the code of the cleaning robot r'; the output is the node p with the maximum probability _t 。

In one possible embodiment, the training of the initial deep-reinforcement learning model for cleaning robot path planning optimizes model parameters of the initial deep-reinforcement learning model for cleaning robot path planning by:

All relevant contents of each step related to the embodiment of the cleaning robot path planning method can be introduced to the functional description of the functional module corresponding to the cleaning robot path planning system in the embodiment of the present invention, and are not described herein again.

The division of the modules in the embodiments of the present invention is schematic, and only one logical function division is provided, and in actual implementation, there may be another division manner, and in addition, each functional module in each embodiment of the present invention may be integrated in one processor, or may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program comprising program instructions, the processor for executing the program instructions stored by the computer storage medium. The Processor may be a Central Processing Unit (CPU), or may be other general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable gate array (FPGA) or other Programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, etc., which is a computing core and a control core of the terminal, and is specifically adapted to load and execute one or more instructions in a computer storage medium to implement a corresponding method flow or a corresponding function; the processor of the embodiment of the invention can be used for the operation of the cleaning robot path planning method.

In yet another embodiment of the present invention, the present invention further provides a storage medium, specifically a computer-readable storage medium (Memory), which is a Memory device in a computer device and is used for storing programs and data. It is understood that the computer readable storage medium herein can include both built-in storage media in the computer device and, of course, extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory. One or more instructions stored in the computer-readable storage medium may be loaded and executed by a processor to perform the corresponding steps of the cleaning robot path planning method in the above-described embodiments.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A method for cleaning robot path planning, comprising:

2. The cleaning robot path planning method according to claim 1, wherein the deep reinforcement learning model for cleaning robot path planning is constructed by:

establishing an initial depth reinforcement learning model for path planning of the cleaning robot according to a Markov decision process model of the path planning problem of the cleaning robot;

3. The cleaning robot path planning method according to claim 2, wherein the mathematical model of the cleaning robot path planning problem includes optimization variables, optimization objectives, and constraints;

the optimization variables comprise a first optimization variable Y and a second optimization variable Z:

Z＝{z _i,j |i∈P,j∈P}

k is the number of the cleaning robots,

Otherwise

z _i,j Is p _i From x _i To p _j Coordinate x of _j Total amount of waste;

the optimization objective is shown as follows:

the constraint conditions comprise an optimized variable value range constraint, a region access frequency constraint, a robot path continuity constraint, a robot carried garbage total amount constraint and a garbage transportation constraint;

z _i,j ≥0,i∈P,j∈P

the region access times constraint is as follows:

the robot path continuity constraint is given by:

the total amount of garbage that the robot can carry is constrained as follows:

wherein, b _r Is the garbage bin capacity of the cleaning robot r;

the refuse transport constraint is as follows:

4. The cleaning robot path planning method of claim 3, wherein the Markov decision process model of the cleaning robot path planning problem includes environmental states, actions, state transition rules, and costs;

wherein the environmental state S _t As shown in the following formula:

wherein t is the number of steps,

to clean the remaining capacity of the robot r waste bin at step t,

in order to clean the node where the robot r is located at the t-th step,

a set formed by nodes visited by the cleaning robot r from the t step;

to be at the t-th nodep _i If node p has access to state _i Has been accessed, then

Otherwise

Action A _t As shown in the following formula:

A _t ＝(d _t ,p _t )

state transition rules ST for actions according to A _t The environmental state is changed from S by the following formula _t Transfer to S _t+1 ：

Wherein r is _t Is a node decoder s _t A corresponding cleaning robot is arranged on the base plate,

represents p is _t Is spliced at

A terminal end;

the cost F is shown below:

wherein, T is the total number of steps,

is the cost of the cleaning robot r at step t,

obtained by the following formula:

represent

And

the distance of (a) to (b),

is p _t Is determined by the coordinate of (a) in the space,

is composed of

The coordinates of (a).

5. The cleaning robot path planning method according to claim 4, wherein the deep reinforcement learning model for cleaning robot path planning includes: an encoder and a decoder; the encoder comprises a node encoder and a robot encoder, and the decoder comprises a decoder selector and k node decoders; the output ends of the node encoder and the robot encoder are connected with the input end of a decoder selector, and the output end of the decoder selector is connected with the input ends of the k node decoders;

the node encoder comprises a linear mapping layer and L1 graph encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; is provided with

For the index of the graph coding module of the node encoder, when

Time, picture coding module

To the output terminal and

the input end of the picture coding module is connected with

Time-graph coding module

The output end of the decoder is connected with the input end of the decoder selector; the robot encoder comprises a linear mapping layer and L2 image encoding modules; the output end of the linear mapping layer is connected with the input end of the first graph coding module; is provided with

For the indexing of the image coding modules of the robot encoder, when

Time-graph coding module

Output terminal and the second

The input end of the picture coding module is connected with

Time-graph coding module

The output end of the decoder is connected with the input end of the decoder selector; the decoder selector comprises a multi-head attention layer and a fitness layer, wherein the output end of the multi-head attention layer is connected with the input end of the fitness layer; the node decoder comprises a multi-head attention layer and a fitness layer, wherein the output end of the multi-head attention layer is connected with the input end of the fitness layer.

6. The cleaning robot path planning method according to claim 5, wherein the linear mapping layer is represented by the following formula:

Linear(x)＝Wx+B

is the input of the data to be transmitted,

and

the fitness layer is represented by the following formula:

wherein sofmtx () is a normalized exponential function;

the multi-head attention layer is shown as follows:

MHA(X)＝Concat(head ₁ ,head ₂ ,…,head _h )W ^O

wherein the content of the first and second substances,

is a trainable parameter, h is the number of attention heads, d _v Is the dimension of the value vector, head _i Is the output of the ith attention head; head _i The calculation method of (a) is as follows:

wherein Q _i ＝XW _i ^Q ,K _i ＝XW _i ^K ,V _i ＝XW _i ^V ；

And

is a learnable parameter, d _k Is the dimension of the key vector;

the graph encoding module is represented by the following formula:

X _l+1 ＝GraphEncoder(X _l )

is the input to the graph coding module and,

is the output of the graph encoding module and,

the FF is a forward propagation module and is formed by connecting a plurality of linear mapping layers and a ReLU function layer; BN () is a batch normalization layer;

the ReLU function layer is expressed as follows:

ReLU(x)＝max(0,x)

the batch normalization layer is shown below:

is the code of the ith node;

the input of the robot encoder is I _R ＝{(v _r ,b _r ) I R belongs to R, and the output is

is the code of the ith cleaning robot;

the decoder selector has as input at time step t

Wherein the content of the first and second substances,

is the path taken by the cleaning robot r up to time step t-1,

node decoder d with maximum output probability _t ；

The node decoder has the input of

7. The cleaning robot path planning method according to claim 2, wherein the training of the initial deep reinforcement learning model for cleaning robot path planning optimizes model parameters of the initial deep reinforcement learning model for cleaning robot path planning by:

where θ is the model parameter, s is the output path planning scheme, F _s Is the cost of the path planning scheme s, b(s) is the evaluation of the reference method to the path planning scheme s, the strategy of the pi reinforcement learning method, p _θ (π | s) represents the probability of outputting a path planning solution s under the parameter θ and the policy π.

8. A cleaning robot path planning system, comprising:

9. A computer arrangement comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor, when executing the computer program, carries out the steps of the method for path planning of a cleaning robot according to any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of a method for path planning for a cleaning robot according to any of claims 1 to 7.