CN115235476B

CN115235476B - Full-coverage path planning method and device, storage medium and electronic equipment

Info

Publication number: CN115235476B
Application number: CN202211169283.3A
Authority: CN
Inventors: 娄君杰; 郑鑫宇; 章航嘉; 郑习羽
Original assignee: Ningbo Junsheng Intelligent Automobile Technology Research Institute Co ltd
Current assignee: Ningbo Junsheng Intelligent Automobile Technology Research Institute Co ltd
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2023-01-17
Anticipated expiration: 2042-09-26
Also published as: CN115235476A

Abstract

The invention provides a full coverage path planning method, a full coverage path planning device, a storage medium and electronic equipment. The invention improves the traditional grid modeling mode, represents discrete environment by grid points, and designs a convolutional neural network model and a state input matrix. Designing a reward and punishment function on the model, and training the convolutional neural network model by using the current mainstream reinforcement learning algorithm; the motion can be output in a continuous motion space, and an optimal full-coverage path is finally formed.

Description

Full-coverage path planning method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of intelligent control, in particular to a full-coverage path planning method and device, a storage medium and electronic equipment.

Background

The problems to be solved by full Coverage Path Planning (Complete Coverage Path Planning) include traversing all regions except obstacles in a working region, effectively avoiding all obstacles in the traversing process, avoiding Path repetition as much as possible in the traversing process, and shortening the movement distance.

The traditional full coverage path planning method needs to grid the environment, i.e. divide the task area into a limited number of grids in equal size. The movements of the agent are then discretized into 9 actions of top left, top right, bottom left, bottom right, top, bottom, left, right, and motionless. For most of the motion platform agents, such discrete actions do not satisfy their kinematic constraints, and path planning is inefficient and space-consuming. Secondly, the size of the grid is generally designed according to the size of the intelligent agent, in a large-scale reconnaissance task, the task area is large, the intelligent agent is equivalent to a particle in the task area, the task area is rasterized, a large number of grids are generated, and the calculation difficulty of equipment is increased.

Disclosure of Invention

In order to solve the above problems, the present invention provides a full coverage path planning method based on deep reinforcement learning, which includes:

s10: dividing task area where agent is located into n ₁ ×n ₂ A plurality of grid points arranged in a matrix;

s20: according to the environment attribute of each grid point in the plurality of grid points at the current moment, respectively assigning values to each grid point to obtain a first environment state matrix for representing the environment state of the task area;

s30: according to the distance between the intelligent agent and each grid point at the current moment, respectively assigning values to each grid point to obtain a first position state matrix for representing the position state of the intelligent agent;

s40: according to the distances between the intelligent agent and each grid point at the N previous moments, respectively assigning values to each grid point to obtain N heading information matrixes for representing heading information of the intelligent agent;

s50: splicing the first environment state matrix, the first position state matrix and the N heading information matrices into N +2 state input matrices;

s60: constructing a convolutional neural network model, inputting the N +2 state input matrixes into the convolutional neural network model, so that the convolutional neural network model outputs according to the N +2 state input matrixes, and outputting an output value representing the next-step execution information of the agent;

s70: training the convolutional neural network model by adopting a deep reinforcement learning algorithm;

s80: adopting a trained convolutional neural network model to plan a path of the agent;

n previous moments are adjacent to the current moment and occur at moments before the current moment, and N is greater than or equal to 2; n is ₁ Is an integer from 1 to 1000; n is ₂ Is 1 to 1000 unitsAnd (4) counting.

The benefit of this scheme of adoption lies in: firstly, compared with the traditional rasterization modeling mode which uses discrete behavior actions, the scheme can output actions in a continuous action space and finally form an optimal full coverage path; the second step is as follows: the calculated amount is reduced, the convolutional neural network model is used in the scheme, the input state characteristics can be extracted, and the mode is smaller than the image matrix data amount; and thirdly: the designed high-dimensional state input has more obvious characteristics and can represent richer environment and intelligent state information; and the fourth step: the resource utilization rate is increased, and the model convolutional neural network model is trained by using a reinforcement learning algorithm and does not depend on an original data set; and fifthly, the universality is higher, by using the path planning method provided by the scheme, the initial position of the intelligent agent can be any position in the task area, and the area outside the task can use the same set of neural network model.

Further, the element m (i, j) in the first environment state matrix is any one of [ -1,0,1], and the element m (i, j) is assigned according to the following principle:

when the environment attribute is that the grid point is an obstacle, m (i, j) = -1;

when the environment attribute is that a grid point has been detected, m (i, j) =0;

when the environment attribute is that the grid point is not detected, m (i, j) =1.

The benefit of this scheme of adoption lies in: and assigning values according to the environment attributes, digitizing the current environment state to obtain a first environment state matrix, which can represent the environment type and the environment coverage condition of the current area.

Further, the elements dis in the first position state matrix _i,j Assigned according to the following principle:

；

wherein dis _i,j Is the Euclidean distance, X, between the agent and the grid point of the ith row and the jth column of the first position state matrix _agent For agents in a two-dimensional plane corresponding to the task areaX-coordinate, Y-coordinate in rectangular coordinate system _agent For the Y coordinate, X coordinate of the agent in a two-dimensional rectangular plane coordinate system _i,j Is a Yuan dis _i,j X-coordinate, Y-coordinate in a two-dimensional planar rectangular coordinate system _i,j Is a Yuan dis _i,j Y-coordinate, dis, in a two-dimensional plane rectangular coordinate system _max The longest distance in the task area.

Further, the method further comprises: and (3) limiting the output value within the range of [ -1,1] by using the tanh activation function at the output layer of the convolutional neural network model, and multiplying the limited output value by the maximum steering limit of the intelligent agent to obtain a steering action output value representing the steering action of the intelligent agent.

The benefit of this scheme of adoption lies in: output values are standardized, and subsequent calculation and action execution of the intelligent agent are facilitated.

Further, training the convolutional neural network model by adopting a deep reinforcement learning algorithm, comprising: constructing a reward and punishment function according to the detection process of the agent in the task area; and training the convolutional neural network model by adopting a depth reinforcement learning algorithm based on the reward and punishment function.

The benefit of this scheme of adoption lies in: the learned convolutional neural network model can effectively eliminate adverse effects caused by various factors such as noise and the like, so that the convolutional neural network model is more suitable for the actual situation in path planning, and is reasonable and effective.

Further, the rewarding and punishing function is constructed in the following manner: r = r _dot +r _full +r _fail +r _close ；

Where r is a reward or punishment function, r _dot An average distance difference between the agent and an undetected point in the plurality of grid points at a current time and at a next subsequent time relative to the current time; the agent moves towards an undetected point, r _dot For awards, the agent is not moving towards undetected points, r _dot Is punishment; r is a radical of hydrogen _full Reward for the agent to complete a full coverage task; r is _fail Punishment for collision of the intelligent body with an obstacle or driving away from a task area; r is a radical of hydrogen _close For the distance of the agent from the obstacle or from the task areaIs less than the penalty of the target distance.

The benefit of this scheme of adoption lies in: the reward and punishment function provided by the scheme can avoid sparse reward, has a corresponding reward and punishment value every time one step is executed, and accelerates model training; the generalization of the trained convolutional neural network model is higher.

Further, the method further comprises:

assigning values to each grid point according to the environment attribute of each grid point in the plurality of grid points at the next moment to obtain a second environment state matrix for representing the environment state of the task area;

assigning values to each grid point according to the distance between the intelligent agent and each grid point at the next moment to obtain a second position state matrix for representing the position state of the intelligent agent;

the next moment is a moment adjacent to the current moment and occurring after the current moment;

mean distance difference r _dot Obtained by the following formula:

；

wherein S is _cur Is a first environmental state matrix and is greater than 0,dis _cur Is a first position state matrix, S _next Is a second ambient state matrix and is greater than 0,dis _next And n is the number of undetected points in the second position state matrix.

The invention also provides a full-coverage path planning device based on deep reinforcement learning, which comprises:

the system comprises a first determining module, a second determining module and a control module, wherein the first determining module is used for determining a plurality of grid points according to a task area where an agent is located;

the second determining module is used for determining a first environment state matrix according to the environment attribute of each grid point in the plurality of grid points;

the third determining module is used for determining a first position state matrix according to the distance between the intelligent agent and each grid point at the current moment;

the fourth determining module is used for determining N heading information matrixes according to the distance between the intelligent agent and each grid point at N previous moments; the N previous moments are moments which are adjacent to the current moment and occur before the current moment;

the building module is used for building a convolutional neural network model, splicing the first environment state matrix, the first position state matrix and the N heading information matrices into N +2 state input matrices to be input into the convolutional neural network model, and outputting an output value representing the next-step execution information of the intelligent agent;

the training module is used for training the convolutional neural network model according to a deep reinforcement learning algorithm;

and the planning module is used for planning the path of the intelligent agent according to the trained convolutional neural network model.

The invention also provides an electronic device comprising a processor, a memory and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the method as in any of the above aspects.

The invention also provides a readable storage medium on which is stored a program or instructions which, when executed by a processor, performs a method as in any of the above.

Drawings

Fig. 1 is a flowchart of a full coverage path planning method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of grid points provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first environment state matrix according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a first position state matrix according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

Referring to fig. 1 to 4, the present embodiment provides a full coverage path planning method based on deep reinforcement learning, including:

wherein the N previous moments are moments adjacent to the current moment and occurring before the current momentN is greater than or equal to 2; n is ₁ Is an integer from 1 to 1000; n is a radical of an alkyl radical ₂ Is an integer of 1 to 1000.

In view of the prior art, the full coverage path planning needs to grid the environment, so that a large number of grids are generated, and the calculation difficulty of the device is increased.

Aiming at the problem, the embodiment provides a new path planning method; and changing the grid blocks into points by using a method similar to grid graph modeling, and assigning values to the grid blocks according to the environment attributes to obtain a first environment state matrix which can represent the environment type and the coverage condition of the environment. And further acquiring the distances between the intelligent body and the grid points when the current time and the N previous times are obtained, and respectively obtaining a first position state matrix and N heading information matrices. And splicing the first environment state matrix, the first position state matrix and the N heading information matrixes into N +2 state input matrixes which are used as input values of the convolutional neural network model. The constructed convolutional neural network model is trained and then used for path planning of the intelligent agent, wherein the path planning includes but is not limited to determination of speed, steering angle and the like of the intelligent agent. Further, a deep reinforcement learning algorithm is used for training the convolutional neural network model; the deep reinforcement learning algorithm includes, for example, DDPG, SAC, and the like. The trained convolutional neural network model can be used for full coverage path planning of the agent.

The benefit of this scheme of adoption lies in: firstly, compared with the traditional rasterization modeling mode which uses discrete behavior actions, the scheme can output actions in a continuous action space and finally form an optimal full coverage path; and the second step is as follows: the calculated amount is reduced, the convolutional neural network model is used in the scheme, the input state characteristics can be extracted, and the mode is smaller than the image matrix data amount; and thirdly: the designed high-dimensional state input has more obvious characteristics and can represent richer environment and intelligent state information; fourthly, the method comprises the following steps: the resource utilization rate is increased, and the model convolutional neural network model is trained by using a reinforcement learning algorithm and does not depend on an original data set; and fifthly, the universality is higher, by using the path planning method provided by the scheme, the initial position of the intelligent agent can be any position in the task area, and the area outside the task can use the same set of neural network model.

In the embodiment, a modeling mode of a traditional grid graph is abandoned, and a grid point is assigned, specifically, when the grid point is an obstacle, the assignment is-1; the grid point is a passable area, namely the grid point is assigned as 1 when not detected; the grid point is the covered traffic area, i.e. when it has been detected, the value is assigned to 0. Thus, a first environment state matrix with n × n dimensions is formed, and the environment type and the environment coverage condition of the current area can be represented.

；

wherein dis _i,j Is the Euclidean distance, X, between the agent and the grid point of the ith row and the jth column of the first position state matrix _agent For the X coordinate, Y coordinate of the agent in a two-dimensional rectangular plane coordinate system corresponding to the task area _agent For the Y coordinate, X coordinate of the intelligent body in a two-dimensional plane rectangular coordinate system _i,j Is a Yuan dis _i,j X-coordinate, Y-coordinate in a two-dimensional planar rectangular coordinate system _i,j Is Yuan Di _i,j Y-coordinate, dis, in a two-dimensional plane rectangular coordinate system _max The longest distance in the task area.

In the present embodiment, the output value is limited to the range of [ -1,1] using the tanh activation function. the convergence speed of the tanh activation function is high, and the iteration times are few; output values are standardized, and subsequent calculation and action execution of the intelligent agent are facilitated.

In the embodiment, the convolutional neural network model is trained by using a reinforcement learning algorithm, and does not depend on an original data set; the learned convolutional neural network model can effectively eliminate adverse effects caused by various factors such as noise and the like, so that the convolutional neural network model is more suitable for the actual situation in path planning, and is reasonable and effective.

Further, the reward and punishment function is constructed in the following manner: r = r _dot +r _full +r _fail +r _close ；

Where r is a reward or punishment function, r _dot An average distance difference between the agent and an undetected point in the plurality of grid points at a current time and at a next subsequent time relative to the current time; the agent moves towards the undetected point, r _dot For awards, the agent is not moving towards undetected points, r _dot Is punishment; r is _full Awarding for completing a full coverage task for the agent; r is _fail Punishment for collision of the intelligent body with an obstacle or driving away from a task area; r is _close Penalty for the agent being less than the target distance from the obstacle or from the boundary of the task area.

In the related art, sparse reward is mostly adopted, only scoring or not scoring is carried out, and no feedback is provided in the action execution process, so that the training result is poor. For this situation, the present embodiment constructs a reward and punishment function, which includes four parameters; the objectivity of the data can be fully respected, and the actions of the intelligent agent are comprehensively investigated; and each step of execution has a corresponding reward and punishment value, so that the model training is accelerated.

In the present embodiment, r _dot The method is used for evaluating the movement trend of the intelligent agent, and when the intelligent agent moves towards an undetected point, the intelligent agent is rewarded, and otherwise, the intelligent agent is punished. r is _full Is a reward for outcome that is assigned when the agent completes the full coverage task. r is _fail Is a penalty for agent errors, such as hitting an obstacle or driving out of a task area. r is _close The method can inspect the motion behavior degree of the intelligent body, set the target distance and take punishment when the distance between the intelligent body and the obstacle is smaller than the target distance. Through the setting of the four parameters, the convolutional neural network model can be further optimized, and more appropriate path planning can be performed on the intelligent agent.

Further, the method further comprises:

respectively assigning values to each grid point according to the environment attribute of each grid point in the plurality of grid points at the next moment to obtain a second environment state matrix for representing the environment state of the task area;

the next moment is a moment which is adjacent to the current moment and occurs after the current moment;

mean distance difference r _dot Obtained by the following formula:

；

In the present embodiment, r _dot The intelligent agent motion trend calculation method is used for calculating the motion trend of the intelligent agent and judging from the sum of two dimensions of time and distance; specifically, the movement trend of the agent is judged according to the distance between the agent and each grid point at the current moment and the distance between the agent and each grid point at the next moment and the change between the two distances. In one embodiment, the distance is a euclidean distance.

Example 2

The embodiment provides a full coverage path planning device based on deep reinforcement learning, which comprises:

the first determination module is used for determining a plurality of grid points according to the task area where the agent is located;

the second determining module is used for determining the first environment state matrix according to the environment attribute of each grid point in the plurality of grid points;

a fourth determining module, configured to determine N heading information matrices according to distances between the intelligent agent and each grid point at N previous moments; the N previous moments are moments which are adjacent to the current moment and occur before the current moment;

Example 3

The present embodiment provides an electronic device, which includes a processor, a memory, and a program or an instruction stored in the memory and executable on the processor, wherein the program or the instruction implements the steps of the method of the above embodiment when executed by the processor.

Example 4

The present embodiment provides a readable storage medium on which a program or instructions are stored, which when executed by a processor implement the steps of the method of the above embodiment.

The processor is the processor in the electronic device in the above embodiment. Readable storage media, including computer-readable storage media, such as computer Read-Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, etc.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A full coverage path planning method based on deep reinforcement learning is characterized by comprising the following steps:

dividing task area where intelligent agent is located into n ₁ ×n ₂ A plurality of grid points arranged in a matrix;

according to the environment attribute of each grid point in the plurality of grid points at the current moment, respectively assigning values to each grid point to obtain a first environment state matrix used for representing the environment state of the task area;

according to the distance between the intelligent agent and each grid point at the current moment, respectively assigning values to each grid point to obtain a first position state matrix for representing the position state of the intelligent agent;

according to the distances between the intelligent agent and each grid point at N previous moments, respectively assigning values to each grid point to obtain N heading information matrixes for representing heading information of the intelligent agent;

splicing the first environment state matrix, the first position state matrix and the N heading information matrixes into N +2 state input matrixes;

constructing a convolutional neural network model, and inputting the N +2 state input matrixes into the convolutional neural network model, so that the convolutional neural network model outputs an output value representing the next-step execution information of the agent according to the N +2 state input matrixes;

training the convolutional neural network model by adopting a deep reinforcement learning algorithm;

adopting the trained convolutional neural network model to plan a path of the agent;

n previous moments are adjacent to the current moment and occur at moments before the current moment, and N is greater than or equal to 2; n is ₁ Is an integer from 1 to 1000; n is a radical of an alkyl radical ₂ Is an integer from 1 to 1000;

the element m (i, j) in the first environment state matrix is any one of [ -1,0,1], which is assigned according to the following principle:

when the grid point is an obstacle, m (i, j) = -1;

the environment attribute is that when a grid point has been detected, m (i, j) =0;

when the environment attribute is that the grid point is not detected, m (i, j) =1;

meta dis in the first position state matrix _i,j Assigned according to the following principle:

；

wherein dis _i,j Is the Euclidean distance, X, between the agent and the grid point of the ith row and the jth column of the first position state matrix _agent For the X coordinate and the Y coordinate of the intelligent agent in a two-dimensional plane rectangular coordinate system corresponding to the task area _agent Is the Y coordinate and the X coordinate of the intelligent body in the two-dimensional plane rectangular coordinate system _i,j Is said meta dis _i,j X-coordinate, Y-coordinate in said two-dimensional rectangular plane coordinate system _i,j Is said meta dis _i,j Y-coordinate, dis, in said two-dimensional plane rectangular coordinate system _max Is the longest distance in the task area.

2. The method of claim 1, further comprising:

and limiting the output value within the range of [ -1,1] by using a tanh activation function at an output layer of the convolutional neural network model, and multiplying the limited output value by the maximum steering limit of the intelligent agent to obtain a steering action output value representing the steering action of the intelligent agent.

3. The method according to claim 1 or 2, wherein the training the convolutional neural network model by using a deep reinforcement learning algorithm comprises:

constructing a reward and punishment function according to the detection process of the agent in the task area;

and training the convolution neural network model by adopting a depth reinforcement learning algorithm based on the reward and punishment function.

4. The method of claim 3, wherein the reward function is constructed by:

r=r _dot +r _full +r _fail +r _close ；

wherein r is the rewarding function,

r _dot an average distance difference between the agent and an undetected point in the plurality of grid points at the current time and at a next time relative to the current time; the next moment is adjacent to the current moment, anda time occurring after the current time;

the agent moves towards the undetected point, r _dot For awards, the agent is not moving towards the undetected point, r _dot Is punishment;

r _full reward for the agent completing a full coverage task;

r _fail punishment of collision of the intelligent body with an obstacle or driving away from a task area is given;

r _close and punishment that the distance between the intelligent agent and the obstacle or the distance between the intelligent agent and the boundary of the task area is smaller than the target distance.

5. The method of claim 4, further comprising:

wherein the average distance difference r _dot Obtained by the following formula:

；

wherein S is _cur Is the first environmental state matrix and is greater than 0,dis _cur Is the first position state matrix, S _next Is the second ambient state matrix and is greater than 0,dis _next N is the number of undetected points for the second position state matrix.

6. A full coverage path planning device based on deep reinforcement learning is characterized by comprising:

a second determining module, configured to determine a first environment state matrix according to an environment attribute of each of the plurality of grid points;

a third determining module, configured to determine a first position state matrix according to distances between the agent and each grid point at the current time;

a fourth determining module, configured to determine N heading information matrices according to distances between the agent and each grid point at N previous times, respectively; the N previous moments are moments which are adjacent to the current moment and occur before the current moment;

the building module is used for building a convolutional neural network model, splicing the first environment state matrix, the first position state matrix and the N heading information matrices into N +2 state input matrices, inputting the state input matrices into the convolutional neural network model, and outputting an output value representing the next step execution information of the intelligent agent;

7. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, which when executed by the processor, implements the full coverage path planning method of any of claims 1 to 5.

8. A readable storage medium, on which a program or instructions are stored, which when executed by a processor, implement a full coverage path planning method according to any one of claims 1 to 5.