CN116203972B

CN116203972B - Unknown environment exploration path planning method, system, equipment and medium

Info

Publication number: CN116203972B
Application number: CN202310491011.3A
Authority: CN
Inventors: 张旭东; 樊杰; 邹渊; 刘颖群
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-05-05
Filing date: 2023-05-05
Publication date: 2023-08-18
Anticipated expiration: 2043-05-05
Also published as: CN116203972A

Abstract

The invention discloses a planning method, a system, equipment and a medium for an unknown environment exploration path, which relate to the technical field of unknown environment exploration, and are characterized in that a plurality of edge points of a grid map at the current moment are determined by utilizing an edge detection algorithm, partial edge points are uniformly selected as potential target points, a potential target point is selected as a target edge point based on a reinforcement learning algorithm, so that a collision-free path from the current position of an unmanned vehicle at the current moment to the target edge point is further planned, the unmanned vehicle is controlled to move along the collision-free path, the process is continued until the unknown environment exploration is completed, and the path planning process for the unknown environment is completed, so that the grid map of the unknown environment is generated. According to the method, the target edge point is selected through the reinforcement learning algorithm, the problems of long planning path and premature termination of exploration can be solved, the exploration efficiency of the unknown environment is improved, and full exploration of the unknown environment can be realized.

Description

Unknown environment exploration path planning method, system, equipment and medium

Technical Field

The present invention relates to the field of unknown environment exploration technologies, and in particular, to a method, a system, an apparatus, and a medium for planning an unknown environment exploration path.

Background

The unknown environment exploration has important application in the fields of disaster area rescue, household cleaning, outer space exploration and the like, and the current commonly used path planning method facing the unknown environment exploration comprises two types of path guiding methods based on edge points and path optimizing methods based on reinforcement learning, and the map construction of the unknown environment is completed through the path planning methods.

Although the path guiding method based on the edge points is easy to deploy and realize, the optimality is poor because the selection of the target edge points is usually based on engineering experience, and the planning path for exploring the map is usually longer; and under partial special conditions, the situation that the exploration of the unknown environment is terminated in advance due to the fact that the target edge point falls into dead spots possibly exists, and the full exploration of the unknown environment is difficult to realize.

Although the path optimizing method based on reinforcement learning is good in optimality, the planned path is relatively short, the optimal action is required to be searched in a full action space (namely the speed and the direction of the ground unmanned vehicle), and the training time of the model is relatively long; and the next exploration space is closer to the current position, so that a myopia effect is easy to form, namely the method is difficult to quickly reach a long-distance unknown space for exploration.

Based on this, a new technology for exploring the path of unknown environments is needed.

Disclosure of Invention

The invention aims to provide a planning method, a system, equipment and a medium for an unknown environment exploration path, which are combined with a path guiding method based on edge points and a path optimizing method based on reinforcement learning, and a target edge point is selected through the reinforcement learning algorithm, so that the problems of planning path length and early termination of exploration can be solved, the exploration efficiency of the unknown environment is improved, and the full exploration of the unknown environment can be realized.

In order to achieve the above object, the present invention provides the following solutions:

an unknown environment exploration path planning method, the unknown environment exploration path planning method comprising:

acquiring vehicle pose data and laser radar data at the current moment; the vehicle pose data are pose data of unmanned vehicles for exploring unknown environments; the laser radar data are scanning data obtained by scanning surrounding environments by a laser radar carried on the unmanned vehicle;

updating the grid map at the previous moment by utilizing the vehicle pose data and the laser radar data at the current moment to obtain the grid map at the current moment; the grid map comprises a plurality of grids obtained by dividing the area where the unknown environment is located, and the occupation probability of each grid; the occupancy probability is used to represent the likelihood that the grid is occupied;

Performing edge point detection on the grid map at the current moment by using an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, uniformly selecting part of the edge points as potential target points;

taking the grid map at the current moment as input, and selecting one potential target point as a target edge point based on a reinforcement learning algorithm;

planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, and controlling the unmanned vehicle to move along the collision-free path until reaching the target edge point;

judging whether the exploration of the unknown environment is completed or not; if not, returning to the step of acquiring the vehicle pose data and the laser radar data at the current moment.

In some embodiments, updating the grid map at the previous time by using the vehicle pose data and the laser radar data at the current time, and obtaining the grid map at the current time specifically includes:

determining the update occupation probability of each grid at the current moment according to the vehicle pose data and the laser radar data at the current moment;

And for each grid, calculating the occupancy probability of the grid at the current moment according to the updated occupancy probability of the grid at the current moment and the prior occupancy probability at the initial moment so as to update the grid map at the previous moment and obtain the grid map at the current moment.

In some embodiments, the calculation formula of the occupancy probability of the grid at the current moment includes:

；

wherein ,the occupancy probability of the grid m at the current time t is given; />Lidar data for each time between the initial time and the current time t; />The intermediate parameter is the current time t;

；

wherein ,is the intermediate parameter of the last time t-1; />Updating the occupancy probability of the grid m at the current time t; />The laser radar data at the current time t; />The prior occupancy probability of the grid m at the initial moment.

In some embodiments, the edge detection algorithm is a Sobel edge detection algorithm.

In some embodiments, before selecting one of the potential target points as a target edge point based on a reinforcement learning algorithm with the grid map at the current time as an input, the unknown environment exploration path planning method further includes: training an initial reinforcement learning algorithm;

The cumulative return function used in the training process is as follows:

；

wherein ,the accumulated return for the time T; t0 is the initial time; />Is a forgetting factor; />For middle time->Is a real-time reward for (a);

；

wherein ,is an instant rewards; />Is a weighted constant; />Rewarding for entropy increase; />The step cost; />For unexpected termination costs; />To fully explore the return.

In some embodiments, the method of calculating the entropy augmentation reward includes: calculating the difference value between the image entropy of the training grid map at the middle moment and the image entropy of the training grid map at the last moment of the middle moment to obtain entropy rewards;

the step cost is a constant negative number;

the unexpected termination cost is a constant negative number;

the method for calculating the full exploration return comprises the following steps: calculating the difference between the training grid map at the middle moment and the real grid map at the middle moment; if the difference is smaller than a preset threshold, the sufficient exploration returns to be rewarding values; if not, the full exploration return is 0.

In some embodiments, the planning, according to the grid map at the current time and the target edge point, a collision-free path from the current position of the unmanned vehicle at the current time to the target edge point, and controlling the unmanned vehicle to move along the collision-free path specifically includes:

Using the grid map at the current moment and the target edge point as inputs, and planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point by using a PRM algorithm; the collision-free path is composed of a plurality of path points;

taking the collision-free path and the vehicle pose data at the current moment as inputs, and calculating the linear speed and the angular speed corresponding to each path point of the unmanned vehicle by using a pure tracking algorithm;

taking the laser radar data at the current moment as input, and correcting the angular speed corresponding to each path point of the unmanned vehicle by utilizing a VFH algorithm to obtain corrected angular speeds corresponding to each path point of the unmanned vehicle;

and controlling the unmanned vehicle to move along the collision-free path based on the linear speed and the corrected angular speed corresponding to each path point.

An unknown environmental exploration path planning system, the unknown environmental exploration path planning system comprising:

the data acquisition module is used for acquiring vehicle pose data and laser radar data at the current moment; the vehicle pose data are pose data of unmanned vehicles for exploring unknown environments; the laser radar data are scanning data obtained by scanning surrounding environments by a laser radar carried on the unmanned vehicle;

The map updating module is used for updating the grid map at the previous moment by utilizing the vehicle pose data and the laser radar data at the current moment to obtain the grid map at the current moment; the grid map comprises a plurality of grids obtained by dividing the area where the unknown environment is located, and the occupation probability of each grid; the occupancy probability is used to represent the likelihood that the grid is occupied;

the edge detection module is used for detecting edge points of the grid map at the current moment by utilizing an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, uniformly selecting part of the edge points as potential target points;

the target point selection module is used for taking the grid map at the current moment as input, and selecting one potential target point as a target edge point based on a reinforcement learning algorithm;

the motion control module is used for planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, and controlling the unmanned vehicle to move along the collision-free path until reaching the target edge point;

The return module is used for judging whether the unknown environment is explored; if not, returning to the step of acquiring the vehicle pose data and the laser radar data at the current moment.

An unknown environment exploration path planning apparatus, comprising:

a processor; and

a memory in which computer-readable program instructions are stored,

wherein the above-described unknown environment exploration path planning method is performed when the computer readable program instructions are executed by the processor.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the above described unknown environment exploration path planning method.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a method, a system, equipment and a medium for planning an unknown environment exploration path. And then, carrying out edge point detection on the grid map at the current moment by utilizing an edge detection algorithm to obtain a plurality of edge points, uniformly selecting part of the edge points as potential target points, and selecting a potential target point as a target edge point based on a reinforcement learning algorithm. And finally, planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, controlling the unmanned vehicle to move along the collision-free path until the target edge point is reached, and continuing the process until the unknown environment exploration is completed, thereby completing the path planning process of the unknown environment exploration and generating the grid map of the unknown environment. The invention combines the path guiding method based on the edge points and the path optimizing method based on reinforcement learning, selects the target edge points through the reinforcement learning algorithm, can solve the problems of planning path length and early termination of exploration, improves the exploration efficiency of the unknown environment, and can realize the full exploration of the unknown environment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a path planning method according to embodiment 1 of the present invention;

FIG. 2 is a schematic diagram of a hierarchical data flow processing architecture for unknown environment exploration path planning provided in embodiment 1 of the present invention;

FIG. 3 is a schematic diagram of a convolution filter of the Sobel operator according to embodiment 1 of the present invention; wherein fig. 3 (a) is a schematic diagram of a horizontal convolution filter; FIG. 3 (b) is a schematic diagram of a vertical convolution filter;

fig. 4 is a system block diagram of a path planning system according to embodiment 2 of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

Example 1:

the embodiment is used for providing an unknown environment exploration path planning method, as shown in fig. 1, including:

s1: acquiring vehicle pose data and laser radar data at the current moment; the vehicle pose data are pose data of unmanned vehicles for exploring unknown environments; the laser radar data are scanning data obtained by scanning surrounding environments by a laser radar carried on the unmanned vehicle;

in this embodiment, only the movement of the unmanned vehicle in the two-dimensional plane is considered, so the vehicle pose data in this embodiment includes the position and the pose of the unmanned vehicle at the current moment, the position may be represented by two-dimensional coordinates (i.e., the abscissa x of the position of the unmanned vehicle and the ordinate y of the position of the unmanned vehicle), the pose may be represented by the heading angle of the unmanned vehicle, the vehicle pose data may be acquired by a sensor, and the unmanned vehicle may be a ground unmanned vehicle (Unmanned Ground Vehicle, UGV). The laser radar data are scanning data obtained by 360-degree scanning of the surrounding environment of the position where the unmanned vehicle is located by the laser radar mounted on the unmanned vehicle.

S2: updating the grid map at the previous moment by utilizing the vehicle pose data and the laser radar data at the current moment to obtain the grid map at the current moment; the grid map comprises a plurality of grids obtained by dividing the area where the unknown environment is located, and the occupation probability of each grid; the occupancy probability is used to represent the likelihood that the grid is occupied;

s3: performing edge point detection on the grid map at the current moment by using an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, uniformly selecting part of the edge points as potential target points;

s4: taking the grid map at the current moment as input, and selecting one potential target point as a target edge point based on a reinforcement learning algorithm;

s5: planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, and controlling the unmanned vehicle to move along the collision-free path until reaching the target edge point;

s6: judging whether the exploration of the unknown environment is completed or not; if not, returning to the step of acquiring the vehicle pose data and the laser radar data at the current moment.

After the target edge point is reached, returning to the step S1, and sequentially executing the steps S1-S3 to determine the number of the edge points obtained after the edge point detection, and if the number of the edge points corresponding to the non-obstacle is smaller than the preset number in the edge points, considering that the exploration is completed. The method used for judging whether the non-obstacle belongs to the edge point corresponding to the non-obstacle is as follows: judging whether the occupancy probability of the grid where the edge point is located is 1, if not, indicating that the edge point belongs to the edge point corresponding to the non-obstacle.

The embodiment can adopt a hierarchical data stream processing architecture (namely a hierarchical architecture) facing to unknown environment exploration path planning, and adopts a path planning method integrating reinforcement learning and edge points under the hierarchical architecture, wherein the path planning method comprises an edge point generation method based on image edge detection, a target edge point screening method based on reinforcement learning and the like. As shown in fig. 2, the hierarchical data stream processing architecture used in the present embodiment includes a perception layer, a planning layer, and a control layer. Firstly, inputting laser radar data and vehicle pose data of a ground unmanned vehicle into a perception layer, and iteratively updating a grid map (corresponding to S1 and S2) by the perception layer based on the laser radar data and the vehicle pose data by using a Bayesian estimation criterion (also called Bayesian rule); then, the updated grid map is input to a planning layer, and two data streams are processed in the planning layer, wherein the edge detection is performed on the grid map to extract edge points (corresponding to S3), and the grid map and the edge points are input to a reinforcement learning algorithm, so that a reinforcement learning agent selects target edge points from the edge points (corresponding to S4). Finally, inputting the target edge point into a control layer, wherein the control layer takes the current position of the vehicle as a starting position, takes the target edge point as a target position, plans a collision-free path from the starting position to the target position through a probability roadmap (Probabilistic Road map, PRM) algorithm, and realizes path tracking control (corresponding to S5) of the unmanned vehicle on the collision-free path based on a pure tracking algorithm (which can also be called as a pure tracking algorithm) and a vector field histogram (Vector Field Histogram, VFH) algorithm. The above process is repeated until the unknown environment is completely explored (corresponding to S6).

In this embodiment, a grid map (the grid map is a two-dimensional map) is used to describe an unknown environment, a size range of an area where the unknown environment is located is predetermined to draw an initial map, and then the initial map is subjected to grid division to obtain a grid map, wherein each grid in the grid map has an occupancy probability indicating the possibility of being occupied, and the higher the value of the occupancy probability is, the higher the probability of being occupied is. At the initial moment, the occupation probability of each grid in the grid map is equal to the prior occupation probability, and then the grid map is iteratively updated based on the laser radar data and the vehicle pose data acquired at each moment by using a Bayesian estimation criterion (namely, the posterior probability at the t-1 moment (or called the prior probability at the t moment) and the laser radar data at the t moment) to estimate the posterior probability at the t moment, so as to acquire the grid map at each moment.

In S2, updating the grid map at the previous time by using the vehicle pose data and the laser radar data at the current time, the obtaining the grid map at the current time may include:

(1) And determining the updated occupation probability of each grid at the current moment according to the vehicle pose data and the laser radar data at the current moment.

The laser radar generally uses a heading angle of an unmanned vehicle (which can be determined according to vehicle pose data at the current moment) as a starting angle (0 degree), and scans the surrounding environment of the current position of the unmanned vehicle at the current moment by 360 degrees to obtain laser radar data, wherein the laser radar data comprises a scanning length of each scanning angle between 0 and 360 degrees. Since the laser radar has the longest scanning length, if the scanning length of a certain scanning angle is smaller than the longest scanning length, the laser radar indicates that an obstacle exists in the direction of the scanning angle, the grid where the obstacle exists can be determined according to the scanning length of the scanning angle, the state of the grid where the obstacle exists is set as an occupied state, and the states of the grid where the current position of the unmanned vehicle exists and the grid where the obstacle exists along the grid between the scanning angles are set as idle states; if the scanning length of a certain scanning angle is equal to the longest scanning length, the fact that no barrier exists in the direction of the scanning angle is indicated, the farthest grid can be determined according to the longest scanning length, and the states of the grid where the current position of the unmanned vehicle is located and the grid where the farthest grid is located along the scanning angle are set to be idle states; for other grids of the grid map which are not scanned by the laser radar, the states of the other grids are all set to be unknown states. The method comprises the steps of setting the update occupancy probability of a grid in an occupied state at the current moment to be a first preset value, setting the update occupancy probability of a grid in an idle state at the current moment to be a second preset value, and setting the update occupancy probability of a grid in an unknown state at the current moment to be a third preset value, so that the update occupancy probability of each grid at the current moment is determined according to vehicle pose data and laser radar data at the current moment. The first preset value may be 0.8, the second preset value may be 0.2, and the third preset value may be 0.5. Since one grid may correspond to a plurality of scanning angles, if it is determined that the state of a certain grid at the current time is both the occupied state and the idle state based on the plurality of scanning angles, the state of the certain grid at the current time is considered to be the occupied state.

(2) And for each grid, calculating the occupancy probability of the grid at the current moment according to the updated occupancy probability of the grid at the current moment and the prior occupancy probability at the initial moment so as to update the grid map at the previous moment and obtain the grid map at the current moment.

Suppose that at the current time t, the unmanned vehicle is in the vehicle pose data x _t Reading laser radar data z _t The purpose of the perceptual layer is then to determine the occupation probability p (m|z) of each grid m at the current instant t _t ，x _t ). Based on Bayesian estimation criteria, continuously updating the grid map by laser radar data, deducing the grid map at the time t by using the grid map at the time t-1, and recursing the following steps:

；（1）

in the formula (1), the components are as follows,the intermediate parameter is the current time t; />Is the intermediate parameter of the last time t-1; />Updating the occupancy probability of the grid m at the current time t; />The laser radar data at the current time t; />At the initial moment for grid mThe present embodiment initializes the prior occupancy probabilities of all grids to 0.5.

The present embodiment defines the intermediate parameters according to the following formula (2):

；（2）

in the formula (2), the amino acid sequence of the compound,the occupancy probability of the grid m at the current time t is given; />For lidar data at each time between the initial time and the current time t.

The embodiment derives according to the formula (2), so as to obtain a calculation formula of the occupancy probability of the grid at the current moment, as follows:

；（3）

intermediate parameter of current time t is calculated by formula (1)And then, calculating the occupancy probability of the grid at the current moment by using the formula (3). And replacing the occupation probability of each grid in the grid map at the previous moment by the occupation probability of each grid at the current moment to obtain the grid map at the current moment.

In the S3, when the edge detection algorithm is used for carrying out edge point detection on the grid map at the current moment to obtain a plurality of edge points, the edge detection algorithm can be a Sobel edge detection algorithm, and the Sobel edge detection algorithm has the advantages of good robustness and small operation amount. Of course, other existing edge detection algorithms may also be used in this embodiment.

In the following, a Sobel edge detection algorithm is taken as an example, and a detection process of an edge point in this embodiment is discussed in detail:

taking a grid map at the current moment as image data, wherein one grid is equivalent to one pixel point, the occupied probability of the grid at the current moment is equivalent to the pixel value of the pixel point, and extracting edge points in the grid map at the current moment based on a Sobel edge detection algorithm. The core of the Sobel edge detection algorithm is the Sobel operator, which first performs weighted averaging on an image using a 3×3 convolution filter, and then detects edge points of the image by first-order differentiation. Assuming that f (x, y) represents an image, its gradient can be expressed as:

；（4）

In the formula (4), the amino acid sequence of the compound,is an image gradient; />Is a gradient in the horizontal x direction; />Is a gradient in the vertical y-direction.

The Sobel operator includes a horizontal direction convolution filter and a vertical direction convolution filter, as shown in fig. 3, fig. 3 (a) is a horizontal direction convolution filter, and fig. 3 (b) is a vertical direction convolution filter. The specific steps of the Sobel edge detection algorithm are as follows: (1) Traversing the image to be processed corresponding to the grid map at the current moment from left to right and from top to bottom by utilizing the horizontal convolution filter and the vertical convolution filter respectively, performing discrete convolution operation on each traversed target pixel point, specifically performing weighted summation on pixel values (including the target pixel point) of the horizontal convolution filter and all pixel points in a 3*3 adjacent area taking the target pixel point as a center to obtain a horizontal convolution result, and performing weighted summation on pixel values (including the target pixel point) of the vertical convolution filter and all pixel points in a 3*3 adjacent area taking the target pixel point as a center to obtain a vertical convolution result. (2) And for each target pixel point, replacing the pixel value of the target pixel point in the image to be processed with the maximum value of the horizontal convolution result and the vertical convolution result of the target pixel point to obtain the processed image. (3) For each pixel point of the processed image, if the pixel value of the pixel point is more than or equal to a threshold value T, determining the pixel point as an edge point of the processed image so as to extract and obtain a plurality of edge points of the grid map at the current moment.

After the plurality of edge points are extracted, the embodiment may further uniformly select a part of the edge points as the potential target points based on the positions of all the edge points in the grid map at the current time, that is, the selected potential target points are uniformly distributed in the grid map at the current time. Specifically, the selection mode may be: each edge point is assigned with a random number, and a specific number of edge points are randomly selected as potential target points in the procedure by generating the random number. The potential target point should be an edge point belonging to a non-obstacle correspondence. The present embodiment may further sort all the potential target points in order of the distance from large to small or from small to large according to the distance from the potential target point to the current position of the unmanned vehicle at the current time, where the sorting purpose is to quickly determine which potential target point the target edge point refers to when outputting the nth potential target point as the target edge point through the reinforcement learning algorithm.

The present embodiment performs optimal selection on the potential target point based on the reinforcement learning algorithm to determine an optimal target edge point, that is, a grid that the unmanned vehicle needs to reach at the next moment. Specifically, the grid map at the current moment output by the perception layer is used as the input state s of the reinforcement learning algorithm, the potential target points output by the planning layer form the action space a of the reinforcement learning algorithm, and the optimal strategy of the reinforcement learning algorithm is utilized And selecting the optimal action in the action space, namely selecting an optimal potential target point from all potential target points as a target edge point. Strong strengthThe output of the learning algorithm is the nth potential target point as the target edge point, and since the potential target points are ordered according to the distance, the potential target points in the action space of the reinforcement learning algorithm are sequential, and at this time, it can be quickly determined which potential target point the target edge point is.

Because the dimension of the input layer of the reinforcement learning network (neural network) corresponding to the reinforcement learning algorithm is preset, and the shapes of the different searched unknown environments may have different aspect ratios, before the grid map at the current moment is input into the reinforcement learning algorithm, the embodiment adjusts the image size of the grid map at the current moment through the resize function in the matlab so as to enable the grid map to conform to the dimension of the input layer of the neural network, and the subsequent neural network can process the grid map at the current moment.

The reinforcement learning algorithm of the embodiment may be any reinforcement learning algorithm, such as a deep Q-learning algorithm or a confidence domain policy optimization algorithm. Due to the need to utilize an optimal strategy of reinforcement learning algorithm in determining the target edge points Therefore, before taking the grid map at the current moment as input and selecting a potential target point as a target edge point based on the reinforcement learning algorithm, the unknown environment exploration path planning method of the embodiment further includes: training the initial reinforcement learning algorithm, namely training the network corresponding to the reinforcement learning algorithm, and determining the optimal strategy +.>。

In the training process, a grid map is used as a state, potential target points are used as action spaces, and a maximized cumulative return function is used as a training target.

Wherein, the cumulative return function used is:

；（5）

in the formula (5), the amino acid sequence of the compound,the accumulated return for the time T; t0 is the initial time, ">=t0，t0+1，...，T；/>Is forgetting factor [0,1 ]]An internal constant; />For middle time->Instant rewards obtained from the environment.

；（6）

In the formula (6), the amino acid sequence of the compound,is an instant rewards; />Is a weighted constant; />Rewarding for entropy increase; />The step cost; />For unexpected termination costs; />To fully explore the return.

Specifically, the entropy-increasing reward is entropy increase of the updated grid map, and is used for encouraging the unmanned vehicle to explore the unknown environment, and the calculation method of the entropy-increasing reward comprises the following steps: and calculating the difference value between the image entropy of the training grid map at the middle moment and the image entropy of the training grid map at the moment above the middle moment to obtain the entropy increasing rewards. The specific calculation formula is as follows:

；（7）

In the formula (7), the amino acid sequence of the compound,training grid map for intermediate time t>Is a picture entropy of (2); />The image entropy of the training grid map at the time t-1 immediately preceding the intermediate time t.

；（8）

In the formula (8), the amino acid sequence of the compound,grid map for training->Probability of being in a certain interval n, which probability is +.>And (3) frequency statistics is carried out on all grids in the grid. Specifically, the range of 0-1 of the occupancy probability of the grid is divided into N intervals, the N intervals do not overlap each other, n=1, 2. Then, for each section, counting the number of grids in the section, dividing the number of grids in the section by the total number of grids, and calculating the probability corresponding to the section (namely, training grid map->Probability of being in that interval). And summing all the intervals to obtain the image entropy.

The step cost aims to prompt the reinforcement learning algorithm to find a relatively short path, and the faster the unmanned vehicle explores an unknown environment, the lower the accumulated step cost, thereby encouraging the unmanned vehicle to improve the exploration efficiency, so the step cost is a constant negative number.

The unexpected termination cost is set to a constant negative number for punishing that the environment is not fully explored, which can be set to-20 in the present embodiment.

If the difference between the real map and the explored map is below a certain value, then rewards for full exploration will be provided to the reinforcement learning algorithm, so the calculation method for full exploration rewards includes: calculating an intermediate timeIs a training grid map and intermediate time->Is a difference of the real grid map; if the difference is smaller than the preset threshold value, fully exploring the rewards as rewards; if not, the full exploration return is 0. The calculation method of the difference is as follows: and calculating the error of the average value of the free space area (namely, the area with the occupied probability of not 1) in the explored training grid map and the occupied probability of all grids of the free space area in the real grid map, and obtaining the probability average error, wherein the probability average error is taken as the difference. If the probability average error is smaller than the preset threshold value of 0.05, fully exploring the return to be the rewarding value, and finishing the training process by the fact that the exploration is completed. The preset threshold may also be other smaller values.

After determining the state, the action space and the accumulated return function, any reinforcement learning algorithm can be adopted to obtain the function mapping optimal strategy from the state s to the optimal actionAccording to the optimal strategy->The optimal potential target point may be selected from the random potential target points as a target edge point (also referred to as a path target tracking point). The sample data used in the training process includes a plurality of real grid maps of the explored unknown environment, and for each real grid map, the training process is performed once, and the training process may include: and (3) working according to S1-S6, in the working process, updating a grid map once every time vehicle pose data and laser radar data are obtained, namely calculating the accumulated return at the current moment according to the formula, calculating a state function in reinforcement learning through the accumulated return, comparing the difference value between the state function actually output by the current network and the state function output according to the return function, and carrying out parameter updating on the reinforcement learning network according to the difference value through gradient descent until the parameter converges. In training the model, the search is considered to be completed when the probability average error of the free space region in the search-obtained training grid map to all the grids of the free space region in the real grid map is less than 0.05. The training process belongs to a conventional training process of the reinforcement learning algorithm, and is not described herein.

The control layer takes a grid map at the current moment from the perception layer and target edge points from the planning layer as inputs, and firstly generates a collision-free path connecting the current position and the target edge points by using a PRM algorithm. And then inputting the generated collision-free path into a pure tracking algorithm to obtain the control linear speed and angular speed of the unmanned vehicle. In order to avoid the obstacle and improve the collision-free performance, the angular velocity output by the pure tracking algorithm is further processed by the VFH algorithm, so that the corrected angular velocity is output. And finally, the linear speed output by the pure tracking algorithm and the corrected angular speed output by the VFH algorithm are acted on the unmanned vehicle, so that the unmanned vehicle moves to a target edge point along a collision-free path, pose information of the unmanned vehicle is updated, and an unknown environment is continuously explored. Then in S5, planning a collision-free path from the current position of the unmanned vehicle at the current time to the target edge point according to the grid map at the current time and the target edge point, and controlling the unmanned vehicle to move along the collision-free path may include: using a grid map at the current moment and a target edge point as inputs, and planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point by using a PRM algorithm, wherein the collision-free path consists of a plurality of path points; taking a collision-free path and vehicle pose data at the current moment as input, and calculating the linear speed and the angular speed corresponding to each path point of the unmanned vehicle by using a pure tracking algorithm; taking laser radar data at the current moment as input, and correcting the angular speed corresponding to each path point of the unmanned vehicle by utilizing a VFH algorithm to obtain corrected angular speed corresponding to each path point of the unmanned vehicle; and controlling the unmanned vehicle to move along the collision-free path based on the linear speed and the corrected angular speed corresponding to each path point of the unmanned vehicle.

According to the unknown environment exploration path planning method, the target edge point is selected through the reinforcement learning algorithm by combining the edge point-based path guiding method and the reinforcement learning-based path optimizing method, the situations that the planned path length and exploration are terminated in advance in the edge point-based path guiding method can be solved, the exploration efficiency of the unknown environment is improved, and full exploration of the unknown environment can be achieved.

Example 2:

the embodiment is used for providing an unknown environment exploration path planning system, as shown in fig. 4, which includes:

the data acquisition module M1 is used for acquiring vehicle pose data and laser radar data at the current moment; the vehicle pose data are pose data of unmanned vehicles for exploring unknown environments; the laser radar data are scanning data obtained by scanning surrounding environments by a laser radar carried on the unmanned vehicle;

the map updating module M2 is configured to update the grid map at the previous time by using the vehicle pose data and the laser radar data at the current time, so as to obtain the grid map at the current time; the grid map comprises a plurality of grids obtained by dividing the area where the unknown environment is located, and the occupation probability of each grid; the occupancy probability is used to represent the likelihood that the grid is occupied;

The edge detection module M3 is used for detecting edge points of the grid map at the current moment by utilizing an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, uniformly selecting part of the edge points as potential target points;

the target point selection module M4 is configured to select one of the potential target points as a target edge point based on a reinforcement learning algorithm with the grid map at the current time as an input;

the motion control module M5 is used for planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, and controlling the unmanned vehicle to move along the collision-free path until reaching the target edge point;

a return module M6, configured to determine whether the exploration of the unknown environment is completed; if not, returning to the step of acquiring the vehicle pose data and the laser radar data at the current moment.

Example 3:

the embodiment is used for providing an unknown environment exploration path planning device, which comprises the following steps:

a processor; and

a memory in which computer-readable program instructions are stored,

Wherein the computer readable program instructions, when executed by the processor, perform the unknown environment exploration path planning method as described in embodiment 1.

Example 4:

the present embodiment is directed to a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the unknown environment exploration path planning method described in embodiment 1.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; also, it is within the scope of the present invention to be modified by those of ordinary skill in the art in light of the present teachings. In view of the foregoing, this description should not be construed as limiting the invention.

Claims

1. An unknown environment exploration path planning method, which is characterized by comprising the following steps:

performing edge point detection on the grid map at the current moment by using an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, part of the edge points are uniformly selected as potential target points, and specifically, the selection mode can be as follows: assigning a random number to each edge point, randomly selecting a specific number of edge points as potential target points in a program in a mode of generating the random number, wherein the potential target points are the edge points corresponding to non-obstacle;

Taking the grid map at the current moment as input, and selecting one potential target point as a target edge point based on a reinforcement learning algorithm; before the grid map at the current moment is input into the reinforcement learning algorithm, the size of the image of the grid map at the current moment is adjusted through a size function in matlab;

judging whether the exploration of the unknown environment is completed or not; if not, returning to the step of acquiring the vehicle pose data and the laser radar data at the current moment;

the step of planning a collision-free path from the current position of the unmanned vehicle at the current moment to the target edge point according to the grid map at the current moment and the target edge point, and controlling the unmanned vehicle to move along the collision-free path specifically comprises the following steps:

2. The unknown environment exploration path planning method according to claim 1, wherein updating the grid map of the previous time by using the vehicle pose data and the laser radar data of the current time, and obtaining the grid map of the current time specifically comprises:

3. The unknown environmental exploration path planning method of claim 2, wherein a calculation formula of an occupancy probability of said grid at a current time comprises:

；

wherein p (m|z) _1:t ) The occupancy probability of the grid m at the current time t is given; z _1:t Lidar data for each time between the initial time and the current time t;l _t the intermediate parameter is the current time t;

；

wherein ,l _t-1 is the intermediate parameter of the last time t-1; p (m|z) _t ) Updating the occupancy probability of the grid m at the current time t; z _t The laser radar data at the current time t; p (m) is the prior occupancy probability of grid m at the initial moment.

4. The unknown environmental exploration path planning method of claim 1, wherein said edge detection algorithm is a Sobel edge detection algorithm.

5. The unknown environment exploration path planning method of claim 1, wherein prior to selecting one of said potential target points as a target edge point based on a reinforcement learning algorithm with a grid map of said current time as an input, said unknown environment exploration path planning method further comprises: training an initial reinforcement learning algorithm;

the cumulative return function used in the training process is as follows:

；

wherein ,G_t0 The accumulated return for the time T; t is t ₀ Is the initial time; gamma is a forgetting factor; r is (r) _t' Instant rewards for intermediate time t';

；

wherein r is instant rewards; alpha is a weighting constant; r is (r) _EI Rewarding for entropy increase; r is (r) _l The step cost; r is (r) _f For unexpected termination costs; r is (r) _w To fully explore the return.

6. The unknown environmental exploration path planning method of claim 5, wherein said entropy increase rewards calculation method comprises: calculating the difference value between the image entropy of the training grid map at the middle moment and the image entropy of the training grid map at the last moment of the middle moment to obtain entropy rewards;

the step cost is a constant negative number;

the unexpected termination cost is a constant negative number;

7. An unknown environmental exploration path planning system, said unknown environmental exploration path planning system comprising:

the edge detection module is used for detecting edge points of the grid map at the current moment by utilizing an edge detection algorithm to obtain a plurality of edge points; based on the positions of all the edge points in the grid map at the current moment, part of the edge points are uniformly selected as potential target points, and specifically, the selection mode can be as follows: assigning a random number to each edge point, randomly selecting a specific number of edge points as potential target points in a program in a mode of generating the random number, wherein the potential target points are the edge points corresponding to non-obstacle;

the target point selection module is used for taking the grid map at the current moment as an input, selecting one potential target point as a target edge point based on a reinforcement learning algorithm, and adjusting the image size of the grid map at the current moment through a resize function in a matlab before the grid map at the current moment is input into the reinforcement learning algorithm;

The motion control module is configured to plan a collision-free path from a current position of the unmanned vehicle at the current time to the target edge point according to the grid map at the current time and the target edge point, and control the unmanned vehicle to move along the collision-free path until reaching the target edge point, and plan a collision-free path from the current position of the unmanned vehicle at the current time to the target edge point according to the grid map at the current time and the target edge point, and control the unmanned vehicle to move along the collision-free path specifically includes:

Controlling the unmanned vehicle to move along the collision-free path based on the linear speed and the corrected angular speed corresponding to each path point;

8. An unknown environment exploration path planning apparatus, comprising:

a processor; and

a memory in which computer-readable program instructions are stored,

wherein the computer readable program instructions, when executed by the processor, perform the unknown environment exploration path planning method of any of claims 1-6.

9. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the unknown environment exploration path planning method of any of claims 1-6.