CN115826581A

CN115826581A - Mobile robot path planning algorithm combining fuzzy control and reinforcement learning

Info

Publication number: CN115826581A
Application number: CN202211693190.0A
Authority: CN
Inventors: 刘春玲; 郭楷文; 裴萌韶; 骆远翔; 程惠; 李想
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-03-21

Abstract

A fuzzy logic and reinforcement learning combined mobile robot path planning algorithm belongs to the technical field of deep learning of robots, fuzzy control and a direct controller provide a large number of positive samples for DuelingDQN, the training speed of the DuelingDQN is improved, and under the mode of a multi-controller, the mobile robot avoids doing a large number of meaningless actions at the initial stage of training and avoids the condition that a large number of negative samples generated along with the positive samples submerge; after training is finished, the model combining DuelingDQN and improved fuzzy control does not have the condition that the DuelingDQN model shakes forward. The model combining DuelingDQN and improved fuzzy control has good performance under the condition that the target position or the mobile robot position changes and the scene is different from the training scene, and the effectiveness of the method in path planning is illustrated. In addition, experiments prove that the model combining the DuelingDQN and the improved fuzzy control has higher convergence speed, higher stability and better path quality than the Dueling DQN model.

Description

Mobile robot path planning algorithm combining fuzzy control and reinforcement learning

Technical Field

The invention belongs to the technical field of deep learning of robots, and particularly relates to a mobile robot path planning algorithm combining fuzzy control and reinforcement learning.

Background

Path planning is generally divided into global path planning and local path planning, where the global path planning is to find a feasible path without collision through an intelligent algorithm under the condition that global map information is known, such as an artificial bee colony algorithm, an ant colony algorithm, a Probabilistic Roadmap (PRM), and the like. The artificial bee colony algorithm finally enables the global optimal value to be highlighted in the colony through the local optimization of each individual, and has higher convergence rate. The ant colony algorithm is a heuristic global optimization algorithm. PRM is a network map of possible paths within a map based on available space and occupied space, finding an optimal path by evaluating metrics such as distance, time, etc. The local planning refers to sensing the environment and finding a collision-free path by the mobile robot according to information obtained by the sensor, and the local planning algorithm includes an artificial potential field method (APF), a dynamic window method (DWA) and the like. The artificial potential field method regards a mobile robot as a point in the environment, which is pushed by the resultant force of the target gravity and the obstacle resistance, thereby finding a safe path. The dynamic window method samples the speed and the angular speed at a certain moment, and conjectures the possible motion trail of the mobile robot after a period of time, and then evaluates the motion trail to execute the trail with the highest evaluation. Most global path planning uses scenarios that are static and require all the information of the map. However, as the environment becomes more complex, the efficiency of the global planning algorithm will be greatly reduced. The local path planning collects scene information through a sensor, and the information contains noise, influences the result of the path and is easy to fall into a local optimal path. Therefore, the traditional path planning algorithm has the problems of dependence on global map information, low real-time performance and the like.

Reinforcement Learning (RL) is an artificial intelligence algorithm that does not need prior knowledge and performs trial and error iteration directly with the environment to obtain feedback information to optimize a strategy, and can perform autonomous learning and online learning, and gradually becomes a research hotspot of path planning of a robot in an unknown environment. The traditional reinforcement learning method is limited by the dimensions of an action space and a sample space, and is difficult to adapt to the complex problem condition closer to the actual condition, and Deep Learning (DL) has strong perception capability, can adapt to the complex problem more, but lacks certain decision-making capability. Therefore, the google brain combines DL and RL to obtain Deep Reinforcement Learning (DRL), which makes up for the deficiencies of the DL and the RL, and provides a new idea and direction for the motion planning problem in the complex environment of the mobile robot. In 2013, google deep Mind proposes a deep Q-learning network (DQN), which is a milestone in the development process of deep reinforcement learning, breaks through a learning mechanism based on shallow structure value function approximation and strategy search in the traditional RL method, and realizes end-to-end mapping from a high-dimensional input space to a Q value space through a multilayer deep convolutional neural network so as to simulate the activities of human brain. Since agents need to continuously interact with the surrounding environment, the Deep Q Network (DQN) inevitably faces the need to learn a large number of network parameters, resulting in low learning efficiency. In response to these problems, researchers have conducted intensive research on the DQN algorithm from the aspects of training algorithm, neural network structure, learning mechanism, AC framework-based DRL algorithm, etc. and proposed many improvement strategies.

The existing literature provides an industrial robot obstacle avoidance path planning method based on deep reinforcement learning. The method solves the problems that when a conventional method defines a robot obstacle avoidance reward function, only rewards are given after the robot reaches a target position, obstacle avoidance rewards are sparse, and therefore obstacle avoidance path planning time and length are long and planning success rate is low. The existing literature provides an improved depth reinforcement learning algorithm based on depth image information in order to solve the problems of poor exploration capability and sparse environment state space reward in the path planning of a mobile robot in an indoor unknown environment in the traditional depth reinforcement learning. The existing literature provides a dynamic fusion depth double-Q algorithm (DTDDQN) for the over-estimation problem of a depth Q learning algorithm in robot path planning. The comparison result shows that the DTDDQN algorithm can better solve the over-estimation problem in path planning, and certain improvement is achieved in the aspects of action selection and path length planning. The existing literature aims at the challenges of complex battlefield environment, incomplete information, large uncertainty, high strategy complexity and the like faced by military (tactical level) combat mission planning, combs basic concepts and flow frames of the combat mission planning, introduces the basic principle and the development status of deep reinforcement learning, and analyzes the application of the deep reinforcement learning in the aspects of scene recognition, target detection, behavior judgment, threat assessment, path planning, firepower distribution and the like in the combat mission planning. Document [13] proposes a global dynamic path planning fusion algorithm based on a safe a × algorithm and a dual-speed model dynamic window method, which reduces the iteration times, calculation cost and storage cost and improves the algorithm efficiency. The existing literature proposes visual localization and SLAM and chain-based path planning methods to achieve obstacle avoidance in an obstacle force field. The existing literature constructs an obstacle avoidance safety model by analyzing the obstacle avoidance behavior of a human driver. Then, based on the safety model, the artificial potential field method is improved, and the repulsive field range of the obstacle is reconstructed. Finally, based on the improved artificial potential field, a collision-free path of the autonomous vehicle is generated. The existing document utilizes the state estimation of an extended kalman filter to construct a dynamic obstacle avoidance risk region along the direction of the movement of an obstacle. And secondly, the robot is combined with nonlinear model predictive control to realize safe obstacle avoidance operation of the robot on the dynamic obstacle.

The reinforcement learning model can complete the situation that the path planning is difficult to process before, and the application range is wide. However, the training process of the deep reinforcement learning model is long, so that many meaningless actions are taken at the initial training stage, and many negative samples are generated, which hinders the training speed. Therefore, the problem of training efficiency and the problem that the positive samples are swamped by the negative samples in the early training stage need to be solved.

Application No.: 202210512746.5, proposes a mobile robot path planning method based on deep reinforcement learning, comprising the following steps: acquiring a depth image based on a full convolution residual network; sensing whether an obstacle exists in a front area; planning a path for avoiding the obstacle by using a deep reinforcement learning algorithm; driving the robot to travel until avoiding the obstacle; drawing a two-dimensional local environment map based on FastSLAM; the steps are repeated until the final destination is reached, and the invention needs to draw a two-dimensional local map online.

Application No.: 202210043667.4 discloses a vehicle path planning method and device based on deep reinforcement learning, the method comprises: constructing a solving framework of a vehicle path planning problem, and determining initial parameter information; building a neural network model as a destruction strategy; fitting a large neighborhood searching process into a Markov decision process according to the initial parameter information and the destruction strategy; training a neural network model by a reinforcement learning method according to the Markov decision process; and solving the vehicle path planning problem through the trained neural network model to obtain a vehicle path planning result. The method can shorten the solving time, ensure the solving quality and be widely applied to the technical field of artificial intelligence. Although the method can shorten the solving time, the model training time is longer.

A mechanical arm path planning method integrating reinforcement learning and fuzzy obstacle avoidance is disclosed. Application No.: 202110393339.2, in particular to a method for planning a path of a mechanical arm by combining reinforcement learning and fuzzy obstacle avoidance, which uses reinforcement learning to search an optimal track in a prior three-dimensional space model, starts a fuzzy control obstacle avoidance algorithm when encountering an obstacle, and enters the reinforcement learning algorithm again after successfully avoiding the obstacle so as to enable the mechanical arm to move to a target point. The method can plan feasible paths according to different states in different environments, has short decision time and high success rate, can meet the real-time requirement of online planning, overcomes the defects of poor real-time performance and large calculated amount of the traditional mechanical arm path planning method, and also overcomes the defect that the learning efficiency is difficult to improve based on the traditional reinforcement learning method. This method uses fuzzy control when an obstacle is encountered. The above methods all have a problem of long initial training time.

The training process of the deep reinforcement learning model in the path planning is long, a lot of meaningless actions can be performed at the initial training stage, a lot of negative samples are generated, and the training speed is hindered. Therefore, the problems of training efficiency and the flooding of the positive samples by the negative samples in the initial training stage need to be solved; the invention provides a training model integrating fuzzy control, direct control and reinforcement learning, which provides a positive sample at the initial training stage and solves the problem of random collision at the initial training stage.

The design quality of the reward function determines whether the mobile robot can learn the expected strategy and directly influences the convergence speed and the final performance of the DRL algorithm. The commonly adopted shaping reward may cause abnormal behaviors of the mobile robot, a negative feedback shaping reward mechanism is provided, and the problem that collision is caused due to unstable and easily-generated deviation of a commonly reinforced learning path is solved. The improved fuzzy control reinforcement learning model has less path deviation, and can ensure that a path planning task is finally completed in a complex environment.

Disclosure of Invention

In order to solve the existing problems, the invention provides: a mobile robot path planning algorithm combining fuzzy control and reinforcement learning jointly controls the path of an intelligent agent through the fuzzy control and the Dueling DQN, a fuzzy controller and a direct controller are added into a deep reinforcement learning model, and the two controllers react according to different conditions of a sensor; the intelligent body is a four-wheel drive trolley, steering is realized by means of differential speed of left and right wheels, 5 laser sensors are simultaneously arranged, and the detection distance of each sensor is 5m farthest; after training begins, the environment transmits state space information to a model, a dispatching center divides the state space information through different thresholds and divides the state space information into three conditions, each condition is controlled by one controller, a simple condition is controlled by a direct controller, a complex condition is controlled by a Dueling DQN, and a dangerous condition is controlled by a fuzzy controller; putting the experience data generated by the three conditions into an experience pool, and reducing the collision of the intelligent agent through a fuzzy controller and a direct controller in each training to obtain a positive sample; the positive sample is a sample corresponding to the main line event, so that the probability that the main line event is explored in a random mode under a complex environment is prevented from becoming extremely small; the Dueling DQN obtains a plurality of positive samples in the initial training stage to improve the convergence rate; in the training process, the reward function guides the mobile robot to efficiently complete the task, the reward function of path planning is sparse reward, the sparse reward is changed into dense reward, and negative feedback shaping reward is used as the reward function according to analysis of the task.

In state space analysis mobile robot's motion model, let mobile robot find the correct direction through target and mobile robot forward contained angle, confirm the degree through target and mobile robot's distance, guarantee through sensor data that mobile robot avoids the barrier, so state space has 7 characteristics:

s＝{se ₁ ,se ₂ ,se ₃ ,se ₄ ,se ₅ ,θ _angle ,dis} (13)

therein, se _i I =1,2,3,4,5, representing sensor data, θ _angle Indicating the angle of the target to the forward direction of the mobile robot and dis the distance between the mobile robot and the target.

Furthermore, the fuzzy controller is responsible for handling dangerous situations, when the data of the middle sensor reaches an alert threshold, the mobile robot is controlled by the fuzzy controller, and the data tuple of the mobile robot consists of the middle three sensors and a forward included angle between the target and the mobile robot, namely { se ₁ ,se ₂ ,se ₃ ,θ _angle When the fuzzy control obtains the data tuple, firstly preprocessing the data, namely se ₂ -se ₃ The value x is transmitted into a fuzzification module, the fuzzification module divides the data into different stages according to the size of the transmitted data value, when the data value range is-beta is not less than x and not more than beta, the membership degree is calculated by using a formula (1), wherein theta is _max The maximum deviation angle allowed by the included angle between the target and the forward direction of the mobile robot is represented, beta represents the maximum direction changing capability of the intelligent agent under the current motion model, alpha represents the degree close to the maximum direction changing capability, and theta represents the included angle between the target and the forward direction of the mobile robot at the current moment.

When the data value x is in the range (— ∞, - β) U (β, + ∞), the degree of membership is calculated by the following formula:

and finally, three membership degree values are obtained, wherein the first membership degree value represents the tendency degree of turning to the left, the second membership degree value represents the tendency degree of turning to the right, the third membership degree value represents the tendency degree determined by the Dueling DQN controller, then the membership degree group is transmitted into a fuzzy solving module, the action with the maximum membership degree is selected according to the maximum membership degree principle, and an action instruction is sent to the mobile robot.

Further, the direct controller is responsible for handling simple situations, i.e. when there are no obstacles around and a collision is about to occur on the side, based on the received data tuples, when the sensor value se ₄ Or se ₅ When the value of the sensor is within the safety threshold value, the direct controller controls the trolley to run by the shortest radial target.

Further, the deep reinforcement learning obtains an incentive value through interaction with the environment until one iteration is finished to obtain a total incentive value, different actions are selected to obtain different incentive total values in one iteration, a deep reinforcement learning model adopts a Q value to predict the incentive total value obtained after a certain action is executed, then the action with the highest incentive total value is selected as output, a dulling DQN is adopted to decouple a state value irrelevant to the action from the Q value, a more robust learning effect is obtained, and the Q value is divided into a state value V and an action advantage value A:

Q _π (s，a)＝V _π (s)+A _π (s，a)(3)

then, dulling DQN separates the representation of the two parts:

wherein s represents a state value, a represents an action value, θ _v And theta _a The parameters of the two fully-connected layers are expressed by taking the average value instead of the maximum value so as to obtain better stability:

in the training process, dulling DQN interacts with the environment to generate experience data, the experience data is stored in an experience pool, the dulling DQN takes out data of a batch from the experience pool at regular intervals, and an optimization strategy is learned from the taken-out data;

the Dueling DQN approximates Q (s, a) in Q-learning to Q (s, a; theta), theta represents a parameter of the neural network, a loss function is a residual error between a true value and a predicted value, in order to relieve an overestimation problem, a target network is introduced, and the loss function is written in the form of:

L(θ)＝(r+γmaxQ(s′，a′；θ-)-Q(s，a；θ)) ² (6)

updating a parameter theta by gradient descent, wherein r represents reward, gamma represents discount return rate, s represents state space information of last iteration, a represents action of last iteration, theta represents network parameter of last iteration, s 'represents current state space information, a' represents current action, theta ^- The network parameter of the target network is represented, an input layer is state information, the advantage function and the action function of a hidden layer are respectively three layers, an output layer is 5 actions, and the network structure is as follows: the number of the neurons of each hidden layer is 128, the number of the output of the action layer is the number of action instructions, the number of the output of the state layer is 1, finally, the Q value of each action is obtained through a formula (5), and the action with the largest Q value is selected as an execution action.

Further, in the reinforcement learning task, the mobile robot continuously improves the strategy according to a feedback signal from the environment in the exploration process, wherein the feedback signal is a reward, and the negative feedback shaping reward is proposed as follows:

when the mobile robot collides with an obstacle, the following rewards are obtained:

R _collision ＝-100(8)

when the mobile robot reaches the target position, the obtained reward is as follows:

Rsuccess＝100(9)

when the mobile robot does not collide with an obstacle or reach a target point, its normal operation reward is defined as follows:

wherein d (n) represents the distance between the mobile robot and the target at the previous moment, d (n + 1) represents the distance between the mobile robot and the target at the current moment, and d _min Representing a minimum effective distance of the mobile robot approaching or departing the target;

when the mobile robot approaches to or moves away from an obstacle, the mobile robot is divided into a safe state S and an unsafe state NS, and the reward is defined as follows:

wherein, sensor _min (n) represents the minimum in the sensor data of the last moment, sensor _min (n + 1) represents the minimum value in the sensor data at the current time;

in order for the mobile robot to advance toward the target direction, the angle awards are defined as follows:

where θ represents the angle of the target from the headingDegree, theta _max Representing the maximum angle value allowed by the reward function.

The invention has the beneficial effects that: a method of combining Dueling DQN with improved fuzzy control is presented. The fuzzy control and direct controller provides a large number of positive samples for the Dueling DQN, and the training speed of the Dueling DQN is improved. In the multi-controller mode, the mobile robot avoids doing a large number of meaningless actions at the initial stage of training and avoids the situation where a large number of negative samples swamp the positive samples that ensue. After training is finished, the shaking advancing condition of the Dueling DQN model can not occur in the model combining the Dueling DQN and the improved fuzzy control.

The model was evaluated by a series of generalization tests. Experiments prove that the model combining Dueling DQN and improved fuzzy control has good performance under the conditions that the target position or the position of the mobile robot changes and the scene is different from the training scene, and the effectiveness of the method in path planning is demonstrated. In addition, experiments prove that the model combining the Dueling DQN and the improved fuzzy control has higher convergence speed, higher stability and better path quality than the Dueling DQN model.

Drawings

FIG. 1 is a block diagram of the system framework of the present invention;

FIG. 2 is a diagram of the Dueling DQN neural network model architecture of the present invention;

FIG. 3 is a schematic diagram of a training scenario of the present invention;

FIG. 4 is a graph of prize values according to the present invention;

FIG. 5 is a schematic diagram of scenario a of a testing environment 1 according to the present invention;

FIG. 6 is a schematic diagram of scenario b of testing environment 2 of the present invention;

FIG. 7 is a schematic view of scenario 3 c of the testing environment of the present invention;

FIG. 8 is a bar graph of the success rate of two models of the present invention in a test scenario.

Detailed Description

A mobile robot path planning algorithm combining fuzzy control and reinforcement learning is provided, a traditional Dueling DQN model is trained by using a large amount of collision data in the initial training stage, the convergence speed is low due to the collision data, and the path planning algorithm for jointly controlling intelligent depth reinforcement learning by a fuzzy controller and the Dueling DQN is provided aiming at the problems that the Dueling DQN is slow in convergence and poor in generalization capability. The framework of the algorithm is shown in fig. 1.

In the deep reinforcement learning model, a fuzzy controller and a direct controller are added, and the two models can react according to different conditions of the sensor. The intelligent agent is a four-wheel drive trolley, steering is realized by depending on the differential speed of left and right wheels, 5 infrared sensors are simultaneously arranged, and the farthest detection distance of each sensor is 5m. After training begins, the environment passes state space data to the model. The dispatching center compares the received data tuple with a preset regulation, if the data accords with the preset regulation, the trolley is controlled by the fuzzy controller, otherwise, the dulling DQN controls the movement of the trolley, and if the data tuple exceeds an alert threshold, the direct controller controls the trolley. Whether the cart is controlled by a fuzzy controller, a Dueling DQN, or a direct controller, the experience it generates is put into the experience pool. The fuzzy controller and the direct controller not only reduce the collision of the intelligent agent, but also generate a large number of positive samples, wherein the positive samples are samples corresponding to the main line event, so that the problem that the probability of exploring the main line event in a random mode under a complex environment is extremely low, namely, a few positive samples are submerged in the ocean of the negative samples is avoided. The Dueling DQN can obtain considerable number of positive samples at the initial training stage, and the convergence speed is effectively improved. In the training process, a good reward function can guide the colleges and universities of the mobile robot to complete tasks, the reward function of path planning is sparse reward, so that the sparse reward needs to be changed into dense reward, and negative feedback shaping reward is provided as the reward function according to analysis of the tasks.

(1) Fuzzy controller

The fuzzy controller is responsible for handling dangerous situations, and the intelligent agent is controlled by the fuzzy controller when the data of the intermediate sensor reaches a warning threshold value. When the fuzzy control obtains the data, the data is preprocessed firstly, namely the values x of the sensors 2-3 are transmitted into the fuzzification module. The fuzzification module divides the incoming data values into different stages according to the sizes of the incoming data values, and when the data values range from-beta to x, the membership degree is calculated by using the following formula, wherein theta _max The maximum deviation angle allowed by the included angle between the target and the forward direction of the mobile robot is represented, beta represents the maximum direction changing capacity of the intelligent agent under the current motion model, theta represents the degree close to the maximum direction changing capacity, and theta represents the included angle between the target and the forward direction of the mobile robot at the current moment.

When the data value is in the range (— ∞, - β) U (β, + ∞), the degree of membership is calculated by the following formula:

and finally, obtaining three membership degrees, wherein the third membership degree is determined according to the sizes of the first two membership degrees, if the first two membership degrees are both 0, the third membership degree is 1, and if not, the third membership degree is 0. The first of which represents the degree of inclination to turn left, the second represents the degree of inclination to turn right, and the third represents the degree of inclination determined by the dulling DQN controller. And then, the membership group is transmitted into a fuzzy solving module, the action with the maximum membership is selected according to the maximum membership principle, and an action instruction is sent to the trolley.

(2) Direct controller

The direct controller is responsible for handling simple situations, based on the received data tuples, i.e. when there are no obstacles around and a side collision is about to occur, based on the received data tuples, when the sensor value se is ₄ Or se ₅ When the value of the sensor is within the safety threshold value, the direct controller controls the trolley to run by the shortest radial target.

(3) Reinforced learning

The Dueling DQN is developed from DQN, the DQN introduces deep learning into reinforcement learning, Q-learning is connected with the deep learning, and a control strategy is directly learned from high-dimensional data. The Dueling DQN decouples the state value irrelevant to the action from the Q value, and a more robust learning effect can be obtained. The Q value can be divided into two parts, a state value and an action advantage:

Q _π (s，a)＝V _π (s)+A _π (s，a)(3)

then, dulling DQN separates the representation of the two parts:

wherein s represents a state value, a represents an action value, θ _v And theta _a Parameters representing two fully-connected layers, the average value is used instead of the maximum value, so as to obtain better stability:

where M represents the number of action dimensions, a' represents the optimal action, and the dominance function need only be closer to the average dominance direction, rather than pursuing the greatest dominance.

In the training process, the Dueling DQN or other controllers interact with the environment to generate experience data, the experience data are stored in an experience pool, the Dueling DQN takes out data of a batch from the experience pool at regular intervals, and an optimization strategy is learned from the taken-out data.

Dulling DQN approximates Q (s, a) in Q-learning to Q (s, a; θ), which represents a parameter of the neural network. The loss function is the residual of the true value and the predicted value, and can be written as follows:

L(θ)＝(r+γmaxQ(s′，a′；θ-)-Q(s，a；θ)) ² (6)

updating a parameter theta using gradient descent, where r represents reward, gamma represents discount rate of return, s represents state space information of last iteration, a represents action of last iteration, and thetaRepresenting the network parameters of the last iteration, s 'representing the current state space information, a' representing the current action, theta ^- The network parameter of the target network is represented, an input layer is state information, the advantage function and the action function of a hidden layer are respectively three layers, an output layer is 5 actions, and the network structure is as follows: the number of the neurons of each hidden layer is 128, the number of the output of the action layer is the number of action instructions, the number of the output of the state layer is 1, finally, the Q value of each action is obtained through a formula (5), and the action with the largest Q value is selected as an execution action.

The input layer is state information, the dominant function and the action function of the hidden layer are three layers respectively, and the output layer is 5 actions. The network structure is shown in fig. 2.

The number of the neurons of each hidden layer is 128, the number of the output of the action layer is the number of action instructions, the number of the output of the state layer is 1, finally, the Q value of each action is obtained through an average formula, and the action with the largest Q value is selected as an execution action.

(4) Reward function

In reinforcement learning tasks, the mobile robot continuously improves the strategy in the exploration process according to feedback signals from the environment, which are called rewards. The design quality of the reward function determines whether the mobile robot can learn the expected strategy and directly influences the convergence speed and the final performance of the DRL algorithm. The samples corresponding to the dominant line event are generally called positive samples, the rest are called negative samples, and with further increase of task complexity, the probability of exploring the positive samples in a random manner becomes smaller, a few positive samples are submerged in the negative samples, and in this case, the convergence speed of the algorithm becomes slower. In addition, the shaping reward may cause abnormal behavior of the mobile robot, so the negative feedback shaping reward is proposed as:

R＝R _success +R _collision +R _safe+Rdanger +R _angle (7)

R _collision ＝-100(8)

R _success ＝100(9)

wherein d (n) represents the distance between the mobile robot and the target at the previous moment, d (n + 1) represents the distance between the mobile robot and the target at the current moment, and d _min Representing the minimum effective distance of the mobile robot approaching or moving away from the target.

When the mobile robot approaches an obstacle or moves away from the obstacle, it is divided into a safe state (S) and an unsafe state (NS), and the reward is defined as follows:

wherein, sensor _min (n) represents the minimum in the sensor data of the last moment, sensor _min (n + 1) represents the minimum value in the sensor data at the present time.

In order to make the mobile robot advance to the target direction, an angle reward is designed, which is defined as follows:

wherein theta represents the angle between the target and the head orientation, and theta _max Representing the maximum angle value allowed by the reward function.

(4) State space and motion space

Analyzing mobile robot's motion model, target and mobile robot forward contained angle can let mobile robot find the correct direction, and the target can confirm the degree with mobile robot's distance, and sensor data can guarantee that mobile robot avoids the barrier, so the state space has 8 characteristics:

s＝{se ₁ ，se ₂ ，se ₃ ，se ₄ ，se ₅ ， _θangle ，dis}(13)

2. Experimental results of the method

In the experiment, the model combining the Dueling DQN model with the Dueling DQN and the improved fuzzy control is trained in the scenario of fig. 3 for 500 times, and each time a collision occurs or a target point is reached, the current training is finished, the scenario returns to the initial state and the next training is started. The neural network updates the weights once every 5 iterations. The total reward value for each training is recorded during the training process, and the resulting reward curve is shown in fig. 4.

It can be seen from the reward graph that the Dueling DQN model maintains a lower level of the reward value at the initial training stage, which indicates that the success rate is lower, and an abnormal reward value appears at the middle training stage, which indicates that the abnormal state appears during the environmental training, i.e. the phenomena of circling, wandering and the like appear in the environment. The model combining the Dueling DQN and the improved fuzzy control has higher reward value relative to the Dueling DQN model in the initial training period and reaches a stable state faster than the Dueling DQN model, and the number of abnormal reward values is far smaller than that of the Dueling DQN model because a plurality of controllers provide high-quality positive samples for model training and avoid unnecessary mistakes. Finally, the reward values of the two models are stable at 160, however, in the training environment, the Dueling DQN model still has collision in the late training period, and the model combining the Dueling DQN and the improved fuzzy control has no collision in the late training period.

After training is finished, the two models are tested for 100 times in a training scene, the success rate of the Dueling DQN model is 92%, and the success rate of the model combining the Dueling DQN and the improved fuzzy control is 100%.

(1) Generalization ability

In order to test the adaptability of the model after the environment is changed, the two models are placed in the following three scenes for testing in the experiment, as shown in fig. 5-7, a scene a is a simple scene, a scene b changes the position of a trolley and the position of a target in a training scene, and a scene c is a complex scene.

As shown in fig. 8, for each scenario, each model was tested for 100 rounds with success rates and path lengths as shown in the table below. The success rate of the duplex DQN model in the scene a is 79%, the success rate in the scene b is 94%, and the success rate in the scene c is 74%, so that it can be known that the duplex DQN model maintains a high success rate in the trained scene, but the success rate decreases once the scene changes. The success rate of the model combining Dueling DQN and improved fuzzy control in the scene a is 100%, the success rate in the scene b is 100%, and the success rate in the scene c is 100%, so that the model combining Dueling DQN and improved fuzzy control has better performance than the Dueling DQN model in the training scene, and has better performance in a simple scene and more complex scenes than the training scene.

TABLE 1 success rates of the models in the test environment

In the test scenario, the path length of the mobile robot is recorded for each test. The mean scene path length in each scene is shown in table 1. In each test scenario, the path length of the algorithm is smaller than that of the Dueling DQN algorithm.

In summary, the test results prove that the model combining the dulling DQN and the fuzzy control not only has better performance when the mobile robot and the target change in the training scene, but also has better capability in other non-training scenes. Has strong generalization ability. In addition, the model combining dulling DQN with improved fuzzy control produces a more stable path and better ability to handle offset cases than the dulling DQN model.

1. Aiming at the problems of slow convergence and poor generalization capability of the DuelingDQN, a path planning algorithm for controlling an intelligent agent jointly by a fuzzy controller and the DuelingDQN is provided. Firstly, a fuzzy controller and a direct controller are added into a deep reinforcement learning model, the two models can react according to different conditions of a sensor, positive samples can be generated by the fuzzy controller and the direct controller, the sample data are stored in an experience pool and used for relieving the condition that the negative samples are too many, and the two controllers can also counteract meaningless actions of the deep reinforcement learning controller in the training process.

2. The fuzzy controller is responsible for handling dangerous situations, when the data of the middle sensor reaches a warning threshold value, the mobile robot is controlled by the fuzzy controller, and the data tuple of the fuzzy controller consists of the middle three sensors and the forward included angle between the target and the mobile robot, namely { se ₁ ,se ₂ ,se ₃ ,θ _angle When the fuzzy control obtains the data tuple, firstly preprocessing the data, namely se ₂ -se ₃ The value x is transmitted into a fuzzification module, the fuzzification module divides the value x into different stages according to the size of the transmitted data value, and different control methods are adopted according to different parameters.

3. In reinforcement learning, the design quality of the reward function determines whether the mobile robot can learn the expected strategy, and directly influences the convergence speed and the final performance of the algorithm. The samples corresponding to the dominant line event are generally called positive samples, the rest are called negative samples, and with further increase of task complexity, the probability of exploring the positive samples in a random manner becomes smaller, a few positive samples are submerged in the negative samples, and in this case, the convergence speed of the algorithm becomes slower. In addition, the shaping reward may cause abnormal behaviors of the mobile robot, and a negative feedback shaping reward mechanism is provided.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and their concepts should be equivalent or changed within the technical scope of the present invention.

Claims

1. A mobile robot path planning algorithm combining fuzzy control and reinforcement learning is characterized in that the path of an intelligent agent is controlled through the fuzzy control and Dueling DQN together, a fuzzy controller and a direct controller are added into a deep reinforcement learning model, and the two controllers react according to different conditions of sensors; the intelligent body is a four-wheel drive trolley, steering is realized by means of differential speed of left and right wheels, 5 laser sensors are simultaneously arranged, and the detection distance of each sensor is 5m farthest; after training begins, the environment transmits state space information to a model, a dispatching center divides the state space information through different thresholds and divides the state space information into three conditions, each condition is controlled by one controller, a simple condition is controlled by a direct controller, a complex condition is controlled by a Dueling DQN, and a dangerous condition is controlled by a fuzzy controller; putting the experience data generated by the three conditions into an experience pool, and reducing the collision of the intelligent agent through a fuzzy controller and a direct controller in each training to obtain a positive sample; the positive sample is a sample corresponding to the main line event, so that the probability that the main line event is explored in a random mode under a complex environment is prevented from becoming extremely small; the Dueling DQN obtains a plurality of positive samples in the initial training stage to improve the convergence rate; in the training process, the reward function guides the mobile robot to efficiently complete the task, the reward function of path planning is sparse reward, the sparse reward is changed into dense reward, and negative feedback shaping reward is used as the reward function according to analysis of the task.

2. The algorithm for planning the path of the mobile robot with the combination of the fuzzy control and the reinforcement learning as claimed in claim 1, wherein in the motion model of the mobile robot, the mobile robot can find the right direction by the forward included angle between the target and the mobile robot, and can avoid the obstacle by the distance determination degree between the target and the mobile robot and the sensor data, so that the state space has 7 characteristics:

s＝{se ₁ ,se ₂ ,se ₃ ,se ₄ ,se ₅ ,θ _angle ,dis} (13)

3. The fuzzy-control and reinforcement-learning combined mobile robot path planning algorithm of claim 1, wherein the fuzzy controller is responsible for handling dangerous situations, when the data of the middle sensor reaches a warning threshold, the mobile robot is controlled by the fuzzy controller, and the data tuple consists of the middle three sensors and the forward included angle between the target and the mobile robot, i.e. { se ₁ ,se ₂ ,se ₃ ,θ _angle When the fuzzy control gets the data tuple, the data is first preprocessed, i.e. se ₂ -se ₃ The value x of (a) is transmitted into a fuzzification module, the fuzzification module divides the value into different stages according to the size of the transmitted data value, when the data value range is-beta is less than or equal to x and less than or equal to beta, the membership degree is calculated by using a formula (1), wherein theta is _max The maximum deviation angle allowed by the included angle between the target and the forward direction of the mobile robot is represented, beta represents the maximum direction changing capacity of the intelligent agent under the current motion model, alpha represents the degree close to the maximum direction changing capacity, and theta represents the included angle between the target and the forward direction of the mobile robot at the current moment.

4. The fuzzy control and reinforcement learning combined mobile robot path planning algorithm as claimed in claim 1, wherein the direct controller is responsible for handling simple situations, i.e. when there are no obstacles around and there is a side collision about to occur, according to the received data tuples, when the sensor value se is ₄ Or se ₅ When the value of the sensor is within the safety threshold value, the direct controller controls the trolley to run by the shortest radial target.

5. The algorithm for planning the path of the mobile robot combining the fuzzy control and the reinforcement learning according to claim 1, wherein the deep reinforcement learning obtains an incentive value through interaction with the environment until an iteration is finished, a total incentive value is obtained, different actions are selected to obtain different incentive total values in the iteration, a Q value is used by the deep reinforcement learning model to predict the incentive total value obtained after a certain action is executed, then the action with the highest incentive total value is selected as an output, a dulling dqn is used to decouple a state value irrelevant to the action from the Q value, so that a more robust learning effect is obtained, and the Q value is divided into a state value V and an action advantage value a:

Q ^π (s,a)=V ^π (s)+A ^π (s,a) (3)

then, dulling DQN separates the representation of the two parts:

wherein, s isTable state value, a represents action value, θ _v And theta _a The parameters of the two fully-connected layers are expressed by taking the average value instead of the maximum value so as to obtain better stability:

L(θ)＝(r+γmaxQ(s′，a′；θ-)-Q(s，a；θ)) ² (6)

6. The algorithm of claim 1, wherein in the reinforcement learning task, the mobile robot continuously improves the strategy according to the feedback signal from the environment during the exploration process, the feedback signal is a reward, and the feedback shaping reward is proposed as:

R＝R _success +R _collision +R _safe +R _danger +R _angle (7)

R _collision ＝-100 (8)

R _success ＝100 (9)

wherein, sensor _min (n) represents the minimum in the sensor data of the last moment, sensor _min (n + 1) represents the minimum value in the sensor data at the present time;