CN116551703B

CN116551703B - Motion planning method based on machine learning in complex environment

Info

Publication number: CN116551703B
Application number: CN202310851954.2A
Authority: CN
Inventors: 李慧; 罗明月; 李佳男; 魏俊杰; 孙哲; 张秀梅; 刘越; 秦伟; 丛铄沣; 杨帆
Original assignee: Changchun University of Technology
Current assignee: Changchun University of Technology
Priority date: 2023-07-12
Filing date: 2023-07-12
Publication date: 2023-09-12
Anticipated expiration: 2043-07-12
Also published as: CN116551703A

Abstract

The invention discloses a motion planning method based on machine learning in a complex environment, and relates to the fields of workshop transportation, machine learning, robot control, motion planning and the like. Firstly, designing a motion planning algorithm based on FMP and a motion planning algorithm based on A3C, introducing game theory ideas to design a reward function in the improvement of the A3C algorithm, and then fusing the improved A3C algorithm with the FMP algorithm to form three motion planning strategies, wherein different motion planning strategies are adopted when the motion planning problem of the multi-robot system in different environment states is solved, so that the resource waste caused by algorithm redundancy is avoided. Compared with other methods, the method can not only improve the generalization capability and the motion planning efficiency of the model, but also meet the requirements of the multi-robot system on cooperativity, safety and high efficiency in a complex environment, can improve the working efficiency of the multi-robot system, and can be applied to the fields of logistics industry, service industry, agriculture and the like.

Description

Motion planning method based on machine learning in complex environment

Technical Field

The invention relates to the fields of workshop transportation, machine learning, multi-robot control, motion planning and the like, in particular to a motion planning method based on machine learning in a complex environment.

Background

The development of intelligent factories and intelligent logistics promotes the application and full expansion of mobile robots in various industries, and particularly plays an active important role in the intelligent manufacturing field. Although the logistics development speed in the factory in recent years is faster, the logistics development speed in the factory is still quite different from the level of developed countries, especially the datamation and intelligent robots which are closely related to the manufacturing core process flow and are oriented to future manufacturing are faced with great challenges, and the problems of cooperative operation, movement planning, transportation and the like of multiple robots in the factory are seriously limited, so that the development of intelligent manufacturing, especially matrix production, is seriously restricted. For the next generation process robots which are manufactured by the service matrix and have intelligent characteristics, interaction cooperation among a plurality of robots is needed, the total data volume is increased by about three orders of magnitude compared with the traditional mode, and the production and transportation efficiency of workshops can be greatly improved. However, the huge multi-robot interactive system intangibly converts a single transportation environment of a workshop into a densely dynamic environment, which causes many complex problems in the interaction between robots, such as collision, path conflict, traffic jam, poor timeliness and the like, which can seriously affect transportation safety, transportation efficiency and production efficiency.

In a multi-robot system, each robot needs to perform its specific tasks while taking into account the positions and actions of the other robots to ensure coordinated and optimized execution. The motion planning can generate a motion sequence of the robot through an algorithm under the condition of considering physical limitation and environmental constraint of the robot system, and quickly respond to the change in the environment so as to complete a specific task. Therefore, the motion planning of the robot is important for the workshop transportation efficiency and transportation safety. The existing robot motion planning method mainly comprises a genetic algorithm, an ant colony algorithm and the like, and the traditional motion planning method is generally based on rules and constraint conditions, but under a complex workshop environment, the traditional rules and constraint methods have the problems of low efficiency, difficulty in guaranteeing safety and the like, and cannot meet the actual requirements of a dense dynamic workshop. In addition, because the production tasks of workshops are different every day, the quantity and task requirements of robots are also greatly different, if the production tasks are fewer on the same day, when the transportation tasks can be completed only by a small quantity of robots, the excessive redundant motion planning algorithm can cause the problems of resource waste and the like caused by a large quantity of unnecessary calculation.

In order to solve the problems and the actual needs, the invention designs a hybrid motion planning method, a motion planning method based on reinforcement learning (Reinforcement Learning, RL) and a motion planning algorithm based on force (Force based motion planning, FMP), and the two methods are fused. According to different environments, different motion planning methods are adopted, so that on one hand, resource waste can be avoided, and on the other hand, the fused hybrid motion planning algorithm can meet the requirements of multi-robot transportation on cooperativity, safety and high efficiency in a dense dynamic environment.

Disclosure of Invention

The invention designs a motion planning method based on machine learning in a complex environment.

The specific implementation steps are as follows:

step1: the environment and the action space are determined, workshop modeling is first established, and then the action space is defined. The reinforcement learning system mainly comprises two large blocks, an environment and an intelligent body, wherein the action space of a robot is defined, the robot interacts with the environment, and the environment is learned through continuous exploration, so that the established task is finally and smoothly completed.

Is provided with，，Representing the state, motion and other robot of one robotAt the time ofIs a state of (2). The state vector has two parts, i.e. observableAnd is not observablePart, i.e. The observable part being determined by the position of the robotSpeed and velocity ofRadius sumThe composition, here observable, refers to the portion of the state that is visible to neighboring robots, and therefore, each robot has information about its neighboring robots. Target positionOptimum speedAnd direction of（Is the direction of the two-dimensional environment,direction of the three-dimensional environment) and other hidden robots, etc., constitute an unobservable portion of the state. The action space comprises two parts: speed of speedAnd heading angle in two-dimensional environmentCourse angle under three-dimensional environment。

Step2: an FMP motion planning algorithm is designed, and is a motion planning method based on field quantity theory, which can guide a robot to a designated target point by establishing attractive force and repulsive force fields and avoid collision with obstacles.

Step 2.1: generating a reference track, calculating attractive force and repulsive force fields according to the current position of the robot and the position of the target point by the FMP, and then superposing the attractive force and repulsive force fields to obtain the total field quantity of the robot at the current position, wherein the robot moves according to the direction and the size of the total field quantity. The force to which the robot is subjected can be expressed as:

wherein,,is attractive,Is the repulsive force of the force,is the current position of the robot,Is the target position,Is an obstacleAt the position of the base plate, the base plate is positioned,in order to provide a number of obstacles,、the coefficients of the attractive force and the repulsive force are adjusted, respectively.

Step 2.2: collision detection, in which the problem of collision between robots needs to be considered in the process of generating the reference trajectory. Therefore, collision detection is required each time the reference trajectory is updated, to avoid collisions between different robots. The specific detection steps are as follows:

step1: determining the motion trail and the limiting conditions of speed and acceleration of the robot;

step2: calculating the distance and relative speed between the robot and other robots or obstacles;

step3: predicting whether collision occurs according to the change rate of the distance and the speed;

step4: if collision is predicted, stopping the motion of the robot and rescheduling the motion trail of the robot to avoid collision;

step5: if no collision is predicted, the travel continues along the original trajectory.

Step 2.3: in the process of generating the reference trajectory, the path planning needs to be performed because of avoiding obstacles and other robots, and the path planning is realized by a traditional a-x algorithm.

The algorithm A is an effective heuristic search algorithm for solving the shortest path in a simple map environment, the method searches in a current point environment from a given starting point, calculates the sum of the accumulated cost value from each current search node to the given starting point and the estimated cost value of the current search node and a target point, selects the search node with the minimum total cost as the next expansion node, and finishes the algorithm search when the expansion node is the target point. The heuristic evaluation function of a can be expressed as:

for the actual cost of the current node and the starting point,for the estimated cost of the current node from the target node,expressed as the total cost of the current node.

The invention adopts Euclidean distance method to calculate the estimated cost, and the calculation formula is as follows:

as the coordinates of the current node,is the coordinates of the target point.

Step3: the A3C algorithm learning strategy is determined, the A3C (Asynchronous advantage Actor-Critic, A3C) algorithm is an improved form of an Actor-Critic algorithm framework, network parameter information is shared by establishing a plurality of thread sub-networks and a global network, and the global network feeds back optimal parameters to each thread, so that interaction with the environment is facilitated, and an asynchronous concurrent learning model is achieved. The input to the Actor network typically includes observed information about the robot, such as the state of the environment surrounding the robot, the position and speed of other robots, etc., and the output is the motion of the robot with the objective of maximizing the reward function, i.e., maximizing the cumulative rewards of the robot in the environment. The input of the Critic network is the same as that of the Actor network, and the output is a value function for evaluating the quality of the action, and the value function is generally represented by a Q function, which is defined as follows:

wherein,,the current state is indicated and the current state is indicated,the current action is indicated and the current action is indicated,representing execution of an actionThe rewards to be obtained later are provided,representing the discount factor(s),a value function representing the next state, the goal of the value function is to minimize the error of the Q function.

Step 3.1: policy determination, in the present invention, the goal of each robot is to minimize reaching the goalWhile avoiding contact with neighboring robotsConflict occurs by making policies The method comprises the following steps:

wherein the method comprises the steps ofAndrepresenting the state of one robot and another robotAt the time ofIs used for the control of the state of (a),is an observable state of the device, and,the distance is indicated by the distance between the two points,representing a distance constraint, the distance constraint is represented,is the current location of the object in question,is the location of the object to be located,is the principle body of the motion science,is the time step.

Simplifying the robot model can increase computational efficiency, the higher order model can be adapted by simple modification in the following, the RL framework can be applied to joint configuration based on the robot and its neighborsGenerating policies, the RL framework applies a reward functionThe robot is punished when collision occurs, and the robot is rewarded when the target is reached.

Step 3.2: parameter updating, in the A3C algorithm, a plurality of threads of one robot interact with the environment copy, while other threads interact with the environment, and at the end of each round, each thread updates the global network parameters according to the gradient calculated by the loss item.

In updating Critic network parameters, the Actor-Critic algorithm estimates by single-step sampling approximation, namely:

the A3C algorithm further adopts an N-step sampling method, and simultaneously considers the rewarding value in a multi-step range through an attenuation factorTo the current state to increase the convergence rateThe method comprises the following steps:

joining policies upon network parameter updatesEntropy H term of (C), coefficient c, and Actor-CtiIn contrast to the tic algorithm,is a function of the score value of the score,the strategy parameters are represented, and the gradient update formula is as follows:

step 3.3: defining a loss function, wherein the Actor loss function is used for optimizing a strategy function so as to maximize the probability of taking optimal action under a given state; the Critic penalty function is used to optimize the cost function so that future prize values achieved in a given state are maximized.

Wherein,,estimating a consideration for the discount;as a function of the total loss,，policy and cost loss functions, respectively;is an adjustable constant parameter used to balance the relative importance of the Actor and Critic loss functions;as a function of the entropy,for trainingAnd (5) matching a cost function.

Step 3.4: the network model is designed, so that the algorithm model can bypass all dangerous parts in motion planning, and a safest and most appropriate path is obtained, and therefore, the RNN cyclic neural network is combined with the full-connection layer to construct the neural network model, and the RNN cyclic neural network has advantages of processing data with sequence properties, considering not only input at the current moment, but also memory function for the previous network.

Step4: bonus function design and improvement, bonus functionThe state of the robot in the above-mentioned environment is shown asThrough interaction with the environment, the action in the current state is obtainedThe merits of the action are obtained by the set reward function. The winning function of the invention is set as follows:

wherein the method comprises the steps ofFor the current position of the robot,as a result of the location of the object,for the current robot distance from the obstacle and other robots, the ending point is awarded 1, awarded-0.25 when colliding with the obstacle and other robots, awarded in the interval (0, 2) when the distance from the obstacle and other robotsThe other position awards 0.

The above-mentioned diversity problem of the reward function may occur that there is a contradictory relationship between different targets, making it difficult for the robot to make a decision. In this case the robot cannot determine which target is more important or how to balance the trade-off between different targets. This may result in the robot failing to reach the intended task goal or unreasonable behavior. In order to solve the problem, the invention introduces a game theory idea, designs different reward weights, takes a safety first as a basic principle, sequentially uses a conflict-free basic reward for completing tasks, gives a corresponding reward function a weight from large to small according to the priority of the behavior, and provides priority judgment for the robot, wherein the total reward function is as follows:

wherein,,a basic reward indicating that the robot has completed a task,express the weight, satisfy，Andindicating the penalty for collision and collision of the robot with other robots or environments.Can be used for、Andthe definition is as follows:

wherein,,the number of robots is indicated and the number of robots,as a function of the maximum positive prize,for the current position of the robot,as a result of the location of the object,andis a constant, is used to adjust the weight relationship between time and distance,representation robotRobotThe distance between the two plates is set to be equal,representation robotRobotThe conflict between the two is that,is an indicator function that returns 1 when the condition in brackets is met, and returns 0 otherwise.Representing the collision penalty between the robots,indicating a collision penalty between robots.

Step5: scheme selection and strategy updating, in a multi-robot cooperation system, four conditions of normal, extremely simple, conflict and high density can occur, so as to cope with the continuously changing multi-robot cooperation system environment, avoid algorithm redundancy, and adopt different motion planning methods aiming at the environments under different conditions, and the specific scheme is selected as follows: the A3C strategy is selected under normal conditions, the FMP algorithm is adopted under extremely simple environment, and the DRL-FMP strategy is selected under complex conditions of multi-robot conflict, high density and the like.

Step 5.1: FMP strategy training under extremely simple environment, wherein extremely simple environment refers to static environment, and path planning is carried out by adopting an A-type algorithm, and the specific steps are as follows:

step1: for each node, calculating an actual distance from the node to the start point and an estimated distance from the node to the end point;

step2: calculating the valuation function value of the node;

step3: sorting all the nodes to be expanded according to the valuation function value from small to large, and selecting the node with the smallest value for expansion;

step4: for all neighboring nodes of the node, its actual distance to the start point is calculated, as well as the estimated distance to the end point. If the node has not been expanded or the path through the current node to the node is shorter, updating the value of the valuation function for the node and recalculating the value;

step5: step3 and Step4 are repeated until the shortest path is found.

Step 5.2: training A3C strategy in normal environment, training each robot as a reinforcement learning agent, using deep reinforcement learning method to learn how to take optimal action in given environment, A3C generating random scenes to learn optimal strategy, all scenes including initial position and target position of random robot, radius range of each robot isAnd an optimal speed rangeRandomly and uniformly distributed. The specific strategy updating method comprises the following steps:

step1: initializing a strategy function, namely, a neural network which is randomly initialized can be used as the strategy function, inputting the state of the current robot and outputting the action which the robot should execute;

step2: performing environment interaction, putting the robot into the environment, selecting an action according to the current strategy function, executing the action, and observing feedback of the environment, wherein the feedback comprises rewards and the state of the next moment;

step3: recording data, namely recording information such as states, actions, rewards and the like obtained by executing the actions each time, and using the information for subsequent strategy updating;

step4: training a strategy function, namely training the strategy function by using an A3C algorithm according to the recorded data, and updating the weight of the neural network so that the strategy function is better adapted to the current environment;

step5: and updating the strategy function in real time, and carrying out real-time decision making by using the updated strategy function after training is completed. If the environment changes, the training needs to be carried out again, and the strategy function is updated.

Step 5.3: DRL-FMP strategy training under complex environment, utilize A3C algorithm to learn a strategy function for controlling the motion strategy of robot, through strategy function's study, the robot can be according to the reference orbit removal, and with other robots collaborative movements, then, update strategy function and reference orbit in real time to optimize the motion control strategy of robot. In the real-time motion process, the strategy function is continuously updated by utilizing an A3C algorithm, and the reference track is updated in real time by utilizing an FMP algorithm so as to adapt to the task requirements in a complex environment, and the method comprises the following specific steps:

step1: calculating a reference track, and calculating the reference track of each robot by using an A3C algorithm, wherein the track can meet the requirements of safety, no conflict, short path and quick arrival;

step2: monitoring environmental changes, wherein the environment may change such as new obstacles, robot state changes and the like in the robot transportation process, so that the environmental changes need to be monitored in real time and the reference track needs to be recalculated;

step3: updating the reference track, if the environment changes, the reference track of each robot needs to be recalculated and updated into a path tracking controller of the robot;

step4: and after updating the reference track, the robot can realize real-time path tracking according to the path tracking controller and complete the transportation task of multiple robots.

The invention designs a hybrid motion planning algorithm for solving the motion planning problem of the multi-robot system in different environment states. Compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

1. according to different environment conditions, different motion planning strategies are selected, so that the problems of large calculation amount, resource waste and the like caused by algorithm redundancy are avoided, and the fused algorithm can be more suitable for the changes of various environments and solves the problem of motion planning of robots in complex environments such as dynamic environments, conflicts and high density;

2. the cooperative game thought is introduced, a new reward function is designed, and the probability of robot collision and collision in a crowded and dynamic environment is reduced, so that the requirements of multi-robot transportation on cooperativity, safety and high efficiency in the crowded and dynamic environment can be met;

3. the invention meets the requirements of the multi-robot system on cooperativity, safety and high efficiency in complex environments, is not only effective in the industrial field, but also has beneficial effects in the fields of service industry, agriculture and the like.

The invention will be described in further detail with reference to the accompanying drawings.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention;

fig. 2 is a diagram of a hybrid control framework in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Fig. 1 is an overall flowchart of an embodiment of the present invention, which provides a motion planning method based on machine learning in a complex environment, taking multiple automatic guided vehicles (Automated Guided Vehicle, AGV) transportation in a production shop as an example, specifically including the following procedures: determining environment and action space, generating a reference track by adopting an FMP algorithm, determining a learning strategy of an A3C algorithm, designing and improving a reward function, and selecting and updating a scheme.

Fig. 2 is a hybrid control frame diagram in an embodiment of the present invention, where scenes that may be generated in different environments are classified into four types, and in order to avoid situations of algorithm redundancy and inflexibility of robot decision, a scheme determination mode under hybrid control is designed, specifically: the A3C strategy is selected under normal conditions, the FMP algorithm is adopted under extremely simple environment, and the DRL-FMP strategy is selected under complex conditions of multi-robot conflict, high density and the like.

The specific implementation is as follows:

step1 is implemented: the environment and the agent are the main bodies of the reinforcement learning system, in this embodiment, the environment is set as a production workshop environment, the agent is a workshop transport AGV, and the determination of the environment and the action space of the AGV are defined as follows.

Step 1.1: workshop modeling, namely establishing a 3D simulation environment of a workshop by adopting Gazebo in the ROS according to the actual environment of a workshop site, specifically comprising information such as a drivable area, the position of production equipment and the like, and determining information such as a starting point, a terminal point and the like of the AGV according to the actual situation.

Step 1.2: defining action space, and setting，，Representing the status, action, and other of an AGVAt the time ofIs a state of (2). The state vector being made observableAnd is not observableTwo-part composition, i.e. The position of the observable portion by the AGVSpeed and velocity ofRadius sumThe composition, which is observable here, refers to the portion of the status that is visible to adjacent AGVs, due toEach AGV has information about its proximity to the AGV. Target positionOptimum speedAnd direction of（Is the direction of the two-dimensional environment,direction of the three-dimensional environment) and other hidden AGV, etc., constitute an unobservable portion of the state. The action space comprises two parts: speed of speedAnd heading angle in two-dimensional environmentCourse angle under three-dimensional environment。

Step2 is implemented: the FMP algorithm is a motion planning method based on field quantity theory, and can guide the AGV to a designated target point and avoid collision with an obstacle by establishing attractive force and repulsive force fields.

Step 2.1: and generating a reference track, calculating an attractive force and a repulsive force field by the FMP according to the current position of the AGV and the position of the target point, and then superposing the attractive force and the repulsive force field to obtain the total field quantity of the AGV at the current position, wherein the AGV moves according to the direction and the size of the total field quantity. The force exerted by the AGV may be expressed as:

wherein,,is attractive,Is the repulsive force of the force,is the current position of AGV,Is the target position,Is an obstacleAt the position of the base plate, the base plate is positioned,in order to provide a number of obstacles,、the coefficients of the attractive force and the repulsive force are adjusted, respectively.

Step 2.2: collision detection, in which collision problems between AGVs need to be considered in generating the reference trajectory. Therefore, collision detection is required each time the reference trajectory is updated to avoid collisions between different AGVs. The specific detection steps are as follows:

step1: determining a motion trail of the AGV and limiting conditions of speed and acceleration;

step2: calculating the distance and the relative speed between the AGV and other AGVs or obstacles;

step4: if collision is predicted, stopping the motion of the AGV and rescheduling the motion trail of the AGV to avoid collision;

Step 2.3: in the process of generating the reference track, the path planning needs to avoid obstacles and other AGVs, so that the path planning needs to be performed, and the path planning is realized through a traditional A-x algorithm.

as the coordinates of the current node,is the coordinates of the target point.

Implementing the step3: the A3C algorithm learning strategy is determined, the A3C (Asynchronous advantage Actor-Critic, A3C) algorithm is an improved form of an Actor-Critic algorithm framework, network parameter information is shared by establishing a plurality of thread sub-networks and a global network, and the global network feeds back optimal parameters to each thread, so that interaction with the environment is facilitated, and an asynchronous concurrent learning model is achieved. The input to the Actor network typically includes observed information about the AGV, such as the environmental conditions surrounding the AGV, the position and speed of other AGVs, etc., and the output is the action of the AGV with the goal of maximizing the reward function, i.e., maximizing the cumulative rewards of the AGV in the environment. The input of the Critic network is the same as that of the Actor network, and the output is a value function for evaluating the quality of the action, and the value function is generally represented by a Q function, which is defined as follows:

Step 3.1: policy determination, in the present invention, the goal of each AGV is to minimize reaching the goalWhile avoiding interaction with adjacent AGVsConflict occurs by making policies，The method comprises the following steps:

Simplifying the AGV model can increase computational efficiency, higher order models can be accommodated by simple modification in the following, RL framework can be applied to joint configuration based on the AGV and its neighborsGenerating policies, the RL framework applies a reward functionAGVs are penalized when collisions occur and rewarded when the targets are reached.

Step 3.2: parameter updating, in the A3C algorithm, a plurality of threads of an AGV interact with the environment copy, while other threads interact with the environment, and at the end of each round, each thread updates the global network parameters according to the gradient calculated by the loss item.

joining policies upon network parameter updatesThe entropy H term of (c) and coefficient c, compared with the Actor-Cititic algorithm,is a function of the score value of the score,the strategy parameters are represented, and the gradient update formula is as follows:

Wherein,,estimating a consideration for the discount;as a function of the total loss,，policy and cost loss functions, respectively;is an adjustable constant parameter used to balance the relative importance of the Actor and Critic loss functions;as an entropy function, equationFor trainingAnd (5) matching a cost function.

And 4, implementing the following steps: bonus function design and improvement, bonus functionThe AGV state is shown in the above environment asThrough interaction with the environment, the action in the current state is obtainedThe merits of the action are obtained by the set reward function. The winning function of the invention is set as follows:

wherein the method comprises the steps ofFor the current position of the AGV,as a result of the location of the object,for the current AGV distance from the obstacle and other AGVs, the end-of-arrival is awarded 1, awarded-0.25 when colliding with the obstacle and other AGVs, awarded in the (0, 2) interval when the distance from the obstacle and other AGVsThe other position awards 0.

The diversity of the reward function described above may create a conflicting relationship between different targets, making it difficult for the AGV to make decisions. In this case, the AGV cannot determine which targets are more important or how to balance the trade-off between different targets. This may result in the AGV not achieving the intended task goals or otherwise exhibiting unreasonable behavior. In order to solve the problem, the invention introduces a game theory idea, designs different reward weights, takes a safety first as a basic principle, sequentially uses a conflict-free basic reward for completing tasks, gives a corresponding reward function a weight from large to small according to the priority of the behavior, and provides priority judgment for the AGV, wherein the total reward function is as follows:

wherein,,indicating the basic rewards for the AGV to complete the task,express the weight, satisfy，Andindicating the penalty for the AGV to collide and conflict with other AGVs or environments. Can be used for、Andthe definition is as follows:

wherein,,indicating the number of AGVs,as a function of the maximum positive prize,for the current position of the AGV,as a result of the location of the object,andis a constant, is used to adjust the weight relationship between time and distance,representation ofAndthe distance between the two plates is set to be equal,representation ofAndthe conflict between the two is that,is an indicator function that returns 1 when the condition in brackets is met, and returns 0 otherwise.Indicating a collision penalty between AGVs,indicating the collision penalty between AGVs.

Step5: scheme selection and strategy updating, in a multi-AGV cooperation system, four conditions of normal, extremely simple, conflict and high density can occur, and for coping with the continuously changing multi-AGV cooperation system environment, algorithm redundancy is avoided, and different motion planning methods are adopted for the environments under different conditions, and the specific scheme is selected as follows: an A3C strategy is selected under normal conditions, an FMP algorithm is adopted under extremely simple environment, and a DRL-FMP strategy is selected under complex conditions of multiple AGV conflicts, high density and the like.

step2: calculating the valuation function value of the node;

step5: step3 and Step4 are repeated until the shortest path is found.

Step 5.2: training of A3C strategy in normal environment, training each AGV as a reinforcement learning agent, using deep reinforcement learning method to learn how to take optimal action in given environment, A3C generating random scenes to learn optimal strategy, all scenes including initial position and target position of random AGVs, radius range of each AGV isAnd an optimal speed rangeRandomly and uniformly distributed. The specific strategy updating method comprises the following steps:

step1: initializing a strategy function, namely, a neural network which is randomly initialized can be used as the strategy function, inputting the current state of the AGV and outputting the action which the AGV should execute;

step2: performing environment interaction, namely placing the AGV into the environment, selecting an action according to a current strategy function, executing the action, and observing feedback of the environment, wherein the feedback comprises rewards and the state of the next moment;

Step 5.3: and the DRL-FMP strategy training is performed in a complex environment, a strategy function is learned by using an A3C algorithm and used for controlling the motion strategy of the AGV, the AGV can move according to the reference track through the learning of the strategy function and move cooperatively with other AGVs, and then the strategy function and the reference track are updated in real time so as to optimize the motion control strategy of the AGV. In the real-time motion process, the strategy function is continuously updated by utilizing an A3C algorithm, and the reference track is updated in real time by utilizing an FMP algorithm so as to adapt to the task requirements in a complex environment, and the method comprises the following specific steps:

step1: calculating a reference track, namely calculating the reference track of each AGV by using an A3C algorithm, wherein the track meets the requirements of safety, no conflict, short path and quick arrival;

step2: monitoring environmental changes, wherein the environment may change such as new obstacles, change of AGV state and the like in the AGV transportation process, so that the environmental changes need to be monitored in real time and the reference track needs to be recalculated;

step3: updating the reference track, if the environment changes, the reference track of each AGV needs to be recalculated and updated into the path tracking controller of the AGV;

step4: and after updating the reference track, the AGV can realize real-time path tracking according to the path tracking controller and complete multiple AGV transportation tasks.

Claims

1. The motion planning method based on machine learning in the complex environment is characterized by comprising the following steps:

step one, determining an environment and an action space;

designing an FMP motion planning algorithm, generating a reference track according to the total field quantity, planning a path by adopting an A-type algorithm, and calculating estimated cost by using Euclidean distance;

determining an A3C algorithm learning strategy, and designing a neural network model by adopting an RNN;

step four, designing and improving the bonus function, introducing a cooperative game idea to improve the bonus function, determining the bonus priority of the new bonus function through weight, and specifically implementing the following steps:

the bonus function is designed as follows:

wherein R represents a reward function, P is the current position of the robot, and P _G For the target position d _min Distance between the current robot and the obstacle and other robots;

by designing different bonus weights, the total bonus function is as follows:

R＝ω ₁ R _task -ω ₂ R _s -ω ₃ R _c

wherein R is _task Represents a basic reward for the robot to complete the task, ω represents a weight, satisfies ω ₂ ＞ω ₃ ＞ω ₁ ，R _s And R is _c Indicating the penalty of collision and conflict of the robot with other robots or environments, R can be calculated _task 、R _s And R is _c The definition is as follows:

wherein N represents the number of robots, R _max As a function of the maximum positive prize,and->Is a constant for adjusting the weight relationship between time and distance, d _i,j Representing the distance between robots i and j, c _i,j Representing a collision between robots i and j, delta is an indicator function, returning to 1 when the condition in brackets is met, otherwise returning to 0, R _s Representing collision penalties between robots, R _c Representing collision penalties between robots;

step five, scheme selection and strategy updating, adopting an A3C strategy under normal conditions, using an FMP algorithm under extremely simple environments, selecting a DRL-FMP strategy under complex conditions of multiple AGV conflicts and high density, and respectively updating and training the selected strategies.

2. The machine learning based motion planning method according to claim 1, wherein the designing FMP motion planning algorithm in the second step generates a reference track according to the total field quantity, performs path planning by adopting an a-x algorithm, and calculates an estimated cost by using euclidean distance, specifically comprising the following steps:

the force applied to the robot when generating the reference trajectory is expressed as:

F＝F _a +F _r

F _a (P)＝-η(P-P _G )

wherein F is _a Is attractive force F _r Is repulsive force, P is the current position of the robot, P _G For the target position, P _i N is the number of obstacles, and eta and beta are coefficients for adjusting attractive force and repulsive force respectively;

in the process of generating the reference track, path planning of the process is realized through a traditional A-algorithm, wherein a heuristic evaluation function of A is expressed as:

F(n)＝G(n)+H(n)

calculating estimated cost by Euclidean distance method, wherein G (n) is actual cost of current node and starting point, H (n) is estimated cost of current node from target node, F (n) is total cost of current node, (x) _n ,y _n ) Is the coordinates of the current node, (x _g ,y _g ) The calculation formula is as follows:

the specific detection steps of the generated track are as follows:

step1: determining a motion trail, a speed and an acceleration;

step2: calculating a relative distance and a relative speed;

step3: calculating the change rate of the distance and the speed, and predicting whether collision occurs or not;

step4: predicting collision, stopping movement and re-planning the movement track;

step5: and predicting no collision and travelling according to the original track.

3. The machine learning-based motion planning method in a complex environment according to claim 1, wherein the determining of the A3C algorithm learning strategy in the third step is implemented by designing a neural network model of the machine learning-based motion planning method by adopting RNN, and specifically comprises the following steps:

a learning strategy is determined and a learning strategy is determined,for the formulated policy, the following are satisfied:

P _t ＝P _G

wherein E [ t ] _G ]N for the intended purpose _i For conflicting adjacent robots s _t And s' _j,t Representing the state of one robot and the state of the other robot j at time t, s _t ^o Is an observable state, r represents distance, s.t ||P _t -P′ _j,t || ₂ ≥r+r′ _j Representing distance constraints, P _t Is the current position, P _G Is the location of the object to be located,is subject kinematics, Δt is the time step;

and (3) designing a network model, combining the RNN circulating neural network with the full-connection layer, and constructing the neural network model.