CN116679719A

CN116679719A - Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy

Info

Publication number: CN116679719A
Application number: CN202310792088.4A
Authority: CN
Inventors: 张卫波; 王单坤; 黄赐坤; 林景胜; 丘英浩; 陈虎
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-01

Abstract

The invention relates to an unmanned vehicle self-adaptive path planning method based on a dynamic window method and a near-end strategy. Firstly, constructing an intelligent body-environment interaction model facing an unmanned vehicle, constructing a near-end strategy optimization learning (PPO) model based on an actor-commentator framework, defining a reward function according to a Dynamic Window Algorithm (DWA) principle and main evaluation factors, determining model parameters in an input layer, an output layer, a hidden layer number, a neuron number and the like, and constructing a DWA-PPO deep reinforcement learning model; and then, using the established DWA-PPO deep reinforcement learning model, continuously iterating and training, and finally converging a network model capable of representing the potential relation between surrounding environment information and the evaluation function weight parameter to complete the construction of the self-adaptive PPO-ADWA algorithm. And finally, verifying feasibility and effectiveness of the PPO-ADWA-based unmanned vehicle self-adaptive path planning strategy through a simulation comparison experiment.

Description

Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy

技术领域technical field

本发明涉及无人驾驶路径规划和自主导航技术领域，具体涉及一种基于动态窗口法与近端策略的无人车自适应路径规划方法。The invention relates to the technical field of unmanned driving path planning and autonomous navigation, in particular to an adaptive path planning method for unmanned vehicles based on a dynamic window method and a proximal strategy.

背景技术Background technique

近年来，伴随着科学技术的飞速发展，以互联网、人工智能、大数据等为代表的新一轮的科技产业革命正在重新定义社会的各行各业，传统汽车产业正在面临着深刻的产业变革。传统汽车正在向着智能化、无人化发展，智能网联汽车、自动驾驶汽车已成为全球汽车产业发展的战略方向。智能驾驶技术主要包括环境感知、导航定位、路径规划与控制决策等。路径规划是智能驾驶中重要的一环其对智能驾驶技术的发展具有重大意义。In recent years, with the rapid development of science and technology, a new round of technological industrial revolution represented by the Internet, artificial intelligence, big data, etc. is redefining all walks of life in society, and the traditional automobile industry is facing profound industrial changes. Traditional cars are developing towards intelligence and unmanned vehicles. Intelligent networked cars and self-driving cars have become the strategic direction of the development of the global auto industry. Intelligent driving technology mainly includes environment perception, navigation and positioning, path planning and control decision-making, etc. Path planning is an important part of intelligent driving, which is of great significance to the development of intelligent driving technology.

路径规划是自动驾驶智能车的重要组成部分，路径规划技术可归结为路径规划指在已知环境下通过算法规划出一条安全、可行的无碰撞路径，选择出从起点连接至终点的最优避障路径，本质为几个约束条件下的最优解，路径规划是智能车无人导航技术的关键部分。路径规划算法又可分为基于完整区域信息理解的全局规划与基于局部区域信息理解的局部规划。动态窗口法(Dynamic Window Approach,DWA)作为考虑智能车运动性能的局部路径规划方法，广泛应用于智能车路径导航。DWA算法中起决策作用的为其评价函数，包括朝向角函数、障碍物函数、速度函数等三部分，评价函数为这三个子函数的加权求和，经典DWA算法中该三个函数所对应的权重为固定值，然而智能车在探索终点过程，其周围的障碍物环境是复杂多变的，不同障碍物分布需要不同的权重，经典DWA算法固定权重值方法容易使智能车陷入局部最优或目标不可达。因此借助深度强化学习中的近端策略优化算法，对经典DWA算法进行改进。Path planning is an important part of autonomous driving smart cars. Path planning technology can be summarized as path planning refers to planning a safe and feasible collision-free path through algorithms in a known environment, and selecting the optimal avoidance path from the starting point to the end point. The obstacle path is essentially the optimal solution under several constraints, and path planning is a key part of the unmanned navigation technology for smart vehicles. Path planning algorithms can be divided into global planning based on complete regional information understanding and local planning based on local regional information understanding. Dynamic Window Approach (DWA), as a local path planning method considering the motion performance of smart cars, is widely used in smart car path navigation. The decision-making function in the DWA algorithm is its evaluation function, which includes three parts: orientation angle function, obstacle function, and speed function. The evaluation function is the weighted sum of these three sub-functions. In the classic DWA algorithm, the three functions correspond to The weight is a fixed value. However, when the smart car is exploring the end point, the surrounding obstacle environment is complex and changeable. Different obstacle distributions require different weights. The fixed weight value method of the classic DWA algorithm is easy to make the smart car fall into a local optimum or Target unreachable. Therefore, with the help of the proximal strategy optimization algorithm in deep reinforcement learning, the classic DWA algorithm is improved.

发明内容Contents of the invention

本发明的目的在于解决智能体在面对不同障碍物环境是因评价函数中的权重系数不可动态调节，往往不能够寻至终点或者算出最优路径的问题，提供一种基于动态窗口法与近端策略的无人车自适应路径规划方法，在经典DWA算法的基础上提出改进，改进经典DWA算法中的权重参数与深度强化学习中近端策略优化进行结合通过学习训练，得到适用不同静态障碍物的模型参数，完成自适应PPO-ADWA算法的构建。The purpose of the present invention is to solve the problem that the agent cannot find the end point or calculate the optimal path because the weight coefficient in the evaluation function cannot be dynamically adjusted in the face of different obstacle environments. The self-adaptive path planning method for unmanned vehicles based on the terminal strategy is improved on the basis of the classic DWA algorithm. The weight parameters in the improved classic DWA algorithm are combined with the optimization of the proximal strategy in deep reinforcement learning. Through learning and training, different static obstacles are obtained. The model parameters of the object are used to complete the construction of the adaptive PPO-ADWA algorithm.

为实现上述目的，本发明的技术方案是：一种基于动态窗口法与近端策略的无人车自适应路径规划方法，包括如下步骤：In order to achieve the above object, the technical solution of the present invention is: a method for self-adaptive path planning of unmanned vehicles based on dynamic window method and proximal strategy, comprising the following steps:

步骤一、构建智能体-环境交互模型，无人车作为深度强化学习中的智能体，障碍物地图作为环境；Step 1. Construct the agent-environment interaction model, the unmanned vehicle is used as the agent in deep reinforcement learning, and the obstacle map is used as the environment;

步骤二、建立DWA算法模型，根据阿克曼智能车确定包括：速度范围、角速度范围、加速度范围、角加速度范围参数以及DWA算法的主要要素以及评价函数；Step 2. Establish the DWA algorithm model, which includes: speed range, angular velocity range, acceleration range, angular acceleration range parameters, and the main elements of the DWA algorithm and evaluation functions according to the Ackerman smart car;

步骤三、建立基于演员-评论家框架的近端策略优化学习PPO模型，模拟建立无人车实际应用场景作为模型的学习环境，根据应用场景确定模型中的状态与动作；Step 3. Establish a proximal policy optimization learning PPO model based on the actor-critic framework, simulate and establish the actual application scenario of the unmanned vehicle as the learning environment of the model, and determine the state and action in the model according to the application scenario;

步骤四、构建DWA-PPO深度强化学习模型，定义包括主线奖励与子目标奖励的奖励函数；并确定包括输入层、输出层大小以及隐藏层层数与神经元个数参数在内的DWA-PPO深度强化学习模型参数，完成DWA-PPO深度强化学习模型的实例化；Step 4. Construct the DWA-PPO deep reinforcement learning model, define the reward function including the main line reward and the sub-target reward; and determine the DWA-PPO including the input layer, output layer size, number of hidden layers and number of neurons Deep reinforcement learning model parameters, complete the instantiation of DWA-PPO deep reinforcement learning model;

步骤五、构建自适应PPO-ADWA算法，使用建立好的DWA-PPO深度强化学习模型，在随机生成的复杂静态障碍物环境下，模拟无人车的导航规划，以收集用于训练DWA-PPO深度强化学习模型的训练集，通过反复迭代收敛出能够根据周围障碍物分布的变化，输出相应权重参数的模型，完成自适应PPO-ADWA算法的构建；Step 5. Build an adaptive PPO-ADWA algorithm, use the established DWA-PPO deep reinforcement learning model, and simulate the navigation planning of unmanned vehicles in the environment of randomly generated complex static obstacles to collect data for training DWA-PPO The training set of the deep reinforcement learning model, through repeated iterations, converges to a model that can output the corresponding weight parameters according to the changes in the distribution of surrounding obstacles, and completes the construction of the adaptive PPO-ADWA algorithm;

步骤六、通过仿真对比实验论证基于自适应PPO-ADWA算法的无人车路径规划自适应调节能力。Step 6. Demonstrate the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm through simulation comparison experiments.

相较于现有技术，本发明具有以下有益效果：本发明方法针对传统DWA算法的评价函数中权重系数，其取值并不会随着智能车所处的环境及其自身的运动状态做出动态调整的问题，使用深度强化学习中的近端策略优化算法，构建DWA-PPO深度强化学习模型，通过不断迭代训练得到网络模型，从而输出相应的权重参数的模型参数，完成自适应PPO-ADWA算法的构建；本发明方法解决了智能体在面对不同障碍物环境是因评价函数中的权重系数不可动态调节，往往不能够寻至终点或者算出最优路径的问题。Compared with the prior art, the present invention has the following beneficial effects: the method of the present invention is aimed at the weight coefficient in the evaluation function of the traditional DWA algorithm, and its value will not change with the environment in which the smart car is located and its own motion state. For the problem of dynamic adjustment, the proximal strategy optimization algorithm in deep reinforcement learning is used to construct the DWA-PPO deep reinforcement learning model, and the network model is obtained through continuous iterative training, so as to output the model parameters of the corresponding weight parameters, and complete the adaptive PPO-ADWA Algorithm construction; the method of the present invention solves the problem that the agent cannot find the end point or calculate the optimal path because the weight coefficient in the evaluation function cannot be dynamically adjusted when facing different obstacle environments.

附图说明Description of drawings

图1为智能体-环境交互模型示意图。Figure 1 is a schematic diagram of the agent-environment interaction model.

图2为DWA算法原理示意图。Figure 2 is a schematic diagram of the principle of the DWA algorithm.

图3为速度角速度窗口。Figure 3 shows the velocity angular velocity window.

图4为与δ示意图。Figure 4 is and δ schematic diagram.

图5为演员评论家框架示意图。Figure 5 is a schematic diagram of the actor-critic framework.

图6为状态s。Figure 6 is state s.

图7为策略网络结构。Figure 7 shows the policy network structure.

图8为价值网络结构。Figure 8 shows the value network structure.

图9为DWA-PPO模型。Figure 9 is the DWA-PPO model.

图10为分数与到达率变化曲线。Figure 10 is the change curve of score and arrival rate.

图11为仿真环境。Figure 11 is the simulation environment.

图12为经典DWA。Figure 12 is a classic DWA.

图13为PPO-ADWA。Figure 13 is PPO-ADWA.

图14为权重参数变化曲线。Figure 14 is the weight parameter change curve.

图15为本发明方法流程图。Fig. 15 is a flowchart of the method of the present invention.

具体实施方式Detailed ways

下面结合附图1-15，对本发明的技术方案进行具体说明。The technical solution of the present invention will be specifically described below in conjunction with accompanying drawings 1-15.

如图15所示，本发明提供了一种基于动态窗口法与近端策略的无人车自适应路径规划方法，包括如下步骤：As shown in Figure 15, the present invention provides a method for self-adaptive path planning for unmanned vehicles based on dynamic window method and proximal strategy, including the following steps:

各步骤具体实现如下：The specific implementation of each step is as follows:

步骤一、如图1所示，构建智能体-环境交互模型，无人车作为深度强化学习中的智能体，障碍物地图作为环境；Step 1, as shown in Figure 1, build an agent-environment interaction model, the unmanned vehicle is used as the agent in deep reinforcement learning, and the obstacle map is used as the environment;

智能体在深度强化学习系统中扮演决策及学习的角色，主要负责动作信息的输出及接收奖励、状态，环境是智能体的交互对象，其交互过程包括如下三个步骤：The agent plays the role of decision-making and learning in the deep reinforcement learning system. It is mainly responsible for the output of action information and receiving rewards and states. The environment is the interaction object of the agent. The interaction process includes the following three steps:

(1)智能体由环境状态观测到信息/> 为状态空间，是环境状态的取值集合；/>为观测空间，为智能体观测量的取值集合。(1) The agent is determined by the environment state Observed information /> is the state space, which is the value set of the environment state; /> is the observation space, and is the value set of agent observations.

(2)智能体由已知的O_t做出相应的决策，决定要对环境施加的动作是动作取值集合。(2) The agent makes a corresponding decision based on the known O _t , and decides the action to be imposed on the environment Is the set of action values.

(3)环境受A_t影响，自身状态S_t转移至S_t+1，并给与智能体奖励是奖励的取值集合。因此离散化的智能体-环境交互模型可以用如下序列表示：(3) The environment is affected by A _t , its own state S _t is transferred to S _t+1 , and rewards are given to the agent is the value set of rewards. Therefore, the discretized agent-environment interaction model can be represented by the following sequence:

S₀,O₀,A₀,R₀,S₁,O₁,A₁,R₁,S₂,O₂,A₂,R₂,…,S_T＝S_终止 S ₀ ,O ₀ ,A ₀ ,R ₀ ,S 1 ,O ₁ ,A ₁ ,R ₁ ,S ₂ ,O ₂ ,A ₂ ,R ₂ _, …,S _T =S _termination

当环境的状态能够被智能体完全观测时，则有S_t＝O_t，以简化为：When the state of the environment can be completely observed by the agent, then there is S _t =O _t , which can be simplified as:

S₀,A₀,R₀,S₁,A₁,R₁,S₂,A₂,R₂,…,S_T＝S_终止 S ₀ ,A ₀ ,R ₀ ,S ₁ ,A ₁ ,R ₁ ,S ₂ ,A ₂ ,R ₂ ,…,S _T =S _termination

步骤二、建立DWA算法模型，根据阿克曼智能车确定包括：速度范围、角速度范围、加速度范围、角加速度范围参数以及DWA算法的主要要素评价函数；Step 2. Establish the DWA algorithm model, and determine according to the Ackerman smart car, including: speed range, angular velocity range, acceleration range, angular acceleration range parameters and the main element evaluation function of the DWA algorithm;

DWA算法是一种从速度空间角度对无人车所处地图环境做出直观理解的局部路径规划法，工作流程为：考虑t时刻各条件对速度角速度的约束，得出t时刻无人车所能到达的速度角速度窗口V_win；将其离散化，对离散后的速度角速度进行组合；无人车遍历所有组合并按照给定运动模型模拟前行m个Δ_t时长，获得模拟轨迹集τ，即一系列点集；评价函数给出模拟轨迹集τ中的所有模拟轨迹的得分，选取评分最高轨迹τ_b对应的组合；以该组合驱动无人车前进时长Δ_t，到达t+1时刻；以此循环直至终点。m为采样步数，Δ_t为采样间隔，如图2所示。The DWA algorithm is a local path planning method that intuitively understands the map environment of the unmanned vehicle from the perspective of velocity space. The attainable velocity-angular-velocity window V _win ; discretize it and combine the discretized velocity-angular velocity; the unmanned vehicle traverses all the combinations and simulates forward m _Δt duration according to the given motion model, and obtains the simulated trajectory set τ, That is, a series of point sets; the evaluation function gives the scores of all simulated trajectories in the simulated trajectory set τ, and selects the combination corresponding to the highest-scoring trajectory τ _b ; drives the unmanned vehicle forward with this combination for a length of Δ _t , and reaches the time t+1; Repeat this until the end. m is the number of sampling steps, and _Δt is the sampling interval, as shown in Figure 2.

在t时刻，无人车的V_win受自身硬件条件与周围环境约束，考虑如下三点约束：At time t, the V _win of the unmanned vehicle is constrained by its own hardware conditions and the surrounding environment. Consider the following three constraints:

(1)极限速度角速度约束：(1) Limit velocity angular velocity constraint:

V_lim＝{(v,w)|v∈[v_min,v_max]^w∈[w_min,w_max]}V _lim ＝{(v,w)|v∈[v _min ,v _max ]^w∈[w _min ,w _max ]}

(2)加速度限制的速度角速度约束：(2) Velocity angular velocity constraint of acceleration limitation:

(3)制动距离限制的速度角速度约束：(3) Velocity and angular velocity constraints limited by braking distance:

以上，v_min、v_max为极限线速度，w_min、w_max为极限角速度。v_cu、w_cu为当前线速度、角速度，为极限线加速度，/>为极限角加速度。dist(v,w)为速度角速度组合(v,w)对应的模拟轨迹离障碍物的最近距离。最终t时刻V_win表示为：Above, v _min and v _max are limit linear velocity, w _min and w _max are limit angular velocity. v _cu and w _cu are the current linear velocity and angular velocity, is the limit linear acceleration, /> is the limit angular acceleration. dist(v,w) is the shortest distance from the simulated trajectory corresponding to the combination of velocity and angular velocity (v,w) to the obstacle. V _win at the final time t is expressed as:

V_win＝V_lim∩V_acc∩V_dis V _win = V _lim ∩ V _acc ∩ _{V dis}

具体如图3所示，评价函数包括三个子函数，是对无人车行驶速度、障碍物碰撞风险、无人车航向三个因素的综合考虑，具体如下:Specifically, as shown in Figure 3, the evaluation function includes three sub-functions, which are a comprehensive consideration of the three factors of the driving speed of the unmanned vehicle, the risk of collision with obstacles, and the heading of the unmanned vehicle, as follows:

G(v,w)＝σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))G(v,w)=σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))

其中表示无人车航向角，δ为无人车与目标点连线与x轴正方向夹角，如图4所示。in Indicates the heading angle of the unmanned vehicle, and δ is the angle between the line connecting the unmanned vehicle and the target point and the positive direction of the x-axis, as shown in Figure 4.

dist(v,w)为模拟轨迹到最近障碍物的欧氏距离，vel(v,w)表示无人车的线速度大小，α、η、γ为三个权重系数。由上可知，评价函数是由不同量纲的子函数构成，式中的归一化函数σ()相当于无量纲学习，能够将不同量纲的数据统一到相同参考系下进行组合或比较，从而避免因数据的尺度不同导致评价偏差，具体如下：dist(v,w) is the Euclidean distance from the simulated trajectory to the nearest obstacle, vel(v,w) represents the linear velocity of the unmanned vehicle, and α, η, γ are three weight coefficients. It can be seen from the above that the evaluation function is composed of sub-functions of different dimensions. The normalization function σ() in the formula is equivalent to dimensionless learning, which can unify data of different dimensions into the same reference system for combination or comparison. In order to avoid evaluation bias due to different scales of data, the details are as follows:

dist(v_i,w_j)与vel(v_i,w_j)进行同样的归一化操作。dist(v _i ,w _j ) performs the same normalization operation as vel(v _i ,w _j ).

无人车根据匀速运动模型获取模拟轨迹，在该运动模型的假设条件下，无人车的线速度、角速度大小保持不变，线速度方向改变量与时间成线性关系，为简化模型加快运算，可认为在微小时间间隔内速度方向保持不变，因此可将匀速运动模型离散化处理，x_t、y_t表示t时刻智能车的横纵坐标，表示t时刻的航向角，v_t、w_t表示t时刻的速度、角速度，如下式所示。The unmanned vehicle obtains the simulated trajectory according to the uniform motion model. Under the assumption of the motion model, the linear velocity and angular velocity of the unmanned vehicle remain unchanged, and the change in the direction of the linear velocity has a linear relationship with time. In order to simplify the model and speed up the calculation, It can be considered that the velocity direction remains unchanged in a small time interval, so the uniform motion model can be discretized, x _t and y _t represent the horizontal and vertical coordinates of the smart car at time t, Indicates the heading angle at time t, and v _t and w _t indicate the speed and angular velocity at time t, as shown in the following formula.

步骤三、建立基于演员-评论家框架的近端策略优化学习(PPO)模型(如图5所示)，模拟建立无人车实际应用场景作为模型的学习环境，根据应用场景确定模型中的状态与动作；Step 3. Establish a proximal policy optimization learning (PPO) model based on the actor-critic framework (as shown in Figure 5), simulate and establish the actual application scenario of the unmanned vehicle as the learning environment of the model, and determine the state in the model according to the application scenario and action;

近端策略优化算法(Proximal Policy Optimization,PPO)的做法则是在目标函数中增加D_KL(p||q)惩罚项具体如下：The Proximal Policy Optimization (PPO) algorithm is based on the objective function Add D _KL (p||q) penalty item in the specific as follows:

式中为对参数θ₁的积分得到基于重要性采样的策略学习的目标函数，θ为策略π参数，当策略越好，目标函数/>越大，γ为进行蒙特卡洛近似引入的参数，U_t为策略梯度中的参数、π(a_t|s_t；θ₁)为目标策略、π(a_t|s_t；θ₂)为行为策略，/>为策略网络的数学期望，β为超参数，分布q与p相差越大则D_KL(p||q)项越大，/>受到的惩罚越大，反之则D_KL(p||q)项越小，/>受到的惩罚越小，强化学习的目标是最大化/>因此具有惩罚项的/>能够控制行为与目标策略在一定相似度范围内。In the formula Integrate the parameter θ ₁ to obtain the objective function of policy learning based on importance sampling, θ is the strategy π parameter, when the strategy is better, the objective function /> The larger γ is the parameter introduced by Monte Carlo approximation, U _t is the parameter in the strategy gradient, π(a _t |s _t ; θ ₁ ) is the target strategy, π(a _t |s _t ; θ ₂ ) is Behavior Policy, /> is the mathematical expectation of the policy network, β is a hyperparameter, the greater the difference between the distribution q and p, the greater the D _KL (p||q) item, /> The greater the penalty, otherwise the D _KL (p||q) item is smaller, /> The smaller the penalty received, the goal of reinforcement learning is to maximize So the /> with the penalty term Be able to control the behavior and the target policy within a certain similarity range.

无人车在障碍物环境下寻找能够连接起点与终点的最优路径，因此无人车实际应用场景作为模型的学习环境环境即为障碍物地图。The unmanned vehicle looks for the optimal path that can connect the starting point and the end point in the obstacle environment, so the actual application scene of the unmanned vehicle is used as the learning environment of the model is the obstacle map.

模型中的状态s为无人车利用传感器感知到的环境信息，也可以包括自身位置、运动状态信息。状态s是无人车动作决策的唯一信息来源，同时也是最大化回报的重要依据，因此状态s的优劣直接影响到算法能否收敛、收敛速度及最终性能。状态s可以理解为周围环境信息的高维向量，无人车的最终目标是以最优路径到达终点，显然无人车位置及状态、周围障碍物分布、目标点位置信息将是无人车动作决策的核心依据。为更加契合实际应用场景，可将激光雷达以2度的扫描间隔，扫描一周反射回来的信息作为状态s的主要部分此外，状态s还应包括无人车速度v_t、角速度w_t、航向角以及当前目标点位置信息(x^g _t,y^g _t)，如图6所示。具体方法为利用策略网络输出替代评价函数的固定权重，构建自适应评价函数，显然动作a与评价函数中的权重(α,η,γ)相对应，因此定义动作a为：The state s in the model is the environmental information perceived by the unmanned vehicle using the sensor, and can also include its own position and motion state information. The state s is the only source of information for unmanned vehicle action decisions, and it is also an important basis for maximizing returns. Therefore, the quality of the state s directly affects whether the algorithm can converge, the convergence speed and the final performance. The state s can be understood as a high-dimensional vector of surrounding environment information. The ultimate goal of the unmanned vehicle is to reach the end point with the optimal path. Obviously, the position and state of the unmanned vehicle, the distribution of surrounding obstacles, and the location information of the target point will be the actions of the unmanned vehicle. basis for decision-making. In order to better fit the actual application scenario, the laser radar can scan the reflected information for a week at a scanning interval of 2 degrees as the main part of the state s. In addition, the state s should also include the speed v _t of the unmanned vehicle, the angular velocity w _t , and the heading angle and the current target point position information (x ^g _t , y ^g _t ), as shown in Fig. 6 . The specific method is to use the policy network output to replace the fixed weight of the evaluation function to construct an adaptive evaluation function. Obviously, action a corresponds to the weight (α, η, γ) in the evaluation function, so define action a as:

a＝[μ₁,σ₁,μ₂,σ₂,μ₃,σ₃]a＝[μ ₁ ,σ ₁ ,μ ₂ ,σ ₂ ,μ ₃ ,σ ₃ ]

其中[μ₁,σ₁]为均值与方差，用于描述权重α的概率密度函数：Where [μ ₁ ,σ ₁ ] is the mean and variance, used to describe the probability density function of the weight α:

以此类推[μ₂,σ₂]为均值与方差，用于描述权重η的概率密度函数，[μ₃,σ₃]为均值与方差，用于描述权重γ的概率密度函数。之后按照各自的概率密度函数随机抽样确定(α,η,γ)，并经过Tanh函数将动作映射到[-1,1]区间内。By analogy, [μ ₂ ,σ ₂ ] is the mean and variance, which is used to describe the probability density function of weight η, and [μ ₃ ,σ ₃ ] is the mean and variance, which is used to describe the probability density function of weight γ. Then randomly sample (α, η, γ) according to their respective probability density functions, and map the action to the [-1, 1] interval through the Tanh function.

状态s与动作a确定之后，则策略网络π(a|s；θ)、价值网络q(s,a；w)的输入、输出层神经元个数也随之确定。策略网络、价值网络结构示意图如图7、8所示：After the state s and action a are determined, the number of neurons in the input and output layers of the policy network π(a|s; θ) and the value network q(s, a; w) is also determined. The structural diagrams of strategy network and value network are shown in Figures 7 and 8:

步骤四、构建DWA-PPO深度强化学习模型，定义包括主线奖励与子目标奖励的奖励函数；并确定输入层、输出层大小以及隐藏层层数与神经元个数等参数在内的模型参数，完成DWA-PPO深度强化学习模型的实例化；Step 4. Construct the DWA-PPO deep reinforcement learning model, define the reward function including the main line reward and the sub-goal reward; and determine the model parameters including the input layer, output layer size, number of hidden layers and number of neurons, etc. Complete the instantiation of the DWA-PPO deep reinforcement learning model;

奖励函数是学习模型中的核心内容无人车获得的奖励根据是否为触发主线事件所获得的奖励可分为主线奖励与子目标奖励：The reward function is the core content of the learning model. The rewards obtained by unmanned vehicles can be divided into main-line rewards and sub-target rewards according to whether they are triggered by main-line events:

主线奖励：所谓的主线奖励可以理解为智能体到达终止状态的结算奖励，本文中即为无人车导航至终点获得的奖励R_mian ^goal、超过最大迭代步数时的惩罚奖励R_mian ^out及当无人车与障碍物发生碰撞时的惩罚奖励R_mian ^coll。Mainline reward: The so-called mainline reward can be understood as the settlement reward for the agent to reach the terminal state. In this paper, it is the reward R _mian ^goal obtained when the unmanned vehicle navigates to the end point, the penalty reward R _mian ^out when the maximum number of iteration steps is exceeded, and when The penalty reward R _mian ^coll when the unmanned vehicle collides with an obstacle.

子目标奖励：主线奖励外的奖励称之为辅助奖励，其主要形式为子目标奖励。结合无人车在障碍物环境中导航规划的实际应用场景，分析局部关键点、环境状态、无人车运动状态、无人车与目标点相对关系等因素对无人车寻得最优路径这一主线任务的影响，给出如下子目标奖励：Sub-goal rewards: Rewards other than the main line rewards are called auxiliary rewards, and their main form is sub-goal rewards. Combined with the actual application scenario of unmanned vehicle navigation planning in an obstacle environment, the analysis of local key points, environmental status, unmanned vehicle motion state, relative relationship between unmanned vehicle and target points and other factors is crucial for unmanned vehicles to find the optimal path. For the impact of a main task, the following sub-objective rewards are given:

(1)能量惩罚奖励R_sub ^step：R_sub ^step的存在一方面能够限制无人车自身的能量消耗，同时又能够促进无人车寻得最优路径；E_t为第t个step无人车以速度v_t行驶Δ_t，该过程消耗的能量，归一化后，定义R_sub ^step为：(1) Energy penalty reward R _sub ^step : the existence of R _sub ^step can limit the energy consumption of the unmanned vehicle on the one hand, and at the same time can promote the unmanned vehicle to find the optimal path; E _t is the tth step unmanned vehicle Driving _Δt at speed _vt , the energy consumed in this process, after normalization, defines R _sub ^step as:

(2)距离变化奖励R_sub ^dis：这个过程中无人车可能会因为躲避障碍物局部上远离终点，但是全局上必定是向终点靠拢。由此可定义一个与无人车位置-目标点距离相关的奖励，R_sub ^dis应该是一个正向奖励，并且，若朝终点方向移动的距离越大，则R_sub ^dis越大。(2) Distance change reward R _sub ^dis : During this process, the unmanned vehicle may move away from the end point locally due to avoiding obstacles, but must move closer to the end point globally. Therefore, a reward related to the distance between the position of the unmanned vehicle and the target point can be defined. R _sub ^dis should be a positive reward, and the greater the distance moving towards the end point, the greater the R _sub ^dis .

(3)障碍物距离奖励R_sub ^obs：r_t ^obs定义为当无人车安全距离内不存在障碍物并且无人车以最大减速度制动，无人车在规划过程中，不会发生碰撞是保证行驶安全性是首要前提，归一化后，定义R_sub ^obs为：(3) Obstacle distance reward R _sub ^obs : r _t ^obs is defined as when there are no obstacles within the safe distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle will not collide during the planning process It is the first premise to ensure driving safety. After normalization, define R _sub ^obs as:

(4)方位角奖励R_sub ^head：无人车的目标为抵达终点，因此在导航中认为其越朝向目标点，无人车航向角越好；r^head定义为无人车航向十分接近最佳方位角时才会获得正向的奖励，归一化后，定义R_sub ^head为：(4) Azimuth angle reward R _sub ^head : the goal of the unmanned vehicle is to reach the destination, so in the navigation, it is considered that the closer it is to the target point, the better the heading angle of the unmanned vehicle; r ^head is defined as the heading of the unmanned vehicle is very close to the best The positive reward will only be obtained when the azimuth is set. After normalization, define the R _sub ^head as:

综上可知无人车第t步时的奖励R_t为下式，为子目标奖励调节因子。To sum up, it can be seen that the reward R _t for the unmanned vehicle at the t-th step is the following formula, Reward modifiers for subgoals.

AC框架构造了价值网络用于近似策略梯度中的动作价值，因此网络架构至少包括价值网络与策略网络架构。根据价值网络损失函数：The AC framework constructs a value network to approximate the action value in the policy gradient, so the network architecture includes at least a value network and a policy network architecture. According to the value network loss function:

价值网络的学习目标为：The learning objectives of the value network are:

可以看出其学习目标包括自身的一部分预测假设价值网络自身存在对动作价值Q(s,a)的高估，那么价值网络这种利用自身学习自身的方式会导致高估问题被不断放大，并且这种高估时非均匀高估，严重影响网络训练，这种现象称之为自举(Bootstrapping)。为防止价值网络出现自举现象，使用w^-构建一个目标价值网络为该网络的参数结构与价值网络一致但具体数值不一样，用于计算TD误差：It can be seen that its learning objectives include part of its own predictions Assuming that the value network itself has an overestimation of the action value Q(s,a), then the way the value network uses itself to learn itself will cause the overestimation problem to be continuously amplified, and this overestimation is unevenly overestimated, seriously Affecting network training, this phenomenon is called Bootstrapping. In order to prevent the bootstrapping phenomenon of the value network, use w ^- to construct a target value network as The parameter structure of the network is consistent with the value network but the specific values are different, which is used to calculate the TD error:

目标价值网络初始参数与价值网络一致，μ为参数，确保系数和为1，后续更新参考下式：The initial parameters of the target value network are consistent with the value network, μ is the parameter, and the sum of coefficients is guaranteed to be 1. For subsequent updates, refer to the following formula:

综上可知，DWA-PPO强化学习模型下的网络架构包括三大部分：策略网络π(a|s；θ)、价值网络q(s,a；w)及目标价值网络q^T(s,a；w)。DWA-PPO强化学习模型如图9所示。In summary, the network architecture under the DWA-PPO reinforcement learning model includes three parts: policy network π(a|s;θ), value network q(s,a;w) and target value network q ^T (s,a ;w). The DWA-PPO reinforcement learning model is shown in Figure 9.

综上所述，构建模型包括智能体、环境、评论家模块、演员模块。评论家模块包括价值网络误差函数L(w)、价值网络q(s,a；w)、目标价值网络q^T(s,a；w^-)。演员模块包括目标网络π(a|s；θ₁)、行为网络π(a|s；θ₂)、策略网络目标函数训练开始阶段为训练集收集，如图中黑色线段所示：第0回合的初始时刻无人车利用感知与定位系统从环境中观测到状态s₀，行为网络π(a|s；θ₂)接收s₀后输出一个关于动作A₀的高斯分布π(A₀|s₀；θ₁)，之后从该概率分布随机抽取确定动作a₀传递至智能车，得到初始时刻DWA算法的评价函数G₀(v,w)，完成对初始时刻DWA算法的模拟轨迹集的评价，并将最优轨迹的速度角速度指令传递至无人车运动控制模块，驱动无人车运动。至此，无人车位置、朝向角、周围障碍物分布等信息发生改变，环境转换至状态s₁，奖励函数也会根据改变的信息反馈给评论家模块奖励r₀。当s₁不为终止状态s_n，该回合进入下一个时刻，否则重置地图、无人车状态，进行下一个回合的轨迹收集，直至收集满i回合，最终得到训练集：To sum up, the construction model includes agent, environment, critic module and actor module. The critic module includes the value network error function L(w), the value network q(s,a;w), and the target value network ^qT (s,a;w ⁻ ). The actor module includes target network π(a|s; θ ₁ ), behavior network π(a|s; θ ₂ ), policy network objective function The beginning stage of training is the collection of training sets, as shown by the black line in the figure: at the initial moment of the 0th round, the unmanned vehicle uses the perception and positioning system to observe the state s ₀ from the environment, and the behavior network π(a|s; θ ₂ ) After receiving s ₀ , output a Gaussian distribution π(A ₀ |s ₀ ; θ ₁ ) about the action A ₀ , and then randomly select the determined action a ₀ from the probability distribution and pass it to the smart car to obtain the evaluation function G of the DWA algorithm at the initial moment ₀ (v, w), complete the evaluation of the simulated trajectory set of the DWA algorithm at the initial moment, and transmit the velocity and angular velocity command of the optimal trajectory to the unmanned vehicle motion control module to drive the unmanned vehicle to move. So far, information such as the position, orientation angle, and distribution of surrounding obstacles of the unmanned vehicle has changed, and the environment has changed to state s ₁ , and the reward function will also feed back the reward r ₀ to the critic module according to the changed information. When s ₁ is not in the termination state s _n , the round enters the next moment; otherwise, the map and unmanned vehicle state are reset, and the trajectory collection of the next round is carried out until the i round is collected, and finally the training set is obtained:

χ＝[χ₀,χ₁,…,χ_i]χ=[χ ₀ ,χ ₁ ,…,χ _i ]

χ₀＝[s₀ ⁰,a₀ ⁰,r₀ ⁰,…,s_n-1 ⁰,a_n-1 ⁰,r_n-1 ⁰,s_n ⁰]χ ₀ ＝[s ₀ ⁰ ,a ₀ ⁰ ,r ₀ ⁰ ,…,s _n-1 ⁰ ,a _n-1 ⁰ ,r _n-1 ⁰ ,s _n ⁰ ]

步骤五、构建PPO-ADWA算法，使用建立好的DWA-PPO深度强化学习模型，在随机生成的复杂静态障碍物环境下，模拟无人车的导航规划，以收集用于训练网络模型的训练集，通过反复迭代收敛出能够根据周围障碍物分布的变化，从而输出相应的权重参数的模型参数，完成自适应PPO-ADWA算法的构建；Step 5. Construct the PPO-ADWA algorithm, use the established DWA-PPO deep reinforcement learning model, and simulate the navigation planning of the unmanned vehicle in the environment of randomly generated complex static obstacles, so as to collect the training set for training the network model , through repeated iterative convergence, the model parameters that can output the corresponding weight parameters according to the changes in the distribution of surrounding obstacles, and complete the construction of the adaptive PPO-ADWA algorithm;

得到训练集后利用误差函数L(w)反向传播更新价值网络q(s,a；w)；利用PPO算法的误差函数反向传播更新π(a|s；θ₁)。设网络q(s,a；w)、q^T(s,a；w^-)、π(a|s；θ₁)当前的网络参数分别为w_now、/>θ_now，重复以下步骤Z次完成一代更新：After obtaining the training set, use the error function L(w) to backpropagate to update the value network q(s,a;w); use the error function of the PPO algorithm Backpropagation updates π(a|s; θ ₁ ). Let the current network parameters of the network q(s,a;w), q ^T (s,a;w ^- ), π(a|s;θ ₁ ) be w _now , /> respectively θ _now , repeat the following steps Z times to complete the generation update:

(1)从打乱的训练集χ内随机抽取出M_I(最小批次大小)个状态s_N ^I。(1) Randomly extract M _I (minimum batch size) states s _N ^I from the shuffled training set χ.

(2)用q^T(s,a；w^-)计算出状态s_N ^I为起点的K步TD误差MTD_N ^I：(2) Use q ^T (s, a; w ^- ) to calculate the K-step TD error MTD _N ^I starting from the state s _N ^I :

(3)用价值网络q(s,a；w)计算出状态s_N ^I时的动作价值估计：(3) Use the value network q(s,a;w) to calculate the action value estimate when the state s _N ^I is:

q_N ^I＝Q(s_N ^I,a_N ^I；w_now)q _N ^I ＝ Q(s _N ^I , a _N ^I ; w _now )

(4)计算L(w)：(4) Calculate L(w):

(5)计算 (5) calculation

(6)更新价值网络、策略网络、目标价值网络：(6) Update the value network, strategy network, and target value network:

假设更新前参数为θ_now，经过重要性采样更新后的得到参数θ_new，假设更新前参数为w_now经过策略学习更新后的得到参数w_new，假设跟新前为防止价值网络出现自举现象引入的参数w^- _now，经过μ为参数，确保系数和为1更新后得到w^- _new。完成Z次更新后将目标网络π(a|s；θ₁)参数赋给目标网络π(a|s；θ₁)，记为一代更新，之后将训练集清空，重新进入下一代更新，直至模型收敛。Assume that the parameter before updating is θ _now , and the parameter θ _new is obtained after the importance sampling update. Assume that the parameter before updating is w _now and the parameter w _new is obtained after updating the policy learning. It is assumed that before updating, it is to prevent the value network from bootstrapping. The introduced parameter w ^- _now is updated with μ as a parameter to ensure that the sum of the coefficients is 1 to obtain w ^- _new . After completing Z updates, assign the parameters of the target network π(a|s; θ ₁ ) to the target network π(a|s; θ ₁ ), which is recorded as a generation update, and then clear the training set and re-enter the next generation update until The model converges.

图10为网络训练过程中，无人车在深度强化学习环境中每代的平均得分、到达率的变化曲线，随着模型的迭代收敛，网络模型逐渐学习到能够正确指导无人车路径规划的参数网络。完成自适应PPO-ADWA算法的构建。Figure 10 shows the change curve of the average score and arrival rate of each generation of unmanned vehicles in the deep reinforcement learning environment during the network training process. With the iterative convergence of the model, the network model gradually learns to correctly guide the path planning of unmanned vehicles parameter network. Complete the construction of adaptive PPO-ADWA algorithm.

步骤六、通过仿真对比实验论证基于PPO-ADWA的无人车路径规划自适应调节能力；Step 6. Demonstrate the self-adaptive adjustment capability of unmanned vehicle path planning based on PPO-ADWA through simulation comparison experiments;

为验证基于PPO-ADWA算法的无人车路径规划的自调节能力，本节将在随机生成的复杂静态障碍物环境下验证其鲁棒性。仿真环境如图11所示，地图大小为60m×60m，绿色圆点为起点，蓝色五角星为终点，黑色几何图形表示障碍物，障碍物形状包括正多边形与圆形，障碍物大小与数量在一定范围内随机生成。在100张障碍物位置不同的地图下的表现，仿真结果见表1。In order to verify the self-regulation ability of the unmanned vehicle path planning based on the PPO-ADWA algorithm, this section will verify its robustness in the randomly generated complex static obstacle environment. The simulation environment is shown in Figure 11. The size of the map is 60m×60m. The green dot is the starting point, the blue five-pointed star is the end point, and the black geometric figures represent obstacles. The shapes of obstacles include regular polygons and circles, and the size and number of obstacles Randomly generated within a certain range. The performance under 100 maps with different obstacle positions, the simulation results are shown in Table 1.

表1仿真结果对比Table 1 Comparison of simulation results

PPO-ADWA下的无人车路径规划结果的到达率为84％，相较于经典DWA下的到达率提升了6个百分点；平均路径长度为93.04m，路径效率提升了5.00％；平均步数为251.95，平均步数花费减少了4.85％。经典DWA规划结果见图12，PPO-ADWA规划结果见图13。PPO-ADWA融合策略的无人车路径规划过程中，权重参数的变化曲线如图14所示。可以看出，权重参数总体上保持η>γ>α的数值关系。The arrival rate of the unmanned vehicle path planning results under PPO-ADWA is 84%, which is 6 percentage points higher than that under the classic DWA; the average path length is 93.04m, and the path efficiency is increased by 5.00%; the average number of steps 251.95, a 4.85% reduction in average step cost. The classic DWA planning results are shown in Figure 12, and the PPO-ADWA planning results are shown in Figure 13. During the unmanned vehicle path planning process of the PPO-ADWA fusion strategy, the change curve of the weight parameters is shown in Figure 14. It can be seen that the weight parameters generally maintain the numerical relationship of η>γ>α.

以上是本发明的较佳实施例，凡依本发明技术方案所作的改变，所产生的功能作用未超出本发明技术方案的范围时，均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.

Claims

1. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy is characterized by comprising the following steps:

firstly, constructing an intelligent body-environment interaction model, wherein an unmanned vehicle is used as an intelligent body in deep reinforcement learning, and an obstacle map is used as an environment;

establishing a DWA algorithm model, and determining the DWA algorithm model according to the Ackerman intelligent vehicle comprises the following steps: speed range, angular speed range, acceleration range, angular acceleration range parameters, and the main elements and evaluation functions of the DWA algorithm;

thirdly, establishing a near-end strategy optimization learning PPO model based on an actor-critter framework, simulating and establishing an actual application scene of the unmanned vehicle as a learning environment of the model, and determining states and actions in the model according to the application scene;

step four, constructing a DWA-PPO deep reinforcement learning model, and defining a reward function comprising main line rewards and sub-target rewards; determining parameters of a DWA-PPO deep reinforcement learning model including parameters of an input layer, an output layer, the number of hidden layers and the number of neurons, and completing instantiation of the DWA-PPO deep reinforcement learning model;

step five, constructing a self-adaptive PPO-ADWA algorithm, namely simulating navigation planning of the unmanned vehicle in a randomly generated complex static obstacle environment by using the established DWA-PPO deep reinforcement learning model to collect a training set for training the DWA-PPO deep reinforcement learning model, and outputting a model of corresponding weight parameters according to the distribution change of surrounding obstacles by repeated iteration convergence to finish the construction of the self-adaptive PPO-ADWA algorithm;

step six, the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm is demonstrated through a simulation comparison experiment.

2. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 1, wherein the following steps are specifically implemented:

the intelligent agent is responsible for outputting action information, receiving rewards and states, the environment is an interactive object of the intelligent agent, and the interactive process comprises the following three steps:

(1) The intelligent agent is controlled by the environment stateObserve information +.> The state space is a value set of environmental states; />The system is an observation space and a value set of the observed quantity of the agent;

(2) The agent being derived from known O _t Making a corresponding decision to decide the action to be applied to the environment Is an action value set;

(3) Environmental subject A _t Influence, self state S _t Transition to S _t+1 And awarding rewards to the agent Is a value set of rewards; the discretized agent-environment interaction model is thus represented by the following sequence:

S ₀ ,O ₀ ,A ₀ ,R ₀ ,S ₁ ,O ₁ ,A ₁ ,R ₁ ,S ₂ ,O ₂ ,A ₂ ,R ₂ ,…,S _T ＝S _{termination of}

When the state of the environment can be completely observed by the intelligent agent, S is present _t ＝O _t To be simplified as:

S ₀ ,A ₀ ,R ₀ ,S ₁ ,A ₁ ,R ₁ ,S ₂ ,A ₂ ,R ₂ ,…,S _T ＝S _{termination of} 。

3. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 2, wherein the step two is specifically implemented as follows:

the DWA algorithm is a local path planning method for intuitively understanding the map environment where the unmanned vehicle is located from the perspective of speed space, and the working flow is as follows: taking into consideration the constraint of each condition at the moment t on the speed and angular velocity, obtaining a speed and angular velocity window V which can be reached by the unmanned vehicle at the moment t _win The method comprises the steps of carrying out a first treatment on the surface of the Discretizing the speed and angular velocity, and combining the discretized speed and angular velocity; the drone traverses all combinations and simulates the forward m delta according to a given motion model _t Obtaining a simulation track set tau, namely a series of point sets, according to the duration; the evaluation function gives the scores of all the simulated tracks in the simulated track set tau, and the track tau with the highest score is selected _b Corresponding combinations; driving the unmanned aerial vehicle with the combination for a forward time delta _t Reaching the time t+1; taking the cycle to the end point, m is the sampling step number, delta _t Is the sampling interval;

at time t, speed angular velocity window V of unmanned vehicle _win Constrained by its hardware conditions and the surrounding environment, consider the following three-point constraint:

(1) Limit speed angular speed constraint:

V _lim ＝{(v,w)|v∈[v _min ,v _max ]∧w∈[w _min ,w _max ]}

(2) Acceleration limited speed angular velocity constraint:

(3) Speed angular velocity constraint for braking distance limitation:

above, v _min 、v _max Is the limit linear velocity, w _min 、w _max For the limit angular velocity v _cu 、w _cu For the current linear velocity and the angular velocity,for extreme linear acceleration, +.>For the limit angular acceleration, dist (v, w) is the nearest distance from the obstacle of the simulated track corresponding to the velocity angular velocity combination (v, w); finally, speed angular velocity window V of unmanned vehicle at t moment _win Expressed as:

V _win ＝V _lim ∩V _acc ∩V _dis

the evaluation function comprises three subfunctions, which are comprehensive consideration of three factors of the unmanned vehicle running speed, the obstacle collision risk and the unmanned vehicle heading, and specifically comprises the following steps:

G(v,w)＝σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))

wherein the method comprises the steps of The course angle of the unmanned vehicle is represented, delta is the included angle between the connecting line of the unmanned vehicle and the target point and the positive direction of the x-axis; dist (v, w) is the Euclidean distance from the simulated track to the nearest obstacle, vel (v, w) represents the linear speed of the unmanned vehicle, and alpha, eta and gamma are three weight coefficients; from the above, the evaluation function is notThe homonymy subfunction is formed, the normalization function sigma () in the formula is equivalent to dimensionless learning, and can unify the data of different dimensions under the same reference system for combination or comparison, so as to avoid evaluation deviation caused by different dimensions of the data, and the method is concretely as follows:

dist(v _i ,w _j ) And vel (v) _i ,w _j ) Performing the same normalization operation;

the unmanned vehicle obtains a simulation track according to a uniform motion model, under the assumption condition of the uniform motion model, the linear speed and the angular speed of the unmanned vehicle are kept unchanged, the linear speed direction change amount and time are in linear relation, the speed direction is kept unchanged in a tiny time interval for simplifying the model to accelerate the operation, and therefore the uniform motion model is discretized, and x is calculated _t 、y _t Represents the abscissa of the intelligent vehicle at the time t,indicating the course angle at time t, v _t 、w _t The speed and angular velocity at time t are represented by the following formulas:

4. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 3, wherein the third implementation is as follows:

the approach of the near-end policy optimization algorithm is that the approach is based on the objective functionIncrease D _KL The (p||q) penalty term is specifically as follows:

in the middle ofTo the parameter theta ₁ The integration of (2) to obtain the target function of strategy learning based on importance sampling, theta is the strategy pi parameter, when the strategy is better, the target function +.>The larger, γ is the parameter introduced for the Monte Carlo approximation, U _t Pi (a _t |s _t ；θ ₁ ) For the target policy, pi (a _t |s _t ；θ ₂ ) For behavioural policy->For mathematical expectation of policy network, beta is super parameter, and D is larger as the difference between the distribution q and p is larger _KL The larger the (p||q) term is, ++>The greater the penalty received, the conversely D _KL The smaller the (p||q) term is, ++>The smaller the penalty is, the goal of reinforcement learning is to maximize +.>Thus have penalty term->The behavior and the target strategy can be controlled within a preset similarity range;

the unmanned vehicle searches an optimal path capable of connecting the starting point and the end point in the obstacle environment, so that the learning environment taking the actual application scene of the unmanned vehicle as a model is an obstacle map;

the state s in the model is environmental information perceived by the unmanned vehicle by using a sensor and comprises self position and motion state information; the laser radar scans the information reflected by a circle at a scanning interval of 2 degrees to form a main part of a state s, and the state s also comprises the speed v of the unmanned vehicle _t Angular velocity w _t Course angleCurrent target point position information (x ^g _t ,y ^g _t ) The specific method is that a strategy network is utilized to output fixed weights for replacing an evaluation function, a self-adaptive evaluation function is constructed, and obviously, an action a corresponds to the weights (alpha, eta, gamma) in the evaluation function, so that the action a is defined as follows:

a＝[μ ₁ ,σ ₁ ,μ ₂ ,σ ₂ ,μ ₃ ,σ ₃ ]

wherein [ mu ] ₁ ,σ ₁ ]For the mean and variance, the probability density function used to describe the weight α, the same applies:

[μ ₂ ,σ ₂ ]for mean and variance, probability density function for describing weight η, [ mu ] ₃ ,σ ₃ ]The mean value and the variance are probability density functions for describing the weight gamma; then randomly sampling and determining (alpha, eta, gamma) according to the respective probability density function, and mapping the actions to [ -1,1 through the Tanh function]The interval is within;

after the state s and the action a are determined, the number of neurons of the output layer of the input and output layers of the strategy network and the value network are also determined.

5. The unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 4, wherein the fourth implementation is as follows:

the rewarding function is to learn core content in a DWA-PPO deep reinforcement learning model, and rewards obtained by unmanned vehicles are divided into main line rewards and sub-target rewards according to whether the rewards obtained by triggering main line events are:

main line rewards: the main line rewards are settlement rewards for the agent reaching the ending state, namely rewards R obtained by the navigation of the unmanned vehicle to the ending point _mian ^goal Punishment rewards R when the maximum number of iterative steps is exceeded _mian ^out And punishment rewards R when unmanned vehicle collides with obstacle _mian ^coll ；

Sub-target rewards: rewards other than the main rewards are called auxiliary rewards, and the main form of the rewards is sub-target rewards; and analyzing the influence of factors such as local key points, environment states, movement states of the unmanned aerial vehicle, relative relation between the unmanned aerial vehicle and target points and the like on a main line task of the unmanned aerial vehicle for finding an optimal path by combining with the actual application scene of navigation planning of the unmanned aerial vehicle in an obstacle environment, and giving the following sub-target rewards:

(1) Energy penalty prize R _sub ^step ：R _sub ^step On one hand, the energy consumption of the unmanned vehicle can be limited, and meanwhile, the unmanned vehicle can be promoted to find an optimal path; e (E) _t At speed v for the t step unmanned vehicle _t Running delta _t The energy consumed by the process, normalized, defines R _sub ^step The method comprises the following steps:

(2) Distance change reward R _sub ^dis : defining a prize related to the position-target point distance of the unmanned vehicle, R _sub ^dis Should be a positive prize, and R is the greater the distance moved in the direction of the end point _sub ^dis The larger;

(3) Obstacle distance rewards R _sub ^obs ：r _t ^obs Defining that when no obstacle exists in the safety distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle does not collide in the planning process, which is the primary premise of ensuring the driving safety, and defining R after normalization _sub ^obs The method comprises the following steps:

(4) Azimuth prize R _sub ^head : the target of the unmanned vehicle reaches the destination, so that the more the unmanned vehicle is considered to face the destination in navigation, the better the heading angle of the unmanned vehicle is; r is (r) ^head The method is defined as that forward rewards are obtained when the heading of the unmanned vehicle is very close to the optimal azimuth angle, and R is defined after normalization _sub ^head The method comprises the following steps:

rewards R at t-th step of unmanned vehicle _t Is a compound of the formula (I),awarding the adjustment factor for the sub-target;

the network architecture under the DWA-PPO reinforcement learning model at least comprises a value network and a strategy network architecture; according to the value network loss function:

the learning targets of the value network are:

learning objectives for value networks include a portion of predictions themselvesTo prevent bootstrap phenomenon of value network, w is used ^- Constructing a target value network q ^T (s,a；w ^- ) The parameter structure of the target value network is consistent with the value network but different in specific value, and is used for calculating TD error:

the initial parameters of the target value network are consistent with the value network, mu is a parameter, the sum of coefficients is ensured to be 1, and the following updating reference is carried out:

w ^- _new ←μw _new +(1-μ)w ^- _now

the network architecture under the DWA-PPO reinforcement learning model includes three major parts: policy network pi (a|s; θ), value network q (s, a; w) and target value network q ^T (s,a；w)；

The DWA-PPO reinforcement learning model is built in a comprehensive way and comprises an agent, an environment, a commentator module and an actor module; the commentator module comprises a value network error function L (w), a value network q (s, a; w) and a target value network q ^T (s,a；w ^- ) The method comprises the steps of carrying out a first treatment on the surface of the Actor modules include a target network pi (a|s; θ ₁ ) A behavioral network pi (a|s; θ ₂ ) Policy network objective functionThe training beginning stage is collection of training sets: the 0 th round of initial time unmanned vehicle observes state s from the environment using the sensing and positioning system ₀ Behavior network pi (a|s; θ ₂ ) Receiving s ₀ Post-outputting a message about action A ₀ Is of Gaussian distribution pi (A) ₀ |s ₀ ；θ ₁ ) Then randomly extracting the determination action a from the probability distribution ₀ Transmitting to an intelligent vehicle to obtain an evaluation function G of the DWA algorithm at the initial moment ₀ (v, w) completing evaluation of a simulated track set of the DWA algorithm at the initial moment, and transmitting a speed angular speed instruction of the optimal track to an unmanned vehicle motion control module to drive the unmanned vehicle to move; to this end, information including the position, the angle of orientation, the surrounding obstacle distribution of the drone is changed, the environment is switched to state s ₁ The rewarding function also feeds back the rewarding r to the commentator module according to the changed information ₀ The method comprises the steps of carrying out a first treatment on the surface of the When s is ₁ Not in the termination state s _n And (3) entering the next time in the round, otherwise resetting the map and the unmanned vehicle state, and collecting the track of the next round until the round i is collected, so as to finally obtain a training set:

χ＝[χ ₀ ,χ ₁ ,…,χ _i ]

χ ₀ ＝[s ₀ ⁰ ,a ₀ ⁰ ,r ₀ ⁰ ,…,s _n-1 ⁰ ,a _n-1 ⁰ ,r _n-1 ⁰ ,s _n ⁰ ]。

6. the unmanned vehicle self-adaptive path planning method based on the dynamic window method and the near-end strategy according to claim 5, wherein the fifth implementation is as follows:

after obtaining the training set, the value network q (s, a; w) is updated by back propagation of an error function L (w); error function using near-end strategy optimization algorithmBack propagation update pi (a|s; θ) ₁ ) The method comprises the steps of carrying out a first treatment on the surface of the Let q (s, a; w), q ^T (s,a；w ^- )、π(a|s；θ ₁ ) The current network parameters are w respectively _now 、w ^- _now 、θ _now The following steps are repeated for Z times to finish the generation of updating:

(1) Randomly extracting the minimum lot size M from the shuffled training set χ _I Individual states s _N ^I ；

(2) With q ^T (s,a；w ^- ) Calculating the state s _N ^I K-step TD error MTD as starting point _N ^I ：

(3) Calculating the state s using the value network q (s, a; w) _N ^I Action value estimation at the time:

q _N ^I ＝Q(s _N ^I ,a _N ^I ；w _now )

(4) Calculation L (w):

(5) Calculation of

(6) Update value network, policy network, target value network:

w ^- _new ←(1-μ)w _new +μw ^- _now

assume that the pre-update parameter is θ _now Obtaining parameter theta after importance sampling update _new Let the pre-update parameter be w _now Obtaining a parameter w after strategy learning and updating _new Let us assume the parameters w introduced before the update to prevent the bootstrap phenomenon of the value network ^- _now After mu is taken as a parameter, the w is obtained after the update of the coefficient sum of 1 ^- _new The method comprises the steps of carrying out a first treatment on the surface of the After Z times of updating, the target network pi (a|s; theta ₁ ) Parameters are assigned to the target network pi (a|s; θ ₁ ) Recording as one generation of update, then emptying the training set, and re-entering the next generation of update until the model converges;

the average score and arrival rate change curve of each generation of the unmanned vehicle in the deep reinforcement learning environment gradually learns a parameter network capable of guiding the unmanned vehicle path planning correctly along with the iterative convergence of the model; and completing the construction of the self-adaptive PPO-ADWA algorithm.