CN116679719A - Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy - Google Patents
Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy Download PDFInfo
- Publication number
- CN116679719A CN116679719A CN202310792088.4A CN202310792088A CN116679719A CN 116679719 A CN116679719 A CN 116679719A CN 202310792088 A CN202310792088 A CN 202310792088A CN 116679719 A CN116679719 A CN 116679719A
- Authority
- CN
- China
- Prior art keywords
- unmanned vehicle
- network
- rewards
- model
- ppo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 230000003044 adaptive effect Effects 0.000 title description 11
- 230000006870 function Effects 0.000 claims abstract description 64
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 50
- 230000002787 reinforcement Effects 0.000 claims abstract description 39
- 238000011156 evaluation Methods 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 25
- 238000005457 optimization Methods 0.000 claims abstract description 11
- 238000004088 simulation Methods 0.000 claims abstract description 11
- 238000010276 construction Methods 0.000 claims abstract description 10
- 230000003993 interaction Effects 0.000 claims abstract description 10
- 210000002569 neuron Anatomy 0.000 claims abstract description 7
- 238000002474 experimental method Methods 0.000 claims abstract description 5
- 230000009471 action Effects 0.000 claims description 28
- 230000001133 acceleration Effects 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 14
- 230000008859 change Effects 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000005070 sampling Methods 0.000 claims description 9
- 230000006399 behavior Effects 0.000 claims description 6
- 230000003068 static effect Effects 0.000 claims description 6
- 230000007613 environmental effect Effects 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000013459 approach Methods 0.000 claims description 3
- 238000005265 energy consumption Methods 0.000 claims description 2
- 230000003542 behavioural effect Effects 0.000 claims 2
- 230000002452 interceptive effect Effects 0.000 claims 2
- 150000001875 compounds Chemical group 0.000 claims 1
- 230000010354 integration Effects 0.000 claims 1
- 238000013507 mapping Methods 0.000 claims 1
- 230000007704 transition Effects 0.000 claims 1
- 239000003795 chemical substances by application Substances 0.000 description 14
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 230000008447 perception Effects 0.000 description 2
- 206010048669 Terminal state Diseases 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
Abstract
Description
技术领域technical field
本发明涉及无人驾驶路径规划和自主导航技术领域,具体涉及一种基于动态窗口法与近端策略的无人车自适应路径规划方法。The invention relates to the technical field of unmanned driving path planning and autonomous navigation, in particular to an adaptive path planning method for unmanned vehicles based on a dynamic window method and a proximal strategy.
背景技术Background technique
近年来,伴随着科学技术的飞速发展,以互联网、人工智能、大数据等为代表的新一轮的科技产业革命正在重新定义社会的各行各业,传统汽车产业正在面临着深刻的产业变革。传统汽车正在向着智能化、无人化发展,智能网联汽车、自动驾驶汽车已成为全球汽车产业发展的战略方向。智能驾驶技术主要包括环境感知、导航定位、路径规划与控制决策等。路径规划是智能驾驶中重要的一环其对智能驾驶技术的发展具有重大意义。In recent years, with the rapid development of science and technology, a new round of technological industrial revolution represented by the Internet, artificial intelligence, big data, etc. is redefining all walks of life in society, and the traditional automobile industry is facing profound industrial changes. Traditional cars are developing towards intelligence and unmanned vehicles. Intelligent networked cars and self-driving cars have become the strategic direction of the development of the global auto industry. Intelligent driving technology mainly includes environment perception, navigation and positioning, path planning and control decision-making, etc. Path planning is an important part of intelligent driving, which is of great significance to the development of intelligent driving technology.
路径规划是自动驾驶智能车的重要组成部分,路径规划技术可归结为路径规划指在已知环境下通过算法规划出一条安全、可行的无碰撞路径,选择出从起点连接至终点的最优避障路径,本质为几个约束条件下的最优解,路径规划是智能车无人导航技术的关键部分。路径规划算法又可分为基于完整区域信息理解的全局规划与基于局部区域信息理解的局部规划。动态窗口法(Dynamic Window Approach,DWA)作为考虑智能车运动性能的局部路径规划方法,广泛应用于智能车路径导航。DWA算法中起决策作用的为其评价函数,包括朝向角函数、障碍物函数、速度函数等三部分,评价函数为这三个子函数的加权求和,经典DWA算法中该三个函数所对应的权重为固定值,然而智能车在探索终点过程,其周围的障碍物环境是复杂多变的,不同障碍物分布需要不同的权重,经典DWA算法固定权重值方法容易使智能车陷入局部最优或目标不可达。因此借助深度强化学习中的近端策略优化算法,对经典DWA算法进行改进。Path planning is an important part of autonomous driving smart cars. Path planning technology can be summarized as path planning refers to planning a safe and feasible collision-free path through algorithms in a known environment, and selecting the optimal avoidance path from the starting point to the end point. The obstacle path is essentially the optimal solution under several constraints, and path planning is a key part of the unmanned navigation technology for smart vehicles. Path planning algorithms can be divided into global planning based on complete regional information understanding and local planning based on local regional information understanding. Dynamic Window Approach (DWA), as a local path planning method considering the motion performance of smart cars, is widely used in smart car path navigation. The decision-making function in the DWA algorithm is its evaluation function, which includes three parts: orientation angle function, obstacle function, and speed function. The evaluation function is the weighted sum of these three sub-functions. In the classic DWA algorithm, the three functions correspond to The weight is a fixed value. However, when the smart car is exploring the end point, the surrounding obstacle environment is complex and changeable. Different obstacle distributions require different weights. The fixed weight value method of the classic DWA algorithm is easy to make the smart car fall into a local optimum or Target unreachable. Therefore, with the help of the proximal strategy optimization algorithm in deep reinforcement learning, the classic DWA algorithm is improved.
发明内容Contents of the invention
本发明的目的在于解决智能体在面对不同障碍物环境是因评价函数中的权重系数不可动态调节,往往不能够寻至终点或者算出最优路径的问题,提供一种基于动态窗口法与近端策略的无人车自适应路径规划方法,在经典DWA算法的基础上提出改进,改进经典DWA算法中的权重参数与深度强化学习中近端策略优化进行结合通过学习训练,得到适用不同静态障碍物的模型参数,完成自适应PPO-ADWA算法的构建。The purpose of the present invention is to solve the problem that the agent cannot find the end point or calculate the optimal path because the weight coefficient in the evaluation function cannot be dynamically adjusted in the face of different obstacle environments. The self-adaptive path planning method for unmanned vehicles based on the terminal strategy is improved on the basis of the classic DWA algorithm. The weight parameters in the improved classic DWA algorithm are combined with the optimization of the proximal strategy in deep reinforcement learning. Through learning and training, different static obstacles are obtained. The model parameters of the object are used to complete the construction of the adaptive PPO-ADWA algorithm.
为实现上述目的,本发明的技术方案是:一种基于动态窗口法与近端策略的无人车自适应路径规划方法,包括如下步骤:In order to achieve the above object, the technical solution of the present invention is: a method for self-adaptive path planning of unmanned vehicles based on dynamic window method and proximal strategy, comprising the following steps:
步骤一、构建智能体-环境交互模型,无人车作为深度强化学习中的智能体,障碍物地图作为环境;Step 1. Construct the agent-environment interaction model, the unmanned vehicle is used as the agent in deep reinforcement learning, and the obstacle map is used as the environment;
步骤二、建立DWA算法模型,根据阿克曼智能车确定包括:速度范围、角速度范围、加速度范围、角加速度范围参数以及DWA算法的主要要素以及评价函数;Step 2. Establish the DWA algorithm model, which includes: speed range, angular velocity range, acceleration range, angular acceleration range parameters, and the main elements of the DWA algorithm and evaluation functions according to the Ackerman smart car;
步骤三、建立基于演员-评论家框架的近端策略优化学习PPO模型,模拟建立无人车实际应用场景作为模型的学习环境,根据应用场景确定模型中的状态与动作;Step 3. Establish a proximal policy optimization learning PPO model based on the actor-critic framework, simulate and establish the actual application scenario of the unmanned vehicle as the learning environment of the model, and determine the state and action in the model according to the application scenario;
步骤四、构建DWA-PPO深度强化学习模型,定义包括主线奖励与子目标奖励的奖励函数;并确定包括输入层、输出层大小以及隐藏层层数与神经元个数参数在内的DWA-PPO深度强化学习模型参数,完成DWA-PPO深度强化学习模型的实例化;Step 4. Construct the DWA-PPO deep reinforcement learning model, define the reward function including the main line reward and the sub-target reward; and determine the DWA-PPO including the input layer, output layer size, number of hidden layers and number of neurons Deep reinforcement learning model parameters, complete the instantiation of DWA-PPO deep reinforcement learning model;
步骤五、构建自适应PPO-ADWA算法,使用建立好的DWA-PPO深度强化学习模型,在随机生成的复杂静态障碍物环境下,模拟无人车的导航规划,以收集用于训练DWA-PPO深度强化学习模型的训练集,通过反复迭代收敛出能够根据周围障碍物分布的变化,输出相应权重参数的模型,完成自适应PPO-ADWA算法的构建;Step 5. Build an adaptive PPO-ADWA algorithm, use the established DWA-PPO deep reinforcement learning model, and simulate the navigation planning of unmanned vehicles in the environment of randomly generated complex static obstacles to collect data for training DWA-PPO The training set of the deep reinforcement learning model, through repeated iterations, converges to a model that can output the corresponding weight parameters according to the changes in the distribution of surrounding obstacles, and completes the construction of the adaptive PPO-ADWA algorithm;
步骤六、通过仿真对比实验论证基于自适应PPO-ADWA算法的无人车路径规划自适应调节能力。Step 6. Demonstrate the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm through simulation comparison experiments.
相较于现有技术,本发明具有以下有益效果:本发明方法针对传统DWA算法的评价函数中权重系数,其取值并不会随着智能车所处的环境及其自身的运动状态做出动态调整的问题,使用深度强化学习中的近端策略优化算法,构建DWA-PPO深度强化学习模型,通过不断迭代训练得到网络模型,从而输出相应的权重参数的模型参数,完成自适应PPO-ADWA算法的构建;本发明方法解决了智能体在面对不同障碍物环境是因评价函数中的权重系数不可动态调节,往往不能够寻至终点或者算出最优路径的问题。Compared with the prior art, the present invention has the following beneficial effects: the method of the present invention is aimed at the weight coefficient in the evaluation function of the traditional DWA algorithm, and its value will not change with the environment in which the smart car is located and its own motion state. For the problem of dynamic adjustment, the proximal strategy optimization algorithm in deep reinforcement learning is used to construct the DWA-PPO deep reinforcement learning model, and the network model is obtained through continuous iterative training, so as to output the model parameters of the corresponding weight parameters, and complete the adaptive PPO-ADWA Algorithm construction; the method of the present invention solves the problem that the agent cannot find the end point or calculate the optimal path because the weight coefficient in the evaluation function cannot be dynamically adjusted when facing different obstacle environments.
附图说明Description of drawings
图1为智能体-环境交互模型示意图。Figure 1 is a schematic diagram of the agent-environment interaction model.
图2为DWA算法原理示意图。Figure 2 is a schematic diagram of the principle of the DWA algorithm.
图3为速度角速度窗口。Figure 3 shows the velocity angular velocity window.
图4为与δ示意图。Figure 4 is and δ schematic diagram.
图5为演员评论家框架示意图。Figure 5 is a schematic diagram of the actor-critic framework.
图6为状态s。Figure 6 is state s.
图7为策略网络结构。Figure 7 shows the policy network structure.
图8为价值网络结构。Figure 8 shows the value network structure.
图9为DWA-PPO模型。Figure 9 is the DWA-PPO model.
图10为分数与到达率变化曲线。Figure 10 is the change curve of score and arrival rate.
图11为仿真环境。Figure 11 is the simulation environment.
图12为经典DWA。Figure 12 is a classic DWA.
图13为PPO-ADWA。Figure 13 is PPO-ADWA.
图14为权重参数变化曲线。Figure 14 is the weight parameter change curve.
图15为本发明方法流程图。Fig. 15 is a flowchart of the method of the present invention.
具体实施方式Detailed ways
下面结合附图1-15,对本发明的技术方案进行具体说明。The technical solution of the present invention will be specifically described below in conjunction with accompanying drawings 1-15.
如图15所示,本发明提供了一种基于动态窗口法与近端策略的无人车自适应路径规划方法,包括如下步骤:As shown in Figure 15, the present invention provides a method for self-adaptive path planning for unmanned vehicles based on dynamic window method and proximal strategy, including the following steps:
步骤一、构建智能体-环境交互模型,无人车作为深度强化学习中的智能体,障碍物地图作为环境;Step 1. Construct the agent-environment interaction model, the unmanned vehicle is used as the agent in deep reinforcement learning, and the obstacle map is used as the environment;
步骤二、建立DWA算法模型,根据阿克曼智能车确定包括:速度范围、角速度范围、加速度范围、角加速度范围参数以及DWA算法的主要要素以及评价函数;Step 2. Establish the DWA algorithm model, which includes: speed range, angular velocity range, acceleration range, angular acceleration range parameters, and the main elements of the DWA algorithm and evaluation functions according to the Ackerman smart car;
步骤三、建立基于演员-评论家框架的近端策略优化学习PPO模型,模拟建立无人车实际应用场景作为模型的学习环境,根据应用场景确定模型中的状态与动作;Step 3. Establish a proximal policy optimization learning PPO model based on the actor-critic framework, simulate and establish the actual application scenario of the unmanned vehicle as the learning environment of the model, and determine the state and action in the model according to the application scenario;
步骤四、构建DWA-PPO深度强化学习模型,定义包括主线奖励与子目标奖励的奖励函数;并确定包括输入层、输出层大小以及隐藏层层数与神经元个数参数在内的DWA-PPO深度强化学习模型参数,完成DWA-PPO深度强化学习模型的实例化;Step 4. Construct the DWA-PPO deep reinforcement learning model, define the reward function including the main line reward and the sub-target reward; and determine the DWA-PPO including the input layer, output layer size, number of hidden layers and number of neurons Deep reinforcement learning model parameters, complete the instantiation of DWA-PPO deep reinforcement learning model;
步骤五、构建自适应PPO-ADWA算法,使用建立好的DWA-PPO深度强化学习模型,在随机生成的复杂静态障碍物环境下,模拟无人车的导航规划,以收集用于训练DWA-PPO深度强化学习模型的训练集,通过反复迭代收敛出能够根据周围障碍物分布的变化,输出相应权重参数的模型,完成自适应PPO-ADWA算法的构建;Step 5. Build an adaptive PPO-ADWA algorithm, use the established DWA-PPO deep reinforcement learning model, and simulate the navigation planning of unmanned vehicles in the environment of randomly generated complex static obstacles to collect data for training DWA-PPO The training set of the deep reinforcement learning model, through repeated iterations, converges to a model that can output the corresponding weight parameters according to the changes in the distribution of surrounding obstacles, and completes the construction of the adaptive PPO-ADWA algorithm;
步骤六、通过仿真对比实验论证基于自适应PPO-ADWA算法的无人车路径规划自适应调节能力。Step 6. Demonstrate the self-adaptive adjustment capability of unmanned vehicle path planning based on the self-adaptive PPO-ADWA algorithm through simulation comparison experiments.
各步骤具体实现如下:The specific implementation of each step is as follows:
步骤一、如图1所示,构建智能体-环境交互模型,无人车作为深度强化学习中的智能体,障碍物地图作为环境;Step 1, as shown in Figure 1, build an agent-environment interaction model, the unmanned vehicle is used as the agent in deep reinforcement learning, and the obstacle map is used as the environment;
智能体在深度强化学习系统中扮演决策及学习的角色,主要负责动作信息的输出及接收奖励、状态,环境是智能体的交互对象,其交互过程包括如下三个步骤:The agent plays the role of decision-making and learning in the deep reinforcement learning system. It is mainly responsible for the output of action information and receiving rewards and states. The environment is the interaction object of the agent. The interaction process includes the following three steps:
(1)智能体由环境状态观测到信息/> 为状态空间,是环境状态的取值集合;/>为观测空间,为智能体观测量的取值集合。(1) The agent is determined by the environment state Observed information /> is the state space, which is the value set of the environment state; /> is the observation space, and is the value set of agent observations.
(2)智能体由已知的Ot做出相应的决策,决定要对环境施加的动作 是动作取值集合。(2) The agent makes a corresponding decision based on the known O t , and decides the action to be imposed on the environment Is the set of action values.
(3)环境受At影响,自身状态St转移至St+1,并给与智能体奖励 是奖励的取值集合。因此离散化的智能体-环境交互模型可以用如下序列表示:(3) The environment is affected by A t , its own state S t is transferred to S t+1 , and rewards are given to the agent is the value set of rewards. Therefore, the discretized agent-environment interaction model can be represented by the following sequence:
S0,O0,A0,R0,S1,O1,A1,R1,S2,O2,A2,R2,…,ST=S终止 S 0 ,O 0 ,A 0 ,R 0 ,S 1 ,O 1 ,A 1 ,R 1 ,S 2 ,O 2 ,A 2 ,R 2 , …,S T =S termination
当环境的状态能够被智能体完全观测时,则有St=Ot,以简化为:When the state of the environment can be completely observed by the agent, then there is S t =O t , which can be simplified as:
S0,A0,R0,S1,A1,R1,S2,A2,R2,…,ST=S终止 S 0 ,A 0 ,R 0 ,S 1 ,A 1 ,R 1 ,S 2 ,A 2 ,R 2 ,…,S T =S termination
步骤二、建立DWA算法模型,根据阿克曼智能车确定包括:速度范围、角速度范围、加速度范围、角加速度范围参数以及DWA算法的主要要素评价函数;Step 2. Establish the DWA algorithm model, and determine according to the Ackerman smart car, including: speed range, angular velocity range, acceleration range, angular acceleration range parameters and the main element evaluation function of the DWA algorithm;
DWA算法是一种从速度空间角度对无人车所处地图环境做出直观理解的局部路径规划法,工作流程为:考虑t时刻各条件对速度角速度的约束,得出t时刻无人车所能到达的速度角速度窗口Vwin;将其离散化,对离散后的速度角速度进行组合;无人车遍历所有组合并按照给定运动模型模拟前行m个Δt时长,获得模拟轨迹集τ,即一系列点集;评价函数给出模拟轨迹集τ中的所有模拟轨迹的得分,选取评分最高轨迹τb对应的组合;以该组合驱动无人车前进时长Δt,到达t+1时刻;以此循环直至终点。m为采样步数,Δt为采样间隔,如图2所示。The DWA algorithm is a local path planning method that intuitively understands the map environment of the unmanned vehicle from the perspective of velocity space. The attainable velocity-angular-velocity window V win ; discretize it and combine the discretized velocity-angular velocity; the unmanned vehicle traverses all the combinations and simulates forward m Δt duration according to the given motion model, and obtains the simulated trajectory set τ, That is, a series of point sets; the evaluation function gives the scores of all simulated trajectories in the simulated trajectory set τ, and selects the combination corresponding to the highest-scoring trajectory τ b ; drives the unmanned vehicle forward with this combination for a length of Δ t , and reaches the time t+1; Repeat this until the end. m is the number of sampling steps, and Δt is the sampling interval, as shown in Figure 2.
在t时刻,无人车的Vwin受自身硬件条件与周围环境约束,考虑如下三点约束:At time t, the V win of the unmanned vehicle is constrained by its own hardware conditions and the surrounding environment. Consider the following three constraints:
(1)极限速度角速度约束:(1) Limit velocity angular velocity constraint:
Vlim={(v,w)|v∈[vmin,vmax]^w∈[wmin,wmax]}V lim ={(v,w)|v∈[v min ,v max ]^w∈[w min ,w max ]}
(2)加速度限制的速度角速度约束:(2) Velocity angular velocity constraint of acceleration limitation:
(3)制动距离限制的速度角速度约束:(3) Velocity and angular velocity constraints limited by braking distance:
以上,vmin、vmax为极限线速度,wmin、wmax为极限角速度。vcu、wcu为当前线速度、角速度,为极限线加速度,/>为极限角加速度。dist(v,w)为速度角速度组合(v,w)对应的模拟轨迹离障碍物的最近距离。最终t时刻Vwin表示为:Above, v min and v max are limit linear velocity, w min and w max are limit angular velocity. v cu and w cu are the current linear velocity and angular velocity, is the limit linear acceleration, /> is the limit angular acceleration. dist(v,w) is the shortest distance from the simulated trajectory corresponding to the combination of velocity and angular velocity (v,w) to the obstacle. V win at the final time t is expressed as:
Vwin=Vlim∩Vacc∩Vdis V win = V lim ∩ V acc ∩ V dis
具体如图3所示,评价函数包括三个子函数,是对无人车行驶速度、障碍物碰撞风险、无人车航向三个因素的综合考虑,具体如下:Specifically, as shown in Figure 3, the evaluation function includes three sub-functions, which are a comprehensive consideration of the three factors of the driving speed of the unmanned vehicle, the risk of collision with obstacles, and the heading of the unmanned vehicle, as follows:
G(v,w)=σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))G(v,w)=σ(αheading(v,w)+ηdist(v,w)+γvel(v,w))
其中 表示无人车航向角,δ为无人车与目标点连线与x轴正方向夹角,如图4所示。in Indicates the heading angle of the unmanned vehicle, and δ is the angle between the line connecting the unmanned vehicle and the target point and the positive direction of the x-axis, as shown in Figure 4.
dist(v,w)为模拟轨迹到最近障碍物的欧氏距离,vel(v,w)表示无人车的线速度大小,α、η、γ为三个权重系数。由上可知,评价函数是由不同量纲的子函数构成,式中的归一化函数σ()相当于无量纲学习,能够将不同量纲的数据统一到相同参考系下进行组合或比较,从而避免因数据的尺度不同导致评价偏差,具体如下:dist(v,w) is the Euclidean distance from the simulated trajectory to the nearest obstacle, vel(v,w) represents the linear velocity of the unmanned vehicle, and α, η, γ are three weight coefficients. It can be seen from the above that the evaluation function is composed of sub-functions of different dimensions. The normalization function σ() in the formula is equivalent to dimensionless learning, which can unify data of different dimensions into the same reference system for combination or comparison. In order to avoid evaluation bias due to different scales of data, the details are as follows:
dist(vi,wj)与vel(vi,wj)进行同样的归一化操作。dist(v i ,w j ) performs the same normalization operation as vel(v i ,w j ).
无人车根据匀速运动模型获取模拟轨迹,在该运动模型的假设条件下,无人车的线速度、角速度大小保持不变,线速度方向改变量与时间成线性关系,为简化模型加快运算,可认为在微小时间间隔内速度方向保持不变,因此可将匀速运动模型离散化处理,xt、yt表示t时刻智能车的横纵坐标,表示t时刻的航向角,vt、wt表示t时刻的速度、角速度,如下式所示。The unmanned vehicle obtains the simulated trajectory according to the uniform motion model. Under the assumption of the motion model, the linear velocity and angular velocity of the unmanned vehicle remain unchanged, and the change in the direction of the linear velocity has a linear relationship with time. In order to simplify the model and speed up the calculation, It can be considered that the velocity direction remains unchanged in a small time interval, so the uniform motion model can be discretized, x t and y t represent the horizontal and vertical coordinates of the smart car at time t, Indicates the heading angle at time t, and v t and w t indicate the speed and angular velocity at time t, as shown in the following formula.
步骤三、建立基于演员-评论家框架的近端策略优化学习(PPO)模型(如图5所示),模拟建立无人车实际应用场景作为模型的学习环境,根据应用场景确定模型中的状态与动作;Step 3. Establish a proximal policy optimization learning (PPO) model based on the actor-critic framework (as shown in Figure 5), simulate and establish the actual application scenario of the unmanned vehicle as the learning environment of the model, and determine the state in the model according to the application scenario and action;
近端策略优化算法(Proximal Policy Optimization,PPO)的做法则是在目标函数中增加DKL(p||q)惩罚项具体如下:The Proximal Policy Optimization (PPO) algorithm is based on the objective function Add D KL (p||q) penalty item in the specific as follows:
式中为对参数θ1的积分得到基于重要性采样的策略学习的目标函数,θ为策略π参数,当策略越好,目标函数/>越大,γ为进行蒙特卡洛近似引入的参数,Ut为策略梯度中的参数、π(at|st;θ1)为目标策略、π(at|st;θ2)为行为策略,/>为策略网络的数学期望,β为超参数,分布q与p相差越大则DKL(p||q)项越大,/>受到的惩罚越大,反之则DKL(p||q)项越小,/>受到的惩罚越小,强化学习的目标是最大化/>因此具有惩罚项的/>能够控制行为与目标策略在一定相似度范围内。In the formula Integrate the parameter θ 1 to obtain the objective function of policy learning based on importance sampling, θ is the strategy π parameter, when the strategy is better, the objective function /> The larger γ is the parameter introduced by Monte Carlo approximation, U t is the parameter in the strategy gradient, π(a t |s t ; θ 1 ) is the target strategy, π(a t |s t ; θ 2 ) is Behavior Policy, /> is the mathematical expectation of the policy network, β is a hyperparameter, the greater the difference between the distribution q and p, the greater the D KL (p||q) item, /> The greater the penalty, otherwise the D KL (p||q) item is smaller, /> The smaller the penalty received, the goal of reinforcement learning is to maximize So the /> with the penalty term Be able to control the behavior and the target policy within a certain similarity range.
无人车在障碍物环境下寻找能够连接起点与终点的最优路径,因此无人车实际应用场景作为模型的学习环境环境即为障碍物地图。The unmanned vehicle looks for the optimal path that can connect the starting point and the end point in the obstacle environment, so the actual application scene of the unmanned vehicle is used as the learning environment of the model is the obstacle map.
模型中的状态s为无人车利用传感器感知到的环境信息,也可以包括自身位置、运动状态信息。状态s是无人车动作决策的唯一信息来源,同时也是最大化回报的重要依据,因此状态s的优劣直接影响到算法能否收敛、收敛速度及最终性能。状态s可以理解为周围环境信息的高维向量,无人车的最终目标是以最优路径到达终点,显然无人车位置及状态、周围障碍物分布、目标点位置信息将是无人车动作决策的核心依据。为更加契合实际应用场景,可将激光雷达以2度的扫描间隔,扫描一周反射回来的信息作为状态s的主要部分此外,状态s还应包括无人车速度vt、角速度wt、航向角以及当前目标点位置信息(xg t,yg t),如图6所示。具体方法为利用策略网络输出替代评价函数的固定权重,构建自适应评价函数,显然动作a与评价函数中的权重(α,η,γ)相对应,因此定义动作a为:The state s in the model is the environmental information perceived by the unmanned vehicle using the sensor, and can also include its own position and motion state information. The state s is the only source of information for unmanned vehicle action decisions, and it is also an important basis for maximizing returns. Therefore, the quality of the state s directly affects whether the algorithm can converge, the convergence speed and the final performance. The state s can be understood as a high-dimensional vector of surrounding environment information. The ultimate goal of the unmanned vehicle is to reach the end point with the optimal path. Obviously, the position and state of the unmanned vehicle, the distribution of surrounding obstacles, and the location information of the target point will be the actions of the unmanned vehicle. basis for decision-making. In order to better fit the actual application scenario, the laser radar can scan the reflected information for a week at a scanning interval of 2 degrees as the main part of the state s. In addition, the state s should also include the speed v t of the unmanned vehicle, the angular velocity w t , and the heading angle and the current target point position information (x g t , y g t ), as shown in Fig. 6 . The specific method is to use the policy network output to replace the fixed weight of the evaluation function to construct an adaptive evaluation function. Obviously, action a corresponds to the weight (α, η, γ) in the evaluation function, so define action a as:
a=[μ1,σ1,μ2,σ2,μ3,σ3]a=[μ 1 ,σ 1 ,μ 2 ,σ 2 ,μ 3 ,σ 3 ]
其中[μ1,σ1]为均值与方差,用于描述权重α的概率密度函数:Where [μ 1 ,σ 1 ] is the mean and variance, used to describe the probability density function of the weight α:
以此类推[μ2,σ2]为均值与方差,用于描述权重η的概率密度函数,[μ3,σ3]为均值与方差,用于描述权重γ的概率密度函数。之后按照各自的概率密度函数随机抽样确定(α,η,γ),并经过Tanh函数将动作映射到[-1,1]区间内。By analogy, [μ 2 ,σ 2 ] is the mean and variance, which is used to describe the probability density function of weight η, and [μ 3 ,σ 3 ] is the mean and variance, which is used to describe the probability density function of weight γ. Then randomly sample (α, η, γ) according to their respective probability density functions, and map the action to the [-1, 1] interval through the Tanh function.
状态s与动作a确定之后,则策略网络π(a|s;θ)、价值网络q(s,a;w)的输入、输出层神经元个数也随之确定。策略网络、价值网络结构示意图如图7、8所示:After the state s and action a are determined, the number of neurons in the input and output layers of the policy network π(a|s; θ) and the value network q(s, a; w) is also determined. The structural diagrams of strategy network and value network are shown in Figures 7 and 8:
步骤四、构建DWA-PPO深度强化学习模型,定义包括主线奖励与子目标奖励的奖励函数;并确定输入层、输出层大小以及隐藏层层数与神经元个数等参数在内的模型参数,完成DWA-PPO深度强化学习模型的实例化;Step 4. Construct the DWA-PPO deep reinforcement learning model, define the reward function including the main line reward and the sub-goal reward; and determine the model parameters including the input layer, output layer size, number of hidden layers and number of neurons, etc. Complete the instantiation of the DWA-PPO deep reinforcement learning model;
奖励函数是学习模型中的核心内容无人车获得的奖励根据是否为触发主线事件所获得的奖励可分为主线奖励与子目标奖励:The reward function is the core content of the learning model. The rewards obtained by unmanned vehicles can be divided into main-line rewards and sub-target rewards according to whether they are triggered by main-line events:
主线奖励:所谓的主线奖励可以理解为智能体到达终止状态的结算奖励,本文中即为无人车导航至终点获得的奖励Rmian goal、超过最大迭代步数时的惩罚奖励Rmian out及当无人车与障碍物发生碰撞时的惩罚奖励Rmian coll。Mainline reward: The so-called mainline reward can be understood as the settlement reward for the agent to reach the terminal state. In this paper, it is the reward R mian goal obtained when the unmanned vehicle navigates to the end point, the penalty reward R mian out when the maximum number of iteration steps is exceeded, and when The penalty reward R mian coll when the unmanned vehicle collides with an obstacle.
子目标奖励:主线奖励外的奖励称之为辅助奖励,其主要形式为子目标奖励。结合无人车在障碍物环境中导航规划的实际应用场景,分析局部关键点、环境状态、无人车运动状态、无人车与目标点相对关系等因素对无人车寻得最优路径这一主线任务的影响,给出如下子目标奖励:Sub-goal rewards: Rewards other than the main line rewards are called auxiliary rewards, and their main form is sub-goal rewards. Combined with the actual application scenario of unmanned vehicle navigation planning in an obstacle environment, the analysis of local key points, environmental status, unmanned vehicle motion state, relative relationship between unmanned vehicle and target points and other factors is crucial for unmanned vehicles to find the optimal path. For the impact of a main task, the following sub-objective rewards are given:
(1)能量惩罚奖励Rsub step:Rsub step的存在一方面能够限制无人车自身的能量消耗,同时又能够促进无人车寻得最优路径;Et为第t个step无人车以速度vt行驶Δt,该过程消耗的能量,归一化后,定义Rsub step为:(1) Energy penalty reward R sub step : the existence of R sub step can limit the energy consumption of the unmanned vehicle on the one hand, and at the same time can promote the unmanned vehicle to find the optimal path; E t is the tth step unmanned vehicle Driving Δt at speed vt , the energy consumed in this process, after normalization, defines R sub step as:
(2)距离变化奖励Rsub dis:这个过程中无人车可能会因为躲避障碍物局部上远离终点,但是全局上必定是向终点靠拢。由此可定义一个与无人车位置-目标点距离相关的奖励,Rsub dis应该是一个正向奖励,并且,若朝终点方向移动的距离越大,则Rsub dis越大。(2) Distance change reward R sub dis : During this process, the unmanned vehicle may move away from the end point locally due to avoiding obstacles, but must move closer to the end point globally. Therefore, a reward related to the distance between the position of the unmanned vehicle and the target point can be defined. R sub dis should be a positive reward, and the greater the distance moving towards the end point, the greater the R sub dis .
(3)障碍物距离奖励Rsub obs:rt obs定义为当无人车安全距离内不存在障碍物并且无人车以最大减速度制动,无人车在规划过程中,不会发生碰撞是保证行驶安全性是首要前提,归一化后,定义Rsub obs为:(3) Obstacle distance reward R sub obs : r t obs is defined as when there are no obstacles within the safe distance of the unmanned vehicle and the unmanned vehicle brakes at the maximum deceleration, the unmanned vehicle will not collide during the planning process It is the first premise to ensure driving safety. After normalization, define R sub obs as:
(4)方位角奖励Rsub head:无人车的目标为抵达终点,因此在导航中认为其越朝向目标点,无人车航向角越好;rhead定义为无人车航向十分接近最佳方位角时才会获得正向的奖励,归一化后,定义Rsub head为:(4) Azimuth angle reward R sub head : the goal of the unmanned vehicle is to reach the destination, so in the navigation, it is considered that the closer it is to the target point, the better the heading angle of the unmanned vehicle; r head is defined as the heading of the unmanned vehicle is very close to the best The positive reward will only be obtained when the azimuth is set. After normalization, define the R sub head as:
综上可知无人车第t步时的奖励Rt为下式,为子目标奖励调节因子。To sum up, it can be seen that the reward R t for the unmanned vehicle at the t-th step is the following formula, Reward modifiers for subgoals.
AC框架构造了价值网络用于近似策略梯度中的动作价值,因此网络架构至少包括价值网络与策略网络架构。根据价值网络损失函数:The AC framework constructs a value network to approximate the action value in the policy gradient, so the network architecture includes at least a value network and a policy network architecture. According to the value network loss function:
价值网络的学习目标为:The learning objectives of the value network are:
可以看出其学习目标包括自身的一部分预测假设价值网络自身存在对动作价值Q(s,a)的高估,那么价值网络这种利用自身学习自身的方式会导致高估问题被不断放大,并且这种高估时非均匀高估,严重影响网络训练,这种现象称之为自举(Bootstrapping)。为防止价值网络出现自举现象,使用w-构建一个目标价值网络为该网络的参数结构与价值网络一致但具体数值不一样,用于计算TD误差:It can be seen that its learning objectives include part of its own predictions Assuming that the value network itself has an overestimation of the action value Q(s,a), then the way the value network uses itself to learn itself will cause the overestimation problem to be continuously amplified, and this overestimation is unevenly overestimated, seriously Affecting network training, this phenomenon is called Bootstrapping. In order to prevent the bootstrapping phenomenon of the value network, use w - to construct a target value network as The parameter structure of the network is consistent with the value network but the specific values are different, which is used to calculate the TD error:
目标价值网络初始参数与价值网络一致,μ为参数,确保系数和为1,后续更新参考下式:The initial parameters of the target value network are consistent with the value network, μ is the parameter, and the sum of coefficients is guaranteed to be 1. For subsequent updates, refer to the following formula:
综上可知,DWA-PPO强化学习模型下的网络架构包括三大部分:策略网络π(a|s;θ)、价值网络q(s,a;w)及目标价值网络qT(s,a;w)。DWA-PPO强化学习模型如图9所示。In summary, the network architecture under the DWA-PPO reinforcement learning model includes three parts: policy network π(a|s;θ), value network q(s,a;w) and target value network q T (s,a ;w). The DWA-PPO reinforcement learning model is shown in Figure 9.
综上所述,构建模型包括智能体、环境、评论家模块、演员模块。评论家模块包括价值网络误差函数L(w)、价值网络q(s,a;w)、目标价值网络qT(s,a;w-)。演员模块包括目标网络π(a|s;θ1)、行为网络π(a|s;θ2)、策略网络目标函数训练开始阶段为训练集收集,如图中黑色线段所示:第0回合的初始时刻无人车利用感知与定位系统从环境中观测到状态s0,行为网络π(a|s;θ2)接收s0后输出一个关于动作A0的高斯分布π(A0|s0;θ1),之后从该概率分布随机抽取确定动作a0传递至智能车,得到初始时刻DWA算法的评价函数G0(v,w),完成对初始时刻DWA算法的模拟轨迹集的评价,并将最优轨迹的速度角速度指令传递至无人车运动控制模块,驱动无人车运动。至此,无人车位置、朝向角、周围障碍物分布等信息发生改变,环境转换至状态s1,奖励函数也会根据改变的信息反馈给评论家模块奖励r0。当s1不为终止状态sn,该回合进入下一个时刻,否则重置地图、无人车状态,进行下一个回合的轨迹收集,直至收集满i回合,最终得到训练集:To sum up, the construction model includes agent, environment, critic module and actor module. The critic module includes the value network error function L(w), the value network q(s,a;w), and the target value network qT (s,a;w − ). The actor module includes target network π(a|s; θ 1 ), behavior network π(a|s; θ 2 ), policy network objective function The beginning stage of training is the collection of training sets, as shown by the black line in the figure: at the initial moment of the 0th round, the unmanned vehicle uses the perception and positioning system to observe the state s 0 from the environment, and the behavior network π(a|s; θ 2 ) After receiving s 0 , output a Gaussian distribution π(A 0 |s 0 ; θ 1 ) about the action A 0 , and then randomly select the determined action a 0 from the probability distribution and pass it to the smart car to obtain the evaluation function G of the DWA algorithm at the initial moment 0 (v, w), complete the evaluation of the simulated trajectory set of the DWA algorithm at the initial moment, and transmit the velocity and angular velocity command of the optimal trajectory to the unmanned vehicle motion control module to drive the unmanned vehicle to move. So far, information such as the position, orientation angle, and distribution of surrounding obstacles of the unmanned vehicle has changed, and the environment has changed to state s 1 , and the reward function will also feed back the reward r 0 to the critic module according to the changed information. When s 1 is not in the termination state s n , the round enters the next moment; otherwise, the map and unmanned vehicle state are reset, and the trajectory collection of the next round is carried out until the i round is collected, and finally the training set is obtained:
χ=[χ0,χ1,…,χi]χ=[χ 0 ,χ 1 ,…,χ i ]
χ0=[s0 0,a0 0,r0 0,…,sn-1 0,an-1 0,rn-1 0,sn 0]χ 0 =[s 0 0 ,a 0 0 ,r 0 0 ,…,s n-1 0 ,a n-1 0 ,r n-1 0 ,s n 0 ]
步骤五、构建PPO-ADWA算法,使用建立好的DWA-PPO深度强化学习模型,在随机生成的复杂静态障碍物环境下,模拟无人车的导航规划,以收集用于训练网络模型的训练集,通过反复迭代收敛出能够根据周围障碍物分布的变化,从而输出相应的权重参数的模型参数,完成自适应PPO-ADWA算法的构建;Step 5. Construct the PPO-ADWA algorithm, use the established DWA-PPO deep reinforcement learning model, and simulate the navigation planning of the unmanned vehicle in the environment of randomly generated complex static obstacles, so as to collect the training set for training the network model , through repeated iterative convergence, the model parameters that can output the corresponding weight parameters according to the changes in the distribution of surrounding obstacles, and complete the construction of the adaptive PPO-ADWA algorithm;
得到训练集后利用误差函数L(w)反向传播更新价值网络q(s,a;w);利用PPO算法的误差函数反向传播更新π(a|s;θ1)。设网络q(s,a;w)、qT(s,a;w-)、π(a|s;θ1)当前的网络参数分别为wnow、/>θnow,重复以下步骤Z次完成一代更新:After obtaining the training set, use the error function L(w) to backpropagate to update the value network q(s,a;w); use the error function of the PPO algorithm Backpropagation updates π(a|s; θ 1 ). Let the current network parameters of the network q(s,a;w), q T (s,a;w - ), π(a|s;θ 1 ) be w now , /> respectively θ now , repeat the following steps Z times to complete the generation update:
(1)从打乱的训练集χ内随机抽取出MI(最小批次大小)个状态sN I。(1) Randomly extract M I (minimum batch size) states s N I from the shuffled training set χ.
(2)用qT(s,a;w-)计算出状态sN I为起点的K步TD误差MTDN I:(2) Use q T (s, a; w - ) to calculate the K-step TD error MTD N I starting from the state s N I :
(3)用价值网络q(s,a;w)计算出状态sN I时的动作价值估计:(3) Use the value network q(s,a;w) to calculate the action value estimate when the state s N I is:
qN I=Q(sN I,aN I;wnow)q N I = Q(s N I , a N I ; w now )
(4)计算L(w):(4) Calculate L(w):
(5)计算 (5) calculation
(6)更新价值网络、策略网络、目标价值网络:(6) Update the value network, strategy network, and target value network:
假设更新前参数为θnow,经过重要性采样更新后的得到参数θnew,假设更新前参数为wnow经过策略学习更新后的得到参数wnew,假设跟新前为防止价值网络出现自举现象引入的参数w- now,经过μ为参数,确保系数和为1更新后得到w- new。完成Z次更新后将目标网络π(a|s;θ1)参数赋给目标网络π(a|s;θ1),记为一代更新,之后将训练集清空,重新进入下一代更新,直至模型收敛。Assume that the parameter before updating is θ now , and the parameter θ new is obtained after the importance sampling update. Assume that the parameter before updating is w now and the parameter w new is obtained after updating the policy learning. It is assumed that before updating, it is to prevent the value network from bootstrapping. The introduced parameter w - now is updated with μ as a parameter to ensure that the sum of the coefficients is 1 to obtain w - new . After completing Z updates, assign the parameters of the target network π(a|s; θ 1 ) to the target network π(a|s; θ 1 ), which is recorded as a generation update, and then clear the training set and re-enter the next generation update until The model converges.
图10为网络训练过程中,无人车在深度强化学习环境中每代的平均得分、到达率的变化曲线,随着模型的迭代收敛,网络模型逐渐学习到能够正确指导无人车路径规划的参数网络。完成自适应PPO-ADWA算法的构建。Figure 10 shows the change curve of the average score and arrival rate of each generation of unmanned vehicles in the deep reinforcement learning environment during the network training process. With the iterative convergence of the model, the network model gradually learns to correctly guide the path planning of unmanned vehicles parameter network. Complete the construction of adaptive PPO-ADWA algorithm.
步骤六、通过仿真对比实验论证基于PPO-ADWA的无人车路径规划自适应调节能力;Step 6. Demonstrate the self-adaptive adjustment capability of unmanned vehicle path planning based on PPO-ADWA through simulation comparison experiments;
为验证基于PPO-ADWA算法的无人车路径规划的自调节能力,本节将在随机生成的复杂静态障碍物环境下验证其鲁棒性。仿真环境如图11所示,地图大小为60m×60m,绿色圆点为起点,蓝色五角星为终点,黑色几何图形表示障碍物,障碍物形状包括正多边形与圆形,障碍物大小与数量在一定范围内随机生成。在100张障碍物位置不同的地图下的表现,仿真结果见表1。In order to verify the self-regulation ability of the unmanned vehicle path planning based on the PPO-ADWA algorithm, this section will verify its robustness in the randomly generated complex static obstacle environment. The simulation environment is shown in Figure 11. The size of the map is 60m×60m. The green dot is the starting point, the blue five-pointed star is the end point, and the black geometric figures represent obstacles. The shapes of obstacles include regular polygons and circles, and the size and number of obstacles Randomly generated within a certain range. The performance under 100 maps with different obstacle positions, the simulation results are shown in Table 1.
表1仿真结果对比Table 1 Comparison of simulation results
PPO-ADWA下的无人车路径规划结果的到达率为84%,相较于经典DWA下的到达率提升了6个百分点;平均路径长度为93.04m,路径效率提升了5.00%;平均步数为251.95,平均步数花费减少了4.85%。经典DWA规划结果见图12,PPO-ADWA规划结果见图13。PPO-ADWA融合策略的无人车路径规划过程中,权重参数的变化曲线如图14所示。可以看出,权重参数总体上保持η>γ>α的数值关系。The arrival rate of the unmanned vehicle path planning results under PPO-ADWA is 84%, which is 6 percentage points higher than that under the classic DWA; the average path length is 93.04m, and the path efficiency is increased by 5.00%; the average number of steps 251.95, a 4.85% reduction in average step cost. The classic DWA planning results are shown in Figure 12, and the PPO-ADWA planning results are shown in Figure 13. During the unmanned vehicle path planning process of the PPO-ADWA fusion strategy, the change curve of the weight parameters is shown in Figure 14. It can be seen that the weight parameters generally maintain the numerical relationship of η>γ>α.
以上是本发明的较佳实施例,凡依本发明技术方案所作的改变,所产生的功能作用未超出本发明技术方案的范围时,均属于本发明的保护范围。The above are the preferred embodiments of the present invention, and all changes made according to the technical solution of the present invention, when the functional effect produced does not exceed the scope of the technical solution of the present invention, all belong to the protection scope of the present invention.
Claims (6)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310792088.4A CN116679719A (en) | 2023-06-30 | 2023-06-30 | Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310792088.4A CN116679719A (en) | 2023-06-30 | 2023-06-30 | Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116679719A true CN116679719A (en) | 2023-09-01 |
Family
ID=87782071
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310792088.4A Pending CN116679719A (en) | 2023-06-30 | 2023-06-30 | Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116679719A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117130263A (en) * | 2023-10-26 | 2023-11-28 | 博创联动科技股份有限公司 | Intelligent control method and system for whole vehicle based on big data of Internet of vehicles |
CN117553800A (en) * | 2024-01-04 | 2024-02-13 | 深圳市乐骑智能科技有限公司 | AGV positioning and path planning method and device |
CN117682429A (en) * | 2024-02-01 | 2024-03-12 | 华芯(嘉兴)智能装备有限公司 | Crown block carrying instruction scheduling method and device of material control system |
CN117724478A (en) * | 2023-11-27 | 2024-03-19 | 上海海事大学 | Automatic container terminal AGV path planning method |
CN117826713A (en) * | 2023-11-22 | 2024-04-05 | 山东科技大学 | An improved reinforcement learning AGV path planning method |
CN117990119A (en) * | 2024-01-29 | 2024-05-07 | 中山大学·深圳 | A hierarchical off-road path planning method based on deep reinforcement learning |
CN118372851A (en) * | 2024-04-15 | 2024-07-23 | 海南大学 | Vehicle optimal control method based on deep reinforcement learning |
CN118906859A (en) * | 2024-09-06 | 2024-11-08 | 中氢投电力(北京)有限公司 | Automatic parking charging system |
-
2023
- 2023-06-30 CN CN202310792088.4A patent/CN116679719A/en active Pending
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117130263B (en) * | 2023-10-26 | 2024-01-16 | 博创联动科技股份有限公司 | Intelligent control method and system for whole vehicle based on big data of Internet of vehicles |
CN117130263A (en) * | 2023-10-26 | 2023-11-28 | 博创联动科技股份有限公司 | Intelligent control method and system for whole vehicle based on big data of Internet of vehicles |
CN117826713A (en) * | 2023-11-22 | 2024-04-05 | 山东科技大学 | An improved reinforcement learning AGV path planning method |
CN117724478B (en) * | 2023-11-27 | 2024-09-20 | 上海海事大学 | Automatic container terminal AGV path planning method |
CN117724478A (en) * | 2023-11-27 | 2024-03-19 | 上海海事大学 | Automatic container terminal AGV path planning method |
CN117553800A (en) * | 2024-01-04 | 2024-02-13 | 深圳市乐骑智能科技有限公司 | AGV positioning and path planning method and device |
CN117553800B (en) * | 2024-01-04 | 2024-03-19 | 深圳市乐骑智能科技有限公司 | AGV positioning and path planning method and device |
CN117990119B (en) * | 2024-01-29 | 2024-10-15 | 中山大学·深圳 | A hierarchical off-road path planning method based on deep reinforcement learning |
CN117990119A (en) * | 2024-01-29 | 2024-05-07 | 中山大学·深圳 | A hierarchical off-road path planning method based on deep reinforcement learning |
CN117682429B (en) * | 2024-02-01 | 2024-04-05 | 华芯(嘉兴)智能装备有限公司 | Crown block carrying instruction scheduling method and device of material control system |
CN117682429A (en) * | 2024-02-01 | 2024-03-12 | 华芯(嘉兴)智能装备有限公司 | Crown block carrying instruction scheduling method and device of material control system |
CN118372851A (en) * | 2024-04-15 | 2024-07-23 | 海南大学 | Vehicle optimal control method based on deep reinforcement learning |
CN118906859A (en) * | 2024-09-06 | 2024-11-08 | 中氢投电力(北京)有限公司 | Automatic parking charging system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116679719A (en) | Adaptive path planning method for unmanned vehicles based on dynamic window method and proximal strategy | |
CN112356830B (en) | Intelligent parking method based on model reinforcement learning | |
Zhu et al. | Deep reinforcement learning based mobile robot navigation: A review | |
Zhang et al. | Reinforcement learning-based motion planning for automatic parking system | |
CN110136481B (en) | Parking strategy based on deep reinforcement learning | |
Faust et al. | Prm-rl: Long-range robotic navigation tasks by combining reinforcement learning and sampling-based planning | |
CN112162555B (en) | Vehicle control method based on reinforcement learning control strategy in mixed fleet | |
CN111580544B (en) | A UAV Target Tracking Control Method Based on Reinforcement Learning PPO Algorithm | |
CN112433525A (en) | Mobile robot navigation method based on simulation learning and deep reinforcement learning | |
Botteghi et al. | On reward shaping for mobile robot navigation: A reinforcement learning and SLAM based approach | |
CN114020013B (en) | Unmanned aerial vehicle formation collision avoidance method based on deep reinforcement learning | |
CN113848974B (en) | Aircraft trajectory planning method and system based on deep reinforcement learning | |
CN114741886B (en) | Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation | |
CN115542733B (en) | Adaptive dynamic window method based on deep reinforcement learning | |
CN116069023B (en) | Multi-unmanned vehicle formation control method and system based on deep reinforcement learning | |
Gök | Dynamic path planning via Dueling Double Deep Q-Network (D3QN) with prioritized experience replay | |
CN116551703B (en) | Motion planning method based on machine learning in complex environment | |
CN113391633A (en) | Urban environment-oriented mobile robot fusion path planning method | |
CN114089751A (en) | A Path Planning Method for Mobile Robots Based on Improved DDPG Algorithm | |
Sun et al. | Event-triggered reconfigurable reinforcement learning motion-planning approach for mobile robot in unknown dynamic environments | |
CN113959446B (en) | Autonomous logistics transportation navigation method for robot based on neural network | |
CN115265547A (en) | Robot active navigation method based on reinforcement learning in unknown environment | |
Wang et al. | An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle | |
CN118153431A (en) | Underwater multi-agent cooperative trapping method and device based on deep reinforcement learning | |
CN117908565A (en) | Unmanned aerial vehicle safety path planning method based on maximum entropy multi-agent reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |