WO2020056875A1 - 一种基于深度强化学习的停车策略 - Google Patents

一种基于深度强化学习的停车策略 Download PDF

Info

Publication number
WO2020056875A1
WO2020056875A1 PCT/CN2018/113660 CN2018113660W WO2020056875A1 WO 2020056875 A1 WO2020056875 A1 WO 2020056875A1 CN 2018113660 W CN2018113660 W CN 2018113660W WO 2020056875 A1 WO2020056875 A1 WO 2020056875A1
Authority
WO
WIPO (PCT)
Prior art keywords
vehicle
target
parking
reward
reinforcement learning
Prior art date
Application number
PCT/CN2018/113660
Other languages
English (en)
French (fr)
Inventor
王宇舟
Original Assignee
初速度(苏州)科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 初速度(苏州)科技有限公司 filed Critical 初速度(苏州)科技有限公司
Publication of WO2020056875A1 publication Critical patent/WO2020056875A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G08SIGNALLING
    • G08GTRAFFIC CONTROL SYSTEMS
    • G08G1/00Traffic control systems for road vehicles
    • G08G1/16Anti-collision systems
    • G08G1/168Driving aids for parking, e.g. acoustic or visual feedback on parking space

Definitions

  • the invention relates to the technical field of vehicles, and in particular to a parking strategy based on deep reinforcement learning.
  • the main technical routes for automatic parking technology are based on traditional path planning algorithms, such as RRT, PRM, A * and so on.
  • the basic idea is to randomly generate paths in a pre-made scene map, and then perform collision detection on the randomly generated paths, that is, to detect whether the path will pass through obstacles or whether the path is within the vehicle's driveable area.
  • Dijkstra's algorithm and other methods are used to select the optimal parking route.
  • the above-mentioned prior art has the following disadvantages: because the prior art needs to generate a random path first, when the scene is more complicated (more obstacles, narrow parking spaces), it is difficult to generate a feasible path, making the final planned path quality poor. ;
  • the existing technology needs to re-calculate its optimal path for different scenarios (different garages, even different parking spaces in the same garage), making its generalization ability poor; traditional algorithms have higher requirements for map accuracy, so in When applying noisy sensor inputs (such as cameras, lidar, etc.), the planning effect is poor.
  • noisy sensor inputs such as cameras, lidar, etc.
  • the invention provides a parking method based on deep reinforcement learning, which is characterized in that: the method can obtain a parking planning route by a deep reinforcement learning algorithm;
  • a tuple is formed by the vehicle observation state, the vehicle predicted action, and a reward function, and the tuple is updated every predetermined time;
  • the predicted action and reward function for route planning After the tuple is updated once, according to the updated vehicle observation state, output the predicted action and reward function for route planning again until the vehicle reaches the target parking space; From this, a parking planning route with the highest reward function value can be obtained; the superiority of the parking planning route can be evaluated by the following formula:
  • Y a * distance (car position, target position) + b * abs (car yaw-arget yaw) + c * target reached;
  • Y represents the superiority of the parking path
  • a, b represents the completion degree of the control task
  • c represents an additional reward for the completion of the task
  • the abs () function is the absolute value of the number in parentheses.
  • the vehicle observation state includes vehicle coordinates (x, y, yaw), where x, y respectively represent an x coordinate and a y coordinate of a vehicle steering center in a coordinate system of a feasible region, and yaw is a current attitude of the vehicle and x The angle of the axis.
  • the sensor information is the distance from each corner point to the nearest obstacle measured by the sensors installed at the four corner points of the vehicle.
  • the vehicle prediction action includes a vehicle linear speed and a vehicle steering angle.
  • the reward function represents the distance between the termination state of the vehicle and the target parking space. The closer the termination state of the vehicle is to the target parking space, the higher the reward value r obtained.
  • a first neural network and a second neural network are established, wherein the first neural network uses a vehicle observation state as an input, and a function value of an output reward function is used It is used to quantify the current state of the vehicle.
  • the second neural network uses the vehicle observation state as an input to output the predicted vehicle behavior.
  • An embodiment of the present invention further provides a parking route acquisition system based on deep reinforcement learning, which is characterized in that: the system can obtain a parking planning route by a deep reinforcement learning algorithm system;
  • a tuple is formed by the vehicle observation state, the vehicle predicted action, and a reward function, and the tuple is updated every predetermined time;
  • the current vehicle observation state output predicted actions and reward functions for route planning.
  • the tuple is updated once, according to the updated vehicle observation states, output predicted actions and reward functions for route planning again until the vehicle reaches the target parking space; From this, the parking planning route with the highest reward function value can be obtained; the superiority of the parking planning route is related to controlling the completion of the task and the additional rewards for task completion.
  • the vehicle observation state includes vehicle coordinates (x, y, yaw), where x, y respectively represent an x coordinate and a y coordinate of a vehicle steering center in a coordinate system of a feasible region, and yaw is a current attitude of the vehicle and x The angle of the axis.
  • the sensor information is the distance from each corner point to the nearest obstacle measured by the sensors installed at the four corner points of the vehicle.
  • the vehicle prediction action includes a vehicle linear speed and a vehicle steering angle.
  • the reward function represents the distance between the termination state of the vehicle and the target parking space. The closer the termination state of the vehicle is to the target parking space, the higher the reward value r obtained.
  • a first neural network and a second neural network are established, wherein the first neural network uses a vehicle observation state as an input, and a function value of an output reward function is used It is used to quantify the current state of the vehicle.
  • the second neural network uses the vehicle observation state as an input to output the predicted vehicle behavior.
  • the superiority of the parking planning route can be evaluated by the following formula:
  • Y a * distance (car position, target position) + b * abs (car yaw-target yaw) + c * targetreached.
  • Y represents the superiority of the parking path
  • a, b represents the completion degree of the control task
  • c represents an additional reward for the completion of the task
  • the size of the planned task space that is, the above feasible area
  • the parking planning method based on this tuple has features based on product characteristics to extract fewer parameters. Based on the objective function: (distance + Steering + collision), the coefficient does not need to be adjusted.
  • This is one of the inventive points of the present invention. For example, only the observation state o of the vehicle needs to be extracted, and the predicted action a can be output based on the observed state o. After the vehicle performs the predicted action a, the predicted action a is output based on the observed state o after the predicted action a is performed, and the vehicle executes Prediction action a, and so on.
  • the parking strategy from the initial position of the vehicle to the target parking space can be obtained. Since only the observation state o of the vehicle needs to be extracted in this process, the parking strategy Relatively few parameters are required. In the process of obtaining the parking strategy, it is necessary to extract the observation state o of the vehicle in real time, which mainly obtains parameters such as coordinates and distance from obstacles, which makes the requirements for map accuracy relatively low. In the process of obtaining the parking strategy, because The predicted action a is output based on the observed state o of the vehicle extracted in real time.
  • This application uses deep reinforcement learning to extract features. Compared with traditional feature extraction methods, deep reinforcement learning has the advantages of faster overall planning time and faster response to the outside world. This is one of the inventive points of the present invention.
  • FIG. 1 is a schematic diagram of an environment design according to an embodiment of the present invention
  • FIG. 2 is a flowchart provided by an embodiment of the present invention.
  • the parking strategy is used to obtain a planned route that can be safely entered into the warehouse.
  • the parking strategy refers to inputting the positioning posture of the current vehicle and the positioning posture of the target parking space, and the output controls the vehicle to reach the target parking space.
  • the vehicle's linear speed and vehicle steering angle, the vehicle's linear speed and the vehicle steering angle output from the parking strategy constrain the vehicle to drive only in the feasible area and eventually enter the target parking space.
  • the simulation software program first obtains a map of the environment in which the vehicle is currently located, and obtains the vehicle by the map, the target parking space information entered by the user, and the current coordinate information of the vehicle on the map.
  • the area that can be driven when the current position enters the target parking space that is, the area that can be traveled to obtain the side distance.
  • the side distance is the distance from the side of the vehicle body that is relatively close to the target parking space when the vehicle enters the feasible area environment, and then enters the training stop The process of strategy.
  • the simulation environment can be shown in Figure 1.
  • the rectangular area A is a feasible area, the length of the feasible area can be 8-10m, and the width of the feasible area can be 5-6m.
  • the rectangular area B is the target parking space, and the width of the target parking space can be It is 2.6-3m; the direction of the arrow in the target parking space is the direction of the head when parking, that is, the vehicle must be parked to the target parking space in this direction to be regarded as a successful task; the value of the side distance can be between 0.5-2m, and different side distances Corresponding to the optimal parking strategy for different parking tasks, specifically, if the side distance is too small or too large, it will increase the difficulty of finding the optimal parking strategy. For example, when the side distance is 0.5, it is difficult to park and the side distance is 2 It is relatively easy.
  • the tuples (that is, the observation state o, the predicted action a, and the reward value r) are updated every 0.1s. That is, route planning is performed in real time according to the current observation state o, and the predicted action a and reward value r are output.For example, based on the initial current observation state o 0, the predicted action a 0 is output, and the vehicle obtains the updated current after performing the predicted action a 0 .
  • the reward function outputs the original tuple (current observation state o 0 , predicted action a 0 , reward value r 0 ) based on the updated current observation state o 1 and the target parking space to output the reward value r 0 ; regarded as the current observation state of the current observation state o. 1, and then based on the current observed prediction operation status output o.
  • the observation of the vehicle includes the current vehicle coordinates and sensor information.
  • the current vehicle coordinates of the vehicle in the feasible area are (x, y, yaw), where x, y respectively represent the x coordinate and y coordinate of the vehicle steering center in the coordinate system of the feasible area, yaw Is the angle of the vehicle's current attitude to the x-axis.
  • the sensor information (s1, s2, s3, s4) is the four corner points of the vehicle (for example, the two corner points at the front end of the vehicle and the two corner points at the rear end of the vehicle, as shown in Fig. 1, 2, 3, and 4).
  • the action space of the vehicle is an output capable of controlling the movement of the vehicle, that is, the above-mentioned predicted action a.
  • the reward function (reward) is used to return the reward value r.
  • the reward value r is zero except the termination state, where the termination state includes the number of steps exceeding the maximum step size (the step size is the number of times the tuple is updated from the starting state to the ending state), the vehicle hits an obstacle, and the vehicle reaches the target Parking spaces.
  • the target parking space is (target_x, target_y, target_yaw), where target_x represents the x-coordinate, target_y represents the y-coordinate, and target_yaw represents the offset angle of the parking space attitude (the angle between the orientation of the vehicle head and the x-axis when the target parking space is parked).
  • the deep reinforcement learning algorithm can explore the plan with the highest reward, and use neural networks to fit the state evaluation and parking strategy output in deep reinforcement learning.
  • the neural network critical takes the above-mentioned vehicle observation state o as an input, and outputs a reward value r (value function) with To quantify the quality of the current state (whether it is easy to drive from that state to the target parking space), a neural network is used to fit the relationship between the vehicle observation state o and the reward value r. The expression of this relationship is the above reward function; the same is true for neural network actors.
  • the neural network actor Take the vehicle observation state o as input, and output the predicted action a, that is, under the vehicle observation state, the neural network actor predicts that the vehicle should use the predicted action a to drive into the target parking space, and the neural network actor is used to fit the vehicle observation state o and prediction Distribution chosen by action a.
  • the actor and the critical network are to obtain a higher reward value r for the predicted action a output by the actor network in the observation state o, where the distribution of the updated predicted action and the Kullback-Leibler divergence of the original action distribution ( KL divergence, used to measure the distance between two probability distributions) is less than a certain threshold.
  • the hidden layer of the neural network critic and the actor adopts the same structure, that is, it contains three layers of 64-node fully connected hidden layers, and both use the ReLu function as the activation function, but the neural network critical adds one after the last hidden layer.
  • Layer is a fully connected linear layer to output the function value r, and the neural network actor adds a fully connected layer and uses Tanh as the activation function to output the predicted vehicle line speed and vehicle steering angle.
  • Using neural network to realize state evaluation and motion prediction can well fit the function values corresponding to different states in the above complex environment and the best strategy for driving into the target parking space.
  • the main reasons include the non-linear activation function and multiple hidden layers, which makes the neural network can extract the obstacle information hidden in the environment, and the actor-critic dual network structure makes it possible for the agent to explore the environment on the premise that The training process is more stable and smooth, which also improves the efficiency of the sample.
  • this application can also evaluate the superiority of each possible parking path through the following reinforcement learning reward formula (the larger the result value, the better the parking path):
  • Y a * distance (car position, target position) + b * abs (car yaw-target yaw) + c * targetreached.
  • Y represents the superiority of the parking path
  • a, b represents the completion degree of the control task
  • c represents an additional reward for the completion of the task
  • the size of the planned task space that is, the above feasible area
  • the process of training the parking strategy is completed in the simulator.
  • the simulation software program starts from the current position of the vehicle. Train parking strategies and enter the scene of automatic parking.
  • the algorithm module (Explorer) in the simulation software program inputs the current observation state o 0 of the vehicle into the neural network actor and critic.
  • the neural network actor outputs the predicted action of the vehicle based on the observation state o 0 a 0 (also known as the control amount Velocity yaw rate); then control the vehicle to perform the predicted action a 0 to get the next observation state o1.
  • the neural network in the simulation software uses the reward function (Reward function) based on the next observation state o 1 and target parking obtain predicted action a 0 corresponding function value r 0 (state reward); neural network actor and go to the next observation state prediction o 1 a, o 1 output corresponding to the prediction operation a 1, control of the vehicle is performed based on the next observation state Predicted action a 1 , neural network critical uses a reward function to obtain the function value r 1 corresponding to predicted action a 1 based on the observed state o 2 of the vehicle after performing predicted action a 1 and the target parking space, and so on until the vehicle reaches the end state (reach target Parking space or hit an obstacle).
  • Reward function return function
  • modules or steps of the embodiments of the present invention described above may be implemented by a general-purpose computing device, and they may be centralized on a single computing device or distributed to multiple computing devices.
  • they can be implemented with program code executable by a computing device, so that they can be stored in a storage device and executed by the computing device, and in some cases, can be different from here
  • the steps shown or described are performed sequentially, or they are separately made into individual integrated circuit modules, or multiple modules or steps in them are made into a single integrated circuit module to implement.
  • the embodiments of the present invention are not limited to any specific combination of hardware and software.
  • the above descriptions are merely preferred embodiments of the present invention and are not intended to limit the present invention.
  • the embodiments of the present invention may have various modifications and changes. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention shall be included in the protection scope of the present invention.

Abstract

一种基于深度强化学习的停车方法以及系统,涉及智能驾驶领域,特别涉及一种基于深度强化学习的停车策略。现有技术中,传统的自动泊车系统基于传统的路径规划算法,效果较差;本技术方案可由深度强化学习算法获得停车规划路线;以车辆观测状态、车辆预测动作和奖励函数构成元组,基于该元组的泊车规划方法,具有基于产品特点提取特征,使得需要的参数少的特点。此外,基于目标函数:(距离+转向+碰撞),系数不需要调整;该技术方案采用深度强化学习的方式来提取特征,具有整体规划时间快,对外界的反应快等有益的技术效果。

Description

一种基于深度强化学习的停车策略 技术领域
本发明涉及交通工具技术领域,特别涉及一种基于深度强化学习的停车策略。
背景技术
目前对于自动泊车技术主要的技术路线是基于传统的路径规划算法,如RRT、PRM、A*等。其基本思想是在预先制作好的场景地图内,随机产生路径,之后对随机生成的路径进行碰撞检测,即检测路径是否会穿过障碍物,或路径是否在车辆可行驶区域内。再在所有可行的路径中,利用迪杰斯特拉算法等方法选择其中最优的停车路径。
但是,上述现有技术存在以下缺陷:由于现有技术需要先产生随机路径,当场景较为复杂时(障碍物较多、车位狭小),难以产生可行的路径,使得最终得到的规划路径质量较差;现有技术针对不同的场景(不同车库、甚至同一车库的不同车位)都需要重新进行计算其最优路径,使得其泛化能力较差;传统算法对于地图精度有较高的要求,因而在应用有较大噪声的传感器输入(如:相机、Lidar等)时,规划效果较差。对规划路径的选择较少,缺少选择最优解的可选规划路径数量。
发明内容
为解决了现有技术中的技术问题。本发明提供了一种基于深度强化学习的停车方法,其特征在于:所述方法可由深度强化学习算法获得停车规划路线;
在深度强化学习算法的训练过程中,由车辆观测状态、车辆预测动作和奖励函数构成元组,所述元组每隔规定时间更新一次;
根据当前车辆观测状态,输出预测动作和奖励函数进行路线规划,当元组更新一次后,根据更新后的车辆观测状态,输出预测动作和奖励函数进行再一次的路线规划,直至车辆到达目标车位;由此可以得到奖励函数值最高的停车规划路线;其中所述停车规划路线的优越程度可通过以下公式进行评价:
Y=a*distance(car position,target position)+b*abs(car yaw-arget yaw)+c*target reached;
其中,Y表示停车路径的优越程度;a,b表示控制任务完成度;c表示任务完成的额外奖励;假设规划任务的空间大小为L米*L米,则a=1/L;b=1/2π;c=1,distance()函数返回车辆转向中心距离目标车位点的距离,abs()函数为取括号内数的绝对值,target reached表明车辆是否到达目标车位,如果车辆到达目标车位,则target reach=1,否则,target reach=0。
优选的,所述车辆观测状态包括车辆坐标(x,y,yaw),其中,x,y分别表示车辆转向中 心在可行区域的坐标系下的x坐标与y坐标,yaw为车辆当前姿态与x轴的角度。
优选的,传感器信息为车辆四个角点处安装的传感器测量得到的各角点到最近障碍物的距离。
优选的,所述车辆预测动作包括车辆线速度和车辆转向角度。
优选的,所述奖励函数表示车辆的终止状态与目标车位的距离,车辆的终止状态越接近目标车位,获得的奖励值r越高。
优选的,在采用深度强化学习算法来训练停车策略的过程中,建立第一神经网络和第二神经网络,其中,所述第一神经网络采用车辆观测状态作为输入,输出奖励函数的函数值用于量化当前状态的好坏;第二神经网络采用车辆观测状态作为输入,输出车辆预测动作。
本发明实施例还提供了一种基于深度强化学习的停车路线获取系统,其特征在于:所述系统可由深度强化学习算法系统获得停车规划路线;
在深度强化学习算法系统的训练过程中,由车辆观测状态、车辆预测动作和奖励函数构成元组,所述元组每隔规定时间更新一次;
根据当前车辆观测状态,输出预测动作和奖励函数进行路线规划,当元组更新一次后,根据更新后的车辆观测状态,输出预测动作和奖励函数进行再一次的路线规划,直至车辆到达目标车位;由此可以得到奖励函数值最高的停车规划路线;其中停车规划路线的优越程度与控制任务完成度、任务完成的额外奖励相关。
优选的,所述车辆观测状态包括车辆坐标(x,y,yaw),其中,x,y分别表示车辆转向中心在可行区域的坐标系下的x坐标与y坐标,yaw为车辆当前姿态与x轴的角度。
优选的,传感器信息为车辆四个角点处安装的传感器测量得到的各角点到最近障碍物的距离。
优选的,所述车辆预测动作包括车辆线速度和车辆转向角度。
优选的,所述奖励函数表示车辆的终止状态与目标车位的距离,车辆的终止状态越接近目标车位,获得的奖励值r越高。
优选的,在采用深度强化学习算法来训练停车策略的过程中,建立第一神经网络和第二神经网络,其中,所述第一神经网络采用车辆观测状态作为输入,输出奖励函数的函数值用于量化当前状态的好坏;第二神经网络采用车辆观测状态作为输入,输出车辆预测动作。
优选的,所述停车规划路线的优越程度可通过以下公式进行评价:
Y=a*distance(car position,target position)+b*abs(car yaw-target yaw)+c*target reached。
其中,Y表示停车路径的优越程度;a,b表示控制任务完成度;c表示任务完成的额外 奖励;假设规划任务的空间(即上述可行区域)大小为L米*L米,则a=1/L;b=1/2π;c=1,distance()函数返回车辆转向中心距离目标车位点的距离,abs()函数为取括号内数的绝对值,target reached表明车辆是否到达目标车位,如果车辆到达目标车位,则target reach=1,否则,target reach=0。
本发明的发明点包括如下几个方面,但不仅限于这几个方面:
(1)提出了以车辆观测状态、车辆预测动作和奖励函数构成元组,基于该元组的泊车规划方法,具有基于产品特点提取特征,使得需要的参数少;基于目标函数:(距离+转向+碰撞),系数不需要调整。这是本发明的发明点之一。举例来说,只需要提取车辆的观测状态o,基于观测状态o即可输出预测动作a,车辆执行预测动作a后,再基于执行预测动作a后的观测状态o输出预测动作a,车辆再执行预测动作a,以此类推循环,经过多次输出预测动作a,即可得到由车辆初始位置行驶至目标车位的停车策略,由于该过程中只需要提取车辆的观测状态o,使得获取停车策略所需要的参数相对较少。在获取停车策略的过程中是需要实时提取车辆的观测状态o,主要是获取坐标、与障碍物的距离等参数,使得对地图精度的要求相对较低;在获取停车策略的过程中,由于是基于实时提取的车辆的观测状态o输出预测动作a,经过多次基于观测状态o预测动作a的重复过程,即使目标车位发生变化,也无需重新进行规划,继续基于当前的观测状态o输出预测动作a即可,有利于提高其泛化能力;这是本发明的发明点之一。
(2)本申请采用深度强化学习的方式来提取特征,采用深度强化学习的方法相比于传统的特征提取方法,具有整体规划时间更快,对外界的反应更快等突出的技术效果。这是本发明的发明点之一。
(3)通过建立合适的公式算法Y来衡量停车规划路线的优越程度,使得最终的停车规划路线更加科学。在这一公式中创造性的使用了表示控制任务完成度的两个参数;以及表示任务完成的额外奖励的参数,对优越程度的衡量更加全面,这是本发明的发明点之一。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,并不构成对本发明的限定。在附图中:
图1是本发明实施例提供的一种环境设计的示意图;
图2是本发明实施例提供的一种的流程图。
具体实施方式
为使本发明的目的、技术方案和优点更加清楚明白,下面结合实施方式和附图,对本发 明做进一步详细说明。在此,本发明的示意性实施方式及其说明用于解释本发明,但并不作为对本发明的限定。
在实车运行时,在仿真环境中,使用停车策略获得可以安全入库的规划路线,其中,停车策略是指输入当前车辆的定位位姿和目标车位的定位位姿,输出控制车辆到达目标车位的车辆线速度和车辆转向角度,停车策略输出的车辆线速度和车辆转向角度约束车辆只能在可行区域内行驶并最终驶入目标车位中。
具体的,在用户开启获取停车策略的仿真软件程序时,仿真软件程序首先获取车辆当前所处环境的地图,根据地图、用户输入的目标车位信息以及车辆当前在地图中的坐标信息,获取车辆由当前位置驶入目标车位时可行驶的区域,即可行区域,获取侧面距离,侧面距离即车辆驶入该可行区域环境时,车辆相对靠近目标车位的车身侧面距离车位线的距离,而后进入训练停车策略的过程。
例如,仿真环境可以如图1所示,矩形区域A为可行区域,可行区域的长度可以为8-10m,可行区域的宽可以为5-6m;矩形区域B为目标车位,目标车位的宽可以为2.6-3m;目标车位中箭头的指向为停车时车头的朝向,即车辆必须按该朝向停到目标车位才视为任务成功;侧面距离的取值可以在0.5-2m之间,不同侧面距离对应不同停车任务下的最优停车策略,具体而言,侧面距离过小或过大都会加大寻找最优停车策略的难度,例如,侧面距离为0.5时就很难泊车,侧面距离为2米时就相对容易一些。
本申请通过深度强化学习来训练停车策略,只有在探索到目标车位或者发生碰撞才会停止规划,并根据奖励函数获得相应奖励。具体的,在深度强化学习过程中,本申请从探索序列中学习停车策略,探索序列[o 0,a 0,r 0,o 1,a 1,r 1,o 2,a 2,r 2,...]中的每个元组(o i,a i,r i)由三个元素组成:车辆观测状态o,车辆在该观测状态下执行的预测动作a,任务反馈奖励值r,探索目标argmax(a 0,a 1,a 2,...)(r 0+r 1+r 2+...),i=0,1,2,....等自然数,i表示元组更新的序数。
在深度强化学习过程中,元组(即观测状态o、预测动作a和奖励值r)每0.1s更新一次。也就是说,实时根据当前观测状态o,输出预测动作a和奖励值r进行路线规划,例如,基于初始的当前观测状态o 0输出预测动作a 0,车辆执行预测动作a 0后得到更新的当前观测状态o 1,奖励函数基于更新的当前观测状态o 1和目标车位输出奖励值r 0,得到原始元组(当前观测状态o 0、预测动作a 0、奖励值r 0);将更新后的当前观测状态o 1视为当前观测状态,再基于当前观测状态o 1输出预测动作a 1,车辆再执行预测动作a 1得到更新的当前观测状态o 2,奖励函数再基于更新的当前观测状态o 2和目标车位得到奖励值r 1,得到更新一次的元组(当前观测状态 o 1、预测动作a 1、奖励值r 1);以此类推,直至车辆到达目标车位,由各次输出的预测动作a组成一条由初始位置到目标车位的完成路线。
在仿真环境中,车辆观测状态(observation)包括当前车辆坐标和传感器信息。根据可行区域的地图信息获得车辆在可行区域中的当前车辆坐标为(x,y,yaw),其中,x,y分别表示车辆转向中心在可行区域的坐标系下的x坐标与y坐标,yaw为车辆当前姿态与x轴的角度。传感器信息(s1,s2,s3,s4)为车辆四个角点(例如,车辆最前端的两个角点和车辆最后端的两个角点,具体如图1中1、2、3、4所示的四个角点)处安装的声呐传感器测量得到的各角点到最近障碍物的距离。因此,车辆观测状态为七维向量o=(x,y,yaw,s1,s2,s3,s4)。
车辆的动作空间(action)为能够控制车辆运动的输出,即上述预测动作a。在该仿真环境中预测动作a包括车辆线速度linear_v和车辆转向角度angular_z,即a=(linear_v,angular_z)。
奖励函数(reward)用于返回奖励值r。奖励值r除了终止状态之外都为零,其中,终止状态包括步数超过最大步长(步长即从起始状态到终止状态元组更新的次数)、车辆撞到障碍物以及车辆到达目标车位。目标车位为(target_x,target_y,target_yaw),其中,target_x表示x坐标、target_y表示y坐标及target_yaw表示车位姿态的偏移角度(在目标车位停车时车头的朝向与x轴的夹角)。当车辆到达终止状态但未到达目标车位时,环境中的奖励函数返回一个r=-sqrt((x-target_x) 2+(y-target_y) 2)/10-abs(yaw-target_yaw)/π,该r表示奖励值,车辆的终止状态越接近目标车位,获得的奖励值r越高。当车辆的终止状态到达目标车位时,环境中的奖励函数返回的奖励值r会在上述r的基础上加一,即r=r+1。
基于合理且简单的奖励函数设计,深度强化学习算法可以探索出奖励最高的规划线路,使用神经网络来拟合深度强化学习中状态评价和停车策略输出。
在采用深度强化学习算法来训练停车策略的过程中,具体而言,建立两个神经网络actor和critic,其中,神经网络critic采用上述车辆观测状态o作为输入,输出奖励值r(value function)用于量化当前状态的好坏(是否容易从该状态驶到目标车位),使用神经网络critic拟合车辆观测状态o与奖励值r的关系,该关系的表达式即上述奖励函数;神经网络actor同样采用车辆观测状态o作为输入,输出预测动作a,即在该车辆观测状态下神经网络actor预测出车辆应该采用该预测动作a以驶入目标车位,使用神经网络actor拟合车辆观测状态o与预测动作a选择的分布。具体来说,actor和critic网络是为了将actor网络在观测状 态o下输出的预测动作a获得更高的奖励值r,其中,更新之后的预测动作的分布和原动作分布的Kullback-Leibler divergence(KL散度,用于度量两个概率分布之间的距离)小于某个阈值。神经网络critic和actor的隐含层采用相同的结构,即包含三层64节点全连接的隐含层,且都使用ReLu函数作为激活函数,但神经网络critic在最后一层隐含层后添加一层全连接的线性层来输出函数值r,而神经网络actor则添加了一层全连接层并使用Tanh作为激活函数,以输出预测的车辆线速度和车辆转向角度。
采用神经网络来实现状态评价和动作预测可以很好拟合上述复杂环境中不同状态对应的函数值及驶入目标车位的最佳策略。主要原因包括非线性的激活函数以及多层隐含层,使得神经网络可以对环境中隐含的障碍信息进行提取,且actor-critic的双网络结构在保证智能体对于环境探索的前提下,使得训练过程更加稳定和平滑,也提升了样本的效率。
在训练停车策略得到多条停车路径之后,本申请还可以通过以下强化学习奖励公式来评价每一条可能的停车路径的优越程度(结果数值越大表示停车路径越优):
Y=a*distance(car position,target position)+b*abs(car yaw-target yaw)+c*target reached。
其中,Y表示停车路径的优越程度;a,b表示控制任务完成度;c表示任务完成的额外奖励;假设规划任务的空间(即上述可行区域)大小为L米*L米,则a=1/L;b=1/2π;c=1,distance()函数返回车辆转向中心距离目标车位点的距离,abs()函数为取括号内数的绝对值,target reached表明车辆是否到达目标车位,如果车辆到达目标车位,则target reach=1,否则,target reach=0。
以下结合图2所示的流程图来描述训练停车策略的过程,该训练停车策略的过程在模拟器中完成,当用户开启获取停车策略的仿真软件程序时,仿真软件程序由车辆的当前位置开始训练停车策略,进入自动泊车的场景。在训练停车策略的过程中,首先,仿真软件程序中的算法模块(Explorer)将车辆当前的观测状态o 0输入神经网络actor和critic中,神经网络actor基于观测状态o 0输出对车辆的预测动作a 0(也称控制量Velocity yaw rate);之后控制车辆执行预测动作a 0,得到下一个观测状态o1,仿真软件中的神经网络critic通过奖励函数(Reward function)基于下一个观测状态o 1和目标车位得到预测动作a 0对应的函数值r 0(State reward);神经网络actor并进入下一个观测状态o 1的预测,基于下一个观测状态o 1输出对应的预测动作a 1,控制车辆执行预测动作a 1,神经网络critic通过奖励函数基于车辆执行预测动作a 1后的观测状态o 2和目标车位得到预测动作a 1对应的函数值r 1,依次类推,直至车辆到达终止状态(到达目标车位或撞到障碍物)。这样经过多次基于观测状态o输 出预测动作a以及控制车辆执行预测动作a的过程,就得到一个完成的由车辆初始位置行驶至目标车位的车辆轨迹(trajectory),随着训练的进行,车辆不断探索不同的路径,最终获得一个足够好的泊车策略。
显然,本领域的技术人员应该明白,上述的本发明实施例的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明实施例不限制于任何特定的硬件和软件结合。以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明实施例可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (13)

  1. 一种基于深度强化学习的停车方法,其特征在于:所述方法可由深度强化学习算法获得停车规划路线;
    在深度强化学习算法的训练过程中,由车辆观测状态、车辆预测动作和奖励函数构成元组,所述元组每隔规定时间更新一次;
    根据当前车辆观测状态,输出预测动作和奖励函数进行路线规划,当元组更新一次后,根据更新后的车辆观测状态,输出预测动作和奖励函数进行再一次的路线规划,直至车辆到达目标车位;由此可以得到奖励函数值最高的停车规划路线;其中所述停车规划路线的优越程度可通过以下公式进行评价:
    Y=a*distance(car position,target position)+b*abs(car yaw-arget yaw)+c*target reached;
    其中,Y表示停车路径的优越程度;a,b表示控制任务完成度;c表示任务完成的额外奖励;假设规划任务的空间大小为L米*L米,则a=1/L;b=1/2π;c=1,distance()函数返回车辆转向中心距离目标车位点的距离,abs()函数为取括号内数的绝对值,target reached表明车辆是否到达目标车位,如果车辆到达目标车位,则target reach=1,否则,target reach=0。
  2. 根据权利要求1所述的方法,其特征在于:所述车辆观测状态包括车辆坐标(x,y,yaw),其中,x,y分别表示车辆转向中心在可行区域的坐标系下的x坐标与y坐标,yaw为车辆当前姿态与x轴的角度。
  3. 根据权利要求2所述的方法,其特征在于:传感器信息为车辆四个角点处安装的传感器测量得到的各角点到最近障碍物的距离。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于:所述车辆预测动作包括车辆线速度和车辆转向角度。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于:所述奖励函数表示车辆的终止状态与目标车位的距离,车辆的终止状态越接近目标车位,获得的奖励值r越高。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于:在采用深度强化学习算法来训练停车策略的过程中,建立第一神经网络和第二神经网络,其中,所述第一神经网络采用车辆观测状态作为输入,输出奖励函数的函数值用于量化当前状态的好坏;第二神经网络采用车辆观测状态作为输入,输出车辆预测动作。
  7. 一种基于深度强化学习的停车路线获取系统,其特征在于:所述系统可由深度强化学习 算法系统获得停车规划路线;
    在深度强化学习算法系统的训练过程中,由车辆观测状态、车辆预测动作和奖励函数构成元组,所述元组每隔规定时间更新一次;
    根据当前车辆观测状态,输出预测动作和奖励函数进行路线规划,当元组更新一次后,根据更新后的车辆观测状态,输出预测动作和奖励函数进行再一次的路线规划,直至车辆到达目标车位;由此可以得到奖励函数值最高的停车规划路线;其中停车规划路线的优越程度与控制任务完成度、任务完成的额外奖励相关。
  8. 根据权利要求7所述的系统,其特征在于:所述车辆观测状态包括车辆坐标(x,y,yaw),其中,x,y分别表示车辆转向中心在可行区域的坐标系下的x坐标与y坐标,yaw为车辆当前姿态与x轴的角度。
  9. 根据权利要求7-8中任一项所述的系统,其特征在于:传感器信息为车辆四个角点处安装的传感器测量得到的各角点到最近障碍物的距离。
  10. 根据权利要求7-9中任一项所述的系统,其特征在于:所述车辆预测动作包括车辆线速度和车辆转向角度。
  11. 根据权利要求7-10中任一项所述的系统,其特征在于:所述奖励函数表示车辆的终止状态与目标车位的距离,车辆的终止状态越接近目标车位,获得的奖励值r越高。
  12. 根据权利要求7-11中任一项所述的系统,其特征在于:在采用深度强化学习算法来训练停车策略的过程中,建立第一神经网络和第二神经网络,其中,所述第一神经网络采用车辆观测状态作为输入,输出奖励函数的函数值用于量化当前状态的好坏;第二神经网络采用车辆观测状态作为输入,输出车辆预测动作。
  13. 根据权利要求7-12中任一项所述的系统,其特征在于:所述停车规划路线的优越程度可通过以下公式进行评价:
    Y=a*distance(car position,target position)+b*abs(car yaw-target yaw)+c*target reached;
    其中,Y表示停车路径的优越程度;a,b表示控制任务完成度;c表示任务完成的额外奖励;假设规划任务的空间(即上述可行区域)大小为L米*L米,则a=1/L;b=1/2π;c=1,distance()函数返回车辆转向中心距离目标车位点的距离,abs()函数为取括号内数的绝对值,target reached表明车辆是否到达目标车位,如果车辆到达目标车位,则target reach=1,否则,target reach=0。
PCT/CN2018/113660 2018-09-20 2018-11-02 一种基于深度强化学习的停车策略 WO2020056875A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811097576.9A CN110136481B (zh) 2018-09-20 2018-09-20 一种基于深度强化学习的停车策略
CN201811097576.9 2018-09-20

Publications (1)

Publication Number Publication Date
WO2020056875A1 true WO2020056875A1 (zh) 2020-03-26

Family

ID=67568416

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/113660 WO2020056875A1 (zh) 2018-09-20 2018-11-02 一种基于深度强化学习的停车策略

Country Status (2)

Country Link
CN (1) CN110136481B (zh)
WO (1) WO2020056875A1 (zh)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111645673A (zh) * 2020-06-17 2020-09-11 西南科技大学 一种基于深度强化学习的自动泊车方法
CN112862885A (zh) * 2021-01-22 2021-05-28 江苏丰华联合科技有限公司 一种基于深度强化学习的柔性物体展开方法
CN113553934A (zh) * 2021-07-19 2021-10-26 吉林大学 基于深度强化学习的地面无人车智能决策方法及系统
CN113777918A (zh) * 2021-07-28 2021-12-10 张金宁 一种数字孪生架构的汽车智能线控底盘控制方法
CN113868113A (zh) * 2021-06-22 2021-12-31 中国矿业大学 一种基于Actor-Critic算法的类集成测试序列生成方法
CN113867334A (zh) * 2021-09-07 2021-12-31 华侨大学 一种移动机械无人驾驶的路径规划方法和系统
CN113867332A (zh) * 2021-08-18 2021-12-31 中国科学院自动化研究所 一种无人车自学习控制方法、装置、设备及可读存储介质
CN113985870A (zh) * 2021-10-19 2022-01-28 复旦大学 一种基于元强化学习的路径规划方法
CN114003059A (zh) * 2021-11-01 2022-02-01 河海大学常州校区 运动学约束条件下基于深度强化学习的uav路径规划方法
CN114020013A (zh) * 2021-10-26 2022-02-08 北航(四川)西部国际创新港科技有限公司 一种基于深度强化学习的无人机编队避撞方法
WO2022090040A1 (de) * 2020-10-29 2022-05-05 Zf Friedrichshafen Ag Verfahren und vorrichtung zum steuern eines fahrzeugs entlang einer fahrttrajektorie
CN114489059A (zh) * 2022-01-13 2022-05-13 沈阳建筑大学 基于d3qn-per移动机器人路径规划方法
CN114783178A (zh) * 2022-03-30 2022-07-22 同济大学 一种自适应停车场出口道闸控制方法、装置和存储介质
CN114815813A (zh) * 2022-03-29 2022-07-29 山东交通学院 一种基于改进ddpg算法的高效路径规划方法、装置及介质
CN115083199A (zh) * 2021-03-12 2022-09-20 上海汽车集团股份有限公司 一种车位信息确定方法及其相关设备
CN115542733A (zh) * 2022-09-23 2022-12-30 福州大学 基于深度强化学习的自适应动态窗口法
CN115862367A (zh) * 2022-11-28 2023-03-28 合肥工业大学 一种代客泊车机器人平台的运行路径的控制方法
CN116540731A (zh) * 2023-06-02 2023-08-04 东莞理工学院 融合堆叠lstm与sac算法的路径规划方法及系统
CN116533992A (zh) * 2023-07-05 2023-08-04 南昌工程学院 基于深度强化学习算法的自动泊车路径规划方法及其系统

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619442A (zh) * 2019-09-26 2019-12-27 浙江科技学院 一种基于强化学习的车辆泊位预测方法
CN110716550B (zh) * 2019-11-06 2022-07-22 南京理工大学 一种基于深度强化学习的换挡策略动态优化方法
CN110843746B (zh) * 2019-11-28 2022-06-14 的卢技术有限公司 一种基于强化学习的防抱死刹车控制方法及系统
CN111098852B (zh) * 2019-12-02 2021-03-12 北京交通大学 一种基于强化学习的泊车路径规划方法
CN111026272B (zh) * 2019-12-09 2023-10-31 网易(杭州)网络有限公司 虚拟对象行为策略的训练方法及装置、电子设备、存储介质
CN111026157B (zh) * 2019-12-18 2020-07-28 四川大学 一种基于奖励重塑强化学习的飞行器智能引导方法
CN112061116B (zh) * 2020-08-21 2021-10-29 浙江大学 一种基于势能场函数逼近的强化学习方法的泊车策略
CN112967516B (zh) * 2021-02-03 2022-07-26 芜湖泊啦图信息科技有限公司 快速停车场端关键参数与整车匹配全局动态路径规划方法
CN113119957B (zh) * 2021-05-26 2022-10-25 苏州挚途科技有限公司 泊车轨迹规划方法、装置及电子设备
CN113554300A (zh) * 2021-07-19 2021-10-26 河海大学 一种基于深度强化学习的共享车位实时分配方法
CN114373324B (zh) * 2021-12-01 2023-05-09 江铃汽车股份有限公司 一种车位信息共享方法及系统
CN115223387B (zh) * 2022-06-08 2024-01-30 东风柳州汽车有限公司 泊车控制系统及方法
CN115472038B (zh) * 2022-11-01 2023-02-03 南京杰智易科技有限公司 一种基于深度强化学习的自动泊车方法和系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103077615A (zh) * 2012-12-20 2013-05-01 长沙理工大学 一种优化信号交叉口排队长度的在线学习方法
CN103774883A (zh) * 2012-10-19 2014-05-07 罗春松 用于停车或存储的自动叠置存储系统
CN107792062A (zh) * 2017-10-16 2018-03-13 北方工业大学 一种自动泊车控制系统

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR0205395A (pt) * 2001-05-21 2003-06-17 Luk Lamellen & Kupplungsbau Processo de controle para veìculos automóveis com dispositivo de embreagem automatizado
US20120233102A1 (en) * 2011-03-11 2012-09-13 Toyota Motor Engin. & Manufact. N.A.(TEMA) Apparatus and algorithmic process for an adaptive navigation policy in partially observable environments
CN105128856B (zh) * 2015-08-24 2018-06-26 奇瑞汽车股份有限公司 停车入库方法及装置
CN106970615B (zh) * 2017-03-21 2019-10-22 西北工业大学 一种深度强化学习的实时在线路径规划方法
CN108407805B (zh) * 2018-03-30 2019-07-30 中南大学 一种基于dqn的车辆自动泊车方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103774883A (zh) * 2012-10-19 2014-05-07 罗春松 用于停车或存储的自动叠置存储系统
CN103077615A (zh) * 2012-12-20 2013-05-01 长沙理工大学 一种优化信号交叉口排队长度的在线学习方法
CN107792062A (zh) * 2017-10-16 2018-03-13 北方工业大学 一种自动泊车控制系统

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111645673B (zh) * 2020-06-17 2021-05-11 西南科技大学 一种基于深度强化学习的自动泊车方法
CN111645673A (zh) * 2020-06-17 2020-09-11 西南科技大学 一种基于深度强化学习的自动泊车方法
WO2022090040A1 (de) * 2020-10-29 2022-05-05 Zf Friedrichshafen Ag Verfahren und vorrichtung zum steuern eines fahrzeugs entlang einer fahrttrajektorie
CN112862885A (zh) * 2021-01-22 2021-05-28 江苏丰华联合科技有限公司 一种基于深度强化学习的柔性物体展开方法
CN112862885B (zh) * 2021-01-22 2023-07-21 江苏丰华联合科技有限公司 一种基于深度强化学习的柔性物体展开方法
CN115083199B (zh) * 2021-03-12 2024-02-27 上海汽车集团股份有限公司 一种车位信息确定方法及其相关设备
CN115083199A (zh) * 2021-03-12 2022-09-20 上海汽车集团股份有限公司 一种车位信息确定方法及其相关设备
CN113868113A (zh) * 2021-06-22 2021-12-31 中国矿业大学 一种基于Actor-Critic算法的类集成测试序列生成方法
CN113553934B (zh) * 2021-07-19 2024-02-20 吉林大学 基于深度强化学习的地面无人车智能决策方法及系统
CN113553934A (zh) * 2021-07-19 2021-10-26 吉林大学 基于深度强化学习的地面无人车智能决策方法及系统
CN113777918A (zh) * 2021-07-28 2021-12-10 张金宁 一种数字孪生架构的汽车智能线控底盘控制方法
CN113867332A (zh) * 2021-08-18 2021-12-31 中国科学院自动化研究所 一种无人车自学习控制方法、装置、设备及可读存储介质
CN113867334A (zh) * 2021-09-07 2021-12-31 华侨大学 一种移动机械无人驾驶的路径规划方法和系统
CN113867334B (zh) * 2021-09-07 2023-05-05 华侨大学 一种移动机械无人驾驶的路径规划方法和系统
CN113985870B (zh) * 2021-10-19 2023-10-03 复旦大学 一种基于元强化学习的路径规划方法
CN113985870A (zh) * 2021-10-19 2022-01-28 复旦大学 一种基于元强化学习的路径规划方法
CN114020013A (zh) * 2021-10-26 2022-02-08 北航(四川)西部国际创新港科技有限公司 一种基于深度强化学习的无人机编队避撞方法
CN114020013B (zh) * 2021-10-26 2024-03-15 北航(四川)西部国际创新港科技有限公司 一种基于深度强化学习的无人机编队避撞方法
CN114003059B (zh) * 2021-11-01 2024-04-16 河海大学常州校区 运动学约束条件下基于深度强化学习的uav路径规划方法
CN114003059A (zh) * 2021-11-01 2022-02-01 河海大学常州校区 运动学约束条件下基于深度强化学习的uav路径规划方法
CN114489059A (zh) * 2022-01-13 2022-05-13 沈阳建筑大学 基于d3qn-per移动机器人路径规划方法
CN114489059B (zh) * 2022-01-13 2024-02-02 沈阳建筑大学 基于d3qn-per移动机器人路径规划方法
CN114815813A (zh) * 2022-03-29 2022-07-29 山东交通学院 一种基于改进ddpg算法的高效路径规划方法、装置及介质
CN114783178B (zh) * 2022-03-30 2023-08-08 同济大学 一种自适应停车场出口道闸控制方法、装置和存储介质
CN114783178A (zh) * 2022-03-30 2022-07-22 同济大学 一种自适应停车场出口道闸控制方法、装置和存储介质
CN115542733A (zh) * 2022-09-23 2022-12-30 福州大学 基于深度强化学习的自适应动态窗口法
CN115862367B (zh) * 2022-11-28 2023-11-24 合肥工业大学 一种代客泊车机器人平台的运行路径的控制方法
CN115862367A (zh) * 2022-11-28 2023-03-28 合肥工业大学 一种代客泊车机器人平台的运行路径的控制方法
CN116540731A (zh) * 2023-06-02 2023-08-04 东莞理工学院 融合堆叠lstm与sac算法的路径规划方法及系统
CN116540731B (zh) * 2023-06-02 2024-03-26 东莞理工学院 融合堆叠lstm与sac算法的路径规划方法及系统
CN116533992A (zh) * 2023-07-05 2023-08-04 南昌工程学院 基于深度强化学习算法的自动泊车路径规划方法及其系统
CN116533992B (zh) * 2023-07-05 2023-09-22 南昌工程学院 基于深度强化学习算法的自动泊车路径规划方法及其系统

Also Published As

Publication number Publication date
CN110136481B (zh) 2021-02-02
CN110136481A (zh) 2019-08-16

Similar Documents

Publication Publication Date Title
WO2020056875A1 (zh) 一种基于深度强化学习的停车策略
Loquercio et al. Deep drone racing: From simulation to reality with domain randomization
Tai et al. Towards cognitive exploration through deep reinforcement learning for mobile robots
CN109976340B (zh) 一种基于深度增强学习的人机协同动态避障方法及系统
CN112356830B (zh) 一种基于模型强化学习的智能泊车方法
CN111694364A (zh) 一种应用于智能车路径规划的基于改进蚁群算法与动态窗口法的混合算法
CN107063280A (zh) 一种基于控制采样的智能车辆路径规划系统及方法
Zhang et al. Reinforcement learning-based motion planning for automatic parking system
CN110989576A (zh) 速差滑移转向车辆的目标跟随及动态障碍物避障控制方法
CN110745136A (zh) 一种驾驶自适应控制方法
CN113219997B (zh) 一种基于tpr-ddpg的移动机器人路径规划方法
Zhu et al. A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation
CN116679719A (zh) 基于动态窗口法与近端策略的无人车自适应路径规划方法
CN113311828A (zh) 一种无人车局部路径规划方法、装置、设备及存储介质
US11911902B2 (en) Method for obstacle avoidance in degraded environments of robots based on intrinsic plasticity of SNN
Ma et al. Learning to navigate in indoor environments: From memorizing to reasoning
CN116804879A (zh) 一种改进蜣螂算法融合dwa算法的机器人路径规划框架方法
CN112665592B (zh) 一种基于多智能体的时空路径规划方法
CN117109574A (zh) 一种农用运输机械覆盖路径规划方法
CN115167393A (zh) 未知环境下基于改进蚁群和动态窗口法的路径规划方法
CN116127853A (zh) 融合时序信息的基于ddpg的无人驾驶超车决策方法
CN113959446B (zh) 一种基于神经网络的机器人自主物流运输导航方法
Guo et al. Optimal navigation for AGVs: A soft actor–critic-based reinforcement learning approach with composite auxiliary rewards
He et al. Intelligent navigation of indoor robot based on improved DDPG algorithm
Li et al. Research on the agricultural machinery path tracking method based on deep reinforcement learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18934169

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18934169

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18934169

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 17/11/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 18934169

Country of ref document: EP

Kind code of ref document: A1