WO2023065494A1 - 一种意图驱动的强化学习路径规划方法 - Google Patents
一种意图驱动的强化学习路径规划方法 Download PDFInfo
- Publication number
- WO2023065494A1 WO2023065494A1 PCT/CN2021/137549 CN2021137549W WO2023065494A1 WO 2023065494 A1 WO2023065494 A1 WO 2023065494A1 CN 2021137549 W CN2021137549 W CN 2021137549W WO 2023065494 A1 WO2023065494 A1 WO 2023065494A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data collector
- data
- path planning
- sensor nodes
- intention
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 230000002787 reinforcement Effects 0.000 title claims abstract description 20
- 238000013480 data collection Methods 0.000 claims abstract description 30
- 238000012544 monitoring process Methods 0.000 claims abstract description 23
- 230000007613 environmental effect Effects 0.000 claims description 19
- 238000004891 communication Methods 0.000 claims description 7
- 238000005265 energy consumption Methods 0.000 claims description 7
- 230000008447 perception Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 4
- 238000003915 air pollution Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 231100001261 hazardous Toxicity 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/0088—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots characterized by the autonomous decision making process, e.g. artificial intelligence, predefined behaviours
Definitions
- the invention belongs to the technical field of wireless communication, and in particular relates to an intent-driven reinforcement learning path planning method.
- wireless sensor networks are widely used as a monitoring technology to monitor the surrounding environment, such as air pollution, marine resource detection, disaster warning, etc.
- IoT sensors are usually energy-constrained devices with limited transmission range, requiring data collectors to collect sensor data for further forwarding or processing.
- smart devices such as drones, unmanned ships, and unmanned submarines have been deployed in military and civilian applications, performing in hazardous and inaccessible environments Difficult or tedious tasks.
- drones, unmanned ships, and unmanned submarines can more conveniently complete the data collection of monitoring networks as data collectors, they have the key challenge of limited energy. After starting from the base, the data collector needs to travel to the sensor nodes while avoiding collisions with environmental obstacles and sensor nodes, and return to the base within the specified time to prevent energy exhaustion. Therefore, it is necessary to reasonably design the motion path of the data collector according to the intention of the data collector and sensor nodes, so as to improve the data collection efficiency of the monitoring network.
- the present invention provides an intention-driven reinforcement learning path planning method, which expresses the intentions of data collectors and sensor nodes as rewards and punishments according to the real-time changing monitoring network environment, and uses Q-learning to strengthen The learning method plans the path of the data collector to improve the efficiency and reliability of data collection.
- An intention-driven reinforcement learning path planning method comprising the following steps:
- Step A the data collector acquires the status of the monitoring network
- Step B determine the steering angle of the data collector according to the positions of the data collector, sensor nodes and environmental obstacles
- Step C select the action of the data collector according to the ⁇ -greedy strategy, including the speed of the data collector, the target node and the next target node;
- Step D the data collector adjusts the traveling direction according to the steering angle, and executes the action to the next time slot position
- Step E calculating rewards and penalties according to the intent of the data collector and sensor nodes, and updating the Q value
- Step F repeat step A to step E, until the monitoring network reaches the termination state or the Q learning meets the convergence condition;
- step G the data collector selects the action with the largest Q value in each time slot as the planning result, and generates an optimal data collection path.
- monitoring the state s of the network in the step A includes: the direction of travel of the data collector in the time slot n
- the coordinates q u [n] of the data collector, the available storage space of the sensor node Completion of data collection of sensor nodes The distance between the data collector and the sensor node
- the distance between the data collector and environmental obstacles in is the set of sensor nodes, is the set of environmental obstacles
- w m [n] ⁇ 0,1 ⁇ is the indicator factor of sensor node data collection
- step of determining the target travel position in step B includes:
- Step B1 Determine whether the data collector senses an obstacle, if it senses an obstacle, compare and the size of. if Then the target travel position of the data collector Otherwise the target travel position of the data collector in and For the data collector to detect two points on the boundary of environmental obstacles at the maximum perception angle, and are target sensor nodes and points respectively relative angle.
- Step B2 If the data collector does not perceive environmental obstacles, determine the path from the data collector to the next target node m2 Whether to pass through the communication area of the target node m 1 if won't pass through then the target travel position in, for communication area up distance shortest point.
- Step B3 If through judgment path Whether to pass through the safe area of the target node m 1 if won't pass through then the target travel position Otherwise, the target travel position in, safe area up distance shortest point.
- step C the method for selecting an action by the ⁇ -greedy strategy in step C is expressed as:
- ⁇ is the exploration probability
- ⁇ [0,1] is a randomly generated value
- Q(s,a) is the Q value of executing action a in state s.
- calculation formula of the next time slot position of the data collector in the step D is:
- x u [n-1] and y u [n-1] are the x coordinates and y coordinates of the data collector
- v[n] is the traveling speed of the data collector
- ⁇ is the duration of each time slot.
- the reward and penalty calculation step corresponding to the intention of the data collector and the sensor node in the step E includes:
- Step D1 Consider that the intention of the data collector is to safely complete the data collection of all sensor nodes with the minimum energy consumption E tot , and return to the base within the specified time T; the intention of the sensor nodes is to minimize the overflow data Then the reward R a (s, s') of Q learning is the weighted sum of the energy consumption of the data collector and the data overflow of the sensor node Among them, s' is the next state of the monitoring network after executing action a in state s, is the weight factor.
- ⁇ safe the safety penalty, Indicates that the distance between the data collector and environmental obstacles, and the distance between the data collector and sensor nodes must meet the anti-collision distance
- ⁇ bou is the boundary penalty, indicating that the
- ⁇ is the learning rate and ⁇ is the reward discount factor.
- the termination state of the monitoring network in the step F is that the data collector completes the data collection of the sensor node or the data collector has not completed the data collection at time T;
- the convergence condition of Q learning is expressed as:
- ⁇ is the allowable error of learning
- j is the number of iterations of learning.
- the intention-driven reinforcement learning path planning method is applicable to the terrestrial Internet of Things assisted by drones, the ocean monitoring network assisted by unmanned ships, and the seabed sensor network assisted by unmanned submarines.
- the Q-learning model will optimize the real-time coordinates of the data collector based on the current monitoring network status information, minimize the difference in intent, and improve the efficiency and reliability of data collection.
- Fig. 1 is an example scene diagram of the present invention
- Fig. 2 is a schematic diagram of the implementation process of the present invention.
- FIG. 1 is an example scene diagram of the present invention. As shown in Figure 1,
- the unmanned ship starts from the base, avoids collisions with obstacles and sensor nodes, completes the data collection of each sensor node within the specified time T, and returns to the base.
- the weighted energy consumption of unmanned ships and the data overflow of sensor nodes are expressed as rewards of reinforcement learning, and the intentions of safety, traversal collection, and returning to the base on time are expressed as punishments, and Q-learning is used.
- the method optimizes the path of the unmanned ship.
- Fig. 2 is the implementation flow schematic diagram of the present invention, and concrete implementation steps are:
- Step 1 the data collector obtains the status information of the monitoring network including: the direction of travel of the data collector in time slot n
- the distance between the data collector and the sensor node is the set of sensor nodes, is the set of environmental obstacles
- w m [n] ⁇ 0,1 ⁇ is the indicator factor of sensor node data collection
- Step 2 According to the positions of the data collector, sensor nodes and environmental obstacles, the following steps are used to determine the steering angle of the data collector:
- Step 3 Select the action of the data collector according to the ⁇ -greedy strategy, including the speed of the data collector, the target node and the next target node.
- the method of ⁇ -greedy policy selection action is expressed as:
- ⁇ is the exploration probability
- ⁇ [0,1] is a randomly generated value
- Q(s,a) is the Q value of executing action a in state s.
- Step 4 The data collector adjusts the direction of travel according to the steering angle, and executes the action to the position of the next time slot.
- the coordinates of the data collector are expressed as:
- x u [n-1] and y u [n-1] are the x coordinates and y coordinates of the data collector
- v[n] is the traveling speed of the data collector
- ⁇ is the duration of each time slot.
- Step 5 Calculate rewards and penalties according to the intentions of data collectors and sensor nodes, and use the following formula to update the Q value:
- ⁇ is the learning rate and ⁇ is the reward discount factor.
- the calculation steps of rewards and penalties include:
- the intention of the data collector is to safely complete the data collection of all sensor nodes with the minimum energy consumption E tot , and return to the base within the specified time T; the intention of the sensor nodes is to minimize the overflow data Then the reward R a (s, s') of Q learning is the weighted sum of the energy consumption of the data collector and the data overflow of the sensor node Among them, s' is the next state of the monitoring network after executing action a in state s, is the weight factor.
- Step 6 Repeat steps 1 to 5 until the monitoring network reaches the termination state or the Q-learning meets the convergence condition.
- the termination state is that the data collector completes the data collection of sensor nodes or the data collector has not completed data collection at time T
- the convergence condition of Q-learning is expressed as:
- ⁇ is the allowable error of learning
- j is the number of iterations of learning.
- Step 7 The data collector selects the action with the largest Q value in each time slot as the planning result, and generates an optimal data collection path.
- the intent-driven reinforcement learning path planning method of the present invention is applicable to the ground Internet of Things assisted by drones, the ocean monitoring network assisted by unmanned ships, and the seabed sensor network assisted by unmanned submarines.
Landscapes
- Engineering & Computer Science (AREA)
- Medical Informatics (AREA)
- Radar, Positioning & Navigation (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Game Theory and Decision Science (AREA)
- Business, Economics & Management (AREA)
- Aviation & Aerospace Engineering (AREA)
- Health & Medical Sciences (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
意图驱动的强化学习路径规划方法包括:步骤1、数据采集器获取监测网络的状态;步骤2、根据环境障碍物、传感器节点和数据采集器的位置选择数据采集器的转向角;步骤3、根据ε贪心策略选择数据采集器的速度、目标节点和下一目标节点作为动作;步骤4、数据采集器根据选择的转向角和速度确定下一时隙的位置;步骤5、根据数据采集器和传感器节点的意图得到奖赏和惩罚,并更新Q值;步骤6、重复执行步骤1至步骤5,直至到达终止状态或收敛条件;步骤7、数据采集器选择每一时隙Q值最大的动作作为规划结果,生成最佳路径;强化学习路径规划方法以较高的成功概率、更接近意图的性能完成数据采集路径规划。
Description
本发明属于无线通信技术领域,尤其涉及一种意图驱动的强化学习路径规划方法。
随着物联网领域的发展,无线传感器网络作为一种监测技术被广泛应用于监测周围环境,例如空气污染、海洋资源探测、灾害预警等。这些物联网传感器通常是能量受限的设备,传输范围有限,需要数据采集器收集传感器的数据并进行进一步的转发或处理。近年来,随着自动控制系统变得越来越智能和可靠,无人机、无人船和无人潜艇等智能设备已经被部署在军事和民用应用中,在危险和难以访问的环境下执行困难或乏味的任务。
尽管无人机、无人船和无人潜艇等作为数据采集器可以更方便地完成监测网络的数据收集,但它们存在能量有限这一关键挑战。从基地出发后,数据采集器需要向传感器节点行进,同时避免与环境障碍物、传感器节点的碰撞,并在规定时间内回到基地,防止能量耗尽。因此,需要根据数据采集器与传感器节点的意图合理地设计数据采集器的运动路径,以提高监测网络的数据采集效率。
在已有的数据采集路径规划方案中,大部分都是单独考虑数据采集器和传感器节点的意图,不能针对数据采集器和传感器节点不同的意图调整数据采集路径。同时,现有路径规划方法没有考虑监测环境中随机出现和随机移动的动态障碍物。因此,现有路径规划方法存在采集效率和可靠性低的问题。
发明内容
为解决上述技术问题,本发明提供一种意图驱动的强化学习路径规划方法,该方法根据实时变化的监测网络环境,将数据采集器和传感器节点的意图表示为奖赏与惩罚,利用Q-learning强化学习方法规划数据采集器的路径,提高数据采集的效率和可靠性。
一种意图驱动的强化学习路径规划方法,包括如下的步骤:
步骤A、数据采集器获取监测网络的状态;
步骤B、根据数据采集器、传感器节点和环境障碍物的位置,确定数据采集器的转向角;
步骤C、根据ε贪心策略选择数据采集器动作,包括数据采集器的速度、目标节点和下一目标节点;
步骤D、数据采集器根据转向角调整行进方向,执行动作至下一时隙位置;
步骤E、根据数据采集器和传感器节点的意图计算奖赏和惩罚,并更新Q值;
步骤F、重复执行步骤A至步骤E,直到监测网络到达终止状态或Q学习满足收敛条件;
步骤G、数据采集器选择每一时隙Q值最大的动作作为规划结果,生成最优数据采集路径。
进一步地,所述步骤A中监测网络的状态s包括:数据采集器在时隙n的行进方向
数据采集器的坐标q
u[n]、传感器节点的可用存储空间
传感器节点的数据采集完成情况
数据采集器与传感器节点的距离
数据采集器与环境障碍物的距离
其中
为传感器节点的集合、
为环境障碍物的集合,w
m[n]∈{0,1}为传感器节点数据采集指示因子,w
m[n]=1表示数据采集器在时隙n完成传感器节点m的数据采集,否则,表示未完成。
进一步地,所述步骤B中数据采集器转向角的计算公式表示为:
进一步地,所述步骤B中确定目标行进位置的步骤包括:
步骤B1:判断数据采集器是否感知到障碍物,如果感知到障碍物,比较
和
的大小。如果
则数据采集器的目标行进位置
否则数据采集器的目标行进位置
其中
和
为数据采集器以最大感知角度探测环境障碍物边界上的两点,
和
分别为目标传感器节点与点
的相对角度。
进一步地,所述步骤C中ε贪心策略选择动作的方法表示为:
其中,ε为探索概率、β∈[0,1]为随机产生的数值、Q(s,a)为状态s时执行动作a的Q值。
进一步地,所述步骤D中数据采集器下一时隙位置的计算公式为:
其中,x
u[n-1]和y
u[n-1]为数据采集器的x坐标和y坐标、v[n]为数据采集器的行进速度、τ为每个时隙的时长。
进一步地,所述步骤E中数据采集器和传感器节点意图对应的奖赏和惩罚计算步骤包括:
步骤D1:考虑数据采集器的意图为以最小的能量消耗E
tot安全完成所有传感器节点的数据采集,并在规定时间T内返回基地;传感器节点的意图为最小化溢出数据
则Q学习的奖赏R
a(s,s')为数据采集器能量消耗和传感器节点数据溢出的加权和
其中,s'为在状态s执行动作a后监测网络的下一状态、
为权重因子。
步骤D2:根据数据采集器与传感器节点的意图,Q学习的惩罚为C
a(s,s')=θ
safe+θ
bou+θ
time+θ
tra+θ
ter,其中,θ
safe为安全惩罚,表示数据采集器与环境障碍物、数据采集器与传感器节点的距离须满足防碰撞距离;θ
bou为边界惩罚,表示数据采 集器不得超过其可行区域;θ
time为时间惩罚,表示数据采集器须在时间T内完成数据采集;θ
tra为遍历采集惩罚,表示所有传感器节点的数据须被采集;θ
ter为终点惩罚,表示数据采集器须在时间T内返回基地。
进一步地,所述步骤E中Q值的更新公式为:
其中,α为学习率、γ为奖赏折扣因子。
进一步地,所述步骤F中监测网络的终止状态为数据采集器完成传感器节点的数据采集或数据采集器在时刻T还未完成数据采集;Q学习的收敛条件表示为:
|Q
j(s,a)-Q
j-1(s,a)|≤ξ (5)
其中,ξ为学习允许误差、j为学习的迭代次数。
进一步地,意图驱动的强化学习路径规划方法适用于无人机协助的地面物联网、无人船协助的海洋监测网络、无人潜艇协助的海床传感器网络。
本发明的一种意图驱动的强化学习路径规划方法具有以下优点:
根据监测环境中的随机动态障碍物和实时感知数据,综合考虑数据采集器和传感器节点的意图,设计了节点全覆盖的数据采集路径规划方法。Q学习模型会根据当前监测网络状态信息,优化数据采集器的实时坐标,最小化意图差异,同时提高数据采集的效率和可靠性。
图1为本发明的举例场景图;
图2为本发明的实施流程示意图。
下面结合附图,对本发明一种意图驱动的强化学习路径规划方法做进一步详细的描述。
图1为本发明的举例场景图。如图1所示,
海洋监测网络中有一个无人船,M个传感器节点,K个诸如海岛、海浪、礁石等的障碍物。无人船从基地出发,避免与障碍物、传感器节点的碰撞,在规定时间T内,完成每个传感器节点的数据采集,并返回到基地。为了满足无人船和传感器节点的意图,将无 人船加权能量消耗和传感器节点数据溢出表示为强化学习的奖赏,将安全意图、遍历采集意图、按时返回基地的意图表示为惩罚,利用Q学习方法优化无人船的路径。
图2为本发明的实施流程示意图,具体的实施步骤为:
步骤一、数据采集器获取监测网络的状态信息包括:数据采集器在时隙n的行进方向
数据采集器的坐标q
u[n]、传感器节点的可用存储空间
传感器节点的数据采集完成情况
数据采集器与传感器节点的距离
数据采集器与环境障碍物的距离
其中
为传感器节点的集合、
为环境障碍物的集合,w
m[n]∈{0,1}为传感器节点数据采集指示因子,w
m[n]=1表示数据采集器在时隙n完成传感器节点m的数据采集,否则,表示未完成。
步骤二、根据数据采集器、传感器节点和环境障碍物的位置,确定数据采集器的转向角采用了如下步骤:
(1)判断数据采集器是否感知到障碍物,如果感知到障碍物,比较
和
的大小。如果
则数据采集器的目标行进位置
否则数据采集器的目标行进位置
其中
和
为数据采集器以最大感知角度探测环境障碍物边界上的两点,
和
分别为目标传感器节点与点
的相对角度。
(4)利用如下公式计算数据采集器的转向角:
步骤三、根据ε贪心策略选择数据采集器动作,包括数据采集器的速度、目标节点和下一目标节点。其中,ε贪心策略选择动作的方法表示为:
其中,ε为探索概率、β∈[0,1]为随机产生的数值、Q(s,a)为状态s时执行动作a的Q值。
步骤四、数据采集器根据转向角调整行进方向,执行动作至下一时隙位置,数据采集器坐标表示为:
其中,x
u[n-1]和y
u[n-1]为数据采集器的x坐标和y坐标、v[n]为数据采集器的行进速度、τ为每个时隙的时长。
步骤五、根据数据采集器和传感器节点的意图计算奖赏和惩罚,并利用如下公式更新Q值:
其中α为学习率、γ为奖赏折扣因子。
奖赏和惩罚的计算步骤包括:
(1)考虑数据采集器的意图为以最小的能量消耗E
tot安全完成所有传感器节点的数据采集,并在规定时间T内返回基地;传感器节点的意图为最小化溢出数据
则Q学习的奖赏R
a(s,s')为数据采集器能量消耗和传感器节点数据溢出的加权和
其中,s'为在状态s执行动作a后监测网络的下一状态、
为权重因子。
(2)根据数据采集器与传感器节点的意图,Q学习的惩罚为C
a(s,s')=θ
safe+θ
bou+θ
time+θ
tra+θ
ter,其中,θ
safe为安全惩罚,表示数据采集器与环境障碍物、数据采集器与传感器节点的距离须满足防碰撞距离;θ
bou为边界惩罚,表示数据采集器不得超过其可行区域;θ
time为时间惩罚,表示数据采集器须在时间T内完成数据采集;θ
tra为遍历采集惩罚,表示所有传感器节点的数据须被采集;θ
ter为终点惩罚,表示数据采集器须在时间T内返回基地。
步骤六、重复执行步骤一至步骤五,直到监测网络到达终止状态或Q学习满足收敛条件。其中,终止状态为数据采集器完成传感器节点的数据采集或数据采集器在时刻T还未完成数据采集,Q学习的收敛条件表示为:
|Q
j(s,a)-Q
j-1(s,a)|≤ξ (5)
其中,ξ为学习允许误差、j为学习的迭代次数。
步骤七、数据采集器选择每一时隙Q值最大的动作作为规划结果,生成最优数据采集路径。
本发明意图驱动的强化学习路径规划方法适用于无人机协助的地面物联网、无人船协助的海洋监测网络、无人潜艇协助的海床传感器网络。
可以理解,本发明是通过一些实施例进行描述的,本领域技术人员知悉的,在不脱离本发明的精神和范围的情况下,可以对这些特征和实施例进行各种改变或等效替换。另外,在本发明的教导下,可以对这些特征和实施例进行修改以适应具体的情况及材料而不会脱离本发明的精神和范围。因此,本发明不受此处所公开的具体实施例的限制,所有落入本申请的权利要求范围内的实施例都属于本发明所保护的范围内。
Claims (9)
- 一种意图驱动的强化学习路径规划方法,其特征在于,包括以下步骤:步骤A、数据采集器获取监测网络的状态;步骤B、根据数据采集器、传感器节点和环境障碍物的位置,确定数据采集器的转向角;步骤C、根据ε贪心策略选择数据采集器动作,包括数据采集器的速度、目标节点和下一目标节点;步骤D、数据采集器根据转向角调整行进方向,执行动作至下一时隙位置;步骤E、根据数据采集器和传感器节点的意图计算奖赏和惩罚,并更新Q值;步骤F、重复执行步骤A至步骤E,直到监测网络到达终止状态或Q学习满足收敛条件;步骤G、数据采集器选择每一时隙Q值最大的动作作为规划结果,生成最优数据采集路径。
- 根据权利要求3所述的一种意图驱动的强化学习路径规划方法,其特征在于,所 述步骤B中确定目标行进位置的步骤包括:步骤B1:判断数据采集器是否感知到障碍物,如果感知到障碍物,比较 和 的大小;如果 则数据采集器的目标行进位置 否则数据采集器的目标行进位置 其中 和 为数据采集器以最大感知角度探测环境障碍物边界上的两点, 和 分别为目标传感器节点与点 的相对角度;
- 根据权利要求1所述的一种意图驱动的强化学习路径规划方法,其特征在于,所 述步骤E中数据采集器和传感器节点意图对应的奖赏和惩罚计算步骤包括:步骤D1:考虑数据采集器的意图为以最小的能量消耗E tot安全完成所有传感器节点的数据采集,并在规定时间T内返回基地;传感器节点的意图为最小化溢出数据 则Q学习的奖赏R a(s,s')为数据采集器能量消耗和传感器节点数据溢出的加权和 其中,s'为在状态s执行动作a后监测网络的下一状态、 为权重因子;步骤D2:根据数据采集器与传感器节点的意图,Q学习的惩罚为Ca(s,s′)=θ safe+θ bou+θ time+θ tra+θ ter,其中,θ safe为安全惩罚,表示数据采集器与环境障碍物、数据采集器与传感器节点的距离须满足防碰撞距离;θ bou为边界惩罚,表示数据采集器不得超过其可行区域;θ time为时间惩罚,表示数据采集器须在时间T内完成数据采集;θ tra为遍历采集惩罚,表示所有传感器节点的数据须被采集;θ ter为终点惩罚,表示数据采集器须在时间T内返回基地。
- 根据权利要求1所述的一种意图驱动的强化学习路径规划方法,其特征在于,所述步骤F中监测网络的终止状态为数据采集器完成传感器节点的数据采集或数据采集器在时刻T还未完成数据采集;Q学习的收敛条件表示为:|Q j(s,a)-Q j-1(s,a)|≤ξ (5)其中,ξ为学习允许误差、j为学习的迭代次数。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111208888.4 | 2021-10-18 | ||
CN202111208888.4A CN113848868B (zh) | 2021-10-18 | 2021-10-18 | 一种意图驱动的强化学习路径规划方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023065494A1 true WO2023065494A1 (zh) | 2023-04-27 |
Family
ID=78978692
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/137549 WO2023065494A1 (zh) | 2021-10-18 | 2021-12-13 | 一种意图驱动的强化学习路径规划方法 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113848868B (zh) |
WO (1) | WO2023065494A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070269077A1 (en) * | 2006-05-17 | 2007-11-22 | The Boeing Company | Sensor scan planner |
CN111515932A (zh) * | 2020-04-23 | 2020-08-11 | 东华大学 | 一种基于人工势场与强化学习的人机共融流水线实现方法 |
CN112672307A (zh) * | 2021-03-18 | 2021-04-16 | 浙江工商大学 | 一种基于q学习的无人机辅助数据收集系统及方法 |
CN112866911A (zh) * | 2021-01-11 | 2021-05-28 | 燕山大学 | 基于q学习的自主水下航行器协助下水下数据收集方法 |
CN113190039A (zh) * | 2021-04-27 | 2021-07-30 | 大连理工大学 | 一种基于分层深度强化学习的无人机采集路径规划方法 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110730486B (zh) * | 2019-09-09 | 2022-10-14 | 南京理工大学 | 基于Q-Learning算法获取无线体域网最优路径的方法 |
CN110856134B (zh) * | 2019-10-16 | 2022-02-11 | 东南大学 | 一种基于无人机的大规模无线传感器网络数据收集方法 |
CN113342029B (zh) * | 2021-04-16 | 2022-06-21 | 山东师范大学 | 基于无人机群的最大传感器数据采集路径规划方法及系统 |
CN113283169B (zh) * | 2021-05-24 | 2022-04-26 | 北京理工大学 | 一种基于多头注意力异步强化学习的三维群体探索方法 |
CN113406965A (zh) * | 2021-05-31 | 2021-09-17 | 南京邮电大学 | 一种基于强化学习的无人机能耗优化方法 |
-
2021
- 2021-10-18 CN CN202111208888.4A patent/CN113848868B/zh active Active
- 2021-12-13 WO PCT/CN2021/137549 patent/WO2023065494A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070269077A1 (en) * | 2006-05-17 | 2007-11-22 | The Boeing Company | Sensor scan planner |
CN111515932A (zh) * | 2020-04-23 | 2020-08-11 | 东华大学 | 一种基于人工势场与强化学习的人机共融流水线实现方法 |
CN112866911A (zh) * | 2021-01-11 | 2021-05-28 | 燕山大学 | 基于q学习的自主水下航行器协助下水下数据收集方法 |
CN112672307A (zh) * | 2021-03-18 | 2021-04-16 | 浙江工商大学 | 一种基于q学习的无人机辅助数据收集系统及方法 |
CN113190039A (zh) * | 2021-04-27 | 2021-07-30 | 大连理工大学 | 一种基于分层深度强化学习的无人机采集路径规划方法 |
Non-Patent Citations (1)
Title |
---|
WANG GICHEOL; LEE BYOUNG-SUN; AHN JAE YOUNG: "UAV-Assisted Cluster Head Election for a UAV-Based Wireless Sensor Network", 2018 IEEE 6TH INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD), IEEE, 6 August 2018 (2018-08-06), pages 267 - 274, XP033399738, DOI: 10.1109/FiCloud.2018.00046 * |
Also Published As
Publication number | Publication date |
---|---|
CN113848868A (zh) | 2021-12-28 |
CN113848868B (zh) | 2023-09-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111240319B (zh) | 室外多机器人协同作业系统及其方法 | |
CN108680163B (zh) | 一种基于拓扑地图的无人艇路径搜索系统及方法 | |
CN109828607B (zh) | 一种面向不规则障碍物的无人机路径规划方法及系统 | |
CN105865449B (zh) | 基于激光和视觉的移动机器人的混合定位方法 | |
CN108681321B (zh) | 一种无人船协同编队的水下探测方法 | |
CN103336526B (zh) | 基于协同进化粒子群滚动优化的机器人路径规划方法 | |
CN106873599A (zh) | 基于蚁群算法和极坐标变换的无人自行车路径规划方法 | |
CN107422736B (zh) | 一种无人船自主返航控制方法 | |
CN106970648A (zh) | 城市低空环境下无人机多目标路径规划联合搜索方法 | |
CN112817318B (zh) | 一种多无人艇协同搜索控制方法及系统 | |
CN107966153A (zh) | 水下航行器路径规划算法 | |
Srinivasan et al. | A survey of sensory data boundary estimation, covering and tracking techniques using collaborating sensors | |
Guo et al. | An improved a-star algorithm for complete coverage path planning of unmanned ships | |
Zheng et al. | A decision-making method for ship collision avoidance based on improved cultural particle swarm | |
CN109857117B (zh) | 一种基于分布式模式匹配的无人艇集群编队方法 | |
CN108387240B (zh) | 一种多层次六边形网格地图的构建方法 | |
CN110825112B (zh) | 基于多无人机的油田动态侵入目标追踪系统与方法 | |
CN107422326B (zh) | 基于贝叶斯估计的水下目标追踪方法 | |
Wu et al. | Multi-vessels collision avoidance strategy for autonomous surface vehicles based on genetic algorithm in congested port environment | |
CN108445894A (zh) | 一种考虑无人艇运动性能的二次路径规划方法 | |
CN110320907B (zh) | 一种基于改进蚁群算法和椭圆碰撞锥推演模型的无人水面艇双层避碰方法 | |
CN112097774A (zh) | 基于自适应卡尔曼滤波与平均跟踪的分布式地图融合方法 | |
CN117389305A (zh) | 一种无人机巡检路径规划方法、系统、设备及介质 | |
WO2023065494A1 (zh) | 一种意图驱动的强化学习路径规划方法 | |
CN115047871A (zh) | 动态目标的多无人车协同搜索方法、装置、设备及介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 17923114 Country of ref document: US |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21961232 Country of ref document: EP Kind code of ref document: A1 |