WO2022241808A1 - 一种多机器人轨迹规划方法 - Google Patents

一种多机器人轨迹规划方法 Download PDF

Info

Publication number
WO2022241808A1
WO2022241808A1 PCT/CN2021/095970 CN2021095970W WO2022241808A1 WO 2022241808 A1 WO2022241808 A1 WO 2022241808A1 CN 2021095970 W CN2021095970 W CN 2021095970W WO 2022241808 A1 WO2022241808 A1 WO 2022241808A1
Authority
WO
WIPO (PCT)
Prior art keywords
robot
learning
state
reward
value
Prior art date
Application number
PCT/CN2021/095970
Other languages
English (en)
French (fr)
Inventor
张弓
侯至丞
杨文林
吕浩亮
吴月玉
徐征
梁济民
张治彪
Original Assignee
广州中国科学院先进技术研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州中国科学院先进技术研究所 filed Critical 广州中国科学院先进技术研究所
Publication of WO2022241808A1 publication Critical patent/WO2022241808A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the invention relates to the technical field of multi-robot collaborative control, in particular to a multi-robot trajectory planning method.
  • Multi-machine collaborative operations have replaced single-machine construction.
  • Research hotspots of intelligent production line Compared with the single-robot system, the multi-robot system has the characteristics of strong adaptability to the environment, high self-regulation ability, wide spatial distribution of the system, better data redundancy, and robustness.
  • Using the collaborative cooperation between multiple robots it can reliably complete high-precision operations and efficient processing that cannot be completed by a single robot.
  • Welding is a potentially dangerous, intensive and proficient job.
  • the traditional robot welding process mostly uses manual teaching to generate welding trajectories, which is not only time-consuming and labor-intensive, but also has low precision, and is limited by the limited working space of the robot. It is difficult to realize the three-dimensional arbitrary complex curve welding of complex components. Collaborative work among them, that is, in the same station area, multiple robots carry out cooperative clamping, handling, flipping and welding of the workpiece (that is, the object to be welded) respectively, to achieve mutual cooperation and achieve punctuality, synchronization, coordination and high efficiency requirements.
  • the spatial three-dimensional complex trajectory planning of robot welding must not only ensure that the multi-robot system does not have any conflicts with obstacles in the environment, but also must ensure that each robot maintains a given position, especially considering the robot When having workspace height overlap.
  • multi-robots When multi-robots overlap highly in the working space, they need to cooperate with each other in the same station area to realize operations such as clamping, handling, flipping and welding of workpieces.
  • Each robot in the multi-robot system must act independently and communicate with other robots. Robots collaborate.
  • the multi-robot collaboration scheme is highly dependent on conditions such as the position and speed of each robot. How to realize the efficiency of robots to perform complex tasks collaboratively and meet the three-dimensional complex trajectory planning is the key problem to be solved at present.
  • the present invention proposes a multi-robot trajectory planning method, which integrates deep Q-learning and convolutional neural network (CNN) algorithms, so that multiple robots can cooperate with each other without interference, thereby realizing Spatial 3D complex trajectory planning for multi-robots.
  • CNN convolutional neural network
  • the present invention solves the above problems by the following technical means:
  • a multi-robot trajectory planning method comprising the steps of:
  • Deep Q-learning uses the state of the surrounding environment of multiple robots to analyze the current trajectory vector, and designs a reward network for deep Q-learning. Both the current trajectory vector and the expected trajectory vector are used as the input of the reward network, and the output is reward information, which is used for convolution neural The parameters of the network CNN are trained;
  • the convolutional neural network CNN algorithm uses the state of the surrounding environment of multiple robots to analyze the current trajectory vector, and uses the current trajectory vector as the input of the convolutional neural network CNN, and the trained convolutional neural network CNN based on the reward information uses convolutional neural network
  • the network CNN algorithm outputs the corresponding action information to the environment information
  • the resource-based multi-robot task allocation algorithm is used to reasonably allocate all the actions of the workpiece to multiple robots, so that multiple robots can cooperate with each other without interference, so as to realize the spatial three-dimensional complex trajectory planning of multiple robots.
  • the basic structure of the convolutional neural network CNN is: input layer ⁇ convolution layer ⁇ pooling layer ⁇ repeated convolution layer, pooling layer ⁇ fully connected layer ⁇ output result.
  • the multi-robot trajectory planning method combines deep Q-learning and convolutional neural network (CNN) algorithms, adopts experience representation technology, and the learning experience that occurs at each time step is stored in a data set by storing multiple events, called For memory regeneration; learning data samples are used to update each time with a certain probability in the reconstructed memory, by reusing empirical data and reducing the correlation between samples.
  • CNN convolutional neural network
  • the multi-robot trajectory planning method integrates deep Q-learning and convolutional neural network (CNN) algorithms, uses empirical data based on the different roles assigned to a single robot, and sets different expectations for the role of each robot before starting learning , the learning makes the compensation value always increase; if the search time of the algorithm is too long, the compensation value decreases, and the learning is performed so that the search time does not increase; the preprocessing part uses a convolutional neural network CNN to find outliers, and the postprocessing part uses In the preprocessing part, the input image is used to search for the features of the image, and these features are collected and learned.
  • CNN convolutional neural network
  • P r is the state transition probability matrix under random reward r
  • t is the time interval
  • s is the state
  • s' is all possible next states
  • s t is the state at time t
  • s t+1 is t+
  • the state at time 1 a t is the action at time t
  • r is the random reward
  • the robot can acquire the state s from the environment, and then perform the action a t ; it gets a random reward r, and it depends on the state and behavior of the expected reward R st to find what the system wants to achieve optimal strategy;
  • a i is the action at time i ⁇ [1,n], i ⁇ [1,n] is the time point, j ⁇ [1,m] is the time point, r t+j is the random reward at the time point t+j, ⁇ is the decay coefficient, and the discount factor means that the reward received in the time interval t will have a smaller impact than the currently received reward;
  • the operation value function V a is defined by the strategy Function ⁇ and policy value function V p are calculated, as shown in formula (3); when starting from state s and following the policy, the state value function of expected reward is expressed by the following formula:
  • R s is the expected reward in state s
  • P xy is the conditional probability distribution matrix
  • is the attenuation coefficient
  • Q( st ,a t ) is the Q value of taking action a t in state s t , corresponding to the newly calculated Q( st-1 ,a t-1 ), and Q( st-1 , a t-1 ) corresponds to the current Q(s t-1 ,a t-1 ) value and the next state of the current Q(s t-1 ,a t-1 ).
  • the Q value is shared during learning and used for the learning machine; in order to optimize the update of the Q value, it is necessary to define an objective function, which is defined as the error between the target value and the predicted value of the Q value ;
  • the objective function is shown in equation (5):
  • a is the action
  • a' is all possible next actions
  • the basic information to obtain the loss function is the transformation ⁇ s, a, r, s'>; therefore, first, use the state as the input to perform the Q-network forward Pass to obtain the action values of all actions; after obtaining the environmental return value ⁇ r, s'> of action a, use the state s to obtain the action values of all action a again; then, get the loss function through all the information obtained,
  • This function updates the weight parameters so that the Q-value update of the selected action converges, that is, as close as possible to the target value and the predicted value; for the compensation function, if the distance to the current target point decreases before it decreases, the compensation increases greatly; If the distance is getting closer, the compensation will be reduced.
  • the two network structures are the same, only the weight parameters are different; in order to smooth the convergence in deep Q-learning, the target network is not updated continuously, but periodically updated;
  • the root-mean-square transfer algorithm is used as the optimizer, and the learning rate is adjusted according to the parameter gradient; in the case of changing training sets, unlike some training sets, it is necessary to constantly change the parameters.
  • the robot continuously consumes its resources during the execution of the task, and these resources must be refilled during the run; the robot calculates the task considering all the possibilities of visiting different combinations of resource stations according to its resource level. performance, such that this enables the robot to reduce unnecessary waste of time and resources during tasks.
  • the beneficial effects of the present invention at least include:
  • the present invention integrates deep Q-learning and convolutional neural network (CNN) algorithm, adopts convolutional neural network (CNN) algorithm to analyze the accurate position by using the information of its surrounding environment, each robot moves according to the position obtained by deep Q-learning analysis, and then uses resource-based
  • the robot task allocation method reasonably allocates all the solder joints of the workpiece to multiple welding robots, so that multiple robots can cooperate with each other without interference, so as to realize the space three-dimensional complex trajectory planning of multiple robots, and finally plan the most suitable for multiple robots.
  • the optimal collaborative path enables multiple robots to cooperate with each other without interference, and realize the efficiency of robots to perform complex tasks collaboratively.
  • Fig. 1 is a schematic diagram of depth Q learning of the present invention
  • Fig. 2 is the structural representation of convolutional neural network CNN of the present invention
  • Fig. 3 is a trajectory planning flow chart of the fusion of deep Q-learning and convolutional neural network (CNN) algorithm of the present invention.
  • CNN convolutional neural network
  • each robot can be viewed as a dynamic obstacle or as a collaborative robot. That is, each robot in the system can perform independent actions according to a given task while cooperating with each other. After an action is selected, the relationship to the goal is evaluated and each robot is rewarded or punished for learning.
  • reinforcement learning is a kind of deep Q-learning (Deep Q-Learning, DQN). By sharing the Q parameters of each robot, it consumes less trajectory search time and can be applied to static and dynamic environments of multiple robots.
  • the present invention is based on the principle of multi-robot trajectory planning based on deep Q-learning, as shown in FIG. 1 .
  • a robot that chooses an action as an output recognizes the environment and receives the state of the environment. When the state is changed, the state transition is delivered to the individual as a reinforcement signal. The behavior of individual bots is chosen such that the sum of boosted signal values increases over a longer period of time.
  • the function of the action is to provide the control strategy for the control system.
  • the ultimate goal of the multi-robot collaborative clamping/handling/flipping/welding system is to maximize the infinitely accumulated reward value in the state (multi-robot collaborative operation) process to achieve the environment (multiple robots and workpieces)) optimal trajectory planning.
  • a robot When a robot works in a discrete, constrained environment, it chooses one of a set of deterministic behaviors at each time interval, assuming it is in a Markov state whose state changes to different probability.
  • P r is the state transition probability matrix under random reward r
  • t is the time interval
  • s is the state
  • s' is all possible next states
  • s t is the state at time t
  • s t+1 is t+
  • the state at time 1 a t is the action at time t
  • r is the random reward
  • the robot can acquire the state s from the environment, and then perform the action at t . It gets a random reward r, and it depends on the state and behavior of the expected reward R st to find the optimal policy that the system wants to achieve.
  • a i is the action at time i ⁇ [1,n]
  • i ⁇ [1,n] is the time point
  • j ⁇ [1,m] is the time point
  • r t+j is the random reward at the time point t+j
  • is the decay coefficient
  • the discount factor means that the reward received within the time interval t will have less impact than the currently received reward.
  • the operation value function V a is calculated by the policy function ⁇ and the policy value function V p , as shown in formula (3).
  • R s is the expected reward in state s
  • P xy is the conditional probability distribution matrix
  • is the attenuation coefficient
  • ⁇ [0,1] it can be seen that there is at least one optimal strategy, and the goal of Q learning is In the absence of initial conditions, establish an optimal strategy; for the strategy, the Q value can be defined as follows:
  • Q( st ,a t ) is the Q value of taking action a t in state s t , corresponding to the newly calculated Q( st-1 ,a t-1 ), and Q( st-1 , a t-1 ) corresponds to the current Q(s t-1 ,a t-1 ) value and the next state of the current Q(s t-1 ,a t-1 ).
  • CNN Convolution Neural Networks
  • the convolutional neural network (CNN) proposed by Yann LeCun of New York University in 1998 can be regarded as a generalized form of a neurocognitive machine (Neocognitron) and a variant of a multilayer perceptron (MLP). It is called Artificial Neural Network (ANN).
  • ANN Artificial Neural Network
  • the basic structure of the convolutional neural network CNN adopted in the present invention is: input layer (Input Layer) ⁇ convolution layer (Convolution Layer) ⁇ pooling layer (Pooling Layer) ⁇ (repeated convolutional layer, pooling layer) ⁇ full connection Layer (Full Connected Layer) ⁇ Output Layer (Output Layer), as shown in Figure 2.
  • the environmental information image is 2560 ⁇ 2000
  • the input layer is an integer multiple of 2
  • the convolution layer is 16
  • a 3 ⁇ 3 filter is used
  • the pooling layer reduces the dimensionality of the convolution result
  • the fully connected layer is 3.
  • the present invention integrates the trajectory planning process of deep Q-learning and convolutional neural network (CNN) algorithm, as shown in FIG. 3 .
  • the reward network is designed, and the two state information (current trajectory vector, expected trajectory vector) are both used as its network input, and the output is reward information, which is used to train the parameters of the convolutional neural network CNN.
  • the current trajectory vector will try to be consistent with the expected trajectory vector through advanced seam tracking technology.
  • the current trajectory vector is also used as the input of the convolutional neural network CNN.
  • the trained convolutional neural network CNN will output corresponding action information to the environment information (multi-robots and workpieces), so that multi-robots can achieve Collaborative clamping/handling/flipping/welding space 3D complex welds.
  • the present invention integrates deep Q-learning and convolutional neural network (CNN) algorithms, adopts experience representation technology, and stores multiple events in a data set for the learning experience that occurs at each time step, which is also called memory regeneration.
  • the learning data samples are used to update with a certain probability in the reconstructed memory each time, and the data efficiency can be improved by reusing the empirical data and reducing the correlation between samples.
  • the present invention integrates deep Q-learning and convolutional neural network (CNN) algorithms, uses experience data based on the different assigned roles of a single robot, and sets different expectations for the roles of each robot before starting learning, and learning makes the compensation value always increase . If the search time of the algorithm is too long, the compensation value is decreased, and learning is performed so that the search time does not increase.
  • the preprocessing part uses convolutional neural network (CNN) to find outliers, and the postprocessing part uses singular points to learn data.
  • CNN convolutional neural network
  • the postprocessing part uses singular points to learn data.
  • the features of the image are searched by using the input image, and these features are collected and learned. In this case, Q values are learned for each robot assigned a different role, but the CNN values have the same input and different expected values.
  • the Q-values are shared while learning and used by the learning machine.
  • an objective function which is defined as the error between the target value and the predicted value of the Q value.
  • the objective function is shown in equation (5).
  • a is the action
  • a' is all possible next actions
  • the basic information to obtain the loss function is the transformation ⁇ s, a, r, s'>. Therefore, first, a Q-network forward pass is performed using the state as input to obtain action values for all actions. After getting the environment return value ⁇ r, s'> of action a, use the state s to get all the action values of action a again. Then, all the information obtained is used to obtain the loss function, which updates the weight parameters so that the Q-value update of the selected action converges, i.e. as close as possible to the target value and the predicted value. For the compensation function, if the distance to the current target point decreases before decreasing, the compensation is greatly increased; if the distance is getting closer, the compensation is decreased.
  • RMSProp Root Mean Square Propagation
  • the project proposed a resource-based (RB) robot task allocation algorithm for the task allocation of two robots' collaborative welding.
  • RB resource-based
  • the robot continuously consumes its resources while performing tasks, and these resources must be refilled during the run.
  • the robot calculates mission performance considering all possibilities of visiting different combinations of resource stations according to its resource level, this allows the robot to reduce unnecessary waste of time and resources during missions.
  • the present invention proposes a high-quality multi-robot trajectory planning method that integrates deep Q-learning and convolutional neural network (CNN) algorithms.
  • the convolutional neural network (CNN) algorithm utilizes the information of its surrounding environment to analyze the accurate position, and each robot learns the correct position according to the deep Q-learning method.
  • the location obtained by the analysis is used to move, and then use the resource-based multi-robot task allocation algorithm to reasonably allocate all the solder joints of the workpiece to the two welding robots, so as to finally plan the optimal collaborative path for multiple robots, so that multiple Robots are able to cooperate with each other and non-interference occurs.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Feedback Control In General (AREA)
  • Manipulator (AREA)

Abstract

本发明公开了一种多机器人轨迹规划方法,包括如下步骤:深度Q学习利用多机器人周围环境的状态分析出当前轨迹向量,设计深度Q学习的奖励网络,将当前轨迹向量和期望轨迹向量均作为奖励网络的输入,输出为奖励信息,用来对卷积神经网络CNN的参数进行训练;将当前轨迹向量作为卷积神经网络CNN的输入,基于所述奖励信息而训练好的卷积神经网络CNN采用卷积神经网络CNN算法输出相应的动作信息给环境信息;再采用基于资源的多机器人任务分配算法,将工件的所有动作合理地分配给多台机器人,达到多台机器人能够相互协作且无干涉,实现多机器人的空间三维复杂轨迹规划,从而实现机器人协同执行复杂任务的高效性。

Description

一种多机器人轨迹规划方法 技术领域
本发明涉及多机器人协同控制技术领域,具体涉及一种多机器人轨迹规划方法。
背景技术
随着钢/铝等复杂构件行业的加工量和作业环境的不断变化,有些工作仅靠单机器人难以承担,需要通过多台机器人之间的协调配合才能完成,多机协同作业已取代单机成为构建智能产线的研究热点。多机器人系统相比于单机器人系统具有适应环境能力强、自我调节能力高、系统空间分布广、更好的数据冗余性、鲁棒性等特点。采用多机器人之间的协同合作,能够可靠地完成单机器人无法完成的高精度作业和高效加工。
焊接是一种潜在危险性高、强度大、熟练度要求高的工作。传统的机器人焊接工艺多采用手动示教生成焊接轨迹,不仅费时费力,精度也不高,而且受限于机器人有限的工作空间,难以实现复杂构件的三维任意复杂曲线焊接,迫切需要多台机器人之间的协同作业,即在同一个工位区域内,多个机器人分别对工件(即被焊接对象)进行协同夹持、搬运、翻转和焊接,实现相互配合作业,达到准时、同步、协调和高效的要求。
多机器人在工作空间大范围重叠的情况下,对于每个机器人的协同轨迹规划难度不小,采用传统的空间轨迹优化方法,较难得到最优解。面向复杂焊接任务,机器人焊接的空间三维复杂轨迹规划,不仅要保证多机器人系统不与环境中的障碍物有任何冲突,而且必须确保每台机器人之间保持给定的位置,尤其是在考虑机器人具有工作空间高度重叠时。
已有多机器人协作方案高度依赖于每个机器人的位置和速度等条件,传统 的轨迹规划方法难以适应复杂和动态的系统和环境,因为每一个机器人都难以将其周围的机器人识别为障碍物或协作机器人。机器学习虽已应用于机器人控制、路径规划等,但大多数的研究,仅限于模拟仿真,遗传算法也有一些局限性需要加以弥补,应用于解决一个或多个任务的多机器人强化学习的轨迹规划研究相对欠活跃。
多机器人在工作空间高度重叠时,需要在同一工位区域内相互配合以实现对工件的夹持、搬运、翻转和焊接等操作,多机器人系统中的每一个机器人都必须独立动作,并与其他机器人加以协作。多机器人的协作方案高度依赖于每个机器人的位置和速度等条件,如何实现机器人协同执行复杂任务的高效性,满足空间三维复杂轨迹规划,是目前拟解决的关键问题。
发明内容
有鉴于此,为了解决现有技术中的上述问题,本发明提出一种多机器人轨迹规划方法,融合深度Q学习和卷积神经网络CNN算法,达到多台机器人能够相互协作且无干涉,从而实现多机器人的空间三维复杂轨迹规划。
本发明通过以下技术手段解决上述问题:
一种多机器人轨迹规划方法,包括如下步骤:
深度Q学习利用多机器人周围环境的状态分析出当前轨迹向量,设计深度Q学习的奖励网络,将当前轨迹向量和期望轨迹向量均作为奖励网络的输入,输出为奖励信息,用来对卷积神经网络CNN的参数进行训练;
卷积神经网络CNN算法利用多机器人周围环境的状态分析出当前轨迹向量,将当前轨迹向量作为卷积神经网络CNN的输入,基于所述奖励信息而训练好的卷积神经网络CNN采用卷积神经网络CNN算法输出相应的动作信息给环境信息;
再采用基于资源的多机器人任务分配算法,将工件的所有动作合理地分配 给多台机器人,达到多台机器人能够相互协作且无干涉,从而实现多机器人的空间三维复杂轨迹规划。
进一步地,所述卷积神经网络CNN的基本结构为:输入层→卷积层→池化层→重复卷积层、池化层→全连接层→输出结果。
进一步地,当前轨迹向量将力求与期望轨迹向量一致。
进一步地,所述多机器人轨迹规划方法融合深度Q学习和卷积神经网络CNN算法,采用经验表示技术,在每个时间步长上发生的学习经验,通过将多个事件存储在数据集中,称为记忆再生;学习数据样本用于每次在重建的存储器中以一定的概率进行更新,通过重复使用经验数据并减少样本间的相关性。
进一步地,所述多机器人轨迹规划方法融合深度Q学习和卷积神经网络CNN算法,基于单个机器人分配角色的不同而使用经验数据,在开始学习之前,为每个机器人的角色设定不同的期望值,学习使补偿值总是增加;如果算法的搜索时间过长,则补偿值减小,并且执行学习以使搜索时间不增加;预处理部分采用卷积神经网络CNN查找异常值,后处理部分采用奇异点来学习数据;在预处理部分,利用输入图像来搜索图像的特征,并对这些特征进行采集和学习。
进一步地,在深度Q学习中,当机器人工作在一个离散的、受限的环境中时,它会在每个时间间隔内选择一组确定行为中的一个,并假设它处于马尔可夫状态,其状态变化为不同的概率;
P r[s t+1]=s′[s t,a t]=P r[a t]          (1)
式中,P r为随机奖励r下的状态转移概率矩阵,t为时间间隔,s为状态,s’为下一个所有可能的状态,s t为t时刻的状态,s t+1为t+1时刻的状态,a t为t时刻的动作,r为随机奖励;
在每个时间间隔t内,机器人可从环境中获取状态s,然后再执行动作a t;它得到一个随机奖励r,它依赖于期望奖励R st的状态和行为,以找到系统想要实现的最优策略;
Figure PCTCN2021095970-appb-000001
式中,
Figure PCTCN2021095970-appb-000002
为t时刻下状态s的期望奖励,a i为i∈[1,n]时刻的动作,i∈[1,n]为时刻点,j∈[1,m]为时刻点,r t+j为时刻点t+j下的随机奖励,γ为衰减系数,贴现因子意味着在时间间隔t内收到的奖励,会比当前收到的奖励产生的影响更小;操作值函数V a由策略函数π和策略值函数V p来计算,如式(3)所示;从状态s开始并遵循策略时,期望奖励的状态值函数由下式表示:
V a(s t)≡R s(π(s t))+γ∑P xy[π(s t)]V p(s t)         (3)
式中,R s为s状态下的期望奖励,P xy为条件概率分布矩阵,γ为衰减系数,由此可知,至少存在一个最优策略,Q学习的目标就是在没有初始条件下,建立一个最优策略;对于策略,可定义Q值如下:
Q p(s t,a t)=R s(a t)+γ∑P xy[π(s t)]V p(s t)          (4)
式中,Q(s t,a t)是状态s t下采取行动a t的Q值,对应新计算出的Q(s t-1,a t-1),而Q(s t-1,a t-1)对应于当前Q(s t-1,a t-1)值和当前Q(s t-1,a t-1)的下一个状态。
进一步地,在深度Q学习中,Q值在学习时是共享的,并用于学习机;为了优化Q值的更新,有必要定义一个目标函数,将其定义为目标值和Q值预测值的误差;目标函数如方程(5)所示:
Figure PCTCN2021095970-appb-000003
式中,a为动作,a'为下一个所有可能的动作,获得损失函数的基本信息是转换<s,a,r,s'>;因此,首先,使用状态作为输入来执行Q网络正向传递,以获得所有动作的动作值;在获得动作a的环境返回值<r,s'>后,使用状态s再次获得所有动作a的动作值;然后,通过获得的所有信息来得到损失函数,该函数更新权重参数,使所选动作的Q值更新收敛,即尽可能接近目标值和预测值;对于补偿函数,如果到当前目标点的距离在减小之前减小,则补偿会大大增加;如果距离越来越近,则补偿会减少。
进一步地,在深度Q学习中,使用目标Q网络和Q网络两种,两种网络结构相同,仅权重参数不同;为了平滑深度Q学习中的收敛,目标网络不是连续更新,而是定期更新;采用均方根传递算法作为优化器,并根据参数梯度调整学习率;在训练集不断变化的情况下,不同于某些训练集的情况,有必要不断地改变参数。
进一步地,在多机器人任务分配算法中,机器人执行任务期间会持续消耗其资源,这些资源必须在运行期间重新填充;机器人会根据其资源水平,考虑访问资源站不同组合的所有可能性来计算任务性能,这样这使机器人能够减少任务期间不必要的时间和资源浪费。
与现有技术相比,本发明的有益效果至少包括:
本发明融合深度Q学习和卷积神经网络CNN算法,采用卷积神经网络CNN算法利用其周围环境的信息分析准确的位置,各机器人根据深度Q学习分析得到的位置进行动作,再通过基于资源的机器人任务分配方法,将工件的所有焊点合理地分配给多台焊接机器人,达到多台机器人能够相互协作且无干涉,从而实现多机器人的空间三维复杂轨迹规划,最终为多台机器人规划出最优的协同路径,使多个机器人能够相互协作且无干涉的发生,实现机器人协同执行复杂任务的高效性。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明深度Q学习原理图;
图2是本发明卷积神经网络CNN的结构示意图;
图3是本发明融合深度Q学习和卷积神经网络CNN算法的轨迹规划流程图。
具体实施方式
为使本发明的上述目的、特征和优点能够更加明显易懂,下面将结合附图和具体的实施例对本发明的技术方案进行详细说明。需要指出的是,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例,基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
多机器人在工作空间大范围重叠时,对每个机器人的协同轨迹规划难度较大。在传统的轨迹规划方法中,机器人需要搜索一个相对较宽的动作区域,并在给定的环境下以预先设计的路线移动。多机器人系统中的每一个机器人都必须独立动作,并与其他机器人协作,以获得优异的性能。此外,多机器人的协同方案高度依赖于每个机器人的位置和速度等条件。然而,传统的轨迹规划方法难以有效地处理各种情况,因为每一个机器人都难以将其周围的机器人识别为障碍物或协同的机器人。
为了弥补这些不足,针对复杂任务的多机器人轨迹规划问题,本发明研究多机器人轨迹规划中强化学习的信息和策略问题,提出一种通过强化学习使得机器人可快速到达目标点的方法。根据情况的不同,每个机器人可以被视为动态障碍物或协同的机器人。也就是说,系统中的每个机器人可以根据给定的任务执行独立的动作,同时相互协同。选择动作后,评估与目标的关系,并对每个机器人进行奖励或惩罚以开展学习。此时,强化学习就是一种深度Q学习(Deep Q-Learning,DQN),通过共享每个机器人的Q参数,耗费更少的轨迹搜索时间,可应用于多机器人的静态和动态环境中。
本发明基于深度Q学习的多机器人轨迹规划的原理,如图1。选择动作为输出的机器人通过识别环境并接收环境的状态,当状态被改变时,状态转换作 为强化信号被传递给个体。选择单个机器人的行为,以便在较长一段时间内提升增强信号值的总和。动作的作用是为控制系统提供控制策略,多机器人协同夹持/搬运/翻转/焊接系统的最终目标是使得状态(多机器人协同作业)过程中无穷累积的奖励值趋于最大化,以实现环境(多机器人及工件))的最佳轨迹规划。
当机器人工作在一个离散的、受限的环境中时,它会在每个时间间隔内选择一组确定行为中的一个,并假设它处于马尔可夫(Markov)状态,其状态变化为不同的概率。
P r[s t+1]=s′[s t,a t]=P r[a t]         (1)
式中,P r为随机奖励r下的状态转移概率矩阵,t为时间间隔,s为状态,s’为下一个所有可能的状态,s t为t时刻的状态,s t+1为t+1时刻的状态,a t为t时刻的动作,r为随机奖励;
在每个时间间隔t内,机器人可以从环境中获取状态s,然后再执行动作a t。它得到一个随机奖励r,它依赖于期望奖励R st的状态和行为,以找到系统想要实现的最优策略。
Figure PCTCN2021095970-appb-000004
式中,
Figure PCTCN2021095970-appb-000005
为t时刻下状态s的期望奖励,a i为i∈[1,n]时刻的动作,i∈[1,n]为时刻点,j∈[1,m]为时刻点,r t+j为时刻点t+j下的随机奖励,γ为衰减系数,贴现因子意味着在时间间隔t内收到的奖励,会比当前收到的奖励产生的影响更小。操作值函数V a由策略函数π和策略值函数V p来计算,如式(3)所示。从状态s开始并遵循策略时,期望奖励的状态值函数由下式表示。
V a(s t)≡R s(π(s t))+γ∑P xy[π(s t)]V p(s t)          (3)
式中,R s为s状态下的期望奖励,P xy为条件概率分布矩阵,γ为衰减系数,γ∈[0,1],由此可知,至少存在一个最优策略,Q学习的目标就是在没有初始条件下,建立一个最优策略;对于策略,可定义Q值如下:
Q p(s t,a t)=R s(a t)+γ∑P xy[π(s t)]V p(s t)         (4)
式中,Q(s t,a t)是状态s t下采取行动a t的Q值,对应新计算出的Q(s t-1,a t-1),而Q(s t-1,a t-1)对应于当前Q(s t-1,a t-1)值和当前Q(s t-1,a t-1)的下一个状态。
在多机器人轨迹规划中,现有方法难以适应复杂和动态的系统和环境,但可通过深度Q学习和卷积神经网络(Convolution Neural Networks,CNN)相融合,采用多机器人深度强化学习。1998年纽约大学杨立昆(Yann LeCun)提出的卷积神经网络CNN可以看作是神经认知机(Neocognitron)的推广形式,也是多层感知机(Multilayer Perceptron,MLP)的变种,多层感知机也叫人工神经网络(Artificial Neural Network,ANN),除了输入层和输出层,中间可以有多个隐藏层。
本发明采用的卷积神经网络CNN的基本结构为:输入层(Input Layer)→卷积层(Convolution Layer)→池化层(Pooling Layer)→(重复卷积层、池化层)→全连接层(Full Connected Layer)→输出结果(Output Layer),如图2所示。环境信息图像为2560×2000,输入层为2的整数倍,卷积层为16个,使用3×3的滤波器,池化层对卷积结果进行降低维度处理,全连接层为3个。
本发明融合深度Q学习和卷积神经网络CNN算法的轨迹规划流程,如图3所示。首先,设计奖励网络,将两个状态信息(当前轨迹向量、期望轨迹向量)均作为其网络输入,输出为奖励信息,用来对卷积神经网络CNN的参数进行训练。其中,当前轨迹向量将通过先进焊缝跟踪技术力求与期望轨迹向量一致。当前轨迹向量也作为卷积神经网络CNN的输入,基于前述的奖励输出而训练好的卷积神经网络CNN,会输出相应的动作信息给环境信息(多机器人及工件),从而使得多机器人能够实现协同夹持/搬运/翻转/焊接空间三维复杂焊缝。本发明融合深度Q学习和卷积神经网络CNN算法,采用经验表示技术,在每个时间步长上发生的学习经验,通过将多个事件存储在数据集中,也称为记忆再生。学习数据样本用于每次在重建的存储器中以一定的概率进行更新,通过重复使用经验数据并减少样本间的相关性,可以提高数据效率。
本发明融合深度Q学习和卷积神经网络CNN算法,基于单个机器人分配角色的不同而使用经验数据,在开始学习之前,为每个机器人的角色设定不同的期望值,学习使补偿值总是增加。如果算法的搜索时间过长,则补偿值减小,并且执行学习以使搜索时间不增加。预处理部分采用卷积神经网络CNN查找异常值,后处理部分采用奇异点来学习数据。在预处理部分,利用输入图像来搜索图像的特征,并对这些特征进行采集和学习。在这种情况下,为分配不同角色的每个机器人学习Q值,但是卷积神经网络CNN值具有相同的输入和不同的期望值。因此,Q值在学习时是共享的,并用于学习机。为了优化Q值的更新,有必要定义一个目标函数,将其定义为目标值和Q值预测值的误差。目标函数如方程(5)所示。
Figure PCTCN2021095970-appb-000006
式中,a为动作,a'为下一个所有可能的动作,获得损失函数的基本信息是转换<s,a,r,s'>。因此,首先,使用状态作为输入来执行Q网络正向传递,以获得所有动作的动作值。在获得动作a的环境返回值<r,s'>后,使用状态s再次获得所有动作a的动作值。然后,通过获得的所有信息来得到损失函数,该函数更新权重参数,使所选动作的Q值更新收敛,即尽可能接近目标值和预测值。对于补偿函数,如果到当前目标点的距离在减小之前减小,则补偿会大大增加;如果距离越来越近,则补偿会减少。
在深度Q学习中,使用目标Q网络和Q网络两种,两种网络结构相同,仅权重参数不同。为了平滑深度Q学习中的收敛,目标网络不是连续更新,而是定期更新。采用均方根传递算法(Root Mean Square Propagation,RMSProp)作为优化器,并根据参数梯度调整学习率。这意味着,在训练集不断变化的情况下,不同于某些训练集的情况,有必要不断地改变参数。
随后,项目针对两机器人协同焊接的任务分配,提出一种基于资源(Resource-based,RB)的机器人任务分配算法。在该机器人任务分配算法中,机 器人执行任务期间会持续消耗其资源,这些资源必须在运行期间重新填充。机器人会根据其资源水平,考虑访问资源站不同组合的所有可能性来计算任务性能,这样这使机器人能够减少任务期间不必要的时间和资源浪费。
综上分析,本发明提出融合深度Q学习和卷积神经网络CNN算法的高品质多机器人轨迹规划方法,卷积神经网络CNN算法利用其周围环境的信息分析准确的位置,各机器人根据深度Q学习分析得到的位置进行动作,再采用基于资源的多机器人任务分配算法,将工件的所有焊点合理地分配给两台焊接机器人,从而最终为多台机器人规划出最优的协同路径,使多个机器人能够相互协作且无干涉的发生。
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。

Claims (9)

  1. 一种多机器人轨迹规划方法,其特征在于,包括如下步骤:
    深度Q学习利用多机器人周围环境的状态分析出当前轨迹向量,设计深度Q学习的奖励网络,将当前轨迹向量和期望轨迹向量均作为奖励网络的输入,输出为奖励信息,用来对卷积神经网络CNN的参数进行训练;
    卷积神经网络CNN算法利用多机器人周围环境的状态分析出当前轨迹向量,将当前轨迹向量作为卷积神经网络CNN的输入,基于所述奖励信息而训练好的卷积神经网络CNN采用卷积神经网络CNN算法输出相应的动作信息给环境信息;
    再采用基于资源的多机器人任务分配算法,将工件的所有动作合理地分配给多台机器人,达到多台机器人能够相互协作且无干涉,从而实现多机器人的空间三维复杂轨迹规划。
  2. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,所述卷积神经网络CNN的基本结构为:输入层→卷积层→池化层→重复卷积层、池化层→全连接层→输出结果。
  3. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,当前轨迹向量将力求与期望轨迹向量一致。
  4. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,所述多机器人轨迹规划方法融合深度Q学习和卷积神经网络CNN算法,采用经验表示技术,在每个时间步长上发生的学习经验,通过将多个事件存储在数据集中,称为记忆再生;学习数据样本用于每次在重建的存储器中以一定的概率进行更新,通过重复使用经验数据并减少样本间的相关性。
  5. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,所述多机器人轨迹规划方法融合深度Q学习和卷积神经网络CNN算法,基于单个机器人分 配角色的不同而使用经验数据,在开始学习之前,为每个机器人的角色设定不同的期望值,学习使补偿值总是增加;如果算法的搜索时间过长,则补偿值减小,并且执行学习以使搜索时间不增加;预处理部分采用卷积神经网络CNN查找异常值,后处理部分采用奇异点来学习数据;在预处理部分,利用输入图像来搜索图像的特征,并对这些特征进行采集和学习。
  6. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,在深度Q学习中,当机器人工作在一个离散的、受限的环境中时,它会在每个时间间隔内选择一组确定行为中的一个,并假设它处于马尔可夫状态,其状态变化为不同的概率;
    P r[s t+1]=s′[s t,a t]=P r[a t]  (1)
    式中,P r为随机奖励r下的状态转移概率矩阵,t为时间间隔,s为状态,s’为下一个所有可能的状态,s t为t时刻的状态,s t+1为t+1时刻的状态,a t为t时刻的动作,r为随机奖励;
    在每个时间间隔t内,机器人可从环境中获取状态s,然后再执行动作a t;它得到一个随机奖励r,它依赖于期望奖励R st的状态和行为,以找到系统想要实现的最优策略;
    Figure PCTCN2021095970-appb-100001
    式中,
    Figure PCTCN2021095970-appb-100002
    为t时刻下状态s的期望奖励,a i为i∈[1,n]时刻的动作,i∈[1,n]为时刻点,j∈[1,m]为时刻点,r t+j为时刻点t+j下的随机奖励,γ为衰减系数,贴现因子意味着在时间间隔t内收到的奖励,会比当前收到的奖励产生的影响更小;操作值函数V a由策略函数π和策略值函数V p来计算,如式(3)所示;从状态s开始并遵循策略时,期望奖励的状态值函数由下式表示:
    V a(s t)≡R s(π(s t))+γΣP xy[π(s t)]V p(s t)  (3)
    式中,R s为s状态下的期望奖励,P xy为条件概率分布矩阵,γ为衰减系数,由此可知,至少存在一个最优策略,Q学习的目标就是在没有初始条件下,建立一个最优策略;对于策略,可定义Q值如下:
    Q p(s t,a t)=R s(a t)+γΣP xy[π(s t)]V p(s t)  (4)
    式中,Q(s t,a t)是状态s t下采取行动a t的Q值,对应新计算出的Q(s t-1,a t-1),而Q(s t-1,a t-1)对应于当前Q(s t-1,a t-1)值和当前Q(s t-1,a t-1)的下一个状态。
  7. 根据权利要求6所述的多机器人轨迹规划方法,其特征在于,在深度Q学习中,Q值在学习时是共享的,并用于学习机;为了优化Q值的更新,有必要定义一个目标函数,将其定义为目标值和Q值预测值的误差;目标函数如方程(5)所示:
    Figure PCTCN2021095970-appb-100003
    式中,a为动作,a'为下一个所有可能的动作,获得损失函数的基本信息是转换<s,a,r,s'>;因此,首先,使用状态作为输入来执行Q网络正向传递,以获得所有动作的动作值;在获得动作a的环境返回值<r,s'>后,使用状态s再次获得所有动作a的动作值;然后,通过获得的所有信息来得到损失函数,该函数更新权重参数,使所选动作的Q值更新收敛,即尽可能接近目标值和预测值;对于补偿函数,如果到当前目标点的距离在减小之前减小,则补偿会大大增加;如果距离越来越近,则补偿会减少。
  8. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,在深度Q学习中,使用目标Q网络和Q网络两种,两种网络结构相同,仅权重参数不同;为了平滑深度Q学习中的收敛,目标网络不是连续更新,而是定期更新;采用均方根传递算法作为优化器,并根据参数梯度调整学习率;在训练集不断变化的情况下,不同于某些训练集的情况,有必要不断地改变参数。
  9. 根据权利要求1所述的多机器人轨迹规划方法,其特征在于,在多机器 人任务分配算法中,机器人执行任务期间会持续消耗其资源,这些资源必须在运行期间重新填充;机器人会根据其资源水平,考虑访问资源站不同组合的所有可能性来计算任务性能,这样使机器人能够减少任务期间不必要的时间和资源浪费。
PCT/CN2021/095970 2021-05-19 2021-05-26 一种多机器人轨迹规划方法 WO2022241808A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110547794.3A CN113326872A (zh) 2021-05-19 2021-05-19 一种多机器人轨迹规划方法
CN202110547794.3 2021-05-19

Publications (1)

Publication Number Publication Date
WO2022241808A1 true WO2022241808A1 (zh) 2022-11-24

Family

ID=77416039

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095970 WO2022241808A1 (zh) 2021-05-19 2021-05-26 一种多机器人轨迹规划方法

Country Status (2)

Country Link
CN (1) CN113326872A (zh)
WO (1) WO2022241808A1 (zh)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730475A (zh) * 2023-01-09 2023-03-03 广东省科学院智能制造研究所 一种云边端协同的柔性产线机器人学习系统及方法
CN115840794A (zh) * 2023-02-14 2023-03-24 国网山东省电力公司东营供电公司 一种基于gis和rl模型的光伏系统规划方法
CN116302569A (zh) * 2023-05-17 2023-06-23 安世亚太科技股份有限公司 一种基于用户请求信息的资源分区智能化调度方法
CN116307251A (zh) * 2023-04-12 2023-06-23 哈尔滨理工大学 一种基于强化学习的工作排程优化方法
CN116300977A (zh) * 2023-05-22 2023-06-23 北京科技大学 一种依托强化学习的铰接车轨迹跟踪控制方法及装置
CN116562740A (zh) * 2023-07-10 2023-08-08 长沙宜选供应链有限公司 一种基于改进型深度学习算法模型的外贸物流平台
CN116690589A (zh) * 2023-08-07 2023-09-05 武汉理工大学 基于深度强化学习的机器人u型拆解线动态平衡方法
CN116747026A (zh) * 2023-06-05 2023-09-15 北京长木谷医疗科技股份有限公司 基于深度强化学习的机器人智能截骨方法、装置及设备
CN116776154A (zh) * 2023-07-06 2023-09-19 华中师范大学 一种ai人机协同数据标注方法和系统
CN116803635A (zh) * 2023-08-21 2023-09-26 南京邮电大学 基于高斯核损失函数的近端策略优化训练加速方法
CN116834018A (zh) * 2023-08-07 2023-10-03 南京云创大数据科技股份有限公司 一种多机械臂多目标寻找的训练方法及训练装置
CN116900538A (zh) * 2023-09-14 2023-10-20 天津大学 基于深度强化学习和区域平衡的多机器人任务规划方法
CN117078236A (zh) * 2023-10-18 2023-11-17 广东工业大学 复杂装备智能维护方法、装置、电子设备及存储介质
CN117273225A (zh) * 2023-09-26 2023-12-22 西安理工大学 一种基于时空特征的行人路径预测方法
CN117437188A (zh) * 2023-10-17 2024-01-23 广东电力交易中心有限责任公司 一种用于智慧电网的绝缘子缺陷检测系统
CN117590751A (zh) * 2023-12-28 2024-02-23 深圳市德威胜潜水工程有限公司 基于水下机器人的水下环境监测方法及系统
CN117789095A (zh) * 2024-01-02 2024-03-29 广州汇思信息科技股份有限公司 一种切花开放周期优化方法、系统、设备及存储介质
CN117631547B (zh) * 2024-01-26 2024-04-26 哈尔滨工业大学 一种小天体不规则弱引力场下的四足机器人着陆控制方法
CN117973820A (zh) * 2024-04-01 2024-05-03 浙江数达智远科技有限公司 基于人工智能的任务动态分配系统及方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114089627B (zh) * 2021-10-08 2023-09-15 北京师范大学 基于双深度q网络学习的非完全信息博弈策略优化方法
CN114397817A (zh) * 2021-12-31 2022-04-26 上海商汤科技开发有限公司 网络训练、机器人控制方法及装置、设备及存储介质
CN115855226B (zh) * 2023-02-24 2023-05-30 青岛科技大学 基于dqn和矩阵补全的多auv协同水下数据采集方法
CN116382304B (zh) * 2023-05-26 2023-09-15 国网江苏省电力有限公司南京供电分公司 基于dqn模型的多巡检机器人协同路径规划方法及系统

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109540150A (zh) * 2018-12-26 2019-03-29 北京化工大学 一种应用于危化品环境下多机器人路径规划方法
CN109839933A (zh) * 2019-02-20 2019-06-04 哈尔滨工程大学 一种基于vdsom算法的多机器人任务分配方法
CN109906132A (zh) * 2016-09-15 2019-06-18 谷歌有限责任公司 机器人操纵的深度强化学习
CN110083166A (zh) * 2019-05-30 2019-08-02 浙江远传信息技术股份有限公司 针对多机器人的协同调度方法、装置、设备及介质
JP2020082314A (ja) * 2018-11-29 2020-06-04 京セラドキュメントソリューションズ株式会社 学習装置、ロボット制御装置、及びロボット制御システム
US10733535B1 (en) * 2012-05-22 2020-08-04 Google Llc Training a model using parameter server shards
CN112596515A (zh) * 2020-11-25 2021-04-02 北京物资学院 一种多物流机器人移动控制方法及装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10733535B1 (en) * 2012-05-22 2020-08-04 Google Llc Training a model using parameter server shards
CN109906132A (zh) * 2016-09-15 2019-06-18 谷歌有限责任公司 机器人操纵的深度强化学习
JP2020082314A (ja) * 2018-11-29 2020-06-04 京セラドキュメントソリューションズ株式会社 学習装置、ロボット制御装置、及びロボット制御システム
CN109540150A (zh) * 2018-12-26 2019-03-29 北京化工大学 一种应用于危化品环境下多机器人路径规划方法
CN109839933A (zh) * 2019-02-20 2019-06-04 哈尔滨工程大学 一种基于vdsom算法的多机器人任务分配方法
CN110083166A (zh) * 2019-05-30 2019-08-02 浙江远传信息技术股份有限公司 针对多机器人的协同调度方法、装置、设备及介质
CN112596515A (zh) * 2020-11-25 2021-04-02 北京物资学院 一种多物流机器人移动控制方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUI BOWEN, HUANG ZHIJIAN, JIANG BAOXIANG, ZHENG HUAN, WEN JIAYI: "Path planning algorithm for unmanned surface vessels based on deep Q network", JOURNAL OF SHANGHAI MARITIME UNIVERSITY, SHANGHAI, vol. 41, no. 3, 30 September 2020 (2020-09-30), Shanghai, XP093005735, ISSN: 1672-9498, DOI: 10.13340/j.jsmu.2020.03.001 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115730475B (zh) * 2023-01-09 2023-05-19 广东省科学院智能制造研究所 一种云边端协同的柔性产线机器人学习系统及方法
CN115730475A (zh) * 2023-01-09 2023-03-03 广东省科学院智能制造研究所 一种云边端协同的柔性产线机器人学习系统及方法
CN115840794A (zh) * 2023-02-14 2023-03-24 国网山东省电力公司东营供电公司 一种基于gis和rl模型的光伏系统规划方法
CN116307251B (zh) * 2023-04-12 2023-09-19 哈尔滨理工大学 一种基于强化学习的工作排程优化方法
CN116307251A (zh) * 2023-04-12 2023-06-23 哈尔滨理工大学 一种基于强化学习的工作排程优化方法
CN116302569B (zh) * 2023-05-17 2023-08-15 安世亚太科技股份有限公司 一种基于用户请求信息的资源分区智能化调度方法
CN116302569A (zh) * 2023-05-17 2023-06-23 安世亚太科技股份有限公司 一种基于用户请求信息的资源分区智能化调度方法
CN116300977B (zh) * 2023-05-22 2023-07-21 北京科技大学 一种依托强化学习的铰接车轨迹跟踪控制方法及装置
CN116300977A (zh) * 2023-05-22 2023-06-23 北京科技大学 一种依托强化学习的铰接车轨迹跟踪控制方法及装置
CN116747026A (zh) * 2023-06-05 2023-09-15 北京长木谷医疗科技股份有限公司 基于深度强化学习的机器人智能截骨方法、装置及设备
CN116776154B (zh) * 2023-07-06 2024-04-09 华中师范大学 一种ai人机协同数据标注方法和系统
CN116776154A (zh) * 2023-07-06 2023-09-19 华中师范大学 一种ai人机协同数据标注方法和系统
CN116562740A (zh) * 2023-07-10 2023-08-08 长沙宜选供应链有限公司 一种基于改进型深度学习算法模型的外贸物流平台
CN116562740B (zh) * 2023-07-10 2023-09-22 长沙宜选供应链有限公司 一种基于改进型深度学习算法模型的外贸物流平台
CN116690589A (zh) * 2023-08-07 2023-09-05 武汉理工大学 基于深度强化学习的机器人u型拆解线动态平衡方法
CN116834018A (zh) * 2023-08-07 2023-10-03 南京云创大数据科技股份有限公司 一种多机械臂多目标寻找的训练方法及训练装置
CN116690589B (zh) * 2023-08-07 2023-12-12 武汉理工大学 基于深度强化学习的机器人u型拆解线动态平衡方法
CN116803635A (zh) * 2023-08-21 2023-09-26 南京邮电大学 基于高斯核损失函数的近端策略优化训练加速方法
CN116803635B (zh) * 2023-08-21 2023-12-22 南京邮电大学 基于高斯核损失函数的近端策略优化训练加速方法
CN116900538A (zh) * 2023-09-14 2023-10-20 天津大学 基于深度强化学习和区域平衡的多机器人任务规划方法
CN116900538B (zh) * 2023-09-14 2024-01-09 天津大学 基于深度强化学习和区域平衡的多机器人任务规划方法
CN117273225B (zh) * 2023-09-26 2024-05-03 西安理工大学 一种基于时空特征的行人路径预测方法
CN117273225A (zh) * 2023-09-26 2023-12-22 西安理工大学 一种基于时空特征的行人路径预测方法
CN117437188A (zh) * 2023-10-17 2024-01-23 广东电力交易中心有限责任公司 一种用于智慧电网的绝缘子缺陷检测系统
CN117437188B (zh) * 2023-10-17 2024-05-28 广东电力交易中心有限责任公司 一种用于智慧电网的绝缘子缺陷检测系统
CN117078236B (zh) * 2023-10-18 2024-02-02 广东工业大学 复杂装备智能维护方法、装置、电子设备及存储介质
CN117078236A (zh) * 2023-10-18 2023-11-17 广东工业大学 复杂装备智能维护方法、装置、电子设备及存储介质
CN117590751A (zh) * 2023-12-28 2024-02-23 深圳市德威胜潜水工程有限公司 基于水下机器人的水下环境监测方法及系统
CN117590751B (zh) * 2023-12-28 2024-03-22 深圳市德威胜潜水工程有限公司 基于水下机器人的水下环境监测方法及系统
CN117789095A (zh) * 2024-01-02 2024-03-29 广州汇思信息科技股份有限公司 一种切花开放周期优化方法、系统、设备及存储介质
CN117789095B (zh) * 2024-01-02 2024-05-14 广州汇思信息科技股份有限公司 一种切花开放周期优化方法、系统、设备及存储介质
CN117631547B (zh) * 2024-01-26 2024-04-26 哈尔滨工业大学 一种小天体不规则弱引力场下的四足机器人着陆控制方法
CN117973820A (zh) * 2024-04-01 2024-05-03 浙江数达智远科技有限公司 基于人工智能的任务动态分配系统及方法

Also Published As

Publication number Publication date
CN113326872A (zh) 2021-08-31

Similar Documents

Publication Publication Date Title
WO2022241808A1 (zh) 一种多机器人轨迹规划方法
Chen et al. Distributed model predictive control for vessel train formations of cooperative multi-vessel systems
Chen et al. Cooperative multi-vessel systems in urban waterway networks
CN110398967B (zh) 一种采用离散化方法的多机器人协同轨迹信息处理方法
CN113156954B (zh) 一种基于增强学习的多智能体集群避障方法
Yang et al. LF-ACO: an effective formation path planning for multi-mobile robot
Xu et al. Two-layer distributed hybrid affine formation control of networked Euler–Lagrange systems
CN112427843B (zh) 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法
Cai et al. A combined hierarchical reinforcement learning based approach for multi-robot cooperative target searching in complex unknown environments
Xin et al. Overview of research on transformation of multi-AUV formations
CN112083727B (zh) 基于速度障碍物的多自主体系统分布式避碰编队控制方法
Demesure et al. Navigation scheme with priority-based scheduling of mobile agents: Application to AGV-based flexible manufacturing system
WO2024016457A1 (zh) 基于自主绕障的异构型多智能体网联协同调度规划方法
Wang et al. Pattern-rl: Multi-robot cooperative pattern formation via deep reinforcement learning
Chen et al. Real-time path planning for a robot to track a fast moving target based on improved Glasius bio-inspired neural networks
Wang Robot algorithm based on neural network and intelligent predictive control
Jin et al. Physical-Informed Neural Network for MPC-based Trajectory Tracking of Vehicles with Noise Considered
Zhang et al. Reinforcement learning and digital twin-based real-time scheduling method in intelligent manufacturing systems
Chen et al. Maddpg algorithm for coordinated welding of multiple robots
Wang et al. Study on scheduling and path planning problems of multi-AGVs based on a heuristic algorithm in intelligent manufacturing workshop
Huang et al. Multi-agent vehicle formation control based on mpc and particle swarm optimization algorithm
Kabtoul et al. Proactive and smooth maneuvering for navigation around pedestrians
Jin et al. Event-Triggered bundled target traversing path planning using a dynamic elliptical guidance region for unmanned surface vehicles
Xiong et al. Research on intelligent path planning technology of logistics robots based on Giraph architecture
Jungbluth et al. Reinforcement Learning-based Scheduling of a Job-Shop Process with Distributedly Controlled Robotic Manipulators for Transport Operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21940261

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21940261

Country of ref document: EP

Kind code of ref document: A1