WO2022095278A1 - 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法 - Google Patents

基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法 Download PDF

Info

Publication number
WO2022095278A1
WO2022095278A1 PCT/CN2021/070578 CN2021070578W WO2022095278A1 WO 2022095278 A1 WO2022095278 A1 WO 2022095278A1 CN 2021070578 W CN2021070578 W CN 2021070578W WO 2022095278 A1 WO2022095278 A1 WO 2022095278A1
Authority
WO
WIPO (PCT)
Prior art keywords
welding
action
action value
manipulator
value
Prior art date
Application number
PCT/CN2021/070578
Other languages
English (en)
French (fr)
Inventor
廖良闯
张本顺
孙宏伟
李萌萌
花磊
陈卫彬
陈杨杨
马韬
余睿
王传生
Original Assignee
中国船舶重工集团公司第七一六研究所
江苏杰瑞科技集团有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国船舶重工集团公司第七一六研究所, 江苏杰瑞科技集团有限责任公司 filed Critical 中国船舶重工集团公司第七一六研究所
Priority to EP21887984.9A priority Critical patent/EP4241915A1/en
Publication of WO2022095278A1 publication Critical patent/WO2022095278A1/zh

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • B25J9/1684Tracking a line or surface by means of sensors
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B23MACHINE TOOLS; METAL-WORKING NOT OTHERWISE PROVIDED FOR
    • B23KSOLDERING OR UNSOLDERING; WELDING; CLADDING OR PLATING BY SOLDERING OR WELDING; CUTTING BY APPLYING HEAT LOCALLY, e.g. FLAME CUTTING; WORKING BY LASER BEAM
    • B23K37/00Auxiliary devices or processes, not specially adapted to a procedure covered by only one of the preceding main groups
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B23MACHINE TOOLS; METAL-WORKING NOT OTHERWISE PROVIDED FOR
    • B23KSOLDERING OR UNSOLDERING; WELDING; CLADDING OR PLATING BY SOLDERING OR WELDING; CUTTING BY APPLYING HEAT LOCALLY, e.g. FLAME CUTTING; WORKING BY LASER BEAM
    • B23K37/00Auxiliary devices or processes, not specially adapted to a procedure covered by only one of the preceding main groups
    • B23K37/02Carriages for supporting the welding or cutting element
    • B23K37/0252Steering means
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/005Manipulators for mechanical processing tasks
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • B25J9/1666Avoiding collision or forbidden zones
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1679Programme controls characterised by the tasks executed
    • B25J9/1682Dual arm manipulator; Coordination of several manipulators
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/39Robotics, robotics to robotics hand
    • G05B2219/39102Manipulator cooperating with conveyor
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/39Robotics, robotics to robotics hand
    • G05B2219/39132Robot welds, operates on moving workpiece, moved by other robot
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/40Robotics, robotics mapping to robotics vision
    • G05B2219/40499Reinforcement learning algorithm
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B2219/00Program-control systems
    • G05B2219/30Nc systems
    • G05B2219/45Nc applications
    • G05B2219/45104Lasrobot, welding robot

Definitions

  • the invention belongs to the field of welding control, and in particular relates to a collaborative welding method for ship multi-manipulator welding points based on a QMIX reinforcement learning algorithm.
  • Welding robots account for more than half of all industrial robots in all walks of life around the world.
  • the welding robot has changed from the assembly line arm robot at the beginning, to the offline programming welding robot based on the sensor, to the multi-sensor intelligent welding robot with high adaptive ability, which is widely used today. Its welding ability and automation level also increased rapidly.
  • Using welding robots to weld multiple complex scene welds in a large space such as a ship can greatly improve the welding efficiency and quality. This technology has also become a hot research topic today.
  • Putting robots into the shipbuilding industry, partially replacing or completely replacing manual welding, realizing mechanization, automation, partial intelligence or even full intelligence and true "unmanned” is the inevitable trend of future ship construction (Yu Bujiang, "Cabins”). “Research on Robotic Welding Process between Grids” [D], Jiangsu: Jiangsu University of Science and Technology, 2013).
  • the purpose of the present invention is to provide a ship multi-manipulator joint welding method based on the QMIX reinforcement learning algorithm. Reasonable planning of the working path of the robotic arm and avoid collisions, while reducing the labor and time costs of production.
  • the technical solution for realizing the purpose of the present invention is: a method for collaborative welding of multi-manipulator welding points of ships based on QMIX reinforcement learning algorithm, which is suitable for path planning and collision avoidance in the process of collaborative welding of multi-manipulator welding points of ships. It includes the following steps:
  • step (a) specifically comprises the following steps:
  • Step (a1) describe the cabin space based on the simulation welding environment in python software from the actual ship multi-manipulator welding scene;
  • Step (a2) According to the cabin space and the actual welding scene, describe the position of the robot arm and the area to be welded, and divide it in the space.
  • step (b) specifically comprises the following steps:
  • Step (b1) the state value of the robotic arm as the reinforcement learning agent in the learning and training process is specified by the simulation environment in a);
  • Step (b2) According to the actual operation mode of the manipulator and the simulation environment, describe the action value of the manipulator as a reinforcement learning agent in the learning and training process.
  • step (c) specifically comprises the following steps:
  • Step (c1) analyze the task requirements of the actual welding scene, and give the collaborative welding goals that need to be achieved;
  • Step (c2) Characterize the goal described in step (c1) as the reward value form of the reinforcement learning process.
  • step (d) specifically includes the following steps:
  • Step (d1) Construct a recurrent neural network with state value and action value as input and action value function as output;
  • Step (d2) The state value and action value obtained in step (b1) and step (b2) are used as input, and the local action value function of each robotic arm is obtained through the neural network of step (d1);
  • Step (d3) From the local action value function of each manipulator obtained in step (d1), select the action of each manipulator at the next moment according to the epsilon-greedy strategy.
  • step (e) specifically comprises the steps:
  • Step (e1) Construct a hyper-network with non-negative weights that takes the local action value function as the input and the overall value function as the output;
  • Step (e2) take the local action value function obtained in step (d2) as an input, and substitute it into the neural network constructed in step (e1) to obtain the overall action value function.
  • step (f) specifically includes the following steps:
  • Step (f1) Construct a loss function from the reward value of step (c2) and the overall action value function of step (e2);
  • Step (f2) Calculate and update the neural network weights by the back-propagation algorithm and repeat the training process.
  • the present invention has the following significant advantages:
  • the method is simple and effective, and can realize the reasonable planning of the working path of the manipulator in the welding environment with limited information sensing and the task requirements of collaborative welding spot welding, avoid collision, and reduce the labor and time cost of production.
  • Figure 1 shows the welding scene of the upright parts of the ship.
  • Fig. 2 is the simulation welding grid world based on python software of the present invention.
  • Figure 3 is a schematic diagram of the simulation welding environment of two robotic arms based on python software.
  • Figure 4 is a partial observation of the robotic arm.
  • Figure 5 shows the potential collision scene of two robotic arms in the simulation environment.
  • Figure 6 shows a basic recurrent neural network.
  • Figure 7 is an expanded view of the recurrent neural network at three moments.
  • Figure 8 shows the super network for the overall action value function calculation.
  • Fig. 9 is a flow chart of the collaborative welding method of ship multi-manipulator welding points based on the QMIX reinforcement learning algorithm.
  • the invention is a collaborative welding method for ship multi-manipulator welding points based on the QMIX reinforcement learning algorithm, and is especially suitable for welding environments including collaborative welding requirements, limited sensing information, collision avoidance and the like.
  • the robotic arms J 1 , J 2 due to the limitation of information exchange equipment such as sensors, the robotic arms can only obtain information on a 3x3 grid with itself as the center without getting global status information The specific position of the welding area outside this range cannot be obtained.
  • the setting of the reward value r of the robotic arm in the learning environment can guide its learning process correctly.
  • the collaborative welding area requires both manipulators to reach their initial welding area, and after completing the welding of the collaborative welding area, the two manipulators can get a common +500 reward value.
  • collision avoidance it is stipulated that when the two manipulators are within a grid range, if the two manipulators take action and cause the coordinate positions to overlap, that is, a collision occurs, the two manipulators will receive a penalty value of -500 together. Prevent the robotic arm from colliding again during the subsequent learning and training process. When the two robotic arms complete all the tasks in the welding area together, the two robotic arms will receive a bonus value of +2000 to encourage them to complete the welding task together.
  • the QMIX algorithm can be used to train the learning process of the robotic arm in the environment.
  • the local action value function qi of each robotic arm is determined by its own action and observations obtained through a recurrent neural network.
  • the choice of each robotic arm for its own action is also based on the local action value function output by such a recurrent neural network.
  • the manipulator under the current observation, the manipulator has a probability of ⁇ to select the action corresponding to the largest q i_max value among all feasible actions
  • each robotic arm can choose a more reasonable action after learning a better value function.
  • each manipulator chooses an action according to its own action value function, but at the same time it also learns a total action value function q total that includes all the manipulators.
  • the overall action value function q total can be obtained.
  • a loss function L( ⁇ ) for updating the action value function can be constructed, where ⁇ is a parameter including the neural network weights U, V, W.
  • this loss function can be minimized and the weights of the updated neural network can be calculated. Repeat the above steps and train the neural network after updating the weights, and finally get a result that approximates to the optimal value.
  • the training and learning process of the collaborative welding of ship multi-manipulator welding points based on the QMIX reinforcement learning algorithm is completed.
  • Fig. 9 is the design flow chart of the present invention, is made up of steps P1-P6, and each step is described as follows:
  • the present invention is a method for collaborative welding of ship multi-manipulator welding points based on the QMIX reinforcement learning algorithm. Therefore, it is necessary to build a reinforcement learning environment for the collaborative welding of ship multi-manipulator welding points according to the actual welding requirements of 40 points of ship multi-manipulator collaborative welding. , and set the area to be welded and the working area of the robotic arm.
  • the specific implementation steps are as follows:
  • Step 1 According to the actual ship welding scene shown in Figure 1, the ship welding task is mainly to weld the strip-shaped area to be welded in the figure by manipulating the robotic arm in a welding cabin. Therefore, the cv library in python software can be used to describe the cabin space of the simulated welding environment as shown in Figure 2.
  • the entire welding cabin is built on a plane rectangular coordinate system, the upper left corner of the entire welding cabin is the origin, the horizontal rightward direction is the x-axis, and the vertical downward direction is the y-axis.
  • the 20x20 equal white grids corresponding to the entire plane represent the cabin in the actual welding environment;
  • Step 2 The area to be welded in Figure 1 is usually composed of three parts as shown in the figure, and the welding of the area is counted as completion of the welding of all three parts. Therefore, the setting of the manipulator and the welding area is as follows: the manipulator is represented by the circles marked I and II in Figure 3, and its initial position is at both ends of the entire plane, and the coordinates are, and The welding area is represented as follows: the two parts with sizes of 2x4 and 2x10 are marked with "1" and "2" respectively (hereinafter referred to as welding area 1 and welding area 2 respectively) represent two different sizes to be welded.
  • the two areas to be welded both have two sides to be welded (that is, each 1x4, 1x10 area separated by a dotted line is one side, and the 2x4 and 2x10 areas represent both sides of a welding area), and both can be Welding of both edges in any area is done by a single robotic arm. When welding this area with a single manipulator, it needs to start welding from the starting position (black triangle) and end the welding at the end position (white triangle).
  • the 2x8 area marked with "3" indicates the area to be welded that requires two robotic arms to perform collaborative welding, that is, both robotic arms need to be at the starting welding position of 1x8 on each side ( At the black triangle position), welding can start to this area and end at the end position (white triangle).
  • Step P2 further specifies the state value and action value of the robotic arm in the reinforcement learning process from the simulation environment and welding area set in step P1.
  • the specific implementation steps are as follows:
  • the first step the setting of the state value.
  • the state value of the ith manipulator For the non-partially observed case, the state value of the ith manipulator its coordinates
  • the robotic arm is usually limited by the performance of communication and sensor equipment during the welding process, and can only obtain information within a certain range near it, but cannot accurately obtain the global state information, so this The setting of the state value at this point takes some observations into account.
  • each robotic arm can only obtain the information of the nearby 3x3 grids (as shown in Figure 4), and can no longer obtain accurate state values (coordinates), the i-th
  • the observations of the manipulators are expressed as and take the position of the robotic arm (gray circle in the figure) is the center, and the relative coordinate information with other grids is obtained.
  • the coordinate information can be expressed as In addition to obtaining the coordinate information of nearby grids, the robotic arm can also obtain the corresponding environmental information on these grids.
  • the grid is the starting welding point on the left side of welding area 1, then the robot arm can know the relative position information of the current position of the robot arm and the starting welding point on the left side of welding area 1 during the learning process, and then the current position of the robot arm can be obtained. Judging the location.
  • the second step the setting of the action value. Movement of the robotic arm In general, it can be divided into five actions: up, down, left, right, and still, corresponding to (u 1 , u 2 , u 3 , u 4 , u 5 ). Among them, considering the actual situation, there are also the following restrictions on the action in the learning environment: (1) When the robot arm moves around the wall (corresponding to the boundary part of the grid), the action is limited to, if the selected action will occur with the wall If there is a collision, the actual action is stationary.
  • the robotic arm selects the action u 3 to move to the left at the left boundary, then the actual result of this action is to make the robotic arm stationary; (2) Successfully found and entered the waiting After the initial welding position of the welding area, the movement of the manipulator is limited to the direction of continuing to complete the welding. For example, if a manipulator starts the welding position on the left side of the welding area Before welding in the 1x4 area, its action can only move in the direction of the downward welding, that is, no matter which one is selected The final actual action is This is to take into account the actual situation, after the welding robot arm finds the starting point of welding, it can directly complete the welding of the part without additional learning of this process.
  • Step P3 sets the reward value for guiding the learning of the robotic arm in the reinforcement learning process according to the regulations on the state value and the action value in the step P2 and the collaborative welding task requirements in actual production.
  • the design is implemented in the following steps:
  • the first step: analyzing the task requirements of the actual welding scene includes three points. One is that the task requires the robotic arm to complete the welding of all areas to be welded in as few steps as possible, that is, in the shortest possible working time; The collision between the arms; the third is to complete the welding task under the requirements of the cooperation of the robotic arms.
  • Step 2 From the above analysis, the set reward value r needs to include the following levels. (1) Considering that the overall goal of the scheme is to complete all welding tasks in the shortest possible time, corresponding to the discrete environment, the total number of steps of the robot arm movement is the least, so set the reward value of each step to - 1, means that if the robot arm needs more steps to complete all areas of welding, the total reward value will be smaller, that is, the robot arm is encouraged to complete all welding tasks in as few steps as possible. (2) When the robotic arm is welding, that is, when the action of each step is fixed to advance along the welding direction, the reward for each step is 0, which means that there is no penalty or reward for the extra steps, so that the robotic arm can be successfully completed. Welding target.
  • a robot arm will get a bonus value of +200 after completing the welding task in the area; for the cooperative welding area, because only when both robots are at the initial stage of the area Welding can only start when the welding position is in the welding position, so after both robotic arms enter the initial welding position of the area and complete the welding of the area, the two robotic arms will get a bonus value of +500.
  • the robotic arm will make decisions based on the guidance of the reward value and continuously update the decision in the direction of the maximum cumulative reward value, so there is no need to additionally specify the mechanical task when designing this collaborative task. Bonus value for the arm welding sequence, as the robotic arm has been continuously updating and optimizing its welding sequence decisions during the learning process.
  • step P4 the state value and action value set in step P2 are calculated through the recurrent neural network to obtain the local action value function of each mechanical arm, and the action selection process is performed.
  • the design is implemented in the following steps:
  • Step 1 Build a recurrent neural network as shown in Figure 6.
  • x represents the input layer
  • s represents the hidden layer
  • o represents the output layer.
  • U, W, and V represent the weights from the input layer to the hidden layer, between the hidden layers, and from the hidden layer to the output layer, respectively.
  • the recurrent neural network is for sequential decision-making, the neural network of a single moment (represented by the subscript t) in Figure 6 is expanded into the form of Figure 7, and the previous moment (represented by the subscript t-1) and the next moment can be obtained.
  • the structure of the neural network at three instants in time denoted by the subscript t+1).
  • the hidden layer information at time t is not only determined by the input x at the current time, but also determined by the hidden layer information s t-1 at the previous time in the update process.
  • the manipulator since the manipulator cannot obtain accurate position information during the learning process, it is very important to obtain the observation information of the previous moment together as the observation value of the current moment, which can greatly improve the mechanical The accuracy of the arm's judgment on the current real position.
  • Step 2 When the robotic arm gets some initial observation information and take action After that, the After inputting into the neural network constructed in the first step, the corresponding local action value function value of the robotic arm is obtained.
  • Step 3 For a certain robotic arm to be observed Then, the action selection is performed according to the action value function qi and the epsilon-greedy strategy of all feasible actions under the current observation. That is, under the current observation, the robot arm has a probability of ⁇ to select the action corresponding to the largest qi_max value among all feasible actions At the same time, there is also a probability of 1- ⁇ to select any other non-maximum value as the action at the current moment.
  • step P5 the local action value function of each manipulator obtained in step P4 is obtained by setting a fully connected neural network with non-negative weights to obtain the overall action value function of all manipulators.
  • the specific implementation steps are as follows:
  • the first step build a super network, and meet the conditions: the weights of the neural network are all non-negative.
  • the overall action value function q total and the local action value function q i need to satisfy the following monotonicity constraints:
  • the constructed super network structure is shown in Figure 8.
  • represents the non-negative weight of the neural network from the input q i to the output q total
  • s total is the global state. It is worth noting that in the process of each robotic arm performing its own action selection and state transition, the global state cannot be directly utilized, but instead is the observation value of each robotic arm based on its own observation range. The global state information is only used when training and updating the overall action value function. Therefore, for the robotic arm, the learning environment is still a partially observed environment, rather than an environment that can utilize the global state.
  • the second step the value of the local action value function of the robot arm obtained in step P4 Input into the super network constructed in the previous step, and the overall action value function q total can be obtained.
  • Step P6 constructs a loss function based on the reward value r of Step P3 and the overall action value function q total obtained in Step P5, and calculates and updates the weight of the neural network according to the back-propagation algorithm.
  • the specific implementation steps are as follows:
  • the training of neural network usually takes minimizing the loss function as the main means. Combined with the construction of the QMIX algorithm and this method, the main purpose is to minimize the following loss functions:
  • is the update rate, Represent the observation value, action value and state value used as the target value calculation at the next moment, respectively
  • ⁇ and ⁇ - are the estimated neural network that calculates the action value function at the current moment and the target neural network that calculates the action value function at the next moment. weight parameter.
  • Step 2 According to the back-propagation algorithm, the weights U, V, W of the neural network can be calculated to obtain the update of the parameters ⁇ and ⁇ -.
  • the parameters ⁇ and ⁇ - are the parameter vectors containing the weights U, V, W of the neural network. Repeat the above steps and train the neural network after updating the weights, and finally a result that approximates to the optimal value can be obtained.
  • the training and learning process of the collaborative welding of ship multi-manipulator welding points based on the QMIX reinforcement learning algorithm is completed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Physics & Mathematics (AREA)
  • Optics & Photonics (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法。包括如下步骤:a)搭建强化学习环境,并设定环境中的焊接区域及作业区域;b)确定机械臂的状态值、动作值;c)由状态值、动作值及协同焊接、避碰的任务,设定奖励值;d)由状态值和动作值通过循环神经网络来计算得到每个机械臂的局部动作价值函数,并进行动作选择过程;e)由动作价值函数,通过设定权值非负的超网络得到所有机械臂总体的动作价值函数;f)由步骤c)的奖励值及步骤e)总体的动作价值函数网络构造损失函数,根据反向传播算法计算更新神经网络的权重并重复训练过程。该方法不依赖系统模型、简单有效,能够实现障碍环境中焊点协同焊接的任务。

Description

基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法 技术领域
本发明属于焊接控制领域,具体涉及一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法。
背景技术
在全球各行各业的工业机器人中,焊接机器人占据了半数以上。焊接机器人从一开始的流水线手臂式机器人,到后来的基于传感器的离线编程焊接机器人,再到现今广泛适用的多传感器、具有高度自适应能力的智能化焊接机器人,其焊接的能力及自动化的水平也快速提高。在船舶这种大空间中利用焊接机器人对多处复杂场景焊接处进行焊接,可以大大提升焊接效率和质量,这一技术也成为如今研究的热门。将机器人投入到船舶制造业中,部分代替直至全面代替人工焊接,实现机械化、自动化、部分智能化甚至全智能化以及真正“无人化”是未来船舶建造的必然趋势(于步江,“船舱格子间机器人焊接工艺研究”[D],江苏:江苏科技大学,2013)。
在复杂的任务要求下,多焊接机器人协同进行焊接对提升焊接效率、减少焊接时间有着重要的意义。多机器人焊接要处理和研究的主要问题可以分为以下几个方面。其一,多焊接机器人的协同任务分配和路径规划问题;其二,多机器人在焊接作业过程中相互干涉、碰撞问题;其三,通过传感设备传递信息的机器人在作业过程中通信受限情况下的协同焊接问题。针对以上三个问题,国内外学者研究了多种方法来进行解决。
景奉水等在文献(景奉水,谭民,侯增广,梁自泽,王云宽,“基于多机器人协调的船体分段对接系统的运动学及对接精度研究”,机器人,2002(04):324-328.)针对基于多机器人协调的船体分段对接系统的运动学和对接精度问题,提出了机器人的轨迹规划算法和对接控制方案,讨论了几种误差因素对系统对接精度的影响,该方法可以保证船体分段对接精度。Gan在文献(Gan Y,Duan J,Chen M,Dai X,“Multi-Robot Trajectory Planning and Position/Force Coordination Control in Complex Welding Tasks”,Applied Sciences,2019,9(5):924.)针对多机器人系统在焊接过程中的轨迹规划和位置协调控制问题,采用面向对象的层次规划控制策略,针对多机器人协同机械手的位置跟踪,提出了对称内外自适应变阻抗控制,该方法可以平滑地完成焊接任务,实现良好的位置跟踪性能。以上文献很好的解 决了多机器人的协同焊接路径规划问题,但对于机器人之间干涉和传感器限制问题并没有涉及。
陈辉在文献(陈辉,“白车身多机器人协同焊接的路径规划研究”[D],山西:中北大学,2019)分析比较了常用路径规划算法的优缺点,并针对传统蚁群算法的缺陷提出一种复合改进蚁群算法,应用于单机器人以及多机器人焊接路径规划问题,并针对多机器人焊接路径规划问题建立焊点分配模型以及机器人间的防干涉模型,但该文献也并未将传感器受限的情况纳入考虑范围。考虑到船舶焊接为一个较大空间的焊接过程,在机器人作业过程中往往会由于传感设备性能的限制,无法得到全局的焊接信息,从而会影响路径规划的准确性及防干涉的有效性。
发明内容
本发明的目的在于提供一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法,该方法简单有效,能够实现在协同焊点焊接的任务需求和信息传感受限的焊接环境中的机械臂作业路径的合理规划并避免碰撞,同时降低生产的人工和时间成本。
实现本发明目的的技术解决方案为:一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法,适用于船舶多机械臂焊点协同焊接作业过程中的路径规划及避碰情况,具体包括如下步骤:
步骤(a):由实际的船舶多机械臂焊点焊接场景,搭建船舶多机械臂焊点协同焊接的强化学习环境,并设定、区分环境中的焊接区域及机械臂作业区域;
步骤(b):由步骤(a)中设置的环境及区域,构建机械臂在强化学习过程中的状态值和动作值;
步骤(c):由步骤(b)中对状态值、动作值的规定及实际生产中的协同焊接任务需求,设定强化学习过程中引导机械臂学习的奖励值;
步骤(d):将步骤(b)中设定完成的状态值及动作值通过循环神经网络来计算得到每个机械臂的局部动作价值函数,并进行动作选择过程;
步骤(e):由步骤(d)得到的每个机械臂局部的动作价值函数,通过设定权值非负的超网络,得到所有机械臂总体的动作价值函数;
步骤(f):由步骤(c)的奖励值及步骤(e)总体的动作价值函数网络构造损失函数,由反向传播算法计算更新神经网络的权重并重复训练过程。
进一步的,所述步骤(a)具体包括如下步骤:
步骤(a1):由实际船舶多机械臂焊接场景,刻画基于python软件中的仿真焊接环境的舱室空间;
步骤(a2):由舱室空间及实际焊接场景,描述机械臂及待焊接区域位置,并在空间中对其进行划分。
进一步的,所述步骤(b)具体包括如下步骤:
步骤(b1):由a)中的仿真环境规定机械臂作为强化学习智能体在学习和训练过程的状态值;
步骤(b2):根据实际机械臂作业方式及仿真环境,描述机械臂作为强化学习智能体,在学习和训练过程的动作值。
进一步的,所述步骤(c)具体包括如下步骤:
步骤(c1):分析实际焊接场景的任务需求,并给出需要达成的协同焊接目标;
步骤(c2):将步骤(c1)所述的目标刻画为强化学习过程的奖励值形式。
进一步的,所述步骤(d)具体包括如下步骤:
步骤(d1):构建一个以状态值和动作值作为输入、动作价值函数作为输出的循环神经网络;
步骤(d2):将步骤(b1)和步骤(b2)中得到的状态值和动作值作为输入,通过步骤(d1)的神经网络得到每个机械臂的局部的动作价值函数;
步骤(d3):由步骤(d1)得到的每个机械臂局部的动作价值函数,根据epsilon-greedy策略选取每个机械臂下一时刻的动作。
进一步的,所述步骤(e)具体包括如下步骤:
步骤(e1):构建一个权重非负的以局部动作价值函数作为输入、总体价值函数为输出的超网络;
步骤(e2):将步骤(d2)得到的局部动作价值函数作为输入,代入步骤(e1)所构建的神经网络中并得到总体的动作价值函数。
进一步的,所述步骤(f)具体包括如下步骤:
步骤(f1):由步骤(c2)的奖励值及步骤(e2)的总体动作价值函数构造损失函数;
步骤(f2):由反向传播算法计算更新神经网络权重并重复训练过程。
本发明与现有技术相比,其显著优点在于:
本方法简单有效,能够实现在协同焊点焊接的任务需求和信息传感受限的焊接环境中的机械臂作业路径的合理规划并避免碰撞,同时降低生产的人工和时间成本。
附图说明
图1为船舶组立件焊接场景。
图2为本发明基于python软件的仿真焊接网格世界。
图3为基于python软件的两机械臂仿真焊接环境示意。
图4为机械臂的部分观测情况。
图5为仿真环境中两机械臂潜在碰撞场景。
图6为一个基础的循环神经网络。
图7为三个时刻的循环神经网络展开图。
图8为总体动作价值函数计算的超网络。
图9为基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法流程图。
以上的图中:
Figure PCTCN2021070578-appb-000001
Figure PCTCN2021070578-appb-000002
为中心的机械臂的观测空间,其中
Figure PCTCN2021070578-appb-000003
为第j格的观测信息;x:循环神经网络输入层;s:循环神经网络隐藏层;o:循环神经网络输出层;U:循环神经网络输入层到隐藏层的权重;W:循环神经网络隐藏层之间的权重;V:循环神经网络隐藏层到输出层的权重;x t-1:t-1时刻循环神经网络输入层;s t-1:t-1时刻循环神经网络隐藏层;o t-1:t-1时刻循环神经网络输出层;x t+1:t+1时刻循环神经网络输入层;s t+1:t+1时刻循环神经网络隐藏层;o t+1:t+1时刻循环神经网络输出层;q 1第一个机械臂的局部动作价值函数;q 2第二个机械臂的局部动作价值函数;W 2:超网络的权重;s total:全局状态信息;q total总体的动作价值函数。
具体实施方式
下面结合附图对本发明作进一步详细描述。
本发明是一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法, 特别适用于包括协同焊接需求、传感信息受限、避碰等情况的焊接环境。首先,考虑两个机械臂J 1,J 2在一个20x20的网格世界中进行协同焊接学习,并对该环境在python软件中进行建模。根据实际焊接场景分别建立大小为2x4、2x10的独立焊接区域及大小为2x8的协同焊接区域,其在网格世界中的起始坐标分别为((4,9),(5,9))、((9,6),(10,6))、((16,8),(17,8))。
两个机械臂J 1,J 2在学习和训练过程中由于传感器等信息交流设备的限制,机械臂只能够得到以其自身为中心的3x3范围网格的信息
Figure PCTCN2021070578-appb-000004
而并不能得到全局的状态信息
Figure PCTCN2021070578-appb-000005
如该范围以外的焊接区域的具体位置并不能得到。
对于两个机械臂的动作
Figure PCTCN2021070578-appb-000006
除了基本的上下左右四个方向及静止以外,还对机械臂的动作做一些限制。当机械臂进入焊接区域后,由于现实焊接时当机械臂找到准确的焊点后,便可以顺利完成对包括该焊点区域部分的焊接,那么对应仿真环境中,当机械臂到达某一焊接区域的初始点后,便限定其动作为只能沿着完成焊接的方向进行,而不能离开该焊接区域。
对机械臂在学习环境中的奖励值r的设置可以正确的引导其学习过程。首先对于协同焊接的需求,协同焊接区域需要两个机械臂均到达其初始焊接区域,并在完成对协同焊接区域的焊接后两机械臂可以得到一个共同的+500奖励值。另外对于避碰情况,规定当两机械臂在一格范围以内时,若两机械臂采取动作后导致了坐标位置重叠,即发生碰撞,则两机械臂会共同受到-500的惩罚值,意在防止机械臂在之后的学习训练过程中再次发生碰撞情况。当两机械臂共同完成了所有焊接区域的任务后,两机械臂会得到+2000的奖励值,来鼓励它们协同完成了焊接任务。
在对机械臂学习过程的状态值、动作值及奖励值做了上述规定后,便可以利用QMIX算法训练机械臂在环境中的学习过程。
每个机械臂自己的局部动作价值函数q i由自身的动作
Figure PCTCN2021070578-appb-000007
及观测值
Figure PCTCN2021070578-appb-000008
通过一个循环神经网络得到。每个机械臂对于自己的动作选择也基于这样一个循环神经网络输出的局部动作价值函数。由epsilon-greedy策略,在当前观测下机械臂有ε的概率选择所有可行动作中最大的q i_max值对应的动作
Figure PCTCN2021070578-appb-000009
同时也有1-ε的概率选择其余的任意非最大值作为当前时刻的动作。根据这样的策略,每 个机械臂便能够在学习到更好的价值函数后选择更为合理的动作。
QMIX算法中规定,每个机械臂根据自身的动作价值函数进行动作选择,但同时也要学习一个包括所有机械臂的总体动作价值函数q total。将每个智能体的局部动作价值函数通过一个权值非负的超网络,便可以得到总体动作价值函数q total。由q total及奖励值r便能够构造一个更新动作价值函数的损失函数L(θ),其中θ为包含神经网络权重U,V,W的参数。根据反向传播算法,可以最小化该损失函数,并计算更新神经网络的权重。将更新权重后的神经网络不断进行以上步骤的重复及训练过程,最终可以得到一个近似收敛到最优值的结果。由此,便完成了基于QMIX强化学习算法的船舶多机械臂焊点协同焊接的训练学习过程。
图9是本发明的设计流程图,由步骤P1-P6构成,各步骤叙述如下:
(1)步骤P1
本发明是一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法,因此需要根据实际的船舶多机械臂协同焊40点焊接需求,搭建船舶多机械臂焊点协同焊接的强化学习环境,并设定待焊接及机械臂作业区域,具体实施步骤如下:
第一步:根据如图1所示的实际船舶焊接场景可知,船舶焊接任务主要是在一个焊接舱室中,通过操控机械臂对如图中的条状待焊接区域进行焊接。因此,利用python软件中的cv库可以刻画如图2所示的仿真焊接环境的舱室空间。整个焊接舱室建立在一个平面直角坐标系上,整个焊接舱室的左上角为原点、水平向右的方向为x轴,垂直向下的方向为y轴。整个平面对应的20x20的等分白色格子代表实际焊接环境中的船舱;
第二步:图1中的待焊接区域通常由如图所示三个部分组成,完成所有三个部分的焊接后算作对该区域的焊接完成。因此,设置机械臂及焊接区域如下:机械臂分别用图3中标号为Ⅰ、Ⅱ的圆形表示,其初始位置处于整个平面的两端,坐标分别为,
Figure PCTCN2021070578-appb-000010
Figure PCTCN2021070578-appb-000011
焊接区域表示如下:两部分大小分别为2x4、2x10的标号分别为“①”和“②”的区域(下文中分别用焊接区域1、焊接区域2代指)代表两个大小不同的待焊接区域,这两个待焊接区域都是有两边需要焊接的(即,每个由虚线分隔开的1x4,1x10的区域为一边,2x4和2x10的区域便表示一个焊接区域的两边),且均可以由单独一个机械臂完成对任一区域中两条边的焊 接。单机械臂对该区域进行焊接时,需从起始位置处(黑色三角形处)开始焊接,并在终止位置处(白色三角形处)结束焊接。标号为“③”的2x8的区域(下文中用焊接区域3代指)表示需要两个机械臂进行协同焊接的待焊接区域,即两个机械臂均需要在每一边1x8的起始焊接位置(黑色三角形位置处)时,才能开始对该区域进行焊接,并结束于终止位置处(白色三角形处)。
(2)步骤P2
步骤P2由步骤P1中设置的仿真环境及焊接区域,进一步规定机械臂在强化学习过程中的状态值、动作值。具体实施步骤如下:
第一步:状态值的设定。对于非部分观测情况,第i个机械臂的状态值
Figure PCTCN2021070578-appb-000012
即为其坐标
Figure PCTCN2021070578-appb-000013
但考虑到实际焊接环境的场景均比较大,机械臂通常在焊接过程中受通信及传感器设备性能等限制,只能得到其附近一定范围内的信息,而并不能准确得到全局状态信息,所以此处对状态值的设置要考虑部分观测情况。此时规定在网格世界中,每个机械臂只能得到以其为中心的附近3x3格子的信息(如图4所示),而不再能得到准确的状态值(坐标)时,第i个机械臂的观测值表示为
Figure PCTCN2021070578-appb-000014
并以机械臂位置
Figure PCTCN2021070578-appb-000015
(图中灰色圆形)为中心,得到与其余格子的相对坐标信息。例如,
Figure PCTCN2021070578-appb-000016
的坐标信息便可以表示为
Figure PCTCN2021070578-appb-000017
除了得到附近格子的坐标信息以外,机械臂还能够得到这些格子上对应的环境信息。例如,
Figure PCTCN2021070578-appb-000018
格子上是焊接区域1的左侧起始焊接处,则机械臂在学习的过程中便可以得知当前位置该机械臂与焊接区域1的左侧起始焊接处的相对位置信息,进而对当前所处的位置进行判断。
第二步:动作值的设定。机械臂的运动动作
Figure PCTCN2021070578-appb-000019
大体上可以分为上、下、左、右、静止五个动作,分别对应(u 1,u 2,u 3,u 4,u 5)。其中,考虑到实际情况,动作在学习的环境里还有如下的限制:(1)机械臂在墙壁周围(对应网格的边界部分)运动时,动作限定为,若选取的动作将要与墙壁发生碰撞,则实际动作为静止不 动,例如机械臂在左侧边界处选取了向左运动的动作u 3,则实际该动作的结果为使得机械臂静止不动;(2)成功找到并进入待焊接区域的初始焊接位置后,机械臂的动作限制为只能沿着继续完成焊接的方向运动,例如某机械臂在焊接区域1的左侧边起始焊接位置,则该机械臂直到完成对该1x4区域的焊接前,其动作只能够沿着向下完成焊接的方向运动,即不论选取哪个
Figure PCTCN2021070578-appb-000020
最终的实际动作均为
Figure PCTCN2021070578-appb-000021
这是考虑到实际情况中,焊接机械臂在找到焊接的起始点后,可以直接完成对该部分的焊接,而不需要对这一过程进行额外的学习。
(3)步骤P3
步骤P3由步骤P2中对状态值、动作值的规定及实际生产中的协同焊接任务需求,设定强化学习过程中引导机械臂学习的奖励值。设计按下列步骤实现:
第一步:分析实际焊接场景的任务需求包括三点。其一为任务要求机械臂在尽可能少的步数内,即尽可能短的作业时间内完成对所有待焊接区域的焊接;其二为机械臂在焊接作业过程中尽可能避免与边界及机械臂之间的碰撞;其三为要在体现机械臂协作的要求下,完成焊接任务。
第二步:由上分析,设置的奖励值r需要包括以下层面。(1)考虑到该方案的总体目标为在尽可能短的时间内完成所有焊接任务,对应到离散的环境中即为机械臂运动的总的步数最少,所以设置每走一步奖励值为-1,代表若机械臂完成所有区域焊接所需的步数越多,则总的奖励值越小,即鼓励机械臂在尽可能少的步数内完成所有的焊接任务。(2)当机械臂在进行焊接时,即每一步的动作都固定为沿着焊接方向前进时,则每一步奖励为0,意为不对多余的步数惩罚或者奖励,让机械臂能够顺利完成焊接目标。(3)对于非协同焊接区域,某一机械臂在完成对该区域的焊接任务后,会得到一个+200的奖励值;对于协同焊接区域,由于只有当两个机械臂都处于该区域的初始焊接位置时,焊接才能开始,所以对于两个机械臂都进入该区域的初始焊接位置并完成对该区域的焊接后,两机械臂会得到+500的奖励值。需要说明的是,机械臂在学习过程中,会根据奖励值的引导来决策并不断朝着累计奖励值最大的方向进行决策更新,所以在设计这种协同作业的任务时并不需要额外规定机械臂焊接顺序的奖励值,因为机械臂在学习过程已经在不断对其焊接顺序的决策进行了更新和优化。(4)避碰:根据两机械 臂的相对距离(若能在观测范围内观测到),当两机械臂的中心在一格的距离以内时(如图5中两黑色圆形所示),若两机械臂选择了减小相对距离的动作,则会收到一个碰撞惩罚值-500,并且直接结束一个学习过程的片段,回到初始位置并重新开始学习。考虑到实际生产中若两机械臂发生碰撞,都会给生产成本和生产效率带来很大的损失,所以碰撞的发生是要在训练过程中极力避免的。(5)完成任务:当机械臂对所有待焊接区域已完成焊接时,两机械臂收到+2000的奖励值,并结束学习过程。
(4)步骤P4
步骤P4将步骤P2中设定完成的状态值及动作值通过循环神经网络来计算得到每个机械臂的局部动作价值函数,并进行动作选择过程。设计按下列步骤实现:
第一步:构建如图6所示的一个循环神经网络。其中,x代表输入层,s代表隐藏层,o代表输出层。U,W,V分别代表输入层到隐藏层、隐藏层之间、及隐藏层到输出层的权重。由于循环神经网络是针对序列决策的,将图6的单个时刻(由下标t表示)的神经网络展开为图7的形式,可以得到前一时刻(由下标t-1表示)及后一时刻(由下标t+1表示)的三个时刻的神经网络的结构。t时刻的隐藏层信息在更新过程中不仅取决于当前时刻的输入x,还由前一时刻的隐藏层信息s t-1决定。对应于部分观测的问题,由于机械臂在学习的过程中不能得到准确的位置信息,因此得到之前时刻的观测信息一起作为当前时刻的观测值是十分重要的,这能够在很大程度上提升机械臂对当前所处真实位置的判断准确性。
第二步:当机械臂得到某一初始观测信息
Figure PCTCN2021070578-appb-000022
并采取动作
Figure PCTCN2021070578-appb-000023
后,将
Figure PCTCN2021070578-appb-000024
输入到第一步构建的神经网络中后,得到对应的该机械臂局部的动作价值函数值
Figure PCTCN2021070578-appb-000025
第三步:对于某一机械臂在得到观测
Figure PCTCN2021070578-appb-000026
后,根据当前观测下所有可行动作的动作价值函数q i及epsilon-greedy策略进行动作选择。即,在当前观测下机械臂有ε的概率选择所有可行动作中最大的q i_max值对应的动作
Figure PCTCN2021070578-appb-000027
同时也有1-ε的概率选择其余的任意非最大值作为当前时刻的动作。
(5)步骤P5
步骤P5由步骤P4得到的每个机械臂局部的动作价值函数,通过设定权值非负的全连接神经网络,得到所有机械臂总体的动作价值函数。具体实现步骤如下:
第一步:构建一个超网络,并满足条件:神经网络的权值均为非负数。根据QMIX算法的要求,总体的动作价值函数q total与局部的动作价值函数q i之间需要满足如下的单调性限制:
Figure PCTCN2021070578-appb-000028
为了实现以上的要求,构建的超网络结构如图8所示。其中|W 2|表示从输入q i到输出q total的神经网络非负权重,s total为全局的状态。值得说明的是,在每个机械臂进行自己的动作选择及状态转移过程中,全局状态并不能够被直接利用,取而代之的是每个机械臂基于自身观测范围的观测值。只有在进行总体的动作价值函数的训练更新时,才用到全局状态信息,因此对于机械臂来说,其学习的环境依然是部分观测的环境,而并非能够利用全局状态的环境。
第二步:将由步骤P4得到的机械臂局部的动作价值函数值
Figure PCTCN2021070578-appb-000029
输入到上一步构建的超网络中,便可以得到总体的动作价值函数q total
(6)步骤P6
步骤P6基于步骤P3的奖励值r及步骤P5得到的总体的动作价值函数q total来构造损失函数,并根据反向传播算法计算更新神经网络的权重。具体实现步骤如下:
第一步:神经网络的训练通常以最小化损失函数为主要手段。结合QMIX算法及本方法的构建,主要是要最小化以下损失函数:
Figure PCTCN2021070578-appb-000030
其中γ为更新率,
Figure PCTCN2021070578-appb-000031
分别代表下一时刻用于作为目标值计算的观测值、动作值和状态值,θ与θ -分别为当前时刻计算动作价值函数的估计神经网络及下一时刻计算动作价值函数的目标神经网络的权重参数。
第二步:根据反向传播算法,可以计算神经网络的权重U,V,W进而得到参数θ与θ -的更新。其中,参数θ与θ -即为包含了神经网络的权重U,V,W的参数 向量。将更新权重后的神经网络不断进行以上步骤的重复及训练过程,最终便可以得到一个近似收敛到最优值的结果。由此,便完成了基于QMIX强化学习算法的船舶多机械臂焊点协同焊接的训练学习过程。

Claims (7)

  1. 一种基于QMIX强化学习算法的船舶多机械臂焊点协同焊接方法,其特征在于,适用于船舶多机械臂焊点协同焊接作业过程中的路径规划及避碰情况,具体包括如下步骤:
    步骤(a):由实际的船舶多机械臂焊点焊接场景,搭建船舶多机械臂焊点协同焊接的强化学习环境,并设定、区分环境中的焊接区域及机械臂作业区域;
    步骤(b):由步骤(a)中设置的环境及区域,构建机械臂在强化学习过程中的状态值和动作值;
    步骤(c):由步骤(b)中对状态值、动作值的规定及实际生产中的协同焊接任务需求,设定强化学习过程中引导机械臂学习的奖励值;
    步骤(d):将步骤(b)中设定完成的状态值及动作值通过循环神经网络来计算得到每个机械臂的局部动作价值函数,并进行动作选择过程;
    步骤(e):由步骤(d)得到的每个机械臂局部的动作价值函数,通过设定权值非负的超网络,得到所有机械臂总体的动作价值函数;
    步骤(f):由步骤(c)的奖励值及步骤(e)总体的动作价值函数网络构造损失函数,由反向传播算法计算更新神经网络的权重并重复训练过程。
  2. 根据权利要求1所述的方法,其特征在于,所述步骤(a)具体包括如下步骤:
    步骤(a1):由实际船舶多机械臂焊接场景,刻画基于python软件中的仿真焊接环境的舱室空间;
    步骤(a2):由舱室空间及实际焊接场景,描述机械臂及待焊接区域位置,并在空间中对其进行划分。
  3. 根据权利要求2所述的方法,其特征在于,所述步骤(b)具体包括如下步骤:
    步骤(b1):由a)中的仿真环境规定机械臂作为强化学习智能体在学习和训练过程的状态值;
    步骤(b2):根据实际机械臂作业方式及仿真环境,描述机械臂作为强化学习智能体,在学习和训练过程的动作值。
  4. 根据权利要求3所述的方法,其特征在于,所述步骤(c)具体包括如下 步骤:
    步骤(c1):分析实际焊接场景的任务需求,并给出需要达成的协同焊接目标;
    步骤(c2):将步骤(c1)所述的目标刻画为强化学习过程的奖励值形式。
  5. 根据权利要求4所述的方法,其特征在于,所述步骤(d)具体包括如下步骤:
    步骤(d1):构建一个以状态值和动作值作为输入、动作价值函数作为输出的循环神经网络;
    步骤(d2):将步骤(b1)和步骤(b2)中得到的状态值和动作值作为输入,通过步骤(d1)的神经网络得到每个机械臂的局部的动作价值函数;
    步骤(d3):由步骤(d1)得到的每个机械臂局部的动作价值函数,根据epsilon-greedy策略选取每个机械臂下一时刻的动作。
  6. 根据权利要求5所述的方法,其特征在于,所述步骤(e)具体包括如下步骤:
    步骤(e1):构建一个权重非负的以局部动作价值函数作为输入、总体价值函数为输出的超网络;
    步骤(e2):将步骤(d2)得到的局部动作价值函数作为输入,代入步骤(e1)所构建的神经网络中并得到总体的动作价值函数。
  7. 根据权利要求6所述的方法,其特征在于,所述步骤(f)具体包括如下步骤:
    步骤(f1):由步骤(c2)的奖励值及步骤(e2)的总体动作价值函数构造损失函数;
    步骤(f2):由反向传播算法计算更新神经网络权重并重复训练过程。
PCT/CN2021/070578 2020-11-09 2021-01-07 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法 WO2022095278A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP21887984.9A EP4241915A1 (en) 2020-11-09 2021-01-07 Qmix reinforcement learning algorithm-based ship welding spots collaborative welding method using multiple manipulators

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011239331.2A CN112427843B (zh) 2020-11-09 2020-11-09 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法
CN202011239331.2 2020-11-09

Publications (1)

Publication Number Publication Date
WO2022095278A1 true WO2022095278A1 (zh) 2022-05-12

Family

ID=74700799

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/070578 WO2022095278A1 (zh) 2020-11-09 2021-01-07 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法

Country Status (3)

Country Link
EP (1) EP4241915A1 (zh)
CN (1) CN112427843B (zh)
WO (1) WO2022095278A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114861331A (zh) * 2022-07-05 2022-08-05 领伟创新智能系统(浙江)有限公司 一种基于船舶装焊微特征的自匹配工艺设计方法
CN115065728A (zh) * 2022-06-13 2022-09-16 福州大学 一种基于多策略强化学习的多目标内容存储方法
CN115415694A (zh) * 2022-08-29 2022-12-02 无锡达诺精密钣金有限公司 一种钣金工艺用焊接方法、系统及装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114888495B (zh) * 2022-06-30 2023-03-24 中船黄埔文冲船舶有限公司 一种基于中组立模型的焊接控制方法及系统
CN116900539B (zh) * 2023-09-14 2023-12-19 天津大学 一种基于图神经网络和强化学习的多机器人任务规划方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0477430A1 (en) * 1989-03-29 1992-04-01 KABUSHIKI KAISHA KOBE SEIKO SHO also known as Kobe Steel Ltd. Off-line teaching method for industrial robot
JPH08190432A (ja) * 1995-01-11 1996-07-23 Atr Ningen Joho Tsushin Kenkyusho:Kk 運動制御装置および方法
CN106681149A (zh) * 2017-01-11 2017-05-17 浙江大学 一种基于虚拟现实和强化学习的熊蜂机器人摆腹控制方法
CN107102619A (zh) * 2016-02-19 2017-08-29 发那科株式会社 机器学习装置、工业机械单元、制造系统及机器学习方法
CN110390845A (zh) * 2018-04-18 2019-10-29 北京京东尚科信息技术有限公司 虚拟环境下机器人训练方法及装置、存储介质及计算机系统
CN110977967A (zh) * 2019-11-29 2020-04-10 天津博诺智创机器人技术有限公司 一种基于深度强化学习的机器人路径规划方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0477430A1 (en) * 1989-03-29 1992-04-01 KABUSHIKI KAISHA KOBE SEIKO SHO also known as Kobe Steel Ltd. Off-line teaching method for industrial robot
JPH08190432A (ja) * 1995-01-11 1996-07-23 Atr Ningen Joho Tsushin Kenkyusho:Kk 運動制御装置および方法
CN107102619A (zh) * 2016-02-19 2017-08-29 发那科株式会社 机器学习装置、工业机械单元、制造系统及机器学习方法
CN106681149A (zh) * 2017-01-11 2017-05-17 浙江大学 一种基于虚拟现实和强化学习的熊蜂机器人摆腹控制方法
CN110390845A (zh) * 2018-04-18 2019-10-29 北京京东尚科信息技术有限公司 虚拟环境下机器人训练方法及装置、存储介质及计算机系统
CN110977967A (zh) * 2019-11-29 2020-04-10 天津博诺智创机器人技术有限公司 一种基于深度强化学习的机器人路径规划方法

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"Jiangsu: Jiangsu University of Science and Technology", CABINS, 2013
CHEN HUI: "Research on Path Planning of Collaborative Welding of Multi-robots with White Body'' [D", SHANXI: NORTH UNIVERSITY OF CHINA, 2019
GAN YDUAN JCHEN MDAI X: "Multi-Robot Trajectory Planning and Position/Force Coordination Control in Complex Welding Tasks", APPLIED SCIENCES, vol. 9, no. 5, 2019, pages 924, XP055861101, DOI: 10.3390/app9050924
JING FENGSHUITAN MINHOU ZENGGUANGLIANG ZIZEWANG YUNKUAN: "Research on Kinematics and Docking Accuracy of Hull Segmented Docking System Based on Multi-robot Coordination", ROBOT, no. 04, 2002, pages 324 - 328

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115065728A (zh) * 2022-06-13 2022-09-16 福州大学 一种基于多策略强化学习的多目标内容存储方法
CN115065728B (zh) * 2022-06-13 2023-12-08 福州大学 一种基于多策略强化学习的多目标内容存储方法
CN114861331A (zh) * 2022-07-05 2022-08-05 领伟创新智能系统(浙江)有限公司 一种基于船舶装焊微特征的自匹配工艺设计方法
CN114861331B (zh) * 2022-07-05 2022-09-23 领伟创新智能系统(浙江)有限公司 一种基于船舶装焊微特征的自匹配工艺设计方法
CN115415694A (zh) * 2022-08-29 2022-12-02 无锡达诺精密钣金有限公司 一种钣金工艺用焊接方法、系统及装置
CN115415694B (zh) * 2022-08-29 2024-01-12 无锡达诺精密钣金有限公司 一种钣金工艺用焊接方法、系统及装置

Also Published As

Publication number Publication date
CN112427843A (zh) 2021-03-02
CN112427843B (zh) 2021-09-03
EP4241915A1 (en) 2023-09-13

Similar Documents

Publication Publication Date Title
WO2022095278A1 (zh) 基于qmix强化学习算法的船舶多机械臂焊点协同焊接方法
CN109822554B (zh) 面向水下的双臂协同抓取、抱取及避碰一体化方法及系统
CN110378439B (zh) 基于Q-Learning算法的单机器人路径规划方法
Zhao et al. The experience-memory Q-learning algorithm for robot path planning in unknown environment
CN110682286B (zh) 一种协作机器人实时避障方法
CN111645079B (zh) 一种带电作业机器人机械臂路径规划控制装置及其方法
Zhu et al. Rule-based reinforcement learning for efficient robot navigation with space reduction
CN109784201A (zh) 基于四维风险评估的auv动态避障方法
CN113442140B (zh) 一种基于Bezier寻优的笛卡尔空间避障规划方法
CN112428274A (zh) 一种多自由度机器人的空间运动规划方法
CN112857370A (zh) 一种基于时序信息建模的机器人无地图导航方法
Wang et al. Adaptive path planning for the gantry welding robot system
CN115890670A (zh) 基于强化深度学习训练七自由度冗余机械臂运动轨迹的方法
CN116460843A (zh) 一种基于元启发式算法的多机器人协作抓取方法及系统
CN117606490B (zh) 一种水下自主航行器协同搜索路径规划方法
CN112434464B (zh) 基于maddpg算法的船舶多机械臂弧焊协同焊接方法
Hu A novel deep learning driven robot path planning strategy: Q-learning approach
Chen et al. Optimizing the obstacle avoidance trajectory and positioning error of robotic manipulators using multigroup ant colony and quantum behaved particle swarm optimization algorithms
Tam et al. An improved genetic algorithm based robot path planning method without collision in confined workspace
CN117798934A (zh) 一种协作机器人多步骤自主装配作业决策方法
Zhang et al. Robot navigation with reinforcement learned path generation and fine-tuned motion control
CN116339334A (zh) 一种机器人最优路径规划调度系统及方法
Zhou et al. Deep reinforcement learning with long-time memory capability for robot mapless navigation
Araújo Filho et al. Multi-robot autonomous exploration and map merging in unknown environments
CN115533920A (zh) 一种求解绳驱机械臂逆运动学的协同规划方法及系统、计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21887984

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021887984

Country of ref document: EP

Effective date: 20230605