CN113062601B

CN113062601B - A trajectory planning method for concrete placing robot based on Q-learning

Info

Publication number: CN113062601B
Application number: CN202110284547.9A
Authority: CN
Inventors: 范思文; 纪金帅; 王昊天; 李万莉
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2022-05-13
Anticipated expiration: 2041-03-17
Also published as: CN113062601A

Abstract

The invention relates to a novel track planning scheme of an intelligent concrete distributing robot, which is suitable for the autonomous pouring control of the concrete distributing robot, avoids the complex inverse kinematics interpolation calculation and belongs to the field of intelligent manufacturing. The invention designs a universal track planning frame, which comprises a rapid movement process that a material distribution robot is reset from an initial state to a path starting point and from a path end point to the initial state; and the material distributing robot performs a continuous concrete pouring process from the pouring path starting point to the pouring path end point. In the process of rapid movement, a simple interior point method is adopted to perform inverse solution optimization with time optimization as a target, and a cubic polynomial is adopted to fit a track. In the continuous concrete pouring process, an error band with a certain area is formed on a path to be poured at the tail end of the distributing robot, the formed path error band is divided into regions by using a Q learning algorithm, reward values are given to the divided regions according to pouring targets and constraints, Q value training is carried out on a given grid, finally, action sequences of all joints of the robot are formed, the robot action is directly obtained, and the complex track planning process based on kinematic inverse solution optimization is avoided.

Description

A Q-learning-based trajectory planning method for concrete placing robots

技术领域technical field

本发明涉及智能混凝土布料机器人的一种新型轨迹规划方法，适用于混凝土布料机器人的自主浇筑控制，避免了复杂的插值计算，属于智能建造领域。The invention relates to a novel trajectory planning method of an intelligent concrete placing robot, which is suitable for the autonomous pouring control of a concrete placing robot, avoids complex interpolation calculation, and belongs to the field of intelligent construction.

背景技术Background technique

混凝土布料机器人是一种将混凝土输送到施工地点的建筑工业机器人，在城市现代化施工的发展进程中起着相当重要的作用。随着工程建设对布料机器人的效率要求不断提高，关于布料机器人的智能控制研究逐渐发展起来。智能控制与机械臂的路径和轨迹规划密不可分，对于工业机器人，路径规划主要是指机械臂末端运动的轨迹，而轨迹规划表示为操作的各节臂架进行联合运动时的位移、速度及加速度的曲线等。布料机器人的具体工作就是实现臂架末端泵送口的位移，可以把其按照工业机器人的机构进行分析。布料浇筑面一般分为无回转水平面、竖直柱面及无回转空间平面或曲面，根据浇筑面的不同，布料的目标浇筑路线也不相同，为简化浇筑路线规划，一般路线由直线构成。The concrete placing robot is a construction industrial robot that transports concrete to the construction site, and plays a very important role in the development process of urban modernization construction. With the continuous improvement of the efficiency requirements of the cloth robot in engineering construction, the research on the intelligent control of the cloth robot has gradually developed. Intelligent control is inseparable from the path and trajectory planning of the manipulator. For industrial robots, the path planning mainly refers to the trajectory of the end of the manipulator, and the trajectory planning is expressed as the displacement, velocity and acceleration of the joint motion of each boom that is operated. curve, etc. The specific work of the distribution robot is to realize the displacement of the pumping port at the end of the boom, which can be analyzed according to the mechanism of the industrial robot. The cloth pouring surface is generally divided into a non-revolving horizontal plane, a vertical cylindrical surface, and a non-revolving space plane or curved surface. According to the different pouring surfaces, the target pouring route of the cloth is also different. In order to simplify the pouring route planning, the general route is composed of straight lines.

近年来，智能化实时动态自主规划对于工业机器人等作业起着极为重要的作用。针对混凝土布料规划问题，将混凝土自主布料与布料机器人的智能化实时动态路径规划结合，有助于进一步提升布料任务的浇筑效率与质量。目前对于大型布料臂架系统而言，其施工工况比较恶劣，传统的轨迹规划方法计算量大，难以确定最优性能指标，因此无法应对施工现场的实时变化因素，面对不同形状的施工面，工作段的浇筑路径往往根据现场工作人员的有限视角进行规划判断，施工浇筑的质量则过度依靠工人的操作经验和技术水平，自动化程度较低，无法满足大型布料臂架系统浇筑端的轨迹运动需求。强化学习作为机器学习方法的一个分支，其基本原理为模仿生物体的学习过程，智能体将在与外界环境的不断交互中获得学习经验，逐步训练出智能体的自主规划能力。从技术上讲，强化学习方法与工作人员的学习操作过程类似，是解决布料机器人工作端自主路径规划需求的有效途径。In recent years, intelligent real-time dynamic autonomous planning has played an extremely important role in industrial robots and other operations. Aiming at the problem of concrete distribution planning, combining the autonomous concrete distribution with the intelligent real-time dynamic path planning of the distribution robot will help to further improve the pouring efficiency and quality of the distribution task. At present, for the large-scale distributing boom system, its construction conditions are relatively bad, the traditional trajectory planning method has a large amount of calculation, and it is difficult to determine the optimal performance index. , The pouring path of the working section is often planned and judged according to the limited perspective of the on-site staff, and the quality of construction pouring is overly dependent on the operating experience and technical level of the workers, and the degree of automation is low, which cannot meet the trajectory motion requirements of the pouring end of the large-scale placing boom system. . Reinforcement learning is a branch of machine learning methods. Its basic principle is to imitate the learning process of living organisms. The agent will gain learning experience in the continuous interaction with the external environment, and gradually train the autonomous planning ability of the agent. Technically speaking, the reinforcement learning method is similar to the learning and operation process of the staff, and it is an effective way to solve the autonomous path planning needs of the working end of the cloth robot.

发明内容SUMMARY OF THE INVENTION

本发明针对智能冗余布料机器人的自主规划及控制精度要求，设计了一种基于Q学习的轨迹规划器。布料机器人所需进行轨迹规划的路径为已知条件，首先进行布料机从初始位置到路径起点采用时间最优的内点法规划关节轨迹；在对已知直线路径进行轨迹规划时，根据浇筑精度要求，对以已知浇筑路径为中心建立带宽为2倍精度的误差带，基于强化学习中的Q学习算法，对误差带网格化划分，并根据布料机器人工作条件和一般指标对划分网格给予对映的标量奖励值，以奖励值最大为目标进行分布训练，得到误差带环境下的最佳观测-动作-奖励的序列，并根据动作序列形成布料机器人的关节规划轨迹。The invention designs a trajectory planner based on Q-learning for the autonomous planning and control precision requirements of the intelligent redundant cloth robot. The path required for trajectory planning of the placing robot is a known condition. First, the joint trajectory of the placing machine is planned from the initial position to the starting point of the path using the time-optimized interior point method; when planning the trajectory of the known straight path, according to the pouring accuracy It is required to establish an error zone with a bandwidth of 2 times the precision centered on the known pouring path, based on the Q-learning algorithm in reinforcement learning, divide the error zone into a grid, and divide the grid according to the working conditions of the placing robot and general indicators. The corresponding scalar reward value is given, and the distribution training is carried out with the goal of the maximum reward value, and the optimal observation-action-reward sequence in the error zone environment is obtained, and the joint planning trajectory of the cloth robot is formed according to the action sequence.

智能混凝土布料机器人具有冗余度，且机器人在进行布料作业时为连续浇筑直线或曲线路径，因此设计布料机器人的轨迹规划方案要在笛卡尔空间进行，且要考虑连续路径浇筑及冗余自由度下的规划问题。按照现有规划方法，首先应采用DH坐标系变换法对机器人建立正运动学模型，在已知笛卡尔空间中的浇筑路径前提下，需先将路径按照一定数量值进行离散，即分成多段路径，对每段离散后的路径段起点、终点进行运动学逆解运算，即由末端浇筑位置进行机器人运动学逆向推导，得到机械臂各关节变化角度。由于冗余自由度的存在，机器人可变换关节数目大于运动空间自由度数目，运动学逆解运算会产生多解情况，一般采取目标优化算法选择最佳逆解，优化目标多设定为时间最优或能量最优。完成运动学逆解运算后，将得到一系列关节角度值，对其进行数据拟合，拟合方法多用三次、五次多项式拟合或B样条曲线拟合，得到机械臂各关节运动轨迹，完成轨迹规划任务。The intelligent concrete placing robot has redundancy, and the robot is continuously pouring straight or curved paths during the placing operation. Therefore, the trajectory planning scheme of the placing robot should be designed in Cartesian space, and continuous path casting and redundant degrees of freedom should be considered. planning issues below. According to the existing planning method, the DH coordinate system transformation method should first be used to establish a positive kinematics model for the robot. On the premise that the pouring path in the Cartesian space is known, the path needs to be discretized according to a certain number of values, that is, divided into multiple paths. , perform the kinematic inverse solution operation on the starting point and the end point of each discrete path segment, that is, perform the inverse kinematics derivation of the robot from the pouring position of the end, and obtain the change angle of each joint of the manipulator. Due to the existence of redundant degrees of freedom, the number of transformable joints of the robot is greater than the number of degrees of freedom in the motion space, and the inverse kinematics solution operation will generate multiple solutions. Generally, an objective optimization algorithm is used to select the best inverse solution, and the optimization objective is mostly set to the most time-sensitive solution. optimal or energy optimal. After completing the kinematic inverse solution operation, a series of joint angle values will be obtained, and data fitting will be performed on them. Complete the trajectory planning task.

传统方法对于存在冗余度且进行连续直线动作的机械臂来说存在很多问题。首先，完成轨迹规划后，重新对已规划好的轨迹进行正运动学运算，即从各关节空间轨迹推导笛卡尔空间下的路径，结果发现虽然可以达到路径离散点，但离散点中间段与实际路径偏离较大，尤其是在布料机器人这种超长大型机械臂中，传统方法轨迹误差可达到1m左右，因此传统混凝土布料机器人多用人工操作浇筑末端控制实际路径范围，导致布料机器人自主性差，难以达到智能制造的目标。针对此情况，现有研究集中于更高次多项式的拟合或进行路径点分段拟合的研究，如三-五-三次多项式插值方法，需要较高的数学技巧，且计算难度大，在工业机器人实时规划中难以实现。由于冗余度存在，运动学逆解多采用优化算法求解，而优化算法多以时间最优或能量最优为目标，在目标函数设计中也可加入权值与约束进行耦合，从而达到多目标优化要求，这种算法复杂、计算量大，且大大依赖于性能指标的正确给定，而布料机器人工况复杂，性能指标和约束条件随施工环境变化时刻改变，因此算法设计较为困难，难以改变复杂算法内部结构，无法找到一种具有泛化性的固定算法框架。Traditional methods have many problems for manipulators with redundancy and continuous linear motion. First, after the trajectory planning is completed, the forward kinematics operation is performed on the planned trajectory again, that is, the path in the Cartesian space is deduced from the trajectory of each joint space. The path deviation is large, especially in the super-long and large-scale mechanical arm of the placing robot, the trajectory error of the traditional method can reach about 1m. Therefore, the traditional concrete placing robot mostly uses manual operation of the pouring end to control the actual path range, which leads to the poor autonomy of the placing robot and it is difficult to achieve the goal of intelligent manufacturing. In view of this situation, the existing research focuses on the fitting of higher-order polynomials or the research on segmental fitting of path points, such as the three-five-cubic polynomial interpolation method, which requires high mathematical skills and is difficult to calculate. It is difficult to realize real-time planning of industrial robots. Due to the existence of redundancy, the inverse kinematics solutions are mostly solved by optimization algorithms, and the optimization algorithms are mostly aimed at the optimization of time or energy. In the design of the objective function, weights and constraints can also be added to couple, so as to achieve multi-objective Optimization requirements, this algorithm is complex, the amount of calculation is large, and it greatly depends on the correct given performance indicators, while the working conditions of the cloth robot are complex, and the performance indicators and constraints change with the construction environment. Therefore, the algorithm design is difficult and difficult to change. The internal structure of complex algorithms cannot find a generalized fixed algorithm framework.

发明内容SUMMARY OF THE INVENTION

本发明目的在于克服现有技术不足，公开一种基于Q学习的混凝土布料机器人轨迹规划方法。The purpose of the invention is to overcome the deficiencies of the prior art and disclose a trajectory planning method for a concrete placing robot based on Q-learning.

本发明解其技术问题所采用的技术方案是：The technical scheme adopted by the present invention to solve its technical problems is:

针对冗余度混凝土布料机器人的特点，设计一种通用的轨迹规划框架，将布料机器人轨迹规划分为两部分：一部分是布料机器人从初始状态到路径起点、从路径终点复位到初始状态的快速运动过程；另一部分是布料机器人从浇筑路径起点到浇筑路径终点进行连续混凝土浇筑的过程。在快速运动过程中，由于无需考虑中间路径，采取简单的内点法进行以时间最优为目标的逆解优化，并采用三次多项式拟合轨迹。在混凝土连续浇筑过程中，将布料机器人末端所需浇筑的路径形成一定面积的误差带，误差带宽根据给定的浇筑精度条件设定。利用Q学习算法，对形成的路径误差带进行区域划分，并根据浇筑的目标与约束对已划分的区域给定奖励值，对给定格子进行训练，最终形成机器人各关节的动作序列，直接得到机器人动作，避免了复杂的轨迹规划过程。According to the characteristics of redundant concrete placing robots, a general trajectory planning framework is designed. The trajectory planning of the placing robot is divided into two parts: one is the rapid movement of the placing robot from the initial state to the starting point of the path, and the reset from the end point of the path to the initial state. The other part is the process of continuous concrete pouring by the placing robot from the beginning of the pouring path to the end of the pouring path. In the process of fast motion, since there is no need to consider the intermediate path, a simple interior point method is adopted to carry out the inverse solution optimization aiming at the time optimization, and a cubic polynomial is used to fit the trajectory. In the process of continuous concrete pouring, the path required to be poured at the end of the placing robot forms an error band of a certain area, and the error band is set according to the given pouring accuracy conditions. Using the Q-learning algorithm, the formed path error zone is divided into regions, and the divided regions are given reward values according to the pouring goals and constraints, and the given grids are trained to finally form the action sequence of each joint of the robot, which is directly obtained. Robot actions, avoiding the complex trajectory planning process.

本发明的有益效果是：针对布料机器人工作特点，设计了一种轨迹规划器，更具通用性和泛化性，内部参数主要为设定的标量奖励值，容易更改及测试。采取对浇筑路径设计误差带的方式进行轨迹规划，可以按照工作精度要求自主设定误差大小，保证实际路径偏差在工作精度要求范围之内，避免了传统插值规划方法中路径点之间存在过大误差的问题。采用Q学习算法进行训练的方式可以直接得到机器人各关节动作值，避免了复杂的多目标优化逆解运算及数据拟合过程。采用在线学习的方式进行规划，提高了混凝土布料机械的自主性，容易达到智能建造中无人工程机械的目标。The beneficial effects of the present invention are: according to the working characteristics of the cloth robot, a trajectory planner is designed, which is more versatile and generalized, and the internal parameters are mainly set scalar reward values, which are easy to change and test. The trajectory planning is carried out by designing the error zone of the pouring path, and the error size can be set independently according to the working accuracy requirements, so as to ensure that the actual path deviation is within the working accuracy requirements, and avoid the excessively large distance between the path points in the traditional interpolation planning method. error problem. Using the Q-learning algorithm for training can directly obtain the action values of each joint of the robot, avoiding the complex multi-objective optimization inverse solution operation and data fitting process. The online learning method is adopted for planning, which improves the autonomy of the concrete placing machine and easily achieves the goal of unmanned construction machinery in intelligent construction.

附图说明Description of drawings

图1是智能冗余混凝土布料机器人整体结构图(在先申请专利2020111625562《一种三关节回转式混凝土布料机器人》)；Fig. 1 is the overall structure diagram of the intelligent redundant concrete distribution robot (previously applied for patent 2020111625562 "A three-joint rotary concrete distribution robot");

图2是基于Q学习的混凝土布料机器人轨迹规划器结构图；Figure 2 is a structural diagram of a trajectory planner for a concrete placing robot based on Q-learning;

图3是连续直线浇筑过程轨迹规划流程图；Fig. 3 is the flow chart of the trajectory planning of the continuous straight line pouring process;

图4是采用轨迹规划器的直线轨迹规划示例图。FIG. 4 is an example diagram of linear trajectory planning using a trajectory planner.

具体实施方式Detailed ways

图1展示了已设计的智能冗余混凝土布料机器人的整体结构，主要由五大模块构成，包括立柱总成(1)、管道总成(2)、管卡(3)、悬臂支架(4)、转台总成(5)。该设计采用了三个旋转关节，即三个转台总成(5)，针对于二自由度的平面浇筑，该设计冗余一个自由度。Figure 1 shows the overall structure of the designed intelligent redundant concrete placing robot, which is mainly composed of five modules, including column assembly (1), pipe assembly (2), pipe clamp (3), cantilever support (4), Turntable assembly (5). The design adopts three rotary joints, namely three turntable assemblies (5). For the plane casting with two degrees of freedom, the design has one redundant degree of freedom.

图2是轨迹规划器整体结构，包括快速运动部分和混凝土连续浇筑部分的规划方法总结。Figure 2 is a summary of the overall structure of the trajectory planner, including the planning method for the fast-moving part and the continuous concrete pouring part.

图3针对于混凝土连续浇筑过程的轨迹规划部分的思路层和技术层进行细节描述，该部分采用了Q学习方法，图3对其应用方式及过程进行了梳理及总结。Figure 3 describes in detail the idea layer and technical layer of the trajectory planning part of the continuous concrete pouring process. This part adopts the Q-learning method. Figure 3 summarizes and summarizes its application method and process.

在图2中，展示了一种基于Q学习的混凝土布料机器人轨迹规划总体设计框架，设计框架按层次划分，主要包括结构层、思路层、技术层和产出层。在结构层中，将一种基于Q学习的混凝土布料机器人轨迹规划器划分为两部分：一部分是布料机器人从初始状态到路径起点、从路径终点复位到初始状态的快速运动过程；另一部分是布料机器人从浇筑路径起点到浇筑路径终点进行连续混凝土浇筑的过程。In Figure 2, an overall design framework for trajectory planning of concrete placing robots based on Q-learning is shown. The design framework is divided into layers, including structural layer, idea layer, technical layer and output layer. In the structural layer, a Q-learning-based trajectory planner for a concrete placing robot is divided into two parts: one part is the rapid movement process of the placing robot from the initial state to the starting point of the path and the reset from the end point of the path to the initial state; the other part is the distributing robot The process of continuous concrete pouring by the robot from the start of the pour path to the end of the pour path.

基于结构层的划分，采取以下思路：在快速运动过程中，由于在一般情况中无需考虑中间路径，采取传统的逆运动学轨迹规划思路进行以时间最优为目标的逆解优化，技术采取简单的内点法，后续可采用三次多项式拟合轨迹，最终产出各关节角度的轨迹曲线；在混凝土连续浇筑过程中，采取将布料机器人末端所需浇筑的路径形成一定面积的误差带的思路进行规划，误差带宽根据给定的浇筑精度条件设定，多为2倍精度，技术采取Q学习算法，对形成的路径误差带进行区域划分，并根据浇筑的目标与约束对已划分的区域给定奖励值，对给定格子进行训练，最终产出机器人各关节的动作序列，直接得到机器人动作，避免了复杂的轨迹规划过程。Based on the division of the structure layer, the following ideas are adopted: in the process of fast motion, since there is no need to consider the intermediate path in general, the traditional inverse kinematics trajectory planning idea is adopted to carry out the inverse solution optimization aiming at the time optimization. The interior point method can be used to fit the trajectory using a cubic polynomial, and finally the trajectory curve of each joint angle can be produced. In the process of continuous concrete pouring, the idea of forming a certain area of error band for the path required by the end of the placing robot is used to carry out the process. Planning, the error bandwidth is set according to the given pouring accuracy conditions, most of which are 2 times the accuracy. The technology adopts the Q-learning algorithm to divide the formed path error band into regions, and the divided regions are given according to the pouring goals and constraints. The reward value is trained on a given grid, and the action sequence of each joint of the robot is finally produced, and the robot action is directly obtained, avoiding the complex trajectory planning process.

混凝土连续浇筑过程相对于快速运动过程，包括两个过程，其一是快速运动过程，其二是混凝土连续浇筑过程。快速运动过程，是指，在布料机器人刚开始工作时，初始位置不一定在所设定浇筑轨迹的运动起点，所以设置了快速运动过程使布料机器人的初始位置回归于浇筑轨迹起始点。Compared with the rapid movement process, the continuous concrete pouring process includes two processes, one is the rapid movement process, and the other is the continuous concrete pouring process. The rapid movement process means that when the placing robot first starts to work, the initial position is not necessarily at the starting point of the set pouring track, so the rapid movement process is set to make the initial position of the placing robot return to the starting point of the pouring track.

在图3中，展示了布料机器人在混凝土连续直线浇筑过程中布料机器人的轨迹规划流程。In Figure 3, the trajectory planning process of the placing robot during the continuous linear concrete pouring process of the placing robot is shown.

第一步，根据施工环境进行路径规划，一般设定浇筑轨迹为直线，之后确定施工精度要求，建立以路径为区域中心线，以2倍精度为区域宽度的轨迹规划误差带。考虑到在路径寻优过程中，并不需要找到最短路径，而是在权衡效率和路线质量的情况下，找到一个次优解，这里建立误差带牺牲了部分精度，而在误差带宽度一定的条件下平衡了效率和路线质量，且满足误差可控，因此能够达到轨迹规划的要求。The first step is to plan the path according to the construction environment. Generally, the pouring trajectory is set as a straight line, and then the construction accuracy requirements are determined, and a trajectory planning error zone with the path as the area centerline and 2 times the accuracy as the area width is established. Considering that in the process of path optimization, it is not necessary to find the shortest path, but to find a sub-optimal solution under the condition of weighing the efficiency and the quality of the route. The establishment of the error band here sacrifices part of the accuracy, while the error band width is certain. Under the conditions, the efficiency and route quality are balanced, and the error is controllable, so it can meet the requirements of trajectory planning.

第二步，确定误差带后对其划分网格，根据网格区域建立动态奖励值模型R，R为矩阵形式，其中存储了每个网格的奖励值，本申请中设定布料机器人浇筑末端朝向路径目标终点为正奖励，反之为负奖励，误差带之外的区域奖励设定为负无穷，保证规划过程中布料机器人始终在误差带区域中运动。The second step is to divide the error zone into a grid, and establish a dynamic reward value model R according to the grid area. R is in the form of a matrix, in which the reward value of each grid is stored. In this application, the placement robot is set to pour the end Towards the target end point of the path is a positive reward, otherwise it is a negative reward, and the reward for the area outside the error band is set to negative infinity to ensure that the cloth robot always moves in the error band area during the planning process.

第三步，对R矩阵中的奖励值量化处理，以便后续规划器进行学习。The third step is to quantify the reward value in the R matrix so that the subsequent planner can learn.

3.1设定机器人每动作一次，奖励值将减去一个单位，保证机器人的最优能量要求，即以最少的动作达到浇筑轨迹要求。3.1 Set the reward value to be subtracted by one unit for each action of the robot to ensure the optimal energy requirement of the robot, that is, to achieve the pouring trajectory requirement with the least action.

3.2在R矩阵加入偏离路径中心值的标量奖励和已完成动作距离的标量奖励，即在R矩阵中存储布料机器人每个状态下的轨迹误差，将误差值直接作为等值的标量负奖励，并将机器人每个状态下的浇筑位置距路径起点的距离作为等值的标量正奖励，保证布料机器人在误差范围内向目标点动作。3.2 The scalar reward deviating from the path center value and the scalar reward of the completed action distance are added to the R matrix, that is, the trajectory error of the cloth robot in each state is stored in the R matrix, and the error value is directly regarded as the equivalent scalar negative reward, and The distance between the pouring position of the robot in each state and the starting point of the path is regarded as an equivalent scalar positive reward to ensure that the cloth robot moves to the target point within the error range.

3.3设定路径终点所处格子为最大奖励值，保证规划器在寻迹规划时始终朝着目标终点方向前进。3.3 Set the grid where the path end point is located as the maximum reward value to ensure that the planner always moves towards the target end point during the tracing planning.

第四步，根据以上要求建立初始R矩阵，并建立布料机器人动作矩阵a，动作包括向前运动、静止及向后运动3种类型，因此动作矩阵规格为3×3×3，代表了布料机器人所有可行动作下的27种状态，为防止出现长期静止状态死区，去掉(0,0,0)状态，即所有关节静止状态，即动作矩阵a包括了布料机器人可行的26种动作状态。The fourth step is to establish the initial R matrix according to the above requirements, and establish the cloth robot action matrix a. The actions include three types of forward motion, static and backward motion. Therefore, the action matrix size is 3×3×3, which represents the cloth robot. There are 27 states under all feasible actions. In order to prevent the dead zone of long-term static state, the (0, 0, 0) state is removed, that is, the static state of all joints, that is, the action matrix a includes 26 feasible action states of the cloth robot.

第五步，根据规定的奖励矩阵R及动作矩阵a，建立动态Q矩阵，Q矩阵相当于动作-价值函数，输入机器人当前所处状态和即将进行的下一步动作，得到到达目标点的全部奖励值，这里初始化为所有格子Q值为0，后标记终点所在格子Q值为100个单位。The fifth step is to establish a dynamic Q matrix according to the specified reward matrix R and action matrix a. The Q matrix is equivalent to the action-value function. Input the current state of the robot and the next action to be performed, and get all the rewards for reaching the target point. Here, the Q value of all grids is initialized to 0, and the Q value of the grid where the end of the marker is located is 100 units.

根据以上步骤，采取Q学习更新策略公式Q*(s,a)＝E[R+γmaxQ*(s′,a′)|s,a]动态Q矩阵不断训练迭代，这里：According to the above steps, adopt the Q learning update strategy formula Q*(s,a)=E[R+γmaxQ*(s',a')|s,a] The dynamic Q matrix is continuously trained and iterated, here:

在R矩阵加入偏离路径中心值的标量奖励和已完成动作距离的标量奖励，即在R矩阵中存储布料机器人每个状态下的轨迹误差，将误差值直接作为等值的标量负奖励，并将机器人每个状态下的浇筑位置距路径起点的距离作为等值的标量正奖励，保证布料机器人在误差范围内向目标点动作；The scalar reward deviating from the path center value and the scalar reward of the completed action distance are added to the R matrix, that is, the trajectory error of the cloth robot in each state is stored in the R matrix, and the error value is directly regarded as the equivalent scalar negative reward. The distance between the pouring position of the robot and the starting point of the path in each state is used as an equivalent scalar positive reward to ensure that the placing robot moves to the target point within the error range;

动作状态s，包括三个旋转关节角度，即三个转台总成(5)角度；Action state s, including three rotation joint angles, namely three turntable assembly (5) angles;

动作矩阵a，动作矩阵包括向前运动、静止及向后运动3种类型，动作矩阵规格为3×3×3，代表了布料机器人所有可行动作下的27种状态，为防止出现长期静止状态死区，去掉(0,0,0)状态，即所有关节静止状态，即动作矩阵a包括了布料机器人可行的26种动作状态s；Action matrix a, the action matrix includes 3 types of forward motion, static and backward motion. The size of the action matrix is 3×3×3, which represents 27 states under all feasible actions of the cloth robot. In order to prevent the occurrence of long-term static state death area, remove the (0,0,0) state, that is, all joints are still in the static state, that is, the action matrix a includes 26 feasible action states s for the cloth robot;

数学概率中期望E；Expected E in mathematical probability;

Q*(s',a')为一次训练中得到的最大Q值，γ为学习率，这里取值为0.9，直到设定的迭代次数到达上限或Q矩阵收敛时，停止更新，此时默认得到的Q矩阵为最优矩阵，根据此可以得到一系列最优的(动作值，奖励值)的映射关系，从而得到在设定误差带内轨迹规划得到的布料机器人3个关节动作值序列，从而完成直线浇筑的轨迹规划，避免了复杂的运动学逆解优化运算，得到了目标浇筑轨迹的最佳路径点和布料机器人状态-动作序列。Q*(s',a') is the maximum Q value obtained in one training, γ is the learning rate, where the value is 0.9, until the set number of iterations reaches the upper limit or when the Q matrix converges, the update is stopped, at this time the default The obtained Q matrix is the optimal matrix, according to which a series of optimal (action value, reward value) mapping relationships can be obtained, so as to obtain the three joint action value sequences of the cloth robot obtained by trajectory planning within the set error band, In this way, the trajectory planning of straight line pouring is completed, the complex kinematic inverse solution optimization operation is avoided, and the optimal path point of the target pouring trajectory and the state-action sequence of the placing robot are obtained.

如图4所示，通过本申请的Q学习轨迹规划器，得到了最佳的连续路径规划点，路径点之间的部分采用常规曲线拟合方式即可得到最佳连续轨迹。通过本申请提出的方法后得到的混凝土连续浇筑轨迹，由一系列离散点构成，每一离散点代表着布料机器人的一个运动状态。As shown in FIG. 4 , through the Q-learning trajectory planner of the present application, the optimal continuous path planning points are obtained, and the part between the path points can be obtained by using a conventional curve fitting method to obtain the optimal continuous trajectory. The concrete continuous pouring trajectory obtained by the method proposed in this application is composed of a series of discrete points, and each discrete point represents a motion state of the placing robot.

在图4中，基于Q学习的混凝土布料机器人轨迹规划器得到的直线路径，所有路径点为布料机器人到达的所有状态，均在所设定的误差带范围内，达到了智能建造工程机械自主规划、误差可控的要求。In Figure 4, the straight-line path obtained by the trajectory planner of the concrete placing robot based on Q-learning, all path points are all the states reached by the placing robot, all within the set error band, which achieves the autonomous planning of intelligent construction machinery , Error controllable requirements.

Claims

1. The utility model provides a concrete cloth robot path planning method based on Q study, to the characteristics of redundancy concrete cloth robot, its characterized in that designs a general path planning frame, divides cloth robot path planning into two parts: one part is a rapid movement process of the cloth robot from an initial state to a path starting point and from a path end point to an initial state; the other part is the process that the material distribution robot carries out continuous concrete pouring from the starting point of the pouring path to the end point of the pouring path;

the design adopts three rotary joints, namely three turntable assemblies (5);

the track planning process of the material distribution robot in the concrete continuous straight pouring process comprises the following steps:

firstly, planning a path according to a construction environment, and establishing a track planning error band which takes the path as a regional center line and takes 2 times of precision as the regional width;

secondly, dividing grids after the error band is determined, establishing a dynamic reward value model R according to grid areas, wherein the reward value of each grid is stored in a matrix form, setting the pouring tail end of the cloth robot to face a path target terminal point as a positive reward, otherwise setting the pouring tail end of the cloth robot as a negative reward, and setting the reward of areas outside the error band as a negative infinity;

thirdly, quantizing the reward values in the R matrix;

3.1, setting that the reward value is subtracted by one unit every time the robot acts once, and ensuring the optimal energy requirement of the robot, namely, the pouring track requirement is met by the least actions;

3.2 adding scalar rewards deviating from the central value of the path and scalar rewards of finished action distances into the R matrix, namely storing the track error of the distributing robot in each state in the R matrix, directly taking an error value as an equivalent scalar negative reward, and taking the distance between a pouring position of the robot in each state and the starting point of the path as an equivalent scalar positive reward to ensure that the distributing robot moves to a target point in an error range;

3.3 setting the grid where the path end point is located as the maximum reward value;

fourthly, establishing an initial R matrix according to the requirements, and establishing a cloth robot action matrix a, wherein the action comprises 3 types of forward movement, static movement and backward movement, so that the specification of the action matrix is 3 multiplied by 3, the action matrix represents 27 states of the cloth robot under all feasible actions, and in order to prevent a long-term static state dead zone, the (0,0,0) state is removed, namely all joint static states are removed, namely the action matrix a comprises 26 feasible action states s of the cloth robot;

fifthly, establishing a dynamic Q matrix according to a specified reward matrix R and an action matrix a, wherein the Q matrix is equivalent to an action-value function, and inputting the current state of the robot and the next action to be performed to obtain all reward values reaching a target point;

according to the steps, adopting a Q reinforcement learning updating strategy formula Q (s, a) ═ E [ R + gamma maxQ (s ', a') | s, a ] dynamic Q matrix to train and iterate continuously, wherein:

scalar rewards deviating from the central value of the path and scalar rewards of finished action distances are added into the R matrix, namely, the track error of the distributing robot in each state is stored in the R matrix, the error value is directly used as the equivalent scalar negative reward, the distance between the pouring position of the robot in each state and the starting point of the path is used as the equivalent scalar positive reward, and the distributing robot is ensured to move to the target point in the error range;

the action state s comprises three rotation joint angles, namely three angles of the turntable assembly (5);

the motion matrix a comprises 3 types of forward motion, static motion and backward motion, the specification of the motion matrix is 3 multiplied by 3, the motion matrix represents 27 states of the cloth robot under all feasible motions, in order to prevent a long-term static state dead zone, a (0,0,0) state is removed, namely all joint static states are avoided, namely the motion matrix a comprises 26 feasible motion states s of the cloth robot;

expectation in mathematical probability E;

q (s ', a') is the maximum Q value obtained in one training, gamma is a learning rate, the value is 0.9, the updating is stopped until the set iteration times reach the upper limit or the Q matrix is converged, the obtained Q matrix is an optimal matrix by default, and a series of optimal (action value and reward value) mapping relations can be obtained according to the optimal matrix, so that 3 joint action value sequences of the distributing robot obtained by path planning in a set error zone are obtained, the path planning of straight line casting is completed, the complex inverse solution optimization operation of kinematics is avoided, and the optimal path point and the state-action sequence of the distributing robot of the target casting path are obtained.