CN108927806A

CN108927806A - A kind of industrial robot learning method applied to high-volume repeatability processing

Info

Publication number: CN108927806A
Application number: CN201810921161.2A
Authority: CN
Inventors: 李建刚; 钟刚刚; 吴雨璁
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2018-12-04

Abstract

The present invention provides a kind of industrial robot learning methods applied to high-volume repeatability processing, it is characterised in that: the learning method is learnt based on learning model comprising following steps: S001, sensor acquisition state information；S002, learnt according to the information of acquisition；S003, judge whether processing quality and process-cycle reach requirement, terminate to learn if reaching requirement, otherwise resurvey status information and relearn.Method of the invention goes to learn and improves control strategy according to sensing data; reach good control at high speeds; robot debugging efforts can be simplified; and may be implemented in high-volume, scale repeatability processing in apply; and solve robot and lack concussion under high speed operation caused by precise kinetic model in traditional mode of learning, improve industrial machine task efficiency.

Description

A Learning Method for Industrial Robots Applied to Mass Repetitive Processing

技术领域technical field

本发明涉及工业机器人技术领域，尤其涉及一种应用于大批量重复性加工的工业机器人学习方法。The invention relates to the technical field of industrial robots, in particular to an industrial robot learning method applied to large-scale repetitive processing.

背景技术Background technique

工业机器人是一种具有高度非线性的系统，其动态特性的准确建模难以实现。以往的机器人通常只考虑运动学而不考虑动力学模型。在只使用动力学模型时，一方面通常将每个点最大的速度和加速度设置得低于实际所能承受的速度和加速度，这是考虑到动态特性时不超出执行器的最大力矩，但这也导致了执行器的性能没有被充分利用。另一方面，未考虑动力学特性不仅影响了工业机器人的工作效率，在机器人高速运动过程以及重负载过程中，由于惯性力，离心力，摩擦力，重力，关节扭矩力的影响，往往会产生强烈的震动，这不仅影响了机器人的加工质量，也影响了机器人的寿命。此外，工业机器人动力学准确建模还存在对机器人参数难以辨识的问题，如果机器人的一致性不好，每个部件的摩擦力系数不一样，导致动力学参数出现错误，而不正确的动力学参数会导致机器人调试工作更加繁琐，也难以实现大批量、规模化的应用。An industrial robot is a highly nonlinear system, and accurate modeling of its dynamic characteristics is difficult to achieve. Previous robots usually only consider kinematics without considering dynamic models. When only using the dynamic model, on the one hand, the maximum speed and acceleration of each point are usually set lower than the actual speed and acceleration, which is the maximum torque of the actuator when the dynamic characteristics are considered, but this It also leads to the underutilization of the performance of the actuator. On the other hand, failure to consider the dynamic characteristics not only affects the working efficiency of industrial robots, but also tends to produce strong The vibration, which not only affects the processing quality of the robot, but also affects the life of the robot. In addition, the accurate modeling of industrial robot dynamics still has the problem of difficult identification of robot parameters. If the consistency of the robot is not good, the friction coefficient of each component will be different, resulting in errors in dynamic parameters, and incorrect dynamics The parameters will make the debugging of the robot more cumbersome, and it will be difficult to achieve large-scale and large-scale applications.

发明内容Contents of the invention

针对现有技术中存在的缺陷或不足，本发明提供一种应用于大批量重复性加工的工业机器人学习方法，根据传感器数据去学习并改进控制策略，达到在高速下的良好控制，能够简化机器人调试工作，并可实现在大批量、规模化的重复性加工中应用，并解决机器人在传统的学习方式中缺乏精确动力学模型造成的高速工作下的震荡，提高工业机器人的工作效率。Aiming at the defects or deficiencies in the prior art, the present invention provides an industrial robot learning method applied to large-scale repetitive processing, which learns and improves the control strategy according to sensor data, achieves good control at high speed, and can simplify the robot Debugging work, and can be applied in large-scale, large-scale repetitive processing, and solve the vibration of robots under high-speed work caused by the lack of accurate dynamic models in traditional learning methods, and improve the work efficiency of industrial robots.

为了实现上述目的，本发明采取的技术方案为提供一种应用于大批量重复性加工的工业机器人学习方法，该学习方法是基于学习模型进行学习，其包括如下步骤：In order to achieve the above object, the technical solution adopted by the present invention is to provide a learning method for industrial robots applied to large batches of repetitive processing. The learning method is to learn based on a learning model, which includes the following steps:

S001、传感器采集状态信息；S001. The sensor collects status information;

S002、根据采集的信息进行学习；S002. Learning according to the collected information;

S003、判断加工质量以及加工周期是否达到要求，若达到要求则结束学习，否则重新采集状态信息重新学习。S003 , judging whether the processing quality and the processing cycle meet the requirements, and if the requirements are met, the learning is ended, otherwise, the state information is collected again and the learning is performed again.

作为本发明的进一步改进，所述学习模型由环境单元、机器人学习单元和加工执行单元组成；As a further improvement of the present invention, the learning model is composed of an environment unit, a robot learning unit and a processing execution unit;

其中，所述环境单元，由加工工件状态测量传感器和机器人状态末端测量观测器组成，所述加工工件状态测量传感器采集所加工工件的视觉信息，所述视觉信息至少包括工件的几何形状和表面光滑度信息；所述机器人状态末端测量观测器采集机器人的位置、速度、加速度以及关节扭矩的信息；Wherein, the environment unit is composed of a processing workpiece state measurement sensor and a robot state end measurement observer, and the processing workpiece state measurement sensor collects visual information of the processed workpiece, and the visual information includes at least the geometric shape and smooth surface of the workpiece degree information; the position, velocity, acceleration and joint torque information of the robot state terminal measurement observer gathering robot;

所述状态观测单元，所述状态观测单元通过通信线路获取所述环境单元采集的信息，并将获取的信息转化成数据格式；The state observation unit, the state observation unit obtains the information collected by the environmental unit through a communication line, and converts the obtained information into a data format;

所述数据处理单元，接收并处理所述状态观测单元转化成数据格式的信息；所述数据处理单元包括奖励计算单元和函数更新单元，其中，所述奖励计算单元通过奖励函数设置单元设置即时奖励r，所述奖励计算单元对所述状态观测单元的信息进行计算，计算完成后将结果参数输送至函数更新单元，函数更新单元采用神经网络训练的方式对获取到的参数进行更新，直到得到最终学习参数，将最终学习参数存储起来，通过神经网络做出行为决策，再进行强化学习到一个确定性策略以驱动机器人进行工作。The data processing unit receives and processes the information converted into a data format by the state observation unit; the data processing unit includes a reward calculation unit and a function update unit, wherein the reward calculation unit sets an instant reward through a reward function setting unit r, the reward calculation unit calculates the information of the state observation unit, and after the calculation is completed, the result parameters are sent to the function update unit, and the function update unit uses neural network training to update the acquired parameters until the final Learning parameters, storing the final learning parameters, making behavioral decisions through the neural network, and then performing reinforcement learning to a deterministic strategy to drive the robot to work.

作为本发明的进一步改进，所述强化学习通过假设机器人由状态信息到行为定义为策略π，从时刻t开始获得的累积回报定义为：根据累积回报通过As a further improvement of the present invention, the reinforcement learning assumes that the robot is defined as a strategy π from state information to behavior, and the cumulative reward obtained from time t is defined as: According to the cumulative return passed

求取期望回报；其中，Q^π(s_t，a_t)表示依据策略π在状态s_t下采取行为a_t时的期望回报；Find the expected return; among them, Q ^π (st _t , a _t ) represents the expected return when taking action a _t in state st _t according to strategy π;

结合累积回报和取期望回报的公式，得到期望回报的递归形式公式：Combining the cumulative return and the formula for obtaining the expected return, the recursive formula for the expected return is obtained:

根据递归形式公式不断使用上次更新的策略进行决策。Continuously use the last updated strategy to make decisions according to the recursive form formula.

本发明中，采用强化学习的方式，强化学习的策略分为确定性策略和不确定策略，本发明中采用确定性策略的强化学习方式，即在某一状态下采用输出行为的方式，而不是输出概率的方式，则期望回报Q可通过公式(4)计算：In the present invention, the method of reinforcement learning is adopted, and the strategies of reinforcement learning are divided into deterministic strategies and uncertain strategies. output probability, the expected return Q can be calculated by formula (4):

其中，μ代表的是确定的行为。 Among them, μ represents the definite behavior.

作为本发明的进一步改进，所述强化学习采用确定性策略的强化学习方式，其具体过程包括如下步骤：As a further improvement of the present invention, the reinforcement learning adopts a deterministic strategy reinforcement learning method, and its specific process includes the following steps:

S201，初始化行为网络μ(s|θ^μ)，参数表示为θ^Q和评价网络Q(s，a|θ^Q)，参数表示为θ^μ，并初始化目标网络Q′(s，a|θ^Q′)和μ′(s|θ^μ′)，参数是θ^Q′←θ^Q，θ^μ′←θ^μ。S201, initialize the behavior network μ(s|θ ^μ ), the parameters are expressed as θ ^Q and the evaluation network Q(s, a|θ ^Q ), the parameters are expressed as θ ^μ , and initialize the target network Q′(s, a|θ ^{Q ′} ) and μ′(s|θ ^μ′ ), the parameters are θ ^Q′ ←θ ^Q , θ ^μ′ ←θ ^μ .

S202，初始化缓冲容器R；S202, initialize the buffer container R;

S203，接受状态观测单元的状态信息s_t；S203, receiving the state information s _t of the state observation unit;

S204，根据当前策略并施加一定的噪声来选择执行行为a_t；S204, select the execution behavior at according to the current strategy and apply certain _noise ;

S205，观测得到的奖励r_t，并观测下一状态信息s_t+1；S205, observe the obtained reward r _t , and observe the next state information s _t+1 ;

S206，将四元组<s_t，a_t，rt，s_t+1>存在缓冲容器R中；S206, store the quadruple <s _t , a _t , rt, s _t+1 > in the buffer container R;

S207，从缓冲容器中随机选取一批四元组样本进行训练；S207, randomly select a batch of quaternion samples from the buffer container for training;

S208，更新评价网络参数；S208, updating evaluation network parameters;

S209，更新行为网络参数；S209, updating behavioral network parameters;

S210，判断学习次数是否超出预设值或加工质量是否足够好；S210, judging whether the number of learning times exceeds a preset value or whether the processing quality is good enough;

S211，将评价网络和行为网络的参数传输到主机存储，结束学习。S211, transmit the parameters of the evaluation network and the behavior network to the host for storage, and end the learning.

作为本发明的进一步改进，所述步骤S208中更新评价网络参数时，先将目标函数y_t设置为：y_t＝r(s_t，a_t)+γQ(s_t+1，μ(s_t+1)|θ^Q)，再通过公式min_a L(θ^Q)＝ E[(Qst，atθQ-yt)2]计算得到参数来更新评价网络，其中，at表示t时刻的行为，Q表示累积奖励，θ^Q表示行为网络的参数，E表示多组数据实际奖励和目标之间的误差的平方和的期望值，L(θ^Q)表示在参数θ^Q下的误差，μ(s_t+1)表示在状态s_t+1下的确定性策略.As a further improvement of the present invention, when updating the evaluation network parameters in the step S208, the objective function y _t is first set as: y _t = _r ( _st ,at )+γQ( _st+1 ,μ(st _{t +1} )|θ ^Q ), and then use the formula min _a L(θ ^Q )＝ E[(Qst, atθQ-yt)2] to calculate the parameters to update the evaluation network, where at represents the behavior at time t, and Q represents the cumulative Reward, θ ^Q represents the parameters of the behavioral network, E represents the expected value of the sum of the squares of the errors between the actual reward and the target of multiple sets of data, L(θ ^Q ) represents the error under the parameter θ ^Q , μ(s _t+1 ) Denotes a deterministic policy in state _t+1 .

6.根据权利要求3所述的应用于大批量重复性加工的工业机器人学习方法，其特征在于：所述步骤S209中更新行为网络参数时，使用梯度法来更新行为网络，而更新目标网络时采用如下公式组来更新；6. The industrial robot learning method applied to large-scale repetitive processing according to claim 3, characterized in that: when updating the behavioral network parameters in the step S209, the gradient method is used To update the behavioral network, and update the target network using the following formula group to update;

θ′←τθ+(1-τ)θ′θ′←τθ+(1-τ)θ′

θ^Q′←τθ^Q+(1-τ)θ^Q′ θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ^μ′←τθ^μ+(1-τ)θ^μ′ θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

withτ＜＜0.05withτ<<0.05

表示对θ^μ求导，表示对α求导，表示以θ^μ为变量，求J的关于θ^μ的导数。 Indicates the derivative of θ ^μ , Indicates the derivative of α, Indicates that taking θ ^μ as a variable, find the derivative of J with respect to θ ^μ .

本发明的有益效果是：The beneficial effects of the present invention are:

1.本发明的方法通过采集加工信息，使用强化学习的方式进行学习，降低机器人调试工作，优化工业机器人的控制策略，包括给定路径下的轨迹规划功能以及给定轨迹下的电机控制策略，解决机器人在传统的学习方式中缺乏精确动力学模型造成的高速工作下的震荡，提高工业机器人的工作效率。1. The method of the present invention uses reinforcement learning to learn by collecting processing information, reduces robot debugging work, and optimizes the control strategy of industrial robots, including the trajectory planning function under a given path and the motor control strategy under a given trajectory, Solve the vibration of robots under high-speed work caused by the lack of accurate dynamic models in the traditional learning method, and improve the work efficiency of industrial robots.

2.本学习方法本就是学习在高速工作下的控制策略，根据传感器数据去学习并改进控制策略，达到在高速下的良好控制。2. This learning method is to learn the control strategy under high-speed work, learn and improve the control strategy according to the sensor data, and achieve good control under high speed.

附图说明Description of drawings

图1是本发明的学习模型结构示意图；Fig. 1 is the structural representation of learning model of the present invention;

图2是本发明的学习方法流程图；Fig. 2 is a flow chart of the learning method of the present invention;

图3是本发明的强化学习流程图。Fig. 3 is a flowchart of the reinforcement learning of the present invention.

具体实施方式Detailed ways

下面结合附图说明及具体实施方式对本发明进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and specific embodiments.

本发明的学习方法是基于学习模型结构而得到，该学习模型结构也是工业机器人系统；如图1所示，为本发明的学习模型结构示意图；该模型由环境单元、机器人学习单元和加工执行单元组成，其中，环境单元至少包括加工质量测量单元，机器人学习单元包括状态观测单元、数据处理单元及决策制定单元，加工执行单元至少包括机器人和定位器。The learning method of the present invention is obtained based on the learning model structure, which is also an industrial robot system; as shown in Figure 1, it is a schematic diagram of the learning model structure of the present invention; the model consists of an environment unit, a robot learning unit and a processing execution unit The environment unit includes at least a processing quality measurement unit, the robot learning unit includes a state observation unit, a data processing unit, and a decision-making unit, and the processing execution unit includes at least a robot and a positioner.

本发明的学习模型的各个单元的工作过程是：The working process of each unit of the learning model of the present invention is:

环境单元，在本实施里中为加工质量测量单元，由加工工件状态测量传感器和机器人状态末端测量观测器组成，加工工件状态测量传感器主要是采集所加工工件的视觉信息，包括工件的几何形状和表面光滑度。机器人状态末端测量观测器也可以为机器人状态末端测量传感器，用于采集机器人的位置、速度、加速度、关节扭矩等信息。The environment unit, in this implementation, is the processing quality measurement unit, which is composed of the processing workpiece state measurement sensor and the robot state end measurement observer. The processing workpiece state measurement sensor mainly collects the visual information of the processed workpiece, including the geometric shape and Surface smoothness. The robot state end measurement observer can also be a robot state end measurement sensor, which is used to collect information such as the position, speed, acceleration, and joint torque of the robot.

状态观测单元，状态观测单元通过通信线路获取加工质量测量单元采集的信息，并将获取的信息转化成数据格式。A state observation unit, the state observation unit obtains the information collected by the processing quality measurement unit through the communication line, and converts the obtained information into a data format.

数据处理单元，接收并处理状态观测单元转化成数据格式的信息；数据处理单元包括奖励计算单元和函数更新单元，其中，奖励计算单元通过奖励函数设置单元设置即时奖励r，奖励计算单元对所述状态观测单元的信息进行计算，计算完成后将结果参数输送至函数更新单元，函数更新单元采用神经网络训练的方式对获取到的参数进行更新，直到得到最终学习参数，将最终学习参数存储起来，通过神经网络做出行为决策，再进行强化学习到一个确定性策略以驱动机器人进行工作。具体是，奖励计算单元通过奖励函数设置单元设置即时奖励r，具体是基于函数r＝α*velocity+β* position_error+γ*acceleration+u*R*u^T+…，通过调整α、β、γ的参数以改变衡量的指标的比重，其中α表示速度在奖励函数衡量中所占的权重、β表示位置误差的权重、γ表示加速度权重、R表示正定矩阵、u代表电压，电流等，奖励计算单元计算完成后输送至函数更新单元进行更新，在本实施里中函数更新单元优选使用神经网络训练的方式对参数进行更新，得到最终学习参数后，将学习结果存储起来，同时使用神经网络做出行为决策，驱动机器人进行工作，然后不断继续采集新的信息。从而实现可根据具体的加工任务，强化学习会根据要求的性能指标对不同状态下的电压，位置，速度，加速度等进行调整，这一过程包括了轨迹规划功能和电机控制策略The data processing unit receives and processes the information converted into a data format by the state observation unit; the data processing unit includes a reward calculation unit and a function update unit, wherein the reward calculation unit sets the instant reward r through the reward function setting unit, and the reward calculation unit The information of the state observation unit is calculated. After the calculation is completed, the result parameters are sent to the function update unit. The function update unit uses the neural network training method to update the acquired parameters until the final learning parameters are obtained, and the final learning parameters are stored. Make behavioral decisions through the neural network, and then perform reinforcement learning to a deterministic strategy to drive the robot to work. Specifically, the reward calculation unit sets the instant reward r through the reward function setting unit, specifically based on the function r=α*velocity+β*position _error +γ*acceleration+u*R*u ^T +…, by adjusting α, β, The parameter of γ is to change the proportion of the measured index, where α represents the weight of speed in the measurement of reward function, β represents the weight of position error, γ represents the weight of acceleration, R represents positive definite matrix, u represents voltage, current, etc., reward After the calculation of the calculation unit is completed, it is sent to the function update unit for update. In this implementation, the function update unit preferably uses the neural network training method to update the parameters. After obtaining the final learning parameters, it stores the learning results and uses the neural network to do the update. Make behavior decisions, drive robots to work, and then continue to collect new information. In this way, according to specific processing tasks, reinforcement learning will adjust the voltage, position, speed, acceleration, etc. in different states according to the required performance indicators. This process includes trajectory planning functions and motor control strategies.

在本实施例中，基于如图1所示的模型形设计了可应用于大批量重复性加工机器人学习的方法，该方法的流程图如图2所示，具体步骤为如下：In the present embodiment, based on the model as shown in Figure 1, a method applicable to the learning of large batches of repetitive processing robots is designed. The flow chart of the method is shown in Figure 2, and the specific steps are as follows:

在本实施例中，步骤S001、传感器采集状态信息；是加工信息测量单元中的机器人状态末端测量观测器采集机器人关节和机械臂末端的位置、速度、加速度、电流、电压、振动率以及扭矩信息；加工工件状态测量传感器主要是采集包含有加工工件的几何形状和表面光滑度的视觉信息；在此过程中，对视觉信息进行灰度处理，避免光照的影响。对位置，速度，加速度等信息进行归一化处理，将信息长度统一，以方便输入神经网络处理。In this embodiment, step S001, the sensor collects state information; the robot state end measurement observer in the processing information measurement unit collects the position, speed, acceleration, current, voltage, vibration rate and torque information of the robot joint and the end of the mechanical arm The processing workpiece state measurement sensor mainly collects the visual information including the geometric shape and surface smoothness of the processing workpiece; in this process, the visual information is processed in gray scale to avoid the influence of light. Normalize the position, speed, acceleration and other information, and unify the length of the information to facilitate input into the neural network for processing.

在本实施例中，步骤S002、根据采集的信息进行学习，机器学习单元根据状态观测单元得到的电流、电压、力矩、扭矩、振动及立体视觉等信息进行深度强化学习，本实施例中，根据采集的信息进行学习优选采用无模型强化学习的方式。In this embodiment, step S002, learning according to the collected information, the machine learning unit performs deep reinforcement learning according to information such as current, voltage, moment, torque, vibration and stereo vision obtained by the state observation unit. In this embodiment, according to The collected information is preferably learned in a model-free reinforcement learning manner.

在强化学习的过程中，是基于离散等额时间序列，使得机器人与环境进行交互；具体是假设在每个等额时间t内，状态观测单元向机器学习单元输入观测状态，机器学习单元根据当前状态，将状态作为输入信息，输入神经网络，输出结果为电机的确定行为。机器人由状态信息到行为定义为策略π，从时刻t开始获得的累积回报为：In the process of reinforcement learning, it is based on the discrete equivalent time series, so that the robot interacts with the environment; specifically, it is assumed that within each equivalent time t, the state observation unit inputs the observation state to the machine learning unit, and the machine learning unit is based on the current state. The state is used as the input information, which is input into the neural network, and the output result is the definite behavior of the motor. The robot is defined as a strategy π from state information to behavior, and the cumulative reward obtained from time t is:

强化学习得到的策略可分为确定性策略和不确定性策略，不确定策略输出的是每种行为的概率，确定性策略直接输出某种行为。The strategies obtained by reinforcement learning can be divided into deterministic strategies and uncertain strategies. The uncertain strategy outputs the probability of each behavior, and the deterministic strategy directly outputs a certain behavior.

本发明中，使用强化学习的目的是学习到一个确定性策略π，即直接学习从状态输入到输出动作的策略，即对行为网络来说，输入是状态信息，输出是动作。使得从初始状态得到的期望回报Q最大，可用公式(2)表示：In the present invention, the purpose of using reinforcement learning is to learn a deterministic strategy π, that is, directly learn the strategy from state input to output action, that is, for the behavioral network, the input is state information, and the output is action. Maximize the expected return Q obtained from the initial state, which can be expressed by formula (2):

Q^π(s_t，a_t)＝E_π[R_t|s_t，a_t] (2)Q ^π (s _t , a _t )＝E _π [R _t |s _t , a _t ] (2)

其中，Q^π(s_t，a_t)表示依据策略π在状态s_t下采取行为a_t时的期望回报。结合公式(1)和(2)，可以得出期望回报的递归形式，如公式(3)：Among them, Q ^π ( _st , a _t ) represents the expected reward when taking action a _t in state _st according to policy π. Combining formulas (1) and (2), the recursive form of expected return can be obtained, such as formula (3):

这意味着我们可以在学习过程中不断使用上次更新的策略进行决策。This means that we can continuously use the last updated policy to make decisions during the learning process.

其中，μ代表的是确定的行为。在此过程中，公式(4)是公式(2)的递归形式。公式(2)是本发明中期望回报的总体的思想表示，而公式(4)是切实可行的实施方式，它允许了此时累积奖赏的计算使用上次的累积奖赏的值和此刻即时奖励的值，以便易于在计算机上编程实现。Among them, μ represents the definite behavior. In this process, formula (4) is a recursive form of formula (2). Formula (2) is the overall thought representation of expected return in the present invention, and formula (4) is a practical implementation, it allows the calculation of the cumulative reward at this time to use the value of the last cumulative reward and the value of the instant reward at this moment. value so that it can be easily programmed on a computer.

在本实施例中，采取确定性策略的强化学习的具体方法，包括如图3所示的步骤：In this embodiment, the specific method of reinforcement learning using a deterministic strategy includes the steps shown in Figure 3:

S201，初始化行为网络μ(s|θ^μ)，参数表示为θ^Q和评价网络Q(s，a|θ^Q)，参数表示为θ^μ，并初始化目标网络Q′(s，a|θ^Q′)和μ′(s|θ^μ′)，参数是θ^Q′←θ^Q，θμ^′←θ^μ。S201, initialize the behavior network μ(s|θ ^μ ), the parameters are expressed as θ ^Q and the evaluation network Q(s, a|θ ^Q ), the parameters are expressed as θ ^μ , and initialize the target network Q′(s, a|θ ^{Q ′} ) and μ′(s|θ ^μ′ ), the parameters are θ ^Q′ ←θ ^Q , θμ ^′ ←θ ^μ .

具体是，初始化神经网络评价网络Q(s，a|θ^Q)和行为网络μ(s|θ^μ)，神经网络的结构，其神经参数分别表示为：θ^Q，θ^μ，代表了神经网络中每个神经元的比重。其中，θ^Q表示评价神经网络的参数，θ^μ表示行为网络，μ(s|θ^μ)的输入是来自状态观测单元的状态信息，输出是某一行为a_t。评价神经网络Q(s，a|θ^Q)的输入是来自状态观测单元的状态信息以及行为网络μ(s|θ^μ)的输出，输出是在此种状态下采取某一行为的价值函数的值，用于反映当前策略的好坏。初始化目标网络，目标网络分别和行为网络与评价网络的结构相同，其神经网络参数来自行为网络和评价网络的变换较慢副本，目标网络的更新要慢于行为网络和评价网络，目的是维护神经网络学习过程的稳定。Specifically, initialize the neural network evaluation network Q(s, a|θ ^Q ) and behavior network μ(s|θ ^μ ), the structure of the neural network, and its neural parameters are expressed as: θ ^Q , θ ^μ , representing the neural network The proportion of each neuron in . Among them, θ ^Q represents the parameters of the evaluation neural network, θ ^μ represents the behavior network, the input of μ( _s |θ ^μ ) is the state information from the state observation unit, and the output is a certain behavior at . The input of the evaluation neural network Q(s, a|θ ^Q ) is the state information from the state observation unit and the output of the behavior network μ(s|θ ^μ ), and the output is the value function of taking a certain behavior in this state A value that reflects how good or bad the current strategy is. Initialize the target network. The structure of the target network is the same as that of the behavioral network and the evaluation network. The parameters of the neural network come from the slower copies of the behavioral network and the evaluation network. The update of the target network is slower than that of the behavioral network and the evaluation network. The stability of the online learning process.

S202，初始化缓冲容器；S202, initialize the buffer container;

S203，接受状态观测单元的状态信息；S203, receiving the state information of the state observation unit;

S204，根据当前策略并施加一定的噪声来选择执行行为；S204. Select an execution behavior according to the current strategy and apply certain noise;

具体是对一个加工周期的每个离散时间点，根据当前策略并加以一定的随机噪声来选择行为a_t，即：a_t＝μ(s_t|θ^μ)+N(t).其中，μ(s)＝argmax_aQ(s，a)。Specifically, for each discrete time point of a processing cycle, the behavior at _t is selected according to the current strategy and a certain amount of random noise, namely: at ＝μ(s _t |θ ^μ )+N( _t ). Among them, μ (s) = argmax _a Q(s, a).

S205，观测得到的奖励，并观测下一状态信息；S205, observe the obtained reward, and observe the next state information;

具体是当前状态信息s_t作为输入，将其输入行为网络，行为网络输出某一具体的行为值a_t，将状态信息和行为网络的行为值作为输入，输入评价网络，评价网络输出奖励r_t，状态观测单元采集下一时刻的状态信息s_t+1，得到四元组信息(s_t，a_t，r_t，s_t+1)。Specifically, the current state information s _t is used as input, which is input into the behavior network, and the behavior network outputs a specific behavior value a _t , and the state information and the behavior value of the behavior network are used as input, input into the evaluation network, and the evaluation network outputs a reward r _t , the state observation unit collects the state information _st+1 at the next moment, and obtains the quadruple information ( _st , at , _rt , _st ₊₁ ).

S206，将(s_t，a_t，r_t，s_t+1)存在缓冲容器中，(s_t，a_t，r_t，s_t+1)表示了在状态s_t下采取行为a_t后获得的奖励r_t的大小和下一时刻的状态；S206, store (s _t , at , r _t , s _t ₊₁ ) in the buffer container, ( _st _t , at , r _t , s _t ₊₁ ) means that after taking action at state s _t The size of the reward r _t obtained and the state at the next moment;

S207，从缓冲容器中随机选取一批样本进行训练；S207, randomly select a batch of samples from the buffer container for training;

S208，更新评价网络参数；S208, updating evaluation network parameters;

具体的将目标函数设置为：y_t＝r(s_t，a_t)+γQ(s_t+1，μ(s_t+1)|θ^Q)，并通过公式 min_θL(θ^Q)＝E[(Q(s_t，a_t|θ^Q)-y_t)²]更新评价网络；Q表示期望累积奖励，θ^Q表示行为网络的参数，E表示多组数据实际奖励和目标之间的误差的平方和的期望值，L(θ^Q) 表示在参数，μ(s_t+1)表示在状态s_t+1下的确定性策略；Specifically, the objective function is set as: y _t = r(s _t , a _t )+γQ(s _t+1 , μ(s _t+1 )|θ ^Q ), and through the formula min _θ L(θ ^Q )= E[(Q(s _t , a _t |θ ^Q )-y _t ) ² ] updates the evaluation network; Q represents the expected cumulative reward, θ ^Q represents the parameters of the behavior network, and E represents the relationship between the actual reward and the target of multiple sets of data The expected value of the sum of squares of the error, L(θ ^Q ) represents the parameter, μ(s _t+1 ) represents the deterministic strategy in the state s _t+1 ;

S209，更新行为网络参数；S209, updating behavioral network parameters;

具体是使用梯度法Specifically, using the gradient method

来更新行为网络，而更新目标网络时采用如下公式组来更新；To update the behavioral network, and update the target network using the following formula group to update;

θ′←τθ+(1-τ)θ′θ′←τθ+(1-τ)θ′

θ^Q′←τθ^Q+(1-τ)θ^Q′ θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ^μ′←τθ^μ+(1-τ)θ^μ′ θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

withτ＜＜0.05withτ<<0.05

其中，表示对θ^μ求导，表示对α求导，表示以θ^μ为变量，求J的关于θ^μ的导数。in, Indicates the derivative of θ ^μ , Indicates the derivative of α, Indicates that taking θ ^μ as a variable, find the derivative of J with respect to θ ^μ .

S210，判断学习次数是否超出预设值或加工质量是否足够好；具体是当学习次数达到预定次数(如100万次)或学习到的策略已经满足应用要求(加工质量很好)时，退出学习。S210, judging whether the number of learning times exceeds a preset value or whether the processing quality is good enough; specifically, when the number of learning times reaches a predetermined number of times (such as 1 million times) or the learned strategy has met the application requirements (the processing quality is very good), quit learning .

S211，将评价网络和行为网络的参数传输到主机存储，结束。S211, transmit the parameters of the evaluation network and the behavior network to the host for storage, and end.

在学习结束后，将评价网络和行为网络的参数传输到主机存储，加工时得到的(s_t，a_t，r_t，s_t+1)信息，也传输到主机存储。主机将得到的信息传输给其他训练的机器人，机器人学习单元得到训练好的神经网络参数后，采取相同的神经网络结构，固定这些参数使其不能变化，只使最后两层的参数能够变化。根据机器人实际的加工情况对最后两层的参数进行调整。After the learning is over, the parameters of the evaluation network and the behavior network are transmitted to the host storage, and the (st _t , a _t , _rt , _st+1 ) information obtained during processing is also transmitted to the host storage. The host computer transmits the obtained information to other trained robots. After the robot learning unit obtains the trained neural network parameters, it adopts the same neural network structure, fixes these parameters so that they cannot be changed, and only the parameters of the last two layers can be changed. Adjust the parameters of the last two layers according to the actual processing situation of the robot.

在本实施里中，r代表了即时奖励，R代表了累积回报。In this implementation, r represents immediate reward and R represents cumulative return.

本发明的学习方法及强化学习过程，均是采集工业机器人的实时信息/状态信息，是在机器人的工作工程中(包含高速运动过程以及重负载过程)，兼顾了惯性力，离心力，摩擦力，重力，关节扭矩力的对机器人工作的影响，对工作过程产生的震动也是通过机器人状态末端测量传感器，采集机器人的位置、速度、加速度、关节扭矩等信息，实时更新及生成的最优策略，有效确保了机器人的一致性，避免了错误决策的发生。The learning method and the reinforcement learning process of the present invention all collect the real-time information/status information of the industrial robot, and are in the working engineering of the robot (including the high-speed motion process and the heavy load process), taking into account the inertial force, centrifugal force, frictional force, The impact of gravity and joint torque on the robot's work, and the vibration generated during the working process are also measured by the sensor at the end of the robot's state to collect information such as the robot's position, speed, acceleration, and joint torque, and update and generate the optimal strategy in real time. Effective It ensures the consistency of the robot and avoids the occurrence of wrong decisions.

综上所述，本发明的方法通过采集加工信息，使用强化学习的方式进行学习，降低机器人调试工作优化工业机器人的控制策略，包括给定路径下的轨迹规划功能以及给定轨迹下的电机控制策略，解决机器人在传统的学习方式中缺乏精确动力学模型造成的高速工作下的震荡，提高工业机器人的工作效率。本学习方法本是学习在高速工作下的控制策略，根据传感器数据去学习并改进控制策略，达到在高速下的良好控制。To sum up, the method of the present invention collects processing information and uses reinforcement learning to learn, reduce the robot debugging work and optimize the control strategy of industrial robots, including the trajectory planning function under a given path and the motor control under a given trajectory. The strategy solves the vibration of the robot under high-speed work caused by the lack of an accurate dynamic model in the traditional learning method, and improves the work efficiency of the industrial robot. This learning method is to learn the control strategy under high-speed work, learn and improve the control strategy according to the sensor data, and achieve good control under high speed.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above content is a further detailed description of the present invention in conjunction with specific preferred embodiments, and it cannot be assumed that the specific implementation of the present invention is limited to these descriptions. For those of ordinary skill in the technical field of the present invention, without departing from the concept of the present invention, some simple deduction or replacement can be made, which should be regarded as belonging to the protection scope of the present invention.

Claims

1. An industrial robot learning method applied to large batches of repetitive processing, characterized in that: the learning method learns based on a learning model, and it comprises the steps of:

S001. The sensor collects status information;

S002, learning according to the collected information;

S003 , judging whether the processing quality and the processing cycle meet the requirements, and if the requirements are met, the learning is ended, otherwise, the state information is collected again and the learning is performed again.

2. The industrial robot learning method applied to large batches of repetitive processing according to claim 1, wherein the learning model is composed of an environment unit, a robot learning unit and a processing execution unit; wherein the environment unit includes at least a processing unit The quality measurement unit, the robot learning unit includes a state observation unit, data processing unit and decision-making unit, and the processing execution unit includes at least a robot and a positioner;

The environment unit is composed of a processing workpiece state measurement sensor and a robot state end measurement observer. The processing workpiece state measurement sensor collects visual information of the processed workpiece, and the visual information includes at least the geometric shape and surface smoothness information of the workpiece ; The position, speed, acceleration and joint torque of the robot are collected by the robot state terminal measurement observer;

The state observation unit, the state observation unit obtains the information collected by the environmental unit through a communication line, and converts the obtained information into a data format;

The data processing unit receives and processes the information converted into a data format by the state observation unit; the data processing unit includes a reward calculation unit and a function update unit, wherein the reward calculation unit sets an instant reward through a reward function setting unit r, the reward calculation unit calculates the information of the state observation unit, and after the calculation is completed, the result parameters are sent to the function update unit, and the function update unit uses neural network training to update the acquired parameters until the final Learning parameters, storing the final learning parameters, making behavioral decisions through the neural network, and then performing reinforcement learning to a deterministic strategy to drive the robot to work.

3. The industrial robot learning method applied to large batches of repetitive processing according to claim 2, characterized in that: the reinforcement learning is defined as a strategy π by assuming that the robot is defined from state information to behavior, and the cumulative value obtained from time t Returns are defined as: _Calculate the expected return by Q ^π ₍ _st _t , a _t ₎ ^＝ E _π [R _t _| The expected return when the behavior a _t is taken below; combining the cumulative return and the formula for the expected return, the recursive formula of the expected return is obtained:

Continuously use the last updated strategy to make decisions according to the recursive form formula.

4. the industrial robot learning method that is applied to large batches of repetitive processing according to claim 2, is characterized in that: described reinforcement learning adopts the reinforcement learning mode of deterministic strategy, and its concrete process comprises the following steps:

S201, initialize the behavior network μ(s|θ ^μ ), the parameters are expressed as θ ^Q and the evaluation network Q(s, a|θ ^Q ), the parameters are expressed as θ ^μ , and initialize the target network Q′(s, a|θ ^{Q ′} ) and μ′(s|θ ^μ′ ), the parameters are θ ^Q′ ←θ ^Q , θ ^μ′ ←θ ^μ .

S202, initialize the buffer container R;

S203, receiving the state information s _t of the state observation unit;

S204, select the execution behavior at according to the current strategy and apply certain _noise ;

S205, observe the obtained reward r _t , and observe the next state information s _t+1 ;

S206, store the quadruple <s _t , a _t , r _t , s _t+1 > in the buffer container;

S207, randomly select a batch of quaternion samples from the buffer container for training;

S208, updating evaluation network parameters;

S209, updating behavioral network parameters;

S210, judging whether the number of learning times exceeds a preset value or whether the processing quality is good enough;

S211, transmit the parameters of the evaluation network and the behavior network to the host for storage, and end the learning.

5. The industrial robot learning method applied to large batches of repetitive processing according to claim 3, characterized in that: when updating the evaluation network parameters in the step S208, the objective function y _t is first set to: y _t =r (s _t , a _t )+γQ(s _t+1 , μ(s _t+1 )|θ ^Q ), and then through the formula min _θ L(θ ^Q )＝E[(Q(s _t , a _t |θ ^Q )-y _t ) ² ] Calculate the parameters to update the evaluation network, where a _t represents the behavior at time t, Q represents the expected cumulative reward, θ ^Q represents the parameters of the behavior network, and E represents the relationship between the actual reward and the target of multiple sets of data. The expected value of the sum of the squares of the errors between , L(θ ^Q ) represents the error under the parameter θ ^Q , μ(s _t+1 ) represents the deterministic policy under the state s _t+1 .

6. The industrial robot learning method applied to large-scale repetitive processing according to claim 3, characterized in that: when updating the behavioral network parameters in the step S209, the gradient method is used To update the behavioral network, and update the target network using the following formula group to update;

θ′←τθ+(1-τ)θ′

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

withτ<<0.05

Indicates the derivative of θ ^μ , Indicates the derivative of α, Indicates that taking θ ^μ as a variable, find the derivative of J with respect to θ ^μ .