CN118114746B

CN118114746B - Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error

Info

Publication number: CN118114746B
Application number: CN202410508730.6A
Authority: CN
Inventors: 陈兴国; 江宛真; 巩宇
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-04-26
Filing date: 2024-04-26
Publication date: 2024-07-23
Anticipated expiration: 2044-04-26
Also published as: CN118114746A

Abstract

The present invention provides a Bellman error-based variance minimization reinforcement learning robot arm training acceleration method for robot arm control, comprising the following steps: establishing an engineering problem into a reinforcement learning environment model, and obtaining and measuring the position and posture data of the robot arm during movement, such as the joint angle, angular velocity, end effector position, end effector speed, and obstacle position, by using a position sensor and a rotary encoder. The data is transformed by a neural network to form the characteristics of the robot arm state. The variance minimization algorithm based on the projected Bellman error is used for training to improve the control strategy of the robot arm. Through repeated iterative training, the optimal control strategy of the robot arm is finally obtained, and the performance of the robot arm in specific tasks and application scenarios is improved. By reducing the variance of the gradient estimate, the method can accelerate the speed of convergence to the optimal strategy, improve the accuracy and efficiency of the robot arm training, and improve the performance of the automatic control system.

Description

A method for accelerating the training of robotic arms using reinforcement learning based on variance minimization of Bellman error

技术领域Technical Field

本发明涉及计算机文本情感分析领域，尤其涉及一种基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法。The present invention relates to the field of computer text sentiment analysis, and in particular to a Bellman error-based variance minimization reinforcement learning robot arm training acceleration method.

背景技术Background technique

随着科技的不断演进，机械臂在多个领域中逐渐成为不可或缺的技术。在复杂的工业环境中，机械臂的控制和训练面临着一些挑战，例如复杂的工作任务、不确定的环境条件以及需要快速适应新任务的能力。针对这些挑战，研究人员着重探索机械臂控制的优化方法，特别是基于强化学习的训练加速算法。很多强化学习方法在训练过程中可能会面临梯度估计方差较大的问题，导致训练效率降低，需要更长的训练时间。With the continuous evolution of science and technology, robotic arms have gradually become an indispensable technology in many fields. In complex industrial environments, the control and training of robotic arms face some challenges, such as complex work tasks, uncertain environmental conditions, and the need to quickly adapt to new tasks. In response to these challenges, researchers focus on exploring optimization methods for robotic arm control, especially training acceleration algorithms based on reinforcement learning. Many reinforcement learning methods may face the problem of large variance in gradient estimation during training, resulting in reduced training efficiency and longer training time.

有鉴于此，有必要设计一种基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法，以解决上述问题。In view of this, it is necessary to design a variance minimization reinforcement learning robotic arm training acceleration method based on Bellman error to solve the above problems.

发明内容Summary of the invention

本发明的目的在于提供一种高效的基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法。The purpose of the present invention is to provide an efficient Bellman error-based variance minimization reinforcement learning manipulator training acceleration method.

为实现上述发明目的，本发明提供了一种基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法，包括如下步骤：To achieve the above-mentioned invention object, the present invention provides a Bellman error-based variance minimization reinforcement learning robot arm training acceleration method, comprising the following steps:

步骤S1、针对机械臂的作业要求建立强化学习环境模型，实例化已训练好的神经网络模型；Step S1: Establish a reinforcement learning environment model according to the operation requirements of the robot arm and instantiate the trained neural network model;

步骤S2、使用位置传感器和旋转编码器，获取并测量机械臂的状态信息，所述状态信息至少包括机械臂的关节角度、关节角速度、末端执行器位置、末端执行器速度和障碍物位置；Step S2: Use the position sensor and rotary encoder to obtain and measure the state information of the robot arm , the status information At least the joint angles of the robot arm , joint angular velocity , end effector position , end effector speed and obstacle location ;

步骤S3、将机械臂的状态信息和可选动作输入神经网络模型中，得到对应的特征向量，利用线性方法，并借助ε-greedy策略，选择动作，并保存动作对应的特征向量；Step S3: The status information of the robot arm and optional actions Input into the neural network model to get the corresponding feature vector , using a linear method and with the help of the ε-greedy strategy, select actions , and save the action The corresponding eigenvector ;

步骤S4、智能体执行动作，获得奖励，进入下一个状态，用于表示下一个状态，利用所述步骤S3，获得状态下的动作和特征向量；Step S4: Agent performs actions ,Earn rewards , enter the next state , Used to indicate the next state, using step S3, to obtain the state Next action and the eigenvector ;

步骤S5、机械臂利用投影贝尔曼误差的方差最小化方法的方法对机械臂控制策略的参数进行更新；Step S5, the robot arm updates the parameters of the robot arm control strategy by using the method of minimizing the variance of the projected Bellman error;

步骤S6、重复所述步骤S2至步骤S5，直到机械臂到达目标位置或迭代达到最大次数。Step S6, repeating steps S2 to S5 until the robot arm reaches the target position or the iteration reaches the maximum number of times.

作为本发明的进一步改进，所述步骤S1具体为：根据机械臂的任务要求，建立一个强化学习环境模型；用带有标记状态特征的数据集充分训练神经网络模型；将机械臂所有的状态信息和可选动作集合依次输入到经过充分训练的神经网络模型中，以获取对应的特征向量。As a further improvement of the present invention, the step S1 is specifically as follows: according to the task requirements of the robot arm, a reinforcement learning environment model is established; a data set with marked state features is used to fully train the neural network model; all state information of the robot arm is recorded. and a collection of optional actions Input into the fully trained neural network model in sequence to obtain the corresponding feature vector .

作为本发明的进一步改进，所述步骤S2具体为：旋转编码器获得机械臂的状态信息s，机械臂的状态信息s至少包括机械臂相对于竖直方向的角度、机械臂的旋转方向上的角度、机械臂上端的角速度、机械臂关节连接处的角速度、末端执行器位置、末端执行器速度和障碍物位置。As a further improvement of the present invention, step S2 is specifically: the rotary encoder obtains the state information s of the robot arm, and the state information s of the robot arm at least includes the angle of the robot arm relative to the vertical direction. , Angle of the robot arm in the direction of rotation , angular velocity of the upper end of the robotic arm , angular velocity of the joint of the robot arm , end effector position , end effector speed and obstacle location .

作为本发明的进一步改进，所述步骤S3具体为：获取机械臂当前所有可选动作，将状态信息及所有可选动作依次输入到已训练好的神经网络模型中，以获得所有可选的特征向量，其中，为可选动作集合，为在状态s下的特征向量；利用线性方法计算所有可选的动作状态函数，即，其中，为特征权重参数向量，为向量转置符号；利用方法选择动作，并保存相应的特征向量。As a further improvement of the present invention, step S3 is specifically: obtaining all the current optional actions of the robot arm , the status information and all optional actions Input into the trained neural network model in sequence to obtain all optional feature vectors ,in, is a set of optional actions. is the feature vector in state s; all optional action-state functions are calculated using linear methods ,Right now ,in, is the feature weight parameter vector, Transpose the sign of a vector; use Method selection action , and save the corresponding feature vector .

作为本发明的进一步改进，所述步骤S4具体为：机械臂执行动作a后，获取即时奖励，并进入下一个状态，同时获取机械臂当前所有可选动作，将所有状态信息及可选动作输入已经训练好的神经网络模型中，获得在状态下所有可选的特征向量，其中，用于表示在状态下所有可选的动作集合，采用线性方法计算所有可选动作在状态下的，即，其中，为特征权重参数向量，为向量转置符号；利用方法选择动作，获得其相应的特征向量。As a further improvement of the present invention, step S4 is specifically as follows: after the robot arm performs action a, it obtains an immediate reward and enters the next state , and get all the current optional actions of the robot arm , all status information and optional actions Input the trained neural network model and obtain the state All optional eigenvectors ,in, Used to indicate the status Under all the optional action sets, a linear method is used to calculate all the optional actions in the state Next ,Right now ,in, is the feature weight parameter vector, Transpose the sign of a vector; use Method selection action , and obtain its corresponding eigenvector .

作为本发明的进一步改进，所述步骤S4中，机械臂获得即时奖励的具体奖励形式如下：As a further improvement of the present invention, in step S4, the specific reward form of the robot arm obtaining the instant reward is as follows:

目标达成奖励：如果末端执行器距离目标位置越近，奖励越高，具体如下：Goal Achievement Rewards : The closer the end effector is to the target position, the higher the reward, as follows:

避障奖励：当机械臂末端离障碍物越远，给予正奖励，具体如下：Obstacle Avoidance Reward : The farther the end of the robotic arm is from the obstacle, the positive reward is given, as follows:

平滑度奖励：对于关节角度和末端执行器位置的平滑变化给予正奖励，具体如下：Smoothness Bonus : Give positive rewards for smooth changes in joint angles and end-effector positions as follows:

能量消耗奖励：对于能耗较小的动作给予正奖励，具体如下：Energy consumption bonus : Give positive rewards for actions with low energy consumption, as follows:

碰撞惩罚：如果机械臂与障碍物发生碰撞，给予一个较大的负奖励，具体如下：Collision Penalty : If the robot collides with an obstacle, a large negative reward is given, as follows:

动作惩罚：对于关节角度和末端执行器位置变化过大或速度过快的动作给予负奖励，具体如下：Action Penalty : Negative rewards are given for actions where the joint angle and end effector position change too much or the speed is too fast, as follows:

所述奖励r为上述奖励至之和。The reward r is the above reward to Sum.

作为本发明的进一步改进，基于投影贝尔曼误差的方差最小化方法的优化过程的最小化目标为：As a further improvement of the present invention, the minimization target of the optimization process of the variance minimization method based on the projected Bellman error is:

公式1 Formula 1

其中，表示误差，误差，表示误差的期望，为奖励，、分别用于表示期望符号和特征向量，定义来估算，所述公式1转化为：in, Indicates error, error , represents the expected error, For reward, , They are used to represent the expected symbol and eigenvector respectively, and are defined To estimate , the formula 1 is transformed into:

公式2 Formula 2

利用随机梯度下降方法分别对进行更新，更新公式如下：Using the stochastic gradient descent method, Update, the update formula is as follows:

公式3 Formula 3

公式4 Formula 4

公式5 Formula 5

其中，，表示特征权重参数向量，为t+1时刻的最优动作集合，、表示t+1时刻下的状态和可调参数，为动作，、、、分别用于表示t时刻下的误差、状态、可调参数和贝尔曼误差期望估值，误差，表示贝尔曼误差期望的估值，、和分别是、和的学习率。in, , represents the feature weight parameter vector, is the optimal action set at time t+1, , represents the state and adjustable parameters at time t+1, For action, , , , They are used to represent the error, state, adjustable parameters and Bellman error expected estimate at time t, respectively. , represents the Bellman error expectation The valuation of , and They are , and The learning rate.

本发明的有益效果是：本发明通过将工程问题建立成强化学习环境模型，并获取和测量机械臂在运动过程中的关节角度、角速度、末端执行器位置、末端执行器速度和障碍物位置等位置姿态数据。数据经过神经网络转化，形成了机械臂状态的特征，再利用基于投影贝尔曼误差的方差最小化算法进行训练，提升机器臂的控制策略。通过反复迭代训练，获得机械臂的最优控制策略，提升机械臂在特定任务和应用场景中的表现。同时通过减小梯度估计方差，能够加快收敛到最优策略的速度，提升机械臂获取最优策略的速度，且有较好的可扩展性和适应性。该方法使机械臂能更迅速、更有效地学习到优化的控制策略，从而提高整个系统的性能表现。通过优化控制策略，提高机械臂训练的准确性和效率，为机械臂在自主决策、快速应变等方面提供了更为灵活、高效的解决方案，改善自动化控制系统性能。The beneficial effects of the present invention are as follows: the present invention establishes the engineering problem into a reinforcement learning environment model, and obtains and measures the position and posture data of the joint angle, angular velocity, end effector position, end effector speed and obstacle position of the robot arm during the movement. The data is transformed by the neural network to form the characteristics of the robot arm state, and then the variance minimization algorithm based on the projected Bellman error is used for training to improve the control strategy of the robot arm. Through repeated iterative training, the optimal control strategy of the robot arm is obtained, and the performance of the robot arm in specific tasks and application scenarios is improved. At the same time, by reducing the variance of the gradient estimation, the speed of convergence to the optimal strategy can be accelerated, and the speed of the robot arm obtaining the optimal strategy can be improved, and it has good scalability and adaptability. The method enables the robot arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the entire system. By optimizing the control strategy, the accuracy and efficiency of the robot arm training are improved, and a more flexible and efficient solution is provided for the robot arm in terms of autonomous decision-making and rapid response, thereby improving the performance of the automatic control system.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法的流程示意图。FIG1 is a flow chart of a Bellman error-based variance minimization reinforcement learning robotic arm training acceleration method according to the present invention.

图2为本发明的基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法的的简易环境示意图。FIG2 is a schematic diagram of a simplified environment of the Bellman error-based variance minimization reinforcement learning robotic arm training acceleration method of the present invention.

图3为基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法与传统经典训练方法的对比图。Figure 3 is a comparison between the Bellman error-based variance minimization reinforcement learning robotic arm training acceleration method and the traditional classical training method.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention more clear, the present invention is described in detail below with reference to the accompanying drawings and specific embodiments.

在此，还需要说明的是，为了避免因不必要的细节而模糊了本发明，在附图中仅仅示出了与本发明的方案密切相关的结构和/或处理步骤，而省略了与本发明关系不大的其他细节。It should also be noted that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the scheme of the present invention are shown in the drawings, while other details that are not closely related to the present invention are omitted.

请参阅图1至图3所示，本发明提供了一种基于贝尔曼误差的方差最小化强化学习机械臂训练加速方法，用于机械臂控制训练，具体包括如下步骤：Referring to FIG. 1 to FIG. 3 , the present invention provides a Bellman error-based variance minimization reinforcement learning robot arm training acceleration method for robot arm control training, which specifically includes the following steps:

步骤S1、根据实际作业要求搭建强化学习模型环境，考虑当前作业需求如下：驱动机械臂的自由端达到目标高度。该机械臂有两个臂肘和一个驱动头，属于欠驱动系统，机械臂的动作控制相对不稳定。强化学习环境搭建如图2所示，机械臂是需要训练的智能体，在驱动头处可以选择施加顺时针方向的扭矩、不施加扭矩、施加逆时针方向的扭矩三个动作。状态信息包括第一机械臂1相对于竖直方向的角度和第一机械臂1相对于第二机械臂2的角度、第一机械臂1的角速度和第二机械臂2的角速度、末端执行器位置、末端执行器速度和障碍物位置；实例化训练好的神经网络，将所有选择输入到神经网络，可以得到对应的特征，也可以采用tile coding编码器进行编码，从而得到特征。其中，神经网络可尝试多种模型，如全连接神经网络和卷积神经网络等，下面给出基本模型参考：Step S1. Build a reinforcement learning model environment according to the actual operation requirements, considering the current operation requirements as follows: drive the free end of the robotic arm to reach the target height. The robotic arm has two elbows and a drive head. It is an under-actuated system, and the motion control of the robotic arm is relatively unstable. The reinforcement learning environment is shown in Figure 2. The robotic arm is an intelligent agent that needs to be trained. At the drive head, you can choose to apply clockwise torque, no torque, and counterclockwise torque. The state information includes the angle of the first robotic arm 1 relative to the vertical direction. and the angle of the first robot arm 1 relative to the second robot arm 2 , the angular velocity of the first robot arm 1 and the angular velocity of the second robot arm 2 , end effector position , end effector speed and obstacle location ; Instantiate the trained neural network, input all the choices into the neural network, and you can get the corresponding features. You can also use tile coding encoder to encode and get the features. Among them, the neural network can try a variety of models, such as fully connected neural network and convolutional neural network, etc. The basic model reference is given below:

输入层:包含13个神经元，每个神经元对应于一个状态信息，包括机械臂1相对于竖直方向的角度、机械臂1相对于机械臂2的角度、两个机械臂的旋转方向、角速度和、末端执行器位置、末端执行器速度和障碍物位置。Input layer: contains 13 neurons, each neuron corresponds to a state information, including the angle of robot arm 1 relative to the vertical direction , the angle of robot arm 1 relative to robot arm 2 , the rotation direction and angular velocity of the two robotic arms and , end effector position , end effector speed and obstacle location .

隐藏层:共三层，维度为(128,256,128)，各层使用ReLU激活函数。Hidden layer: There are three layers in total, with a dimension of (128, 256, 128), and each layer uses the ReLU activation function.

输出层:输出层包含64个神经元，给出特征。Output layer: The output layer contains 64 neurons, giving the features .

步骤S2、机械臂使用位置传感器和旋转编码器，获取并测量第一机械臂1（即机械臂上端）相对于竖直方向的角度和第一机械臂1相对于第二机械臂2的角度（即机械臂旋转方向上的角度）、第一机械臂1的角速度（等同于机械臂上端的角速度）、第二机械臂2的角速度（等同于机械臂关节连接处的角速度）、末端执行器位置、末端执行器速度和障碍物位置等状态信息。机械臂包括第一机械臂1、第二机械臂2和连接两者的机械臂关节3，第一机械臂1、第二机械臂2和连接两者的机械臂关节3，如图2所示。Step S2: The robot arm uses a position sensor and a rotary encoder to obtain and measure the angle of the first robot arm 1 (i.e., the upper end of the robot arm) relative to the vertical direction. and the angle of the first robot arm 1 relative to the second robot arm 2 (i.e. the angle in the direction of rotation of the robot arm), the angular velocity of the first robot arm 1 (equivalent to the angular velocity of the upper end of the robot arm), the angular velocity of the second robot arm 2 (equivalent to the angular velocity of the robot arm joint connection) , end effector position , end effector speed and obstacle location Status information The robot arm includes a first robot arm 1, a second robot arm 2 and a robot arm joint 3 connecting the two, as shown in FIG2 .

步骤S3、获取机械臂当前所有可选择的动作，将状态信息及机械臂所有可选的动作依次输入到已训练好的神经网络模型中，以获得所有可选的特征向量，其中为机械臂当前的状态信息，为机械臂可选动作集合，为在状态s下采取动作的特征向量。利用线性方法计算所有可选择的动作状态函数，即，其中，为特征权重参数向量，为向量转置符号。然后利用方法选择合适的动作，即机械臂有概率随机选取可选动作，有的概率选取使值最大的动作，并保存相应的特征向量，在我们的实验中，设置大小为0.01。Step S3: Obtain all currently available actions of the robot arm, and input the state information and all available actions of the robot arm into the trained neural network model in sequence to obtain all available feature vectors. ,in is the current status information of the robot arm, is a set of optional actions for the robot arm. To take action in state s The characteristic vector of . Use linear methods to calculate all optional action state functions ,Right now ,in, is the feature weight parameter vector, Transpose the sign of the vector. Then use Method to choose the appropriate action , that is, the robot arm has The probability of randomly selecting optional actions is The probability of choosing The action with the largest value and save the corresponding feature vector In our experiments, we set The size is 0.01.

步骤S4、机械臂执行动作后，获取即时奖励，并进入下一个状态。同时，位置传感器和旋转编码器获取机械臂当前所有可选择的动作，将所有状态信息及可选动作输入训练好的神经网络模型中，获得在状态下所有可选的特征向量，代表在状态下的可选动作集合，所有动作应满足可能性求和为1。接下来采用线性方法计算所有可选动作在状态下的，即，此公式中，表示特征权重参数向量，为向量转置操作。最后利用方法选择动作，获得其相应的特征向量。机械臂获得即时奖励的具体奖励形式如下：Step S4: The robot performs an action Get instant rewards , and enter the next state At the same time, the position sensor and rotary encoder obtain all the current optional actions of the robot arm, input all the state information and optional actions into the trained neural network model, and obtain the state All optional eigenvectors , Represents in state The set of optional actions under the state, all actions should satisfy the possibility sum to 1. Next, a linear method is used to calculate all optional actions in the state Next ,Right now , in this formula, represents the feature weight parameter vector, is the vector transpose operation. Finally, use Method selection action , and obtain its corresponding eigenvector The specific form of the instant reward for the robotic arm is as follows:

目标达成奖励：如果末端执行器距离目标位置越近，则奖励越高，具体如下：Goal Achievement Rewards : The closer the end effector is to the target position, the higher the reward, as follows:

最终奖励r为上述奖励至之和。The final reward r is the above reward to Sum.

S5、机械臂利用基于投影贝尔曼误差的方差最小化算法对参数进行更新，该方法的优化过程的最小化目标为：S5. The robot arm updates the parameters using the variance minimization algorithm based on the projected Bellman error. The minimization objective of the optimization process of this method is:

公式1 Formula 1

其中表示误差，误差，表示特征权重参数向量，初始化为维向量，也对应特征向量的维度。表示误差的期望，即贝尔曼误差，为奖励，、分别用于表示期望符号和特征向量。因为期望项不易计算，因此定义来估计近似，初始化为0。则公式1可以表示为：in Indicates error, error , Represents the feature weight parameter vector, initialized as dimension vector, It also corresponds to the dimension of the feature vector. represents the expectation of the error, i.e., the Bellman error, For reward, , are used to represent the expected symbol and eigenvector respectively. It is not easy to calculate, so we define To estimate the approximation , initialized to 0. Then formula 1 can be expressed as:

公式2 Formula 2

公式3 Formula 3

公式4 Formula 4

公式5 Formula 5

其中，，表示特征权重参数向量，误差，表示贝尔曼误差期望的估值，、和分别是、和的学习率，可通过试验选择效果最佳的学习率。in, , represents the feature weight parameter vector, error , represents the Bellman error expectation The valuation of , and They are , and The learning rate with the best effect can be selected through experiments.

S6、机械臂将判断动作是否到达目标或者训练是否达到最大迭代次数，若是则结束该轮训练；若不是，则继续重复步骤S2-S5。S6. The robot arm will determine whether the action reaches the target or whether the training reaches the maximum number of iterations. If so, the training will end; if not, steps S2-S5 will continue to be repeated.

将发明与传统训练方法进行对比，结果统计如图3所示。在图3中，本发明所提出的投影贝尔曼误差的方差最小化强化学习训练加速方法(图3中用英文improved GQ(0)表示)收敛速度显著大于传统的时序差分控制算法Q-learning以及经典的GQ(0)算法，有效提高了训练效率，缩短训练时间。The invention is compared with the traditional training method, and the statistical results are shown in Figure 3. In Figure 3, the variance minimization reinforcement learning training acceleration method for projected Bellman error proposed by the invention (expressed as improved GQ(0) in Figure 3) has a significantly faster convergence speed than the traditional temporal difference control algorithm Q-learning and the classic GQ(0) algorithm, which effectively improves the training efficiency and shortens the training time.

综上所述，本发明通过将工程问题建立成强化学习环境模型，使用位置传感器和旋转编码器获取并测量了机械臂在运动过程中的关节角度、角速度、末端执行器位置、末端执行器速度和障碍物位置等位置姿态数据。数据经过神经网络转化，形成了机械臂状态的特征，再利用基于投影贝尔曼误差的方差最小化算法进行训练，提升机器臂的控制策略。通过反复迭代训练，获得机械臂的最优控制策略，提升机械臂在特定任务和应用场景中的表现。同时通过减小梯度估计方差，能够加快收敛到最优策略的速度，提升机械臂获取最优策略的速度，且有较好的可扩展性和适应性。该方法使机械臂能更迅速、更有效地学习到优化的控制策略，从而提高整个系统的性能表现。优化控制策略，提高机械臂训练的准确性和效率，为机械臂在自主决策、快速应变等方面提供了更为灵活、高效的解决方案，改善自动化控制系统性能。In summary, the present invention establishes the engineering problem into a reinforcement learning environment model, and uses position sensors and rotary encoders to obtain and measure the position and posture data of the robot arm during the movement, such as the joint angle, angular velocity, end effector position, end effector speed and obstacle position. The data is transformed by a neural network to form the characteristics of the robot arm state, and then the variance minimization algorithm based on the projected Bellman error is used for training to improve the control strategy of the robot arm. Through repeated iterative training, the optimal control strategy of the robot arm is obtained, and the performance of the robot arm in specific tasks and application scenarios is improved. At the same time, by reducing the variance of the gradient estimation, the speed of convergence to the optimal strategy can be accelerated, and the speed of the robot arm obtaining the optimal strategy can be improved, and it has good scalability and adaptability. This method enables the robot arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the entire system. Optimizing the control strategy improves the accuracy and efficiency of the robot arm training, provides a more flexible and efficient solution for the robot arm in terms of autonomous decision-making and rapid response, and improves the performance of the automation control system.

以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention rather than to limit the present invention. Although the present invention has been described in detail with reference to the preferred embodiments, those skilled in the art should understand that the technical solutions of the present invention may be modified or replaced by equivalents without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A Bellman error-based variance minimization reinforcement learning robot arm training acceleration method, characterized in that it includes the following steps:

Step S1: Establish a reinforcement learning environment model according to the operation requirements of the robot arm and instantiate the trained neural network model;

Step S2: Use the position sensor and rotary encoder to obtain and measure the state information of the robot arm , the status information At least the joint angles of the robot arm , joint angular velocity , end effector position , end effector speed and obstacle location ;

Step S3: The status information of the robot arm and optional actions Input into the neural network model to get the corresponding feature vector , using a linear method and with the help of the ε-greedy strategy, select actions , and save the action The corresponding eigenvector ;

Step S4: Agent performs actions ,Earn rewards , enter the next state , Used to indicate the next state, using step S3, to obtain the state Next action and the eigenvector ;

Step S5, the robot arm updates the parameters of the robot arm control strategy using a method based on the variance minimization method of the projected Bellman error;

Step S6, repeating steps S2 to S5 until the robot reaches the target position or the iteration reaches the maximum number of times;

The step S5 is specifically as follows: the minimization target of the optimization process based on the variance minimization method of the projected Bellman error is:

Formula 1

in, Indicates error, error , represents the expected error, For reward, , They are used to represent the expected symbol and eigenvector respectively, and are defined To estimate the Bellman error expectation , the formula 1 is transformed into:

Formula 2

Using the stochastic gradient descent method, Update, the update formula is as follows:

Formula 3

Formula 4

Formula 5

in, , represents the feature weight parameter vector, is the optimal action set at time t+1, , represents the state and adjustable parameters at time t+1, For action, , , , They are used to represent the error, state, adjustable parameters and Bellman error expected estimate at time t, respectively. , represents the Bellman error expectation The valuation of , and They are , and The learning rate.

2. The Bellman error-based variance minimization reinforcement learning robot training acceleration method according to claim 1 is characterized in that: the step S1 specifically comprises: establishing a reinforcement learning environment model according to the task requirements of the robot; fully training the neural network model with a data set with marked state features; and storing all the state information of the robot arm. and a collection of optional actions Input into a fully trained neural network model to obtain the corresponding feature vector .

3. The Bellman error-based variance minimization reinforcement learning robot arm training acceleration method according to claim 1 is characterized in that: the step S2 is specifically: the rotary encoder obtains the state information s of the robot arm, and the state information s of the robot arm at least includes the angle of the robot arm relative to the vertical direction , Angle of the robot arm in the direction of rotation , angular velocity of the upper end of the robotic arm , angular velocity of the joint of the robot arm , end effector position , end effector speed and obstacle location .

4. The Bellman error-based variance minimization reinforcement learning robot arm training acceleration method according to claim 1 is characterized in that: the step S3 specifically comprises: obtaining all current optional actions of the robot arm , the status information and all optional actions Input into the trained neural network model in sequence to obtain all optional feature vectors ,in, is a set of optional actions. To take action in state s The feature vector of ; use linear methods to calculate all optional action state functions ,Right now ,in, is the feature weight parameter vector, Transpose the sign of a vector; use Method selection action , and save the corresponding feature vector .

5. The Bellman error-based variance minimization reinforcement learning robot arm training acceleration method according to claim 1 is characterized in that: the step S4 is specifically: after the robot arm performs action a, it obtains an immediate reward and enters the next state , and get all the current optional actions of the robot arm , all status information and optional actions Input the trained neural network model and obtain the state All optional eigenvectors ,in, Used to indicate the status Under all the optional action sets, a linear method is used to calculate all the optional actions in the state Next ,Right now ,in, is the feature weight parameter vector, Transpose the sign of a vector; use Method selection action , and obtain its corresponding eigenvector .

6. The Bellman error-based variance minimization reinforcement learning robot arm training acceleration method according to claim 1, characterized in that: in step S4, the specific reward form of the robot arm obtaining the instant reward is as follows:

Goal Achievement Rewards : The closer the end effector is to the target position, the higher the reward, as follows:

Obstacle Avoidance Reward : The farther the end of the robotic arm is from the obstacle, the positive reward is given, as follows:

Smoothness Bonus : Give positive rewards for smooth changes in joint angles and end-effector positions as follows:

Energy consumption bonus : Give positive rewards for actions with low energy consumption, as follows:

Collision Penalty : If the robot collides with an obstacle, a large negative reward is given, as follows:

Action Penalty : Negative rewards are given for actions where the joint angle and end effector position change too much or the speed is too fast, as follows:

The reward r is the above reward to Sum.