CN118114746B - Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error - Google Patents
Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error Download PDFInfo
- Publication number
- CN118114746B CN118114746B CN202410508730.6A CN202410508730A CN118114746B CN 118114746 B CN118114746 B CN 118114746B CN 202410508730 A CN202410508730 A CN 202410508730A CN 118114746 B CN118114746 B CN 118114746B
- Authority
- CN
- China
- Prior art keywords
- mechanical arm
- actions
- reinforcement learning
- error
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 238000012549 training Methods 0.000 title claims abstract description 43
- 230000002787 reinforcement Effects 0.000 title claims abstract description 29
- 230000001133 acceleration Effects 0.000 title claims abstract description 18
- 239000012636 effector Substances 0.000 claims abstract description 29
- 238000011217 control strategy Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 44
- 230000009471 action Effects 0.000 claims description 35
- 238000003062 neural network model Methods 0.000 claims description 14
- 230000006870 function Effects 0.000 claims description 4
- 238000005457 optimization Methods 0.000 claims description 4
- 238000005265 energy consumption Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000011478 gradient descent method Methods 0.000 claims description 3
- 239000003795 chemical substances by application Substances 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 230000006872 improvement Effects 0.000 description 6
- 239000004480 active ingredient Substances 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 210000002569 neuron Anatomy 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1602—Programme controls characterised by the control system, structure, architecture
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1602—Programme controls characterised by the control system, structure, architecture
- B25J9/1605—Simulation of manipulator lay-out, design, modelling of manipulator
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/061—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Mechanical Engineering (AREA)
- Robotics (AREA)
- Neurology (AREA)
- Automation & Control Theory (AREA)
- Microelectronics & Electronic Packaging (AREA)
- Manipulator (AREA)
- Feedback Control In General (AREA)
Abstract
The invention provides a variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors, which is used for mechanical arm control and comprises the following steps: the engineering problem is built into a reinforcement learning environment model, and joint angle, angular velocity, end effector position, end effector speed, obstacle position and other gesture data of the mechanical arm in the motion process are acquired and measured through the position sensor and the rotary encoder. The data are transformed by the neural network to form the state characteristics of the mechanical arm. Training is performed by using a variance minimization algorithm based on the projected bellman error, and a control strategy of the robot arm is improved. And through repeated iterative training, finally, the optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. The method can accelerate convergence to the optimal strategy by reducing the gradient estimation variance, improves the accuracy and efficiency of mechanical arm training, and improves the performance of an automatic control system.
Description
Technical Field
The invention relates to the field of computer text emotion analysis, in particular to a variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors.
Background
With the continuous evolution of technology, mechanical arms are becoming an indispensable technology in various fields. In complex industrial environments, control and training of robotic arms presents challenges such as complex work tasks, uncertain environmental conditions, and the ability to quickly adapt to new tasks. In order to address these challenges, researchers have focused on exploring the optimization method of the mechanical arm control, especially the training acceleration algorithm based on reinforcement learning. Many reinforcement learning methods may suffer from a large variance of gradient estimates during training, resulting in reduced training efficiency and longer training time.
In view of the foregoing, it is necessary to design a method for accelerating training of a reinforcement learning mechanical arm based on the variance minimization of the bellman error to solve the above-mentioned problems.
Disclosure of Invention
The invention aims to provide an efficient variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors.
In order to achieve the above object, the present invention provides a training acceleration method for a reinforcement learning mechanical arm based on variance minimization of bellman error, comprising the steps of:
step S1, establishing a reinforcement learning environment model aiming at the operation requirement of a mechanical arm, and instantiating a trained neural network model;
S2, acquiring and measuring state information of the mechanical arm by using the position sensor and the rotary encoder The status informationJoint angle including at least a mechanical armAngular velocity of jointEnd effector positionEnd effector speedAnd obstacle location;
S3, state information of the mechanical arm is obtainedAnd optional actionsInputting into a neural network model to obtain corresponding feature vectorsSelecting actions by linear method and by epsilon-greedy strategyAnd save the actionsCorresponding feature vector;
Step S4, the agent executes the actionObtain rewardsEnter the next state,For representing the next state, a state is obtained by means of said step S3Action of the lower partAnd feature vector;
S5, the mechanical arm updates parameters of a mechanical arm control strategy by using a method of minimizing variance of a projection Belman error;
And S6, repeating the steps S2 to S5 until the mechanical arm reaches the target position or the iteration reaches the maximum times.
As a further improvement of the present invention, the step S1 specifically includes: establishing a reinforcement learning environment model according to the task requirements of the mechanical arm; training the neural network model fully with the data set with the signature status feature; all state information of the mechanical armAnd a set of selectable actionsSequentially inputting into a fully trained neural network model to obtain corresponding feature vectors。
As a further improvement of the present invention, the step S2 specifically includes: the rotary encoder obtains state information s of the mechanical arm, wherein the state information s of the mechanical arm at least comprises an angle of the mechanical arm relative to the vertical directionAngle in the rotation direction of the robot armAngular velocity of the upper end of the mechanical armAngular velocity at the joint of the mechanical armEnd effector positionEnd effector speedAnd obstacle location。
As a further improvement of the present invention, the step S3 specifically includes: acquiring all current optional actions of the mechanical armStatus information is toAll optional actionsSequentially inputting into a trained neural network model to obtain all optional feature vectorsWherein, the method comprises the steps of, wherein,As a set of the optional actions,Is a feature vector in state s; calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsAnd store the corresponding feature vector。
As a further improvement of the present invention, the step S4 specifically includes: after the mechanical arm executes the action a, the instant rewards are obtained, and the next state is enteredSimultaneously acquiring all current optional actions of the mechanical armAll state information is processedOptional actionInputting the training neural network model to obtain the in-stateAll optional feature vectors belowWherein, the method comprises the steps of, wherein,For indicating in stateAll optional actions are collected, and all optional actions are calculated to be in a state by adopting a linear methodLower part (C)I.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsObtain its corresponding feature vector。
As a further improvement of the present invention, in the step S4, a specific rewarding form of the mechanical arm for obtaining the instant rewards is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The reward r is the above-mentioned reward To the point ofAnd (3) summing.
As a further improvement of the present invention, the minimization objective of the optimization procedure based on the variance minimization method of the projected bellman error is:
Equation 1
Wherein,Representing errors, errors,Indicating the desire for an error,In order to be rewarded,、For representing desired symbols and feature vectors, respectively, definingTo estimateThe equation 1 translates into:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,,The characteristic weight parameter vector is represented as such,For the optimal set of actions at time t +1,、Representing the state and adjustable parameters at time t +1,In order to act on the device,、、、Respectively used for representing the error, the state, the adjustable parameter and the Belman error expected estimation value at the t moment, and the error,Representing bellman error expectationsIs used for the evaluation of the (c),、AndRespectively are、AndIs a learning rate of (a).
The beneficial effects of the invention are as follows: according to the invention, the engineering problem is built into a reinforcement learning environment model, and the joint angle, the angular speed, the end effector position, the end effector speed, the obstacle position and other gesture data of the mechanical arm in the motion process are acquired and measured. The data are converted through a neural network to form the state characteristics of the mechanical arm, and then the variance minimization algorithm based on the projection Belman error is used for training, so that the control strategy of the mechanical arm is improved. And through repeated iterative training, an optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. Meanwhile, by reducing the gradient estimation variance, the speed of converging to the optimal strategy can be increased, the speed of acquiring the optimal strategy by the mechanical arm is improved, and the method has good expandability and adaptability. The method enables the mechanical arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the whole system. By optimizing the control strategy, the accuracy and efficiency of the mechanical arm training are improved, a more flexible and efficient solution is provided for the mechanical arm in the aspects of autonomous decision making, rapid strain and the like, and the performance of an automatic control system is improved.
Drawings
Fig. 1 is a flow chart of a bellman error-based variance minimization reinforcement learning mechanical arm training acceleration method of the present invention.
Fig. 2 is a simplified environmental schematic diagram of a bellman error-based variance minimization reinforcement learning mechanical arm training acceleration method of the present invention.
Fig. 3 is a graph comparing a variance minimization reinforcement learning mechanical arm training acceleration method based on bellman error with a conventional classical training method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
It should be noted that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to aspects of the present invention are shown in the drawings, and other details not greatly related to the present invention are omitted.
Referring to fig. 1 to 3, the present invention provides a variance minimization reinforcement learning mechanical arm training acceleration method based on bellman error, which is used for mechanical arm control training, and specifically comprises the following steps:
Step S1, building an reinforcement learning model environment according to actual operation requirements, and considering the current operation requirements as follows: the free end of the mechanical arm is driven to reach the target height. The mechanical arm is provided with two arm elbows and a driving head, belongs to an underactuated system and is relatively unstable in motion control. The reinforcement learning environment is built as shown in fig. 2, the mechanical arm is an intelligent body to be trained, and three actions of applying clockwise torque, not applying torque and applying counterclockwise torque can be selected at the driving head. The status information includes the angle of the first mechanical arm 1 with respect to the vertical direction And the angle of the first mechanical arm 1 relative to the second mechanical arm 2Angular velocity of the first mechanical arm 1And angular velocity of the second mechanical arm 2End effector positionEnd effector speedAnd obstacle location; The trained neural network is instantiated, all choices are input into the neural network, corresponding characteristics can be obtained, and a tile coding encoder can be used for encoding, so that the characteristics can be obtained. Among them, the neural network may try various models, such as a fully connected neural network, a convolutional neural network, etc., and basic model references are given below:
an input layer containing 13 neurons, each corresponding to a state information including the angle of the mechanical arm 1 relative to the vertical direction Angle of the robot arm 1 relative to the robot arm 2Rotational direction, angular velocity of two mechanical armsAndEnd effector positionEnd effector speedAnd obstacle location。
The hidden layers are three layers, the dimension is 128,256,128, and each layer uses a ReLU activation function.
Output layer the output layer contains 64 neurons giving the characteristics。
Step S2, the mechanical arm uses a position sensor and a rotary encoder to acquire and measure the angle of the first mechanical arm 1 (i.e. the upper end of the mechanical arm) relative to the vertical directionAnd the angle of the first mechanical arm 1 relative to the second mechanical arm 2(I.e., angle in the rotation direction of the arm), angular velocity of the first arm 1(Equivalent to the angular velocity of the upper end of the arm), the angular velocity of the second arm 2 (equivalent to the angular velocity of the joint of the arm)End effector positionEnd effector speedAnd obstacle locationStatus information of etc. The mechanical arm comprises a first mechanical arm 1, a second mechanical arm 2 and a mechanical arm joint 3 connecting the two, and the first mechanical arm 1, the second mechanical arm 2 and the mechanical arm joint 3 connecting the two are shown in fig. 2.
S3, acquiring all current optional actions of the mechanical arm, and sequentially inputting the state information and all the optional actions of the mechanical arm into the trained neural network model to obtain all the optional feature vectorsWhereinAs the current state information of the mechanical arm,For the set of selectable actions of the robotic arm,To take action in state sIs described. Calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors. Then utilizeThe method selects the appropriate actionI.e. mechanical arm havingThe probability randomly selects optional actions includingProbability selection of (a) is such thatMaximum value and storing corresponding characteristic vectorIn our experiments, the settings wereThe size is 0.01.
Step S4, the mechanical arm executes actionsAfter that, get the instant rewardsAnd proceeds to the next state. Simultaneously, the position sensor and the rotary encoder acquire all current selectable actions of the mechanical arm, and input all state information and the selectable actions into a trained neural network model to acquire the stateAll optional feature vectors below,Representing in stateThe next set of selectable actions, all actions should satisfy a sum of likelihood of 1. Next, all selectable actions are calculated in a state by adopting a linear methodLower part (C)I.e.In this formula, the number of the active ingredients in the active ingredients,The characteristic weight parameter vector is represented as such,The vector transpose operation. Finally utilizeMethod selection actionsObtain its corresponding feature vector. The specific rewarding form of the mechanical arm for obtaining the instant rewarding is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The final prize r is the above-mentioned prize To the point ofAnd (3) summing.
S5, the mechanical arm updates parameters by using a variance minimization algorithm based on a projection Belman error, and the minimization target of the optimization process of the method is as follows:
Equation 1
Wherein the method comprises the steps ofRepresenting errors, errors,Representing feature weight parameter vectors, initialized toDimension(s)The vector quantity is used to determine the vector quantity,And also corresponds to the dimension of the feature vector.Indicating the desire for errors, i.e. bellman errors,In order to be rewarded,、Respectively for representing the desired symbol and the feature vector. Because of the desired itemIs not easy to calculate, thus definingTo estimate the approximationInitialized to 0. Then equation 1 can be expressed as:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,,Representing characteristic weight parameter vectors, errors,Representing bellman error expectationsIs used for the evaluation of the (c),、AndRespectively are、AndThe learning rate with the best effect can be selected through experiments.
S6, the mechanical arm judges whether the action reaches a target or whether the training reaches the maximum iteration number, if so, the wheel training is finished; if not, the steps S2-S5 are repeated continuously.
Comparing the invention with the traditional training method, the statistics of the results are shown in figure 3. In fig. 3, the convergence rate of the variance minimization reinforcement learning training acceleration method (represented by English improved GQ (0) in fig. 3) of the projection bellman error provided by the invention is obviously higher than that of the traditional time sequence differential control algorithm Q-learning and the classical GQ (0) algorithm, so that the training efficiency is effectively improved, and the training time is shortened.
In summary, the present invention establishes the engineering problem as a reinforcement learning environment model, and uses the position sensor and the rotary encoder to acquire and measure the joint angle, the angular velocity, the end effector position, the end effector speed, the obstacle position, and the like pose data of the mechanical arm in the motion process. The data are converted through a neural network to form the state characteristics of the mechanical arm, and then the variance minimization algorithm based on the projection Belman error is used for training, so that the control strategy of the mechanical arm is improved. And through repeated iterative training, an optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. Meanwhile, by reducing the gradient estimation variance, the speed of converging to the optimal strategy can be increased, the speed of acquiring the optimal strategy by the mechanical arm is improved, and the method has good expandability and adaptability. The method enables the mechanical arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the whole system. The control strategy is optimized, the accuracy and the efficiency of the mechanical arm training are improved, a more flexible and efficient solution is provided for the mechanical arm in the aspects of autonomous decision making, rapid strain and the like, and the performance of an automatic control system is improved.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.
Claims (6)
1. A variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors is characterized in that: the method comprises the following steps:
step S1, establishing a reinforcement learning environment model aiming at the operation requirement of a mechanical arm, and instantiating a trained neural network model;
S2, acquiring and measuring state information of the mechanical arm by using the position sensor and the rotary encoder The status informationJoint angle including at least a mechanical armAngular velocity of jointEnd effector positionEnd effector speedAnd obstacle location;
S3, state information of the mechanical arm is obtainedAnd optional actionsInputting into a neural network model to obtain corresponding feature vectorsSelecting actions by linear method and by epsilon-greedy strategyAnd save the actionsCorresponding feature vector;
Step S4, the agent executes the actionObtain rewardsEnter the next state,For representing the next state, a state is obtained by means of said step S3Action of the lower partAnd feature vector;
S5, the mechanical arm updates parameters of a mechanical arm control strategy by using a method based on a variance minimization method of a projection Belman error;
step S6, repeating the steps S2 to S5 until the mechanical arm reaches the target position or the iteration reaches the maximum times;
the step S5 specifically comprises the following steps: the minimization goal of the optimization process based on the variance minimization method of the projected bellman error is:
Equation 1
Wherein,Representing errors, errors,Indicating the desire for an error,In order to be rewarded,、For representing desired symbols and feature vectors, respectively, definingEstimating bellman error expectationsThe equation 1 translates into:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,,The characteristic weight parameter vector is represented as such,For the optimal set of actions at time t +1,、Representing the state and adjustable parameters at time t +1,In order to act on the device,、、、Respectively used for representing the error, the state, the adjustable parameter and the Belman error expected estimation value at the t moment, and the error,Representing bellman error expectationsIs used for the evaluation of the (c),、AndRespectively are、AndIs a learning rate of (a).
2. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S1 specifically comprises the following steps: establishing a reinforcement learning environment model according to the task requirements of the mechanical arm; training the neural network model fully with the data set with the signature status feature; all state information of the mechanical armAnd a set of selectable actionsInput into a fully trained neural network model to obtain corresponding feature vectors。
3. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S2 specifically comprises the following steps: the rotary encoder obtains state information s of the mechanical arm, wherein the state information s of the mechanical arm at least comprises an angle of the mechanical arm relative to the vertical directionAngle in the rotation direction of the robot armAngular velocity of the upper end of the mechanical armAngular velocity at the joint of the mechanical armEnd effector positionEnd effector speedAnd obstacle location。
4. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S3 specifically comprises the following steps: acquiring all current optional actions of the mechanical armStatus information is toAll optional actionsSequentially inputting into a trained neural network model to obtain all optional feature vectorsWherein, the method comprises the steps of, wherein,As a set of the optional actions,To take action in state sIs a feature vector of (1); calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsAnd store the corresponding feature vector。
5. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S4 specifically includes: after the mechanical arm executes the action a, the instant rewards are obtained, and the next state is enteredSimultaneously acquiring all current optional actions of the mechanical armAll state information is processedOptional actionInputting the training neural network model to obtain the in-stateAll optional feature vectors belowWherein, the method comprises the steps of, wherein,For indicating in stateAll optional actions are collected, and all optional actions are calculated to be in a state by adopting a linear methodLower part (C)I.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsObtain its corresponding feature vector。
6. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: in the step S4, the specific rewarding form of the mechanical arm for obtaining the instant rewarding is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The reward r is the above-mentioned reward To the point ofAnd (3) summing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410508730.6A CN118114746B (en) | 2024-04-26 | 2024-04-26 | Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410508730.6A CN118114746B (en) | 2024-04-26 | 2024-04-26 | Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error |
Publications (2)
Publication Number | Publication Date |
---|---|
CN118114746A CN118114746A (en) | 2024-05-31 |
CN118114746B true CN118114746B (en) | 2024-07-23 |
Family
ID=91208969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410508730.6A Active CN118114746B (en) | 2024-04-26 | 2024-04-26 | Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118114746B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109906132A (en) * | 2016-09-15 | 2019-06-18 | 谷歌有限责任公司 | The deeply of Robotic Manipulator learns |
KR20200010982A (en) * | 2018-06-25 | 2020-01-31 | 군산대학교산학협력단 | Method and apparatus of generating control parameter based on reinforcement learning |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102007042440B3 (en) * | 2007-09-06 | 2009-01-29 | Siemens Ag | Method for computer-aided control and / or regulation of a technical system |
WO2019241680A1 (en) * | 2018-06-15 | 2019-12-19 | Google Llc | Deep reinforcement learning for robotic manipulation |
WO2020154542A1 (en) * | 2019-01-23 | 2020-07-30 | Google Llc | Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning |
CN111331607B (en) * | 2020-04-03 | 2021-04-23 | 山东大学 | Automatic grabbing and stacking method and system based on mechanical arm |
US20220410380A1 (en) * | 2021-06-17 | 2022-12-29 | X Development Llc | Learning robotic skills with imitation and reinforcement at scale |
CN114781789A (en) * | 2022-03-10 | 2022-07-22 | 北京控制工程研究所 | Hierarchical task planning method and system for spatial fine operation |
CN116175577A (en) * | 2023-03-06 | 2023-05-30 | 南京理工大学 | Strategy learning method based on optimizable image conversion in mechanical arm grabbing |
CN116533249A (en) * | 2023-06-05 | 2023-08-04 | 贵州大学 | Mechanical arm control method based on deep reinforcement learning |
CN116859755B (en) * | 2023-08-29 | 2023-12-08 | 南京邮电大学 | Minimized covariance reinforcement learning training acceleration method for unmanned vehicle driving control |
CN117086882A (en) * | 2023-10-07 | 2023-11-21 | 四川大学 | Strengthening learning method based on mechanical arm attitude movement degree of freedom |
-
2024
- 2024-04-26 CN CN202410508730.6A patent/CN118114746B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109906132A (en) * | 2016-09-15 | 2019-06-18 | 谷歌有限责任公司 | The deeply of Robotic Manipulator learns |
KR20200010982A (en) * | 2018-06-25 | 2020-01-31 | 군산대학교산학협력단 | Method and apparatus of generating control parameter based on reinforcement learning |
Also Published As
Publication number | Publication date |
---|---|
CN118114746A (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108656117B (en) | Mechanical arm space trajectory optimization method for optimal time under multi-constraint condition | |
CN112102405B (en) | Robot stirring-grabbing combined method based on deep reinforcement learning | |
WO2020207219A1 (en) | Non-model robot control method for multi-shaft-hole assembly optimized by environmental prediction | |
CN109240091B (en) | Underwater robot control method based on reinforcement learning and tracking control method thereof | |
CN116460860B (en) | Model-based robot offline reinforcement learning control method | |
CN117103282B (en) | Double-arm robot cooperative motion control method based on MATD3 algorithm | |
CN113070878B (en) | Robot control method based on impulse neural network, robot and storage medium | |
CN113442140B (en) | Cartesian space obstacle avoidance planning method based on Bezier optimization | |
CN116533249A (en) | Mechanical arm control method based on deep reinforcement learning | |
CN115464659A (en) | Mechanical arm grabbing control method based on deep reinforcement learning DDPG algorithm of visual information | |
CN118288294B (en) | Robot vision servo and man-machine cooperative control method based on image variable admittance | |
CN113977583A (en) | Robot rapid assembly method and system based on near-end strategy optimization algorithm | |
CN112965487A (en) | Mobile robot trajectory tracking control method based on strategy iteration | |
CN115416024A (en) | Moment-controlled mechanical arm autonomous trajectory planning method and system | |
Luo et al. | Balance between efficient and effective learning: Dense2sparse reward shaping for robot manipulation with environment uncertainty | |
CN118114746B (en) | Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error | |
CN117245666A (en) | Dynamic target quick grabbing planning method and system based on deep reinforcement learning | |
Zhu et al. | Allowing safe contact in robotic goal-reaching: Planning and tracking in operational and null spaces | |
CN116834014A (en) | Intelligent cooperative control method and system for capturing non-cooperative targets by space dobby robot | |
CN116834015A (en) | Deep reinforcement learning training optimization method for automatic control of intelligent robot arm | |
CN117021066A (en) | Robot vision servo motion control method based on deep reinforcement learning | |
CN113352320B (en) | Q learning-based Baxter mechanical arm intelligent optimization control method | |
CN113370205B (en) | Baxter mechanical arm track tracking control method based on machine learning | |
CN112380655A (en) | Robot inverse kinematics solving method based on RS-CMSA algorithm | |
CN113290554A (en) | Intelligent optimization control method for Baxter mechanical arm based on value iteration |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |