CN118114746B - Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error - Google Patents

Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error Download PDF

Info

Publication number
CN118114746B
CN118114746B CN202410508730.6A CN202410508730A CN118114746B CN 118114746 B CN118114746 B CN 118114746B CN 202410508730 A CN202410508730 A CN 202410508730A CN 118114746 B CN118114746 B CN 118114746B
Authority
CN
China
Prior art keywords
mechanical arm
actions
reinforcement learning
error
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410508730.6A
Other languages
Chinese (zh)
Other versions
CN118114746A (en
Inventor
陈兴国
江宛真
巩宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202410508730.6A priority Critical patent/CN118114746B/en
Publication of CN118114746A publication Critical patent/CN118114746A/en
Application granted granted Critical
Publication of CN118114746B publication Critical patent/CN118114746B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1602Programme controls characterised by the control system, structure, architecture
    • B25J9/1605Simulation of manipulator lay-out, design, modelling of manipulator
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1628Programme controls characterised by the control loop
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/061Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using biological neurons, e.g. biological neurons connected to an integrated circuit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Neurology (AREA)
  • Automation & Control Theory (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention provides a variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors, which is used for mechanical arm control and comprises the following steps: the engineering problem is built into a reinforcement learning environment model, and joint angle, angular velocity, end effector position, end effector speed, obstacle position and other gesture data of the mechanical arm in the motion process are acquired and measured through the position sensor and the rotary encoder. The data are transformed by the neural network to form the state characteristics of the mechanical arm. Training is performed by using a variance minimization algorithm based on the projected bellman error, and a control strategy of the robot arm is improved. And through repeated iterative training, finally, the optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. The method can accelerate convergence to the optimal strategy by reducing the gradient estimation variance, improves the accuracy and efficiency of mechanical arm training, and improves the performance of an automatic control system.

Description

Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error
Technical Field
The invention relates to the field of computer text emotion analysis, in particular to a variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors.
Background
With the continuous evolution of technology, mechanical arms are becoming an indispensable technology in various fields. In complex industrial environments, control and training of robotic arms presents challenges such as complex work tasks, uncertain environmental conditions, and the ability to quickly adapt to new tasks. In order to address these challenges, researchers have focused on exploring the optimization method of the mechanical arm control, especially the training acceleration algorithm based on reinforcement learning. Many reinforcement learning methods may suffer from a large variance of gradient estimates during training, resulting in reduced training efficiency and longer training time.
In view of the foregoing, it is necessary to design a method for accelerating training of a reinforcement learning mechanical arm based on the variance minimization of the bellman error to solve the above-mentioned problems.
Disclosure of Invention
The invention aims to provide an efficient variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors.
In order to achieve the above object, the present invention provides a training acceleration method for a reinforcement learning mechanical arm based on variance minimization of bellman error, comprising the steps of:
step S1, establishing a reinforcement learning environment model aiming at the operation requirement of a mechanical arm, and instantiating a trained neural network model;
S2, acquiring and measuring state information of the mechanical arm by using the position sensor and the rotary encoder The status informationJoint angle including at least a mechanical armAngular velocity of jointEnd effector positionEnd effector speedAnd obstacle location
S3, state information of the mechanical arm is obtainedAnd optional actionsInputting into a neural network model to obtain corresponding feature vectorsSelecting actions by linear method and by epsilon-greedy strategyAnd save the actionsCorresponding feature vector
Step S4, the agent executes the actionObtain rewardsEnter the next stateFor representing the next state, a state is obtained by means of said step S3Action of the lower partAnd feature vector
S5, the mechanical arm updates parameters of a mechanical arm control strategy by using a method of minimizing variance of a projection Belman error;
And S6, repeating the steps S2 to S5 until the mechanical arm reaches the target position or the iteration reaches the maximum times.
As a further improvement of the present invention, the step S1 specifically includes: establishing a reinforcement learning environment model according to the task requirements of the mechanical arm; training the neural network model fully with the data set with the signature status feature; all state information of the mechanical armAnd a set of selectable actionsSequentially inputting into a fully trained neural network model to obtain corresponding feature vectors
As a further improvement of the present invention, the step S2 specifically includes: the rotary encoder obtains state information s of the mechanical arm, wherein the state information s of the mechanical arm at least comprises an angle of the mechanical arm relative to the vertical directionAngle in the rotation direction of the robot armAngular velocity of the upper end of the mechanical armAngular velocity at the joint of the mechanical armEnd effector positionEnd effector speedAnd obstacle location
As a further improvement of the present invention, the step S3 specifically includes: acquiring all current optional actions of the mechanical armStatus information is toAll optional actionsSequentially inputting into a trained neural network model to obtain all optional feature vectorsWherein, the method comprises the steps of, wherein,As a set of the optional actions,Is a feature vector in state s; calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsAnd store the corresponding feature vector
As a further improvement of the present invention, the step S4 specifically includes: after the mechanical arm executes the action a, the instant rewards are obtained, and the next state is enteredSimultaneously acquiring all current optional actions of the mechanical armAll state information is processedOptional actionInputting the training neural network model to obtain the in-stateAll optional feature vectors belowWherein, the method comprises the steps of, wherein,For indicating in stateAll optional actions are collected, and all optional actions are calculated to be in a state by adopting a linear methodLower part (C)I.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsObtain its corresponding feature vector
As a further improvement of the present invention, in the step S4, a specific rewarding form of the mechanical arm for obtaining the instant rewards is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The reward r is the above-mentioned reward To the point ofAnd (3) summing.
As a further improvement of the present invention, the minimization objective of the optimization procedure based on the variance minimization method of the projected bellman error is:
Equation 1
Wherein,Representing errors, errorsIndicating the desire for an error,In order to be rewarded,For representing desired symbols and feature vectors, respectively, definingTo estimateThe equation 1 translates into:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,The characteristic weight parameter vector is represented as such,For the optimal set of actions at time t +1,Representing the state and adjustable parameters at time t +1,In order to act on the device,Respectively used for representing the error, the state, the adjustable parameter and the Belman error expected estimation value at the t moment, and the errorRepresenting bellman error expectationsIs used for the evaluation of the (c),AndRespectively areAndIs a learning rate of (a).
The beneficial effects of the invention are as follows: according to the invention, the engineering problem is built into a reinforcement learning environment model, and the joint angle, the angular speed, the end effector position, the end effector speed, the obstacle position and other gesture data of the mechanical arm in the motion process are acquired and measured. The data are converted through a neural network to form the state characteristics of the mechanical arm, and then the variance minimization algorithm based on the projection Belman error is used for training, so that the control strategy of the mechanical arm is improved. And through repeated iterative training, an optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. Meanwhile, by reducing the gradient estimation variance, the speed of converging to the optimal strategy can be increased, the speed of acquiring the optimal strategy by the mechanical arm is improved, and the method has good expandability and adaptability. The method enables the mechanical arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the whole system. By optimizing the control strategy, the accuracy and efficiency of the mechanical arm training are improved, a more flexible and efficient solution is provided for the mechanical arm in the aspects of autonomous decision making, rapid strain and the like, and the performance of an automatic control system is improved.
Drawings
Fig. 1 is a flow chart of a bellman error-based variance minimization reinforcement learning mechanical arm training acceleration method of the present invention.
Fig. 2 is a simplified environmental schematic diagram of a bellman error-based variance minimization reinforcement learning mechanical arm training acceleration method of the present invention.
Fig. 3 is a graph comparing a variance minimization reinforcement learning mechanical arm training acceleration method based on bellman error with a conventional classical training method.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
It should be noted that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to aspects of the present invention are shown in the drawings, and other details not greatly related to the present invention are omitted.
Referring to fig. 1 to 3, the present invention provides a variance minimization reinforcement learning mechanical arm training acceleration method based on bellman error, which is used for mechanical arm control training, and specifically comprises the following steps:
Step S1, building an reinforcement learning model environment according to actual operation requirements, and considering the current operation requirements as follows: the free end of the mechanical arm is driven to reach the target height. The mechanical arm is provided with two arm elbows and a driving head, belongs to an underactuated system and is relatively unstable in motion control. The reinforcement learning environment is built as shown in fig. 2, the mechanical arm is an intelligent body to be trained, and three actions of applying clockwise torque, not applying torque and applying counterclockwise torque can be selected at the driving head. The status information includes the angle of the first mechanical arm 1 with respect to the vertical direction And the angle of the first mechanical arm 1 relative to the second mechanical arm 2Angular velocity of the first mechanical arm 1And angular velocity of the second mechanical arm 2End effector positionEnd effector speedAnd obstacle location; The trained neural network is instantiated, all choices are input into the neural network, corresponding characteristics can be obtained, and a tile coding encoder can be used for encoding, so that the characteristics can be obtained. Among them, the neural network may try various models, such as a fully connected neural network, a convolutional neural network, etc., and basic model references are given below:
an input layer containing 13 neurons, each corresponding to a state information including the angle of the mechanical arm 1 relative to the vertical direction Angle of the robot arm 1 relative to the robot arm 2Rotational direction, angular velocity of two mechanical armsAndEnd effector positionEnd effector speedAnd obstacle location
The hidden layers are three layers, the dimension is 128,256,128, and each layer uses a ReLU activation function.
Output layer the output layer contains 64 neurons giving the characteristics
Step S2, the mechanical arm uses a position sensor and a rotary encoder to acquire and measure the angle of the first mechanical arm 1 (i.e. the upper end of the mechanical arm) relative to the vertical directionAnd the angle of the first mechanical arm 1 relative to the second mechanical arm 2(I.e., angle in the rotation direction of the arm), angular velocity of the first arm 1(Equivalent to the angular velocity of the upper end of the arm), the angular velocity of the second arm 2 (equivalent to the angular velocity of the joint of the arm)End effector positionEnd effector speedAnd obstacle locationStatus information of etc. The mechanical arm comprises a first mechanical arm 1, a second mechanical arm 2 and a mechanical arm joint 3 connecting the two, and the first mechanical arm 1, the second mechanical arm 2 and the mechanical arm joint 3 connecting the two are shown in fig. 2.
S3, acquiring all current optional actions of the mechanical arm, and sequentially inputting the state information and all the optional actions of the mechanical arm into the trained neural network model to obtain all the optional feature vectorsWhereinAs the current state information of the mechanical arm,For the set of selectable actions of the robotic arm,To take action in state sIs described. Calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors. Then utilizeThe method selects the appropriate actionI.e. mechanical arm havingThe probability randomly selects optional actions includingProbability selection of (a) is such thatMaximum value and storing corresponding characteristic vectorIn our experiments, the settings wereThe size is 0.01.
Step S4, the mechanical arm executes actionsAfter that, get the instant rewardsAnd proceeds to the next state. Simultaneously, the position sensor and the rotary encoder acquire all current selectable actions of the mechanical arm, and input all state information and the selectable actions into a trained neural network model to acquire the stateAll optional feature vectors belowRepresenting in stateThe next set of selectable actions, all actions should satisfy a sum of likelihood of 1. Next, all selectable actions are calculated in a state by adopting a linear methodLower part (C)I.e.In this formula, the number of the active ingredients in the active ingredients,The characteristic weight parameter vector is represented as such,The vector transpose operation. Finally utilizeMethod selection actionsObtain its corresponding feature vector. The specific rewarding form of the mechanical arm for obtaining the instant rewarding is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The final prize r is the above-mentioned prize To the point ofAnd (3) summing.
S5, the mechanical arm updates parameters by using a variance minimization algorithm based on a projection Belman error, and the minimization target of the optimization process of the method is as follows:
Equation 1
Wherein the method comprises the steps ofRepresenting errors, errorsRepresenting feature weight parameter vectors, initialized toDimension(s)The vector quantity is used to determine the vector quantity,And also corresponds to the dimension of the feature vector.Indicating the desire for errors, i.e. bellman errors,In order to be rewarded,Respectively for representing the desired symbol and the feature vector. Because of the desired itemIs not easy to calculate, thus definingTo estimate the approximationInitialized to 0. Then equation 1 can be expressed as:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,Representing characteristic weight parameter vectors, errorsRepresenting bellman error expectationsIs used for the evaluation of the (c),AndRespectively areAndThe learning rate with the best effect can be selected through experiments.
S6, the mechanical arm judges whether the action reaches a target or whether the training reaches the maximum iteration number, if so, the wheel training is finished; if not, the steps S2-S5 are repeated continuously.
Comparing the invention with the traditional training method, the statistics of the results are shown in figure 3. In fig. 3, the convergence rate of the variance minimization reinforcement learning training acceleration method (represented by English improved GQ (0) in fig. 3) of the projection bellman error provided by the invention is obviously higher than that of the traditional time sequence differential control algorithm Q-learning and the classical GQ (0) algorithm, so that the training efficiency is effectively improved, and the training time is shortened.
In summary, the present invention establishes the engineering problem as a reinforcement learning environment model, and uses the position sensor and the rotary encoder to acquire and measure the joint angle, the angular velocity, the end effector position, the end effector speed, the obstacle position, and the like pose data of the mechanical arm in the motion process. The data are converted through a neural network to form the state characteristics of the mechanical arm, and then the variance minimization algorithm based on the projection Belman error is used for training, so that the control strategy of the mechanical arm is improved. And through repeated iterative training, an optimal control strategy of the mechanical arm is obtained, and the performance of the mechanical arm in specific tasks and application scenes is improved. Meanwhile, by reducing the gradient estimation variance, the speed of converging to the optimal strategy can be increased, the speed of acquiring the optimal strategy by the mechanical arm is improved, and the method has good expandability and adaptability. The method enables the mechanical arm to learn the optimized control strategy more quickly and effectively, thereby improving the performance of the whole system. The control strategy is optimized, the accuracy and the efficiency of the mechanical arm training are improved, a more flexible and efficient solution is provided for the mechanical arm in the aspects of autonomous decision making, rapid strain and the like, and the performance of an automatic control system is improved.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims (6)

1. A variance minimization reinforcement learning mechanical arm training acceleration method based on Belman errors is characterized in that: the method comprises the following steps:
step S1, establishing a reinforcement learning environment model aiming at the operation requirement of a mechanical arm, and instantiating a trained neural network model;
S2, acquiring and measuring state information of the mechanical arm by using the position sensor and the rotary encoder The status informationJoint angle including at least a mechanical armAngular velocity of jointEnd effector positionEnd effector speedAnd obstacle location
S3, state information of the mechanical arm is obtainedAnd optional actionsInputting into a neural network model to obtain corresponding feature vectorsSelecting actions by linear method and by epsilon-greedy strategyAnd save the actionsCorresponding feature vector
Step S4, the agent executes the actionObtain rewardsEnter the next stateFor representing the next state, a state is obtained by means of said step S3Action of the lower partAnd feature vector
S5, the mechanical arm updates parameters of a mechanical arm control strategy by using a method based on a variance minimization method of a projection Belman error;
step S6, repeating the steps S2 to S5 until the mechanical arm reaches the target position or the iteration reaches the maximum times;
the step S5 specifically comprises the following steps: the minimization goal of the optimization process based on the variance minimization method of the projected bellman error is:
Equation 1
Wherein,Representing errors, errorsIndicating the desire for an error,In order to be rewarded,For representing desired symbols and feature vectors, respectively, definingEstimating bellman error expectationsThe equation 1 translates into:
Equation 2
Respectively pairs by using random gradient descent methodUpdating, wherein the updating formula is as follows:
equation 3
Equation 4
Equation 5
Wherein,The characteristic weight parameter vector is represented as such,For the optimal set of actions at time t +1,Representing the state and adjustable parameters at time t +1,In order to act on the device,Respectively used for representing the error, the state, the adjustable parameter and the Belman error expected estimation value at the t moment, and the errorRepresenting bellman error expectationsIs used for the evaluation of the (c),AndRespectively areAndIs a learning rate of (a).
2. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S1 specifically comprises the following steps: establishing a reinforcement learning environment model according to the task requirements of the mechanical arm; training the neural network model fully with the data set with the signature status feature; all state information of the mechanical armAnd a set of selectable actionsInput into a fully trained neural network model to obtain corresponding feature vectors
3. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S2 specifically comprises the following steps: the rotary encoder obtains state information s of the mechanical arm, wherein the state information s of the mechanical arm at least comprises an angle of the mechanical arm relative to the vertical directionAngle in the rotation direction of the robot armAngular velocity of the upper end of the mechanical armAngular velocity at the joint of the mechanical armEnd effector positionEnd effector speedAnd obstacle location
4. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S3 specifically comprises the following steps: acquiring all current optional actions of the mechanical armStatus information is toAll optional actionsSequentially inputting into a trained neural network model to obtain all optional feature vectorsWherein, the method comprises the steps of, wherein,As a set of the optional actions,To take action in state sIs a feature vector of (1); calculating all selectable motion state functions using a linear methodI.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsAnd store the corresponding feature vector
5. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: the step S4 specifically includes: after the mechanical arm executes the action a, the instant rewards are obtained, and the next state is enteredSimultaneously acquiring all current optional actions of the mechanical armAll state information is processedOptional actionInputting the training neural network model to obtain the in-stateAll optional feature vectors belowWherein, the method comprises the steps of, wherein,For indicating in stateAll optional actions are collected, and all optional actions are calculated to be in a state by adopting a linear methodLower part (C)I.e.Wherein, the method comprises the steps of, wherein,As a characteristic weight parameter vector of the model,Transpose the symbols for the vectors; by means ofMethod selection actionsObtain its corresponding feature vector
6. The bellman error based variance minimization reinforcement learning mechanical arm training acceleration method of claim 1, characterized by: in the step S4, the specific rewarding form of the mechanical arm for obtaining the instant rewarding is as follows:
target achievement rewards : The more the end effector is closer to the target position, the higher the prize is, as follows:
Obstacle avoidance rewards : When the tail end of the mechanical arm is far away from the obstacle, a positive reward is given, which is specifically as follows:
Smoothness rewards : Positive rewards are given for smooth changes in joint angle and end effector position, as follows:
Energy expenditure rewards : A positive prize is given to the action with smaller energy consumption, which is as follows:
Collision penalty : If the mechanical arm collides with an obstacle, a larger negative reward is given, specifically as follows:
Motion penalty : Negative rewards are given for actions that vary too much or too fast in joint angle and end effector position, as follows:
The reward r is the above-mentioned reward To the point ofAnd (3) summing.
CN202410508730.6A 2024-04-26 2024-04-26 Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error Active CN118114746B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410508730.6A CN118114746B (en) 2024-04-26 2024-04-26 Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410508730.6A CN118114746B (en) 2024-04-26 2024-04-26 Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error

Publications (2)

Publication Number Publication Date
CN118114746A CN118114746A (en) 2024-05-31
CN118114746B true CN118114746B (en) 2024-07-23

Family

ID=91208969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410508730.6A Active CN118114746B (en) 2024-04-26 2024-04-26 Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error

Country Status (1)

Country Link
CN (1) CN118114746B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109906132A (en) * 2016-09-15 2019-06-18 谷歌有限责任公司 The deeply of Robotic Manipulator learns
KR20200010982A (en) * 2018-06-25 2020-01-31 군산대학교산학협력단 Method and apparatus of generating control parameter based on reinforcement learning

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102007042440B3 (en) * 2007-09-06 2009-01-29 Siemens Ag Method for computer-aided control and / or regulation of a technical system
WO2019241680A1 (en) * 2018-06-15 2019-12-19 Google Llc Deep reinforcement learning for robotic manipulation
WO2020154542A1 (en) * 2019-01-23 2020-07-30 Google Llc Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning
CN111331607B (en) * 2020-04-03 2021-04-23 山东大学 Automatic grabbing and stacking method and system based on mechanical arm
US20220410380A1 (en) * 2021-06-17 2022-12-29 X Development Llc Learning robotic skills with imitation and reinforcement at scale
CN114781789A (en) * 2022-03-10 2022-07-22 北京控制工程研究所 Hierarchical task planning method and system for spatial fine operation
CN116175577A (en) * 2023-03-06 2023-05-30 南京理工大学 Strategy learning method based on optimizable image conversion in mechanical arm grabbing
CN116533249A (en) * 2023-06-05 2023-08-04 贵州大学 Mechanical arm control method based on deep reinforcement learning
CN116859755B (en) * 2023-08-29 2023-12-08 南京邮电大学 Minimized covariance reinforcement learning training acceleration method for unmanned vehicle driving control
CN117086882A (en) * 2023-10-07 2023-11-21 四川大学 Strengthening learning method based on mechanical arm attitude movement degree of freedom

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109906132A (en) * 2016-09-15 2019-06-18 谷歌有限责任公司 The deeply of Robotic Manipulator learns
KR20200010982A (en) * 2018-06-25 2020-01-31 군산대학교산학협력단 Method and apparatus of generating control parameter based on reinforcement learning

Also Published As

Publication number Publication date
CN118114746A (en) 2024-05-31

Similar Documents

Publication Publication Date Title
CN108656117B (en) Mechanical arm space trajectory optimization method for optimal time under multi-constraint condition
CN112102405B (en) Robot stirring-grabbing combined method based on deep reinforcement learning
WO2020207219A1 (en) Non-model robot control method for multi-shaft-hole assembly optimized by environmental prediction
CN109240091B (en) Underwater robot control method based on reinforcement learning and tracking control method thereof
CN116460860B (en) Model-based robot offline reinforcement learning control method
CN117103282B (en) Double-arm robot cooperative motion control method based on MATD3 algorithm
CN113070878B (en) Robot control method based on impulse neural network, robot and storage medium
CN113442140B (en) Cartesian space obstacle avoidance planning method based on Bezier optimization
CN116533249A (en) Mechanical arm control method based on deep reinforcement learning
CN115464659A (en) Mechanical arm grabbing control method based on deep reinforcement learning DDPG algorithm of visual information
CN118288294B (en) Robot vision servo and man-machine cooperative control method based on image variable admittance
CN113977583A (en) Robot rapid assembly method and system based on near-end strategy optimization algorithm
CN112965487A (en) Mobile robot trajectory tracking control method based on strategy iteration
CN115416024A (en) Moment-controlled mechanical arm autonomous trajectory planning method and system
Luo et al. Balance between efficient and effective learning: Dense2sparse reward shaping for robot manipulation with environment uncertainty
CN118114746B (en) Variance minimization reinforcement learning mechanical arm training acceleration method based on Belman error
CN117245666A (en) Dynamic target quick grabbing planning method and system based on deep reinforcement learning
Zhu et al. Allowing safe contact in robotic goal-reaching: Planning and tracking in operational and null spaces
CN116834014A (en) Intelligent cooperative control method and system for capturing non-cooperative targets by space dobby robot
CN116834015A (en) Deep reinforcement learning training optimization method for automatic control of intelligent robot arm
CN117021066A (en) Robot vision servo motion control method based on deep reinforcement learning
CN113352320B (en) Q learning-based Baxter mechanical arm intelligent optimization control method
CN113370205B (en) Baxter mechanical arm track tracking control method based on machine learning
CN112380655A (en) Robot inverse kinematics solving method based on RS-CMSA algorithm
CN113290554A (en) Intelligent optimization control method for Baxter mechanical arm based on value iteration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant