CN113910221B - Mechanical arm autonomous motion planning method, device, equipment and storage medium - Google Patents

Mechanical arm autonomous motion planning method, device, equipment and storage medium Download PDF

Info

Publication number
CN113910221B
CN113910221B CN202111143685.1A CN202111143685A CN113910221B CN 113910221 B CN113910221 B CN 113910221B CN 202111143685 A CN202111143685 A CN 202111143685A CN 113910221 B CN113910221 B CN 113910221B
Authority
CN
China
Prior art keywords
mechanical arm
action
data
value
yaw
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111143685.1A
Other languages
Chinese (zh)
Other versions
CN113910221A (en
Inventor
林凡
李沐
卢泉州
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GCI Science and Technology Co Ltd
Original Assignee
GCI Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GCI Science and Technology Co Ltd filed Critical GCI Science and Technology Co Ltd
Priority to CN202111143685.1A priority Critical patent/CN113910221B/en
Publication of CN113910221A publication Critical patent/CN113910221A/en
Application granted granted Critical
Publication of CN113910221B publication Critical patent/CN113910221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J9/00Programme-controlled manipulators
    • B25J9/16Programme controls
    • B25J9/1656Programme controls characterised by programming, planning systems for manipulators
    • B25J9/1664Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
    • B25J9/1666Avoiding collision or forbidden zones

Landscapes

  • Engineering & Computer Science (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for planning the autonomous movement of a mechanical arm, wherein the method comprises the following steps: acquiring mechanical arm data comprising position coordinate values, current moving speed values and current yaw speed values of the mechanical arm and obstacle data comprising position coordinate values and size data of obstacles; acquiring action evaluation index values by adopting an artificial potential field algorithm according to the mechanical arm data and the barrier data; taking the mechanical arm data, the obstacle data and the action evaluation index value as state values, inputting the state values into a preset decision model, and selecting the moving action, the continuous speed value of the moving action, the continuous angular speed value of the yaw action and the continuous angular speed value of the yaw action of the mechanical arm through the decision model; the decision model adopts an action strategy function based on normal distribution. The invention can output actions with continuous moving speed values and continuous swinging angular speed values based on the action strategy function of normal distribution, thereby improving the operation accuracy of the mechanical arm in continuous motion.

Description

Mechanical arm autonomous motion planning method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of robot control, in particular to a method, a device, equipment and a storage medium for planning autonomous motion of a mechanical arm.
Background
The precision operation manipulator is a novel product which can precisely process products on a factory assembly line after a worker introduces a motion code compiled in advance into the precision operation manipulator, and is a novel intelligent product integrating high operation precision, high execution efficiency and high automation. In a traditional workshop, staff are often required to stare at the assembly line all the time to process semi-finished products passing through the assembly line in time, but with continuous promotion of factory intellectualization, the precision operation manipulator is gradually and widely applied to the factory assembly line by virtue of the advantages of high operation precision, small error, short operation time, high efficiency, capability of reducing factory human resource loss, reducing enterprise cost and the like, and is an indispensable link in the construction of intelligent factories in the future.
However, most of the existing algorithms adopted by the manipulator are algorithms for realizing the autonomous navigation function, such as a, D, RRT algorithms and the like, the idea of the algorithms is to search out an optimal path according to the position of a target point and the position of an obstacle by positioning the position of the manipulator, and the algorithms can obtain good effect when being applied to a simple actual scene.
Disclosure of Invention
The invention provides a mechanical arm autonomous motion planning method, a device, equipment and a storage medium, which are used for solving the problem that the operation accuracy of a mechanical arm in continuous motion cannot be improved by the existing mechanical arm motion planning method, and can make a decision on the motion of the mechanical arm based on a motion strategy function of normal distribution, wherein the decision-making motion can sample the moving speed and the yaw angular speed from the normal distribution so as to output motions with a continuous moving speed value and a continuous swinging angular speed value, so that the operation accuracy of the mechanical arm in continuous motion is improved, and the mechanical arm is suitable for complex actual environments.
In order to solve the above technical problem, a first aspect of an embodiment of the present invention provides a method for planning autonomous motion of a robot arm, including:
acquiring mechanical arm data and obstacle data, wherein the mechanical arm data comprises position coordinate values of a mechanical arm, a current movement speed value and a current yaw speed value, and the obstacle data comprises position coordinate values and size data of an obstacle;
obtaining an action evaluation index value by adopting an artificial potential field algorithm according to the mechanical arm data and the barrier data;
inputting the mechanical arm data, the obstacle data and the action evaluation index value into a preset decision model as state values, and selecting the moving action of the mechanical arm, the continuous speed value of the moving action, the yawing action and the continuous angular speed value of the yawing action through the decision model; the decision model adopts an action strategy function based on normal distribution.
As an improvement, the action policy function is specifically:
Figure BDA0003284607100000021
wherein s represents a state value, θ represents a parameter vector value, α represents a learning rate, σ (s, θ) represents a strategy distribution variance with the parameter vector value θ, μ (s, θ) represents a strategy distribution expectation with the parameter vector value θ, and π (x | s, θ) represents a probability of selecting an action x when the state value of the decision model is s and the parameter vector value is θ.
As an improvement, the selecting, by the decision model, the moving motion of the mechanical arm, the continuous velocity value of the moving motion, the yaw motion, and the continuous angular velocity value of the yaw motion specifically includes:
respectively limiting the normal distribution expectation of the moving speed of the mechanical arm and the normal distribution expectation of the yaw angular velocity of the mechanical arm by adopting a tanh activation function, and respectively obtaining a speed strategy distribution expectation and an angular velocity strategy distribution expectation;
obtaining a normal distribution and a moving action strategy function of the moving speed according to the strategy distribution variance of the moving speed and the speed strategy distribution expectation;
selecting the moving action of the mechanical arm according to the moving action strategy function, sampling the moving speed according to the normal distribution of the moving speed by the moving action, and selecting the continuous speed value;
obtaining a normal distribution and a yaw action strategy function of the yaw angular velocity according to the strategy distribution variance of the yaw angular velocity and the strategy distribution expectation of the angular velocity;
and selecting the yaw action of the mechanical arm according to the yaw action strategy function, and sampling the yaw angular velocity according to the normal distribution of the yaw angular velocity by the yaw action to obtain the continuous angular velocity value.
As an improvement, the method obtains the decision model in advance by the following steps:
acquiring mechanical arm data and obstacle data at multiple moments;
obtaining action evaluation index values at a plurality of moments by adopting an artificial potential field algorithm according to the mechanical arm data and the obstacle data at the plurality of moments;
inputting the mechanical arm data, the obstacle data and the action evaluation index value at a plurality of moments as state values into a PPO model;
and training the PPO model by adopting a PPO algorithm based on the action strategy function of normal distribution, and obtaining the decision model.
As one improvement, the decision model comprises an input layer, a full connection layer, a selection network and an evaluation network;
the input layer is used for converting a state space sequence (S) 1 ,S 2 ,…,S t ) Input to the full connection layer, S t Represents the state value at time t;
the fully-connected layer comprises a first active layer and a second active layer;
the first active layer has 256 nodes, one node corresponds to one moving action or yaw action of the mechanical arm, and according to the state space sequence, the first active layer selects an evaluation function corresponding to the moving action or yaw action of the mechanical arm, wherein the evaluation function comprises a state cost function and an action cost function;
the second activation layer is used for limiting normal distribution expectation of the moving speed of the mechanical arm and normal distribution expectation of the yaw velocity of the mechanical arm by adopting a tanh activation function according to the state space sequence, and obtaining speed strategy distribution expectation and angular velocity strategy distribution expectation respectively;
the evaluation network is used for estimating the evaluation function according to the state space sequence, obtaining an advantage function according to the estimated evaluation function, obtaining a return value according to the advantage function, and updating the parameter vector value according to the return value;
the selection network constructs the moving action strategy function, the normal distribution of the moving speed, the yaw action strategy function and the normal distribution of the angular velocity according to the speed strategy distribution expectation and the angular velocity strategy distribution expectation obtained by the second activation layer, selects the moving action of the mechanical arm according to the moving action strategy function, samples the moving speed according to the normal distribution of the moving speed by the moving action, selects the continuous velocity value, selects the yaw action of the mechanical arm according to the yaw action strategy function, samples the yaw velocity according to the normal distribution of the yaw velocity, and obtains the continuous angular velocity value.
As an improvement, the obtaining of the action evaluation index value by using an artificial potential field algorithm according to the mechanical arm data and the obstacle information data specifically includes:
acquiring a danger factor according to the mechanical arm data and the obstacle data;
acquiring repulsive potential energy of the barrier according to the risk factor and a preset risk threshold;
obtaining the action evaluation index value according to the repulsive force potential energy and the attractive force potential energy of the destination area;
wherein the risk factor is defined as:
t=d 0 -max(L,W)
t represents the value of the risk factor, d 0 The distance between the mechanical arm and the obstacle is shown, and L and W respectively show the length and the width of the obstacle;
the repulsive potential energy is defined as:
Figure BDA0003284607100000041
U′ r denotes repulsive potential energy, eta denotes repulsive factor, t * Represents the minimum value of the risk factor, t 0 Represents a preset hazard threshold;
the gravitational potential energy is defined as:
Figure BDA0003284607100000042
U′ a denotes gravitational potential energy, k p Denotes the gravitational factor, d g Represents the distance between the robot arm and the destination,
Figure BDA0003284607100000043
representing a distance threshold between a moving platform of the robotic arm and a destination;
the motion evaluation index value is represented by a formula U = U' r +U′ a And (4) obtaining.
As an improvement, the acquiring of the mechanical arm data and the obstacle data specifically includes:
acquiring a position coordinate value of the mechanical arm based on a base position induction sensor arranged on the mechanical arm;
the method comprises the steps that a current moving speed value of the mechanical arm is obtained based on at least one speed induction sensor installed on the mechanical arm, and a current yaw speed value of the mechanical arm is obtained based on at least one yaw type angular speed sensor installed on the mechanical arm;
and obtaining laser radar point cloud data of the obstacle according to a laser radar, and clustering the laser radar point cloud data by adopting an Euclidean distance clustering method of a K-D tree to obtain the obstacle data.
A second aspect of an embodiment of the present invention provides a device for planning autonomous motion of a robot arm, including:
the data acquisition module is used for acquiring mechanical arm data and obstacle data, wherein the mechanical arm data comprises position coordinate values of a mechanical arm, a current moving speed value and a current yaw speed value, and the obstacle data comprises position coordinate values and size data of an obstacle;
the action evaluation index value acquisition module is used for acquiring an action evaluation index value by adopting an artificial potential field algorithm according to the mechanical arm data and the barrier data;
the action decision module is used for inputting the mechanical arm data, the obstacle data and the action evaluation index value into a preset decision model as state values, and selecting the movement action of the mechanical arm, the continuous speed value of the movement action, the yaw action and the continuous angular speed value of the yaw action through the decision model; and the decision model adopts an action strategy function based on normal distribution.
A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for planning autonomous movement of a robot arm according to any one of the first aspect.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to perform the method for planning autonomous movement of a robot arm according to any one of the first aspects.
Compared with the prior art, the method, the device, the equipment and the storage medium for planning the autonomous movement of the mechanical arm have the advantages that the action strategy function based on normal distribution is adopted as the action strategy function of the decision model, the action of the mechanical arm is decided based on the action strategy function based on normal distribution, the movement speed and the yaw angular velocity can be sampled from the normal distribution through the decision action, the actions with the continuous movement velocity value and the continuous swing angular velocity value are output, the operation accuracy of the mechanical arm in the continuous movement process is improved, and the mechanical arm is suitable for complex actual environments.
Drawings
Fig. 1 is a schematic flow chart of a robot autonomous movement planning method according to a preferred embodiment of the present invention;
FIG. 2 is a network architecture diagram of a preferred embodiment of a decision model provided by the present invention;
fig. 3 is a schematic structural diagram of a preferred embodiment of the autonomous movement planning apparatus for a robot arm according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a method for planning autonomous movement of a robot arm according to a preferred embodiment of the present invention.
The first aspect of the embodiments of the present invention provides a method for planning autonomous motion of a robot arm, including steps S1 to S3, which are specifically as follows:
step S1: the method comprises the steps of acquiring mechanical arm data and obstacle data, wherein the mechanical arm data comprise position coordinate values of a mechanical arm, a current movement speed value and a current yaw speed value, and the obstacle data comprise position coordinate values and size data of an obstacle.
As an improvement, the acquiring of the mechanical arm data and the obstacle data specifically includes:
acquiring a position coordinate value of the mechanical arm based on a base position induction sensor arranged on the mechanical arm;
obtaining a current movement speed value of the mechanical arm based on at least one speed sensing sensor mounted on the mechanical arm, and obtaining a current yaw speed value of the mechanical arm based on at least one yaw type angular velocity sensor mounted on the mechanical arm;
and obtaining laser radar point cloud data of the obstacle according to a laser radar, and clustering the laser radar point cloud data by adopting an Euclidean distance clustering method of a K-D tree to obtain the obstacle data.
In particular, to define states S in reinforcement learning algorithms t It is necessary to acquire a plurality of data such as the position, speed, and obstacle data of the robot arm itself. To magnetism in factorySetting the north direction as the positive direction of a y axis, setting the magnetic north to be 90 degrees to the east as the positive direction of an x axis, placing a router at intervals, wherein the placing distance is based on the scale of a factory, and the larger the scale is, the longer the placing distance can be properly shortened; the information sent by the sensor is collected through the router, and a plurality of data such as the position, the speed and the obstacle data of the mechanical arm are provided for the mechanical arm by means of a wireless sensing network.
The embodiment of the invention is characterized in that a position induction sensor is arranged on a base of a mechanical arm, the coordinate value (N, E, U) of the position induction sensor in a factory is sensed by a wireless sensing network formed by a router, a plurality of speed induction sensors and deflection type angular velocity sensors are arranged on the mechanical arm, and the moving speed v of the mechanical arm and the rotating yaw angular velocity omega are obtained through the sensors.
Further, the laser radar is used as a sensing sensor of the obstacle, and the Euclidean distance clustering method based on the K-D tree is used for clustering the point cloud of the laser radar to obtain the position coordinate value (X) of the obstacle relative to the self i ,Y i ,Z i ) And size data (L) including length, width and height information i ,W i ,H i )。
The reinforcement learning is based on the state S at a certain time t Action selection, state S in the autonomous navigation scenario of an embodiment of the present invention t The method comprises position coordinate values (N, E, U), mechanical arm motion information (v, omega) and obstacle information (X) i ,Y i ,Z i ,L i ,W i ,H i )。
Step S2: obtaining an action evaluation index value by adopting an artificial potential field algorithm according to the mechanical arm data and the barrier data;
as an improvement, the obtaining of the action evaluation index value by using an artificial potential field algorithm according to the mechanical arm data and the obstacle information data specifically includes:
acquiring a danger factor according to the mechanical arm data and the obstacle data;
acquiring repulsive potential energy of the barrier according to the danger factor and a preset danger threshold;
obtaining the action evaluation index value according to the repulsive force potential energy and the attractive force potential energy of the destination area;
wherein the risk factor is defined as:
t=d 0 -max(L,W)
t represents the value of the risk factor, d 0 The distance between the mechanical arm and the obstacle is shown, and L and W respectively show the length and the width of the obstacle;
the repulsive potential is defined as:
Figure BDA0003284607100000081
U′ r denotes repulsive potential energy, eta denotes repulsive factor, t * Represents the minimum value of the risk factor, t 0 Representing a preset danger threshold value;
the gravitational potential energy is defined as:
Figure BDA0003284607100000082
U′ a represents gravitational potential energy, k p Denotes the gravity factor, d g Represents the distance between the robot arm and the destination,
Figure BDA0003284607100000083
a distance threshold representing a distance between a moving platform of the robotic arm and a destination;
the motion evaluation index value is represented by a formula U = U' r +U′ a And (4) obtaining.
Specifically, the embodiment of the invention provides an improved artificial potential field algorithm to evaluate the condition that a mechanical arm completes a task, and the condition is used as a state value for performing reinforcement learning as a decision model.
The artificial potential field algorithm is one of motion algorithms of a mechanical arm, and the basic idea is that an obstacle generates repulsive force to the robot, and a destination generates attractive force and potential to the robotThe sum of the attractive and repulsive forces can be used for constructing an artificial potential field in the environment by the method. Traditional gravitational potential energy U a And repulsive force potential energy U r The formulas are respectively as follows:
Figure BDA0003284607100000084
Figure BDA0003284607100000085
in the formula (1), U a Potential energy of gravitation, k p As a factor of gravity, d g The distance between the user and the destination; in the formula (2), U r Is repulsive potential energy, eta is repulsive force factor, rho is the distance between the mechanical arm and the nearest barrier, rho 0 Is the distance threshold between the mechanical arm and the obstacle.
However, the repulsive potential energy of the conventional artificial potential field is the distance ρ between the nearest obstacle and the conventional artificial potential field, but many times, the obstacle far away from the nearest obstacle may have a collision risk because of its large volume, so a risk factor t is designed herein for determining the risk level of the obstacle, and the risk factor is defined as follows:
t=d 0 -max(L,W) (3)
in the formula (3), d 0 Indicating the distance of the arm itself from the obstacle, L and W indicating the length and width of the obstacle, respectively.
Calculating the risk factor of each obstacle and taking the minimum value t * The repulsive potential is replaced as follows:
Figure BDA0003284607100000091
in the formula (4), t 0 Is a risk threshold.
In addition, in case the robot operates in a complicated environment, that is, the destination of the movement of the robot is not a certain point but a region, in order to more accurately define the potential energy of the destination region, the gravitational potential energy is defined as follows:
Figure BDA0003284607100000092
in the formula (5), the reaction mixture is,
Figure BDA0003284607100000093
the distance between the moving platform of the mechanical arm and the destination is a threshold value, and if the distance between the moving platform and the destination is smaller than the threshold value, the gravitational potential energy is 0.
And finally, taking the potential energy sum of the attraction force and the repulsion force as an evaluation index of the motion action of the mechanical arm:
U=U′ r +U′ a (6)
the larger the U value, the worse the completion of the operation, and the smaller the U value, the better the completion of the operation. The evaluation index is used as a state value of the decision model, so that the model can conveniently carry out action decision design.
And step S3: inputting the mechanical arm data, the obstacle data and the action evaluation index value into a preset decision model as state values, and selecting the moving action of the mechanical arm, the continuous speed value of the moving action, the yawing action and the continuous angular speed value of the yawing action through the decision model; and the decision model adopts an action strategy function based on normal distribution.
As an improvement, the action policy function is specifically:
Figure BDA0003284607100000094
wherein s represents a state value, θ represents a parameter vector value, α represents a learning rate, σ (s, θ) represents a strategy distribution variance with the parameter vector value θ, μ (s, θ) represents a strategy distribution expectation with the parameter vector value θ, and π (x | s, θ) represents a probability of selecting an action x when the state value of the decision model is s and the parameter vector value is θ.
As an improvement, the selecting, by the decision model, the moving motion of the mechanical arm, the continuous velocity value of the moving motion, the yaw motion, and the continuous angular velocity value of the yaw motion specifically includes:
respectively limiting the normal distribution expectation of the moving speed of the mechanical arm and the normal distribution expectation of the yaw angular velocity of the mechanical arm by adopting a tanh activation function, and respectively obtaining a speed strategy distribution expectation and an angular velocity strategy distribution expectation;
obtaining a normal distribution and a moving action strategy function of the moving speed according to the strategy distribution variance of the moving speed and the speed strategy distribution expectation;
selecting the moving action of the mechanical arm according to the moving action strategy function, sampling the moving speed according to the normal distribution of the moving speed by the moving action, and selecting the continuous speed value;
obtaining normal distribution and a yaw action strategy function of the yaw angular velocity according to the strategy distribution variance of the yaw angular velocity and the strategy distribution expectation of the angular velocity;
and selecting the yaw action of the mechanical arm according to the yaw action strategy function, and sampling the yaw angular velocity according to the normal distribution of the yaw angular velocity by the yaw action to obtain the continuous angular velocity value.
As an improvement, the method obtains the decision model in advance by the following steps:
acquiring mechanical arm data and obstacle data at multiple moments;
obtaining action evaluation index values at multiple moments by adopting an artificial potential field algorithm according to the mechanical arm data and the obstacle data at the multiple moments;
inputting the mechanical arm data, the obstacle data and the action evaluation index value at a plurality of moments as state values into a PPO model;
and training the PPO model by adopting a PPO algorithm based on the action strategy function of normal distribution, and obtaining the decision model.
It should be noted that the basic idea of the reinforcement learning algorithm is to acquire rewards by interacting with the environment, and thus to learn itself. The reinforcement learning algorithm includes several main components, namely, agent, environment, status, action, and reward. If the agent is in state S at time t t The action A can be selected according to the current policy function pi t This action affects the environment and gets a return reward R at the next moment t+1 The sequence of trajectories interacting with the environment may be represented as: s 0 ,A 0 ,R 1 ,S 1 ,A 1 ,R 2 ,…,S t ,A t ,R t+1
Wherein the goal of agent training is to maximize the round awards earned, i.e., the expected return G t The reward represents the accumulated award weight after the end of the round, G t The definition formula is:
Figure BDA0003284607100000111
in equation (7), γ is a discount factor having a value less than 1.
For evaluating the quality of a certain state or action, a state cost function V is introduced π (s) and a cost of action function Q π (s, a), the equations are as follows:
V π (s)=E π [G t |S t =s] (8)
Q π (s,a)=E π [G t |S t =s,A t =a] (9)
the state cost function and the action cost function reflect the average expected return value of the rounds obtainable by the current state or action, and can be used as decision indexes for reinforcement learning. E π [G t |S t =s]Is shown in state S t When = s, the expected reward G t Average expected return value of; e π [G t |S t =s,A t =a]Is shown inState S t = s, model selection action A t In the case of a probability of a, the expectation is to reward G t Average expected return value of.
Further, many reinforcement learning algorithms are currently based on the Q value, and the model parameters are used to update the action cost function Q π (s, a), strategy that is, to select the largest Q π (s, a). Unlike a reinforcement learning algorithm based on a Q value, the PPO algorithm employed in the embodiments of the present invention defines model parameters in a policy function pi (a | s, θ):
π(a|s,θ t )=Pr{A t =a|S t =s,θ t =θ} (10)
in formula (10), pr { A t =a|S t =s,θ t (= θ) = represents the state S at time t t = s, parameter vector θ t = theta, the model selects action a t Is the probability of a.
By updating the parameter vector theta t And a better strategy function can be obtained, and the parameter vector updating equation is as follows:
θ t+1 =θ t +αδ t π(a|s,θ t ) (11)
in the formula (11), α is the learning rate, δ t Is a reference value solved in the training process.
The purpose of the updating of the parameter vector is to maximize the reward function. The reward function is defined as follows:
Figure BDA0003284607100000121
in the formula (12), r(s) t ) Is shown at s t The reward value under the state, and the reward function for solving the reward value is as follows:
Figure BDA0003284607100000122
in the formula (13), d a Indicating the distance of the robot arm to the center of the destination,
Figure BDA0003284607100000123
c represents a collision detection factor, and is 1 when a collision occurs and 0 when no collision occurs. A negative reward is obtained when the arm hits an obstacle and a positive reward is obtained when the destination is reached. In addition to this, a drivable area is defined beyond which a negative reward is also obtained.
Further, in the actual training process, the decision model may perform worse and worse due to the set learning rate α being inappropriate. To solve this problem, the PPO algorithm defines a dominant function A π (s t ,a t ):
A π (s t ,a t )=Q π (s,a)-V π (s) (14)
Equation (14) represents the difference between the round awards obtained for act a and the average round awards obtained for the act. If A π (s t ,a t ) If > 0, it indicates that action a is better than the average performance.
From the merit function, an approximate reward function is constructed
Figure BDA0003284607100000124
Figure BDA0003284607100000125
In the formula (15), the reaction mixture is,
Figure BDA0003284607100000126
representing updated policies
Figure BDA0003284607100000127
Corresponding approximate return function, η (π) represents the return function corresponding to strategy π before updating,
Figure BDA0003284607100000128
represents θ before update 0 The policy parameters are the corresponding state distributions.
The merit function term in equation (15)
Figure BDA0003284607100000129
The reward function after the strategy update is monotonously not decreased, i.e. the model strategy is more optimal or not changed. The optimal model strategy can be screened out through the formula (15).
Further, in order to solve the problem of continuity of output actions of the mechanical arm in the movement process, on the basis of a PPO algorithm, according to a PPO action strategy function of a formula (10), the embodiment of the invention redefines the action strategy function by using a normal distribution function as follows:
Figure BDA0003284607100000131
based on the definition of the action strategy function of equation (16), the mechanical arm movement velocity and yaw rate will be sampled from the normal distribution. Compared with discrete sampling, sampling in normal distribution can make the output action numerical value continuous, thereby solving the problem of output action continuity. And the parameters are updated through the formula (11), the expectation and the variance of normal distribution can be changed, so that the sampling probability of the action is changed, the good action sampling probability is higher, the bad action sampling probability is lower, and a better decision model is obtained.
However, considering that the moving speed and the angular speed of the mechanical arm have upper limit values, if the normal distribution is not limited, most of the sampling actions exceed the upper limit values, which means that many sampling actions will not affect the updating of the model, and further, the convergence speed of the model is slow. The normal distribution expectation is therefore limited using the tanh activation function. Since the tan's activation function equation takes a value of (-1, 1), it is multiplied by the expectation factor delta μ The desirability of policy distribution may be limited, defined as follows:
μ(s,θ)=δ μ tanh(x) (17)
the desired factor delta in the formula (17) μ And taking values as upper limit values of the moving speed and the yaw velocity of the mechanical arm.
Fig. 2 is a schematic diagram of a network architecture of a preferred embodiment of the decision model provided in the present invention.
As one improvement, the decision model comprises an input layer 201, a full connection layer 202, a selection network 203 and an evaluation network 204;
the input layer 201 is used to encode a state space sequence (S) 1 ,S 2 ,…,S t ) Input to the fully-connected layer 202, S t Represents a state value at time t;
the fully-connected layer 202 comprises a first active layer 301 and a second active layer 302;
the number of the nodes of the first active layer 301 is 256, one node corresponds to one moving action or yaw action of the mechanical arm, and according to the state space sequence, the first active layer 301 selects an evaluation function corresponding to the moving action or yaw action of the mechanical arm, wherein the evaluation function comprises a state cost function and an action cost function;
the second active layer 302 is configured to separately limit the expectation of normal distribution of the moving speed of the mechanical arm and the expectation of normal distribution of the yaw rate of the mechanical arm by using a tanh active function according to the state space sequence, and separately obtain the expectation of speed strategy distribution and the expectation of angular velocity strategy distribution;
the evaluation network 204 is configured to estimate the evaluation function according to the state space sequence, obtain an advantage function according to the estimated evaluation function, obtain a report value according to the advantage function, and update the parameter vector value according to the report value;
the selection network 203 constructs the movement motion strategy function, the normal distribution of the movement velocity, the yaw motion strategy function, and the normal distribution of the angular velocity according to the velocity strategy distribution expectation and the angular velocity strategy distribution expectation obtained by the second active layer 302, selects the movement motion of the robot arm according to the movement motion strategy function, samples the movement velocity according to the normal distribution of the movement velocity, selects the continuous velocity value, selects the yaw motion of the robot arm according to the yaw motion strategy function, samples the yaw velocity according to the normal distribution of the yaw velocity, and obtains the continuous angular velocity value.
Specifically, according to the definition of the action policy function of equation (16), the network framework of the decision model proposed in the embodiment of the present invention is shown in fig. 2, and includes an input layer 201, a fully-connected layer 202, a selection network 203, and an evaluation network 204.
The input layer is a state space sequence (S) 1 ,S 2 ,…,S t ) Respectively, into the layers to be activated of the selection network 203 and the evaluation network 204. The to-be-activated layer of the selection network 203 and the evaluation network 204 is called a full connection layer 202, the first layer has 256 nodes, a proper evaluation function is selected as an activation function according to the operation purpose of the mechanical arm, and when the function is applied, namely activated, the function enters the evaluation network 204; the second layer has 128 nodes and the activation function is tanh, and similarly, when the function is activated, the selected network 203 is entered.
The selection network 203 is used to select the desired μ (s, θ) and variance σ (s, θ) of the strategy distribution π (x | s, θ) to construct a normal distribution of the robot arm's movement velocity v and a normal distribution of the yaw rate ω of the turn, according to which the decision-making actions are sampled.
The evaluation network 204 is used to estimate an evaluation function, which can calculate an advantage function A according to equations (14) and (15) π (s t ,a t ) Further obtain a report function
Figure BDA0003284607100000141
And participating in updating parameters of the selected network strategy distribution, and gradually screening, wherein in the motion scene of the mechanical arm, the mechanical arm can obtain a higher return function only by selecting proper moving speed v and yaw angular velocity omega
Figure BDA0003284607100000142
By adopting the mechanical arm autonomous movement planning method provided by the embodiment of the invention, the action strategy function based on normal distribution is adopted as the action strategy function of the decision model, and the action of the mechanical arm is decided based on the action strategy function based on normal distribution, the decision action can sample the moving speed and the yaw angular speed from the normal distribution so as to output actions with continuous moving speed values and continuous swinging angular speed values, so that the operation accuracy of the mechanical arm in continuous movement is improved, and the mechanical arm is suitable for complex actual environments.
A second aspect of the embodiments of the present invention provides a robot arm autonomous movement planning apparatus, including:
a data obtaining module 401, configured to obtain mechanical arm data and obstacle data, where the mechanical arm data includes a position coordinate value of a mechanical arm, a current moving velocity value, and a current yaw velocity value, and the obstacle data includes a position coordinate value and size data of an obstacle;
an action evaluation index value acquisition module 402, configured to obtain an action evaluation index value by using an artificial potential field algorithm according to the mechanical arm data and the obstacle data;
a motion decision module 403, configured to input the mechanical arm data, the obstacle data, and the motion evaluation index value as state values into a preset decision model, and select a movement motion of the mechanical arm, a continuous velocity value of the movement motion, a yaw motion, and a continuous angular velocity value of the yaw motion through the decision model; the decision model adopts an action strategy function based on normal distribution.
As an improvement, the action policy function is specifically:
Figure BDA0003284607100000151
wherein s represents a state value, theta represents a parameter vector value, alpha represents a learning rate, sigma (s, theta) represents a strategy distribution variance with the parameter vector value theta, mu (s, theta) represents a strategy distribution expectation with the parameter vector value theta, and pi (x | s, theta) represents the probability of selecting the action x when the state value of the decision model is s and the parameter vector value is theta.
As an improvement, the action decision module 403 is further configured to:
respectively limiting the normal distribution expectation of the moving speed of the mechanical arm and the normal distribution expectation of the yaw angular velocity of the mechanical arm by adopting a tanh activation function, and respectively obtaining a speed strategy distribution expectation and an angular velocity strategy distribution expectation;
obtaining a normal distribution and a moving action strategy function of the moving speed according to the strategy distribution variance of the moving speed and the speed strategy distribution expectation;
selecting the moving action of the mechanical arm according to the moving action strategy function, sampling the moving speed according to the normal distribution of the moving speed by the moving action, and selecting the continuous speed value;
obtaining a normal distribution and a yaw action strategy function of the yaw angular velocity according to the strategy distribution variance of the yaw angular velocity and the strategy distribution expectation of the angular velocity;
and selecting the yaw action of the mechanical arm according to the yaw action strategy function, wherein the yaw action samples the yaw velocity according to the normal distribution of the yaw velocity to obtain the continuous angular velocity value.
As an improvement, the apparatus for planning autonomous movement of a mechanical arm further includes a decision model obtaining module 404, configured to:
acquiring mechanical arm data and obstacle data at multiple moments;
obtaining action evaluation index values at a plurality of moments by adopting an artificial potential field algorithm according to the mechanical arm data and the obstacle data at the plurality of moments;
inputting the mechanical arm data, the obstacle data and the action evaluation index value at a plurality of moments as state values into a PPO model;
and training the PPO model by adopting a PPO algorithm based on the action strategy function of normal distribution, and obtaining the decision model.
As one improvement, the decision model comprises an input layer 201, a full connection layer 202, a selection network 203 and an evaluation network 204;
the input layer 201 is used for converting a state space sequence (S) 1 ,S 2 ,…,S t ) Input to the fully-connected layer 202, S t Represents the state value at time t;
the fully-connected layer 202 comprises a first active layer 301 and a second active layer 302;
the number of the nodes of the first active layer 301 is 256, one node corresponds to one moving action or yaw action of the mechanical arm, and according to the state space sequence, the first active layer 301 selects an evaluation function corresponding to the moving action or yaw action of the mechanical arm, wherein the evaluation function comprises a state cost function and an action cost function;
the second active layer 302 is configured to respectively limit normal distribution expectation of the moving speed of the mechanical arm and normal distribution expectation of the yaw rate of the mechanical arm by using a tanh active function according to the state space sequence, and respectively obtain speed strategy distribution expectation and angular velocity strategy distribution expectation;
the evaluation network 204 is configured to estimate the evaluation function according to the state space sequence, obtain an advantage function according to the estimated evaluation function, obtain a report value according to the advantage function, and update the parameter vector value according to the report value;
the selection network 203 constructs the moving action strategy function, the normal distribution of the moving velocity, the yaw action strategy function, and the normal distribution of the angular velocity according to the velocity strategy distribution expectation and the angular velocity strategy distribution expectation obtained by the second active layer 302, selects the moving action of the robot arm according to the moving action strategy function, samples the moving velocity according to the normal distribution of the moving velocity by the moving action, selects the continuous velocity value, selects the yaw action of the robot arm according to the yaw action strategy function, samples the yaw velocity according to the normal distribution of the yaw velocity, and obtains the continuous angular velocity value.
As an improvement, the action evaluation index value acquisition module 402 is further configured to:
acquiring a risk factor according to the mechanical arm data and the obstacle data;
acquiring repulsive potential energy of the barrier according to the danger factor and a preset danger threshold;
obtaining the action evaluation index value according to the repulsive force potential energy and the attractive force potential energy of the destination area;
wherein the risk factor is defined as:
t=d 0 -max(L,W)
t represents the value of the risk factor, d 0 The distance between the mechanical arm and the obstacle is shown, and L and W respectively show the length and the width of the obstacle;
the repulsive potential energy is defined as:
Figure BDA0003284607100000171
U′ r denotes repulsive potential energy, eta denotes repulsive factor, t * Represents the minimum value of the risk factor, t 0 Represents a preset hazard threshold;
the gravitational potential energy is defined as:
Figure BDA0003284607100000172
U′ a represents gravitational potential energy, k p Denotes the gravity factor, d g Represents the distance between the robot arm and the destination,
Figure BDA0003284607100000181
representing a distance threshold between a moving platform of the robotic arm and a destination;
the motion evaluation index value is represented by a formula U = U' r +U′ a And (4) obtaining.
As an improvement, the data obtaining module 401 is further configured to:
acquiring a position coordinate value of the mechanical arm based on a base position induction sensor arranged on the mechanical arm;
the method comprises the steps that a current moving speed value of the mechanical arm is obtained based on at least one speed induction sensor installed on the mechanical arm, and a current yaw speed value of the mechanical arm is obtained based on at least one yaw type angular speed sensor installed on the mechanical arm;
and obtaining laser radar point cloud data of the obstacle according to a laser radar, and clustering the laser radar point cloud data by adopting an Euclidean distance clustering method of a K-D tree to obtain the obstacle data.
It should be noted that, the apparatus for planning autonomous motion of a mechanical arm according to the embodiment of the present invention can implement all processes of the method for planning autonomous motion of a mechanical arm according to any embodiment of the present invention, and the functions and technical effects of the modules in the apparatus are respectively the same as those of the method for planning autonomous motion of a mechanical arm according to the embodiment of the present invention, and are not described herein again.
A third aspect of the embodiments of the present invention provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor, when executing the computer program, implements the method for planning autonomous movement of a robot arm according to any one of the embodiments of the first aspect.
The terminal device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The terminal device may include, but is not limited to, a processor, a memory. The terminal device may also include input and output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is the control center of the terminal device and connects the various parts of the whole terminal device using various interfaces and lines.
The memory may be used for storing the computer programs and/or modules, and the processor may implement various functions of the terminal device by executing or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where the computer program, when running, controls an apparatus where the computer-readable storage medium is located to perform the method for planning autonomous movement of a robot arm according to any one of the embodiments of the first aspect.
Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by software plus a necessary hardware platform, and may also be implemented by hardware entirely. With this understanding in mind, all or part of the technical solutions of the present invention that contribute to the background can be embodied in the form of a software product, which can be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes instructions for causing a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments or some parts of the embodiments of the present invention.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method for planning autonomous motion of a mechanical arm is characterized by comprising the following steps:
acquiring mechanical arm data and obstacle data, wherein the mechanical arm data comprises position coordinate values of a mechanical arm, a current movement speed value and a current yaw speed value, and the obstacle data comprises position coordinate values and size data of an obstacle;
obtaining an action evaluation index value by adopting an artificial potential field algorithm according to the mechanical arm data and the barrier data;
inputting the mechanical arm data, the obstacle data and the action evaluation index value as state values into a preset decision model, and selecting the movement action of the mechanical arm, the continuous speed value of the movement action, the yaw action and the continuous angular speed value of the yaw action through the decision model; the decision model adopts an action strategy function based on normal distribution.
2. The method for planning the autonomous motion of a mechanical arm according to claim 1, wherein the action strategy function is specifically:
Figure FDA0003284607090000011
wherein s represents a state value, theta represents a parameter vector value, alpha represents a learning rate, sigma (s, theta) represents a strategy distribution variance with the parameter vector value theta, mu (s, theta) represents a strategy distribution expectation with the parameter vector value theta, and pi (x | s, theta) represents the probability of selecting the action x when the state value of the decision model is s and the parameter vector value is theta.
3. The method for planning the autonomous motion of a mechanical arm according to claim 2, wherein the selecting of the moving motion, the continuous velocity value of the moving motion, the yaw motion, and the continuous angular velocity value of the yaw motion of the mechanical arm by the decision model specifically comprises:
limiting the normal distribution expectation of the moving speed of the mechanical arm and the normal distribution expectation of the yaw rate of the mechanical arm by adopting a tanh activation function, and respectively obtaining a speed strategy distribution expectation and an angular speed strategy distribution expectation;
obtaining a normal distribution and a moving action strategy function of the moving speed according to the strategy distribution variance of the moving speed and the speed strategy distribution expectation;
selecting the moving action of the mechanical arm according to the moving action strategy function, sampling the moving speed according to the normal distribution of the moving speed by the moving action, and selecting the continuous speed value;
obtaining a normal distribution and a yaw action strategy function of the yaw angular velocity according to the strategy distribution variance of the yaw angular velocity and the strategy distribution expectation of the angular velocity;
and selecting the yaw action of the mechanical arm according to the yaw action strategy function, wherein the yaw action samples the yaw velocity according to the normal distribution of the yaw velocity to obtain the continuous angular velocity value.
4. The autonomous motion planning method for a robot arm according to claim 2, wherein the method pre-acquires the decision model by:
acquiring mechanical arm data and obstacle data at multiple moments;
obtaining action evaluation index values at multiple moments by adopting an artificial potential field algorithm according to the mechanical arm data and the obstacle data at the multiple moments;
inputting the mechanical arm data, the obstacle data and the action evaluation index value at a plurality of moments as state values into a PPO model;
and training the PPO model by adopting a PPO algorithm based on the action strategy function of normal distribution, and obtaining the decision model.
5. The autonomous motion planning method for a robot arm according to claim 3, wherein the decision model comprises an input layer, a full connection layer, a selection network and an evaluation network;
the input layer is used for converting a state space sequence (S) 1 ,S 2 ,…,S t ) Input to the full connection layer, S t Represents the state value at time t;
the full connection layer comprises a first active layer and a second active layer;
the first active layer has 256 nodes, one node corresponds to one moving action or yaw action of the mechanical arm, and according to the state space sequence, the first active layer selects an evaluation function corresponding to the moving action or yaw action of the mechanical arm, wherein the evaluation function comprises a state cost function and an action cost function;
the second activation layer is used for limiting normal distribution expectation of the moving speed of the mechanical arm and normal distribution expectation of the yaw velocity of the mechanical arm by adopting a tanh activation function according to the state space sequence, and obtaining speed strategy distribution expectation and angular velocity strategy distribution expectation respectively;
the evaluation network is used for estimating the evaluation function according to the state space sequence, obtaining an advantage function according to the estimated evaluation function, obtaining a return value according to the advantage function, and updating the parameter vector value according to the return value;
the selection network constructs the moving action strategy function, the normal distribution of the moving speed, the yaw action strategy function and the normal distribution of the angular velocity according to the speed strategy distribution expectation and the angular velocity strategy distribution expectation obtained by the second activation layer, selects the moving action of the mechanical arm according to the moving action strategy function, samples the moving speed according to the normal distribution of the moving speed by the moving action, selects the continuous velocity value, selects the yaw action of the mechanical arm according to the yaw action strategy function, samples the yaw velocity according to the normal distribution of the yaw velocity, and obtains the continuous angular velocity value.
6. The method for planning the autonomous motion of the mechanical arm according to claim 1, wherein the obtaining of the action evaluation index value by using an artificial potential field algorithm according to the mechanical arm data and the obstacle data specifically comprises:
acquiring a risk factor according to the mechanical arm data and the obstacle data;
acquiring repulsive potential energy of the barrier according to the risk factor and a preset risk threshold;
obtaining the action evaluation index value according to the repulsive force potential energy and the attractive force potential energy of the destination area;
wherein the risk factor is defined as:
t=d 0 -max(L,W)
t represents the value of the risk factor, d 0 The distance between the mechanical arm and the obstacle is shown, and L and W respectively show the length and the width of the obstacle;
the repulsive potential is defined as:
Figure FDA0003284607090000041
U′ r denotes repulsive potential energy, eta denotes repulsive factor, t * Represents the minimum value of the risk factor, t 0 Representing a preset hazard threshold;
the gravitational potential energy is defined as:
Figure FDA0003284607090000042
U′ a represents gravitational potential energy, k p Denotes the gravity factor, d g Represents the distance between the robot arm and the destination,
Figure FDA0003284607090000043
representing a distance threshold between a moving platform of the robotic arm and a destination;
the motion evaluation index value is represented by a formula U = U' r +U′ a And (4) obtaining.
7. The method for planning autonomous motion of a robotic arm of claim 1, wherein said acquiring robotic arm data and obstacle data specifically comprises:
acquiring a position coordinate value of the mechanical arm based on a base position induction sensor arranged on the mechanical arm;
the method comprises the steps that a current moving speed value of the mechanical arm is obtained based on at least one speed induction sensor installed on the mechanical arm, and a current yaw speed value of the mechanical arm is obtained based on at least one yaw type angular speed sensor installed on the mechanical arm;
and obtaining laser radar point cloud data of the obstacle according to the laser radar, and clustering the laser radar point cloud data by adopting an Euclidean distance clustering method of a K-D tree to obtain the obstacle data.
8. An autonomous movement planning apparatus for a robot arm, comprising:
the data acquisition module is used for acquiring mechanical arm data and obstacle data, wherein the mechanical arm data comprises position coordinate values of a mechanical arm, a current moving speed value and a current yaw speed value, and the obstacle data comprises position coordinate values and size data of an obstacle;
the action evaluation index value acquisition module is used for acquiring an action evaluation index value by adopting an artificial potential field algorithm according to the mechanical arm data and the obstacle data;
the action decision module is used for inputting the mechanical arm data, the obstacle data and the action evaluation index value into a preset decision model as state values, and selecting the movement action of the mechanical arm, the continuous speed value of the movement action, the yaw action and the continuous angular speed value of the yaw action through the decision model; and the decision model adopts an action strategy function based on normal distribution.
9. A terminal device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the method for planning autonomous movement of a robot arm according to any of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, comprising a stored computer program, wherein the computer program when executed controls an apparatus in which the computer-readable storage medium is located to perform the method for planning autonomous movement of a robotic arm of any of claims 1 to 7.
CN202111143685.1A 2021-09-28 2021-09-28 Mechanical arm autonomous motion planning method, device, equipment and storage medium Active CN113910221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111143685.1A CN113910221B (en) 2021-09-28 2021-09-28 Mechanical arm autonomous motion planning method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111143685.1A CN113910221B (en) 2021-09-28 2021-09-28 Mechanical arm autonomous motion planning method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113910221A CN113910221A (en) 2022-01-11
CN113910221B true CN113910221B (en) 2023-01-17

Family

ID=79236646

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111143685.1A Active CN113910221B (en) 2021-09-28 2021-09-28 Mechanical arm autonomous motion planning method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113910221B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114667852B (en) * 2022-03-14 2023-04-14 广西大学 Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008242859A (en) * 2007-03-27 2008-10-09 Sony Corp Motion control device for object, motion control method, and computer program
CN109960880A (en) * 2019-03-26 2019-07-02 上海交通大学 A kind of industrial robot obstacle-avoiding route planning method based on machine learning
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110794842A (en) * 2019-11-15 2020-02-14 北京邮电大学 Reinforced learning path planning algorithm based on potential field
CN111645065A (en) * 2020-03-25 2020-09-11 南京大学 Mechanical arm motion planning method based on deep reinforcement learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008242859A (en) * 2007-03-27 2008-10-09 Sony Corp Motion control device for object, motion control method, and computer program
CN109960880A (en) * 2019-03-26 2019-07-02 上海交通大学 A kind of industrial robot obstacle-avoiding route planning method based on machine learning
CN110632931A (en) * 2019-10-09 2019-12-31 哈尔滨工程大学 Mobile robot collision avoidance planning method based on deep reinforcement learning in dynamic environment
CN110794842A (en) * 2019-11-15 2020-02-14 北京邮电大学 Reinforced learning path planning algorithm based on potential field
CN111645065A (en) * 2020-03-25 2020-09-11 南京大学 Mechanical arm motion planning method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN113910221A (en) 2022-01-11

Similar Documents

Publication Publication Date Title
Haarnoja et al. Reinforcement learning with deep energy-based policies
Siekmann et al. Learning memory-based control for human-scale bipedal locomotion
CN112937564B (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
US20210158162A1 (en) Training reinforcement learning agents to learn farsighted behaviors by predicting in latent space
CN112119404A (en) Sample efficient reinforcement learning
CN116776964A (en) Method, program product and storage medium for distributed reinforcement learning
CN110132282B (en) Unmanned aerial vehicle path planning method and device
CN102708377B (en) Method for planning combined tasks for virtual human
Datta et al. Integrating egocentric localization for more realistic point-goal navigation agents
KR102303126B1 (en) Method and system for optimizing reinforcement learning based navigation to human preference
Nicola et al. A LSTM neural network applied to mobile robots path planning
JP7448683B2 (en) Learning options for action selection using meta-gradient in multi-task reinforcement learning
CN110442129A (en) A kind of control method and system that multiple agent is formed into columns
CN111830822A (en) System for configuring interaction with environment
Bakker et al. Quasi-online reinforcement learning for robots
CN107967513A (en) Multirobot intensified learning collaboratively searching method and system
Pan et al. Additional planning with multiple objectives for reinforcement learning
CN113910221B (en) Mechanical arm autonomous motion planning method, device, equipment and storage medium
CN115860107A (en) Multi-machine search method and system based on multi-agent deep reinforcement learning
CN117885115B (en) Track optimization method and device for multiple optimization targets of welding robot
CN118201742A (en) Multi-robot coordination using a graph neural network
CN115265547A (en) Robot active navigation method based on reinforcement learning in unknown environment
Lee et al. Sampling of pareto-optimal trajectories using progressive objective evaluation in multi-objective motion planning
CN113110101A (en) Production line mobile robot gathering type recovery and warehousing simulation method and system
CN115542901B (en) Deformable robot obstacle avoidance method based on near-end strategy training

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant