CN114609918A - Four-footed robot motion control method, system, storage medium and equipment - Google Patents

Four-footed robot motion control method, system, storage medium and equipment Download PDF

Info

Publication number
CN114609918A
CN114609918A CN202210512279.6A CN202210512279A CN114609918A CN 114609918 A CN114609918 A CN 114609918A CN 202210512279 A CN202210512279 A CN 202210512279A CN 114609918 A CN114609918 A CN 114609918A
Authority
CN
China
Prior art keywords
quadruped robot
action
robot
network
quadruped
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210512279.6A
Other languages
Chinese (zh)
Other versions
CN114609918B (en
Inventor
李彬
刘伟龙
侯兰东
杨姝慧
徐一明
张友梅
张瑜
张明亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Jiqing Technology Service Co.,Ltd.
Original Assignee
Qilu University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qilu University of Technology filed Critical Qilu University of Technology
Priority to CN202210512279.6A priority Critical patent/CN114609918B/en
Publication of CN114609918A publication Critical patent/CN114609918A/en
Application granted granted Critical
Publication of CN114609918B publication Critical patent/CN114609918B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators
    • G05B13/042Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators in which a parameter or coefficient is automatically adjusted to optimise the performance

Abstract

The invention relates to the technical field of self-adaptive control, and provides a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot, wherein the method comprises the following steps: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network; acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action; the reference action and the action output by the strategy network are combined to obtain the action executed by the quadruped robot, and an action instruction is sent to the quadruped robot to realize the motion of the quadruped robot, so that the more stable and robust motion planning and control of the quadruped robot are realized.

Description

Four-footed robot motion control method, system, storage medium and equipment
Technical Field
The invention belongs to the technical field of self-adaptive control, and particularly relates to a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The traditional quadruped robot control method usually needs to carry out accurate dynamics and kinematics modeling analysis on the robot in advance, an angle and a moment which need to be executed by each joint driver are reversely solved by using an expected track and a foot end feedback force, the process needs a large amount of professional knowledge and a long manual design process, and the design of a robust controller for agile motion of the quadruped robot is difficult. In addition, delay and noise interference exist in reality, model analysis of the quadruped robot is often not accurate enough, difficulty in model analysis and system control is increased, and how to enable the robot to have the capability of autonomously learning movement and achieve adaptive control of the movement of the quadruped robot is one of the difficulties which need to be solved urgently at present.
The deep reinforcement learning technology combines the perception capability of deep learning and the decision-making capability of reinforcement learning, and in recent years, the deep reinforcement learning technology achieves a lot of breakthrough performances in a plurality of fields. The motion planning and control task of the robot can be described as a perception and decision problem, so the method of deep reinforcement learning is a very promising technology in the field of motion control of the robot, excessive human intervention of researchers is not needed in the method, and the robust motion with low energy consumption and high dynamic is generated by training the autonomous learning control strategy of the quadruped robot. However, the design process of the traditional quadruped robot controller based on deep reinforcement learning is complicated, and the stability, efficiency and universality of the motion control of the quadruped robot are poor.
Disclosure of Invention
In order to solve the technical problems in the background art, the invention provides a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot, wherein a gait reference frame is added to guide the quadruped robot to generate expected gait motion, and the trained robot can realize more stable and robust motion planning and control.
In order to achieve the purpose, the invention adopts the following technical scheme:
A first aspect of the present invention provides a quadruped robot motion control method, including:
acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;
acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;
combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;
the action performed by the quadruped robot is represented as:
Figure 8910DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 548476DEST_PATH_IMAGE002
and
Figure 336303DEST_PATH_IMAGE003
representing the reference action and the action output by the policy network respectively,
Figure 644925DEST_PATH_IMAGE004
and
Figure 860006DEST_PATH_IMAGE005
respectively representing the weight coefficients of the reference actions and the weight coefficients of the actions output by the policy network.
Further, the states include a pitch angle, a roll angle, a pitch angle velocity, a roll angle velocity, and positions of the respective joints of the quadruped robot.
Further, the reference motion is calculated by:
determining gait parameters of the quadruped robot;
calculating an expected foot end track based on the foot end position of the quadruped robot when walking in the environment and in combination with gait parameters;
the expected foot end trajectory is subjected to inverse kinematics calculation to obtain a reference motion.
Further, the desired foot end trajectory is:
Figure 886868DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,xt)、yt) Andzt) Robot with four feettThe desired foot end position in the body coordinate system at the time of day,
Figure 478386DEST_PATH_IMAGE007
and
Figure 907093DEST_PATH_IMAGE008
represents the position of the foot end in the coordinate system of the robot body in the initial state of the quadruped robot,
Figure 27496DEST_PATH_IMAGE009
the step size of the leg swing is indicated,
Figure 804303DEST_PATH_IMAGE010
indicating the step height at which the leg is swinging,
Figure 933933DEST_PATH_IMAGE011
representing a cycle of a single step.
Further, after the quadruped robot executes the action, the state is transferred, the reward is obtained, and the action, the reward and the state before and after the transfer are combined to be a transfer tuple to be stored in the experience playback pool.
Further, the reward is calculated by adopting a reward function;
the reward function comprises a forward speed reward item, a deflection speed penalty item, an energy consumption penalty item, a centroid trajectory floating penalty item and a posture angle change penalty item.
Further, the training step of the strategy network is as follows:
randomly sampling a plurality of transfer tuples from the experience playback pool, calculating the disturbed action, and updating the parameters of the value network;
judging whether the updating step length of the strategy network is reached, if not, continuing to update the parameters of the value network; otherwise, based on the updated value network, updating parameters of the strategy network by using a deterministic strategy gradient method.
A second aspect of the present invention provides a quadruped robot motion control system comprising:
an action selection module configured to: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;
a reference action calculation module configured to: acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;
a control module configured to: combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;
the action performed by the quadruped robot is represented as:
Figure 951568DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 774031DEST_PATH_IMAGE002
and
Figure 775485DEST_PATH_IMAGE003
representing the reference action and the action output by the policy network respectively,
Figure 443226DEST_PATH_IMAGE004
and
Figure 580947DEST_PATH_IMAGE005
weight coefficients and policy network outputs representing reference actions, respectivelyThe weight coefficient of the motion of (1).
A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a quadruped robot motion control method as described above.
A fourth aspect of the present invention provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a method for controlling the motion of a quadruped robot as described above when executing the program.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a motion control method of a quadruped robot, which is characterized in that a gait reference frame is added, the gait guide frame outputs a reference action according to an expected gait, the reference action instruction is combined with a learned action instruction and then transmitted to a joint driver of the quadruped robot to be executed, the quadruped robot is guided to generate expected gait motion, the trained robot can realize more stable and robust motion planning and control, and the autonomy and intelligence level of the quadruped robot motion planning and control are improved.
The invention provides a quadruped robot motion control method, which covers state information directly closely related to the generated motion of the quadruped robot, can avoid dimension disasters, reduce calculation pressure and improve the learning efficiency of a quadruped robot control strategy.
The invention provides a method for controlling the motion of a quadruped robot, wherein a reward function comprises a forward speed reward item, a deflection speed punishment item, an energy consumption punishment item, a mass center track floating punishment item and an attitude angle change punishment item, and the quadruped robot is encouraged to train to generate high-speed and stable forward motion; and the method has stronger universality, can encourage different task targets by adjusting the weight of each reward item, and then achieves the control effect of the quadruped robot expected to be realized.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
Fig. 1 is a flowchart of a method for controlling the motion of a quadruped robot according to a first embodiment of the present invention;
FIG. 2 is a leg phase diagram for a diagonal sprint gait according to a first embodiment of the invention;
FIG. 3 is a leg phase diagram for jumping gait according to a first embodiment of the invention;
fig. 4 is a diagram of reward values obtained by the quadruped robot in accordance with the first embodiment of the present invention.
Detailed Description
The invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Example one
The embodiment provides a method for controlling the motion of a quadruped robot, as shown in fig. 1, which specifically includes the following steps:
step 1, obtaining the state of the quadruped robot when walking in the environmentsSelecting actions based on state through policy networkaActions to be selectedaActions as policy network exports
Figure 574310DEST_PATH_IMAGE012
Step 2,The method comprises the steps of obtaining the position of the foot end of the quadruped robot when the quadruped robot walks in the environment, and calculating to obtain a reference action
Figure 797481DEST_PATH_IMAGE013
And 3, combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, sending an action command to each joint of the quadruped robot to realize the motion of the quadruped robot, and after the quadruped robot executes the action, transferring the state, namely the statesTransfer to the next state
Figure 534493DEST_PATH_IMAGE014
To obtain a rewardrAnd used to train the policy network.
In step 1, a Markov Decision Process (MDP) is used to model the quadruped robot motion control problem. Reinforcement learning methods are used to handle decision problems of the discrete or continuous type, which are typically modeled as markov decision processes to solve. The motion control of the robot is a continuous decision-making problem, so the invention models the motion control problem of the quadruped robot into a Markov decision-making process which is represented by a quadruple
Figure 526720DEST_PATH_IMAGE015
Wherein, in the step (A),
Figure 425406DEST_PATH_IMAGE016
a representation state space, also called observation space, consisting of basic information of the quadruped robot;
Figure 401452DEST_PATH_IMAGE017
representing an action space consisting of instructions executed by the quadruped robot joints;
Figure 676576DEST_PATH_IMAGE018
representing the state transition probability, and determining by the interaction process of the four-footed robot and the environment;
Figure 523309DEST_PATH_IMAGE019
and expressing a reward function which is used as an evaluation basis of the learning effect and is designed by a user according to the task target of the quadruped robot learning. Quadruped robot as learning control strategy
Figure 858475DEST_PATH_IMAGE020
Is considered to be an agent in the reinforcement learning problem, a control strategy
Figure 56238DEST_PATH_IMAGE020
Can be viewed as a state
Figure 135053DEST_PATH_IMAGE021
To act
Figure 367451DEST_PATH_IMAGE022
To construct a neural network to represent, state
Figure 342360DEST_PATH_IMAGE021
As neural network inputs, actions
Figure 292999DEST_PATH_IMAGE022
As neural network outputs, wherein the state
Figure 909925DEST_PATH_IMAGE021
Belonging to a state space
Figure 996830DEST_PATH_IMAGE021
Element of, act
Figure 142640DEST_PATH_IMAGE022
Belonging to the action space
Figure 314996DEST_PATH_IMAGE023
Of (2) is used. The goal of reinforcement learning is to find the optimal control strategy that maximizes the expected return value:
Figure 735613DEST_PATH_IMAGE024
(1)
wherein the content of the first and second substances,
Figure 674094DEST_PATH_IMAGE025
a discount factor is indicated in the form of a discount factor,
Figure 521964DEST_PATH_IMAGE026
to represent
Figure 916037DEST_PATH_IMAGE027
The reward value fed back by the time reward mechanism is passed through the reward function
Figure 874765DEST_PATH_IMAGE028
And (4) calculating. The quadruped robot is in a certain statesFinding optimal actions according to a policy function
Figure 936262DEST_PATH_IMAGE029
After the quadruped robot executes the optimal action, the state is transferred to obtain the state of the next moment
Figure 955034DEST_PATH_IMAGE030
And earns the reward return value. Since the reward function is set based on the desired mission objective, it is considered that the higher the reward value obtained, the closer the quadruped robot is to the desired control effect, exhibiting better performance.
In step 1, in order to avoid the calculation pressure caused by overhigh dimension of the state space, the state comprises the pitch angle, the roll angle, the pitch angle speed, the roll angle speed and the position of each joint of the quadruped robot, namely the state comprises
Figure 570823DEST_PATH_IMAGE031
Figure 333243DEST_PATH_IMAGE032
Figure 249246DEST_PATH_IMAGE033
Figure 438919DEST_PATH_IMAGE034
And
Figure 807583DEST_PATH_IMAGE035
wherein, in the process,
Figure 108115DEST_PATH_IMAGE031
to show a quadruped robotiThe position of each joint is determined by the position of each joint,i=1,2,3 …, n, n denotes the number of quadruped robotic joints, can be 12,
Figure 878625DEST_PATH_IMAGE036
and
Figure 239199DEST_PATH_IMAGE037
respectively represents the pitch angle and the roll angle of the quadruped robot,
Figure 829580DEST_PATH_IMAGE038
and
Figure 933802DEST_PATH_IMAGE039
the values of the pitch angle speed and the roll angle speed of the quadruped robot are respectively expressed, and the values of the states are obtained by reading the information of the robot model in a simulation environment. Although the simulation platform and the sensors of the quadruped robot can acquire various information such as position, moment, posture, speed, angular speed and the like as states, and even images, videos and the like containing a large amount of environmental information can be acquired, the calculation pressure is increased by observation information with too high dimensionality, so that the algorithm convergence is slow, and the efficiency of the quadruped robot in learning motion planning and control strategies is low. The state of the invention therefore includes
Figure 558819DEST_PATH_IMAGE031
Figure 90294DEST_PATH_IMAGE036
Figure 167972DEST_PATH_IMAGE037
Figure 810306DEST_PATH_IMAGE038
And
Figure 289828DEST_PATH_IMAGE039
the state information directly closely related to the generated motion of the quadruped robot is included, so that dimension disasters can be avoided, the calculation pressure is reduced, and the learning efficiency of the control strategy of the quadruped robot is improved.
In step 1, a control strategy of the quadruped robot is trained by using a deep reinforcement learning algorithm, namely, the quadruped robot executes an action command to generate a desired motion effect when the control strategy is required to output a correct action. For the control task of the quadruped robot, the most intuitive action space is selected to be the moment or position of each joint actuator of the quadruped robot. The joint motor is easily influenced by the interference force of external uncertainty in the process of executing a torque command, and the effect is poor in the task of controlling the motion of the quadruped robot based on the learning method. Therefore, the present invention selects the joint motion command of the quadruped robot as the motion to be executed by the intelligent agent. Thus, the action includes the rotation angle of each joint of the quadruped robot, i.e., the action includesa 1a 2、…、a nWherein, in the process,
Figure 992205DEST_PATH_IMAGE040
to show a quadruped robotiThe motion performed by the individual joint is defined asiSince the operation command of each joint actuator and the driving method of the joints of the quadruped robot are both rotational, the operation command is the rotational angle of the joint actuator. In order to ensure the reasonability of the action value of the strategy network, the action output by the neural network is cut off to be within a reasonable range, so that the overall stability of the four-legged robot in the early stage of training is ensured, the falling frequency of the four-legged robot in the training process is reduced, and the efficiency of learning strategies is improved to a certain extent.
In step 1, the state includes pitch angle, roll angle, pitch angle velocity, roll angle velocity of the quadruped robotThe degrees and the positions of all joints are used as the input of a strategy network, in order to avoid the operation pressure caused by overhigh dimension of the state space, part of collected information is not added into the state space, and the information is used as the evaluation basis in the reward function, so that a more complete reward mechanism is designed. Aiming at the motion control effect expected to be realized, a set of universal reward mechanism of the quadruped robot is designed. The main components of the reward mechanism comprise the forward speed, the deflection speed, the energy consumption, the center of mass floating height and the attitude angle change value of the quadruped robot, and the reward mechanism encourages the quadruped robot to train to generate high-speed stable forward motion. The reward function comprises a forward speed reward item, a deflection speed penalty item, an energy consumption penalty item, a centroid track floating penalty item and a posture angle change penalty item, and the reward function
Figure 822758DEST_PATH_IMAGE041
Expressed as:
Figure 268783DEST_PATH_IMAGE042
(2)
wherein, the first and the second end of the pipe are connected with each other,
Figure 865462DEST_PATH_IMAGE043
and
Figure 738740DEST_PATH_IMAGE044
respectively representing a forward velocity reward item, a deflection velocity penalty item, an energy consumption penalty item, a centroid track floating penalty item and a posture angle change penalty item,
Figure 791009DEST_PATH_IMAGE045
And
Figure 775146DEST_PATH_IMAGE046
are respectively as
Figure 494840DEST_PATH_IMAGE047
And
Figure 273440DEST_PATH_IMAGE048
the weighting coefficients of (a) are adjusted according to the task objectives desired to be achieved,
Figure 813006DEST_PATH_IMAGE049
and
Figure 600834DEST_PATH_IMAGE050
respectively represents the position of the centroid of the quadruped robot at the current moment (before the state is transferred) on the x axis and the y axis of the world coordinate system,
Figure 175034DEST_PATH_IMAGE051
and
Figure 124536DEST_PATH_IMAGE052
respectively represents the position of the centroid of the quadruped robot at the x axis and the y axis of the world coordinate system at the next moment (after the state is transferred),
Figure 416977DEST_PATH_IMAGE053
which represents a step of time in which the time is,
Figure 477337DEST_PATH_IMAGE054
is shown as
Figure 171623DEST_PATH_IMAGE055
The moment of the motor of each joint is,
Figure 557605DEST_PATH_IMAGE056
is shown as
Figure 806184DEST_PATH_IMAGE057
The angular velocity of the individual joint motors,
Figure 935814DEST_PATH_IMAGE058
the total number of joint motors is represented,
Figure 484607DEST_PATH_IMAGE059
and
Figure 41490DEST_PATH_IMAGE060
respectively represents the height of the centroid of the quadruped robot at the current moment and the next moment,
Figure 777365DEST_PATH_IMAGE061
and
Figure 710686DEST_PATH_IMAGE062
respectively represents the pitch angles of the quadruped robot at the current moment and the next moment,
Figure 113986DEST_PATH_IMAGE063
and
Figure 841770DEST_PATH_IMAGE064
respectively represents the roll angle of the quadruped robot at the current moment and the next moment,
Figure 64941DEST_PATH_IMAGE065
and
Figure 801953DEST_PATH_IMAGE066
respectively represents the yaw angles of the quadruped robot at the current moment and the next moment,
Figure 59759DEST_PATH_IMAGE067
and
Figure 955515DEST_PATH_IMAGE068
respectively representing the reward weight coefficients of the pitch angle, the roll angle and the yaw angle, and the fall penalty value
Figure 931562DEST_PATH_IMAGE069
Taking a constant. The reward mechanism designed in the way has strong universality, and can encourage different task targets by adjusting the weight of each reward item, and then achieve the control effect of the quadruped robot expected to be realized.
The strategy network in the step 1 is trained by adopting a double delay depth Deterministic strategy Gradient algorithm (TD 3). The dual delay depth Deterministic Policy Gradient algorithm is an improved algorithm of the depth Policy algorithm (DDPG). According to the algorithm, an Actor-Critic (AC) framework in traditional reinforcement learning is introduced into a depth strategy gradient method, a depth neural network is used for representing an action value function and a certainty strategy, a double-network architecture is used for both the strategy function and the value function, and an experience playback mechanism is introduced to reduce the error problem caused by sample correlation, so that the algorithm training process is easier to converge, the efficiency can be improved in solving a large-scale continuous action space task, and the obtained strategy is more stable and efficient. In addition, the TD3 algorithm solves the problem of overestimation caused by variance by introducing a truncated double Q-Learning (truncated double Q-Learning) method into the AC framework, and solves the problem of error accumulation by adopting a delayed update strategy network and a noise adding method.
The method for training the strategy network by adopting the TD3 algorithm comprises the following specific steps:
(1) The initialization operation, namely initializing the value network, the strategy network, the target network and the experience replay pool, is only carried out during initial training: initializing a value network
Figure 941106DEST_PATH_IMAGE070
And policy network
Figure 53418DEST_PATH_IMAGE071
Parameters of value network
Figure 388585DEST_PATH_IMAGE072
And policy network
Figure 586348DEST_PATH_IMAGE073
The parameters of (2) are all random parameters; initializing target networks (including target value network and target policy network), and setting parameters of value network and policy network
Figure 399583DEST_PATH_IMAGE074
Synchronizing parameters of a target network
Figure 366402DEST_PATH_IMAGE075
Wherein the parameter of the target value network is
Figure 872470DEST_PATH_IMAGE076
The parameters of the target policy network are
Figure 291950DEST_PATH_IMAGE077
And initializing an experience playback pool. The value network is used for fitting a value function, evaluating the strategy network and providing gradient information for strategy network updating. The target network is used to compute a target value and update the value network.
(2) Policy network based on statesAnd noise, selection actiona
Figure 908876DEST_PATH_IMAGE078
(3)
Wherein the content of the first and second substances,
Figure 995781DEST_PATH_IMAGE079
the representation of the noise is represented by,
Figure 672750DEST_PATH_IMAGE080
represents the OU (Ornstein-Uhlenbeck) process,
Figure 579526DEST_PATH_IMAGE081
representing the variance of the noise.
In step 3, after the quadruped robot performs the operation, the state is shifted, that is, the state is shifted to the next state (the state at the next time)
Figure 143DEST_PATH_IMAGE082
And receive a reward
Figure 941554DEST_PATH_IMAGE083
Wherein the prize is awarded
Figure 789424DEST_PATH_IMAGE083
By the reward function in equation (2)
Figure 449076DEST_PATH_IMAGE084
Calculating to obtain; combining actions, rewards, and states before and after a jump into a jump tuple
Figure 142225DEST_PATH_IMAGE085
And storing in an experience playback pool.
During training of the strategy network, a plurality of strategy networks are randomly sampled from an experience replay pool (small batch size)
Figure 203722DEST_PATH_IMAGE086
One) transfer tuple, the target policy network computes the action after disturbance:
Figure 956915DEST_PATH_IMAGE087
(4)
wherein, the first and the second end of the pipe are connected with each other,
Figure 838283DEST_PATH_IMAGE088
representing a clipping function to limit noise to
Figure 600703DEST_PATH_IMAGE089
In the interval range, c is a constant; the target value network then calculates an update target:
Figure 516706DEST_PATH_IMAGE090
(5)
then, the parameters of the ith (i =1, 2) value network are updated:
Figure 706379DEST_PATH_IMAGE091
(6)
(3) judging whether the updating step length of the strategy network is reached, if the updating step length is not reached, returning to the step (2) and continuously updating the parameters of the value network; otherwise, after the updating step length of the strategy network is reached, based on the updated value network, the strategy is updated by using a deterministic strategy gradient methodParameters of the neural network
Figure 75043DEST_PATH_IMAGE092
Figure 107066DEST_PATH_IMAGE093
(7)
And using a soft updating method to update the parameters of the target network:
Figure 143155DEST_PATH_IMAGE094
(8)
wherein the content of the first and second substances,
Figure 238150DEST_PATH_IMAGE095
a hyper-parameter much less than 1. By using the soft updating method, the target network parameters are gradually close to the value network and the strategy network, so that the parameters of the value network and the strategy network are updated in time, the stability of the gradients of the value network and the strategy network is ensured, and the algorithm is easier to converge.
In the process of training by using the TD3 algorithm, the strategy network is continuously updated towards a good direction according to the cost function, and the output action is more reasonable.
The intelligent learning strategy in the reinforcement learning problem is a process of evolution in trial and error, a quadruped robot executes wrong actions inevitably in the learning process, so that the robot generates unstable and even dangerous motions, great loss is often caused on an entity robot, and the robot can be in the wrong strategy for a long time to execute unreasonable actions, so that the learning efficiency in a simulation environment is low. In order to achieve the purpose, a gait reference frame is designed in the step 2, the gait reference frame is inspired by a quadruped robot controller based on a central mode generator, the leg of the robot regularly supports and swings by setting the phases of four legs and a track based on phase change, and the whole motion of the quadruped robot is finally generated through the cooperative rotation of all joints of the leg. In a gait reference frame, the swing, the support and the foot end track of four legs of the quadruped robot are planned, and parameters such as leg lifting height and step length are set according to the gait characteristics. The foot end track of the four-foot robot is designed to avoid the sliding and mopping phenomena when the foot end is contacted with the ground as much as possible, and the foot end track is designed by adopting an improved composite cycloid method.
In step 2, refer to the action
Figure 94110DEST_PATH_IMAGE096
The calculating method comprises the following steps:
(1) gait planning: gait parameters of the quadruped robot are determined, the gait parameters including step size, step height and phase (swing phase and support phase). The gait parameters are determined according to the gait characteristics expected to be realized, and the embodiment realizes two dynamic gaits with the duty ratio of 0.5 by using a gait guide frame, namely a diagonal running (Trot) gait and a jumping (Bound) gait. When the quadruped robot moves with Trot gait, the swinging and the supporting of the leg at the diagonal position are synchronous, the right front leg and the left rear leg move synchronously, the left front leg and the right rear leg move synchronously, and the phase of the Trot gait leg is shown in figure 2. When the quadruped robot moves with Bound gait, two forelegs move synchronously, two hind legs move synchronously, and the phase of the legs of Bound gait is shown in figure 3.
(2) Based on the foot end position of the quadruped robot when walking in the environment, the expected foot end track is calculated by combining gait parameters, and the expected foot end track is as follows:
Figure 198333DEST_PATH_IMAGE006
(9)
wherein the content of the first and second substances,xt)、yt) Andzt) Robot with four feettThe desired foot end position in the body coordinate system at the time of day,
Figure 823349DEST_PATH_IMAGE007
and
Figure 354824DEST_PATH_IMAGE008
represents the position of the foot end in the coordinate system of the robot body in the initial state of the quadruped robot,
Figure 698081DEST_PATH_IMAGE009
The step size of the leg swing is indicated,
Figure 74836DEST_PATH_IMAGE010
indicating the step height at which the leg is swinging,
Figure 819938DEST_PATH_IMAGE011
indicating the duration of the leg swing or support, i.e. the period of a single step,
Figure 522315DEST_PATH_IMAGE097
indicates the current time, an
Figure 87288DEST_PATH_IMAGE098
. When the robot moves forwards, the track of the foot end of the supporting leg moves backwards relative to the position of the mass center.
(2) The expected foot end track is calculated through inverse kinematics to obtain the basic position of each joint of the robot as a reference action
Figure 533313DEST_PATH_IMAGE099
In step 3, the action command output by the strategy network is combined with the reference action command and then transmitted to the quadruped robot, and the 12 joint drivers of the robot execute the action command to generate the expected gait and stable motion.
In particular, in the reference act
Figure 132922DEST_PATH_IMAGE099
On the basis of the network access policy, join the action of the policy network output
Figure 6200DEST_PATH_IMAGE100
(learning operation), and finally, the action executed by the quadruped robot is an action command of each joint, wherein the action executed by the quadruped robot is represented as:
Figure 58469DEST_PATH_IMAGE101
(10)
wherein the content of the first and second substances,
Figure 308185DEST_PATH_IMAGE102
indicating the action that the quadruped robot is actually going to perform,
Figure 496721DEST_PATH_IMAGE103
and
Figure 540900DEST_PATH_IMAGE104
respectively representing the reference action obtained by the gait reference frame and the action output by the strategy network,
Figure 346045DEST_PATH_IMAGE103
is the joint rotation angle calculated by inverse kinematics according to the expected foot end track,
Figure 868294DEST_PATH_IMAGE105
Is the joint rotation angle output by the strategy network trained by the deep reinforcement learning algorithm,
Figure 442494DEST_PATH_IMAGE106
and
Figure 657575DEST_PATH_IMAGE107
the weight coefficient of the reference action and the weight coefficient of the learning action are respectively represented, and the importance of the guiding action of the reference action in the four-legged robot training process can be adjusted through adjusting the coefficients.
The invention models the motion planning and control problem of the quadruped robot into a Markov decision process, and designs a universal motion control rewarding mechanism of the quadruped robot; the gait simulation method comprises the steps of guiding the phase relation between swing legs and supporting legs of the quadruped robot to generate multiple gaits by adding a gait reference frame, training by using a deep reinforcement learning algorithm of a double-delay depth certainty strategy gradient, and generating an expected gait motion control strategy through training in a simulation environment to enable the robot to realize stable walking of the expected gaits. As shown in fig. 4, it is shown that the reward value obtained by the quadruped robot adopting the method for controlling the motion of the quadruped robot of the present embodiment is compared with a near-end Policy Optimization (PPO) deep reinforcement learning algorithm.
Example two
The embodiment provides a quadruped robot motion control system, which specifically comprises the following modules:
an action selection module configured to: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;
a reference action calculation module configured to: acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;
a control module configured to: and combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot. After the quadruped robot executes the action, the state is transferred, the reward is obtained, and the action, the reward and the state before and after the transfer are combined into a transfer tuple to be stored in an experience playback pool.
A policy network training module configured to:
randomly sampling a plurality of transfer tuples from the experience playback pool, calculating the disturbed action, and updating the parameters of the value network;
judging whether the updating step length of the strategy network is reached, if not, continuing to update the parameters of the value network; otherwise, based on the updated value network, updating parameters of the strategy network by using a deterministic strategy gradient method.
It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described again here.
EXAMPLE III
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a method for controlling the motion of a quadruped robot as described in the first embodiment.
Example four
The present embodiment provides a computer device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of a method for controlling the motion of a quadruped robot as described in the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for controlling the motion of a quadruped robot, comprising:
acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;
acquiring the position of a foot end of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;
combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;
the action performed by the quadruped robot is represented as:
Figure 684569DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 540529DEST_PATH_IMAGE002
and
Figure 379172DEST_PATH_IMAGE003
representing the reference action and the action output by the policy network respectively,
Figure 4188DEST_PATH_IMAGE004
and
Figure 535664DEST_PATH_IMAGE005
respectively representing the weight coefficients of the reference actions and the weight coefficients of the actions output by the policy network.
2. The method of claim 1, wherein the state comprises pitch angle, roll angle, pitch angle velocity, roll angle velocity, and position of each joint of the quadruped robot.
3. The method for controlling the motion of a quadruped robot according to claim 1, wherein the reference motion is calculated by:
determining gait parameters of the quadruped robot;
calculating an expected foot end track based on the foot end position of the quadruped robot when walking in the environment and in combination with gait parameters;
The expected foot end trajectory is calculated through inverse kinematics to obtain a reference action.
4. A quadruped robot motion control method according to claim 3, wherein the desired foot end trajectory is:
Figure 878921DEST_PATH_IMAGE006
wherein, the first and the second end of the pipe are connected with each other,xt)、yt) Andzt) Robot with four feettThe desired foot end position in the body coordinate system at the time of day,
Figure 521255DEST_PATH_IMAGE007
and
Figure 266357DEST_PATH_IMAGE008
represents the position of the foot end in the coordinate system of the robot body in the initial state of the quadruped robot,
Figure 968733DEST_PATH_IMAGE009
the step size of the leg swing is indicated,
Figure 799286DEST_PATH_IMAGE010
indicating the step height at which the leg is swinging,
Figure 714153DEST_PATH_IMAGE011
representing a cycle of a single step.
5. The method of claim 1, wherein the state transition occurs after the quadruped robot performs the action, and the reward is obtained, and the action, the reward and the state before and after the transition are combined into a transition tuple to be stored in the experience replay pool.
6. The method of claim 5, wherein the reward is calculated using a reward function;
the reward function comprises a forward speed reward item, a deflection speed penalty item, an energy consumption penalty item, a centroid trajectory floating penalty item and a posture angle change penalty item.
7. The method for controlling the motion of a quadruped robot according to claim 5, wherein the step of training the strategy network comprises:
Randomly sampling a plurality of transfer tuples from the experience playback pool, calculating the disturbed action, and updating the parameters of the value network;
judging whether the updating step length of the strategy network is reached, if not, continuing to update the parameters of the value network; otherwise, based on the updated value network, updating parameters of the strategy network by using a deterministic strategy gradient method.
8. A quadruped robotic motion control system, comprising:
an action selection module configured to: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;
a reference action calculation module configured to: acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;
a control module configured to: combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;
the action performed by the quadruped robot is represented as:
Figure 576411DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 184110DEST_PATH_IMAGE002
and
Figure 501958DEST_PATH_IMAGE003
representing the reference action and the action output by the policy network respectively,
Figure 486095DEST_PATH_IMAGE004
and
Figure 205789DEST_PATH_IMAGE005
respectively representing the weight coefficients of the reference actions and the weight coefficients of the actions output by the policy network.
9. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the steps in a quadruped robot motion control method according to any one of claims 1-7.
10. A computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in a quadruped robotic motion control method according to any one of claims 1-7.
CN202210512279.6A 2022-05-12 2022-05-12 Four-footed robot motion control method, system, storage medium and equipment Active CN114609918B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210512279.6A CN114609918B (en) 2022-05-12 2022-05-12 Four-footed robot motion control method, system, storage medium and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210512279.6A CN114609918B (en) 2022-05-12 2022-05-12 Four-footed robot motion control method, system, storage medium and equipment

Publications (2)

Publication Number Publication Date
CN114609918A true CN114609918A (en) 2022-06-10
CN114609918B CN114609918B (en) 2022-08-02

Family

ID=81870549

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210512279.6A Active CN114609918B (en) 2022-05-12 2022-05-12 Four-footed robot motion control method, system, storage medium and equipment

Country Status (1)

Country Link
CN (1) CN114609918B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114859737A (en) * 2022-07-08 2022-08-05 中国科学院自动化研究所 Method, device, equipment and medium for transferring gait of quadruped robot
CN115128960A (en) * 2022-08-30 2022-09-30 齐鲁工业大学 Method and system for controlling motion of biped robot based on deep reinforcement learning

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
CN107562052A (en) * 2017-08-30 2018-01-09 唐开强 A kind of Hexapod Robot gait planning method based on deeply study
CN109093626A (en) * 2018-09-28 2018-12-28 中科新松有限公司 The fuselage attitude control method and device of quadruped robot
CN109291052A (en) * 2018-10-26 2019-02-01 山东师范大学 A kind of massaging manipulator training method based on deeply study
CN109605377A (en) * 2019-01-21 2019-04-12 厦门大学 A kind of joint of robot motion control method and system based on intensified learning
CN110496377A (en) * 2019-08-19 2019-11-26 华南理工大学 A kind of virtual table tennis forehand hit training method based on intensified learning
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
CN111552301A (en) * 2020-06-21 2020-08-18 南开大学 Hierarchical control method for salamander robot path tracking based on reinforcement learning
CN111625002A (en) * 2019-12-24 2020-09-04 杭州电子科技大学 Stair-climbing gait planning and control method of humanoid robot
CN112476424A (en) * 2020-11-13 2021-03-12 腾讯科技(深圳)有限公司 Robot control method, device, equipment and computer storage medium
CN112596534A (en) * 2020-12-04 2021-04-02 杭州未名信科科技有限公司 Gait training method and device for quadruped robot based on deep reinforcement learning, electronic equipment and medium
WO2021152047A1 (en) * 2020-01-28 2021-08-05 Five AI Limited Planning in mobile robots
CN113400307A (en) * 2021-06-16 2021-09-17 清华大学 Control method of space robot mechanical arm

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106094813A (en) * 2016-05-26 2016-11-09 华南理工大学 It is correlated with based on model humanoid robot gait's control method of intensified learning
CN107562052A (en) * 2017-08-30 2018-01-09 唐开强 A kind of Hexapod Robot gait planning method based on deeply study
CN109093626A (en) * 2018-09-28 2018-12-28 中科新松有限公司 The fuselage attitude control method and device of quadruped robot
CN109291052A (en) * 2018-10-26 2019-02-01 山东师范大学 A kind of massaging manipulator training method based on deeply study
US20200143206A1 (en) * 2018-11-05 2020-05-07 Royal Bank Of Canada System and method for deep reinforcement learning
CN109605377A (en) * 2019-01-21 2019-04-12 厦门大学 A kind of joint of robot motion control method and system based on intensified learning
CN110496377A (en) * 2019-08-19 2019-11-26 华南理工大学 A kind of virtual table tennis forehand hit training method based on intensified learning
CN110764416A (en) * 2019-11-11 2020-02-07 河海大学 Humanoid robot gait optimization control method based on deep Q network
CN111625002A (en) * 2019-12-24 2020-09-04 杭州电子科技大学 Stair-climbing gait planning and control method of humanoid robot
WO2021152047A1 (en) * 2020-01-28 2021-08-05 Five AI Limited Planning in mobile robots
CN111552301A (en) * 2020-06-21 2020-08-18 南开大学 Hierarchical control method for salamander robot path tracking based on reinforcement learning
CN112476424A (en) * 2020-11-13 2021-03-12 腾讯科技(深圳)有限公司 Robot control method, device, equipment and computer storage medium
CN112596534A (en) * 2020-12-04 2021-04-02 杭州未名信科科技有限公司 Gait training method and device for quadruped robot based on deep reinforcement learning, electronic equipment and medium
CN113400307A (en) * 2021-06-16 2021-09-17 清华大学 Control method of space robot mechanical arm

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李怡斌 等: "液压驱动四足仿生机器人的结构设计和步态规划", 《山东大学学报(工学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114859737A (en) * 2022-07-08 2022-08-05 中国科学院自动化研究所 Method, device, equipment and medium for transferring gait of quadruped robot
CN115128960A (en) * 2022-08-30 2022-09-30 齐鲁工业大学 Method and system for controlling motion of biped robot based on deep reinforcement learning

Also Published As

Publication number Publication date
CN114609918B (en) 2022-08-02

Similar Documents

Publication Publication Date Title
CN114609918B (en) Four-footed robot motion control method, system, storage medium and equipment
US9950426B2 (en) Predictive robotic controller apparatus and methods
US7685081B2 (en) Bipedal walking simulation
Da Silva et al. Linear Bellman combination for control of character animation
Abreu et al. Learning low level skills from scratch for humanoid robot soccer using deep reinforcement learning
CN113478486B (en) Robot motion parameter self-adaptive control method and system based on deep reinforcement learning
Daniel et al. Learning concurrent motor skills in versatile solution spaces
Gehring et al. Towards automatic discovery of agile gaits for quadrupedal robots
Peters et al. Robot learning
Kashyap et al. Optimization of stability of humanoid robot NAO using ant colony optimization tuned MPC controller for uneven path
CN111546349A (en) New deep reinforcement learning method for humanoid robot gait planning
Xin et al. Online dynamic motion planning and control for wheeled biped robots
CN111730595A (en) Gait stability control method of biped robot under slope condition
Wu et al. Learning robust and agile legged locomotion using adversarial motion priors
Yu et al. Dynamic bipedal maneuvers through sim-to-real reinforcement learning
CN117215204B (en) Robot gait training method and system based on reinforcement learning
Yu et al. Dynamic bipedal turning through sim-to-real reinforcement learning
Van de Panne Control techniques for physically-based animation.
MacAlpine et al. Using dynamic rewards to learn a fully holonomic bipedal walk
CN113568422A (en) Quadruped robot control method based on model prediction control optimization reinforcement learning
Liu et al. Learning control of quadruped robot galloping
Dau et al. Optimal trajectory generation for bipedal robots
Gao et al. A survey of research on several problems in the RoboCup3D simulation environment
Liu et al. A motion planning and control method of quadruped robot based on deep reinforcement learning
CN117555339B (en) Strategy network training method and human-shaped biped robot gait control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221212

Address after: Room 3115, No. 135, Ward Avenue, Ping'an Street, Changqing District, Jinan, Shandong 250300

Patentee after: Shandong Jiqing Technology Service Co.,Ltd.

Address before: 250353 University Road, Changqing District, Ji'nan, Shandong Province, No. 3501

Patentee before: Qilu University of Technology