CN114609918A

CN114609918A - Four-footed robot motion control method, system, storage medium and equipment

Info

Publication number: CN114609918A
Application number: CN202210512279.6A
Authority: CN
Inventors: 李彬; 刘伟龙; 侯兰东; 杨姝慧; 徐一明; 张友梅; 张瑜; 张明亮
Original assignee: Qilu University of Technology
Current assignee: Shandong Jiqing Technology Service Co.,Ltd.
Priority date: 2022-05-12
Filing date: 2022-05-12
Publication date: 2022-06-10
Anticipated expiration: 2042-05-12
Also published as: CN114609918B

Abstract

The invention relates to the technical field of self-adaptive control, and provides a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot, wherein the method comprises the following steps: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network; acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action; the reference action and the action output by the strategy network are combined to obtain the action executed by the quadruped robot, and an action instruction is sent to the quadruped robot to realize the motion of the quadruped robot, so that the more stable and robust motion planning and control of the quadruped robot are realized.

Description

Four-footed robot motion control method, system, storage medium and equipment

Technical Field

The invention belongs to the technical field of self-adaptive control, and particularly relates to a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The traditional quadruped robot control method usually needs to carry out accurate dynamics and kinematics modeling analysis on the robot in advance, an angle and a moment which need to be executed by each joint driver are reversely solved by using an expected track and a foot end feedback force, the process needs a large amount of professional knowledge and a long manual design process, and the design of a robust controller for agile motion of the quadruped robot is difficult. In addition, delay and noise interference exist in reality, model analysis of the quadruped robot is often not accurate enough, difficulty in model analysis and system control is increased, and how to enable the robot to have the capability of autonomously learning movement and achieve adaptive control of the movement of the quadruped robot is one of the difficulties which need to be solved urgently at present.

The deep reinforcement learning technology combines the perception capability of deep learning and the decision-making capability of reinforcement learning, and in recent years, the deep reinforcement learning technology achieves a lot of breakthrough performances in a plurality of fields. The motion planning and control task of the robot can be described as a perception and decision problem, so the method of deep reinforcement learning is a very promising technology in the field of motion control of the robot, excessive human intervention of researchers is not needed in the method, and the robust motion with low energy consumption and high dynamic is generated by training the autonomous learning control strategy of the quadruped robot. However, the design process of the traditional quadruped robot controller based on deep reinforcement learning is complicated, and the stability, efficiency and universality of the motion control of the quadruped robot are poor.

Disclosure of Invention

In order to solve the technical problems in the background art, the invention provides a method, a system, a storage medium and equipment for controlling the motion of a quadruped robot, wherein a gait reference frame is added to guide the quadruped robot to generate expected gait motion, and the trained robot can realize more stable and robust motion planning and control.

In order to achieve the purpose, the invention adopts the following technical scheme:

A first aspect of the present invention provides a quadruped robot motion control method, including:

acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;

acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;

combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;

the action performed by the quadruped robot is represented as:

wherein, the first and the second end of the pipe are connected with each other,

and

representing the reference action and the action output by the policy network respectively,

and

respectively representing the weight coefficients of the reference actions and the weight coefficients of the actions output by the policy network.

Further, the states include a pitch angle, a roll angle, a pitch angle velocity, a roll angle velocity, and positions of the respective joints of the quadruped robot.

Further, the reference motion is calculated by:

determining gait parameters of the quadruped robot;

calculating an expected foot end track based on the foot end position of the quadruped robot when walking in the environment and in combination with gait parameters;

the expected foot end trajectory is subjected to inverse kinematics calculation to obtain a reference motion.

Further, the desired foot end trajectory is:

wherein, the first and the second end of the pipe are connected with each other,x（t）、y（t) Andz（t) Robot with four feettThe desired foot end position in the body coordinate system at the time of day,

and

represents the position of the foot end in the coordinate system of the robot body in the initial state of the quadruped robot,

the step size of the leg swing is indicated,

indicating the step height at which the leg is swinging,

representing a cycle of a single step.

Further, after the quadruped robot executes the action, the state is transferred, the reward is obtained, and the action, the reward and the state before and after the transfer are combined to be a transfer tuple to be stored in the experience playback pool.

Further, the reward is calculated by adopting a reward function;

the reward function comprises a forward speed reward item, a deflection speed penalty item, an energy consumption penalty item, a centroid trajectory floating penalty item and a posture angle change penalty item.

Further, the training step of the strategy network is as follows:

randomly sampling a plurality of transfer tuples from the experience playback pool, calculating the disturbed action, and updating the parameters of the value network;

judging whether the updating step length of the strategy network is reached, if not, continuing to update the parameters of the value network; otherwise, based on the updated value network, updating parameters of the strategy network by using a deterministic strategy gradient method.

A second aspect of the present invention provides a quadruped robot motion control system comprising:

an action selection module configured to: acquiring the state of the quadruped robot when the quadruped robot walks in the environment, and selecting an action according to the state through a strategy network;

a reference action calculation module configured to: acquiring the foot end position of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;

a control module configured to: combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot;

the action performed by the quadruped robot is represented as:

wherein the content of the first and second substances,

and

and

weight coefficients and policy network outputs representing reference actions, respectivelyThe weight coefficient of the motion of (1).

A third aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a quadruped robot motion control method as described above.

A fourth aspect of the present invention provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in a method for controlling the motion of a quadruped robot as described above when executing the program.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a motion control method of a quadruped robot, which is characterized in that a gait reference frame is added, the gait guide frame outputs a reference action according to an expected gait, the reference action instruction is combined with a learned action instruction and then transmitted to a joint driver of the quadruped robot to be executed, the quadruped robot is guided to generate expected gait motion, the trained robot can realize more stable and robust motion planning and control, and the autonomy and intelligence level of the quadruped robot motion planning and control are improved.

The invention provides a quadruped robot motion control method, which covers state information directly closely related to the generated motion of the quadruped robot, can avoid dimension disasters, reduce calculation pressure and improve the learning efficiency of a quadruped robot control strategy.

The invention provides a method for controlling the motion of a quadruped robot, wherein a reward function comprises a forward speed reward item, a deflection speed punishment item, an energy consumption punishment item, a mass center track floating punishment item and an attitude angle change punishment item, and the quadruped robot is encouraged to train to generate high-speed and stable forward motion; and the method has stronger universality, can encourage different task targets by adjusting the weight of each reward item, and then achieves the control effect of the quadruped robot expected to be realized.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

Fig. 1 is a flowchart of a method for controlling the motion of a quadruped robot according to a first embodiment of the present invention;

FIG. 2 is a leg phase diagram for a diagonal sprint gait according to a first embodiment of the invention;

FIG. 3 is a leg phase diagram for jumping gait according to a first embodiment of the invention;

fig. 4 is a diagram of reward values obtained by the quadruped robot in accordance with the first embodiment of the present invention.

Detailed Description

The invention is further described with reference to the following figures and examples.

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

The embodiment provides a method for controlling the motion of a quadruped robot, as shown in fig. 1, which specifically includes the following steps:

step 1, obtaining the state of the quadruped robot when walking in the environmentsSelecting actions based on state through policy networkaActions to be selectedaActions as policy network exports

；

Step 2,The method comprises the steps of obtaining the position of the foot end of the quadruped robot when the quadruped robot walks in the environment, and calculating to obtain a reference action

；

And 3, combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, sending an action command to each joint of the quadruped robot to realize the motion of the quadruped robot, and after the quadruped robot executes the action, transferring the state, namely the statesTransfer to the next state

To obtain a rewardrAnd used to train the policy network.

In step 1, a Markov Decision Process (MDP) is used to model the quadruped robot motion control problem. Reinforcement learning methods are used to handle decision problems of the discrete or continuous type, which are typically modeled as markov decision processes to solve. The motion control of the robot is a continuous decision-making problem, so the invention models the motion control problem of the quadruped robot into a Markov decision-making process which is represented by a quadruple

Wherein, in the step (A),

a representation state space, also called observation space, consisting of basic information of the quadruped robot;

representing an action space consisting of instructions executed by the quadruped robot joints;

representing the state transition probability, and determining by the interaction process of the four-footed robot and the environment;

and expressing a reward function which is used as an evaluation basis of the learning effect and is designed by a user according to the task target of the quadruped robot learning. Quadruped robot as learning control strategy

Is considered to be an agent in the reinforcement learning problem, a control strategy

Can be viewed as a state

To act

To construct a neural network to represent, state

As neural network inputs, actions

As neural network outputs, wherein the state

Belonging to a state space

Element of, act

Belonging to the action space

Of (2) is used. The goal of reinforcement learning is to find the optimal control strategy that maximizes the expected return value:

(1)

wherein the content of the first and second substances,

a discount factor is indicated in the form of a discount factor,

to represent

The reward value fed back by the time reward mechanism is passed through the reward function

And (4) calculating. The quadruped robot is in a certain statesFinding optimal actions according to a policy function

After the quadruped robot executes the optimal action, the state is transferred to obtain the state of the next moment

And earns the reward return value. Since the reward function is set based on the desired mission objective, it is considered that the higher the reward value obtained, the closer the quadruped robot is to the desired control effect, exhibiting better performance.

In step 1, in order to avoid the calculation pressure caused by overhigh dimension of the state space, the state comprises the pitch angle, the roll angle, the pitch angle speed, the roll angle speed and the position of each joint of the quadruped robot, namely the state comprises

、

、

、

And

wherein, in the process,

to show a quadruped robotiThe position of each joint is determined by the position of each joint,i=1,2,3 …, n, n denotes the number of quadruped robotic joints, can be 12,

and

respectively represents the pitch angle and the roll angle of the quadruped robot,

and

the values of the pitch angle speed and the roll angle speed of the quadruped robot are respectively expressed, and the values of the states are obtained by reading the information of the robot model in a simulation environment. Although the simulation platform and the sensors of the quadruped robot can acquire various information such as position, moment, posture, speed, angular speed and the like as states, and even images, videos and the like containing a large amount of environmental information can be acquired, the calculation pressure is increased by observation information with too high dimensionality, so that the algorithm convergence is slow, and the efficiency of the quadruped robot in learning motion planning and control strategies is low. The state of the invention therefore includes

、

、

、

And

the state information directly closely related to the generated motion of the quadruped robot is included, so that dimension disasters can be avoided, the calculation pressure is reduced, and the learning efficiency of the control strategy of the quadruped robot is improved.

In step 1, a control strategy of the quadruped robot is trained by using a deep reinforcement learning algorithm, namely, the quadruped robot executes an action command to generate a desired motion effect when the control strategy is required to output a correct action. For the control task of the quadruped robot, the most intuitive action space is selected to be the moment or position of each joint actuator of the quadruped robot. The joint motor is easily influenced by the interference force of external uncertainty in the process of executing a torque command, and the effect is poor in the task of controlling the motion of the quadruped robot based on the learning method. Therefore, the present invention selects the joint motion command of the quadruped robot as the motion to be executed by the intelligent agent. Thus, the action includes the rotation angle of each joint of the quadruped robot, i.e., the action includesa ₁、a ₂、…、a _nWherein, in the process,

to show a quadruped robotiThe motion performed by the individual joint is defined asiSince the operation command of each joint actuator and the driving method of the joints of the quadruped robot are both rotational, the operation command is the rotational angle of the joint actuator. In order to ensure the reasonability of the action value of the strategy network, the action output by the neural network is cut off to be within a reasonable range, so that the overall stability of the four-legged robot in the early stage of training is ensured, the falling frequency of the four-legged robot in the training process is reduced, and the efficiency of learning strategies is improved to a certain extent.

In step 1, the state includes pitch angle, roll angle, pitch angle velocity, roll angle velocity of the quadruped robotThe degrees and the positions of all joints are used as the input of a strategy network, in order to avoid the operation pressure caused by overhigh dimension of the state space, part of collected information is not added into the state space, and the information is used as the evaluation basis in the reward function, so that a more complete reward mechanism is designed. Aiming at the motion control effect expected to be realized, a set of universal reward mechanism of the quadruped robot is designed. The main components of the reward mechanism comprise the forward speed, the deflection speed, the energy consumption, the center of mass floating height and the attitude angle change value of the quadruped robot, and the reward mechanism encourages the quadruped robot to train to generate high-speed stable forward motion. The reward function comprises a forward speed reward item, a deflection speed penalty item, an energy consumption penalty item, a centroid track floating penalty item and a posture angle change penalty item, and the reward function

Expressed as:

(2)

and

respectively representing a forward velocity reward item, a deflection velocity penalty item, an energy consumption penalty item, a centroid track floating penalty item and a posture angle change penalty item,

And

are respectively as

And

the weighting coefficients of (a) are adjusted according to the task objectives desired to be achieved,

and

respectively represents the position of the centroid of the quadruped robot at the current moment (before the state is transferred) on the x axis and the y axis of the world coordinate system,

and

respectively represents the position of the centroid of the quadruped robot at the x axis and the y axis of the world coordinate system at the next moment (after the state is transferred),

which represents a step of time in which the time is,

is shown as

The moment of the motor of each joint is,

is shown as

The angular velocity of the individual joint motors,

the total number of joint motors is represented,

and

respectively represents the height of the centroid of the quadruped robot at the current moment and the next moment,

and

respectively represents the pitch angles of the quadruped robot at the current moment and the next moment,

and

respectively represents the roll angle of the quadruped robot at the current moment and the next moment,

and

respectively represents the yaw angles of the quadruped robot at the current moment and the next moment,

and

respectively representing the reward weight coefficients of the pitch angle, the roll angle and the yaw angle, and the fall penalty value

Taking a constant. The reward mechanism designed in the way has strong universality, and can encourage different task targets by adjusting the weight of each reward item, and then achieve the control effect of the quadruped robot expected to be realized.

The strategy network in the step 1 is trained by adopting a double delay depth Deterministic strategy Gradient algorithm (TD 3). The dual delay depth Deterministic Policy Gradient algorithm is an improved algorithm of the depth Policy algorithm (DDPG). According to the algorithm, an Actor-Critic (AC) framework in traditional reinforcement learning is introduced into a depth strategy gradient method, a depth neural network is used for representing an action value function and a certainty strategy, a double-network architecture is used for both the strategy function and the value function, and an experience playback mechanism is introduced to reduce the error problem caused by sample correlation, so that the algorithm training process is easier to converge, the efficiency can be improved in solving a large-scale continuous action space task, and the obtained strategy is more stable and efficient. In addition, the TD3 algorithm solves the problem of overestimation caused by variance by introducing a truncated double Q-Learning (truncated double Q-Learning) method into the AC framework, and solves the problem of error accumulation by adopting a delayed update strategy network and a noise adding method.

The method for training the strategy network by adopting the TD3 algorithm comprises the following specific steps:

(1) The initialization operation, namely initializing the value network, the strategy network, the target network and the experience replay pool, is only carried out during initial training: initializing a value network

And policy network

Parameters of value network

And policy network

The parameters of (2) are all random parameters; initializing target networks (including target value network and target policy network), and setting parameters of value network and policy network

Synchronizing parameters of a target network

Wherein the parameter of the target value network is

The parameters of the target policy network are

And initializing an experience playback pool. The value network is used for fitting a value function, evaluating the strategy network and providing gradient information for strategy network updating. The target network is used to compute a target value and update the value network.

(2) Policy network based on statesAnd noise, selection actiona：

(3)

Wherein the content of the first and second substances,

the representation of the noise is represented by,

represents the OU (Ornstein-Uhlenbeck) process,

representing the variance of the noise.

In step 3, after the quadruped robot performs the operation, the state is shifted, that is, the state is shifted to the next state (the state at the next time)

And receive a reward

Wherein the prize is awarded

By the reward function in equation (2)

Calculating to obtain; combining actions, rewards, and states before and after a jump into a jump tuple

And storing in an experience playback pool.

During training of the strategy network, a plurality of strategy networks are randomly sampled from an experience replay pool (small batch size)

One) transfer tuple, the target policy network computes the action after disturbance:

(4)

representing a clipping function to limit noise to

In the interval range, c is a constant; the target value network then calculates an update target:

(5)

then, the parameters of the ith (i =1, 2) value network are updated:

(6)

(3) judging whether the updating step length of the strategy network is reached, if the updating step length is not reached, returning to the step (2) and continuously updating the parameters of the value network; otherwise, after the updating step length of the strategy network is reached, based on the updated value network, the strategy is updated by using a deterministic strategy gradient methodParameters of the neural network

：

(7)

And using a soft updating method to update the parameters of the target network:

(8)

wherein the content of the first and second substances,

a hyper-parameter much less than 1. By using the soft updating method, the target network parameters are gradually close to the value network and the strategy network, so that the parameters of the value network and the strategy network are updated in time, the stability of the gradients of the value network and the strategy network is ensured, and the algorithm is easier to converge.

In the process of training by using the TD3 algorithm, the strategy network is continuously updated towards a good direction according to the cost function, and the output action is more reasonable.

The intelligent learning strategy in the reinforcement learning problem is a process of evolution in trial and error, a quadruped robot executes wrong actions inevitably in the learning process, so that the robot generates unstable and even dangerous motions, great loss is often caused on an entity robot, and the robot can be in the wrong strategy for a long time to execute unreasonable actions, so that the learning efficiency in a simulation environment is low. In order to achieve the purpose, a gait reference frame is designed in the step 2, the gait reference frame is inspired by a quadruped robot controller based on a central mode generator, the leg of the robot regularly supports and swings by setting the phases of four legs and a track based on phase change, and the whole motion of the quadruped robot is finally generated through the cooperative rotation of all joints of the leg. In a gait reference frame, the swing, the support and the foot end track of four legs of the quadruped robot are planned, and parameters such as leg lifting height and step length are set according to the gait characteristics. The foot end track of the four-foot robot is designed to avoid the sliding and mopping phenomena when the foot end is contacted with the ground as much as possible, and the foot end track is designed by adopting an improved composite cycloid method.

In step 2, refer to the action

The calculating method comprises the following steps:

(1) gait planning: gait parameters of the quadruped robot are determined, the gait parameters including step size, step height and phase (swing phase and support phase). The gait parameters are determined according to the gait characteristics expected to be realized, and the embodiment realizes two dynamic gaits with the duty ratio of 0.5 by using a gait guide frame, namely a diagonal running (Trot) gait and a jumping (Bound) gait. When the quadruped robot moves with Trot gait, the swinging and the supporting of the leg at the diagonal position are synchronous, the right front leg and the left rear leg move synchronously, the left front leg and the right rear leg move synchronously, and the phase of the Trot gait leg is shown in figure 2. When the quadruped robot moves with Bound gait, two forelegs move synchronously, two hind legs move synchronously, and the phase of the legs of Bound gait is shown in figure 3.

(2) Based on the foot end position of the quadruped robot when walking in the environment, the expected foot end track is calculated by combining gait parameters, and the expected foot end track is as follows:

(9)

wherein the content of the first and second substances,x（t）、y（t) Andz（t) Robot with four feettThe desired foot end position in the body coordinate system at the time of day,

and

The step size of the leg swing is indicated,

indicating the step height at which the leg is swinging,

indicating the duration of the leg swing or support, i.e. the period of a single step,

indicates the current time, an

. When the robot moves forwards, the track of the foot end of the supporting leg moves backwards relative to the position of the mass center.

(2) The expected foot end track is calculated through inverse kinematics to obtain the basic position of each joint of the robot as a reference action

。

In step 3, the action command output by the strategy network is combined with the reference action command and then transmitted to the quadruped robot, and the 12 joint drivers of the robot execute the action command to generate the expected gait and stable motion.

In particular, in the reference act

On the basis of the network access policy, join the action of the policy network output

(learning operation), and finally, the action executed by the quadruped robot is an action command of each joint, wherein the action executed by the quadruped robot is represented as:

(10)

wherein the content of the first and second substances,

indicating the action that the quadruped robot is actually going to perform,

and

respectively representing the reference action obtained by the gait reference frame and the action output by the strategy network,

is the joint rotation angle calculated by inverse kinematics according to the expected foot end track,

Is the joint rotation angle output by the strategy network trained by the deep reinforcement learning algorithm,

and

the weight coefficient of the reference action and the weight coefficient of the learning action are respectively represented, and the importance of the guiding action of the reference action in the four-legged robot training process can be adjusted through adjusting the coefficients.

The invention models the motion planning and control problem of the quadruped robot into a Markov decision process, and designs a universal motion control rewarding mechanism of the quadruped robot; the gait simulation method comprises the steps of guiding the phase relation between swing legs and supporting legs of the quadruped robot to generate multiple gaits by adding a gait reference frame, training by using a deep reinforcement learning algorithm of a double-delay depth certainty strategy gradient, and generating an expected gait motion control strategy through training in a simulation environment to enable the robot to realize stable walking of the expected gaits. As shown in fig. 4, it is shown that the reward value obtained by the quadruped robot adopting the method for controlling the motion of the quadruped robot of the present embodiment is compared with a near-end Policy Optimization (PPO) deep reinforcement learning algorithm.

Example two

The embodiment provides a quadruped robot motion control system, which specifically comprises the following modules:

a control module configured to: and combining the reference action with the action output by the strategy network to obtain the action executed by the quadruped robot, and sending an action instruction to the quadruped robot to realize the motion of the quadruped robot. After the quadruped robot executes the action, the state is transferred, the reward is obtained, and the action, the reward and the state before and after the transfer are combined into a transfer tuple to be stored in an experience playback pool.

A policy network training module configured to:

It should be noted that, each module in the present embodiment corresponds to each step in the first embodiment one to one, and the specific implementation process is the same, which is not described again here.

EXAMPLE III

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps in a method for controlling the motion of a quadruped robot as described in the first embodiment.

Example four

The present embodiment provides a computer device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of a method for controlling the motion of a quadruped robot as described in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a computer to implement the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for controlling the motion of a quadruped robot, comprising:

acquiring the position of a foot end of the quadruped robot when the quadruped robot walks in the environment so as to calculate and obtain a reference action;

the action performed by the quadruped robot is represented as:

and

and

2. The method of claim 1, wherein the state comprises pitch angle, roll angle, pitch angle velocity, roll angle velocity, and position of each joint of the quadruped robot.

3. The method for controlling the motion of a quadruped robot according to claim 1, wherein the reference motion is calculated by:

determining gait parameters of the quadruped robot;

The expected foot end trajectory is calculated through inverse kinematics to obtain a reference action.

4. A quadruped robot motion control method according to claim 3, wherein the desired foot end trajectory is:

and

the step size of the leg swing is indicated,

indicating the step height at which the leg is swinging,

representing a cycle of a single step.

5. The method of claim 1, wherein the state transition occurs after the quadruped robot performs the action, and the reward is obtained, and the action, the reward and the state before and after the transition are combined into a transition tuple to be stored in the experience replay pool.

6. The method of claim 5, wherein the reward is calculated using a reward function;

7. The method for controlling the motion of a quadruped robot according to claim 5, wherein the step of training the strategy network comprises:

8. A quadruped robotic motion control system, comprising:

the action performed by the quadruped robot is represented as:

wherein the content of the first and second substances,

and

and

9. A computer-readable storage medium having stored thereon a computer program, characterized in that the program, when executed by a processor, implements the steps in a quadruped robot motion control method according to any one of claims 1-7.

10. A computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in a quadruped robotic motion control method according to any one of claims 1-7.