CN108983804A

CN108983804A - A kind of biped robot's gait planning method based on deeply study

Info

Publication number: CN108983804A
Application number: CN201810979187.2A
Authority: CN
Inventors: 吴晓光; 刘绍维; 杨磊; 张天赐; 李艳会; 王挺进
Original assignee: Yanshan University
Current assignee: Yanshan University
Priority date: 2018-08-27
Filing date: 2018-08-27
Publication date: 2018-12-11
Anticipated expiration: 2038-08-27
Also published as: CN108983804B

Abstract

The invention discloses a kind of biped robot's gait planning methods based on deeply study, utilize the stability and flexibility of body gait, it control effectively in conjunction with deeply study to biped robot's gait, comprising the following steps: 1) establish passive biped robot's model；2) acquisition and processing of body gait data and target gait data；3) hidden feature in biped robot's gait data and body gait data is extracted respectively using noise reduction autocoder；4) body gait feature is learnt using deeply study, and then plans biped robot's gait.The present invention combines deeply study and body gait data, stable as people, the submissive walking of control biped robot.

Description

Biped robot gait planning method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of biped robots, in particular to a biped robot gait planning method based on deep reinforcement learning.

Background

The moving modes of the current mobile robot comprise a crawler type, a wheel type, double feet and the like. Compared with a crawler-type robot and a wheel-type robot, the biped robot has stronger adaptability, can move on a flat ground and can also move in an irregular environment (walking on an uneven ground and up and down steps, etc.). However, biped robots are themselves highly nonlinear hybrid dynamic systems, and gait planning has been a difficult problem.

In addition to maintaining walking stability, the gait planning of a biped robot must also take into account the energy efficiency, compliance, environmental suitability, etc. of the walking movement. A simplified model-based gait planning method is commonly used in biped robot gait planning. The method based on the simplified model is to consider only the main characteristics of the biped robot and simplify the robot into basic models, such as models of an inverted pendulum, a two-link, a four-link and the like, from the aspects of kinematics and dynamics of walking of the biped robot, and then carry out gait planning on the biped robot based on the simplified models. The method based on the simplified model ignores partial physical characteristics of the biped robot, so that the biped robot has the problems of weak anti-interference capability, sensitivity to environment, single gait and the like. The gait planning method based on the intelligent algorithm becomes a hotspot of current research because of the characteristics of learning, self-adaption, high fault tolerance and the like. The gait planning method based on the intelligent algorithm comprises a neural network, a support vector machine, fuzzy control, reinforcement learning and the like. However, in general, the intelligent algorithm can only ensure stable walking of the biped robot, and cannot ensure that the robot has efficient and smooth gait while walking stably, which sometimes even causes the biped robot to have stiff and irregular gait.

Disclosure of Invention

The invention aims to solve the problems and provides a biped robot gait planning method based on deep reinforcement learning. The invention utilizes the characteristic that the knee biped robot model is similar to a human body model in the structure and the walking process, and combines a deep reinforcement learning method under the drive of big data, solves the problems of weak anti-interference capability of a gait planning method based on the model, stiff gait of a conventional intelligent gait planning method and the like, and improves the stability and the flexibility of the robot when the robot walks.

In order to realize the purpose, the invention is realized according to the following technical scheme:

a biped robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps:

step S1: establishing a biped robot model and describing the walking process of the robot;

step S2: acquiring and processing human body gait data and target gait data;

step S3: respectively extracting implicit characteristics in gait data of the biped robot and human body gait data by using a noise reduction automatic encoder;

step S4: and learning the human gait characteristics by using a deep reinforcement learning method so as to plan the gait of the biped robot.

In the above technical solution, step S1 specifically includes the following steps:

step S101: establishing a 4-link robot model with a knee arc foot; the robot model comprises 2 thighs, 2 small legs and 2 arc feet, wherein the legs are connected together through a rigid rod through a hinge in a friction-free mode, the arc feet are fixedly connected to the small legs respectively, the supporting legs and the swinging legs have identical mass and geometric parameters, the mass of the legs is uniformly distributed, a limiting mechanism is arranged at the knee joints of the robot model to simulate the knee joint function of a human body, and two motors are arranged at hip joints to apply control moments to the supporting legs and the swinging legs respectively;

step S102: analyzing the walking process of the model by taking the right side surface of the advancing direction of the robot as a viewpoint in the walking process of the robot, selecting dimensionless physical quantity for representing the state of the robot in real time, and defining the selected physical quantity as the walking state theta of the robot_rThe robot walking state is described as:

wherein, the counterclockwise rotation is taken as positive, theta_r1，The angle and angular velocity from the leg to the vertical direction; theta_r2，The angle and angular velocity of the leg thigh to the vertical direction; theta_r3，The angle and angular velocity of the leg to the vertical.

In the above technical solution, step S2 specifically includes the following steps:

step S201: defining a gait cycle as the process from swinging of the human body and the robot to collision of the swinging leg and the ground;

step S202: selecting a data set of the normal walking process of the human body from a CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain the description of the walking process of the human body;

step S203: taking a robot model as a reference, taking a 2D plane of the human walking longitudinal direction, and defining the human walking state as theta_mAll data in the description of the human walking process are used with theta_mIs expressed and theta is expressed_mAs a row vector, the human body gait data theta are obtained by combination_M；

Step S204, from the human body gait data theta_MSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data theta_S(ii) a Wherein, the target gait data theta_sAny row vector is extracted to obtain theta_m；

Step S205: the walking state theta of the robot in a gait cycle_rAccording to the formula theta_sThe sampling frequency in the method is sampled to form robot gait data theta_R. Wherein, the gait data theta of the robot_RTheta is obtained by taking any row vector as sample_r。

In the above technical solution, step S3 specifically includes: according to the formula theta_r、Θ_mThe data structure of (2) and two noise reduction automatic encoders with the same structure are constructed to process the gait data theta of the robot_RAnd target gait data Θ_SAnd (5) carrying out feature extraction. Will theta_R、Θ_SThe line vectors are sent to a noise reduction automatic encoder one by one,arranging the obtained characteristics according to the original sequence to form robot gait characteristic data H_RAnd target gait feature data H_SIs prepared from H_RAnd H_SUnifying the normalization process for deep reinforcement learning, wherein each denoising autoencoder workflow comprises the following steps:

s301: take theta_ROr theta_SThe middle row vector theta is sent into a noise reduction automatic encoder, the noise reduction automatic encoder randomly erases the original gait data theta by using binomial distribution, the erased data is set to be 0, and the gait data containing noise is obtainedBy means of a coding function fMapping to a hidden layer to obtain a hidden layer characteristic h, wherein the coding function of the noise reduction automatic coder is as follows:

w is a weight matrix between the input layer and the hidden layer; s_fTaking a Sigmod function as an activation function of the coding function f;

s302: mapping the hidden layer characteristics h to an output layer through a decoding function g to obtain a reconstructed output y; reconstructing output y to keep the information of original gait data x, and the integral error of the original gait data x passes through an integral loss function J_DAEWhere the decoding function of the noise reduction auto-encoder is:

wherein,as weight moments between the hidden layer and the output layerAn array ofs_gActivating a function, also a Sigmod function, for the decoding function; denoising the overall loss function of the autoencoder in a given training set:

wherein theta is_DAEParameters of the noise reduction automatic encoder comprise w, p and q; l is defined as the reconstruction error, and is used to depict how close y is to Θ:

wherein n is the dimension of the input and output layer;

s303: noise reduction autoencoder training process using gradient descent pair J_DAE(theta) performing iterative calculation to obtain a minimum value, gradient descent vs_DAEThe update function of (2):

wherein α is the learning rate and takes the value of [0,1 ].

In the above technical solution, in step S4, the depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the biped robot, and the robot gait feature data H processed by the noise reduction auto encoder is used as the learning algorithm of the biped robot_RInput data s as a gradient algorithm for a depth-deterministic strategy_tTarget gait feature data H_SAs r_tThe motor execution torque a is output by a depth certainty strategy gradient algorithm_t(ii) a The robot collects asynchronous data in the continuous walking process and provides the asynchronous data for the training of the depth deterministic strategy gradient algorithm,and finally, the depth certainty strategy gradient algorithm has the capability of controlling the robot to reach the target gait.

In the above technical solution, the policy network of the deep deterministic policy gradient algorithm adopts a 5-layer convolutional neural network, which respectively includes an input layer, two convolutional layers, a full link layer, and an output layer, wherein the input layer is used for receiving s_tThe output layer outputs the torque a required to be executed by the motor_t。

Compared with the prior art, the invention has the following beneficial effects:

the invention combines deep reinforcement learning with human gait data, and solves the problems of weak anti-interference capability of a gait planning method based on a model, hard gait of a conventional intelligent gait planning method and the like. The introduction of the noise reduction automatic encoder not only extracts the characteristics in the gait data, but also eliminates the influence of geometric difference and noise. Compared with the conventional reinforcement learning, the DDPG can spend less time to solve more complex problems and achieve higher control requirements. Target gait feature data H_SAs r_tThe DDPG can effectively utilize human gait data so that r is_tThe gait stability and the flexibility of the robot are evaluated. Through training, the DDPG can finally control the robot to walk stably and smoothly like a human.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of the planning method of the present invention;

FIG. 2 is a schematic diagram of a 4-link robot model with a knee arc foot;

FIG. 3 is a schematic view of biped robot walking;

FIG. 4 is a schematic diagram of a 2D process of human walking through a human motion database;

fig. 5 is a schematic flow chart of the operation of the noise reduction auto-encoder DAE;

FIG. 6 is a block diagram of a depth deterministic policy gradient algorithm DDPG;

fig. 7 is a schematic diagram of a training flow of a deep deterministic strategy gradient algorithm DDPG.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.

FIG. 1 is a schematic diagram of the planning method of the present invention; as shown in fig. 1, a gait planning method for a biped robot based on deep reinforcement learning of the present invention includes:

step S1: establishing a biped robot model and describing the walking process of the robot; wherein the step S1 specifically includes the following steps:

step S101: establishing a 4-link robot model with a knee arc foot;

step S102: analyzing the walking process of the model by taking the right side surface of the advancing direction of the robot as a viewpoint in the walking process of the robot, selecting dimensionless physical quantity for representing the state of the robot in real time, and defining the selected physical quantity as the walking state theta of the robot_r。

Specifically, the biped robot model established in the embodiment belongs to a 4-link robot model with a knee arc foot, and the model is shown in fig. 2. The robot consists of 2 thighs, 2 small legs and 2 arc feet. The legs are connected together by a rigid rod through a hinge without friction, and the arc feet are respectively fixedly connected on the lower leg. The supporting leg and the swinging leg in the model have the same mass and geometric parameters, and the mass of the leg is uniformly distributed. The knee joint of the robot is provided with a limiting mechanism to simulate the knee joint function of a human body. Two motors are arranged on the hip joint, and can respectively apply control torque to the supporting legs and the swinging legs.

The embodiment performs 2D modeling only on the right side surface of the advancing direction during the robot walking. The process of walking one step of the robot is shown in fig. 3, and can be described as follows:

stage I: the knee joints of the robot supporting legs are locked to do inverted pendulum motion, and the supporting legs do not slide and move relative to the ground; the knee joint of the swing leg of the robot is relaxed and swings forwards, and the hip joint moves forwards.

And stage II: the swing legs of the robot swing to the front of the supporting legs, when the swing legs reach the maximum bending and stretching state, the thighs and the calves collide due to the limiting mechanisms, the collision process is finished instantly, and the limiting mechanisms are locked after collision and keep the locked state.

Stage III: the swinging legs of the robot swing back relative to the supporting legs, and the hip joint still moves forwards.

Stage IV: the swing legs of the robot collide with the ground instantly, the collision is finished instantly and no bounce occurs; the roles of the support legs and the swing legs are exchanged.

During the whole walking process, the walking state of the robot can be described as follows:

wherein the counter-clockwise rotation is changed to positive，θ_r1，The angle and angular velocity from the leg to the vertical direction; theta_r2，The angle and angular velocity of the leg thigh to the vertical direction; theta_r3，The angle and angular velocity of the leg to the vertical.

Step S2: acquiring and processing human body gait data and target gait data.

Step S2 specifically includes the following steps:

step S203: taking a robot model as a reference, taking a 2D plane of the human walking longitudinal direction, and defining the human walking state as theta_m. All data in the description of the human walking process are used with Θ_mIs expressed and theta is expressed_mAs a row vector, the human body gait data theta are obtained by combination_M；

Step S204, from the human body gait data theta_MSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data theta_S；

Step S205: the walking state theta of the robot in a gait cycle_rAccording to the formula theta_sThe sampling frequency in the device is sampled to form a machineRobot gait data theta_R。

Specifically, in the present embodiment, in order for the biped robot to learn human gait, it is necessary to provide target gait data for the robot using a human motion capture technique. The quality of the gait data directly affects the final learning effect of the robot, so the reliability is particularly important in the embodiment. Reliable gait data can be acquired through relatively well-known human motion capture databases at home and abroad, and open-source human motion capture data provided by the databases are used by a plurality of researchers, so that the gait data has high accuracy and reliability.

In this example, an open source human motion capture database of the CMU graphics laboratory at university of kanji merlong was used, which recorded data of human motion in 120Hz images using 12 infrared cameras in a 3m × 8m rectangular room and produced the data into a standard file. The human body in the gait data can be divided into 16 parts according to the inertia parameter indexes of the adult human body from the selected data in the database, then low-frequency clutter is filtered by using a filtering method, and the data such as the density, the inertia tensor, the rotational inertia, the mass center position and the like of each limb section are deduced by combining a multiple regression equation of the physiological structure of the human body.

The human walking motion is a process involving a large number of joint degrees of freedom, and even if these joint degrees of freedom are artificially divided, the number of such joint degrees of freedom is too large for a robot. In order to make the robot and the human body have consistency in gait data, it is necessary to simplify the degrees of freedom of joints involved in walking of the human body.

Taking a 2D plane of the human walking longitudinal direction by taking a used robot model with knees and feet as a reference, and defining the walking state of the human body as follows:

wherein, the counterclockwise rotation is taken as positive, theta_m1，The angle and angular velocity from the leg to the vertical direction; theta_m2，The angle and angular velocity of the leg thigh to the vertical direction; theta_m3，To support the angle and angular velocity of the lower leg to the vertical.

And (3) selecting a data set of the normal walking process of the human body from the CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain a 2D process of human body walking, as shown in figure 4. In the embodiment, walking process data obtained in the data set is simplified according to the human walking state definition, and the final human gait data is defined as theta_MWherein any row vector is theta obtained by extraction_m。

The process from swinging the human body and the robot from the swing leg to the collision of the swing leg and the ground is called a gait cycle in the embodiment. From human gait data Θ_MSelecting 1 gait cycle as the learning object of the robot, taking the time required by the change of the motor moment into consideration, extracting odd frames in the learning object data to form a new data set, and defining the data set as target gait data theta_s. The gait data of the robot in the gait cycle is according to theta_sThe sampling frequency is sampled to form robot gait data theta_RWherein any vector is theta obtained by sampling_rWhen theta is equal to_RAnd theta_SIf the dimensions are different, the dimensions are treated to be the same by using a resize method.

Step S3: respectively extracting implicit characteristics in gait data of the biped robot and human body gait data by using a noise reduction automatic encoder; step S3 specifically includes: according to the formula theta_r、Θ_mThe data structure of (2) is constructed such that the two structures are identicalThe noise reduction automatic encoder is used for carrying out the detection on the gait data theta of the robot_RAnd target gait data Θ_SAnd (5) carrying out feature extraction. Will theta_R、Θ_SThe row vectors are sent into a noise reduction automatic encoder one by one, and the obtained characteristics are arranged according to the original sequence to form gait characteristic data H of the robot_RAnd target gait feature data H_SIs prepared from H_RAnd H_SAnd uniformly performing normalization processing for deep reinforcement learning.

Due to differences in geometric parameters between the human body and the robot, and in view of the versatility of the present invention and the noise present in gait data. In the embodiment, the noise reduction automatic encoder DAE is used for further processing the gait data of the robot and the human body so as to extract and encode more robust features from the existing gait data and eliminate the influence of model parameters and noise at the same time, so that the robot can better learn the gait of the human body.

The DAE is an improved algorithm based on an automatic encoder, has a simple structure and high operation speed, is commonly used for pre-data processing of a deep learning network, can extract and encode more robust features from known data, and eliminates possible noise influence.

The DAE used in this embodiment is a single implicit neural network. The system consists of three layers of networks: the first layer is an input layer and is used for receiving original gait data and adding noise to obtain noise-added data; the second layer is a hidden layer, the DAE in the hidden layer encodes the noise-added data, and the encoding result can be regarded as the implicit characteristic of the original gait; the third layer is an output layer, which decodes and reconstructs the implicit characteristics in the hidden layer, and the reconstructed output after the DAE finishes training should be the same as the original gait data. The DAE updates the network parameters by gradient descent.

The DAE adjusts the network parameters through training, and when the loss function formed by the original input x and the reconstructed output y is small, the hidden layer output can be regarded as a representation of the original input x, and such representation is called as the feature of the input x, and the feature can be used as a good expression of the original input signal. DAE in turn passesNoise is added to the training data, so that the hidden layer must learn to remove the noise and completely express the original gait information, thereby forcing the DAE to learn a more robust representation of the input signal. The DAE workflow is shown in FIG. 5, with robot gait data Θ_RBy way of example, it can be described as:

s301: take theta_RMiddle row vector theta_rSent to the DAE. DAE uses binomial distribution to the raw gait data Θ_rRandomly erasing, erasing the erased data to 0 to obtain the gait data containing noiseBy means of a coding function fMapping to a hidden layer to obtain a hidden layer characteristic h, wherein the coding function of the noise reduction automatic coder is as follows:

s302: mapping the hidden layer characteristics h to an output layer through a decoding function g to obtain a reconstructed output y; the reconstruction output y keeps the information of the original gait data x to ensure that the hidden layer characteristic h represents the original gait data, and the integral error of the reconstruction output y passes through an integral loss function J_DAEWhere the decoding function of the DAE:

wherein,is a weight matrix between the hidden layer and the output layer, andis provided withs_gActivating a function, also a Sigmod function, for the decoding function; overall loss function of DAE in a given training set:

wherein theta is_DAEAre parameters of DAE, including w, p, q; l is defined as the reconstruction error and is used for depicting y and theta_rThe approach degree of (c):

wherein n is the dimension of the input and output layer;

s303: DAE training procedure uses gradient descent pairs J_DAE(theta) performing iterative calculation to obtain a minimum value, gradient descent vs_DAEThe update function of (2):

wherein α is the learning rate and takes the value of [0,1 ].

This embodiment constructs two DAE networks with the same structure, DAE_RAnd DAE_MAnd separately using robot gait data Θ_RAnd human gait data theta_MAnd (5) training. DAE trained with large amounts of data_RAnd DAE_MImplicit feature extraction can be carried out on the gait data of the robot and the human body in the embodiment, and the extracted gait feature of the robot and the human body is defined as h_r、h_m. Using DAE_MTo theta_SExtracting the features of each row vector, and arranging the extracted features according to the original sequence to obtain target gait feature data H_S. Theta was treated in the same manner_RObtaining the gait characteristic data H of the robot_R. H is to be_S、H_RAnd uniformly carrying out normalization processing and providing the normalization processing for deep reinforcement learning to operate. H_S、H_RThe characteristics of gait data of the robot and the human body are effectively represented, and the influence of noise and geometric parameter difference on deep reinforcement learning can be effectively reduced.

Step S4: and learning the human gait characteristics by using a deep reinforcement learning method so as to plan the gait of the biped robot. The reinforcement learning is a main branch of machine learning, and can gradually improve the action selection of the intelligent agent in the interaction process of the intelligent agent and the environment, and finally achieve the goal of controlling the intelligent agent to finish. The reinforcement learning does not need an accurate intelligent model, so the method is very suitable for controlling the biped robot. However, the conventional reinforcement learning has a slow convergence speed, and the reinforcement learning improved by combining the neural network technology improves the learning speed, but the samples collected in the interaction process are highly correlated in time, and the samples which do not meet the training requirement of the neural network are independent, so that the network is very easy to be over-fitted. With the rapid development of deep learning, deep reinforcement learning begins to appear within the line of sight of researchers. The deep reinforcement learning is the combination of the conventional reinforcement learning and the deep learning, and the deficiency of the reinforcement learning is supplemented by using the theory of the deep learning, so that the reinforcement learning is greatly improved in all aspects.

According to the characteristics that the biped robot has continuous walking motion and the action space of the hip joint driving motor is continuous, the depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the robot. DDPG is an Actor-Critic structure algorithm improved based on deterministic strategy gradient DPG, and neural networks are used for replacing a strategy function and a value function in conventional reinforcement learning respectively, and the replaced neural networks are called a strategy network mu and a Q network Q respectively. The policy network receives the robot state and returns the motor torque, the Q network evaluates the selection of the policy network in combination with the robot state and the motor torque, and the DDPG framework is as shown in fig. 6.

In step S4, robot gait feature data H processed by the noise reduction auto encoder is used_RInput data s as a depth-deterministic policy gradient DDPG_tTarget gait feature data H_sAs r_tIs calculated according to the depth certainty strategy gradient DDPG and outputs a motor execution torque a_t(ii) a The robot collects asynchronous data in the continuous walking process and provides the asynchronous data for the training of the depth deterministic strategy gradient DDPG, and finally the depth deterministic strategy gradient DDPG has the capacity of controlling the robot to reach a target gait.

In order to solve the network oscillation and overfitting caused by the high correlation of the samples acquired in the interaction process in time, a memory pool is provided for the DDPG. The memory pool stores the robot state s in one gait cycle of the robot_tSelectively executed motor torque a_tThe obtained reward r_tAnd robot state s after motor drive_t+1As a set of experiences(s)_t，a_t，r_t，s_t+1) And storing. When the neural network training is needed, n groups of experiences are randomly extracted from the memory pool as training data, and the size of n is generally set by small batch data (minipatch). The mechanism of random extraction both breaks up the temporal correlation between samples to prevent network oscillation and overfitting, and allows the robot to learn both the previous experience and the current experience.

In the DDPG, if only a single Q network is used for evaluation training of the policy network, the learning process may be unstable, because the network parameters of the single Q network are frequently updated and used for calculating the gradients of the Q network and the policy network. Therefore, in this embodiment, the policy network μ and the Q network Q after initializing the network parameters in the DDPG are copied, new networks obtained after copying are respectively called as an offline policy network μ 'and an offline Q network Q', and an original network is called as an online policy network μ and an online Q network Q. The online network part is used for outputting the actions of the robot and enabling the robot to carry out the actions when the robot walks. The off-line network mainly has the function of providing data support for the training of the on-line network so as to enable the whole network to be more stable and fast converged.

The network structures of the online network and the offline network are completely consistent, and the difference between the online network and the offline network lies in the network parameter updating mode. The network parameter updates of the online network are updated using experience randomly extracted from the memory pool, data provided by the offline network, and random gradient descent. The network parameter update of the off-line network is performed by soft update. The soft update completes the update of the off-line network according to the update obtained from the network parameters of the on-line network, and taking the on-line policy network and the off-line policy network as an example, the soft update can be expressed as:

θ^μ′＝τθ^μ+(1-τ)θ^μ′

wherein, theta^μAnd theta^μ′Network parameters of an online policy network and an offline policy network are respectively, and tau generally takes a value of 0.001. Similarly, the soft update between the online Q network and the offline Q network is the same as the above formula.

The training process flow diagram in this example is shown in fig. 7, and the training process can be described as follows:

the sampling frequency of the robot, the selection of the swing leg at the sampling starting moment and the target gait data are kept consistent. the state of the robot at the time t is theta_RDAE processed gait feature data H of the robot_RAs s_tTarget gait feature data H_SAs r_tAccording to the method.

S401: in this embodiment, the policy network in the DDPG is a 5-layer CNN network: the first layer as an input layer for receiving s_t(ii) a The second layer and the third layer are convolution layers; the fourth layer is a full link layer; the fifth layer is an output layer, which sets the maximum action boundary and outputs the torque required to be executed by the motor. The Q network structure is approximately the same as the strategy network structure, and only the number of input layer units is increased to accommodate the motor torque a_tAnd the output layer unit is set to 1 and only the evaluation is returned.

To be on-lineAnd randomly initializing the network parameters, and copying the initialized network parameters to the corresponding offline network. Setting a maximum storable experience number E of a memory pool, setting a size minimatch of a neural network training data set, a single training time T of the neural network, and initializing an online strategy network learning rate l_poilcyOn-line Q network learning rate l_QAnd setting a soft update rate tau and a maximum step number W of one-time interactive walking. When the robot falls down or finishes the maximum step number W, the robot is regarded as a complete interaction and is marked as EPI, and the maximum interaction times is EPI. And finally, randomly initializing the state of the robot.

S402: the state of the robot at the moment of swinging the swing leg is s_tThe online strategy network outputs a group of motor torque a according to the current network_tIt can be expressed as:

a_t＝μ(s_t|θ^μ)

wherein, a_tThe row vector of the motor is respectively the executing moment, the line number and the s of the motor of the supporting leg and the motor of the swinging leg at the hip joint_tAnd (5) the consistency is achieved.

S403: in the swinging process of the swinging leg, two motors at the hip joint of the robot respectively execute corresponding motor torque a_tThe execution time of any row vector is the same as the sampling interval time. The motor firstly executes a_tFirst row of (2), in a Pair robot state θ_rAnd after sampling is finished, the execution torque is switched to the next row, and the execution is carried out in the sequence. The control torque of the embodiment uses square wave torque, so that the occurrence of shaking in the control process can be effectively avoided. When the swing leg of the robot collides with the ground, updating the step count w, and sampling all the sampled theta_rFeeding into DAE_RTo obtain a new state s of the robot_t+1。

S404: the design of the reward function is an important step in deep reinforcement learning work, and the good reward design can obviously improve the learning effect. The present embodiment uses a program reward design to guide training faster, r_tAs follows:

when the robot does not fall over, s_t+1And H_sThe smaller the difference r_tThe larger and constantly greater than 0. When the robot falls down, r_tThis guides the robot to approach the target gait without falling over.

S405: will(s)_t，a_t，r_t，s_t+1) The experience number is stored as a group in the memory pool, and the memory pool experience number count exp is updated. According to different states, the counters operate differently as follows: 1) if the robot falls down, resetting the state of the robot, returning to execute S402 and resetting w; 2) if not fallen but W < W, then s_t+1As new s_tExecuting S402; 3) if W is not less than W and exp>E then sequentially executes S406 and resets w; 4) otherwise the reset robot state returns to execute S402 and reset w. The epi is updated when S401, S403, and S404 are executed.

S406: and randomly extracting the minimatch group experience from the memory pool as a training data set of the online network.

S407: extracting s in a training dataset_t，a_tSending the data to an online Q network to be evaluated: q(s)_t，a_t|θ^Q). Data set s_t+1Sending the torque to an offline strategy network to obtain motor torque a'_t+1From the offline Q network pair s_t+1、a′_t+1Evaluation was carried out: q'(s)_t+1，μ′(s_t+1|θ^μ′)|θ^Q′). The loss function of the online Q network can then be expressed as:

wherein, y_i＝r_t+γQ′(s_t+1，μ′(s_t+1|θ^μ′)|θ^Q′). According to L_QThe online Q network is updated using a random gradient descent.

S408: calculating the strategy gradient in the strategy network, and defining the loss function of the online strategy network:

L_μ＝Q(s_t，μ(s_t，θ^μ)|θ^Q)

the gradient of the online policy network can be calculated using the loss function of the policy network:

online policy network parameters are also updated using random gradient descent.

S409: after the network parameters of the online policy network and the online Q network are updated, the offline policy network and the offline Q network are updated through soft update:

s410: and updating the network training times time, and when the time is more than the single training time T, ending the network training of the time, and executing S411. Otherwise, returning to S406 to continue network training.

S411: when EPI > EPI, DDPG calculation is finished, and the online strategy network is saved as a controller. If EPI < EPI, the robot status is reset and execution returns to S402.

And in the walking process of the robot, the DDPG is continuously used for learning and training until the online strategy network mu and the online Q network Q converge or the maximum interaction number EPI is reached. When the network in the DDPG is converged, the online strategy network can control the robot with random initial gait until the target gait is reached. Similarly, if external force disturbance is applied during walking, the first step after disturbance can be regarded as the beginningThe gait is started and the DDPG is used for effective control, so that the DDPG can carry out effective control only when the robot is not in a falling state. Target gait feature data H_SUse of (1) is_tProvides the basis for r_tThe stability and compliance of the robot gait can be described simultaneously. The embodiment combines deep reinforcement learning with human gait data, so that the robot can finally obtain stable and smooth gait like a human.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A biped robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps:

step S2: acquiring and processing human body gait data and target gait data;

step S4: and (3) learning the gait characteristics of the human body by utilizing deep reinforcement learning so as to plan the gait of the biped robot.

2. The biped robot gait planning method based on deep reinforcement learning of claim 1, wherein the step S1 specifically comprises the following steps:

wherein, the counterclockwise rotation is taken as positive, theta_r1，The angle and angular velocity from the leg to the vertical direction; theta_r2，The angle and angular velocity of the leg thigh to the vertical direction; theta_r3，For the angle and angular velocity of the leg to the vertical。

3. The biped robot gait planning method based on deep reinforcement learning of claim 2, wherein the step S2 specifically comprises the following steps:

Step S204: from human gait data Θ_MSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data theta_SWherein the target gait data Θ_SAny row vector is extracted to obtain theta_m；

Step S205: the walking state theta of the robot in a gait cycle_rAccording to the formula theta_SThe sampling frequency in the method is sampled to form robot gait data theta_RWherein, the gait data theta of the robot_RTheta is obtained by taking any row vector as sample_r。

4. The gait planning method of the biped robot based on the deep reinforcement learning of claim 3, wherein the step S3 specifically comprises: according to the formula theta_r、Θ_mThe data structure of (2) and two noise reduction automatic encoders with the same structure are constructed to process the gait data theta of the robot_RAnd target gait data Θ_SCarrying out feature extraction; will theta_R、θ_SThe row vectors are sent into a noise reduction automatic encoder one by one, and the obtained characteristics are arranged according to the original sequence to form gait characteristic data H of the robot_RAnd target gait feature data H_SIs prepared from H_RAnd H_SUnifying the normalization process for deep reinforcement learning, wherein each denoising autoencoder workflow comprises the following steps:

wherein,is a weight matrix between the hidden layer and the output layer, and hass_gAn activation function, also a Sigmod function, for a decoding function; denoising the overall loss function of the autoencoder in a given training set:

wherein n is the dimension of the input and output layer;

wherein α is the learning rate and takes the value of [0,1 ].

5. The biped robot gait planning method based on deep reinforcement learning of claim 3, characterized in that in step S4, a depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the biped robot, and the robot gait feature data H processed by the noise reduction automatic encoder is used_RInput data s as a gradient algorithm for a depth-deterministic strategy_tTarget gait feature data H_SAs r_tThe motor execution torque a is output by a depth certainty strategy gradient algorithm_t(ii) a The robot collects data of asynchronous state in the continuous walking process, provides the data for the training of the depth certainty strategy gradient algorithm, and finally makes the depth certaintyThe degree-deterministic policy gradient algorithm has the ability to control the robot to reach a target gait.

6. The gait planning method of the biped robot based on the deep reinforcement learning of claim 5, wherein the strategy network of the deep certainty strategy gradient algorithm adopts 5 layers of convolutional neural networks, which respectively comprise an input layer, two convolutional layers, a full link layer and an output layer, wherein the input layer is used for receiving s_tThe output layer outputs the torque a required to be executed by the motor_t。