CN108983804A - A kind of biped robot's gait planning method based on deeply study - Google Patents

A kind of biped robot's gait planning method based on deeply study Download PDF

Info

Publication number
CN108983804A
CN108983804A CN201810979187.2A CN201810979187A CN108983804A CN 108983804 A CN108983804 A CN 108983804A CN 201810979187 A CN201810979187 A CN 201810979187A CN 108983804 A CN108983804 A CN 108983804A
Authority
CN
China
Prior art keywords
robot
gait
data
theta
human body
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810979187.2A
Other languages
Chinese (zh)
Other versions
CN108983804B (en
Inventor
吴晓光
刘绍维
杨磊
张天赐
李艳会
王挺进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yanshan University
Original Assignee
Yanshan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yanshan University filed Critical Yanshan University
Priority to CN201810979187.2A priority Critical patent/CN108983804B/en
Publication of CN108983804A publication Critical patent/CN108983804A/en
Application granted granted Critical
Publication of CN108983804B publication Critical patent/CN108983804B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/08Control of attitude, i.e. control of roll, pitch, or yaw
    • G05D1/0891Control of attitude, i.e. control of roll, pitch, or yaw specially adapted for land vehicles

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a kind of biped robot's gait planning methods based on deeply study, utilize the stability and flexibility of body gait, it control effectively in conjunction with deeply study to biped robot's gait, comprising the following steps: 1) establish passive biped robot's model;2) acquisition and processing of body gait data and target gait data;3) hidden feature in biped robot's gait data and body gait data is extracted respectively using noise reduction autocoder;4) body gait feature is learnt using deeply study, and then plans biped robot's gait.The present invention combines deeply study and body gait data, stable as people, the submissive walking of control biped robot.

Description

Biped robot gait planning method based on deep reinforcement learning
Technical Field
The invention relates to the technical field of biped robots, in particular to a biped robot gait planning method based on deep reinforcement learning.
Background
The moving modes of the current mobile robot comprise a crawler type, a wheel type, double feet and the like. Compared with a crawler-type robot and a wheel-type robot, the biped robot has stronger adaptability, can move on a flat ground and can also move in an irregular environment (walking on an uneven ground and up and down steps, etc.). However, biped robots are themselves highly nonlinear hybrid dynamic systems, and gait planning has been a difficult problem.
In addition to maintaining walking stability, the gait planning of a biped robot must also take into account the energy efficiency, compliance, environmental suitability, etc. of the walking movement. A simplified model-based gait planning method is commonly used in biped robot gait planning. The method based on the simplified model is to consider only the main characteristics of the biped robot and simplify the robot into basic models, such as models of an inverted pendulum, a two-link, a four-link and the like, from the aspects of kinematics and dynamics of walking of the biped robot, and then carry out gait planning on the biped robot based on the simplified models. The method based on the simplified model ignores partial physical characteristics of the biped robot, so that the biped robot has the problems of weak anti-interference capability, sensitivity to environment, single gait and the like. The gait planning method based on the intelligent algorithm becomes a hotspot of current research because of the characteristics of learning, self-adaption, high fault tolerance and the like. The gait planning method based on the intelligent algorithm comprises a neural network, a support vector machine, fuzzy control, reinforcement learning and the like. However, in general, the intelligent algorithm can only ensure stable walking of the biped robot, and cannot ensure that the robot has efficient and smooth gait while walking stably, which sometimes even causes the biped robot to have stiff and irregular gait.
Disclosure of Invention
The invention aims to solve the problems and provides a biped robot gait planning method based on deep reinforcement learning. The invention utilizes the characteristic that the knee biped robot model is similar to a human body model in the structure and the walking process, and combines a deep reinforcement learning method under the drive of big data, solves the problems of weak anti-interference capability of a gait planning method based on the model, stiff gait of a conventional intelligent gait planning method and the like, and improves the stability and the flexibility of the robot when the robot walks.
In order to realize the purpose, the invention is realized according to the following technical scheme:
a biped robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: establishing a biped robot model and describing the walking process of the robot;
step S2: acquiring and processing human body gait data and target gait data;
step S3: respectively extracting implicit characteristics in gait data of the biped robot and human body gait data by using a noise reduction automatic encoder;
step S4: and learning the human gait characteristics by using a deep reinforcement learning method so as to plan the gait of the biped robot.
In the above technical solution, step S1 specifically includes the following steps:
step S101: establishing a 4-link robot model with a knee arc foot; the robot model comprises 2 thighs, 2 small legs and 2 arc feet, wherein the legs are connected together through a rigid rod through a hinge in a friction-free mode, the arc feet are fixedly connected to the small legs respectively, the supporting legs and the swinging legs have identical mass and geometric parameters, the mass of the legs is uniformly distributed, a limiting mechanism is arranged at the knee joints of the robot model to simulate the knee joint function of a human body, and two motors are arranged at hip joints to apply control moments to the supporting legs and the swinging legs respectively;
step S102: analyzing the walking process of the model by taking the right side surface of the advancing direction of the robot as a viewpoint in the walking process of the robot, selecting dimensionless physical quantity for representing the state of the robot in real time, and defining the selected physical quantity as the walking state theta of the robotrThe robot walking state is described as:
wherein, the counterclockwise rotation is taken as positive, thetar1The angle and angular velocity from the leg to the vertical direction; thetar2The angle and angular velocity of the leg thigh to the vertical direction; thetar3The angle and angular velocity of the leg to the vertical.
In the above technical solution, step S2 specifically includes the following steps:
step S201: defining a gait cycle as the process from swinging of the human body and the robot to collision of the swinging leg and the ground;
step S202: selecting a data set of the normal walking process of the human body from a CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain the description of the walking process of the human body;
step S203: taking a robot model as a reference, taking a 2D plane of the human walking longitudinal direction, and defining the human walking state as thetamAll data in the description of the human walking process are used with thetamIs expressed and theta is expressedmAs a row vector, the human body gait data theta are obtained by combinationM
Step S204, from the human body gait data thetaMSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data thetaS(ii) a Wherein, the target gait data thetasAny row vector is extracted to obtain thetam
Step S205: the walking state theta of the robot in a gait cyclerAccording to the formula thetasThe sampling frequency in the method is sampled to form robot gait data thetaR. Wherein, the gait data theta of the robotRTheta is obtained by taking any row vector as sampler
In the above technical solution, step S3 specifically includes: according to the formula thetar、ΘmThe data structure of (2) and two noise reduction automatic encoders with the same structure are constructed to process the gait data theta of the robotRAnd target gait data ΘSAnd (5) carrying out feature extraction. Will thetaR、ΘSThe line vectors are sent to a noise reduction automatic encoder one by one,arranging the obtained characteristics according to the original sequence to form robot gait characteristic data HRAnd target gait feature data HSIs prepared from HRAnd HSUnifying the normalization process for deep reinforcement learning, wherein each denoising autoencoder workflow comprises the following steps:
s301: take thetaROr thetaSThe middle row vector theta is sent into a noise reduction automatic encoder, the noise reduction automatic encoder randomly erases the original gait data theta by using binomial distribution, the erased data is set to be 0, and the gait data containing noise is obtainedBy means of a coding function fMapping to a hidden layer to obtain a hidden layer characteristic h, wherein the coding function of the noise reduction automatic coder is as follows:
w is a weight matrix between the input layer and the hidden layer; sfTaking a Sigmod function as an activation function of the coding function f;
s302: mapping the hidden layer characteristics h to an output layer through a decoding function g to obtain a reconstructed output y; reconstructing output y to keep the information of original gait data x, and the integral error of the original gait data x passes through an integral loss function JDAEWhere the decoding function of the noise reduction auto-encoder is:
wherein,as weight moments between the hidden layer and the output layerAn array ofsgActivating a function, also a Sigmod function, for the decoding function; denoising the overall loss function of the autoencoder in a given training set:
wherein theta isDAEParameters of the noise reduction automatic encoder comprise w, p and q; l is defined as the reconstruction error, and is used to depict how close y is to Θ:
wherein n is the dimension of the input and output layer;
s303: noise reduction autoencoder training process using gradient descent pair JDAE(theta) performing iterative calculation to obtain a minimum value, gradient descent vsDAEThe update function of (2):
wherein α is the learning rate and takes the value of [0,1 ].
In the above technical solution, in step S4, the depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the biped robot, and the robot gait feature data H processed by the noise reduction auto encoder is used as the learning algorithm of the biped robotRInput data s as a gradient algorithm for a depth-deterministic strategytTarget gait feature data HSAs rtThe motor execution torque a is output by a depth certainty strategy gradient algorithmt(ii) a The robot collects asynchronous data in the continuous walking process and provides the asynchronous data for the training of the depth deterministic strategy gradient algorithm,and finally, the depth certainty strategy gradient algorithm has the capability of controlling the robot to reach the target gait.
In the above technical solution, the policy network of the deep deterministic policy gradient algorithm adopts a 5-layer convolutional neural network, which respectively includes an input layer, two convolutional layers, a full link layer, and an output layer, wherein the input layer is used for receiving stThe output layer outputs the torque a required to be executed by the motort
Compared with the prior art, the invention has the following beneficial effects:
the invention combines deep reinforcement learning with human gait data, and solves the problems of weak anti-interference capability of a gait planning method based on a model, hard gait of a conventional intelligent gait planning method and the like. The introduction of the noise reduction automatic encoder not only extracts the characteristics in the gait data, but also eliminates the influence of geometric difference and noise. Compared with the conventional reinforcement learning, the DDPG can spend less time to solve more complex problems and achieve higher control requirements. Target gait feature data HSAs rtThe DDPG can effectively utilize human gait data so that r istThe gait stability and the flexibility of the robot are evaluated. Through training, the DDPG can finally control the robot to walk stably and smoothly like a human.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a schematic diagram of the planning method of the present invention;
FIG. 2 is a schematic diagram of a 4-link robot model with a knee arc foot;
FIG. 3 is a schematic view of biped robot walking;
FIG. 4 is a schematic diagram of a 2D process of human walking through a human motion database;
fig. 5 is a schematic flow chart of the operation of the noise reduction auto-encoder DAE;
FIG. 6 is a block diagram of a depth deterministic policy gradient algorithm DDPG;
fig. 7 is a schematic diagram of a training flow of a deep deterministic strategy gradient algorithm DDPG.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention.
FIG. 1 is a schematic diagram of the planning method of the present invention; as shown in fig. 1, a gait planning method for a biped robot based on deep reinforcement learning of the present invention includes:
step S1: establishing a biped robot model and describing the walking process of the robot; wherein the step S1 specifically includes the following steps:
step S101: establishing a 4-link robot model with a knee arc foot;
step S102: analyzing the walking process of the model by taking the right side surface of the advancing direction of the robot as a viewpoint in the walking process of the robot, selecting dimensionless physical quantity for representing the state of the robot in real time, and defining the selected physical quantity as the walking state theta of the robotr
Specifically, the biped robot model established in the embodiment belongs to a 4-link robot model with a knee arc foot, and the model is shown in fig. 2. The robot consists of 2 thighs, 2 small legs and 2 arc feet. The legs are connected together by a rigid rod through a hinge without friction, and the arc feet are respectively fixedly connected on the lower leg. The supporting leg and the swinging leg in the model have the same mass and geometric parameters, and the mass of the leg is uniformly distributed. The knee joint of the robot is provided with a limiting mechanism to simulate the knee joint function of a human body. Two motors are arranged on the hip joint, and can respectively apply control torque to the supporting legs and the swinging legs.
The embodiment performs 2D modeling only on the right side surface of the advancing direction during the robot walking. The process of walking one step of the robot is shown in fig. 3, and can be described as follows:
stage I: the knee joints of the robot supporting legs are locked to do inverted pendulum motion, and the supporting legs do not slide and move relative to the ground; the knee joint of the swing leg of the robot is relaxed and swings forwards, and the hip joint moves forwards.
And stage II: the swing legs of the robot swing to the front of the supporting legs, when the swing legs reach the maximum bending and stretching state, the thighs and the calves collide due to the limiting mechanisms, the collision process is finished instantly, and the limiting mechanisms are locked after collision and keep the locked state.
Stage III: the swinging legs of the robot swing back relative to the supporting legs, and the hip joint still moves forwards.
Stage IV: the swing legs of the robot collide with the ground instantly, the collision is finished instantly and no bounce occurs; the roles of the support legs and the swing legs are exchanged.
During the whole walking process, the walking state of the robot can be described as follows:
wherein the counter-clockwise rotation is changed to positive,θr1The angle and angular velocity from the leg to the vertical direction; thetar2The angle and angular velocity of the leg thigh to the vertical direction; thetar3The angle and angular velocity of the leg to the vertical.
Step S2: acquiring and processing human body gait data and target gait data.
Step S2 specifically includes the following steps:
step S201: defining a gait cycle as the process from swinging of the human body and the robot to collision of the swinging leg and the ground;
step S202: selecting a data set of the normal walking process of the human body from a CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain the description of the walking process of the human body;
step S203: taking a robot model as a reference, taking a 2D plane of the human walking longitudinal direction, and defining the human walking state as thetam. All data in the description of the human walking process are used with ΘmIs expressed and theta is expressedmAs a row vector, the human body gait data theta are obtained by combinationM
Step S204, from the human body gait data thetaMSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data thetaS
Step S205: the walking state theta of the robot in a gait cyclerAccording to the formula thetasThe sampling frequency in the device is sampled to form a machineRobot gait data thetaR
Specifically, in the present embodiment, in order for the biped robot to learn human gait, it is necessary to provide target gait data for the robot using a human motion capture technique. The quality of the gait data directly affects the final learning effect of the robot, so the reliability is particularly important in the embodiment. Reliable gait data can be acquired through relatively well-known human motion capture databases at home and abroad, and open-source human motion capture data provided by the databases are used by a plurality of researchers, so that the gait data has high accuracy and reliability.
In this example, an open source human motion capture database of the CMU graphics laboratory at university of kanji merlong was used, which recorded data of human motion in 120Hz images using 12 infrared cameras in a 3m × 8m rectangular room and produced the data into a standard file. The human body in the gait data can be divided into 16 parts according to the inertia parameter indexes of the adult human body from the selected data in the database, then low-frequency clutter is filtered by using a filtering method, and the data such as the density, the inertia tensor, the rotational inertia, the mass center position and the like of each limb section are deduced by combining a multiple regression equation of the physiological structure of the human body.
The human walking motion is a process involving a large number of joint degrees of freedom, and even if these joint degrees of freedom are artificially divided, the number of such joint degrees of freedom is too large for a robot. In order to make the robot and the human body have consistency in gait data, it is necessary to simplify the degrees of freedom of joints involved in walking of the human body.
Taking a 2D plane of the human walking longitudinal direction by taking a used robot model with knees and feet as a reference, and defining the walking state of the human body as follows:
wherein, the counterclockwise rotation is taken as positive, thetam1The angle and angular velocity from the leg to the vertical direction; thetam2The angle and angular velocity of the leg thigh to the vertical direction; thetam3To support the angle and angular velocity of the lower leg to the vertical.
And (3) selecting a data set of the normal walking process of the human body from the CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain a 2D process of human body walking, as shown in figure 4. In the embodiment, walking process data obtained in the data set is simplified according to the human walking state definition, and the final human gait data is defined as thetaMWherein any row vector is theta obtained by extractionm
The process from swinging the human body and the robot from the swing leg to the collision of the swing leg and the ground is called a gait cycle in the embodiment. From human gait data ΘMSelecting 1 gait cycle as the learning object of the robot, taking the time required by the change of the motor moment into consideration, extracting odd frames in the learning object data to form a new data set, and defining the data set as target gait data thetas. The gait data of the robot in the gait cycle is according to thetasThe sampling frequency is sampled to form robot gait data thetaRWherein any vector is theta obtained by samplingrWhen theta is equal toRAnd thetaSIf the dimensions are different, the dimensions are treated to be the same by using a resize method.
Step S3: respectively extracting implicit characteristics in gait data of the biped robot and human body gait data by using a noise reduction automatic encoder; step S3 specifically includes: according to the formula thetar、ΘmThe data structure of (2) is constructed such that the two structures are identicalThe noise reduction automatic encoder is used for carrying out the detection on the gait data theta of the robotRAnd target gait data ΘSAnd (5) carrying out feature extraction. Will thetaR、ΘSThe row vectors are sent into a noise reduction automatic encoder one by one, and the obtained characteristics are arranged according to the original sequence to form gait characteristic data H of the robotRAnd target gait feature data HSIs prepared from HRAnd HSAnd uniformly performing normalization processing for deep reinforcement learning.
Due to differences in geometric parameters between the human body and the robot, and in view of the versatility of the present invention and the noise present in gait data. In the embodiment, the noise reduction automatic encoder DAE is used for further processing the gait data of the robot and the human body so as to extract and encode more robust features from the existing gait data and eliminate the influence of model parameters and noise at the same time, so that the robot can better learn the gait of the human body.
The DAE is an improved algorithm based on an automatic encoder, has a simple structure and high operation speed, is commonly used for pre-data processing of a deep learning network, can extract and encode more robust features from known data, and eliminates possible noise influence.
The DAE used in this embodiment is a single implicit neural network. The system consists of three layers of networks: the first layer is an input layer and is used for receiving original gait data and adding noise to obtain noise-added data; the second layer is a hidden layer, the DAE in the hidden layer encodes the noise-added data, and the encoding result can be regarded as the implicit characteristic of the original gait; the third layer is an output layer, which decodes and reconstructs the implicit characteristics in the hidden layer, and the reconstructed output after the DAE finishes training should be the same as the original gait data. The DAE updates the network parameters by gradient descent.
The DAE adjusts the network parameters through training, and when the loss function formed by the original input x and the reconstructed output y is small, the hidden layer output can be regarded as a representation of the original input x, and such representation is called as the feature of the input x, and the feature can be used as a good expression of the original input signal. DAE in turn passesNoise is added to the training data, so that the hidden layer must learn to remove the noise and completely express the original gait information, thereby forcing the DAE to learn a more robust representation of the input signal. The DAE workflow is shown in FIG. 5, with robot gait data ΘRBy way of example, it can be described as:
s301: take thetaRMiddle row vector thetarSent to the DAE. DAE uses binomial distribution to the raw gait data ΘrRandomly erasing, erasing the erased data to 0 to obtain the gait data containing noiseBy means of a coding function fMapping to a hidden layer to obtain a hidden layer characteristic h, wherein the coding function of the noise reduction automatic coder is as follows:
w is a weight matrix between the input layer and the hidden layer; sfTaking a Sigmod function as an activation function of the coding function f;
s302: mapping the hidden layer characteristics h to an output layer through a decoding function g to obtain a reconstructed output y; the reconstruction output y keeps the information of the original gait data x to ensure that the hidden layer characteristic h represents the original gait data, and the integral error of the reconstruction output y passes through an integral loss function JDAEWhere the decoding function of the DAE:
wherein,is a weight matrix between the hidden layer and the output layer, andis provided withsgActivating a function, also a Sigmod function, for the decoding function; overall loss function of DAE in a given training set:
wherein theta isDAEAre parameters of DAE, including w, p, q; l is defined as the reconstruction error and is used for depicting y and thetarThe approach degree of (c):
wherein n is the dimension of the input and output layer;
s303: DAE training procedure uses gradient descent pairs JDAE(theta) performing iterative calculation to obtain a minimum value, gradient descent vsDAEThe update function of (2):
wherein α is the learning rate and takes the value of [0,1 ].
This embodiment constructs two DAE networks with the same structure, DAERAnd DAEMAnd separately using robot gait data ΘRAnd human gait data thetaMAnd (5) training. DAE trained with large amounts of dataRAnd DAEMImplicit feature extraction can be carried out on the gait data of the robot and the human body in the embodiment, and the extracted gait feature of the robot and the human body is defined as hr、hm. Using DAEMTo thetaSExtracting the features of each row vector, and arranging the extracted features according to the original sequence to obtain target gait feature data HS. Theta was treated in the same mannerRObtaining the gait characteristic data H of the robotR. H is to beS、HRAnd uniformly carrying out normalization processing and providing the normalization processing for deep reinforcement learning to operate. HS、HRThe characteristics of gait data of the robot and the human body are effectively represented, and the influence of noise and geometric parameter difference on deep reinforcement learning can be effectively reduced.
Step S4: and learning the human gait characteristics by using a deep reinforcement learning method so as to plan the gait of the biped robot. The reinforcement learning is a main branch of machine learning, and can gradually improve the action selection of the intelligent agent in the interaction process of the intelligent agent and the environment, and finally achieve the goal of controlling the intelligent agent to finish. The reinforcement learning does not need an accurate intelligent model, so the method is very suitable for controlling the biped robot. However, the conventional reinforcement learning has a slow convergence speed, and the reinforcement learning improved by combining the neural network technology improves the learning speed, but the samples collected in the interaction process are highly correlated in time, and the samples which do not meet the training requirement of the neural network are independent, so that the network is very easy to be over-fitted. With the rapid development of deep learning, deep reinforcement learning begins to appear within the line of sight of researchers. The deep reinforcement learning is the combination of the conventional reinforcement learning and the deep learning, and the deficiency of the reinforcement learning is supplemented by using the theory of the deep learning, so that the reinforcement learning is greatly improved in all aspects.
According to the characteristics that the biped robot has continuous walking motion and the action space of the hip joint driving motor is continuous, the depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the robot. DDPG is an Actor-Critic structure algorithm improved based on deterministic strategy gradient DPG, and neural networks are used for replacing a strategy function and a value function in conventional reinforcement learning respectively, and the replaced neural networks are called a strategy network mu and a Q network Q respectively. The policy network receives the robot state and returns the motor torque, the Q network evaluates the selection of the policy network in combination with the robot state and the motor torque, and the DDPG framework is as shown in fig. 6.
In step S4, robot gait feature data H processed by the noise reduction auto encoder is usedRInput data s as a depth-deterministic policy gradient DDPGtTarget gait feature data HsAs rtIs calculated according to the depth certainty strategy gradient DDPG and outputs a motor execution torque at(ii) a The robot collects asynchronous data in the continuous walking process and provides the asynchronous data for the training of the depth deterministic strategy gradient DDPG, and finally the depth deterministic strategy gradient DDPG has the capacity of controlling the robot to reach a target gait.
In order to solve the network oscillation and overfitting caused by the high correlation of the samples acquired in the interaction process in time, a memory pool is provided for the DDPG. The memory pool stores the robot state s in one gait cycle of the robottSelectively executed motor torque atThe obtained reward rtAnd robot state s after motor drivet+1As a set of experiences(s)t,at,rt,st+1) And storing. When the neural network training is needed, n groups of experiences are randomly extracted from the memory pool as training data, and the size of n is generally set by small batch data (minipatch). The mechanism of random extraction both breaks up the temporal correlation between samples to prevent network oscillation and overfitting, and allows the robot to learn both the previous experience and the current experience.
In the DDPG, if only a single Q network is used for evaluation training of the policy network, the learning process may be unstable, because the network parameters of the single Q network are frequently updated and used for calculating the gradients of the Q network and the policy network. Therefore, in this embodiment, the policy network μ and the Q network Q after initializing the network parameters in the DDPG are copied, new networks obtained after copying are respectively called as an offline policy network μ 'and an offline Q network Q', and an original network is called as an online policy network μ and an online Q network Q. The online network part is used for outputting the actions of the robot and enabling the robot to carry out the actions when the robot walks. The off-line network mainly has the function of providing data support for the training of the on-line network so as to enable the whole network to be more stable and fast converged.
The network structures of the online network and the offline network are completely consistent, and the difference between the online network and the offline network lies in the network parameter updating mode. The network parameter updates of the online network are updated using experience randomly extracted from the memory pool, data provided by the offline network, and random gradient descent. The network parameter update of the off-line network is performed by soft update. The soft update completes the update of the off-line network according to the update obtained from the network parameters of the on-line network, and taking the on-line policy network and the off-line policy network as an example, the soft update can be expressed as:
θμ′=τθμ+(1-τ)θμ′
wherein, thetaμAnd thetaμ′Network parameters of an online policy network and an offline policy network are respectively, and tau generally takes a value of 0.001. Similarly, the soft update between the online Q network and the offline Q network is the same as the above formula.
The training process flow diagram in this example is shown in fig. 7, and the training process can be described as follows:
the sampling frequency of the robot, the selection of the swing leg at the sampling starting moment and the target gait data are kept consistent. the state of the robot at the time t is thetaRDAE processed gait feature data H of the robotRAs stTarget gait feature data HSAs rtAccording to the method.
S401: in this embodiment, the policy network in the DDPG is a 5-layer CNN network: the first layer as an input layer for receiving st(ii) a The second layer and the third layer are convolution layers; the fourth layer is a full link layer; the fifth layer is an output layer, which sets the maximum action boundary and outputs the torque required to be executed by the motor. The Q network structure is approximately the same as the strategy network structure, and only the number of input layer units is increased to accommodate the motor torque atAnd the output layer unit is set to 1 and only the evaluation is returned.
To be on-lineAnd randomly initializing the network parameters, and copying the initialized network parameters to the corresponding offline network. Setting a maximum storable experience number E of a memory pool, setting a size minimatch of a neural network training data set, a single training time T of the neural network, and initializing an online strategy network learning rate lpoilcyOn-line Q network learning rate lQAnd setting a soft update rate tau and a maximum step number W of one-time interactive walking. When the robot falls down or finishes the maximum step number W, the robot is regarded as a complete interaction and is marked as EPI, and the maximum interaction times is EPI. And finally, randomly initializing the state of the robot.
S402: the state of the robot at the moment of swinging the swing leg is stThe online strategy network outputs a group of motor torque a according to the current networktIt can be expressed as:
at=μ(stμ)
wherein, atThe row vector of the motor is respectively the executing moment, the line number and the s of the motor of the supporting leg and the motor of the swinging leg at the hip jointtAnd (5) the consistency is achieved.
S403: in the swinging process of the swinging leg, two motors at the hip joint of the robot respectively execute corresponding motor torque atThe execution time of any row vector is the same as the sampling interval time. The motor firstly executes atFirst row of (2), in a Pair robot state θrAnd after sampling is finished, the execution torque is switched to the next row, and the execution is carried out in the sequence. The control torque of the embodiment uses square wave torque, so that the occurrence of shaking in the control process can be effectively avoided. When the swing leg of the robot collides with the ground, updating the step count w, and sampling all the sampled thetarFeeding into DAERTo obtain a new state s of the robott+1
S404: the design of the reward function is an important step in deep reinforcement learning work, and the good reward design can obviously improve the learning effect. The present embodiment uses a program reward design to guide training faster, rtAs follows:
when the robot does not fall over, st+1And HsThe smaller the difference rtThe larger and constantly greater than 0. When the robot falls down, rtThis guides the robot to approach the target gait without falling over.
S405: will(s)t,at,rt,st+1) The experience number is stored as a group in the memory pool, and the memory pool experience number count exp is updated. According to different states, the counters operate differently as follows: 1) if the robot falls down, resetting the state of the robot, returning to execute S402 and resetting w; 2) if not fallen but W < W, then st+1As new stExecuting S402; 3) if W is not less than W and exp>E then sequentially executes S406 and resets w; 4) otherwise the reset robot state returns to execute S402 and reset w. The epi is updated when S401, S403, and S404 are executed.
S406: and randomly extracting the minimatch group experience from the memory pool as a training data set of the online network.
S407: extracting s in a training datasett,atSending the data to an online Q network to be evaluated: q(s)t,atQ). Data set st+1Sending the torque to an offline strategy network to obtain motor torque a't+1From the offline Q network pair st+1、a′t+1Evaluation was carried out: q'(s)t+1,μ′(st+1μ′)|θQ′). The loss function of the online Q network can then be expressed as:
wherein, yi=rt+γQ′(st+1,μ′(st+1μ′)|θQ′). According to LQThe online Q network is updated using a random gradient descent.
S408: calculating the strategy gradient in the strategy network, and defining the loss function of the online strategy network:
Lμ=Q(st,μ(st,θμ)|θQ)
the gradient of the online policy network can be calculated using the loss function of the policy network:
online policy network parameters are also updated using random gradient descent.
S409: after the network parameters of the online policy network and the online Q network are updated, the offline policy network and the offline Q network are updated through soft update:
s410: and updating the network training times time, and when the time is more than the single training time T, ending the network training of the time, and executing S411. Otherwise, returning to S406 to continue network training.
S411: when EPI > EPI, DDPG calculation is finished, and the online strategy network is saved as a controller. If EPI < EPI, the robot status is reset and execution returns to S402.
And in the walking process of the robot, the DDPG is continuously used for learning and training until the online strategy network mu and the online Q network Q converge or the maximum interaction number EPI is reached. When the network in the DDPG is converged, the online strategy network can control the robot with random initial gait until the target gait is reached. Similarly, if external force disturbance is applied during walking, the first step after disturbance can be regarded as the beginningThe gait is started and the DDPG is used for effective control, so that the DDPG can carry out effective control only when the robot is not in a falling state. Target gait feature data HSUse of (1) istProvides the basis for rtThe stability and compliance of the robot gait can be described simultaneously. The embodiment combines deep reinforcement learning with human gait data, so that the robot can finally obtain stable and smooth gait like a human.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (6)

1. A biped robot gait planning method based on deep reinforcement learning is characterized by comprising the following steps:
step S1: establishing a biped robot model and describing the walking process of the robot;
step S2: acquiring and processing human body gait data and target gait data;
step S3: respectively extracting implicit characteristics in gait data of the biped robot and human body gait data by using a noise reduction automatic encoder;
step S4: and (3) learning the gait characteristics of the human body by utilizing deep reinforcement learning so as to plan the gait of the biped robot.
2. The biped robot gait planning method based on deep reinforcement learning of claim 1, wherein the step S1 specifically comprises the following steps:
step S101: establishing a 4-link robot model with a knee arc foot; the robot model comprises 2 thighs, 2 small legs and 2 arc feet, wherein the legs are connected together through a rigid rod through a hinge in a friction-free mode, the arc feet are fixedly connected to the small legs respectively, the supporting legs and the swinging legs have identical mass and geometric parameters, the mass of the legs is uniformly distributed, a limiting mechanism is arranged at the knee joints of the robot model to simulate the knee joint function of a human body, and two motors are arranged at hip joints to apply control moments to the supporting legs and the swinging legs respectively;
step S102: analyzing the walking process of the model by taking the right side surface of the advancing direction of the robot as a viewpoint in the walking process of the robot, selecting dimensionless physical quantity for representing the state of the robot in real time, and defining the selected physical quantity as the walking state theta of the robotrThe robot walking state is described as:
wherein, the counterclockwise rotation is taken as positive, thetar1The angle and angular velocity from the leg to the vertical direction; thetar2The angle and angular velocity of the leg thigh to the vertical direction; thetar3For the angle and angular velocity of the leg to the vertical。
3. The biped robot gait planning method based on deep reinforcement learning of claim 2, wherein the step S2 specifically comprises the following steps:
step S201: defining a gait cycle as the process from swinging of the human body and the robot to collision of the swinging leg and the ground;
step S202: selecting a data set of the normal walking process of the human body from a CMU human body motion capture database, and carrying out human body division and calculation on the data set to obtain the description of the walking process of the human body;
step S203: taking a robot model as a reference, taking a 2D plane of the human walking longitudinal direction, and defining the human walking state as thetamAll data in the description of the human walking process are used with thetamIs expressed and theta is expressedmAs a row vector, the human body gait data theta are obtained by combinationM
Step S204: from human gait data ΘMSelecting a gait cycle as a learning object of the robot, extracting odd frames in the learning object data to form a new data set, and defining the new data set as target gait data thetaSWherein the target gait data ΘSAny row vector is extracted to obtain thetam
Step S205: the walking state theta of the robot in a gait cyclerAccording to the formula thetaSThe sampling frequency in the method is sampled to form robot gait data thetaRWherein, the gait data theta of the robotRTheta is obtained by taking any row vector as sampler
4. The gait planning method of the biped robot based on the deep reinforcement learning of claim 3, wherein the step S3 specifically comprises: according to the formula thetar、ΘmThe data structure of (2) and two noise reduction automatic encoders with the same structure are constructed to process the gait data theta of the robotRAnd target gait data ΘSCarrying out feature extraction; will thetaR、θSThe row vectors are sent into a noise reduction automatic encoder one by one, and the obtained characteristics are arranged according to the original sequence to form gait characteristic data H of the robotRAnd target gait feature data HSIs prepared from HRAnd HSUnifying the normalization process for deep reinforcement learning, wherein each denoising autoencoder workflow comprises the following steps:
s301: take thetaROr thetaSThe middle row vector theta is sent into a noise reduction automatic encoder, the noise reduction automatic encoder randomly erases the original gait data theta by using binomial distribution, the erased data is set to be 0, and the gait data containing noise is obtainedBy means of a coding function fMapping to a hidden layer to obtain a hidden layer characteristic h, wherein the coding function of the noise reduction automatic coder is as follows:
w is a weight matrix between the input layer and the hidden layer; sfTaking a Sigmod function as an activation function of the coding function f;
s302: mapping the hidden layer characteristics h to an output layer through a decoding function g to obtain a reconstructed output y; reconstructing output y to keep the information of original gait data x, and the integral error of the original gait data x passes through an integral loss function JDAEWhere the decoding function of the noise reduction auto-encoder is:
wherein,is a weight matrix between the hidden layer and the output layer, and hassgAn activation function, also a Sigmod function, for a decoding function; denoising the overall loss function of the autoencoder in a given training set:
wherein theta isDAEParameters of the noise reduction automatic encoder comprise w, p and q; l is defined as the reconstruction error, and is used to depict how close y is to Θ:
wherein n is the dimension of the input and output layer;
s303: noise reduction autoencoder training process using gradient descent pair JDAE(theta) performing iterative calculation to obtain a minimum value, gradient descent vsDAEThe update function of (2):
wherein α is the learning rate and takes the value of [0,1 ].
5. The biped robot gait planning method based on deep reinforcement learning of claim 3, characterized in that in step S4, a depth certainty strategy gradient algorithm DDPG is selected as the learning algorithm of the biped robot, and the robot gait feature data H processed by the noise reduction automatic encoder is usedRInput data s as a gradient algorithm for a depth-deterministic strategytTarget gait feature data HSAs rtThe motor execution torque a is output by a depth certainty strategy gradient algorithmt(ii) a The robot collects data of asynchronous state in the continuous walking process, provides the data for the training of the depth certainty strategy gradient algorithm, and finally makes the depth certaintyThe degree-deterministic policy gradient algorithm has the ability to control the robot to reach a target gait.
6. The gait planning method of the biped robot based on the deep reinforcement learning of claim 5, wherein the strategy network of the deep certainty strategy gradient algorithm adopts 5 layers of convolutional neural networks, which respectively comprise an input layer, two convolutional layers, a full link layer and an output layer, wherein the input layer is used for receiving stThe output layer outputs the torque a required to be executed by the motort
CN201810979187.2A 2018-08-27 2018-08-27 Biped robot gait planning method based on deep reinforcement learning Active CN108983804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810979187.2A CN108983804B (en) 2018-08-27 2018-08-27 Biped robot gait planning method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810979187.2A CN108983804B (en) 2018-08-27 2018-08-27 Biped robot gait planning method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN108983804A true CN108983804A (en) 2018-12-11
CN108983804B CN108983804B (en) 2020-05-22

Family

ID=64547820

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810979187.2A Active CN108983804B (en) 2018-08-27 2018-08-27 Biped robot gait planning method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN108983804B (en)

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN110046457A (en) * 2019-04-26 2019-07-23 百度在线网络技术(北京)有限公司 Control method, device, electronic equipment and the storage medium of manikin
CN110496377A (en) * 2019-08-19 2019-11-26 华南理工大学 A kind of virtual table tennis forehand hit training method based on intensified learning
CN110711055A (en) * 2019-11-07 2020-01-21 江苏科技大学 Image sensor intelligence artificial limb leg system based on degree of depth learning
CN110764415A (en) * 2019-10-31 2020-02-07 清华大学深圳国际研究生院 Gait planning method for leg movement of quadruped robot
CN110861084A (en) * 2019-11-18 2020-03-06 东南大学 Four-legged robot falling self-resetting control method based on deep reinforcement learning
CN111142378A (en) * 2020-01-07 2020-05-12 四川省桑瑞光辉标识系统股份有限公司 Neural network optimization method of biped robot neural network controller
CN111241700A (en) * 2020-01-19 2020-06-05 中国科学院光电技术研究所 Intelligent design method of microwave broadband super-surface absorber
CN111487864A (en) * 2020-05-14 2020-08-04 山东师范大学 Robot path navigation method and system based on deep reinforcement learning
CN111558937A (en) * 2020-04-07 2020-08-21 向仲宇 Robot motion control method based on deep learning
CN111625002A (en) * 2019-12-24 2020-09-04 杭州电子科技大学 Stair-climbing gait planning and control method of humanoid robot
CN111814618A (en) * 2020-06-28 2020-10-23 浙江大华技术股份有限公司 Pedestrian re-identification method, gait identification network training method and related device
CN112060075A (en) * 2020-07-21 2020-12-11 深圳先进技术研究院 Training method, training device and storage medium for gait generation network
CN112149835A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Network reconstruction method and device
CN112171660A (en) * 2020-08-18 2021-01-05 南京航空航天大学 Space double-arm system constrained motion planning method based on deep reinforcement learning
CN112232350A (en) * 2020-10-27 2021-01-15 广东技术师范大学 Paddy field robot mechanical leg length adjusting method and system based on reinforcement learning
CN112256028A (en) * 2020-10-15 2021-01-22 华中科技大学 Method, system, equipment and medium for controlling compliant gait of biped robot
CN112666939A (en) * 2020-12-09 2021-04-16 深圳先进技术研究院 Robot path planning algorithm based on deep reinforcement learning
CN112782973A (en) * 2019-11-07 2021-05-11 四川省桑瑞光辉标识系统股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN114047697A (en) * 2021-11-05 2022-02-15 河南科技大学 Four-footed robot balance inverted pendulum control method based on deep reinforcement learning
CN114684293A (en) * 2020-12-28 2022-07-01 成都启源西普科技有限公司 Robot walking simulation algorithm
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN117572877A (en) * 2024-01-16 2024-02-20 科大讯飞股份有限公司 Biped robot gait control method, biped robot gait control device, storage medium and equipment

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751037A (en) * 2008-12-03 2010-06-23 上海电气集团股份有限公司 Dynamic walking control method for biped walking robot
CN104751172A (en) * 2015-03-12 2015-07-01 西安电子科技大学 Method for classifying polarized SAR (Synthetic Aperture Radar) images based on de-noising automatic coding
CN106127804A (en) * 2016-06-17 2016-11-16 淮阴工学院 The method for tracking target of RGB D data cross-module formula feature learning based on sparse depth denoising own coding device
CN106406162A (en) * 2016-08-12 2017-02-15 广东技术师范学院 Alternating current servo control system based on transfer neural network
CN107506333A (en) * 2017-08-11 2017-12-22 深圳市唯特视科技有限公司 A kind of visual token algorithm based on ego-motion estimation
CN108241375A (en) * 2018-02-05 2018-07-03 景德镇陶瓷大学 A kind of application process of self-adaptive genetic operator in mobile robot path planning
US20180268262A1 (en) * 2017-03-15 2018-09-20 Fuji Xerox Co., Ltd. Information processing device and non-transitory computer readable medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101751037A (en) * 2008-12-03 2010-06-23 上海电气集团股份有限公司 Dynamic walking control method for biped walking robot
CN104751172A (en) * 2015-03-12 2015-07-01 西安电子科技大学 Method for classifying polarized SAR (Synthetic Aperture Radar) images based on de-noising automatic coding
CN106127804A (en) * 2016-06-17 2016-11-16 淮阴工学院 The method for tracking target of RGB D data cross-module formula feature learning based on sparse depth denoising own coding device
CN106406162A (en) * 2016-08-12 2017-02-15 广东技术师范学院 Alternating current servo control system based on transfer neural network
US20180268262A1 (en) * 2017-03-15 2018-09-20 Fuji Xerox Co., Ltd. Information processing device and non-transitory computer readable medium
CN107506333A (en) * 2017-08-11 2017-12-22 深圳市唯特视科技有限公司 A kind of visual token algorithm based on ego-motion estimation
CN108241375A (en) * 2018-02-05 2018-07-03 景德镇陶瓷大学 A kind of application process of self-adaptive genetic operator in mobile robot path planning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KAI HENNING KOCH 等: "《Optimization-based walking generation for humanoid robot》", 《10TH IFAC SYMPOSIUM ON ROBOT CONTROL》 *
吴晓光 等: "《一种基于神经网络的双足机器人》", 《中国机械工程》 *
胡运富 等: "《简单双足被动行走模型仿真和分析》", 《哈尔滨工业大学学报》 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN110046457A (en) * 2019-04-26 2019-07-23 百度在线网络技术(北京)有限公司 Control method, device, electronic equipment and the storage medium of manikin
CN110046457B (en) * 2019-04-26 2021-02-05 百度在线网络技术(北京)有限公司 Human body model control method and device, electronic equipment and storage medium
CN112149835A (en) * 2019-06-28 2020-12-29 杭州海康威视数字技术股份有限公司 Network reconstruction method and device
CN112149835B (en) * 2019-06-28 2024-03-05 杭州海康威视数字技术股份有限公司 Network reconstruction method and device
CN110496377B (en) * 2019-08-19 2020-07-28 华南理工大学 Virtual table tennis player ball hitting training method based on reinforcement learning
CN110496377A (en) * 2019-08-19 2019-11-26 华南理工大学 A kind of virtual table tennis forehand hit training method based on intensified learning
CN110764415A (en) * 2019-10-31 2020-02-07 清华大学深圳国际研究生院 Gait planning method for leg movement of quadruped robot
CN110764415B (en) * 2019-10-31 2022-04-15 清华大学深圳国际研究生院 Gait planning method for leg movement of quadruped robot
CN112782973A (en) * 2019-11-07 2021-05-11 四川省桑瑞光辉标识系统股份有限公司 Biped robot walking control method and system based on double-agent cooperative game
CN110711055A (en) * 2019-11-07 2020-01-21 江苏科技大学 Image sensor intelligence artificial limb leg system based on degree of depth learning
CN110861084A (en) * 2019-11-18 2020-03-06 东南大学 Four-legged robot falling self-resetting control method based on deep reinforcement learning
CN110861084B (en) * 2019-11-18 2022-04-05 东南大学 Four-legged robot falling self-resetting control method based on deep reinforcement learning
CN111625002A (en) * 2019-12-24 2020-09-04 杭州电子科技大学 Stair-climbing gait planning and control method of humanoid robot
CN111625002B (en) * 2019-12-24 2022-12-13 杭州电子科技大学 Stair-climbing gait planning and control method of humanoid robot
CN111142378A (en) * 2020-01-07 2020-05-12 四川省桑瑞光辉标识系统股份有限公司 Neural network optimization method of biped robot neural network controller
CN111241700B (en) * 2020-01-19 2022-12-30 中国科学院光电技术研究所 Intelligent design method of microwave broadband super-surface absorber
CN111241700A (en) * 2020-01-19 2020-06-05 中国科学院光电技术研究所 Intelligent design method of microwave broadband super-surface absorber
CN111558937A (en) * 2020-04-07 2020-08-21 向仲宇 Robot motion control method based on deep learning
CN111487864A (en) * 2020-05-14 2020-08-04 山东师范大学 Robot path navigation method and system based on deep reinforcement learning
CN111814618A (en) * 2020-06-28 2020-10-23 浙江大华技术股份有限公司 Pedestrian re-identification method, gait identification network training method and related device
CN111814618B (en) * 2020-06-28 2023-09-01 浙江大华技术股份有限公司 Pedestrian re-recognition method, gait recognition network training method and related devices
CN112060075A (en) * 2020-07-21 2020-12-11 深圳先进技术研究院 Training method, training device and storage medium for gait generation network
CN112171660A (en) * 2020-08-18 2021-01-05 南京航空航天大学 Space double-arm system constrained motion planning method based on deep reinforcement learning
CN112256028A (en) * 2020-10-15 2021-01-22 华中科技大学 Method, system, equipment and medium for controlling compliant gait of biped robot
CN112232350A (en) * 2020-10-27 2021-01-15 广东技术师范大学 Paddy field robot mechanical leg length adjusting method and system based on reinforcement learning
CN112232350B (en) * 2020-10-27 2022-04-19 广东技术师范大学 Paddy field robot mechanical leg length adjusting method and system based on reinforcement learning
CN112666939A (en) * 2020-12-09 2021-04-16 深圳先进技术研究院 Robot path planning algorithm based on deep reinforcement learning
CN112666939B (en) * 2020-12-09 2021-09-10 深圳先进技术研究院 Robot path planning algorithm based on deep reinforcement learning
CN114684293A (en) * 2020-12-28 2022-07-01 成都启源西普科技有限公司 Robot walking simulation algorithm
CN114047697B (en) * 2021-11-05 2023-08-25 河南科技大学 Four-foot robot balance inverted pendulum control method based on deep reinforcement learning
CN114047697A (en) * 2021-11-05 2022-02-15 河南科技大学 Four-footed robot balance inverted pendulum control method based on deep reinforcement learning
CN115366099A (en) * 2022-08-18 2022-11-22 江苏科技大学 Mechanical arm depth certainty strategy gradient training method based on forward kinematics
CN115366099B (en) * 2022-08-18 2024-05-28 江苏科技大学 Mechanical arm depth deterministic strategy gradient training method based on forward kinematics
CN117572877A (en) * 2024-01-16 2024-02-20 科大讯飞股份有限公司 Biped robot gait control method, biped robot gait control device, storage medium and equipment
CN117572877B (en) * 2024-01-16 2024-05-31 科大讯飞股份有限公司 Biped robot gait control method, biped robot gait control device, storage medium and equipment

Also Published As

Publication number Publication date
CN108983804B (en) 2020-05-22

Similar Documents

Publication Publication Date Title
CN108983804B (en) Biped robot gait planning method based on deep reinforcement learning
Jiang et al. Ditto: Building digital twins of articulated objects from interaction
Amarjyoti Deep reinforcement learning for robotic manipulation-the state of the art
Hu et al. Chainqueen: A real-time differentiable physical simulator for soft robotics
US20200293881A1 (en) Reinforcement learning to train a character using disparate target animation data
Zhu et al. Off-road autonomous vehicles traversability analysis and trajectory planning based on deep inverse reinforcement learning
Piergiovanni et al. Learning real-world robot policies by dreaming
Melo et al. Learning humanoid robot running skills through proximal policy optimization
CN111160294B (en) Gait recognition method based on graph convolution network
Chaffre et al. Sim-to-real transfer with incremental environment complexity for reinforcement learning of depth-based robot navigation
CN111546349A (en) New deep reinforcement learning method for humanoid robot gait planning
Taniguchi et al. Hippocampal formation-inspired probabilistic generative model
CN103839280B (en) A kind of human body attitude tracking of view-based access control model information
CN116959094A (en) Human body behavior recognition method based on space-time diagram convolutional network
CN113359744B (en) Robot obstacle avoidance system based on safety reinforcement learning and visual sensor
Prasad et al. Mild: multimodal interactive latent dynamics for learning human-robot interaction
Giammarino et al. Combining imitation and deep reinforcement learning to accomplish human-level performance on a virtual foraging task
CN113569466A (en) Parameterized deep reinforcement learning algorithm based on value function
CN108805965B (en) Human physical motion generation method based on multi-objective evolution
CN115879377B (en) Training method of decision network for intelligent flying car mode switching
Wang et al. RL-NBV: A deep reinforcement learning based next-best-view method for unknown object reconstruction
Wan et al. DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning
CN116901071A (en) Simulation learning mechanical arm grabbing method and device based on multi-scale sequence model
Allday et al. Auto-perceptive reinforcement learning (APRIL)
Zhang Continuous control for robot based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant