CN104932264B

CN104932264B - The apery robot stabilized control method of Q learning frameworks based on RBF networks

Info

Publication number: CN104932264B
Application number: CN201510299823.3A
Authority: CN
Inventors: 毕盛; 黄铨雍; 韦如明; 闵华清; 董敏
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2015-06-03
Filing date: 2015-06-03
Publication date: 2018-07-20
Anticipated expiration: 2035-06-03
Also published as: CN104932264A

Abstract

The invention discloses the apery robot stabilized control method of Q learning frameworks based on RBF networks, this method includes：It is proposed the Q learning frameworks based on RBF networks（RBF‑Q Learning）, which solves the problems, such as state space serialization and action space serialization in Q learning processes；It proposes that the Q study online actions based on RBF networks adjust stability contorting algorithm, generates the hip joint, knee joint and Anklebone track of support leg, and apery robot stabilized walking is controlled by calculating other each joint angles；Finally by the feasibility and validity for verifying RBF Q Learning frame methods on the The Vitruvian Man anthropomorphic robot platforms of this lab design.The present invention can pass through the gait of generation apery robot stabilized walking during on-line study.

Description

Q learning framework humanoid robot stability control method based on RBF network

Technical Field

The invention relates to the field of humanoid robot walking stability control, in particular to a Q Learning framework (RBF-Q Learning) humanoid robot stability control method based on an RBF network.

Background

The essence of the biped walking control research on the humanoid robot platform is to solve a complex control problem. The solution of the complex control problem is generally solved by modeling the whole system and solving a system equation. However, in reality, we often encounter problems that are difficult to accurately describe by a model, or that the system-dependent parameters are too complex to solve by solving the system equations. The problem can now be solved by learning rather than elaborating the modeling.

The control problem of the biped walking of the humanoid robot has the characteristics of high instability, system nonlinearity and the like, and a perfect solution is difficult to obtain through an accurate modeling mode. The methods of reinforcement learning and neural networks have proven effective in complex control problems. These methods do not require the system designer to have a profound and accurate understanding and appreciation of the system dynamics. By way of learning, these methods may give a complete solution beyond the knowledge domain of the designer. At the same time, such methods have the ability to continue learning and improving, just as animals in nature acquire most of their abilities through learning and adaptation.

Disclosure of Invention

The invention provides a Q Learning framework (RBF-Q Learning) based on an RBF network aiming at the difficulty that the continuity of a state space and the continuity of a behavior space are difficult to realize by strengthening Learning Q Learning, and the invention designs and realizes a walking stability control method of a humanoid robot by using the framework, and finally verifies the effectiveness of the method by simulation and an entity robot.

The invention provides a Q learning frame humanoid robot stability control method based on a RBF network, which can enable the humanoid robot to generate stable gait planning through online learning so as to realize the stable walking of the humanoid robot, and comprises the following steps:

(1) and (3) designing a Q Learning framework (RBF-Q Learning) based on the RBF network.

The invention designs a Q learning framework based on an RBF network for a continuous space, wherein the framework uses the RBF network with stronger global approximation capability to fit a Q function, and uses a gradient descent method to solve the maximum value and the optimal behavior in each iteration step. The algorithm can adjust and learn the RBF network structure and parameters in real time on line according to the complexity of the problem handling, and has good generalization capability.

Combining the RBF network and Q learning, the invention designs an RBF-Q learning algorithm frame, and combines the RBF network to perform approximate fitting on a Q function in Q learning. Suppose the Q function receives a state vector s (t) and a motion vector a (t) as inputs and outputs a scalar Q (t).

1) RBF neural network design

An input layer: s (t) represents the state of the Q function input at the time t in Q learning; a (t) represents the input action of the Q function at the time t in Q learning;

hiding the layer: y is_i(t) as hidden layer RBF activation function, using Gaussian kernel function as RBF activation function of neuron; for the RBF activation function of the ith neuron, its output is calculated using the following formula:

where x is an input variable, μ_iAnd σ_iThe center and standard deviation of the ith neuron are respectively, and k is the number of RBF activation functions.

An output layer: q (t) represents the Q function output, updated using the following equation,

wherein, w_iWeights in the Q function are output for the ith neuron.

2) RBF network update

Defining Q learning error delta_QThe following are:

δ_Q＝(1-λ)(r+γQ_max-Q(s,a*,t))

wherein, the lambda is a learning factor (lambda is more than or equal to 0 and less than or equal to 1); gamma is attenuation factor 0 < gamma < 1; q_maxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents an optimal action selection; s is an input state; error delta_QIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the system is defined as follows:

using BP algorithm and gradient descent method to RBF networkThe envelope is updated, the output weight w for each neuron_i，

There is an update formula as follows:

wherein, α_wFor learning rate, for E (t) and w_i(t) having:

the output weight w of each neuron according to the chain rule_iThe update formula becomes:

w_i(t+1)＝w_i(t)+α_wδ_Q(t)y_i(t)

center and standard deviation μ of RBF function for each neuron_iAnd σ_iThe formula is updated as follows:

wherein, α_μAnd α_σRespectively the learning rates of the RBF function center and the standard deviation;

3) gradient descent method for solving Q learning next-step behavior

For discrete Q learning, solving max [ Q (s (t), b, t)/b ∈ A ] by traversing a Q table, namely b represents the next step optimal behavior, and for the Q function of continuous behaviors, adopting a gradient descent method to solve the next step behavior,

max { Q (s (t), b, t)/b ∈ A } can be converted into a minimum value problem min { -Q (s (t), b, t)/b ∈ A }; assuming that the current state is s (t), for the function-Q (s (t), b, t), there is a gradient direction:

in each step of solving iteration, a is updated to the opposite direction of the gradient, and the following steps are carried out:

wherein λ is_aFor step size, solving max { Q (s (t), b, t)/b ∈ A } by a gradient descent method, and the overall algorithm steps are as follows:

① initialization parameters including the allowable error deltae_minMaximum number of iterations k, step length λ_aRandomly assigning an initial value a (0), and making i equal to 0;

④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ E_minOr i is greater than k, stopping, otherwise, making i equal to i +1, and jumping to step ②;

(2) design online action adjustment stable controller based on RBF-Q Learning framework

For the front and back and left and right directions of the robot, two stable controllers are respectively designed:

1) stable control of front and rear direction

Taking the left foot support stage as an example (the same applies to the right foot), for the stable control of the humanoid robot in the front-back direction, the state input of RBF-QLearning is defined as follows:

s_pitch(t)＝[θ_{hip_pitch}(t),θ_{knee_pitch}(t),θ_{ankle_pitch}(t),θ_xz(t)]

wherein, theta_{hip_pitch}(t)、θ_{knee_pitch}(t)、θ_{ankle_pitch}(t) is respectively a left foot and hip joint pitch steering engine angle, a knee joint steering engine angle and an ankle joint steering engine angle theta in an off-line basic gait of the humanoid robot at the moment t_xz(t) is the torso-plumb line angle on the xz plane at time t.

To fore-and-aft direction stable control mainly depend on left leg hip joint every single move steering wheel, knee joint steering wheel and ankle joint steering wheel, so output action definition is its online adjustment value:

a_pitch(t)＝[Δθ_{hip_pitch}(t),Δθ_{knee_pitch}(t),Δθ_{ankle_pitch}(t)]

wherein, Delta theta_{hip_pitch}(t)、Δθ_{knee_pitch}(t)、Δθ_{ankle_pitch}And (t) is the current hip joint pitch steering engine degree, the adjustment angle of the knee joint steering engine and the adjustment angle of the ankle joint steering engine respectively.

And for judging the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by the attitude sensor information.

The method reinforcement learning stability controller immediate return function before and after definition is as follows:

wherein, a₁、a₂In order to report the function weight immediately,

where ε is an allowable error band, θ_xz(t) and Δ θ_xz(t) is the torso-plumb line angle and its angular velocity on the xz plane at time t, respectively. The immediate reward function is intended to be θ_xz(t) is controlled within an allowable error band while its rate of change Delta theta_xz(t) is as small as possible.

2) Stable control of left and right directions

For the stable control of the humanoid robot in the front-back direction, the state input of RBF-Q Learning is defined as follows:

s_roll(t)＝[θ_{hip_roll}(t),θ_{ankle_roll}(t),θ_yz(t)]

wherein, theta_{hip_roll}(t) and θ_{ankle_roll}(t) the angles theta of the left foot hip joint rolling steering engine and the ankle joint rolling steering engine in the offline basic gait of the humanoid robot at the moment t respectively_yz(t) is the torso-plumb line angle on the yz plane at time t.

Because, then mainly depend on left leg hip joint steering wheel, ankle joint steering wheel decision to left and right sides direction stable control, so the output action definition is:

a_roll(t)＝[Δθ_{hip_roll}(t),Δθ_{ankle_roll}(t)]

wherein, Delta theta_{hip_roll}(t) and Δ θ_{ankle_roll}And (t) the adjustment angles of the hip joint rolling steering engine and the ankle joint rolling steering engine are respectively.

Considering the evaluation of the stability in the left and right directions by using the included angle between the trunk and the plumb line on the z plane and the angular speed thereof, defining the immediate return function of the left and right method reinforcement learning stability controller as follows:

wherein, a₁、a₂In order to report the function weight immediately,

where ε is an allowable error band, θ_yz(t) and Δ θ_yz(t) is the torso-plumb line angle and its angular velocity on the yz plane at time t, respectively. The immediate reward function is intended to be θ_yz(t) is controlled within an allowable error band while its rate of change Delta theta_yz(t) is as small as possible.

Compared with the prior art, the invention has the following advantages:

(1) the Q Learning framework (RBF-Q Learning) method based on the RBF network enables the walking stability of the robot to be optimized and has online Learning capability. After a certain amount of learning, the humanoid robot can stably walk and traverse a complex ground environment.

(2) The control problem of the biped walking of the humanoid robot has the characteristics of high instability, system nonlinearity and the like, and is difficult to realize by an accurate modeling mode, and a Q Learning framework (RBF-Q Learning) method based on an RBF network does not need a designer of the system to deeply and accurately know and understand the system dynamics. Through a learning mode, the method can provide a perfect solution beyond the knowledge field of designers. At the same time, the method has the capacity of continuous learning and improvement, just as animals in nature acquire most of their capacity through learning and adaptation.

Drawings

FIG. 1 is a block diagram of an RBF-Q Learning network architecture.

FIG. 2 is a block diagram of the RBF-Q Learning algorithm.

Fig. 3 is an angular velocity curve of a robot walking on an uphill terrain using online motion adjustment and stabilization control (after 1000 walks, the upper curve corresponds to angular velocity of the robot around the y-axis (i.e., forward and backward swing), and the lower curve corresponds to angular velocity data of the humanoid robot around the x-axis (i.e., left and right swing).

Fig. 4 is an angular velocity curve of a robot walking on rough terrain using online motion adjustment and stabilization control (after 1000 walks, the upper curve corresponds to angular velocity data of the robot around the y-axis (i.e., forward and backward swing), and the lower curve corresponds to angular velocity data of the humanoid robot around the x-axis (i.e., left and right swing)).

Detailed Description

The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings, but the present invention is not limited thereto, and it should be noted that the following symbols or processes, which are not described in detail in particular, can be implemented by those skilled in the art by referring to the prior art.

(1) ZMP analysis is carried out on the simplified humanoid robot model by using a 3-dimensional inverted pendulum model, and the robot mass center and the foot-falling point track in the walking process are calculated. The robot mass center and the foot drop point track are used, and through reverse kinematics analysis, the motion tracks of all joints in the walking process of the humanoid robot are obtained and stored as the offline basic gait information of the robot.

(2) And (3) designing a Q Learning framework (RBF-Q Learning) based on the RBF network.

1) Q function of RBF network fitting

And (4) combining an RBF network to perform approximate fitting on the Q function in Q learning. Suppose that the Q function receives a state vector s (t) and a motion vector a (t) as inputs and outputs a scalar Q (t), there is an RBF neural network as follows, see fig. 1.

An input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has n dimensions; a (t) represents the operation of the Q function input at time t in Q learning, and has m dimensions.

Hiding the layer: and y (t) is hidden layer RBF activation functions, and the number of the hidden layer RBF activation functions is k. Using a gaussian kernel function as the RBF activation function of the neuron; for the RBF activation function of the ith neuron, its output is calculated using the following formula:

wherein, w_iWeights in the Q function are output for the ith neuron.

For the update of the RBF network, a Q learning error delta is defined_QThe following are:

δ_Q＝(1-λ)(r+γQ_max-Q(s,a*,t))

wherein, the lambda is a learning factor (lambda is more than or equal to 0 and less than or equal to 1); gamma is attenuation factor (gamma is more than 0 and less than 1); q_maxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents the mostSelecting a preferred action; s is an input state; error delta_QIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the system is defined as follows:

updating the RBF network by using BP algorithm and gradient descent method, and outputting weight w of each neuron_iThe formula is updated as follows:

wherein, α_wFor learning rate, for E (t) and w_i(t) having:

w_i(t+1)＝w_i(t)+α_wδ_Q(t)y_i(t)

wherein, α_μAnd α_σThe learning rates of the RBF function center and the standard deviation are respectively.

2) Gradient descent method for solving Q learning next-step behavior

For discrete Q learning, solving max { Q (s (t), b, t)/b ∈ A } by traversing a Q table, namely b represents the next step optimal behavior, and for the Q function of continuous behaviors, solving the next step behavior by adopting a gradient descent method.

④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ E_minOr i > k, stopping, otherwise, making i equal to i +1, and going to step ②.

By combining the RBF neural network and the gradient descent method, an RBF-Q Learning algorithm framework is integrally described, and an algorithm flow chart is shown as an attached figure 2.

(3) Design online action adjustment stable controller based on RBF-Q Learning framework

The state input and behavior output of the RBF-Q Learning framework for the humanoid robot walking are designed. The two-foot walking process of the humanoid robot is a process of mutually converting two different walks (taking a first step as a right-foot step as an example), namely, a left-foot supporting stage is converted into a right-foot supporting stage, and a short two-foot supporting stage is generally inserted in the middle of the stage conversion in a circulating mode. In the left foot supporting stage, a three-dimensional inverted pendulum model formed by the left foot support is mainly controlled and stabilized by a left foot steering engine, and at the moment, the stability of the robot in the front and back directions is determined by a left leg hip joint pitch steering engine, a knee joint steering engine and an ankle joint steering engine; the stability in the left and right directions is determined by a left leg hip joint rolling steering engine and an ankle joint rolling steering engine. In the same way, in the right foot supporting stage, the stability in the front-back direction is determined by a right leg hip joint pitch steering engine, a knee joint steering engine and an ankle joint steering engine; the stability in the left and right directions is determined by a right leg hip joint rolling steering engine and an ankle joint rolling steering engine. According to the structural characteristics, two stability controllers are respectively designed for the front and back directions and the left and right directions of the robot.

1) Stable control of front and rear direction

a_pitch(t)＝[Δθ_{hip_pitch}(t),Δθ_{knee_pitch}(t),Δθ_{ankle_pitch}(t)]

For the judgment of the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by the information of the attitude sensor; the method reinforcement learning stability controller immediate return function before and after definition is as follows:

wherein, a₁、a₂In order to report the function weight immediately,

where ε is an allowable error band, θ_xz(t) and Δ θ_xz(t) is the angle between the trunk and the plumb line on the xz plane at the time t and the angular velocity thereofAnd (4) degree. The immediate reward function is intended to be θ_xz(t) is controlled within an allowable error band while its rate of change Delta theta_xz(t) is as small as possible.

2) Stable control of left and right directions

s_roll(t)＝[θ_{hip_roll}(t),θ_{ankle_roll}(t),θ_yz(t)]

a_roll(t)＝[Δθ_{hip_roll}(t),Δθ_{ankle_roll}(t)]

wherein, a₁、a₂In order to report the function weight immediately,

3) Online action adjustment stability control flow based on RBF-Q Learning framework

In the walking process of the humanoid robot, for each action to be executed, the stability controller acquires sensor information from a Kalman filtering algorithm, and calculates the current state according to the current off-line basic gait. And updating the RBF-Q Learning framework according to the flow shown in the figure 2, acquiring the next action and correcting the action to be executed in real time.

In summary, the following algorithm steps are provided for each RBF-Q Learning framework online behavior tuning stability controller:

① initialize the RBF-Q Learning framework.

②, for each walking action to be executed, acquiring the included angle of the trunk and the plumb line and the angular speed thereof from the Kalman filtering fusion algorithm, and calculating the current state according to a formula.

③ calculate optimal behavior according to the RBF-Q Learning framework using the current state.

④ the next walking action is modified using the optimal behavior obtained in step 3.

⑤ executing the next action, obtaining the current system immediate return value, updating RBF-Q Learning framework, and jumping to step 2.

(4) Experimental testing and results analysis

1) Simulation experiment test and result analysis

And performing online stability control on the walking of the humanoid robot by using an online motion adjustment stability controller based on an RBF-Q Learning framework. The humanoid robot learns in a simulation environment, continuously adapts to the environment to modify the basic gait until a continuous walking target is completed.

In this set of experiments, the algorithms all tended to converge after 1000 walks and completed 10 consecutive walks in uphill and rugged terrain, respectively. From the experimental results, the humanoid robot which is stably controlled by using the online action adjustment based on the RBF-Q Learning framework has the capability of walking through complex terrain environments such as upslope and rugged environment after being learned for a period of time.

FIG. 3 shows real-time variation of walking angular velocity on an uphill terrain using an on-line motion-adjusted stability control based on RBF-Q Learning framework. The test is recorded as the 1000 th walking learning process of the humanoid robot, and the robot successfully walks for 10 steps in the uphill terrain.

FIG. 4 shows real-time changes in walking angular velocity over rough terrain using an on-line motion-tuning stability control based on RBF-Q Learning framework. This test is recorded as the 1000 th walk learning process of the humanoid robot, which successfully walks 10 steps over rough terrain.

2) Physical robot experiment testing

In an entity experiment, the RBF-Q Learning frame-based online action adjustment stability control is successfully applied to the platform humanoid robot, and walking is successfully completed, so that the effectiveness of the RBF-Q Learning frame-based humanoid robot stability control method provided by the invention is verified.

Claims

1. A Q learning frame humanoid robot stability control method based on an RBF network is characterized by comprising the following steps:

(1) designing a Q Learning framework (RBF-Q Learning) based on RBF network, assuming that a Q function receives a state vector s (t) and an action vector a (t) as inputs, and outputs a scalar Q (t), specifically comprising:

1) RBF neural network design

where x is an input variable, μ_iAnd σ_iThe center and standard deviation of the ith neuron are respectively, and k is the number of RBF activation functions;

wherein, w_iOutputting the weight in the Q function for the ith neuron;

2) RBF network update

Defining Q learning error delta_QThe following are:

δ_Q＝(1-λ)(r+γQ_max-Q(s(t),a^*,t))

wherein, lambda is a learning factor, and lambda is more than or equal to 0 and less than or equal to 1; gamma is attenuation factor 0<γ<1；Q_maxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents an optimal action selection; s (t) is an input state; error delta_QIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the RBF network is defined as follows:

wherein,for learning rate, for E (t) and w_i(t) having:

3) gradient descent method for solving Q learning next-step behavior

For discrete Q learning, solving max { Q (s (t), b, t)/b ∈ A } by traversing a Q table, namely b represents the next optimal behavior, and A is a set of all actions required in the discrete Q learning; for the Q function of the continuous behavior, solving the next behavior by adopting a gradient descent method;

max { Q (s (t), b, t)/b ∈ A } can be converted into a minimum value problem min { -Q (s (t), b, t)/b ∈ A }; assuming that the current state is s (t), the motion vector a has m dimensions, i.e., a ═ a₁,a₂,...,a_m](ii) a For the function-Q (s (t), b, t), there is a gradient direction:

② for a (i), usingFinding the current gradient direction

③ use the formulaUpdating to obtain a (i + 1);

④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ E_minOr i>If so, stopping, otherwise, making i equal to i +1, and jumping to step ②;

(2) designing an online action adjustment stable controller based on an RBF-Q Learning framework;

1) stable control of front and rear direction

Taking the left foot supporting stage as an example, the right foot similarly defines the state input of RBF-QLearning for the stable control of the humanoid robot in the front-back direction as follows:

wherein, theta_{hip_pitch}(t)、θ_{knee_pitch}(t)、θ_{ankle_pitch}(t) is respectively a left foot and hip joint pitch steering engine angle, a knee joint steering engine angle and an ankle joint steering engine angle theta in an off-line basic gait of the humanoid robot at the moment t_xz(t) is the included angle of the trunk-plumb line on the xz plane at the time t;

a_pitch(t)＝[Δθ_{hip_pitch}(t),Δθ_{knee_pitch}(t),Δθ_{ankle_pitch}(t)]

wherein, Delta theta_{hip_pitch}(t)、Δθ_{knee_pitch}(t)、Δθ_{ankle_pitch}(t) respectively adjusting angles of a current hip joint pitch steering engine, a current knee joint steering engine and a current ankle joint steering engine;

for judging the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by using the information of the attitude sensor;

wherein, a₁、a₂In order to report the function weight immediately,

where ε is an allowable error band, θ_xz(t) and Δ θ_xz(t) each isThe included angle between the trunk and the plumb line on the xz plane at the time t and the angular velocity of the included angle; the immediate reward function is intended to be θ_xz(t) is controlled within an allowable error band while its rate of change Delta theta_xz(t) as small as possible;

2) stable control of left and right directions

s_roll(t)＝[θ_{hip_roll}(t),θ_{ankle_roll}(t),θ_yz(t)]

wherein, theta_{hip_roll}(t) and θ_{ankle_roll}(t) the angles theta of the left foot hip joint rolling steering engine and the ankle joint rolling steering engine in the offline basic gait of the humanoid robot at the moment t respectively_yz(t) is the included angle of the trunk-plumb line on the yz plane at the moment t;

a_roll(t)＝[Δθ_{hip_roll}(t),Δθ_{ankle_roll}(t)]

wherein, Delta theta_{hip_roll}(t) and Δ θ_{ankle_roll}(t) the adjustment angles of the hip joint rolling steering engine and the ankle joint rolling steering engine are respectively;

wherein, a₁、a₂In order to report the function weight immediately,

where ε is an allowable error band, θ_yz(t) and Δ θ_yz(t) the included angle of the trunk-plumb line on the yz plane at the time t and the angular velocity thereof respectively; the immediate reward function is intended to be θ_yz(t) is controlled within an allowable error band while its rate of change Delta theta_yz(t) is as small as possible.