CN104932264B - The apery robot stabilized control method of Q learning frameworks based on RBF networks - Google Patents

The apery robot stabilized control method of Q learning frameworks based on RBF networks Download PDF

Info

Publication number
CN104932264B
CN104932264B CN201510299823.3A CN201510299823A CN104932264B CN 104932264 B CN104932264 B CN 104932264B CN 201510299823 A CN201510299823 A CN 201510299823A CN 104932264 B CN104932264 B CN 104932264B
Authority
CN
China
Prior art keywords
learning
rbf
function
pitch
ankle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201510299823.3A
Other languages
Chinese (zh)
Other versions
CN104932264A (en
Inventor
毕盛
黄铨雍
韦如明
闵华清
董敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201510299823.3A priority Critical patent/CN104932264B/en
Publication of CN104932264A publication Critical patent/CN104932264A/en
Application granted granted Critical
Publication of CN104932264B publication Critical patent/CN104932264B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses the apery robot stabilized control method of Q learning frameworks based on RBF networks, this method includes:It is proposed the Q learning frameworks based on RBF networks(RBF‑Q Learning), which solves the problems, such as state space serialization and action space serialization in Q learning processes;It proposes that the Q study online actions based on RBF networks adjust stability contorting algorithm, generates the hip joint, knee joint and Anklebone track of support leg, and apery robot stabilized walking is controlled by calculating other each joint angles;Finally by the feasibility and validity for verifying RBF Q Learning frame methods on the The Vitruvian Man anthropomorphic robot platforms of this lab design.The present invention can pass through the gait of generation apery robot stabilized walking during on-line study.

Description

Q learning framework humanoid robot stability control method based on RBF network
Technical Field
The invention relates to the field of humanoid robot walking stability control, in particular to a Q Learning framework (RBF-Q Learning) humanoid robot stability control method based on an RBF network.
Background
The essence of the biped walking control research on the humanoid robot platform is to solve a complex control problem. The solution of the complex control problem is generally solved by modeling the whole system and solving a system equation. However, in reality, we often encounter problems that are difficult to accurately describe by a model, or that the system-dependent parameters are too complex to solve by solving the system equations. The problem can now be solved by learning rather than elaborating the modeling.
The control problem of the biped walking of the humanoid robot has the characteristics of high instability, system nonlinearity and the like, and a perfect solution is difficult to obtain through an accurate modeling mode. The methods of reinforcement learning and neural networks have proven effective in complex control problems. These methods do not require the system designer to have a profound and accurate understanding and appreciation of the system dynamics. By way of learning, these methods may give a complete solution beyond the knowledge domain of the designer. At the same time, such methods have the ability to continue learning and improving, just as animals in nature acquire most of their abilities through learning and adaptation.
Disclosure of Invention
The invention provides a Q Learning framework (RBF-Q Learning) based on an RBF network aiming at the difficulty that the continuity of a state space and the continuity of a behavior space are difficult to realize by strengthening Learning Q Learning, and the invention designs and realizes a walking stability control method of a humanoid robot by using the framework, and finally verifies the effectiveness of the method by simulation and an entity robot.
The invention provides a Q learning frame humanoid robot stability control method based on a RBF network, which can enable the humanoid robot to generate stable gait planning through online learning so as to realize the stable walking of the humanoid robot, and comprises the following steps:
(1) and (3) designing a Q Learning framework (RBF-Q Learning) based on the RBF network.
The invention designs a Q learning framework based on an RBF network for a continuous space, wherein the framework uses the RBF network with stronger global approximation capability to fit a Q function, and uses a gradient descent method to solve the maximum value and the optimal behavior in each iteration step. The algorithm can adjust and learn the RBF network structure and parameters in real time on line according to the complexity of the problem handling, and has good generalization capability.
Combining the RBF network and Q learning, the invention designs an RBF-Q learning algorithm frame, and combines the RBF network to perform approximate fitting on a Q function in Q learning. Suppose the Q function receives a state vector s (t) and a motion vector a (t) as inputs and outputs a scalar Q (t).
1) RBF neural network design
An input layer: s (t) represents the state of the Q function input at the time t in Q learning; a (t) represents the input action of the Q function at the time t in Q learning;
hiding the layer: y isi(t) as hidden layer RBF activation function, using Gaussian kernel function as RBF activation function of neuron; for the RBF activation function of the ith neuron, its output is calculated using the following formula:
where x is an input variable, μiAnd σiThe center and standard deviation of the ith neuron are respectively, and k is the number of RBF activation functions.
An output layer: q (t) represents the Q function output, updated using the following equation,
wherein, wiWeights in the Q function are output for the ith neuron.
2) RBF network update
Defining Q learning error deltaQThe following are:
δQ=(1-λ)(r+γQmax-Q(s,a*,t))
wherein, the lambda is a learning factor (lambda is more than or equal to 0 and less than or equal to 1); gamma is attenuation factor 0 < gamma < 1; qmaxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents an optimal action selection; s is an input state; error deltaQIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the system is defined as follows:
using BP algorithm and gradient descent method to RBF networkThe envelope is updated, the output weight w for each neuroni
There is an update formula as follows:
wherein, αwFor learning rate, for E (t) and wi(t) having:
the output weight w of each neuron according to the chain ruleiThe update formula becomes:
wi(t+1)=wi(t)+αwδQ(t)yi(t)
center and standard deviation μ of RBF function for each neuroniAnd σiThe formula is updated as follows:
wherein, αμAnd ασRespectively the learning rates of the RBF function center and the standard deviation;
3) gradient descent method for solving Q learning next-step behavior
For discrete Q learning, solving max [ Q (s (t), b, t)/b ∈ A ] by traversing a Q table, namely b represents the next step optimal behavior, and for the Q function of continuous behaviors, adopting a gradient descent method to solve the next step behavior,
max { Q (s (t), b, t)/b ∈ A } can be converted into a minimum value problem min { -Q (s (t), b, t)/b ∈ A }; assuming that the current state is s (t), for the function-Q (s (t), b, t), there is a gradient direction:
in each step of solving iteration, a is updated to the opposite direction of the gradient, and the following steps are carried out:
wherein λ isaFor step size, solving max { Q (s (t), b, t)/b ∈ A } by a gradient descent method, and the overall algorithm steps are as follows:
① initialization parameters including the allowable error deltaeminMaximum number of iterations k, step length λaRandomly assigning an initial value a (0), and making i equal to 0;
④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ EminOr i is greater than k, stopping, otherwise, making i equal to i +1, and jumping to step ②;
(2) design online action adjustment stable controller based on RBF-Q Learning framework
For the front and back and left and right directions of the robot, two stable controllers are respectively designed:
1) stable control of front and rear direction
Taking the left foot support stage as an example (the same applies to the right foot), for the stable control of the humanoid robot in the front-back direction, the state input of RBF-QLearning is defined as follows:
spitch(t)=[θhip_pitch(t),θknee_pitch(t),θankle_pitch(t),θxz(t)]
wherein, thetahip_pitch(t)、θknee_pitch(t)、θankle_pitch(t) is respectively a left foot and hip joint pitch steering engine angle, a knee joint steering engine angle and an ankle joint steering engine angle theta in an off-line basic gait of the humanoid robot at the moment txz(t) is the torso-plumb line angle on the xz plane at time t.
To fore-and-aft direction stable control mainly depend on left leg hip joint every single move steering wheel, knee joint steering wheel and ankle joint steering wheel, so output action definition is its online adjustment value:
apitch(t)=[Δθhip_pitch(t),Δθknee_pitch(t),Δθankle_pitch(t)]
wherein, Delta thetahip_pitch(t)、Δθknee_pitch(t)、Δθankle_pitchAnd (t) is the current hip joint pitch steering engine degree, the adjustment angle of the knee joint steering engine and the adjustment angle of the ankle joint steering engine respectively.
And for judging the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by the attitude sensor information.
The method reinforcement learning stability controller immediate return function before and after definition is as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θxz(t) and Δ θxz(t) is the torso-plumb line angle and its angular velocity on the xz plane at time t, respectively. The immediate reward function is intended to be θxz(t) is controlled within an allowable error band while its rate of change Delta thetaxz(t) is as small as possible.
2) Stable control of left and right directions
For the stable control of the humanoid robot in the front-back direction, the state input of RBF-Q Learning is defined as follows:
sroll(t)=[θhip_roll(t),θankle_roll(t),θyz(t)]
wherein, thetahip_roll(t) and θankle_roll(t) the angles theta of the left foot hip joint rolling steering engine and the ankle joint rolling steering engine in the offline basic gait of the humanoid robot at the moment t respectivelyyz(t) is the torso-plumb line angle on the yz plane at time t.
Because, then mainly depend on left leg hip joint steering wheel, ankle joint steering wheel decision to left and right sides direction stable control, so the output action definition is:
aroll(t)=[Δθhip_roll(t),Δθankle_roll(t)]
wherein, Delta thetahip_roll(t) and Δ θankle_rollAnd (t) the adjustment angles of the hip joint rolling steering engine and the ankle joint rolling steering engine are respectively.
Considering the evaluation of the stability in the left and right directions by using the included angle between the trunk and the plumb line on the z plane and the angular speed thereof, defining the immediate return function of the left and right method reinforcement learning stability controller as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θyz(t) and Δ θyz(t) is the torso-plumb line angle and its angular velocity on the yz plane at time t, respectively. The immediate reward function is intended to be θyz(t) is controlled within an allowable error band while its rate of change Delta thetayz(t) is as small as possible.
Compared with the prior art, the invention has the following advantages:
(1) the Q Learning framework (RBF-Q Learning) method based on the RBF network enables the walking stability of the robot to be optimized and has online Learning capability. After a certain amount of learning, the humanoid robot can stably walk and traverse a complex ground environment.
(2) The control problem of the biped walking of the humanoid robot has the characteristics of high instability, system nonlinearity and the like, and is difficult to realize by an accurate modeling mode, and a Q Learning framework (RBF-Q Learning) method based on an RBF network does not need a designer of the system to deeply and accurately know and understand the system dynamics. Through a learning mode, the method can provide a perfect solution beyond the knowledge field of designers. At the same time, the method has the capacity of continuous learning and improvement, just as animals in nature acquire most of their capacity through learning and adaptation.
Drawings
FIG. 1 is a block diagram of an RBF-Q Learning network architecture.
FIG. 2 is a block diagram of the RBF-Q Learning algorithm.
Fig. 3 is an angular velocity curve of a robot walking on an uphill terrain using online motion adjustment and stabilization control (after 1000 walks, the upper curve corresponds to angular velocity of the robot around the y-axis (i.e., forward and backward swing), and the lower curve corresponds to angular velocity data of the humanoid robot around the x-axis (i.e., left and right swing).
Fig. 4 is an angular velocity curve of a robot walking on rough terrain using online motion adjustment and stabilization control (after 1000 walks, the upper curve corresponds to angular velocity data of the robot around the y-axis (i.e., forward and backward swing), and the lower curve corresponds to angular velocity data of the humanoid robot around the x-axis (i.e., left and right swing)).
Detailed Description
The following detailed description of the embodiments of the present invention is provided in conjunction with the accompanying drawings, but the present invention is not limited thereto, and it should be noted that the following symbols or processes, which are not described in detail in particular, can be implemented by those skilled in the art by referring to the prior art.
(1) ZMP analysis is carried out on the simplified humanoid robot model by using a 3-dimensional inverted pendulum model, and the robot mass center and the foot-falling point track in the walking process are calculated. The robot mass center and the foot drop point track are used, and through reverse kinematics analysis, the motion tracks of all joints in the walking process of the humanoid robot are obtained and stored as the offline basic gait information of the robot.
(2) And (3) designing a Q Learning framework (RBF-Q Learning) based on the RBF network.
1) Q function of RBF network fitting
And (4) combining an RBF network to perform approximate fitting on the Q function in Q learning. Suppose that the Q function receives a state vector s (t) and a motion vector a (t) as inputs and outputs a scalar Q (t), there is an RBF neural network as follows, see fig. 1.
An input layer: s (t) represents the state of the Q function input at the time t in Q learning, and has n dimensions; a (t) represents the operation of the Q function input at time t in Q learning, and has m dimensions.
Hiding the layer: and y (t) is hidden layer RBF activation functions, and the number of the hidden layer RBF activation functions is k. Using a gaussian kernel function as the RBF activation function of the neuron; for the RBF activation function of the ith neuron, its output is calculated using the following formula:
where x is an input variable, μiAnd σiThe center and standard deviation of the ith neuron are respectively, and k is the number of RBF activation functions.
An output layer: q (t) represents the Q function output, updated using the following equation,
wherein, wiWeights in the Q function are output for the ith neuron.
For the update of the RBF network, a Q learning error delta is definedQThe following are:
δQ=(1-λ)(r+γQmax-Q(s,a*,t))
wherein, the lambda is a learning factor (lambda is more than or equal to 0 and less than or equal to 1); gamma is attenuation factor (gamma is more than 0 and less than 1); qmaxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents the mostSelecting a preferred action; s is an input state; error deltaQIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the system is defined as follows:
updating the RBF network by using BP algorithm and gradient descent method, and outputting weight w of each neuroniThe formula is updated as follows:
wherein, αwFor learning rate, for E (t) and wi(t) having:
the output weight w of each neuron according to the chain ruleiThe update formula becomes:
wi(t+1)=wi(t)+αwδQ(t)yi(t)
center and standard deviation μ of RBF function for each neuroniAnd σiThe formula is updated as follows:
wherein, αμAnd ασThe learning rates of the RBF function center and the standard deviation are respectively.
2) Gradient descent method for solving Q learning next-step behavior
For discrete Q learning, solving max { Q (s (t), b, t)/b ∈ A } by traversing a Q table, namely b represents the next step optimal behavior, and for the Q function of continuous behaviors, solving the next step behavior by adopting a gradient descent method.
Max { Q (s (t), b, t)/b ∈ A } can be converted into a minimum value problem min { -Q (s (t), b, t)/b ∈ A }; assuming that the current state is s (t), for the function-Q (s (t), b, t), there is a gradient direction:
in each step of solving iteration, a is updated to the opposite direction of the gradient, and the following steps are carried out:
wherein λ isaFor step size, solving max { Q (s (t), b, t)/b ∈ A } by a gradient descent method, and the overall algorithm steps are as follows:
① initialization parameters including the allowable error deltaeminMaximum number of iterations k, step length λaRandomly assigning an initial value a (0), and making i equal to 0;
④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ EminOr i > k, stopping, otherwise, making i equal to i +1, and going to step ②.
By combining the RBF neural network and the gradient descent method, an RBF-Q Learning algorithm framework is integrally described, and an algorithm flow chart is shown as an attached figure 2.
(3) Design online action adjustment stable controller based on RBF-Q Learning framework
The state input and behavior output of the RBF-Q Learning framework for the humanoid robot walking are designed. The two-foot walking process of the humanoid robot is a process of mutually converting two different walks (taking a first step as a right-foot step as an example), namely, a left-foot supporting stage is converted into a right-foot supporting stage, and a short two-foot supporting stage is generally inserted in the middle of the stage conversion in a circulating mode. In the left foot supporting stage, a three-dimensional inverted pendulum model formed by the left foot support is mainly controlled and stabilized by a left foot steering engine, and at the moment, the stability of the robot in the front and back directions is determined by a left leg hip joint pitch steering engine, a knee joint steering engine and an ankle joint steering engine; the stability in the left and right directions is determined by a left leg hip joint rolling steering engine and an ankle joint rolling steering engine. In the same way, in the right foot supporting stage, the stability in the front-back direction is determined by a right leg hip joint pitch steering engine, a knee joint steering engine and an ankle joint steering engine; the stability in the left and right directions is determined by a right leg hip joint rolling steering engine and an ankle joint rolling steering engine. According to the structural characteristics, two stability controllers are respectively designed for the front and back directions and the left and right directions of the robot.
1) Stable control of front and rear direction
Taking the left foot support stage as an example (the same applies to the right foot), for the stable control of the humanoid robot in the front-back direction, the state input of RBF-QLearning is defined as follows:
spitch(t)=[θhip_pitch(t),θknee_pitch(t),θankle_pitch(t),θxz(t)]
wherein, thetahip_pitch(t)、θknee_pitch(t)、θankle_pitch(t) is respectively a left foot and hip joint pitch steering engine angle, a knee joint steering engine angle and an ankle joint steering engine angle theta in an off-line basic gait of the humanoid robot at the moment txz(t) is the torso-plumb line angle on the xz plane at time t.
To fore-and-aft direction stable control mainly depend on left leg hip joint every single move steering wheel, knee joint steering wheel and ankle joint steering wheel, so output action definition is its online adjustment value:
apitch(t)=[Δθhip_pitch(t),Δθknee_pitch(t),Δθankle_pitch(t)]
wherein, Delta thetahip_pitch(t)、Δθknee_pitch(t)、Δθankle_pitchAnd (t) is the current hip joint pitch steering engine degree, the adjustment angle of the knee joint steering engine and the adjustment angle of the ankle joint steering engine respectively.
For the judgment of the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by the information of the attitude sensor; the method reinforcement learning stability controller immediate return function before and after definition is as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θxz(t) and Δ θxz(t) is the angle between the trunk and the plumb line on the xz plane at the time t and the angular velocity thereofAnd (4) degree. The immediate reward function is intended to be θxz(t) is controlled within an allowable error band while its rate of change Delta thetaxz(t) is as small as possible.
2) Stable control of left and right directions
For the stable control of the humanoid robot in the front-back direction, the state input of RBF-Q Learning is defined as follows:
sroll(t)=[θhip_roll(t),θankle_roll(t),θyz(t)]
wherein, thetahip_roll(t) and θankle_roll(t) the angles theta of the left foot hip joint rolling steering engine and the ankle joint rolling steering engine in the offline basic gait of the humanoid robot at the moment t respectivelyyz(t) is the torso-plumb line angle on the yz plane at time t.
Because, then mainly depend on left leg hip joint steering wheel, ankle joint steering wheel decision to left and right sides direction stable control, so the output action definition is:
aroll(t)=[Δθhip_roll(t),Δθankle_roll(t)]
wherein, Delta thetahip_roll(t) and Δ θankle_rollAnd (t) the adjustment angles of the hip joint rolling steering engine and the ankle joint rolling steering engine are respectively.
Considering the evaluation of the stability in the left and right directions by using the included angle between the trunk and the plumb line on the z plane and the angular speed thereof, defining the immediate return function of the left and right method reinforcement learning stability controller as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θyz(t) and Δ θyz(t) is the torso-plumb line angle and its angular velocity on the yz plane at time t, respectively. The immediate reward function is intended to be θyz(t) is controlled within an allowable error band while its rate of change Delta thetayz(t) is as small as possible.
3) Online action adjustment stability control flow based on RBF-Q Learning framework
In the walking process of the humanoid robot, for each action to be executed, the stability controller acquires sensor information from a Kalman filtering algorithm, and calculates the current state according to the current off-line basic gait. And updating the RBF-Q Learning framework according to the flow shown in the figure 2, acquiring the next action and correcting the action to be executed in real time.
In summary, the following algorithm steps are provided for each RBF-Q Learning framework online behavior tuning stability controller:
① initialize the RBF-Q Learning framework.
②, for each walking action to be executed, acquiring the included angle of the trunk and the plumb line and the angular speed thereof from the Kalman filtering fusion algorithm, and calculating the current state according to a formula.
③ calculate optimal behavior according to the RBF-Q Learning framework using the current state.
④ the next walking action is modified using the optimal behavior obtained in step 3.
⑤ executing the next action, obtaining the current system immediate return value, updating RBF-Q Learning framework, and jumping to step 2.
(4) Experimental testing and results analysis
1) Simulation experiment test and result analysis
And performing online stability control on the walking of the humanoid robot by using an online motion adjustment stability controller based on an RBF-Q Learning framework. The humanoid robot learns in a simulation environment, continuously adapts to the environment to modify the basic gait until a continuous walking target is completed.
In this set of experiments, the algorithms all tended to converge after 1000 walks and completed 10 consecutive walks in uphill and rugged terrain, respectively. From the experimental results, the humanoid robot which is stably controlled by using the online action adjustment based on the RBF-Q Learning framework has the capability of walking through complex terrain environments such as upslope and rugged environment after being learned for a period of time.
FIG. 3 shows real-time variation of walking angular velocity on an uphill terrain using an on-line motion-adjusted stability control based on RBF-Q Learning framework. The test is recorded as the 1000 th walking learning process of the humanoid robot, and the robot successfully walks for 10 steps in the uphill terrain.
FIG. 4 shows real-time changes in walking angular velocity over rough terrain using an on-line motion-tuning stability control based on RBF-Q Learning framework. This test is recorded as the 1000 th walk learning process of the humanoid robot, which successfully walks 10 steps over rough terrain.
2) Physical robot experiment testing
In an entity experiment, the RBF-Q Learning frame-based online action adjustment stability control is successfully applied to the platform humanoid robot, and walking is successfully completed, so that the effectiveness of the RBF-Q Learning frame-based humanoid robot stability control method provided by the invention is verified.

Claims (1)

1. A Q learning frame humanoid robot stability control method based on an RBF network is characterized by comprising the following steps:
(1) designing a Q Learning framework (RBF-Q Learning) based on RBF network, assuming that a Q function receives a state vector s (t) and an action vector a (t) as inputs, and outputs a scalar Q (t), specifically comprising:
1) RBF neural network design
An input layer: s (t) represents the state of the Q function input at the time t in Q learning; a (t) represents the input action of the Q function at the time t in Q learning;
hiding the layer: y isi(t) as hidden layer RBF activation function, using Gaussian kernel function as RBF activation function of neuron; for the RBF activation function of the ith neuron, its output is calculated using the following formula:
where x is an input variable, μiAnd σiThe center and standard deviation of the ith neuron are respectively, and k is the number of RBF activation functions;
an output layer: q (t) represents the Q function output, updated using the following equation,
wherein, wiOutputting the weight in the Q function for the ith neuron;
2) RBF network update
Defining Q learning error deltaQThe following are:
δQ=(1-λ)(r+γQmax-Q(s(t),a*,t))
wherein, lambda is a learning factor, and lambda is more than or equal to 0 and less than or equal to 1; gamma is attenuation factor 0<γ<1;QmaxIs the current Q maximum value in the iteration process; r is an immediate return value; a represents an optimal action selection; s (t) is an input state; error deltaQIndicating the convergence degree of the Q function in the learning process; the learning performance index E of the RBF network is defined as follows:
updating the RBF network by using BP algorithm and gradient descent method, and outputting weight w of each neuroniThe formula is updated as follows:
wherein,for learning rate, for E (t) and wi(t) having:
the output weight w of each neuron according to the chain ruleiThe update formula becomes:
center and standard deviation μ of RBF function for each neuroniAnd σiThe formula is updated as follows:
wherein, αμAnd ασRespectively the learning rates of the RBF function center and the standard deviation;
3) gradient descent method for solving Q learning next-step behavior
For discrete Q learning, solving max { Q (s (t), b, t)/b ∈ A } by traversing a Q table, namely b represents the next optimal behavior, and A is a set of all actions required in the discrete Q learning; for the Q function of the continuous behavior, solving the next behavior by adopting a gradient descent method;
max { Q (s (t), b, t)/b ∈ A } can be converted into a minimum value problem min { -Q (s (t), b, t)/b ∈ A }; assuming that the current state is s (t), the motion vector a has m dimensions, i.e., a ═ a1,a2,...,am](ii) a For the function-Q (s (t), b, t), there is a gradient direction:
in each step of solving iteration, a is updated to the opposite direction of the gradient, and the following steps are carried out:
wherein λ isaFor step size, solving max { Q (s (t), b, t)/b ∈ A } by a gradient descent method, and the overall algorithm steps are as follows:
① initialization parameters including the allowable error deltaeminMaximum number of iterations k, step length λaRandomly assigning an initial value a (0), and making i equal to 0;
② for a (i), usingFinding the current gradient direction
③ use the formulaUpdating to obtain a (i + 1);
④ calculating the error Δ E | | | a (i +1) -a (i) | | | if Δ E ≦ Δ EminOr i>If so, stopping, otherwise, making i equal to i +1, and jumping to step ②;
(2) designing an online action adjustment stable controller based on an RBF-Q Learning framework;
for the front and back and left and right directions of the robot, two stable controllers are respectively designed:
1) stable control of front and rear direction
Taking the left foot supporting stage as an example, the right foot similarly defines the state input of RBF-QLearning for the stable control of the humanoid robot in the front-back direction as follows:
spitch(t)=[θhip_pitch(t),θknee_pitch(t),θankle_pitch(t),θxz(t)]
wherein, thetahip_pitch(t)、θknee_pitch(t)、θankle_pitch(t) is respectively a left foot and hip joint pitch steering engine angle, a knee joint steering engine angle and an ankle joint steering engine angle theta in an off-line basic gait of the humanoid robot at the moment txz(t) is the included angle of the trunk-plumb line on the xz plane at the time t;
to fore-and-aft direction stable control mainly depend on left leg hip joint every single move steering wheel, knee joint steering wheel and ankle joint steering wheel, so output action definition is its online adjustment value:
apitch(t)=[Δθhip_pitch(t),Δθknee_pitch(t),Δθankle_pitch(t)]
wherein, Delta thetahip_pitch(t)、Δθknee_pitch(t)、Δθankle_pitch(t) respectively adjusting angles of a current hip joint pitch steering engine, a current knee joint steering engine and a current ankle joint steering engine;
for judging the behavior of the robot, calculating an immediate return function by using the body deflection angle of the robot obtained by using the information of the attitude sensor;
the method reinforcement learning stability controller immediate return function before and after definition is as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θxz(t) and Δ θxz(t) each isThe included angle between the trunk and the plumb line on the xz plane at the time t and the angular velocity of the included angle; the immediate reward function is intended to be θxz(t) is controlled within an allowable error band while its rate of change Delta thetaxz(t) as small as possible;
2) stable control of left and right directions
For the stable control of the humanoid robot in the front-back direction, the state input of RBF-Q Learning is defined as follows:
sroll(t)=[θhip_roll(t),θankle_roll(t),θyz(t)]
wherein, thetahip_roll(t) and θankle_roll(t) the angles theta of the left foot hip joint rolling steering engine and the ankle joint rolling steering engine in the offline basic gait of the humanoid robot at the moment t respectivelyyz(t) is the included angle of the trunk-plumb line on the yz plane at the moment t;
because, then mainly depend on left leg hip joint steering wheel, ankle joint steering wheel decision to left and right sides direction stable control, so the output action definition is:
aroll(t)=[Δθhip_roll(t),Δθankle_roll(t)]
wherein, Delta thetahip_roll(t) and Δ θankle_roll(t) the adjustment angles of the hip joint rolling steering engine and the ankle joint rolling steering engine are respectively;
considering the evaluation of the stability in the left and right directions by using the included angle between the trunk and the plumb line on the z plane and the angular speed thereof, defining the immediate return function of the left and right method reinforcement learning stability controller as follows:
wherein, a1、a2In order to report the function weight immediately,
where ε is an allowable error band, θyz(t) and Δ θyz(t) the included angle of the trunk-plumb line on the yz plane at the time t and the angular velocity thereof respectively; the immediate reward function is intended to be θyz(t) is controlled within an allowable error band while its rate of change Delta thetayz(t) is as small as possible.
CN201510299823.3A 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks Expired - Fee Related CN104932264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510299823.3A CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510299823.3A CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Publications (2)

Publication Number Publication Date
CN104932264A CN104932264A (en) 2015-09-23
CN104932264B true CN104932264B (en) 2018-07-20

Family

ID=54119479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510299823.3A Expired - Fee Related CN104932264B (en) 2015-06-03 2015-06-03 The apery robot stabilized control method of Q learning frameworks based on RBF networks

Country Status (1)

Country Link
CN (1) CN104932264B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2019518273A (en) * 2016-04-27 2019-06-27 ニューララ インコーポレイテッド Method and apparatus for pruning deep neural network based Q-learning empirical memory
CN106094813B (en) * 2016-05-26 2019-01-18 华南理工大学 Humanoid robot gait's control method based on model correlation intensified learning
CN106094817B (en) * 2016-06-14 2018-12-11 华南理工大学 Intensified learning humanoid robot gait's planing method based on big data mode
CN107292392B (en) * 2017-05-11 2019-11-22 苏州大学 Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights
CN107292344B (en) * 2017-06-26 2020-09-18 苏州大学 Robot real-time control method based on environment interaction
CN107403049B (en) * 2017-07-31 2019-03-19 山东师范大学 A kind of Q-Learning pedestrian's evacuation emulation method and system based on artificial neural network
CN108051787A (en) * 2017-12-05 2018-05-18 上海无线电设备研究所 A kind of missile-borne radar flying test method
CN108537379B (en) * 2018-04-04 2021-11-16 北京科东电力控制系统有限责任公司 Self-adaptive variable weight combined load prediction method and device
CN108631817B (en) * 2018-05-10 2020-05-19 东北大学 Method for predicting frequency hopping signal frequency band based on time-frequency analysis and radial neural network
CN108873687B (en) * 2018-07-11 2020-06-26 哈尔滨工程大学 Intelligent underwater robot behavior system planning method based on deep Q learning
CN109827292A (en) * 2019-01-16 2019-05-31 珠海格力电器股份有限公司 Construction method and control method of self-adaptive energy-saving control model of household appliance and household appliance
CN111765604B (en) * 2019-04-01 2021-10-08 珠海格力电器股份有限公司 Control method and device of air conditioner
CN110712201B (en) * 2019-09-20 2022-09-16 同济大学 Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer
CN113062601B (en) * 2021-03-17 2022-05-13 同济大学 Q learning-based concrete distributing robot trajectory planning method
CN113467235B (en) * 2021-06-10 2022-09-02 清华大学 Biped robot gait control method and control device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065553A (en) * 2009-09-18 2011-03-31 Honda Motor Co Ltd Learning control system and learning control method
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method
WO2014047142A1 (en) * 2012-09-20 2014-03-27 Brain Corporation Spiking neuron network adaptive control apparatus and methods

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440366B (en) * 2013-08-05 2016-06-08 广东电网公司电力科学研究院 Supercritical turbine steam discharge mass dryness fraction computational methods based on BP neutral net
CN103605285A (en) * 2013-11-21 2014-02-26 南京理工大学 Fuzzy nerve network control method for automobile driving robot system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2011065553A (en) * 2009-09-18 2011-03-31 Honda Motor Co Ltd Learning control system and learning control method
JP2011204036A (en) * 2010-03-25 2011-10-13 Institute Of National Colleges Of Technology Japan Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program
CN102402712A (en) * 2011-08-31 2012-04-04 山东大学 Robot reinforced learning initialization method based on neural network
WO2014047142A1 (en) * 2012-09-20 2014-03-27 Brain Corporation Spiking neuron network adaptive control apparatus and methods
CN103204193A (en) * 2013-04-08 2013-07-17 浙江大学 Under-actuated biped robot walking control method

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
基于Q学习的欠驱动双足机器人行走控制研究;刘道远;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20131015(第10期);I140-164 *
基于RBFNN的强化学习在机器人导航中的应用;吴洪岩,等;《吉林大学学报(信息科学版)》;20090331;第27卷(第2期);第185-190页 *
基于RBF-Q学习的四足机器人运动协调控制;尹俊明,等;《计算机应用研究》;20130831;第30卷(第8期);第2349-2352页 *
基于强化学习的自主移动机器人导航研究;吴洪岩;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20091115(第11期);I140-280 *
基于激励学习算法的移动机器人避障规划研究盛维涛;盛维涛;《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》;20130315(第03期);I140-316 *
模糊强化学习在机器人导航中的应用;葛媛,等;《信息技术》;20091031(第10期);第127-130页 *

Also Published As

Publication number Publication date
CN104932264A (en) 2015-09-23

Similar Documents

Publication Publication Date Title
CN104932264B (en) The apery robot stabilized control method of Q learning frameworks based on RBF networks
US8306657B2 (en) Control device for legged mobile robot
US8311677B2 (en) Control device for legged mobile robot
US8417382B2 (en) Control device for legged mobile body
CN108858208B (en) Self-adaptive balance control method, device and system for humanoid robot in complex terrain
US8204626B2 (en) Control device for mobile body
EP2017042B1 (en) Motion controller and motion control method for legged walking robot, and robot apparatus
Koubaa et al. Adaptive sliding-mode dynamic control for path tracking of nonholonomic wheeled mobile robot
CN111625002B (en) Stair-climbing gait planning and control method of humanoid robot
Pandala et al. Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced-and full-order models
CN108931988B (en) Gait planning method of quadruped robot based on central pattern generator, central pattern generator and robot
JP6781101B2 (en) Non-linear system control method, biped robot control device, biped robot control method and its program
CN114995479A (en) Parameter control method of quadruped robot virtual model controller based on reinforcement learning
CN116551669A (en) Dynamic jump and balance control method for humanoid robot, electronic equipment and medium
CN114397810A (en) Four-legged robot motion control method based on adaptive virtual model control
Halaly et al. Autonomous driving controllers with neuromorphic spiking neural networks
Li et al. Learning agile bipedal motions on a quadrupedal robot
Cisneros et al. Partial yaw moment compensation using an optimization-based multi-objective motion solver
Li et al. Dynamic locomotion of a quadruped robot with active spine via model predictive control
Arena et al. Attitude control in the Mini Cheetah robot via MPC and reward-based feed-forward controller
JP5404543B2 (en) Control device for legged mobile robot
JP5232120B2 (en) Control device for moving body
Dong et al. Reactive bipedal balance: Coordinating compliance and stepping through virtual model imitation for enhanced stability
Znegui et al. Analysis and control of the dynamic walking of the compass biped walker using poincaré maps: Comparison between two design approaches
Zhang et al. Biped walking on rough terfrain using reinforcement learning

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180720

CF01 Termination of patent right due to non-payment of annual fee