CN104932264A - Humanoid robot stable control method of RBF-Q learning frame - Google Patents
Humanoid robot stable control method of RBF-Q learning frame Download PDFInfo
- Publication number
- CN104932264A CN104932264A CN201510299823.3A CN201510299823A CN104932264A CN 104932264 A CN104932264 A CN 104932264A CN 201510299823 A CN201510299823 A CN 201510299823A CN 104932264 A CN104932264 A CN 104932264A
- Authority
- CN
- China
- Prior art keywords
- rbf
- pitch
- partiald
- learning
- ankle
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000006399 behavior Effects 0.000 claims abstract description 27
- 210000000544 articulatio talocruralis Anatomy 0.000 claims abstract description 23
- 210000004394 hip joint Anatomy 0.000 claims abstract description 23
- 230000008569 process Effects 0.000 claims abstract description 16
- 230000005021 gait Effects 0.000 claims abstract description 14
- 210000000629 knee joint Anatomy 0.000 claims abstract description 12
- 238000004364 calculation method Methods 0.000 claims abstract description 4
- 230000006870 function Effects 0.000 claims description 46
- 230000009471 action Effects 0.000 claims description 27
- 238000005096 rolling process Methods 0.000 claims description 22
- 210000002683 foot Anatomy 0.000 claims description 17
- 238000013461 design Methods 0.000 claims description 14
- 238000011478 gradient descent method Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 12
- 230000008859 change Effects 0.000 claims description 8
- 210000001699 lower leg Anatomy 0.000 claims description 6
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000013459 approach Methods 0.000 claims description 4
- 230000001537 neural effect Effects 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 claims description 3
- 230000036541 health Effects 0.000 claims description 3
- 230000000977 initiatory effect Effects 0.000 claims description 3
- 238000012804 iterative process Methods 0.000 claims description 3
- 230000007786 learning performance Effects 0.000 claims description 3
- 230000009184 walking Effects 0.000 abstract description 27
- 230000033001 locomotion Effects 0.000 abstract description 3
- 210000001503 joint Anatomy 0.000 abstract 1
- 210000002414 leg Anatomy 0.000 abstract 1
- 238000002474 experimental method Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002156 mixing Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Landscapes
- Feedback Control In General (AREA)
- Manipulator (AREA)
Abstract
The invention discloses a humanoid robot stable control method of an RBF-Q learning frame. The method comprises the following steps: the RBF-Q learning frame which solves the problems of state space serialization and behavior space serialization in a Q learning process is brought forward; an online motion adjusting stable control algorithm of the RBF-Q learning is brought forward, loci of the hip joint, the knee joint and the ankle joint of a support leg are generated, and a humanoid robot is controlled to walk stably through calculation of angles of other joints; and finally, the feasibility and validity of an RBF-Q learning frame method are verified on the Vitruvian Man humanoid robot platform designed by the laboratory. The method provided by the invention can generate a stable walking gait of the humanoid robot in an online learning process.
Description
Technical field
The present invention relates to anthropomorphic robot walking stability contorting field, be specifically related to Q learning framework (RBF-Q Learning) the apery robot stabilized control method based on RBF network.
Background technology
It is the complicated control problem of solution one that double feet walking on anthropomorphic robot platform controls its essence of research.And the solution of complicated problem, be generally by carrying out modeling to whole system, solving system equation solves.But in reality, we usually run into such problem, namely problem itself is difficult to accurate model and is described, or the too various complexity of parameter that system relies on, to such an extent as to is difficult to be solved by the mode of solving system equation.Now can by study but not meticulous Modling model solves problem.
The control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, is difficult to by Accurate Model mode, obtains a perfect solution.The method of intensified learning and neural network has been proved to be effective in the control problem of complexity.These methods do not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By the mode of study, these methods may provide the perfect solution surmounting deviser's ken.Meanwhile, such method has the ability of continuous learning and improvement, as occurring in nature animal by learning and adapt to obtain their most of abilities.
Summary of the invention
The present invention with the walking stability under Humanoid Complicated ground environment for goal in research, the difficulty realizing state space serialization and action space serialization is difficult to for intensified learning Q study, propose a kind of Q learning framework based on RBF network (RBF-Q Learning), and use this Frame Design and achieve anthropomorphic robot walking stable control method, the validity of the method is verified finally by emulation and tangible machine people.
The invention provides the apery robot stabilized control method of Q learning framework based on RBF network, anthropomorphic robot on-line study can be made to produce stable gait planning, thus realize apery robot stabilized walking, comprise following steps:
(1) the Q learning framework (RBF-Q Learning) based on RBF network designs.
The present invention devises one for the Q learning framework of continuous space based on RBF network, and framework uses has the RBF network matching Q function of stronger overall approximation capability, and uses gradient descent method to solve often to walk the maximal value in iteration and optimum behavior.Algorithm can carry out online adjustment in real time and study for the complexity of process problem to RBF network structure and parameter, possesses good generalization ability.
In conjunction with RBF network and Q study, the present invention devises RBF-Q Leanring algorithm frame, approaches matching in conjunction with RBF network to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t).
1) RBF neural design
Input layer: s (t) represents the state that in Q study, Q function inputs in t; A (t) represents the action that in Q study, Q function inputs in t;
Hidden layer: y
it () is hidden layer RBF activation function, use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
Wherein, x is input variable, μ
iand σ
ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number.
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Wherein, w
ibe that i-th neuron exports the weight in Q function.
2) RBF network upgrades
Definition Q learning error δ
q, as follows:
δ
Q=(1-λ)(r+γQ
max-Q(s,a*,t))
Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor 0 < γ < 1; Q
maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta
qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w
i,
Have and upgrade formula as follows:
Wherein, α
wfor learning rate, for E (t) and w
it (), has:
According to chain rule, each neuronic output weight w
i, more new formula becomes:
w
i(t+1)=w
i(t)+α
wδ
Q(t)y
i(t)
For center and the standard deviation μ of each neuron RBF function
iand σ
i, have and upgrade formula as follows:
Wherein, α
μand α
σbe respectively the learning rate of RBF function center and standard deviation;
3) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior,
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
Wherein, λ
afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E
min, maximum iteration time k, step-length λ
aand random appointment initial value a (0), make i=0;
2. for a (i), utilize
Ask current gradient direction ▽ Q [a (i)];
3. formula is used
Upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E
minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;
(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework
For front and back and the left and right both direction of robot, design two stability controllers respectively:
1) stability contorting of fore-and-aft direction
For left foot driving phase (right crus of diaphragm is in like manner), for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following:
s
pitch(t)=[θ
hip_pitch(t),θ
knee_pitch(t),θ
ankle_pitch(t),θ
xz(t)]
Wherein, θ
hip_pitch(t), θ
knee_pitch(t), θ
ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ
xzt () is the trunk in t xz plane-plumb line angle.
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a
pitch(t)=[Δθ
hip_pitch(t),Δθ
knee_pitch(t),Δθ
ankle_pitch(t)]
Wherein, Δ θ
hip_pitch(t), Δ θ
knee_pitch(t), Δ θ
ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel.
For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately.
Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
xz(t) and Δ θ
xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof.Reward Program is intended to θ immediately
xzt () controls in allowable error band, simultaneously its rate of change Δ θ
xzt () is as far as possible little.
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s
roll(t)=[θ
hip_roll(t),θ
ankle_roll(t),θ
yz(t)]
Wherein, θ
hip_roll(t) and θ
ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ
yzt () is the trunk in t yz plane-plumb line angle.
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a
roll(t)=[Δθ
hip_roll(t),Δθ
ankle_roll(t)]
Wherein, Δ θ
hip_roll(t) and Δ θ
ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel.
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
yz(t) and Δ θ
yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof.Reward Program is intended to θ immediately
yzt () controls in allowable error band, simultaneously its rate of change Δ θ
yzt () is as far as possible little.
Compared with prior art, the present invention has the following advantages:
(1) Q learning framework (RBF-Q Learning) method based on RBF network makes the stability of robot ambulation be optimized, and possesses online learning ability.After certain study, anthropomorphic robot stably walking can pass through complicated ground environment.
(2) control problem of anthropomorphic robot double feet walking has highly unstable and features such as mission nonlinear, be difficult to by Accurate Model mode, Q learning framework (RBF-Q Learning) method based on RBF network does not need the deviser of system to the accurate awareness and understanding of system dynamics own profound.By mode of learning, the inventive method can provide the perfect solution surmounting deviser's ken.This method has the ability of continuous learning and improvement simultaneously, as occurring in nature animal by learning and adapt to obtain their most of abilities.
Accompanying drawing explanation
Fig. 1 is RBF-Q Learning network structure.
Fig. 2 is RBF-Q Learning algorithm frame schematic flow sheet.
Fig. 3 be use the robot of online actions adjustment stability contorting upward slope landform up walk angular velocity curve (after 1000 walkings, top curve corresponds to robot around y-axis angular velocity (i.e. swing), and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data.
Fig. 4 uses the robot angular velocity curve of walking in accidental relief of online actions adjustment stability contorting (after 1000 walkings, the corresponding robot of top curve is around y-axis angular velocity (i.e. swing) data, and the corresponding anthropomorphic robot of lower curve is around x-axis angular velocity (namely swinging) data).
Embodiment
Describe the specific embodiment of the present invention in detail below in conjunction with accompanying drawing, but enforcement of the present invention and protection are not limited thereto, if it is noted that have symbol or the process of special detailed description below, be all that those skilled in the art can refer to existing techniques in realizing.
(1) by using 3 dimension inverted pendulum models to carry out ZMP analysis to the anthropomorphic robot model simplified, robot barycenter and foothold track in gait processes is calculated.Use robot barycenter and foothold track, by reverse movement Epidemiological Analysis, we obtain each joint motions track in anthropomorphic robot gait processes, and preserve as the basic gait information of robot off-line.
(2) the Q learning framework (RBF-Q Learning) based on RBF network designs.
1) the Q function of RBF network matching
In conjunction with RBF network, matching is approached to the Q function in Q study.Suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), have RBF neural as follows, see accompanying drawing 1.
Input layer: s (t) represents the state that in Q study, Q function inputs in t, altogether n dimension; A (t) represents the action that in Q study, Q function inputs in t, altogether m dimension.
Hidden layer: y (t) is hidden layer RBF activation function, k altogether.Use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
Wherein, x is input variable, μ
iand σ
ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number.
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Wherein, w
ibe that i-th neuron exports the weight in Q function.
For the renewal of this RBF network, definition Q learning error δ
q, as follows:
δ
Q=(1-λ)(r+γQ
max-Q(s,a*,t))
Wherein, λ is Studying factors (0≤λ≤1); γ is decay factor (0 < γ < 1); Q
maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta
qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w
i, have and upgrade formula as follows:
Wherein, α
wfor learning rate, for E (t) and w
it (), has:
According to chain rule, each neuronic output weight w
i, more new formula becomes:
w
i(t+1)=w
i(t)+α
wδ
Q(t)y
i(t)
For center and the standard deviation μ of each neuron RBF function
iand σ
i, have and upgrade formula as follows:
Wherein, α
μand α
σbe respectively the learning rate of RBF function center and standard deviation.
2) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior.
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
Wherein, λ
afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E
min, maximum iteration time k, step-length λ
aand random appointment initial value a (0), make i=0;
2. for a (i), utilize
Ask current gradient direction
3. formula is used
Upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i+1)-a (i) ||, if Δ E≤Δ E
minor i > k, then stop; Otherwise, make i=i+1, jump to step 2..
In conjunction with RBF neural and gradient descent method, we do a whole description to RBF-Q Learning algorithm frame, and algorithm flow chart is as accompanying drawing 2.
(3) design is based on the online actions adjustment stability controller of RBF-Q Learning framework
Design the state input for anthropomorphic robot walking RBF-Q Learning learning framework and behavior output.Anthropomorphic robot double feet walking process is the process that two different walkings are changed mutually, (being that right crus of diaphragm is taken a step as example with the first step), namely right crus of diaphragm driving phase is transformed into from left foot driving phase, and circulate with this, phase transition centre is generally also interspersed with an of short duration biped driving phase.At left foot driving phase, left foot supports the three-dimensional inverted pendulum model formed and stablizes primarily of left foot servos control, and now, robot in the longitudinal direction stable is determined by left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by left leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.In like manner, right crus of diaphragm driving phase, the stable of fore-and-aft direction is determined by right leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel; Stablizing on left and right directions is determined by right leg hip joint rolling steering wheel, ankle-joint rolling steering wheel.According to this design feature, for front and back and the left and right both direction of robot, design two stability controllers respectively.
1) stability contorting of fore-and-aft direction
For left foot driving phase (right crus of diaphragm is in like manner), for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following:
s
pitch(t)=[θ
hip_pitch(t),θ
knee_pitch(t),θ
ankle_pitch(t),θ
xz(t)]
Wherein, θ
hip_pitch(t), θ
knee_pitch(t), θ
ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ
xzt () is the trunk in t xz plane-plumb line angle.
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a
pitch(t)=[Δθ
hip_pitch(t),Δθ
knee_pitch(t),Δθ
ankle_pitch(t)]
Wherein, Δ θ
hip_pitch(t), Δ θ
knee_pitch(t), Δ θ
ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel.
For robot take the judge of behavior, the robot health deflection angle using attitude sensor information to obtain calculates Reward Program immediately; Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
xz(t) and Δ θ
xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof.Reward Program is intended to θ immediately
xzt () controls in allowable error band, simultaneously its rate of change Δ θ
xzt () is as far as possible little.
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s
roll(t)=[θ
hip_roll(t),θ
ankle_roll(t),θ
yz(t)]
Wherein, θ
hip_roll(t) and θ
ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ
yzt () is the trunk in t yz plane-plumb line angle.
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a
roll(t)=[Δθ
hip_roll(t),Δθ
ankle_roll(t)]
Wherein, Δ θ
hip_roll(t) and Δ θ
ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel.
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
yz(t) and Δ θ
yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof.Reward Program is intended to θ immediately
yzt () controls in allowable error band, simultaneously its rate of change Δ θ
yzt () is as far as possible little.
3) based on the online actions adjustment stability contorting flow process of RBF-Q Learning framework
In anthropomorphic robot gait processes, each is about to the action of execution, this stability controller gets sensor information from Kalman filtering algorithm, and according to present offline basis gait, calculates current state.2 flow processs with reference to the accompanying drawings, upgrade RBF-Q Learning framework, and obtain next step behavior, revise in real time the action being about to perform.
In sum, following algorithm steps is had for each RBF-Q Learning framework online actions adjustment stability controller:
1. initialization RBF-Q Learning framework.
2. for each the walking action being about to perform, from Kalman filtering blending algorithm, trunk-plumb line angle and angular velocity thereof is got, according to formulae discovery current state.
3. use current state, calculate optimum behavior according to RBF-Q Learning framework.
4. use step 3. gained optimum behavior, revise next walking action.
5. perform next action, obtain current system return value immediately, upgrade RBF-Q Learning framework.Jump to step 2.
(4) experiment test and interpretation of result
1) emulation experiment test and interpretation of result
The online actions adjustment stability controller based on RBF-Q Learning framework is used to carry out the online stability control of anthropomorphic robot walking.Anthropomorphic robot learns in simulated environment, constantly conforms to modify basic gait, until complete continuous walking target.
In the experiment of this group, algorithm all trend convergences after 1000 walkings, and respectively in upward slope landform and accidental relief, complete the continuous walking of 10 step.From experimental result, we can see, use anthropomorphic robot based on the online actions adjustment stability contorting of RBF-Q Learning framework after the study of experience a period of time, possess walking and pass through and go up a slope and the ability of the complicated terrain environment such as rugged.
Figure 3 shows use based on RBF-Q Learning framework online actions adjustment stability contorting anthropomorphic robot upward slope landform up walk angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in upward slope landform.
Figure 4 shows and use the anthropomorphic robot of the online actions adjustment stability contorting based on RBF-Q Learning framework to walk in accidental relief angular velocity real-time change.This test data sheet is anthropomorphic robot the 1000th walking learning process, and robot is success walking 10 step continuously in accidental relief.
2) tangible machine people experiment test
In real experiment, online actions adjustment stability contorting based on RBF-Q Learning framework is successfully applied on platform anthropomorphic robot, and for successfully completing walking, thus the validity based on the apery robot stabilized control method of RBF-Q Learning framework that checking the present invention proposes.
Claims (1)
1., based on the apery robot stabilized control method of Q learning framework of RBF network, it is characterized in that comprising the steps:
(1) design is based on the Q learning framework (RBF-Q Learning) of RBF network, suppose that Q function receives a state vector s (t) and action vector a (t) input, and export scalar Q (t), specifically comprise:
1) RBF neural design
Input layer: s (t) represents the state that in Q study, Q function inputs in t; A (t) represents the action that in Q study, Q function inputs in t;
Hidden layer: y
it () is hidden layer RBF activation function, use gaussian kernel function as neuronic RBF activation function; For i-th neuronic RBF activation function, it exports to use following formulae discovery:
Wherein, x is input variable, μ
iand σ
ibe i-th neuronic center and standard deviation respectively, k is RBF activation function number;
Output layer: Q (t) represents Q function and exports, and uses lower formula to upgrade,
Wherein, w
ibe that i-th neuron exports the weight in Q function;
2) RBF network upgrades
Definition Q learning error δ
q, as follows:
δ
Q=(1-λ)(r+γQ
max-Q(s,a*,t))
Wherein, λ is Studying factors, 0≤λ≤1; γ is decay factor 0 < γ < 1; Q
maxfor Q maximal value current in iterative process; R is return value immediately; A* represents that optimum action is selected; S is input state; Error delta
qindicate the degree of convergence of Q function in learning process; The learning performance index E defining this system is as follows:
Use BP algorithm and gradient descent method, RBF network is upgraded, for each neuronic output weight w
i, have and upgrade formula as follows:
Wherein, α
wfor learning rate, for E (t) and w
it (), has:
According to chain rule, each neuronic output weight w
i, more new formula becomes:
w
i(t+1)=w
i(t)+α
wδ
Q(t)y
i(t)
For center and the standard deviation μ of each neuron RBF function
iand σ
i, have and upgrade formula as follows:
Wherein, α
μand α
σbe respectively the learning rate of RBF function center and standard deviation;
3) gradient descent method solves Q and learns next step behavior
Learn for discrete Q, solve max{Q (s (t), b, t)/b ∈ A} by traversal Q table, namely b represents next step optimum behavior, and for the Q function of Continuous behavior, adopts gradient descent method to solve next step behavior;
Minimum problem min{-Q (s (t), b, t)/b ∈ A} can be converted into max{Q (s (t), b, t)/b ∈ A}; Suppose that current state is s (t), for function-Q (s (t), b, t), have gradient direction:
Solve in iteration in each step, a upgrades in the other direction to gradient, has:
Wherein, λ
afor step-length, max{Q (s (t), b, t)/b ∈ A} is solved for gradient descent method, has total algorithm step as follows:
1. initiation parameter, comprising: allowable error Δ E
min, maximum iteration time k, step-length λ
aand random appointment initial value a (0), make i=0;
2. for a (i), utilize
Ask current gradient direction
3. formula is used
upgrade and obtain a (i+1);
4. error of calculation Δ E=||a (i-1)-a (i) ||, if Δ E≤Δ E
minor i > k, then stop; Otherwise, make i=i+1, jump to step 2.;
(2) design is based on the online actions adjustment stability controller of RBF-Q Learning framework; For front and back and the left and right both direction of robot, design two stability controllers respectively:
1) stability contorting of fore-and-aft direction
For left foot driving phase, in like manner, for the stability contorting of anthropomorphic robot fore-and-aft direction, the state of definition RBF-QLearning study is input as following right crus of diaphragm:
s
pitch(t)=[θ
hip_pitch(t),θ
knee_pitch(t),θ
ankle_pitch(t),θ
xz(t)]
Wherein, θ
hip_pitch(t), θ
knee_pitch(t), θ
ankle_pitcht () is respectively left foot hip joint pitch-control motor angle, knee joint steering wheel angle and ankle-joint steering wheel angle in the gait of t anthropomorphic robot off-line basis, θ
xzt () is the trunk in t xz plane-plumb line angle;
Left leg hip joint pitch-control motor, knee joint steering wheel and ankle-joint steering wheel are depended primarily on to fore-and-aft direction stability contorting, therefore output behavior is defined as its on-line tuning value:
a
pitch(t)=[Δθ
hip_pitch(t),Δθ
knee_pitch(t),Δθ
ankle_pitch(t)]
Wherein, Δ θ
hip_pitch(t), Δ θ
knee_pitch(t), Δ θ
ankle_pitcht () is respectively the adjustment angle of current hip joint pitch-control motor degree, knee joint steering wheel, ankle-joint steering wheel;
For robot take the judge of behavior, the robot health deflection angle that we use attitude sensor information to obtain calculates Reward Program immediately;
Definition anterior-posterior approach intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
xz(t) and Δ θ
xzt () is respectively trunk-plumb line angle in t xz plane and angular velocity thereof; Reward Program is intended to θ immediately
xzt () controls in allowable error band, simultaneously its rate of change Δ θ
xzt () is as far as possible little;
2) stability contorting of left and right directions
For the stability contorting of anthropomorphic robot fore-and-aft direction, equally, the state defining RBF-Q Learning study is input as following:
s
roll(t)=[θ
hip_roll(t),θ
ankle_roll(t),θ
yz(t)]
Wherein, θ
hip_roll(t) and θ
ankle_rollt () is respectively the angle of left foot hip joint rolling steering wheel, ankle-joint rolling steering wheel in the gait of t anthropomorphic robot off-line basis, θ
yzt () is the trunk in t yz plane-plumb line angle;
Due to, left leg hip joint rolling steering wheel is then depended primarily on to left and right directions stability contorting, ankle-joint rolling steering wheel determines, therefore output behavior is defined as:
a
roll(t)=[Δθ
hip_roll(t),Δθ
ankle_roll(t)]
Wherein, Δ θ
hip_roll(t) and Δ θ
ankle_rollt () is respectively the adjustment angle of hip joint rolling steering wheel, ankle-joint rolling steering wheel;
Consider to use the trunk-plumb line angle on z-plane and the stability on angular velocity evaluation left and right directions thereof, definition left and right method intensified learning stability controller immediately Reward Program is:
Wherein, a
1, a
2for Reward Program weights immediately,
Wherein, ε is allowable error band, θ
yz(t) and Δ θ
yzt () is respectively trunk-plumb line angle in t yz plane and angular velocity thereof; Reward Program is intended to θ immediately
yzt () controls in allowable error band, simultaneously its rate of change Δ θ
yzt () is as far as possible little.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510299823.3A CN104932264B (en) | 2015-06-03 | 2015-06-03 | The apery robot stabilized control method of Q learning frameworks based on RBF networks |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510299823.3A CN104932264B (en) | 2015-06-03 | 2015-06-03 | The apery robot stabilized control method of Q learning frameworks based on RBF networks |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104932264A true CN104932264A (en) | 2015-09-23 |
CN104932264B CN104932264B (en) | 2018-07-20 |
Family
ID=54119479
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510299823.3A Expired - Fee Related CN104932264B (en) | 2015-06-03 | 2015-06-03 | The apery robot stabilized control method of Q learning frameworks based on RBF networks |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104932264B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106094813A (en) * | 2016-05-26 | 2016-11-09 | 华南理工大学 | It is correlated with based on model humanoid robot gait's control method of intensified learning |
CN106094817A (en) * | 2016-06-14 | 2016-11-09 | 华南理工大学 | Intensified learning humanoid robot gait's planing method based on big data mode |
CN107292392A (en) * | 2017-05-11 | 2017-10-24 | 苏州大学 | Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights |
CN107292344A (en) * | 2017-06-26 | 2017-10-24 | 苏州大学 | Robot real-time control method based on environment interaction |
CN107403049A (en) * | 2017-07-31 | 2017-11-28 | 山东师范大学 | A kind of Q Learning pedestrians evacuation emulation method and system based on artificial neural network |
CN108051787A (en) * | 2017-12-05 | 2018-05-18 | 上海无线电设备研究所 | A kind of missile-borne radar flying test method |
CN108537379A (en) * | 2018-04-04 | 2018-09-14 | 北京科东电力控制系统有限责任公司 | Adaptive variable weight combination load forecasting method and device |
CN108631817A (en) * | 2018-05-10 | 2018-10-09 | 东北大学 | A method of Frequency Hopping Signal frequency range prediction is carried out based on time frequency analysis and radial neural network |
CN108873687A (en) * | 2018-07-11 | 2018-11-23 | 哈尔滨工程大学 | A kind of Intelligent Underwater Robot behavior system knot planing method based on depth Q study |
CN109348707A (en) * | 2016-04-27 | 2019-02-15 | 纽拉拉股份有限公司 | For the method and apparatus of the Q study trimming experience memory based on deep neural network |
CN109827292A (en) * | 2019-01-16 | 2019-05-31 | 珠海格力电器股份有限公司 | Construction method, control method, the household electrical appliances of household electrical appliances adaptive power conservation Controlling model |
CN110712201A (en) * | 2019-09-20 | 2020-01-21 | 同济大学 | Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer |
WO2020199648A1 (en) * | 2019-04-01 | 2020-10-08 | 珠海格力电器股份有限公司 | Control method and device for air conditioner |
CN113062601A (en) * | 2021-03-17 | 2021-07-02 | 同济大学 | Q learning-based concrete distributing robot trajectory planning method |
CN113467235A (en) * | 2021-06-10 | 2021-10-01 | 清华大学 | Biped robot gait control method and control device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011065553A (en) * | 2009-09-18 | 2011-03-31 | Honda Motor Co Ltd | Learning control system and learning control method |
JP2011204036A (en) * | 2010-03-25 | 2011-10-13 | Institute Of National Colleges Of Technology Japan | Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program |
CN102402712A (en) * | 2011-08-31 | 2012-04-04 | 山东大学 | Robot reinforced learning initialization method based on neural network |
CN103204193A (en) * | 2013-04-08 | 2013-07-17 | 浙江大学 | Under-actuated biped robot walking control method |
CN103440366A (en) * | 2013-08-05 | 2013-12-11 | 广东电网公司电力科学研究院 | BP (Back Propagation) neural network-based exhaust dryness computing method of USC (Ultra-Supercritical) turbine |
CN103605285A (en) * | 2013-11-21 | 2014-02-26 | 南京理工大学 | Fuzzy nerve network control method for automobile driving robot system |
WO2014047142A1 (en) * | 2012-09-20 | 2014-03-27 | Brain Corporation | Spiking neuron network adaptive control apparatus and methods |
-
2015
- 2015-06-03 CN CN201510299823.3A patent/CN104932264B/en not_active Expired - Fee Related
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2011065553A (en) * | 2009-09-18 | 2011-03-31 | Honda Motor Co Ltd | Learning control system and learning control method |
JP2011204036A (en) * | 2010-03-25 | 2011-10-13 | Institute Of National Colleges Of Technology Japan | Experience reinforcement type reinforcement learning system, experience reinforcement type reinforcement learning method and experience reinforcement type reinforcement learning program |
CN102402712A (en) * | 2011-08-31 | 2012-04-04 | 山东大学 | Robot reinforced learning initialization method based on neural network |
WO2014047142A1 (en) * | 2012-09-20 | 2014-03-27 | Brain Corporation | Spiking neuron network adaptive control apparatus and methods |
CN103204193A (en) * | 2013-04-08 | 2013-07-17 | 浙江大学 | Under-actuated biped robot walking control method |
CN103440366A (en) * | 2013-08-05 | 2013-12-11 | 广东电网公司电力科学研究院 | BP (Back Propagation) neural network-based exhaust dryness computing method of USC (Ultra-Supercritical) turbine |
CN103605285A (en) * | 2013-11-21 | 2014-02-26 | 南京理工大学 | Fuzzy nerve network control method for automobile driving robot system |
Non-Patent Citations (6)
Title |
---|
刘道远: "基于Q学习的欠驱动双足机器人行走控制研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 * |
吴洪岩,等: "基于RBFNN的强化学习在机器人导航中的应用", 《吉林大学学报(信息科学版)》 * |
吴洪岩: "基于强化学习的自主移动机器人导航研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 * |
尹俊明,等: "基于RBF-Q学习的四足机器人运动协调控制", 《计算机应用研究》 * |
盛维涛: "基于激励学习算法的移动机器人避障规划研究盛维涛", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 * |
葛媛,等: "模糊强化学习在机器人导航中的应用", 《信息技术》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109348707A (en) * | 2016-04-27 | 2019-02-15 | 纽拉拉股份有限公司 | For the method and apparatus of the Q study trimming experience memory based on deep neural network |
CN106094813A (en) * | 2016-05-26 | 2016-11-09 | 华南理工大学 | It is correlated with based on model humanoid robot gait's control method of intensified learning |
CN106094813B (en) * | 2016-05-26 | 2019-01-18 | 华南理工大学 | Humanoid robot gait's control method based on model correlation intensified learning |
CN106094817A (en) * | 2016-06-14 | 2016-11-09 | 华南理工大学 | Intensified learning humanoid robot gait's planing method based on big data mode |
CN107292392A (en) * | 2017-05-11 | 2017-10-24 | 苏州大学 | Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights |
CN107292392B (en) * | 2017-05-11 | 2019-11-22 | 苏州大学 | Large-range monitoring method and supervisory-controlled robot based on the double Q study of depth cum rights |
CN107292344A (en) * | 2017-06-26 | 2017-10-24 | 苏州大学 | Robot real-time control method based on environment interaction |
CN107292344B (en) * | 2017-06-26 | 2020-09-18 | 苏州大学 | Robot real-time control method based on environment interaction |
CN107403049A (en) * | 2017-07-31 | 2017-11-28 | 山东师范大学 | A kind of Q Learning pedestrians evacuation emulation method and system based on artificial neural network |
CN107403049B (en) * | 2017-07-31 | 2019-03-19 | 山东师范大学 | A kind of Q-Learning pedestrian's evacuation emulation method and system based on artificial neural network |
CN108051787A (en) * | 2017-12-05 | 2018-05-18 | 上海无线电设备研究所 | A kind of missile-borne radar flying test method |
CN108537379A (en) * | 2018-04-04 | 2018-09-14 | 北京科东电力控制系统有限责任公司 | Adaptive variable weight combination load forecasting method and device |
CN108631817A (en) * | 2018-05-10 | 2018-10-09 | 东北大学 | A method of Frequency Hopping Signal frequency range prediction is carried out based on time frequency analysis and radial neural network |
CN108631817B (en) * | 2018-05-10 | 2020-05-19 | 东北大学 | Method for predicting frequency hopping signal frequency band based on time-frequency analysis and radial neural network |
CN108873687A (en) * | 2018-07-11 | 2018-11-23 | 哈尔滨工程大学 | A kind of Intelligent Underwater Robot behavior system knot planing method based on depth Q study |
CN109827292A (en) * | 2019-01-16 | 2019-05-31 | 珠海格力电器股份有限公司 | Construction method, control method, the household electrical appliances of household electrical appliances adaptive power conservation Controlling model |
WO2020199648A1 (en) * | 2019-04-01 | 2020-10-08 | 珠海格力电器股份有限公司 | Control method and device for air conditioner |
US11965666B2 (en) | 2019-04-01 | 2024-04-23 | Gree Electric Appliances, Inc. Of Zhuhai | Control method for air conditioner, and device for air conditioner and storage medium |
CN110712201A (en) * | 2019-09-20 | 2020-01-21 | 同济大学 | Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer |
CN110712201B (en) * | 2019-09-20 | 2022-09-16 | 同济大学 | Robot multi-joint self-adaptive compensation method based on perceptron model and stabilizer |
CN113062601A (en) * | 2021-03-17 | 2021-07-02 | 同济大学 | Q learning-based concrete distributing robot trajectory planning method |
CN113467235A (en) * | 2021-06-10 | 2021-10-01 | 清华大学 | Biped robot gait control method and control device |
CN113467235B (en) * | 2021-06-10 | 2022-09-02 | 清华大学 | Biped robot gait control method and control device |
Also Published As
Publication number | Publication date |
---|---|
CN104932264B (en) | 2018-07-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104932264A (en) | Humanoid robot stable control method of RBF-Q learning frame | |
EP2017042B1 (en) | Motion controller and motion control method for legged walking robot, and robot apparatus | |
US8417382B2 (en) | Control device for legged mobile body | |
Rodriguez et al. | DeepWalk: Omnidirectional bipedal gait by deep reinforcement learning | |
US8306657B2 (en) | Control device for legged mobile robot | |
US8311677B2 (en) | Control device for legged mobile robot | |
KR101083414B1 (en) | Controller of legged mobile robot | |
US20110022232A1 (en) | Control device for mobile body | |
KR20010050543A (en) | Ambulation control apparatus and ambulation control method of robot | |
KR20050021288A (en) | Robot and attitude control method of robot | |
Pandala et al. | Robust predictive control for quadrupedal locomotion: Learning to close the gap between reduced-and full-order models | |
Atmeh et al. | Implementation of an adaptive, model free, learning controller on the Atlas robot | |
Ahn et al. | Data-efficient and safe learning for humanoid locomotion aided by a dynamic balancing model | |
CN106094817B (en) | Intensified learning humanoid robot gait's planing method based on big data mode | |
CN114467097A (en) | Method for learning parameters of a neural network, for generating trajectories of an exoskeleton and for setting the exoskeleton in motion | |
Wang et al. | Terrain adaptive walking of biped neuromuscular virtual human using deep reinforcement learning | |
Palmer et al. | Intelligent control of high-speed turning in a quadruped | |
Flad et al. | Experimental validation of a driver steering model based on switching of driver specific primitives | |
Atmeh et al. | A neuro-dynamic walking engine for humanoid robots | |
Kimpara et al. | Human model-based active driving system in vehicular dynamic simulation | |
Chignoli | Trajectory optimization for dynamic aerial motions of legged robots | |
Mehrabi | Dynamics and model-based control of electric power steering systems | |
Matsubara et al. | Spatiotemporal synchronization of biped walking patterns with multiple external inputs by style–phase adaptation | |
Li et al. | Cafe-mpc: A cascaded-fidelity model predictive control framework with tuning-free whole-body control | |
JP5232120B2 (en) | Control device for moving body |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20180720 |