CN112060082A - Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model - Google Patents
Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model Download PDFInfo
- Publication number
- CN112060082A CN112060082A CN202010836621.9A CN202010836621A CN112060082A CN 112060082 A CN112060082 A CN 112060082A CN 202010836621 A CN202010836621 A CN 202010836621A CN 112060082 A CN112060082 A CN 112060082A
- Authority
- CN
- China
- Prior art keywords
- state
- cerebellum
- robot
- joint
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 210000001638 cerebellum Anatomy 0.000 title claims abstract description 192
- 230000002787 reinforcement Effects 0.000 title claims abstract description 80
- 239000011664 nicotinic acid Substances 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 claims abstract description 184
- 230000008569 process Effects 0.000 claims abstract description 116
- 230000033001 locomotion Effects 0.000 claims abstract description 83
- 230000005021 gait Effects 0.000 claims abstract description 71
- 230000009471 action Effects 0.000 claims abstract description 62
- 230000006870 function Effects 0.000 claims abstract description 58
- 238000013507 mapping Methods 0.000 claims abstract description 35
- 230000006399 behavior Effects 0.000 claims abstract description 27
- 230000004913 activation Effects 0.000 claims abstract description 24
- 240000007817 Olea europaea Species 0.000 claims abstract description 21
- 210000002569 neuron Anatomy 0.000 claims abstract description 16
- 230000000694 effects Effects 0.000 claims abstract description 15
- 238000011156 evaluation Methods 0.000 claims abstract description 12
- 238000003860 storage Methods 0.000 claims abstract description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 58
- 230000008859 change Effects 0.000 claims description 38
- 230000000875 corresponding effect Effects 0.000 claims description 37
- 239000013598 vector Substances 0.000 claims description 33
- 210000000225 synapse Anatomy 0.000 claims description 27
- 241000282414 Homo sapiens Species 0.000 claims description 23
- 238000013528 artificial neural network Methods 0.000 claims description 22
- 230000002490 cerebral effect Effects 0.000 claims description 19
- 230000007246 mechanism Effects 0.000 claims description 16
- 230000001276 controlling effect Effects 0.000 claims description 13
- 238000005192 partition Methods 0.000 claims description 13
- 239000012528 membrane Substances 0.000 claims description 12
- 210000005036 nerve Anatomy 0.000 claims description 10
- 230000005540 biological transmission Effects 0.000 claims description 8
- 206010001497 Agitation Diseases 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 7
- 210000004027 cell Anatomy 0.000 claims description 6
- 230000000946 synaptic effect Effects 0.000 claims description 6
- 230000008602 contraction Effects 0.000 claims description 5
- 230000009023 proprioceptive sensation Effects 0.000 claims description 5
- 230000012501 relaxation of skeletal muscle Effects 0.000 claims description 5
- 230000002441 reversible effect Effects 0.000 claims description 5
- 239000002245 particle Substances 0.000 claims description 4
- 238000012937 correction Methods 0.000 claims description 3
- 230000033228 biological regulation Effects 0.000 claims description 2
- 230000005281 excited state Effects 0.000 claims description 2
- 230000002401 inhibitory effect Effects 0.000 claims description 2
- 230000005764 inhibitory process Effects 0.000 claims description 2
- VIEYMVWPECAOCY-UHFFFAOYSA-N 7-amino-4-(chloromethyl)chromen-2-one Chemical compound ClCC1=CC(=O)OC2=CC(N)=CC=C21 VIEYMVWPECAOCY-UHFFFAOYSA-N 0.000 claims 4
- 238000012549 training Methods 0.000 description 49
- 210000004394 hip joint Anatomy 0.000 description 39
- 210000001503 joint Anatomy 0.000 description 35
- 210000000544 articulatio talocruralis Anatomy 0.000 description 32
- 230000018109 developmental process Effects 0.000 description 30
- 238000004088 simulation Methods 0.000 description 30
- 238000011161 development Methods 0.000 description 29
- 210000000449 purkinje cell Anatomy 0.000 description 25
- 210000002683 foot Anatomy 0.000 description 22
- 210000003141 lower extremity Anatomy 0.000 description 19
- 238000011160 research Methods 0.000 description 13
- 210000001084 basket cell Anatomy 0.000 description 11
- 238000002474 experimental method Methods 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 210000004500 stellate cell Anatomy 0.000 description 11
- 210000004556 brain Anatomy 0.000 description 9
- 239000003795 chemical substances by application Substances 0.000 description 8
- 238000013461 design Methods 0.000 description 8
- 210000002414 leg Anatomy 0.000 description 8
- 238000005096 rolling process Methods 0.000 description 8
- 230000001133 acceleration Effects 0.000 description 7
- 239000000835 fiber Substances 0.000 description 7
- 230000008901 benefit Effects 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000011217 control strategy Methods 0.000 description 5
- 238000005457 optimization Methods 0.000 description 5
- 210000003198 cerebellar cortex Anatomy 0.000 description 4
- 238000010276 construction Methods 0.000 description 4
- 210000000629 knee joint Anatomy 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 230000006641 stabilisation Effects 0.000 description 4
- 238000011105 stabilization Methods 0.000 description 4
- 238000012795 verification Methods 0.000 description 4
- 238000012800 visualization Methods 0.000 description 4
- 230000007423 decrease Effects 0.000 description 3
- 230000005484 gravity Effects 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 210000000323 shoulder joint Anatomy 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 210000000115 thoracic cavity Anatomy 0.000 description 3
- 210000001364 upper extremity Anatomy 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 210000003484 anatomy Anatomy 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 210000003050 axon Anatomy 0.000 description 2
- 210000003710 cerebral cortex Anatomy 0.000 description 2
- 230000009194 climbing Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 239000012636 effector Substances 0.000 description 2
- 230000007613 environmental effect Effects 0.000 description 2
- 230000002068 genetic effect Effects 0.000 description 2
- 210000003652 golgi cell Anatomy 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 230000003278 mimic effect Effects 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 210000003487 olivary nucleus Anatomy 0.000 description 2
- 210000000056 organ Anatomy 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 241000257303 Hymenoptera Species 0.000 description 1
- 241001465754 Metazoa Species 0.000 description 1
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 description 1
- 101710097421 WD repeat and HMG-box DNA-binding protein 1 Proteins 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 210000000133 brain stem Anatomy 0.000 description 1
- 210000003591 cerebellar nuclei Anatomy 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000013016 damping Methods 0.000 description 1
- 238000013079 data visualisation Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 210000001787 dendrite Anatomy 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 210000002451 diencephalon Anatomy 0.000 description 1
- 238000006073 displacement reaction Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 210000005069 ears Anatomy 0.000 description 1
- 230000002900 effect on cell Effects 0.000 description 1
- 210000002310 elbow joint Anatomy 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005538 encapsulation Methods 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000002964 excitative effect Effects 0.000 description 1
- 210000003414 extremity Anatomy 0.000 description 1
- 210000001508 eye Anatomy 0.000 description 1
- 230000019637 foraging behavior Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 239000008187 granular material Substances 0.000 description 1
- 210000003714 granulocyte Anatomy 0.000 description 1
- 210000001624 hip Anatomy 0.000 description 1
- 230000008676 import Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000008140 language development Effects 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 210000000214 mouth Anatomy 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 230000036403 neuro physiology Effects 0.000 description 1
- 235000001968 nicotinic acid Nutrition 0.000 description 1
- 210000003924 normoblast Anatomy 0.000 description 1
- 210000001331 nose Anatomy 0.000 description 1
- 210000004205 output neuron Anatomy 0.000 description 1
- 238000004806 packaging method and process Methods 0.000 description 1
- 230000001575 pathological effect Effects 0.000 description 1
- 230000037361 pathway Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000003016 pheromone Substances 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 230000000272 proprioceptive effect Effects 0.000 description 1
- 210000003742 purkinje fiber Anatomy 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000009877 rendering Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 210000000278 spinal cord Anatomy 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000004936 stimulating effect Effects 0.000 description 1
- 230000003977 synaptic function Effects 0.000 description 1
- 230000003956 synaptic plasticity Effects 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000002054 transplantation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 230000003936 working memory Effects 0.000 description 1
Images
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1602—Programme controls characterised by the control system, structure, architecture
- B25J9/161—Hardware, e.g. neural networks, fuzzy logic, interfaces, processor
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1656—Programme controls characterised by programming, planning systems for manipulators
- B25J9/1664—Programme controls characterised by programming, planning systems for manipulators characterised by motion, path, trajectory planning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/008—Artificial life, i.e. computing arrangements simulating life based on physical entities controlled by simulated intelligence so as to replicate intelligent life forms, e.g. based on robots replicating pets or humans in their appearance or behaviour
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Robotics (AREA)
- Automation & Control Theory (AREA)
- Mechanical Engineering (AREA)
- Fuzzy Systems (AREA)
- Manipulator (AREA)
Abstract
An online stability control humanoid robot based on a bionic reinforcement learning type cerebellum model belongs to the field of humanoid robots and aims to solve the problem of improving the stability and balance capability of the humanoid robot in the walking process; the cerebellum model controller responds to off-line gait, the cerebellum model controller comprises a state coding module, a cerebellum model, a lower olive feedback module and a motion mapping module, the state coding module adjusts the activation state of the PF according to state information collected by the humanoid robot sensor, the lower olive feedback module modifies behavior selection probability and a cerebellum neuron storage weight value based on evaluation information fed back by the environment, the motion mapping module outputs and adjusts the robot action according to the function module, and the effect is that the stability and balance capability of the humanoid robot in the walking process are improved.
Description
Technical Field
The invention belongs to the field of humanoid robots, and relates to an online stably-controlled humanoid robot based on a bionic reinforcement learning type cerebellum model.
Background
The humanoid robot is a robot which aims at simulating the appearance and the behavior of human beings and continuously copies the characteristics of comprehensive ability of the human beings, so that the humanoid robot can replace the human beings to finish repetitive, high-risk and labor-intensive work, thereby having wide application prospect. Humanoid robots are a typical multi-joint nonlinear under-actuated system, so that the gait control of the humanoid robots is a very challenging problem and is also the key point for wide application of the humanoid robots. Researchers have proposed various motion gait control methods, and at present, it is most common to decompose a motion task into different sub-modules for planning and controlling to form an off-line gait pattern, such as a Zero Moment (ZMP) theory, an inverted pendulum model, a genetic algorithm, and the like. These conventional off-line control and planning methods are designed mainly for specific movement tasks, such as stable walking in a flat ground environment, but under a non-flat ground condition, the gait planning effect of the humanoid robot is still to be improved. Therefore, in recent years, offline gait planning is widely applied in combination with online adjustment strategies. However, the current online adjustment intelligent control strategy is still mainly established on the basis of an accurate motion control model, and the designed controller lacks universality and has weak interference resistance. The existing off-line gait planning method does not take changes of the road surface environment into consideration, and only can mechanically track a pre-planned movement mode, so that the off-line gait planning method is difficult to be applied to other environments except for a flat road surface. The environment in the real world is complex and changeable, and the ideal and absolutely smooth road surface environment is not common. To enable the humanoid robot to be widely applied, it is important to have the ability of stably walking in a complex environment. The intelligent control method is derived from the research on biological intelligence, walking is essentially the forward movement process from losing balance to restoring balance of the human body in the process of propulsion, and gait reflects the capability of the human body to control the center of gravity and is an important process of a human body balance system. The cerebellum plays a crucial role in gait control as an important organ of the human body controlling movement and balance. For many years, scholars in the fields of biology, neurophysiology, control engineering and the like simulate structural or functional characteristics of a Cerebellar nervous system, establish a model such as a Cerebellar Model Architecture Controller (CMAC) and the like, and apply the model to humanoid robot control. These cerebellar models are usually only aimed at controlling effects, and although the functional patterns of the cerebellum are used for reference, the structural characteristics of the cerebellum are ignored.
Disclosure of Invention
In order to solve the problem of improving the stability and balance capability of the humanoid robot in the walking process, the invention provides the following technical scheme: an online stable control humanoid robot based on a bionic reinforcement learning type cerebellum model comprises a device for performing offline gait planning on the humanoid robot, wherein the output of the device enables the humanoid robot to track joint motion tracks generated offline and have walking capability; the cerebellum model controller responds to off-line gait, the cerebellum model controller comprises a state coding module, a cerebellum model, a lower olive feedback module and a motion mapping module, the state coding module adjusts the activation state of the PF according to state information collected by the humanoid robot sensor, the lower olive feedback module modifies behavior selection probability and a cerebellum neuron storage weight value based on evaluation information fed back by the environment, and the motion mapping module outputs and adjusts the action of the robot according to the function module.
Has the advantages that: the invention aims at constructing a humanoid robot control method based on a cerebellum mechanism, deeply researches anatomical and physiological structures of the cerebellum, introduces a reinforcement learning mechanism and establishes a bionic control model based on the cerebellum, thereby improving the stability and balance capability of the humanoid robot in the walking process.
Drawings
Detailed Description
The embodiment discloses a humanoid robot online walking stability control method based on an Actor Critic reinforcement learning algorithm, which is based on the research idea of a walking control strategy combining offline gait planning and online stability adjustment, and the frame structure of the designed walking control strategy is shown in figure 1, and the method mainly comprises the following two steps: 1. an off-line gait, namely the off-line gait planning of the humanoid robot, is generated based on a ZMP theory and a cubic spline interpolation method, so that the humanoid robot has basic walking capability by tracking the joint motion trail generated off-line. 2. An online stability controller is designed based on an AC reinforcement learning algorithm, and the controller can acquire the state information of the robot in real time in the robot walking process and adjust the walking posture of the robot so that the robot can walk stably under the condition of an uneven road surface. The two steps are described in detail below:
1. the method comprises the steps of S1, establishing a kinematics model of the humanoid robot, S2, dividing gait cycle, determining ZMP track, S3, planning ankle joint and hip joint track, and S4, generating the off-line gait of the humanoid robot. The present embodiment will explain each specific step in detail.
S1, establishing a kinematics model of a humanoid robot: the ROBOTIS-OP2 humanoid robot (hereinafter abbreviated as OP2) is used as a research platform. To generate its off-line gait plan, it is first necessary to build a kinematic model with reference to the body dimensions and joint layout of OP 2. The kinematic model for OP2 was established using the DH method. The joint and link layout of OP2 is shown in fig. 2 (the dots in fig. 2 represent joints and the lines represent links). OP2 is simplified to a series of rod-joint configurations, with the detailed dimensions of each rod and the mass of each body part as shown in tables 3.1 and 3.2.
TABLE 3.1 connecting rod size of robot
TABLE 3.2 weight of each part of the robot
Considering that the more the total number of links included in the kinematic model, the larger the calculation amount, the simpler the link structure of the OP2 robot needs to be. The OP2 robot has 2 important structural features: (1) the chest cavity is integrated with major heavy components such as a microcomputer, a battery, etc. (2) The tail end of the arm is not provided with an actuating mechanism, the weight is light, and the swing amplitude is small during walking. In addition, the humanoid robot mainly depends on leg movement during walking, so that the influence caused by the swing arm in the walking process can be ignored when a connecting rod model is established, the upper body of the humanoid robot is regarded as a whole, and the gravity center position of the humanoid robot is considered to be positioned right above the hip joint. The simplified link model is shown in fig. 3, and a world coordinate system is established with the pitch axis ankle joint on the left sole as an origin. The multi-link model has 12 degrees of freedom and 9 links, L in FIG. 3iAnd MiThe length and mass of each link are shown separately and the parameters of the simplified model are listed in table 3.3. Wherein the simplified upper body is seen as being fixed to the hip link L4Upper rigid body with link length L9Mass is M9。
TABLE 3.3 simplified model parameters
For convenience of description and distinction of each joint of the connecting rod model, each joint is divided into a pitch axis joint, a roll axis joint and a course axis joint according to the rotation direction of the joint, and the three axial directions are defined as shown in fig. 4. Although the link model has twelve degrees of freedom, because only the forward walking of the robot is considered, two degrees of freedom on the course axis hip joint are not changed in the walking process, namely, the simplified robot link model only considers the remaining ten degrees of freedom actually. The joint rotation angle in these ten degrees of freedom is defined as α1~α10Their respective joints are shown in table 3.4.
S2, gait cycle division and ZMP trajectory: the invention adopts a symmetrical dividing mode. The method has the advantages that only the joint track in one sub-period needs to be planned, and the joint track in the other sub-period is in a left-right foot symmetrical relation with the joint track, so that convenience is brought to the planning of the off-line gait of the humanoid robot. The following off-line gait planning content is described by taking a left foot support sub-period as an example, and the duration of the sub-period is uniformly specified to be 1 unit time. The position of the ZMP is the key to whether the stability of the humanoid robot can be kept in the moving process. Therefore, the moving track of the ZMP point in the walking process of the humanoid robot is firstly planned. Based on the ZMP trajectory shown in FIG. 40, the ZMP motion trajectory in one gait sub-cycle can be expressed by the following formula:
wherein, Xzmp(t),Yzmp(t),Zzmp(t) the coordinates of the ZMP in the world coordinate system, respectively; s is half the length of the plantar stabilization zone in the X direction to adjust the range of motion of the ZMP during ambulation; sigma represents the proportion of the overall duration occupied by the DSP in one walking sub-period; q is an invariant variable determined by the length of the swing leg stride.
S3, ankle joint and hip joint trajectory planning:
(1) ankle joint trajectory planning: during the monopod support phase of the walker cycle, the left foot is in contact with the ground and considered to be fixed to the ground. In the stage, the right foot is always suspended, and the running track of the ankle joint of the right foot is very important. On the one hand, the motion trail of the ankle joint directly determines the motion trail of the sole of the foot, and plays a role in crossing the possible obstacles. On the other hand, after the right sole leaves the ground, the robot enters a single-foot supporting stage, the process needs to be kept stable and smooth as much as possible, and the acceleration and the speed of the robot on the ground are not too high. Further, the ankle joint has two degrees of freedom, but there is no link structure between the two degrees of freedom, so it can be considered that the two degrees of freedom overlap.
Based on the above consideration, the constraint conditions are set for the ankle joint as follows:
wherein L isstepIndicating the forward stepping distance of the right foot, HstepThe ankle joint can reach the highest height, TaAnd TbFor adjusting the duration of time the ankle joint maintains the highest height. From the constraints of equations (3.15) and (3.16), the trajectory of the ankle joint expressed by a cubic interpolation function can be derived:
X(t)=-0.5Lstep+3Lstept2-2Lstept3,0≤t≤1 (3.17)
Y(t)=0,0≤t≤1 (3.19)
when the target environment is a non-flat road surface and an obstacle may exist, the height of the obstacle can be adjusted to LstepAnd HstepThe setting is performed. In the present invention, the presence of obstacles on the ground is temporarily not considered.
(2) Planning hip joint trajectories: the trajectory of the hip joint is then planned. The hip joint has three degrees of freedom, and because only the forward walking of the humanoid robot is considered, the course axis hip joint is not used. Fig. 5 shows front and side views of the robot walking forward in one sub-cycle. In a sub-period, the transverse rolling axis hip joint moves relatively singly in a lateral plane, and the motion trail can be obtained by directly determining a plurality of key positions. Firstly, ZMP points are transferred to the left foot in a biped supporting stage, the hip joints do forward and lateral movement simultaneously, the transverse rolling shaft hip joints in the subsequent single-foot supporting stage are kept unchanged, the ZMP points are moved from the rear side to the front side of a left foot sole plate, and finally the body is restored to the straight biped supporting stage, and the transverse rolling shaft hip joints are restored. Fitting the motion track by using a cubic spline interpolation function to obtain the track of the transverse rolling shaft hip joint:
wherein, tfAnd tbTwo time separation points representing the bipedal support phase and the monopod support phase, dsiIs alpha5(t) maximum angle of rotation.
S4, generating off-line gait of humanoid robot
(1) Upper limb joint trajectory: in order to simplify the robot link model and the convenience of subsequent calculations when building a kinematic model, the upper body of the robot is considered to be a constant rigid body. The direct planning method can be adopted to determine the running track of the upper body joints of the robot during walking, and the joints comprise: the two joints of the neck, the four joints of the left shoulder and the right shoulder and the two joints of the left elbow and the right elbow. Setting the neck and head joints at a constant angle of 0 degree; the two transverse rolling shaft joints of the left shoulder and the right shoulder are respectively set to be 15 degrees and-15 degrees; likewise, the elbow joints are set at 20 and-20, respectively. The track planning of the pitching shaft shoulder joint is special, and the observation of human behavior and activity can find that the swinging of the arm is an indispensable auxiliary action in the walking or running process. So as to be popularized to humanoid machinesThe human gait planning, the swing arm action that staggers from front to back in the walking process is helpful to maintain the stability of the robot in the walking process, and is more beautiful. As shown in formula (3.21), a sine function is introduced to describe the running track of the pitching shoulder joint in the swing arm process, and the relation of the change of the rotation angles of the left joint and the right joint along with time is Sshl(t) and Sshr(t) is shown.
(2) Planning the joint track of the lower limbs: for the twelve joints of the lower limb, since no steering issues need to be taken into account, first the two hip joint angles of the course axis are set to a constant 0 °. In addition, constraint terms are further added to the remaining ten joints for the natural and stable posture of the robot during walking. The roll axis joint and the pitch axis joint of the lower limb of the robot have the following relations respectively according to the kinematic relation of each joint angle:
after the ZMP trajectory planning and the ankle and hip joint trajectory planning are determined, a complete gait planning is further obtained, the change curve of the rotation angle of each joint of the lower limb can be obtained through the mathematical relationship among all joint angles and by combining additional constraint conditions such as a mass center trajectory obtained by using an inverted pendulum model, and the complete walking gait of the robot is obtained. However, such methods are more limited in generating the joint movement trajectory, so the generated gait is not necessarily the optimal gait. Various optimization algorithms such as a genetic algorithm, a particle swarm algorithm, an evolutionary algorithm, an ant colony algorithm and the like can be optimized under certain limiting conditions to obtain optimal gait planning. The continuous domain ant colony algorithm is derived from a classical ant colony algorithm and is an optimization algorithm capable of optimizing in a continuous space. The inspiration of the ant colony algorithm results from the observation in nature of the foraging behavior of ants, who, by spreading pheromones in the path they travel throughHelping the whole ant colony to find the best food source. The ant colony algorithm is widely applied in various fields depending on the characteristics of distribution calculation, heuristic search, global optimization and the like. The method uses a continuous domain ant colony algorithm to optimize the lower limb joint track. The above pre-planning of the ZMP trajectory and the ankle and hip joints provides a reference for the off-line gait of the robot, in combination with the constraints on the relationship between the joint rotation angles of the joints in equation (3.22), for the remaining lower extremity joint trajectory α2,α3,α7,α8Describing by using a cubic spline method, and constructing an objective function by taking a higher ZMP stability margin and a smaller ankle joint trajectory tracking error as an optimization target:
J=Jz+Jan+Jbound (3.23)
wherein, JzIs the ZMP trajectory error during walking, JanCumulative error of ankle joint following given trajectory at sampling time, JboundThe rotation angle of the joint is limited not to exceed the limit of the motor. Setting s 1.25, σ 0.6, q 0.05, Lstep=5,Hstep=12,tf=0.3,tb=0.7,dsiThe trajectory of the lower limb joint was optimized using ant colony algorithm to obtain a complete gait as shown in fig. 6. As can be seen by observing the joint motion curve in the figure 6, the generated joint rotation track changes smoothly, and has no step change or sudden change in a short time, so that the stability of the humanoid robot in the walking process can be ensured, and the motor can not have difficulty in tracking the track operation.
2. Designing an online walking stability controller based on a reinforcement learning algorithm:
in the above specific embodiment mode, the off-line gait of the humanoid robot OP2 is generated by combining the ZMP theory and the cubic spline interpolation method, and the generated off-line gait enables the humanoid robot to walk smoothly on a horizontal road surface by controlling each joint steering engine to track a determined track. However, the ability to stably travel on a horizontal road surface alone is not sufficient, and in practical applications, the environment to which the robot is exposed is more complicated. In order to keep the humanoid robot in a stable motion state in a complex road environment, the output of a robot actuator can be adjusted by self state information and environment feedback only depending on the self real-time adjustment capability of the robot under the condition of no human intervention so as to deal with the disturbance in the environment. Therefore, the invention provides an online stability control method based on an ActorCritic algorithm, which takes a common slope environment in the real world as a target environment and aims to solve the problem that a humanoid robot is difficult to stably walk in the slope environment with continuously changing slope.
For ease of understanding, the basic algorithm used in this step is illustrated: an AC algorithm based on a BP neural network: the AC algorithm uses two neural network structures to fit a policy and value function, which makes it possible to directly receive continuous state inputs. Before the neural network is introduced, the reinforcement learning algorithm usually adopts a discrete form to store a value function corresponding to actions and states by a table, and the table is used as the basis of the actions of the Agent. However, when the number of states and actions to be confronted is too large, convergence of the algorithm becomes difficult. The introduction of neural networks solved this problem to some extent. The characteristic that the neural network can receive multi-dimensional continuous input is utilized, and the performance and the application range of reinforcement learning are further improved. The neural network has a variety of types, and among them, the Back Propagation (BP) neural network has been widely used due to characteristics of mature performance, flexible structure, and strong nonlinear mapping capability. The structure of the BP neural network is shown in fig. 7. The BP neural network is respectively provided with an input layer, a hidden layer and an output layer, wherein the number of layers and the structure of the hidden layer are not limited, and the hidden layer can be flexibly arranged. As a feedforward neural network, the learning process of the BP neural network is a process of adjusting the weight according to feedback, and aims to approximate the fitted input-output relationship as much as possible, that is, the error between the output value and the expected value is as small as possible.
In the invention, when the reinforcement learning online stability controller is designed, the Actor network and the Critic network both adopt BP neural networks. And at the time of t, the controller obtains a state s containing real-time information of the robot, the Actor network outputs an action a based on s, the robot executes the action a to adjust the posture of the robot and reaches the state s' at the time of t +1, and meanwhile, an immediate return value r is obtained. At this time, for the criticic network, there is an error:
=r+V(s′,θc)-V(s,θc) (3.24)
wherein, thetacRepresenting the parameters of the Critic network. Based on errors, the criticic network is updated by using a gradient descent method, which comprises the following steps:
θc=θc+αc▽V(s,θc) (3.25)
wherein alpha iscIs the update step size of the Critic network. The update direction is the direction that minimizes the loss function, the loss function L (θ)c) Expressed in mean square error:
by thetaaIf the Actor network parameter is represented, the Actor network update method is as follows:
θa=θa+αa▽lnπ(a|s,θa) (3.27)
wherein alpha isaIs the update step size of the Actor network, pi (as, theta)a) Representing the policies represented by the Actor network.
The environment of the robot is changed along with the walking of the robot, and the process is repeated along with the change of the environment. It can be seen that the learning process of the AC algorithm follows the general process of the reinforcement learning algorithm, and the Actor and Critic are updated once after each action is performed. After learning is finished and the best strategy is obtained through reinforcement learning, the Actor can be used as a main body of the controller to independently execute a control task. Setting key elements for reinforcement learning: on the basis of off-line gait planning, an AC algorithm-based on-line stability controller is designed. Aiming at the slope environment with continuously changing slope, the controller can learn the optimal action strategy for maintaining the stable walking of the humanoid robot. Specifically, in the process of tracking the off-line gait track, the humanoid robot senses the change of the external environment through a sensor carried by the humanoid robot, adjusts the walking posture according to the sensor information, finally learns the action strategy corresponding to the slope environment, and realizes the stable walking in the environment. The key elements for reinforcement learning are set as follows.
(1) State space: the state s is input information of the reinforcement learning algorithm and is the only way for the Agent to know the situation of the Agent in the environment. Like human beings sense the outside through eyes, ears, mouth, nose and limbs, the robot also explores the outside through various sensors carried by the robot. It is desirable that the sensors are sufficiently numerous, since this means that the robot knows the environment and its own condition sufficiently, but the state information of too high dimension causes a drastic increase in the search space and makes convergence difficult. The robot has the characteristics of high-dimensional state input and high-dimensional action output in the motion process because the robot has a complex structure and high degree of freedom. When designing a reinforcement learning control algorithm applied to a robot, in order to improve training efficiency, available state information is often chosen, and only those main information are selected. The robot used by the invention is provided with two sensors, namely a gyroscope and an accelerometer, when the robot walks on a slope road surface, the state change of the robot is not violent, and the gyroscope sensor has little effect on describing the state of the robot. Therefore, the state information of the online stability controller is collected by the accelerometer as input, and the input at the time t of the online stability controller for reinforcement learning is as follows:
s(t)=[f(t),θacc_x(t),θacc_y(t),θacc_z(t)] (3.28)
wherein, thetaacc_x(t),θacc_y(t),θacc_zAnd (t) are numerical signals acquired by the acceleration sensor in three directions respectively. f (t) is a variable related to the foot support, which can be directly obtained from the off-line gait plan, 1 for the left foot and 1 for the right foot.
(2) An action space: when the humanoid robot encounters gradient change of a road surface in the walking process, the angle of the whole body is inclined, the center of mass and the ZMP of the humanoid robot deviate from a stable supporting area, particularly in a single-foot supporting stage, compared with the stage that two feet fall to the ground, the supporting area in the stage is contracted into a single sole, the stable range is further reduced, and in addition, errors of tracking joint track curves of all steering engines in the actual walking process are added, so that the stability of the robot is the weakest at the moment. In order to simplify the operation process, when the walking phase of the robot is divided, a complete walking cycle is divided into two symmetrical sub-cycles, and one sub-cycle comprises a single-foot supporting phase and a double-foot supporting phase. Therefore, the method is adopted in such a way that one sub-cycle is considered as a whole, the robot tilts forward or backward due to the inclination of the road surface, the mass center and the ZMP are shifted to the front side or the back side of the body, the mass center and the ZMP are adjusted to the stable region, the motion space is designed in such a way that the pitch axis joints of the legs are integrally adjusted without considering the joints of the upper body of the robot, as shown in fig. 7, the adjustment process of each joint in one motion is demonstrated by taking the left leg as the front leg and the right leg as the back leg, fig. 8 (a) shows the rotation direction of each joint of the legs of the robot when the mass center and the ZMP are adjusted forward, and the dotted arrow is the rotation direction of the hip joint of the left pitch axis. Similarly, fig. 8 (b) shows the reversal of the center of mass and joint rotation during the backward adjustment of the ZMP. And (3) designing the action output of the Actor as the adjustment quantity of the off-line angles of the six joints, and acquiring corresponding actions according to state information provided by the robot by reinforcement learning at a time t:
wherein,the adjustment amounts of the joint rotation angles of the pitch axis ankle joint, the pitch axis knee joint and the pitch axis hip joint of the left foot and the right foot on the basis of the off-line angle are respectively. Meanwhile, in order to avoid overlarge instantaneous acceleration possibly caused by the fact that the adjustment quantity directly acts on each joint and to enable the walking process of the humanoid robot to be more natural, the damping coefficient is basedProcessing the actions a (t) using cubic spline:
wherein alpha isreal_time(t) is the joint real-time output angle, αoffline(t) is the joint angle of the off-line trajectory, and S (t) is a cubic spline difference function under natural boundary conditions, determined by the following constraints:
where Δ θ is the adjustment amount of the corresponding joint.
(3) The immediate reward function: for a specific task, the setting of the immediate return value directly determines the quality of the learning effect, and the tendency of the Agent to select the action is always towards the direction of the highest accumulated amount of the return value. The immediate reward value function should be as directly related to the task objective as possible. When an immediate return value function is designed, the requirement of keeping stability when the robot walks on a slope is considered, the robot is required to stably walk on the slope, and the front and back inclination amplitude in the walking process is reduced as much as possible. Therefore, the threshold value of the inclination angle of the trunk of the robot is set to 5 °, and if the inclination angle exceeds the threshold value, it is determined that the robot is in an unstable state, and there is a risk of falling. Based on the above requirements, the present invention designs two immediate return functions, which are respectively expressed as formula (3.32) and formula (3.33), and the two immediate return functions will be compared in the subsequent experiments. The same point for both immediate reward functions is that entering an unstable state will both give a penalty of-10, except that the reward value achieved is different when kept within the stable range.
Wherein, thetapitchThe inclination angle of the robot in the direction of the pitching axis is shown, and eta is a return value amplification coefficient.
The learning flow of the reinforcement learning online stabilization controller of the embodiment is shown in fig. 9: (1) initializing hyper-parameters such as discount factors and learning factors, and initializing an Actor and a Critic neural networks, wherein the weight is initialized randomly, and the bias item is initialized with a constant value; (2) inputting the state s (t) of the current moment into an Actor network and a criticic network, wherein the Actor network provides action output a (t); (3) on the basis of off-line gait, the humanoid robot adjusts six joints of the lower limb based on the action a (t), reaches a new state s (t +1), and obtains a reward r (t) for the action a (t) according to the formula (3.32) or (3.33). At this time, the robot obtains experience of this action: { s (t), a (t), s (t +1), r (t) }; (4) calculating the TD error based on equation (3.24) based on the new empirical data; (5) updating parameters of the Actor network and the Critic network respectively; (6) and (4) judging whether the learning process of the current round is finished or not, and otherwise, jumping back to (2).
The invention introduces a simplified connecting rod model of the humanoid robot on the basis of homogeneous coordinate transformation and a DH method, explains the division mode of the gait cycle of the humanoid robot, and determines the moving track of the ZMP in the sub-cycle by combining a ZMP stability theory after the division cycle. The simplified motion trail of each joint of the upper body is directly planned, and the trajectory planning of the joint of the lower body is obtained by a continuous domain ant colony algorithm after the trajectories of the ankle joint and the hip joint are further determined on the basis of a connecting rod model and a ZMP trajectory. The upper and lower body joint trajectories together form a complete off-line gait plan. And then designing a humanoid robot walking control strategy taking a reinforcement learning online stability controller as a core. The walking control strategy adopts a mode of combining offline gait planning with online posture adjustment, and aims at the problem that the robot is difficult to keep stable walking in the road environment with continuously changing gradient, and an online stability controller is established by using a reinforcement learning AC algorithm. The control effect of the controller is verified by simulation experiments in the following experimental description.
In one embodiment, a method for controlling the online walking stability of a humanoid robot based on a reinforcement learning cerebellum model is provided, the same offline gait planning method is used, and the method is different from the embodiment in that when the online stability controller is designed, the online stability controller is designed based on an AC reinforcement learning algorithm, and the online walking stability controller is based on the reinforcement learning cerebellum model. Of course, online stable control under continuous state input is realized based on the ActorCritic algorithm. And then, a robot bionic control model based on a cerebellum mechanism and reinforcement learning is further established by utilizing a control method, and the stability and the environmental adaptability can be further improved by combining the two control methods.
For the convenience of understanding, a cerebellum model based on reinforcement learning theory modeling in a basic algorithm is explained: anatomical and physiological overview of the cerebellum: the human brain is composed of the brain, cerebellum, diencephalon and brainstem, and is the most mysterious, most complex and most indispensable organ of the human body. The brain occupies more than four fifths of the total brain volume, but the cerebral cortex has less than one fifth of the total brain volume in terms of the number of neurons. Although the cerebellum has a small volume, only about one eighth of the brain, the number of neurons contained is 4 times that of the brain. With the research on the cerebellum, people find that the cerebellum not only controls the motion and coordination control capability of the human body, but also plays an important role in the aspects of emotion cognition, language processing, working memory and the like. The study on the nerve and electric signal conduction path in the cerebellum is of great significance for deeply understanding the cerebellum operation mechanism and establishing a cerebellum model. Figure 10 shows the main neurons and their connectivity inside the cerebellum. There are two main routes for inputting information into the cerebellum, two axons: climbing Fibers (clinmbig Fibers, CF) and moss Fibers (Mossy Fibers, MF). MF transmits proprioceptive information to the dendrites of granulocytes (granule, GC) through terminal synapses, stimulating GC to excite other cells inside the cerebellar cortex. The GC outputs the encoded information to other parts of the cerebellum via Parallel Fibers (PF), which produces activation effects on cells including Golgi cells (GoC), Purkinje Cells (PC), Basket Cells (BC), and Stellate Cells (SC). CF can connect to multiple PCs simultaneously and transmit intense stimuli through excitatory synapses between the two. Studies have shown that in developing mature cerebellum, each PC is dominated by a single CF, and that the more active the CF, the more viable the corresponding purkinje cells. Notably, the excitability of the cerebellum is related to the learning ability of the cerebellum. The synapse formed between PF and PC has plasticity that is long-term inhibitory (LTP), and it is believed that this synaptic plasticity has a significant role in the acquisition of motor capacity. Recent studies have found that based on the reward and punishment signal and the desired signal CF, an effect is produced in the relevant area of the cerebellum, the presence of the reward signal stimulates a part of the area and suppresses the other area, and the absence of the reward signal stimulates the suppressed area. Among the various types of neurons of the cerebellum, PC is the only output neuron. The axons of PCs leave the cerebellar cortex, and other types of neurons are only connected to each other within the cerebellar cortex. After the PC processes all the obtained information, the output information is projected to the deep cerebellar nucleus and then transmitted to other areas of the brain to control the action of the body.
Cerebellum model structure: the cerebellum motion control model used by the invention focuses on the capability of the cerebellum in motion control, realizes the function of the cerebellum on the one hand by combining related control theory and based on a modularized mode, and focuses on the expression of the physiology and anatomical structure of the cerebellum on the other hand. Fig. 11 is an overall configuration diagram of the cerebellum model. In the process that the humanoid robot tracks the off-line joint track to walk, the cerebellum model calculates a cerebellum output instruction according to the neuron storage weight, modifies the joint motion parameters by combining with the command mapping vector, and drives the robot to adjust the walking posture to maintain stability. Meanwhile, the cerebellum model modifies the weight according to the obtained evaluative feedback, and an optimal motion mode is searched. Because the cell structure of the cerebellum is uniform, and the types and connection modes of the neuron cells of each region are basically similar, the main body of the cerebellum model adopts an array structure consisting of n basic units during design, and the input and output structure of the array structure is shown in fig. 12, wherein each basic unit has the same internal structure and information transmission mode.
Cerebellum model functions: the internal structure of the base unit can be largely divided into four modules. The state encoder module solves the mapping problem that the robot state information is input to a relevant area; the cerebellum function module is designed to simulate the conduction process of a nerve electrical signal in the cerebellum mainly by referring to the neuron connection mode of the real cerebellum, and synaptic memory information in the module is continuously corrected in the walking process of the robot, so that the control function is better realized; the olive setting module introduces a TD algorithm into the olive setting feedback process based on guessing of a reinforcement learning mechanism adopted in the cerebellum learning process; the motion mapping module realizes the mapping from the basic unit output to the action command, and enhances the adaptability of the cerebellum model to different controlled objects and different tasks. The modules are briefly described below.
(1) A state encoding module: and the state coding module receives the state signal of the humanoid robot and projects the state signal into a PF state in the cerebellum function module. The implementation of this part mirrors the CMAC cerebellar model. The CMAC algorithm is also inspired by the internal structure and the function of cerebellum, from the viewpoint of the cerebellum theory, the MF receives proprioception information input from the outside of the cerebellum and transmits the information to the GC for information coding, and the process is embodied as the projection of input vectors to related areas in the CMAC. The introduction of the generalization parameter C well reflects the connection relationship between GC and MF, namely one MF can form synaptic connections with a plurality of GCs. The CMAC is a neural network based on weight storage, and different from a network structure formed by connecting weighted neurons with each other through complex connection in deep learning, the weight of the CMAC is obtained through table lookup. The input/output process of the CMAC is shown in fig. 13, where in fig. 13, AC is a conceptual memory and AP is an actual memory. The process is divided into two stages: the first stage, the input information can find a group of unique logic partitions corresponding to the input information in all dimensions, and each logic partition can find the weight corresponding to the partitions in an actual memory;and in the second stage, accumulating the weight values found in the first stage in a weighted summation mode to obtain the output of the CMAC network. The following is a specific example: as shown in fig. 14, the input is set to a two-dimensional space, and the horizontal axis and the vertical axis each represent one dimension, x1And y1Representing the input signal in respective dimensions. The input of each dimension has the concepts of layer and block, and the corresponding block, namely the logic partition, can be found in one layer according to the specific input value. For example, each layer has x1With corresponding logical partitions, each partition has a corresponding discrete code, i.e., b, e, g, corresponding to it. In the same way, with y1The corresponding index is B, D, G, and according to the rule of peer correspondence, the corresponding weight region indexes are Bb, De, and Gg. Then, according to the mapping rule from the logical partition to the relevant area, the corresponding relevant area in the cerebellum function module can be found. These regions are activated to be set to "1" and the remaining regions are correspondingly set to "0".
(2) Cerebellar functional module: the cerebellar function module follows the nerve junction structure of the cerebellum and the process of electric signal conduction, as shown in fig. 15. PF transmits the coded signal of continuous state information of humanoid robot, the input signal is projected to relevant area via state coding module, BC and SC are excited in excitation state in the process, parameters s and b representing BC and SC state informationjIs set to "1".Andthe synapses formed by the connection between BC, SC and PC are indicated, and BC and SC act to suppress PC, so the weights are given corresponding negative values. Membrane potential value P of PCj(t) is calculated as follows:
wherein, Pj(t) represents the membrane potential value of PC at time t, Pj(t) hasBinary, using "0" and "1" respectively means "inhibit" and "activate" states. OmegaijAnd (t) is a weight value stored in a PF-PC plastic synapse and represents the influence of synapse state on PC membrane potential, wherein i (i is 1, 2.., N) represents the ith PF, and j (j is 1, 2.., M) represents the jth PC. PF (particle Filter)i(t) represents the activation state of the ith PF at time t, and is determined by the output of the state encoding module. Status S of jth PCjThe state value of (t) is determined by equation (4.2), where φ is the threshold potential of PC.
O(t)=[o1(t),o2(t),...,ol(t)...,oL(t)]Is a state vector representing the active state of the elementary units. L is the total number of basic units in the module, and is determined by the action space of the cerebellum model, and the action space is the combination of the output basic actions of the cerebellum model. ol(t) is the state of the ith elementary cell, determined by the following equation:
where μ is a correction factor. The effect of PF-PC synapse excitability on PC membrane potential value varies with time, ωij(t) is continuously adjusted at each time t, and the change trend of the (t) is delta omega as the excitability is attenuatedij(t) is:
Δωij(t)=K[1-Pj(t)]PFi(t) (4.4)
wherein, K is a weight adjustment coefficient. The weight value changes at the moment of t + 1:
ωij(t+1)=ωij(t)+Δωij(t) (4.5)
(3) a lower olive feedback module: the reasonable interpretation of the true cerebellum should be based on the proven cerebellum circuits to properly interpret the function and belief of the various structures of the cerebellumAnd (4) information transfer process. During animal locomotion, the PC receives MF afferent contextual information about the locomotion. Meanwhile, the CF also transmits movement information flow, and because synapses formed between cerebellar parallel fibers PF and Purkinje cells PC have plasticity, information transmitted by the CF is regarded as a basis for the change of synaptic functions of the PF-PC. CF originates in the inferior olivary nucleus, and the information pathway with the cerebral cortex is considered as a feedback loop of the cerebellum, which is considered as a root of the cerebellum's learning ability. According to the conjecture that the cerebellum learning process adopts a reinforcement learning mode, a model-free reinforcement learning algorithm-TD algorithm is introduced into the lower olive feedback module. The cerebellum model is regarded as Agent, the state space is constituted by the state of the PC, and the motion space is determined by the output of the basic unit. The cerebellum model PC state-basic unit output is mapped to a reinforcement learning state-action pair, the environment model is unknown to the cerebellum model, and learning through trial and error is the only way to achieve good performance. In the learning process, the selection of the dominant action of reinforcement learning is performed, the action selection probability is adjusted according to the evaluation information fed back from the outside, and meanwhile, the weight in each basic unit is correspondingly adjusted. At time t, the state s of the robot is acquired through a sensortThe TD algorithm selects and executes a based on the behavior selection probabilitytObtaining the next state st+1And an immediate return value rt+1Updating the state value function using equation (2.11), where rt+1The comparison between the stable state of the robot at this time and the stable state at the previous time is "-1" when the robot leaves the stable state, and is "0" when the robot approaches the stable state. The modeling process of the cerebellum is the modeling process of related synapses, the CF information transmission process adopts a reinforcement learning mechanism, and the behavior selection probability pi of the trial-and-error processt(a) With continued adjustment, the relevant synapses are altered in the process. After obtaining a learning experience, the lower olive module calculates TD error by using formula (4.6)tAnd modifying pi based on the formula (4.7) and the formula (4.8)t(a)。
t=rt+1+γV(st+1)-V(st) (4.6)
pt+1(st,at)=pt(st,at)+ψt (4.7)
Wherein p ist(st,at) At time t state stTemporal selection behavior atThe probability that each action is selected is the same initially, and ψ is a step size parameter. According totAnd obtaining the evaluation information of the CF feedback:
cl(t)=g(t) (4.9)
wherein,
based on cl(t), adjusting the corresponding PF weight:
ωij(t)=ωij(t)-σ·cl(t)·PFi(t) (4.11)
where σ is a positive constant.
(4) A motion map (APG) module: the module maps the output of the basic unit in the cerebellum function module to a controlled object to execute related control commands. Houk studies have shown that the motion signals recorded in erythroblasts are generated by a central motion pattern generator, rather than by peripheral continuous feedback, and thus the cerebellum is considered to be a tunable pattern generator. For this purpose, an APG method is introduced in the cerebellum model, and each APG can generate a motion instruction, which corresponds to the output of the basic unit. In the dynamic operation process of the cerebellum model, the calculation mode of generating corresponding action at each time t is as follows:
A(t)=D·O(t) (4.12)
where D is the set vector of motion instructions, also called command map vector, and a (t) is the final output instruction of the cerebellar model at time t.
The four parts form a complete cerebellum model, and the cerebellum model has the following advantages: (1) the control effect is considered, and meanwhile, the expression of physiological characteristics and structural characteristics of the cerebellum is emphasized. The sensory-motor control system based on the cerebellar cortex characteristics is established, and has certain research value and significance for the development of robotics and control science. (2) By utilizing the advantages of the CMAC, certain local generalization can be carried out on input information, and similar inputs generate similar outputs. Meanwhile, the intermediate processing process of 'table query' improves the rapidity of cerebellum model response. (3) The APG module is introduced, a mode similar to freely designing an action space in reinforcement learning is adopted, so that the cerebellum model can be applied to different controlled subjects and different tasks, and the method has good universality.
The invention relates to a specific design method of an online walking stability controller based on a reinforcement learning cerebellum model, which comprises the following steps: a humanoid robot walking control system based on a cerebellum model is shown in fig. 16. The off-line gait planning generates the basic gait of the robot, the state coding module adjusts the activation state of the PF according to the state information collected by the robot sensor, the lower olive feedback module modifies the behavior selection probability and the cerebellar neuron storage weight based on the evaluation information fed back by the environment, and the motion mapping module outputs and adjusts the action of the robot according to the cerebellar model. When the real cerebellum works, the real cerebellum receives external information input from the brain and the spinal cord and regulates and controls the movement of the human body in real time according to the information. The motor regulation ability of the cerebellum is learned in the "acquired" course of the interaction between the human body and the environment. Based on off-line gait planning, the robot can track a joint track planned in advance to travel on a flat road surface, but cannot walk stably on a complex road surface. The inclined road surface is a common scene in a real environment, and when the change of the road surface gradient is responded, the robot can maintain the stability of the robot by finely regulating and controlling the joints of the legs. An accelerometer and a gyroscope are integrated on a secondary control circuit in the thoracic cavity of the OP2 robot, and the acquired signals can provide information of the stability of the current robot posture. The robot is stably controlled by using the trunk attitude angle as cerebellum model input, the attitude angle is determined by the state encoder based on sensor information and is mapped to a corresponding PF state, and the robot adjusts the joints of the lower limbs according to the output of the cerebellum model, so that the robot is always balanced and stable when walking.
The output of the cerebellum model is determined by the activation state of the basic units, the activation state of each basic unit at the time t forms a basic unit state vector, and the dot product of the basic unit state vector and the command mapping vector is the joint adjustment amount at the time t. The hip joint is taken as a control object, each output directly acts on the hip joint, although the motor is taken as an actuating element, the motor can be regarded as a joint-skeleton-muscle system in a human body, and the positive rotation and the reverse rotation of the joint motor are understood as the contraction and the relaxation of skeletal muscle, so that the adjustment quantity of the cerebellum model output is similar to a nerve electric signal and has certain amplitude and duration. When the hip joint is taken as an object for feedback control of the cerebellum model, the command mapping vector is as follows:
where σ is a reference value of hip joint adjustment amount, ξ is a unit increment, and p is setiCan adjust the output commandThe magnitude of (c). t is tsFor controlling the duration of the adjustment process.
Fig. 17 shows a flowchart of a learning process of the cerebellum model stabilization controller, which includes: (1) initializing omegaij,And pi0(a) (ii) a (2) The state coding module determines the cerebellum model state s according to the information of the robot sensort(ii) a (3) Selecting and executing an action a based on a current policytObtaining a state st+1And an immediate return value rt+1Updating the state value function; (4) TD error of last step behaviortUpdate behavior selection probability pit(a) Modifying related neuron synapse weight according to CF feedback information; (5) judging whether the learning of the wheel is finished or not, and returning to the step (2) to continue to execute the learning process if the learning of the wheel is not finished; otherwise, resetting the system and starting the next learning process until the whole learning process is finished.
The present embodiment first briefly introduces the function of the cerebellum in human motion control and the structure and function of the cerebellum, and focuses on the mechanism and function of climbing fibers, moss fibers, purkinje cells and inferior olivary nucleus. Then, a modeling method of the reinforcement learning type cerebellum model and application of a reinforcement learning mechanism in the cerebellum model are emphatically introduced. Finally, the cerebellum model is applied to the walking stability control of the humanoid robot, and how to design an online stability controller based on the cerebellum model is introduced, and the controller takes the humanoid robot state signal as input and enhances the stability of the humanoid robot in the walking process through the real-time adjustment of the hip joint.
For the above embodiments, briefly, the present embodiment provides an online stably-controlled humanoid robot based on a bionic reinforcement learning-type cerebellum model, which includes a device for performing offline gait planning on the humanoid robot, and the device outputs a joint movement track generated by the humanoid robot tracking offline and having walking ability; the cerebellum model controller responds to off-line gait, the cerebellum model controller comprises a state coding module, a cerebellum model, a lower olive feedback module and a motion mapping module, the state coding module adjusts the activation state of the PF according to state information collected by the humanoid robot sensor, the lower olive feedback module modifies behavior selection probability and a cerebellum neuron storage weight value based on evaluation information fed back by the environment, and the motion mapping module outputs and adjusts the action of the robot according to the function module.
Preferably, the offline gait planning method in the above embodiment is used for offline gait planning of the humanoid robot.
Further, the motion mapping module realizes the adjustment of the robot action according to the cerebellum model output based on the following modes: the output of the cerebellum model is determined by the activation state of the basic units, the activation state of each basic unit at the time t forms a basic unit state vector, the point product of the basic unit state vector and the command mapping vector is the joint adjustment amount at the time t, the joints are taken as control objects, each output directly acts on the joints, the motors are taken as executing elements, the model is taken as a joint-skeleton-muscle system in a human body, the positive rotation and the reverse rotation of the joint motors are taken as the contraction and the relaxation of skeletal muscles, and the adjustment amount output by the cerebellum model is taken as a nerve electric signal and has certain amplitude and duration.
Further, acquiring the joint adjustment amount is realized based on the following modes: when a certain joint is taken as an object for feedback control of a cerebellum model, the command mapping vector is as follows:
wherein n is the number of basic units,the duration and amplitude of the joint adjustment amount are adjustable:
where σ is a reference value of the joint adjustment amount, ξ is a unit increment, and p isiTo adjust the magnitude parameter of the volume, the output command may be adjustedMagnitude of amplitude of (t)sIndicating the adjustment processThe duration of (c).
Further, the cerebellum model learning method comprises the following steps: (1) initializing PF-PF plasticity synapse weight omegaijWeight of synapse formed by the connection between SC, BC and PCAnd a behavior selection probability pi0(a) (ii) a (2) The state coding module determines the cerebellum model state s according to the information of the robot sensort(ii) a (3) Selecting and executing an action a based on a current policytObtaining a state st+1And an immediate return value rt+1Updating the state value function; (4) TD error of last step behaviortUpdate behavior selection probability pit(a) Modifying the related synaptic weight value according to the CF feedback information; (5) judging whether the learning of the wheel is finished or not, and returning to the step (2) to continue to execute the learning process if the learning of the wheel is not finished; otherwise, resetting the system and starting the next learning process until the whole learning process is finished.
Further, the state encoding module: receiving a state signal of the humanoid robot, projecting the state signal into a PF state in a cerebellum model, receiving proprioception information input from the outside of the cerebellum by an MF, transmitting the proprioception information to a GC for information coding, forming synaptic connections between the MF and the GCs, wherein a CMAC is a neural network based on weight storage, the weight of the CMAC is obtained by looking up a table, and the input and output processes of the CMAC are divided into two stages: the first stage, the input information can find a group of unique logic partitions corresponding to the input information in all dimensions, and each logic partition can find the weight corresponding to the partitions in an actual memory; in the second stage, the weights found in the first stage are accumulated in a weighted summation mode to obtain the output of the CMAC network;
the cerebellum functional module cerebellum model is as follows: following the nerve connection structure of the cerebellum and the conduction process of electric signals, PF transmits the coded signals of the continuous state information of the humanoid robot, the input signals are projected to a relevant area through a state coding module, BC and SC are excited in an excited state in the process, and parameters s and b representing the state information of BC and SC are transmittedjIs arranged as“1”,Andthe weight of synapse formed by connection between BC, SC and PC is expressed, and BC and SC act on PC to inhibit it, so its weight also gives corresponding negative value, and the membrane potential value P of PCj(t) is calculated as follows:
wherein, Pj(t) represents the membrane potential value of PC at time t, Pj(t) has a binary property, and the two states of 'inhibition' and 'activation' are respectively represented by '0' and '1'; omegaij(t) is a weight value stored in a PF-PC plastic synapse, representing the effect of synapse state on PC membrane potential, where i is 1, 2.., N denotes the ith PF, j is 1, 2.., and M denotes the jth PC; PF (particle Filter)i(t) represents the activation state of the ith PF at time t, and is determined by the output of the state coding module;
status S of jth PCjThe state value of (t) is determined by equation (4.2), where φ is the threshold potential of PC;
O(t)=[o1(t),o2(t),...,ol(t)...,oL(t)]is a state vector representing the active state of the elementary units; l is the total number of basic units in the module and is determined by the action space of the cerebellum model, and the action space is the combination of the output basic actions of the cerebellum model; ol(t) is the state of the ith elementary cell, determined by the following equation:
wherein, mu is a correction factor, and M is the number of PCs in the basic unit;
the effect of PF-PC synapse excitability on PC membrane potential value varies with time, ωij(t) is continuously adjusted at each time t, and the change trend of the (t) is delta omega as the excitability is attenuatedij(t) is:
Δωij(t)=K[1-Pj(t)]PFi(t) (4.4)
wherein K is a weight adjustment coefficient;
the weight value changes at the moment of t + 1:
ωij(t+1)=ωij(t)+Δωij(t) (4.5)
the lower olive feedback module is used for: at the moment t, the state s (t) of the robot is obtained through a sensor, the TD algorithm selects and executes a (t) based on the behavior selection probability to obtain the next state s (t) and an immediate return value rt+1Updating the state value function using equation (2.11), where rt+1Comparing the stable state of the robot with the stable state at the last moment, wherein the stable state is kept as "-1", and the stable state is approached as "0";
the modeling process of the cerebellum is the modeling process of related synapses, the CF information transmission process adopts a reinforcement learning mechanism, and the behavior selection probability pi of the trial-and-error processt(a) Continuously adjusting, and simultaneously changing related synapses in the process; after obtaining a learning experience, the lower olive module calculates TD error by using formula (4.6)tAnd modifying pi based on the formula (4.7) and the formula (4.8)t(a);
t=rt+1+γV(s(t+1))-V(s(t)) (4.6)
pt+1(s(t),a(t))=pt(s(t),a(t))+ψt (4.7)
Wherein p ist(s (t), a (t)) is the tendency of selecting the behavior a (t) at the time t state s (t), the probability of selecting each row is the same initially, ψ is the step size parameter,
pr { a (t) ═ a } is the probability of selecting a at time t, action a (t), abbreviated as pt (a), and n represents the total number of optional actions;
according totAnd obtaining the evaluation information of the CF feedback:
cl(t)=g(t) (4.9)
wherein,
based on cl(t), adjusting the corresponding PF weight:
ωij(t)=ωij(t)-σ·cl(t)·PFi(t) (4.11)
wherein σ is a positive constant;
the motion mapping module: mapping the output of the basic unit in the cerebellum model to a controlled object to execute a related control command; an APG method is introduced into the cerebellum model, each APG can generate a motion instruction, and the motion instructions are in one-to-one correspondence with the basic unit output; in the dynamic operation process of the cerebellum model, the calculation mode of generating corresponding action at each time t is as follows:
A(t)=D·O(t) (4.12)
wherein D is a set vector of motion instructions, also called a command mapping vector, and a (t) is a final output instruction of the cerebellar model at time t;
the output of the cerebellum model is determined by the activation state of the basic units, the activation state of each basic unit at the time t forms a basic unit state vector, the point product of the basic unit state vector and the command mapping vector is the joint adjustment quantity at the time t, the joints are taken as control objects, each output directly acts on the joints, the motors are taken as executing elements, the model is taken as a joint-skeleton-muscle system in a human body, the positive rotation and the reverse rotation of the joint motors are taken as the contraction and the relaxation of skeletal muscles, and the adjustment quantity output by the cerebellum model is taken as a nerve electric signal and has certain amplitude and duration;
when a certain joint is taken as an object for feedback control of a cerebellum model, the command mapping vector is as follows:
where σ is a reference value of the joint adjustment amount, ξ is a unit increment, and p is setiCan adjust the output commandMagnitude of amplitude of (t)sFor controlling the duration of the adjustment process.
Experiment and result analysis
1. Robot platform and simulation experiment environment
ROBOTIS-OP2 humanoid robot: the ROBOTIS-OP2 humanoid robot is a small robot platform developed by the Korean ROBOTIS company, has 20 degrees of freedom, and can complete the tasks of kicking, walking, image recognition, executing actions and the like according to the program written by a developer. The humanoid robot system mainly comprises a main controller (Maincontroller), a Sub controller (Sub controller), a sensor group, 20 steering engines, a camera, a power supply module and the like. The main controller is a microcomputer, the hardware comprises an Intel CPU, a 4G memory, a 32GB hard disk and the like, and the operating system is a Linux system. The sub-controller has the function of communicating the main controller with the lower-layer hardware equipment, the main controller is only communicated with the sub-controller, and the sub-controller executes reading or writing operation of the lower-layer hardware (a steering engine, an accelerometer, a gyroscope, an LED, a camera, an MIC and the like) according to instructions of the main controller.
Robot simulation software Webots: in the invention, the control algorithm needs to be learned in interaction with the environment, if a real robot is used, damage can be caused to the robot, the Webots simulation software is open-source robot simulation software with strong functions, and a researcher can use the Webots simulation software to complete tasks such as robot design, scene construction, development of a robot control program and the like, as shown in FIG. 18. Therefore, the training and learning process of the humanoid robot is carried out by means of the simulation environment of the Webots software. Webots provides rich programming interfaces and nodes, can conveniently build a needed robot and environment according to requirements, and commonly comprises various actuators such as linear motors and rotary motors, and various sensors such as accelerometers, GPS, gyroscopes, cameras and the like. The nodes all provide a uniform Application Programming Interface (API), and a user can perform control and data reading and writing through a program. Open source physical engines (ODEs) are integrated in the Webots, physical characteristics of a specified object can be simulated in a virtual environment, and common physical phenomena in real time such as gravity, collision and friction are simulated.
And (3) construction of a robot joint simulation system: the ROBOTIS-OP2 humanoid robot platform used in the present invention provides a complete set of control framework, which is written based on C + + programming language and provides a series of programming interfaces from bottom to top, as shown in fig. 19. By using the set of interface, researchers can quickly and conveniently develop new functions according to own requirements. In order to avoid repeated work and facilitate the transplantation of a subsequent program between Webots and a real robot, C + + is still used as a main programming language for control program development, the OP2 is realized to walk on an ideal horizontal road off line, and meanwhile, a control interface is reserved to prepare for the access of a subsequent online stable controller. Besides a main part written in C + + language, the complete control program also comprises an online stable control part, and the development of the part needs to be based on Python language and MATLAB language, and the two languages respectively provide an additional package essential for reinforcement learning and a powerful matrix operation function. In addition, the excellent data processing and drawing capabilities of the Python language and the MATLAB language can be used for performing subsequent processing and visualization operation on simulation experiment data. Based on the reasons, a cross-platform and cross-language joint simulation system is built, researchers can use languages such as C + +, Python and MATLAB to carry out joint programming on the system, and different languages form a complete robot controller together through real-time communication, data transmission and mutual calling. And finally, uniformly generating a 'Controller' executable file, and importing the executable file into the Webots to control the actions of the robot. The development of the combined simulation system has two key points, namely, efficient data transmission and real-time communication among different programming languages, and the generated complete controller controls the simulation robot in the Webots, so that each functional module has no abnormal operation. The overall framework of the robot joint simulation system with the Webots as the core is shown in fig. 20. The simulation result processing subsystem accesses a control model generated by training, processes data generated in a simulation experiment and executes operations such as visualization and the like; the robot control subsystem is a main part of a joint simulation system, wherein a cerebellum model on-line control module written by MATLAB language and an off-line walking module based on C + + language form a Controller based on the cerebellum model, and similarly, a reinforcement learning on-line control module written by Python language and the off-line walking module can jointly generate the Controller based on reinforcement learning through a C + +/Python communication module; OP2 in the Webots simulation software loads a Controller to execute a corresponding experiment task.
The development process of the robot control program based on the joint simulation system is described in three parts below.
(1) Robot Controller- "Controller": each simulation robot in the Webots has a separate Controller, which is called as a "Controller", where the "Controller" is an executable file generated after an original control program is compiled, and the execution mode is to import the Webots and load the Webots to the corresponding robot. The Webots simulation platform provides a simple text editor, and the inconvenience of writing and debugging of the text editor is considered, and the subsequent complex development requirements are difficult to deal with. The method is characterized in that a visual studio (integrated development environment, IDE) is used for constructing a main body part of a robot Controller, the main body part is developed mainly based on a Webots native Webots C + + API program interface, and a communication and data transmission interface is designed to fuse Python language and MATLAB language into a Controller.
(2) C + + and Python-based controller development: since the development of the "Controller" involves three languages, the mutual calling and communication between the three languages are the key for building the joint simulation system. The Python programming language is one of the most popular programming languages in the world, and with the development of artificial intelligence, the Python programming language is favored by developers because of its good readability and more convenient and simplified grammar compared with other programming languages. With the rapid expansion and optimization of third-party open source libraries and the emergence of open source libraries such as TensorFlow, PyTorch, Numpy and the like, Python is now becoming one of the most important programming languages for the development and research of the field of artificial intelligence. The deep reinforcement learning algorithm used by the invention needs to be realized by means of a third-party library in Python. The reinforcement learning is applied to gait control of the humanoid robot, and controllers of the robot in the Webots need to be developed jointly by C + + and Python languages, so that corresponding functional modules are developed respectively. To enable the controller to execute in Webots without errors, the requirement on the compatibility of the entire development environment is high, and after debugging and experiments, the details of the Python environment are finally determined, as shown in table 5.1.
TABLE 5.1 details of the C + + and Python based federated development environment
The controller based on C + + and Python has high requirements on compatibility when successfully operating in Webots, so that functions of bidirectional information transmission, data processing, visualization and the like between two modules are necessary before a robot simulation experiment. And after the test is correct, generating a 'Controller' executable file, and performing association and call in Webots.
(3) Controller development based on C + + and MATLAB: MATLAB is an abbreviation for MATrix LABoratory, a commercial mathematical software product of The MathWorks corporation of The united states, with an interactively designed graphical user interface. The basic data elements of MATLAB do not need to explicitly state their dimensions and data types, and can help developers solve many complex computational problems, especially the operations of matrix and vector formulas, in much less time than the time spent by scalar non-interactive languages such as C or Fortran. MATLAB is also a high-performance language for engineering and mathematical computing that integrates computing, visualization, and programming functions. MATLAB is very powerful, covering many areas including mathematical calculations and algorithm development, modeling, simulation and prototyping, scientific and engineering graphics, application development (including graphical user interface construction), and the like. However, since the program developed on the Matlab development platform cannot be separated from the Matlab operating environment, the flexibility is insufficient when some practical application problems are handled. The C/C + + language is widely used in practical engineering applications, but is inconvenient for data visualization processing, difficult to test and analyze results, and also has no strong vector and matrix computing capability. Therefore, the two are combined and shared, and each contribution is long, so that stronger technical support can be provided for scientific research work and engineering development.
There are many ways that C + + can be developed in conjunction with MATLAB, mainly including the following three: MATLAB code: MATLAB Coder is an official code generation tool that can generate C and C + + codes using MATLAB code. The system supports most MATLAB languages and various toolboxes, can take the generated code as a source code, integrates a static library or a dynamic library into a project, and can be conveniently transplanted to different hardware platforms. The disadvantage of MATLAB code is that only functions can be added and cannot be generated for independent files. MATLAB Compiler SDK: the MATLAB Compiler SDK supports the encapsulation and packaging of MATLAB programs into class libraries of C/C + +,. NET, Java or Python, so that the invocation of MATLAB programs by other programming languages is realized, and besides, the MATLAB Compiler encapsulates MATLAB codes, thereby protecting the code safety. The MATLAB Compiler SDK has high requirement on the matching degree of a programming language and a development platform version, and the condition of compiling failure or calling failure is easy to occur. ③ MATLAB Engine: the MATLAB Engine API is a set of interfaces used by C + + programs, and the C + + programs can synchronously or asynchronously interact with the MATLAB based on the interfaces. The operations supported include: starting MATLAB; a MATLAB sharing session connected to the local computer; directly executing MATLAB statements and scripts; pass variables from C + + to MATLAB, from MATLAB to C + +, and so on. Through the C + +/MATLAB interactive window, the MATLAB Engine can observe the running condition of the program in real time, the debugging is convenient, and a user can intervene at any time in the program execution process. Based on these advantages, the way of selecting MATLAB Engine realizes the joint development of C + + and MATLAB. In the actual development process, C + +, Python, and MATLAB may be simultaneously used for the development of the Webots robot controller, and the method is a combination of the contents described in (2) and (3), and thus is not described again.
2. Experimental Environment
A stability training platform: in order to train the cerebellum model, a stability training platform is built in Webots simulation software, as shown in fig. 21. The top table surface of the platform is a square flat plate, the shape of the flat plate can be adjusted to be any rectangle according to needs, and four vertexes of the flat plate are respectively composed of four telescopic columns controlled by linear motors (the columns are invisible). In order to ensure that the plane can be smoothly changed within a required range, four corners of the square flat plate are respectively provided with an undamped spherical joint. The linear motor may provide an output of adjustable amplitude and frequency, respectively, for driving the platform to produce the required fluctuations to mimic disturbances in the environment. Compared with the training in the general environment, the training on the platform enriches the interference to the robot, can be used as pre-training before the real environment training, and even can completely replace the training in the actual environment. In other words, the state space of the robot can be more effectively and comprehensively explored by training on the platform, so that the balance control capability of the robot is more comprehensive. After the platform is built in Webots, the platform is set as a robot node, and a Controller program can be written to control the robot node. Considering four uprights as robot joints, controlling the movements of the joints may produce disturbances that mimic environmental disturbances. Fig. 22 is a node tree of the stability training platform, the node is composed of five sub-nodes, which respectively represent the platform top and four driving columns. Since the four driving pillars have the same structure, only the right front side pillar is drawn in an expanded view in fig. 22. The structure of the right front side driving column is taken as an example, and the nodes contained in the column are shown in table 5.2. Mechanical structures are built in Webots, which are substantially similar to the real world and follow the connection mode of rigid bodies (connecting rods) -joints (effectors) -rigid bodies (connecting rods). The "Transform" type node can be regarded as a point fixed on the rigid body, and "FR _ Transform" is a "Transform" type node, pointing to the right front vertex of the square platform, where a spherical joint "FR _ BallJoint" is fixedly connected, and the starting end of the spherical joint connected with the sliding block joint "FR _ glidjoint" is fixedly connected with the rigid body "FR _ glider joint _ Endpoint _ Solid", which is the base of the driving upright column, and the mass of the rigid body is set to be maximum, and the rigid body can be considered to be fixedly connected with the ground. The effector mounted in the slider joint is a Linear Motor "FR _ Linear _ Motor" that provides the driving force required for the motion of the platform.
TABLE 5.2 node construction of the drive column (Right front side)
The platform can provide output for simulating external disturbance, and the influence of road surface inclination, unevenness and the like on the robot can be simulated by controlling the linear motor in the driving upright post. The invention mainly aims at the problem that the humanoid robot stably walks in the road environment with continuously changing gradient, so the fluctuation change of the road is simulated by using the front and back inclination of the training platform. As shown in fig. 23. The clockwise direction is defined as the positive direction, and the included angle between the table top and the horizontal plane is theta. Inclination angle theta (t) is dependent onThe time t is changed periodically, wherein lpIs the width of the platform, /)1(t) and l2(t) is the height of the mesa from the horizontal. When the platform is used to simulate a ramp environment, two sinusoidal signals are respectively given as inputs to the two linear motors on the front side and the two linear motors on the rear side. The two sinusoidal signals have a phase difference, so that the plane is staggered with respect to the horizontal plane, and the forward and backward tilting motion of the plane is ensured. Equations (5.1) to (5.3) show the time-dependent change in sinusoidal input. And is a constant value used for adjusting the amplitude of the platform inclination and the speed of the change. At this time, the robot located thereon also tilts forward and backward, the body deviates from a stable position, and the robot is in danger of falling down because it is not fixed on the table top. In this way, the ability of the robot to maintain balance in an environment with continuously changing slopes is trained by simulating a slope environment with changing slopes.
θ(t)=arctan((l1-l2)/lp) (5.3)
A slope environment: to verify the ability of the robot to walk stably in a sloping environment, the sloping environment required for the experiment was plotted using the Wings3D software. Windows 3D is a free and open source three-dimensional model rendering software, as shown in fig. 24, which can create a polygon model with low precision and perform texture rendering, and there are mainly four elements that can be selected in windows 3D: the model can be modified by changing the elements, and the selection of different elements and the change of parameters have different effects, so that the model is very flexible to use. Meanwhile, windows 3D is compatible with a plurality of materials and textures, and has the UV mapping capacity from a two-dimensional image to the surface of a three-dimensional model. After the slope environment is drawn in the windows 3D, a VRML (Virtual Reality Modeling Language) file is exported, and then the VRML file is imported into Webots simulation software. As shown in fig. 25, the designed slope environment is a strip-shaped road surface with a constantly changing slope, the inclination angle of the leftmost end of the strip-shaped road surface is 0 °, the right extension angles of the strip-shaped road surface are sequentially increased, the inclination angle reaches the maximum value at the middle position, and then the inclination angle is sequentially decreased to 0 °. The stable walking capability of the robot is tested by using the road when the road surface is inclined and the gradient is changed continuously.
3. Experiment and results
The online walking stability control experiment based on reinforcement learning comprises the following steps: in the invention, the off-line step planning and the on-line stability controller based on reinforcement learning are compiled into a humanoid robot control program to test the effectiveness of the algorithm, and the compilation of the control program and all the experimental processes are completed in an independently constructed combined simulation system. When walking in the slope environment, along with the change of slope, humanoid robot can deviate from stable state gradually, thereby can not continue to advance until barycenter and ZMP leave sole support region, the robot can fall down. A slope environment with a continuously changing gradient is established in the Webots simulation environment, and a schematic diagram of the environment is shown in fig. 26. The walking initial position of the robot is the leftmost side of the slope, the slope of the walking initial position is 0 degree, then the walking initial position gradually increases along with the extension slope of the road surface, the inclination angle is maximum when the walking initial position reaches the middle position, and the walking initial position gradually decreases towards the right slope and is restored to the horizontal road surface with the slope of 0 degree. The ramp environment is used to simulate a non-level road surface with continuously changing slopes in the real world. The humanoid robot is placed in a slope environment for learning training, in the training, the parameter setting of the reinforcement learning algorithm is shown in a table 5.3, and meanwhile an equation (3.32) is used as an immediate return function. Fig. 27 visually shows the training effect, wherein the abscissa is the number of training rounds, the training end condition of each round is that the humanoid robot stably walks the whole course, or the robot falls down, and the ordinate is the total number of steps of walking of the humanoid robot in one round. As can be seen from fig. 26, as the number of training rounds increases, the number of steps the humanoid robot steadily walks increases. And observing the change condition of the steps of each turn in the learning process, wherein the walking steps of the humanoid robot are fewer in 1-100 turns, the humanoid robot can more and more walk the whole course after 100 turns, and the low-step-number turns appear sporadically in the middle, because the Actor has not completely converged to the optimal strategy, the humanoid robot still has the probability of falling, and the humanoid robot can smoothly walk the whole course in each turn after the reinforcement learning algorithm completely converges until 350 turns.
TABLE 5.3 reinforcement learning Algorithm parameter set
Fig. 28 shows the torso inclination angle in the pitch axis direction when the humanoid robot walks in a slope environment. In order to better observe the change of the inclination angle of the trunk, the acquired data is subjected to weightless moving average processing, and the size of a moving window is 15. The lower curve in fig. 28 shows the change process of the inclination angle of the trunk of the humanoid robot when no reinforcement learning stability controller is adjusted on line, the angle is 0, which indicates the vertical state of the vertical line and the horizontal line, and the negative angle indicates that the humanoid robot leans backwards, so that it is obvious that the humanoid robot which only follows the off-line joint track gradually deviates from the stable state along with the increase of the gradient, and finally falls down near the 3550 th sampling point. The upper curve shows the condition of online adjustment of the reinforcement learning stability controller, and the humanoid robot can track off-line preplanned gait and also can adjust the rotation angles of six joints on a lower limb pitch axis on line to correspond to the change of the road surface gradient. The trunk inclination angle in the pitch axis direction always fluctuates near the 0 axis in the walking process of the humanoid robot in the slope environment can be seen, which shows that the humanoid robot can always maintain the stability of the body posture in the walking process, keep the trunk straight and prove the effectiveness of the reinforcement learning method.
Fig. 29 and 30 show the real-time change process of the joint angle of the lower limb of the robot when the slope increases and decreases respectively during the slope walking process. It can be seen that the running track of each joint is continuously changed along with the change of the slope, which is that the reinforcement learning online stability controller adjusts the joints of the lower limbs to deal with the increase or decrease of the inclination angle of the slope. Meanwhile, the joint angle curves are smooth, and sudden change and step phenomenon do not exist, so that the transition between adjustment actions is smooth, and the walking stability of the robot is further ensured. The simulation results show that the humanoid robot is used as an Agent, and on the basis of off-line gait planning, the online stability control of reinforcement learning is combined, so that the humanoid robot has certain adaptability to the environment through autonomous learning.
Fig. 31 is a key frame of the walking process of the humanoid robot in the slope environment, and the walking performance of the humanoid robot in the slope environment can be more intuitively seen when the on-line stability control is performed by the reinforcement learning. The humanoid robot in fig. 31 (a) falls halfway back, whereas the humanoid robot in fig. 31 (b) steadily finishes the entire course. It can be seen that the trunk of the robot can always keep straight no matter how the inclination angle of the ground changes, which indicates that the walking stability is good.
On the basis of the reinforcement learning algorithm, a continuous immediate return function is redesigned as shown in the formula (3.33), a reward and punishment mode of the control algorithm is changed, a fixed reward can be obtained as long as a trunk inclination angle is kept in a stable interval (the front inclination and the back inclination are smaller than 5 degrees) in the prior art, the reward and the punishment mode are changed to be closer to an ideal stable position (the trunk is vertical to a horizontal plane, the inclination angle is 0 degree), higher reward can be obtained, the reward and the inclination angle are in negative correlation, and the penalty with a fixed value is obtained until a stable threshold value (the plus or minus 5 degrees). The learning process using the immediate return function is shown in fig. 32, and compared with fig. 27, it can be seen that the convergence rate of the algorithm is faster, and after 150 rounds, the robot can smoothly walk the whole course each time without falling down. Besides the training and learning speed of the Agent is increased to make the Agent converge faster, the performance of the controller finally trained by the two immediate return functions in a slope environment is not different.
Secondly, an online walking stability control experiment of the reinforcement learning cerebellum model: the invention has introduced the cerebellum model of the reinforcement learning theory and the robot walking on-line stable controller based on the cerebellum model to have presented in detail. The invention carries out simulation verification on the effectiveness of the algorithm. The weight of the humanoid robot is mainly concentrated on the upper half body, and the main controller, the auxiliary controller, the sensor element, the battery and the like are all placed in the chest cavity, so that the trunk position of the humanoid robot in the walking process is very important. Too much displacement of the torso can cause it to fall because the center of mass and the ZMP point are out of the support plane. The hip joint is closest to the upper body of the humanoid robot, and the change of the rotation angle directly influences the position of the trunk, so that the pitching axis hip joint is selected as an adjustment object of the cerebellum model in the experiment, and the slope environment with the gradient change can be met by compensating the pitching axis hip joint in real time in the walking process. In order to train the cerebellum model fully, the humanoid robot is placed on a stability training platform, the slope change in the slope environment is simulated by using the humanoid robot, and then the humanoid robot is placed in the slope environment for walking verification.
(1) Platform training: based on the description of the stability training platform, sinusoidal signals with a phase difference are respectively input to the front and rear pairs of driving upright columns of the training platform, and the slope environment with the fluctuating slope is simulated by using the front and rear inclination of the table board. The table angle θ changes with time as shown in fig. 33, and when the table of the platform is tilted back and forth, the state of the robot standing on the table is changed. To maintain stability, the cerebellar model takes the torso attitude angle of the robot, which is solved from the data of the accelerometer and gyroscope sensors, as the state input stThe hip joint of the robot is adjusted in real time, so that the state of the humanoid robot is always maintained in a stable interval. When the cerebellum model is initialized, in order to ensure that all behaviors are fully tried in the training process, an optimistic initial value method is adopted to set the state values to be 1, and at the moment, the selection probabilities of all behaviors are the same. In the training process, the cerebellum model performs trial activation on each basic unit based on state input, and then adjusts the weight of the relevant area and the corresponding behavior selection probability according to the evaluation information fed back by the CF so as to find the optimal strategy. In practice, the training process considers that the model is the last step of modeling the cerebellum model until the cerebellum model obtains the capability of maintaining the stability of the robot, and the modeling is completed. This is achieved byIn addition, the training process does not set the condition of finishing, only can reset the whole training environment when the robot falls down, the training result before resetting can be reserved, new training is carried out on the basis, the mode of finishing training is that the external artificial interruption is carried out, and then the subsequent processing of the model is carried out. Fig. 35 shows the result of one training, and the cerebellum model after the training time can always maintain the posture angle of the trunk of the robot within the stable range. Fig. 35 (a) and 35 (b) show the variation of the robot with the movement of the platform before and after training, respectively, and it can be seen that the humanoid robot swings back and forth with the movement of the platform before training, and the humanoid robot can maintain the body straight by adjusting the hip joint after training. When the humanoid robot tilts back and forth under the influence of the platform movement, the numerical value of the accelerometer changes, wherein the numerical value change range of the Y axis (front and back direction) of the accelerometer is the largest, and a change process of the acceleration of the humanoid robot in the front and back directions during the platform movement before and after training is shown in fig. 34. Before training, the humanoid robot swings back and forth along with the movement of the platform, as shown by a solid line, the acceleration data in the front and back directions show periodic fluctuation change, and at the moment, the humanoid robot is completely influenced by the movement of the platform. In the case after training, the amplitude of the change in the acceleration data is reduced as shown by the broken line in fig. 34. FIG. 36 illustrates the output of each elementary unit over a period of time while the cerebellar model is executing after training. It can be seen from fig. 36 that the activation frequency of each basic unit tends to be concentrated in the 5 th and 6 th basic units, and the cerebellum model output in the initial state is random, which indicates that after training, the cerebellum model learns a certain behavior pattern to make the robot more capable of maintaining a vertical stable state. Fig. 37 shows the change of the state of each basic unit of the cerebellum model corresponding to CF with the number of learning times in the training process. It can be seen that the CF of the 5 th and 6 th basic units are activated after a certain number of learning, and most of the remaining basic units change their states frequently, which is more consistent with fig. 36, which shows that the cerebellar model starts to obtain a certain behavior pattern after a certain number of learning.
(2) Walking on a slope: the trained cerebellum model is used for executing the task of walking the humanoid robot in a slope environment, the slope environment is as shown in figure 37, the front is low and the back is high, the inclination angle range is 0-7 degrees, and the maximum angle of the middle position is 7 degrees. Fig. 38 shows a key frame when the humanoid robot walks in a slope environment, and fig. 38 (a) shows that the humanoid robot falls down in the middle of the process due to losing stability caused by backward tilting when the cerebellar model is used for online stability control, and the slope walking fails. The experimental process of the slope walking of the humanoid robot with the cerebellum model for online adjustment is shown in (b) of fig. 38, when the road surface inclines, the cerebellum model carries out online stable control on the humanoid robot, and the continuous walking in the slope environment with the gradient change is realized by combining the offline pre-planning gait. Fig. 39 is a curve of the change of the inclination angle of the trunk of the pitch axis of the humanoid robot along with time in the process of walking on a slope, wherein the inclination angle is forward when the angle is positive, and the inclination angle is backward when the angle is negative. The upper curve uses a cerebellum model, the inclination angle of the trunk of the humanoid robot is always maintained in a certain range and does not change violently along with the inclination of the slope, and the lower curve shows the process that the humanoid robot gradually inclines until the humanoid robot falls down.
In the present invention, the hardware configuration, software control system and the capability of the ROBOTIS-OP2 humanoid robot platform are briefly introduced. The system has the characteristics of cross-platform and cross-language, can simultaneously use Python, C + +, MATLAB to develop the robot controller, makes up for the deficiencies of the three languages, respectively realizes the development process by means of respective IDE, finally combines and generates a complete robot controller, and provides convenience for the editing and debugging of programs and the acquisition of data. Finally, aiming at the problem that the robot is difficult to keep stable walking when facing a road surface environment with continuously changing gradient, a reinforcement learning stability controller and a cerebellum model stability controller are respectively used for carrying out simulation experiments, and the experimental results prove the effectiveness of the two methods, so that the robot can stably walk in the slope environment.
The invention uses a humanoid robot as a research object, aims at the problem that the humanoid robot is difficult to walk stably in an uneven road environment, combines off-line pre-planning gait with on-line stability adjustment, and respectively designs two on-line stability controllers, namely an AC reinforcement learning stability controller and a TD reinforcement learning cerebellum model stability controller. Both methods can obtain the ability to maintain robot balance through learning, and can be further applied to slope environments. The main research results of the invention have the following aspects:
(1) a combined simulation system based on Webots simulation software is set up, the system has the characteristics of cross-platform and cross-language, the combined development of Python, C + +, and MATLAB three languages is realized, the control program of the robot is generated together, and the problem of inconvenience in single language development caused by complicated and diversified development trend of a control algorithm is solved.
(2) On the basis of off-line gait planning, a walking stability control method based on an AC reinforcement learning algorithm is constructed according to the walking characteristics of the humanoid robot. The method takes a slope environment with continuously changing gradient as a target environment, and each element of reinforcement learning is designed in a targeted manner. In the walking process, the lower limb joints are compensated and controlled based on the sensor information, so that the humanoid robot can adjust the posture of the humanoid robot in real time. In addition, the impact of the continuous and discrete immediate reward functions on the convergence speed during training is discussed. And finally, performing a slope walking experiment in a simulation environment, and verifying the effectiveness of the method according to an experiment result.
(3) Based on the motion balance function of the cerebellum, a cerebellum-simulated motion control model is set up by combining a reinforcement learning theory from the perspective of bionics. An adaptive online stability controller is designed based on the novel cerebellum model, the controller does not need the accurate modeling of a robot system, and the lower limb joints are compensated and controlled by collecting the real-time posture information of the robot. In order to fully train the controller, a stability training platform is built in a simulation environment to simulate the disturbance in the environment. In the training process, the cerebellum model activates the corresponding basic unit to output actions based on the coded state information, and updates the PF-PC weight value through the evaluation information transmitted by the lower olive feedback module until the controller converges. Simulation experiments show that the cerebellum model after reinforcement learning training has the capability of maintaining the balance of the robot. Compared with a robot without on-line adjustment of the cerebellum model, the robot with the cerebellum model has more excellent performance in a slope environment.
The ROBOTIS-OP2 small-sized humanoid robot is used as a research platform, and a method for stably walking in a complex environment is researched. Firstly, an offline walking mode of the humanoid robot is determined based on a ZMP theory and a cubic spline interpolation method, a neural network is used for fitting strategies, and online stable control under continuous state input is realized based on an ActorCritic algorithm. And then, a robot bionic control model based on a cerebellum mechanism and reinforcement learning is further established by using a control method. In order to improve development efficiency and provide an experimental verification platform of an algorithm, a cross-platform and cross-language combined simulation system is developed.
The invention designs an off-line gait planning method on the basis of analyzing the main parameters and the structural performance of the ROBOTIS-OP 2. Aiming at the problem that offline gait cannot adapt to a complex environment, an on-line stability control method based on an Actor Critic reinforcement learning algorithm is provided. By directly receiving continuous state information, the method realizes end-to-end control from input signals to output joint angle compensation values. On the basis, a cubic spline interpolation method is used for constructing an action attenuation strategy, and smooth transition between actions is realized. In order to improve the convergence speed, a continuous immediate return value function is designed, and the learning efficiency is improved.
On the basis of reinforcement learning, the invention further studies the motion control mechanism of the cerebellum and designs the on-line stability controller based on the bionic reinforcement learning type cerebellum model. The cerebellum model is based on the physiological anatomical structure of the cerebellum, a reinforced learning mechanism is introduced in the next olive feedback link, state information in the walking process of the humanoid robot can be mapped to corresponding commands to be output through learning, and the walking posture of the robot is adjusted in real time. The overall framework and the details of the cerebellum model used and the humanoid robot walking stability controller designed based on the model are described in detail. Considering that the training and learning of the humanoid robot need a large amount of data obtained by interacting with the environment, the invention is carried out by adopting a training mode in the simulation environment. A cross-platform and cross-language development environment is established based on Webots simulation software, and a foundation is laid for development and experimental verification of an algorithm in research. By separating the program editing, compiling and running processes, the environment overall framework and each sub-module thereof are deeply designed, including the building process and the used tools, and some important application program interfaces and key details needing attention in the environment. Finally, a humanoid robot walking simulation experiment is carried out in the environment, the effectiveness of the two online stability control methods is verified, and the experimental result is analyzed and discussed.
As a method for controlling the online walking stability of the humanoid robot based on the actrcritic reinforcement learning algorithm in the above embodiments, the present embodiment will be briefly described: a humanoid robot walking control method, carry on the off-line gait planning to the humanoid robot, make the humanoid robot trace the joint movement track that the off-line produces have walking ability; the state information of the humanoid robot in the walking process is collected in real time, and the walking posture of the humanoid robot is adjusted on line in real time in response to a stability controller of the state information, so that the humanoid robot can walk on a non-flat road surface.
Further, the steps of performing off-line gait planning on the humanoid robot are as follows: acquiring a multi-connecting-rod model of the humanoid robot; determining a ZMP moving track in a gait sub-period through a ZMP stability theory; responding to the multi-connecting-rod model and the ZMP moving track, and performing robot ankle joint and hip joint track planning; responding to the ZMP moving track and ankle joint and hip joint track planning, and acquiring the track of the lower limb joint through the kinematic relationship of the joint; describing the joint rotation angle by using a cubic spline method to improve the ZMP stability margin and reduce the ankle joint track tracking error; and (4) optimizing the track of the lower limb joint by using an ant colony algorithm to obtain a complete off-line gait.
Further, the state information of the humanoid robot when walking on the slope road surface is input as a stable controller, and the input at the time t of the stable controller is as follows:
s(t)=[f(t),θacc_x(t),θacc_y(t),θacc_z(t)] (3.28)
wherein, thetaacc_x(t),θacc_y(t),θacc_z(t) are respectively numerical signals acquired by the acceleration sensor in three directions, f (t) is a variable related to supporting feet of the humanoid robot, 1 is a left foot, and-1 is a right foot; and at the time t, obtaining corresponding action output according to the state information provided by the humanoid robot:
wherein,the joint rotation angle adjustment quantities of the pitch axis ankle joint, the pitch axis knee joint and the pitch axis hip joint of the right foot are respectively compared with the joint rotation angle adjustment quantities of an off-line gait;the joint rotation angle adjustment quantities of the pitch axis ankle joint, the pitch axis knee joint and the pitch axis hip joint of the left foot are respectively compared with the joint rotation angle adjustment quantities of an off-line gait; processing the action a (t):
wherein alpha isreal_time(t) is the joint real-time output angle, αoffline(t) is the joint angle of the off-line trajectory,is the attenuation coefficient used to adjust the duration of the joint adjustment, s (t) is a cubic spline difference function under natural boundary conditions, determined by the following constraints:
where Δ θ is the adjustment amount of the corresponding joint.
Further, the learning method of the stability controller is as follows: (1) inputting the state s (t) of the humanoid robot at the current moment into an Actor neural network and a criticic neural network, wherein the Actor neural network provides action output a (t); (2) responding to an offline gait by the humanoid robot, adjusting joint rotation angles of a pitch axis ankle joint, a pitch axis knee joint and a pitch axis hip joint of a left foot and a right foot based on action output a (t), reaching a new state s (t +1), calculating reward and punishment r (t) of the action output a (t), and obtaining empirical data of the state change: { s (t), a (t), s (t +1), r (t) }; (3) calculating TD error according to empirical data; (4) and updating parameters of the Actor neural network and the Critic neural network respectively in response to the TD error.
Further, a reward penalty r (t) for the action output a (t) is calculated:
or
Wherein, thetapitchThe inclination angle of the robot in the direction of the pitching axis is shown, and eta is a return value amplification coefficient.
Further, the TD error is calculated:
=r+V(s(t+1),θc)-V(s(t),θc) (3.24)
wherein, thetacRepresenting a parameter of the Critic network and V () representing a value function.
Further, updating the Critic network:
θc=θc+αc▽V(s(t),θc) (3.25)
wherein, thetacDenotes the parameter of Critic network, TD, error alphacIs the Critic spiritUpdating the direction to the direction which minimizes the loss function through the updating step length of the network;
loss function L (θ)c) Comprises the following steps:
L(θc)=(r+V(s(t+1),θc)-V(s(t),θc))2 (3.26)
updating the Actor network:
θa=θa+αa▽lnπ(a|s(t),θa) (3.27)
wherein, thetaaDenotes an Actor network parameter, αaIs the update step size of the Actor network, pi (as, theta)a) Represents the policy represented by the Actor network, where a is the action output of the Actor network.
Further, the ZMP moving track in the gait sub-period is determined through a ZMP stability theory: the ZMP motion profile in one gait sub-cycle is expressed by the following formula:
wherein, Xzmp(t),Yzmp(t),Zzmp(t) the coordinates of the ZMP in the world coordinate system, respectively; s is half the length of the plantar stabilization zone in the X direction to adjust the range of motion of the ZMP during ambulation; sigma represents the proportion of the overall duration occupied by the DSP in one walking sub-period; q is an invariant;
the ankle joint and hip joint trajectory planning method comprises the following steps:
(1) ankle joint trajectory planning
Let X (t), Y (t), and Z (t) be the motion trajectories of the ankle joint in the X direction, the Y direction, and the Z direction, respectively, and X ' (t), Y ' (t), and Z ' (t) represent the joint motion speed, and the ankle joint is subject to the following constraint conditions:
wherein L isstepIndicating the forward stepping distance of the right foot, HstepThe ankle joint can reach the highest height, TaAnd TbThe ankle joint is at the beginning and end time of the highest lifting height in a walking cycle and is used for adjusting the duration time of maintaining the highest height of the ankle joint;
deriving the trajectory of the ankle joint expressed by a cubic interpolation function according to the constraints of equations (3.15) and (3.16):
X(t)=-0.5Lstep+3Lstept2-2Lstept3,0≤t≤1 (3.17)
Y(t)=0,0≤t≤1 (3.19)
when the target environment is a non-flat road surface and there may be an obstacle, the height of the obstacle is determined according to the expected height of the obstaclestepAnd HstepSetting is carried out;
(2) hip joint trajectory planning
Trajectory of the transverse rolling axis hip joint:
wherein, tfAnd tbIndicating steps of support of both feetTwo time separation points of the segment and the monopod support phase, dsiIs alpha5(t) maximum rotation angle;
the method for acquiring the track of the joint of the lower limb through the kinematic relationship of the joint comprises the following steps:
setting two hip joint angles of a course axis as constant 0 degrees, and respectively setting the rolling axis joint and the pitching axis joint of the lower limb of the robot to have the following relations according to the kinematic relation of each joint angle:
the joint rotation angle in ten degrees of freedom of the link model is defined as α1~α10,
Angle of rotation alpha to joint2,α3,α7,α8The method described using the cubic spline method is:
constructing an objective function:
J=Jz+Jan+Jbound (3.23)
wherein, JzIs the ZMP trajectory error during walking, JanCumulative error of ankle joint following given trajectory at sampling time, JboundThe rotation angle of the joint is limited not to exceed the limit of the motor.
Further, the method for acquiring the upper limb joint track comprises the following steps: determining the running track of the upper body joint of the robot in the walking process by adopting a direct planning method, wherein the joint comprises: the two joints of the neck, the four joints of the left shoulder and the right shoulder and the two joints of the left elbow and the right elbow are arranged at a constant angle of 0 degree; the two transverse rolling shaft joints of the left shoulder and the right shoulder are respectively set to be 15 degrees and-15 degrees; the two joints of the left elbow and the right elbow are respectively set to be 20 degrees and-20 degrees, the running track of the pitching shoulder joint in the swing arm process is described through the following formula, and the pitching shoulder jointThe relationship between the rotation angles of the left and right joints with time is Sshl(t) and Sshr(t) is as follows:
a humanoid robot walking control method, carry on the off-line gait planning to the humanoid robot, make the humanoid robot trace the joint movement track that the off-line produces have walking ability; in response to the off-line gait, the state coding module adjusts the activation state of the PF according to the state information collected by the humanoid robot sensor, the lower olive feedback module modifies the behavior selection probability and the cerebellar neuron storage weight based on the evaluation information fed back by the environment, and the motion mapping module outputs and adjusts the robot action according to the cerebellar model.
The invention realizes the walking control of the humanoid robot by combining the off-line gait planning and the line stability control. Based on a walking mode generated by off-line gait planning, the humanoid robot realizes stable walking on a horizontal ground; the on-line stable control can sense the environment through the sensor in a fixed walking mode and adjust the environment in real time, so that the humanoid robot has the capability of adapting to a complex environment. In the on-line stability control part, firstly, an on-line stability controller is constructed by using a reinforcement learning algorithm, so that the humanoid robot can self-adaptively adjust the joints of the lower limbs to maintain balance in the walking process through autonomous learning. Reinforcement learning is an intelligent control method, which is derived from observation and research of human learning behaviors, while cerebellum is the center for maintaining body balance and coordinating movement, has strong learning ability and plasticity, and has important significance for developing a cerebellar-like bionic control model for controlling scientific development and simulating pathological cerebellum. With the aim of constructing the humanoid robot control method based on the cerebellum mechanism, anatomical and physiological structures of the cerebellum are further studied deeply, and a reinforcement learning mechanism is introduced to establish a bionic control model based on the cerebellum, so that the stability and balance capability of the humanoid robot in the walking process are improved.
Claims (5)
1. The utility model provides an online stable control humanoid robot based on bionical reinforcement learning type cerebellum model which characterized in that: comprises that
A device for off-line gait planning of the humanoid robot, wherein the output of the device enables the humanoid robot to track the joint motion trail generated off-line and have walking capability;
a cerebellum model controller responding to off-line gait, the cerebellum model controller comprises a state coding module, a cerebellum model, a lower olive feedback module and a motion mapping module,
the state coding module adjusts the activation state of the PF according to state information collected by the humanoid robot sensor, the lower olive feedback module modifies the behavior selection probability and the cerebellar neuron storage weight value based on evaluation information fed back by the environment, and the motion mapping module outputs and adjusts the robot action according to the function module.
2. The online stability control humanoid robot based on the bionic reinforcement learning type cerebellum model of claim 1, characterized in that: the motion mapping module realizes the regulation of the robot motion according to the cerebellum model output based on the following modes: the output of the cerebellum model is determined by the activation state of the basic units, the activation state of each basic unit at the time t forms a basic unit state vector, the point product of the basic unit state vector and the command mapping vector is the joint adjustment amount at the time t, the joints are taken as control objects, each output directly acts on the joints, the motors are taken as executing elements, the model is taken as a joint-skeleton-muscle system in a human body, the positive rotation and the reverse rotation of the joint motors are taken as the contraction and the relaxation of skeletal muscles, and the adjustment amount output by the cerebellum model is taken as a nerve electric signal and has certain amplitude and duration.
3. The online stability control humanoid robot based on the bionic reinforcement learning type cerebellum model of claim 2, characterized in that: the joint adjustment amount is acquired based on the following modes:
when a certain joint is taken as an object for feedback control of a cerebellum model, the command mapping vector is as follows:
wherein n is the number of basic units,the duration and amplitude of the joint adjustment amount are adjustable:
4. The online stability control humanoid robot based on the bionic reinforcement learning type cerebellum model of claim 2, characterized in that: the cerebellum model learning method comprises the following steps:
(1) initializing PF-PF plasticity synapse weight omegaijWeight of synapse formed by the connection between SC, BC and PC And a behavior selection probability pi0(a);
(2) The state coding module determines the cerebellum model state s according to the information of the robot sensort;
(3) Selecting and executing an action a based on a current policytObtaining a state st+1And an immediate return value rt+1Update the state valueA function;
(4) TD error of last step behaviortUpdate behavior selection probability pit(a) Modifying the related synaptic weight value according to the CF feedback information;
(5) judging whether the learning of the wheel is finished or not, and returning to the step (2) to continue to execute the learning process if the learning of the wheel is not finished; otherwise, resetting the system and starting the next learning process until the whole learning process is finished.
5. The online stability control humanoid robot based on the bionic reinforcement learning type cerebellum model of claim 1, characterized in that:
the state coding module: receiving a state signal of the humanoid robot, projecting the state signal into a PF state in a cerebellum model, receiving proprioception information input from the outside of the cerebellum by an MF, transmitting the proprioception information to a GC for information coding, forming synaptic connections between the MF and the GCs, wherein a CMAC is a neural network based on weight storage, the weight of the CMAC is obtained by looking up a table, and the input and output processes of the CMAC are divided into two stages: the first stage, the input information can find a group of unique logic partitions corresponding to the input information in all dimensions, and each logic partition can find the weight corresponding to the partitions in an actual memory; in the second stage, the weights found in the first stage are accumulated in a weighted summation mode to obtain the output of the CMAC network;
the cerebellum functional module cerebellum model is as follows: following the nerve connection structure of the cerebellum and the conduction process of electric signals, PF transmits the coded signals of the continuous state information of the humanoid robot, the input signals are projected to a relevant area through a state coding module, BC and SC are excited in an excited state in the process, and parameters s and b representing the state information of BC and SC are transmittedjIs set to "1",andrepresents the weight of synapse formed by the connection between BC, SC and PC, due to BC and SC play a role in inhibiting PC, so the weight value is also endowed with a corresponding negative value, and the membrane potential value P of PCj(t) is calculated as follows:
wherein, Pj(t) represents the membrane potential value of PC at time t, Pj(t) has a binary property, and the two states of 'inhibition' and 'activation' are respectively represented by '0' and '1'; omegaij(t) is a weight value stored in a PF-PC plastic synapse, representing the effect of synapse state on PC membrane potential, where i is 1, 2.., N denotes the ith PF, j is 1, 2.., and M denotes the jth PC; PF (particle Filter)i(t) represents the activation state of the ith PF at time t, and is determined by the output of the state coding module;
status S of jth PCjThe state value of (t) is determined by equation (4.2), where φ is the threshold potential of PC;
O(t)=[o1(t),o2(t),...,ol(t)...,oL(t)]is a state vector representing the active state of the elementary units; l is the total number of basic units in the module and is determined by the action space of the cerebellum model, and the action space is the combination of the output basic actions of the cerebellum model; ol(t) is the state of the ith elementary cell, determined by the following equation:
wherein, mu is a correction factor, and M is the number of PCs in the basic unit;
the effect of PF-PC synapse excitability on PC membrane potential value varies with time, ωij(t) is continuously adjusted at each time t, and the change trend of the (t) is delta omega as the excitability is attenuatedij(t) is:
Δωij(t)=K[1-Pj(t)]PFi(t) (4.4)
wherein K is a weight adjustment coefficient;
the weight value changes at the moment of t + 1:
ωij(t+1)=ωij(t)+Δωij(t) (4.5)
the lower olive feedback module is used for: at the moment t, the state s (t) of the robot is obtained through a sensor, the TD algorithm selects and executes a (t) based on the behavior selection probability to obtain the next state s (t) and an immediate return value rt+1Updating the state value function using equation (2.11), where rt+1Comparing the stable state of the robot with the stable state at the last moment, wherein the stable state is kept as "-1", and the stable state is approached as "0";
the modeling process of the cerebellum is the modeling process of related synapses, the CF information transmission process adopts a reinforcement learning mechanism, and the behavior selection probability pi of the trial-and-error processt(a) Continuously adjusting, and simultaneously changing related synapses in the process; after obtaining a learning experience, the lower olive module calculates TD error by using formula (4.6)tAnd modifying pi based on the formula (4.7) and the formula (4.8)t(a);
t=rt+1+γV(s(t+1))-V(s(t)) (4.6)
pt+1(s(t),a(t))=pt(s(t),a(t))+ψt (4.7)
Wherein p ist(s (t)), a (t)) is the tendency of selecting action a (t) at time t state s (t), the probability of selecting each action is the same initially, ψ is the step parameter, Pr { a (t) ═ a } is the probability of selecting action a (t) at time t, abbreviated as pt (a), and n represents the total number of selectable actions;
according totAnd obtaining the evaluation information of the CF feedback:
cl(t)=g(t) (4.9) wherein (A) in the reaction,
based on cl(t), adjusting the corresponding PF weight:
ωij(t)=ωij(t)-σ·cl(t)·PFi(t) (4.11)
wherein σ is a positive constant;
the motion mapping module: mapping the output of the basic unit in the cerebellum model to a controlled object to execute a related control command; an APG method is introduced into the cerebellum model, each APG can generate a motion instruction, and the motion instructions are in one-to-one correspondence with the basic unit output; in the dynamic operation process of the cerebellum model, the calculation mode of generating corresponding action at each time t is as follows:
A(t)=D·O(t) (4.12)
wherein D is a set vector of motion instructions, also called a command mapping vector, and a (t) is a final output instruction of the cerebellar model at time t;
the output of the cerebellum model is determined by the activation state of the basic units, the activation state of each basic unit at the time t forms a basic unit state vector, the point product of the basic unit state vector and the command mapping vector is the joint adjustment quantity at the time t, the joints are taken as control objects, each output directly acts on the joints, the motors are taken as executing elements, the model is taken as a joint-skeleton-muscle system in a human body, the positive rotation and the reverse rotation of the joint motors are taken as the contraction and the relaxation of skeletal muscles, and the adjustment quantity output by the cerebellum model is taken as a nerve electric signal and has certain amplitude and duration;
when a certain joint is taken as an object for feedback control of a cerebellum model, the command mapping vector is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836621.9A CN112060082B (en) | 2020-08-19 | 2020-08-19 | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010836621.9A CN112060082B (en) | 2020-08-19 | 2020-08-19 | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112060082A true CN112060082A (en) | 2020-12-11 |
CN112060082B CN112060082B (en) | 2021-10-15 |
Family
ID=73662170
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010836621.9A Active CN112060082B (en) | 2020-08-19 | 2020-08-19 | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112060082B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112596379A (en) * | 2020-12-16 | 2021-04-02 | 复旦大学 | Robot controller design method |
CN112644599A (en) * | 2020-12-30 | 2021-04-13 | 乐聚(深圳)机器人技术有限公司 | Posture adjustment method, device, equipment and storage medium of biped robot |
CN112700011A (en) * | 2020-12-31 | 2021-04-23 | 第四范式(北京)技术有限公司 | Intelligent agent decision information display method and device, electronic equipment and storage medium |
CN112757295A (en) * | 2020-12-28 | 2021-05-07 | 武汉远图信息科技有限公司 | Biped robot gravity center control method and simulation system based on reinforcement learning |
CN112859901A (en) * | 2021-01-21 | 2021-05-28 | 北京理工大学 | Continuous dynamic stable jumping control method of humanoid robot |
CN113001542A (en) * | 2021-02-21 | 2021-06-22 | 深圳市优必选科技股份有限公司 | Robot trajectory planning method and device, readable storage medium and robot |
CN113433865A (en) * | 2021-07-09 | 2021-09-24 | 西安交通大学 | Design method and system of IPMC-driven bionic wave fin motion control system |
CN113510704A (en) * | 2021-06-25 | 2021-10-19 | 青岛博晟优控智能科技有限公司 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
CN113848946A (en) * | 2021-10-20 | 2021-12-28 | 郑州大学 | Robot behavior decision method and device based on neural regulation mechanism |
CN113858204A (en) * | 2021-10-19 | 2021-12-31 | 中山大学 | Redundant mechanical arm tracking control method, device and medium |
CN116197918A (en) * | 2023-05-05 | 2023-06-02 | 北京华晟经世信息技术股份有限公司 | Manipulator control system based on action record analysis |
WO2023159978A1 (en) * | 2022-02-28 | 2023-08-31 | 山东大学 | Quadruped robot motion control method based on reinforcement learning and position increment |
CN117172280A (en) * | 2023-11-01 | 2023-12-05 | 四川酷盼科技有限公司 | Multisource data processing method applied to bionic animal |
CN118003341A (en) * | 2024-04-09 | 2024-05-10 | 首都体育学院 | Lower limb joint moment calculation method based on reinforcement learning intelligent body |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001109732A (en) * | 1999-10-08 | 2001-04-20 | Rinnai Corp | Device for outputting approximate value and its learning device |
CN1590039A (en) * | 2003-08-25 | 2005-03-09 | 索尼株式会社 | Robot and attitude control method of robot |
CN101390100A (en) * | 2005-12-28 | 2009-03-18 | 神经科学研究基金会 | Brain-based device having a cerebellar model for predictive motor control |
CN103645632A (en) * | 2013-11-22 | 2014-03-19 | 大连海联自动控制有限公司 | A model reference self-adaptive control method based on a generalized fuzzy CAMA |
TW201537516A (en) * | 2014-03-26 | 2015-10-01 | Univ Nat Taipei Technology | Method and apparatus for moving object detection based on cerebellar model articulation controller network |
CN105824250A (en) * | 2016-05-14 | 2016-08-03 | 大连理工大学 | Bionic arm control system based on cerebellum model and method for cerebellum model modeling |
CN107341543A (en) * | 2017-06-18 | 2017-11-10 | 大连理工大学 | A kind of cerebellar model modeling method based on intensified learning |
CN109834713A (en) * | 2019-03-13 | 2019-06-04 | 上海飒智智能科技有限公司 | Apery brain robot control axis and apery brain walking robot control axis |
-
2020
- 2020-08-19 CN CN202010836621.9A patent/CN112060082B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001109732A (en) * | 1999-10-08 | 2001-04-20 | Rinnai Corp | Device for outputting approximate value and its learning device |
CN1590039A (en) * | 2003-08-25 | 2005-03-09 | 索尼株式会社 | Robot and attitude control method of robot |
CN101390100A (en) * | 2005-12-28 | 2009-03-18 | 神经科学研究基金会 | Brain-based device having a cerebellar model for predictive motor control |
CN103645632A (en) * | 2013-11-22 | 2014-03-19 | 大连海联自动控制有限公司 | A model reference self-adaptive control method based on a generalized fuzzy CAMA |
TW201537516A (en) * | 2014-03-26 | 2015-10-01 | Univ Nat Taipei Technology | Method and apparatus for moving object detection based on cerebellar model articulation controller network |
CN105824250A (en) * | 2016-05-14 | 2016-08-03 | 大连理工大学 | Bionic arm control system based on cerebellum model and method for cerebellum model modeling |
CN107341543A (en) * | 2017-06-18 | 2017-11-10 | 大连理工大学 | A kind of cerebellar model modeling method based on intensified learning |
CN109834713A (en) * | 2019-03-13 | 2019-06-04 | 上海飒智智能科技有限公司 | Apery brain robot control axis and apery brain walking robot control axis |
Non-Patent Citations (2)
Title |
---|
于薇薇等: "双足机器人跨越动态障碍物在线控制系统设计", 《计算机测量与控制》 * |
司媛媛等: "基于CPG和小脑模型的双足机器人行走控制", 《系统仿真技术》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112596379A (en) * | 2020-12-16 | 2021-04-02 | 复旦大学 | Robot controller design method |
CN112596379B (en) * | 2020-12-16 | 2022-02-22 | 复旦大学 | Robot controller design method |
CN112757295A (en) * | 2020-12-28 | 2021-05-07 | 武汉远图信息科技有限公司 | Biped robot gravity center control method and simulation system based on reinforcement learning |
CN112644599B (en) * | 2020-12-30 | 2022-03-04 | 乐聚(深圳)机器人技术有限公司 | Posture adjustment method, device, equipment and storage medium of biped robot |
CN112644599A (en) * | 2020-12-30 | 2021-04-13 | 乐聚(深圳)机器人技术有限公司 | Posture adjustment method, device, equipment and storage medium of biped robot |
CN112700011A (en) * | 2020-12-31 | 2021-04-23 | 第四范式(北京)技术有限公司 | Intelligent agent decision information display method and device, electronic equipment and storage medium |
CN112700011B (en) * | 2020-12-31 | 2024-05-31 | 第四范式(北京)技术有限公司 | Agent decision information display method and device, electronic equipment and storage medium |
CN112859901A (en) * | 2021-01-21 | 2021-05-28 | 北京理工大学 | Continuous dynamic stable jumping control method of humanoid robot |
CN112859901B (en) * | 2021-01-21 | 2021-12-24 | 北京理工大学 | Continuous dynamic stable jumping control method of humanoid robot |
CN113001542A (en) * | 2021-02-21 | 2021-06-22 | 深圳市优必选科技股份有限公司 | Robot trajectory planning method and device, readable storage medium and robot |
CN113510704A (en) * | 2021-06-25 | 2021-10-19 | 青岛博晟优控智能科技有限公司 | Industrial mechanical arm motion planning method based on reinforcement learning algorithm |
CN113433865A (en) * | 2021-07-09 | 2021-09-24 | 西安交通大学 | Design method and system of IPMC-driven bionic wave fin motion control system |
CN113858204A (en) * | 2021-10-19 | 2021-12-31 | 中山大学 | Redundant mechanical arm tracking control method, device and medium |
CN113858204B (en) * | 2021-10-19 | 2022-05-03 | 中山大学 | Redundant mechanical arm tracking control method, device and medium |
CN113848946B (en) * | 2021-10-20 | 2023-11-03 | 郑州大学 | Robot behavior decision method and equipment based on nerve regulation mechanism |
CN113848946A (en) * | 2021-10-20 | 2021-12-28 | 郑州大学 | Robot behavior decision method and device based on neural regulation mechanism |
WO2023159978A1 (en) * | 2022-02-28 | 2023-08-31 | 山东大学 | Quadruped robot motion control method based on reinforcement learning and position increment |
CN116197918A (en) * | 2023-05-05 | 2023-06-02 | 北京华晟经世信息技术股份有限公司 | Manipulator control system based on action record analysis |
CN117172280A (en) * | 2023-11-01 | 2023-12-05 | 四川酷盼科技有限公司 | Multisource data processing method applied to bionic animal |
CN117172280B (en) * | 2023-11-01 | 2024-02-02 | 四川酷盼科技有限公司 | Multisource data processing method applied to bionic animal |
CN118003341A (en) * | 2024-04-09 | 2024-05-10 | 首都体育学院 | Lower limb joint moment calculation method based on reinforcement learning intelligent body |
Also Published As
Publication number | Publication date |
---|---|
CN112060082B (en) | 2021-10-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112060082B (en) | Online stable control humanoid robot based on bionic reinforcement learning type cerebellum model | |
CN112051735B (en) | Humanoid robot walking control method | |
Kashyap et al. | Particle Swarm Optimization aided PID gait controller design for a humanoid robot | |
Abreu et al. | Learning low level skills from scratch for humanoid robot soccer using deep reinforcement learning | |
Thor et al. | Generic neural locomotion control framework for legged robots | |
Paul et al. | The road less travelled: Morphology in the optimization of biped robot locomotion | |
Fukuda et al. | Multi-locomotion robotic systems: New concepts of bio-inspired robotics | |
Chew et al. | Dynamic bipedal walking assisted by learning | |
Zhu et al. | A survey of sim-to-real transfer techniques applied to reinforcement learning for bioinspired robots | |
Jakimovski | Biologically inspired approaches for locomotion, anomaly detection and reconfiguration for walking robots | |
Oliveira et al. | Multi-objective parameter CPG optimization for gait generation of a biped robot | |
CN114986526B (en) | Robot motion control method and device, robot and storage medium | |
CN111625002A (en) | Stair-climbing gait planning and control method of humanoid robot | |
Hoffmann et al. | The merits of passive compliant joints in legged locomotion: Fast learning, superior energy efficiency and versatile sensing in a quadruped robot | |
Li et al. | Fuzzy double deep Q-network-based gait pattern controller for humanoid robots | |
Shi et al. | Bio-inspired equilibrium point control scheme for quadrupedal locomotion | |
Saputra et al. | Evolving a sensory–motor interconnection structure for adaptive biped robot locomotion | |
Ren et al. | Computational models to synthesize human walking | |
Huang et al. | Smooth stride length change of rat robot with a compliant actuated spine based on cpg controller | |
Culha | An actuated flexible spinal mechanism for a bounding quadrupedal robot | |
Huang et al. | Optimizing Dynamic Balance in a Rat Robot via the Lateral Flexion of a Soft Actuated Spine | |
Villarreal | Bridging Vision and Dynamic Legged Locomotion | |
Liu et al. | A reinforcement learning method for humanoid robot walking | |
Barter et al. | Achieving natural behavior in a robot using neurally inspired hierarchical control | |
Shafii | Development of an optimized omnidirectional walk engine for humanoid robots |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |