JP4644833B2

JP4644833B2 - Biped walking movement device

Info

Publication number: JP4644833B2
Application number: JP2004105291A
Authority: JP
Inventors: 淳森本; 崇充松原; 雅昭佐藤; 淳中西; 玄遠藤
Original assignee: ATR Advanced Telecommunications Research Institute International
Current assignee: ATR Advanced Telecommunications Research Institute International
Priority date: 2004-03-31
Filing date: 2004-03-31
Publication date: 2011-03-09
Anticipated expiration: 2024-03-31
Also published as: JP2005288594A

Description

本発明は、非線形の制御対象である２足歩行の周期運動を実現する上で有効となる動的制御装置を用いた２足歩行移動装置に関する。 The present invention relates to a bipedal walking movement apparatus using a dynamic control device that is effective in realizing a periodic movement of bipedal walking that is a nonlinear control target.

ロボットなどを制御しようとするとき、センサノイズや、そもそもセンサを配置することができないことによって、制御のために必要な状態変数を直接的には測定できない状況が考えられる。 When trying to control a robot or the like, there may be a situation in which state variables necessary for control cannot be directly measured due to sensor noise or the inability to place a sensor in the first place.

そのような場合、状態観測器（オブザーバ）（たとえば、非特許文献１を参照）やカルマンフィルタ（たとえば、非特許文献２、非特許文献３を参照）を用いることが一般的である。しかしながら、対象のダイナミクスが非線形である場合、これらの手法では隠れ状態の推定は困難な場合がある。近年、非線形系に適用可能なオブザーバの提案がなされているものの、それぞれ特定の条件を満たさなければ適用できないなどの問題がある（たとえば、非特許文献４、非特許文献５を参照）。 In such a case, it is common to use a state observer (observer) (for example, refer nonpatent literature 1) and a Kalman filter (for example, refer nonpatent literature 2 and nonpatent literature 3). However, when the target dynamics are nonlinear, it may be difficult to estimate the hidden state with these methods. In recent years, observers applicable to nonlinear systems have been proposed, but there is a problem that they cannot be applied unless specific conditions are satisfied (see, for example, Non-Patent Document 4 and Non-Patent Document 5).

また、拡張カルマンフィルタ(局所線形モデルを用いて状態分布の更新を行う)（たとえば、非特許文献３を参照）やモンテカルロフィルタ（モンテカルロ法により生成した多数の粒子により状態分布を近似する）（たとえば、非特許文献６を参照）などの手法が非線形系での状態推定法として知られているが、いずれも状態分布を陽に求めなければならない。 Also, an extended Kalman filter (updates the state distribution using a local linear model) (for example, see Non-Patent Document 3) or a Monte Carlo filter (approximates the state distribution by a large number of particles generated by the Monte Carlo method) (for example, A method such as Non-Patent Document 6) is known as a state estimation method in a nonlinear system, but in any case, the state distribution must be obtained explicitly.

すなわち、従来の制御器は、基本的には、現在の状態を観測し、そこからの直接の写像によって制御出力を与える必要がある。
D.G. Luenberger著、“An introduction to observers”、IEEE Trans., AC, Vol.16, pp. 596--602, 1971. R.E. Kalman and R.S. Bucy著、“New results in linear filtering and prediction theory”、Trans., ASME, Series D, J. of Basic Engineering, Vol.83, No.1, pp. 95--108, 1961. F.L. Lewis著、“Optimal Estimation: with an Introduction to Stochastic Control Theory”、John Wiley \& Sons, 1977. 志水清孝, 鈴木俊輔, 田中哲史著、“こう配降下法による非線形オブザーバ（非線形システムの状態観測器）”、電子情報通信学会論文誌 A, Vol. J83-A, No.8, pp. 956--964, 2000. H.Nijimeijer and T.L. Fossen著、“Directions in Nonlinear Observer Design”、Springer-Verlag, London, 1999. G.Kitagawa著、“Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Models”、Journal of Computational and Graphical Statistics, Vol.5, pp. 1--25, 1996. That is, the conventional controller basically needs to observe the current state and give a control output by direct mapping from the current state.
DG Luenberger, “An introduction to observers”, IEEE Trans., AC, Vol. 16, pp. 596--602, 1971. RE Kalman and RS Bucy, “New results in linear filtering and prediction theory”, Trans., ASME, Series D, J. of Basic Engineering, Vol.83, No.1, pp. 95--108, 1961. FL Lewis, “Optimal Estimation: with an Introduction to Stochastic Control Theory”, John Wiley \ & Sons, 1977. Kiyotaka Shimizu, Shunsuke Suzuki, Tetsufumi Tanaka, “Nonlinear Observer by Gradient Descent Method (State Observer for Nonlinear Systems)”, IEICE Transactions A, Vol. J83-A, No.8, pp. 956-- 964, 2000. H. Nijimeijer and TL Fossen, “Directions in Nonlinear Observer Design”, Springer-Verlag, London, 1999. G. Kitagawa, “Monte Carlo Filter and Smoother for Non-Gaussian Nonlinear State Models”, Journal of Computational and Graphical Statistics, Vol.5, pp. 1--25, 1996.

本発明の目的は、制御器に内部状態とダイナミクスを持たせることで、周期運動に対し状態観測器の学習を容易とすることが可能な動的制御装置を用いた２足歩行移動装置を提供することである。 An object of the present invention is to provide a biped walking movement device using a dynamic control device that can easily learn a state observer with respect to a periodic motion by giving the controller an internal state and dynamics. It is to be.

本発明では、学習の目的は、ある瞬間での推定誤差を少なくするのではなく、タスクを行っている期間を通じての推定誤差を少なくすることである。そこで、方策勾配法（強化学習）の枠組みを用いて非線形状態観測器の構築を行っている。 In the present invention, the purpose of learning is not to reduce the estimation error at a certain moment, but to reduce the estimation error throughout the task period. Therefore, we construct a nonlinear state observer using the policy gradient method (reinforcement learning) framework.

本発明の１つの局面では、状態観測器のダイナミクスへの入力を学習器の行動と考え、現在の観測可能な状態（観測出力）、状態観測器の状態、制御器の出力から、適切に状態観測器の状態を制御対象の状態へと導く行動則を方策勾配法によって獲得している。 In one aspect of the present invention, the input to the state observer dynamics is considered as the behavior of the learner, and the state is appropriately determined from the current observable state (observation output), state observer state, and controller output. The law of behavior that leads the state of the observer to the state of the controlled object is obtained by the policy gradient method.

したがって、この発明の１つの局面に従うと、２足歩行移動装置であって、右上腿部と右上腿部に右膝の関節を介して接続する右下腿部とを備える右脚と、左上腿部と左上腿部に左膝の関節を介して接続する左下腿部とを備える左脚と、右脚または左脚の接地を検出し、右上腿部に対して右下腿部が成す右膝角と左上腿部に対して左下腿部が成す左膝角とを検出するための第１のセンサ群と、右脚および左脚に接続し、右脚および左脚を駆動して２足歩行を行わせるための腰部と、腰部のピッチ角を検出するためのピッチ角センサと、腰部と右上腿部との成す右関節角と腰部と左上腿部との成す左関節角とを検出するための第２のセンサ群とをさらに備え、腰部は、右下腿部および左下腿部に対する状態マシンに基づいて、右下腿部および左下腿部を駆動する下腿制御信号を生成する下腿制御部と、第２のセンサ群とピッチ角センサの検知結果を受けて、右上腿部および左上腿部を駆動するための上腿制御信号を生成するための動的制御装置とを備え、前記動的制御装置は、前記右上腿部および前記左上腿部にそれぞれ対応する周期的な時間発展を行う内部状態を有するセントラルパターンジェネレータに対して、強化学習に基づいて右上腿部および左上腿部の状態推定を行い、内部状態に対応する目標値に対するＰＤサーボ系の出力として、右上腿部および左上腿部に対する上腿制御信号を生成する制御手段を含み、上腿制御信号および下腿制御信号に基づいて、右脚および左脚を駆動するための駆動手段をさらに備え、下腿制御部の状態マシンは、右脚および左脚のそれぞれについて、第１の膝屈曲状態と、第１の膝伸長状態と、第２の膝屈曲状態と、第２の膝伸長状態とを順次遷移する状態マシンであって、第１の膝屈曲状態から第１の膝伸長状態への遷移および第２の膝屈曲状態から第２の膝伸長状態への遷移は、第２のセンサ群の検知結果に基づいて、左上腿部と右上腿部の成す角度が所定値を下回ることを条件として遷移が発生し、第１の膝伸長状態から第２の膝屈曲状態への遷移および第２の膝伸長状態から第１の膝屈曲状態への遷移は、第１のセンサ群の検知結果に基づいて、左脚または右脚の接地に応じて発生し、駆動手段は、状態マシンの状態に応じて予め定められた右膝角および左膝角についての目標角度と第１のセンサ群により検知される右膝角および左膝角とに基づくＰＤサーボ系の出力として、右膝および左膝とに加えるトルクを出力する膝駆動手段を含む。 Therefore, according to one aspect of the present invention, a biped walking movement device, a right leg comprising a right upper thigh and a right lower thigh connected to the upper right thigh through a right knee joint, and a left upper thigh a left leg and a left lower leg connected through the parts and left upper thigh to the knee joints, detects the ground of the right leg or left leg, right knee right lower leg makes with the upper right thigh A first sensor group for detecting the angle and the left knee angle formed by the left lower leg part with respect to the left upper leg part , connected to the right leg and the left leg, and driving the right leg and the left leg to walk biped For detecting the waist part, the pitch angle sensor for detecting the pitch angle of the waist part, the right joint angle formed by the waist part and the upper right thigh part, and the left joint angle formed by the waist part and the left upper thigh part the second further a sensor group, lumbar, based on the state machine against the right lower leg and the left lower leg, the right lower leg and left Receiving a lower leg control unit for generating a lower leg control signal for driving the thigh, the detection result of the second sensor unit and the pitch angle sensor, generates a thigh control signals for driving the upper right thigh and upper left thigh And a dynamic control device for strengthening the central pattern generator having an internal state that performs periodic time development respectively corresponding to the upper right thigh and the left upper thigh It performs state estimation upper right thigh and left upper thigh on the basis of the learning, as the output of the PD servo system with respect to the target value corresponding to the internal state, the control means for generating on the thigh control signal for the upper right thigh and upper left thigh wherein, based on the upper leg control signal and the crus control signal, further comprising a drive means for driving the right leg and the left leg, the state machine of the lower leg control unit, each of the right leg and left leg A state machine that sequentially transitions between a first knee flexed state, a first knee stretched state, a second knee flexed state, and a second knee stretched state, The transition to the first knee extension state and the transition from the second knee flexion state to the second knee extension state are based on the angle formed by the left upper thigh and the upper right thigh based on the detection result of the second sensor group. Transition occurs on the condition that the value is below a predetermined value, the transition from the first knee extension state to the second knee flexion state and the transition from the second knee extension state to the first knee flexion state are as follows: Based on the detection result of one sensor group, it is generated according to the grounding of the left leg or the right leg, and the driving means is a target angle for the right knee angle and the left knee angle that is predetermined according to the state of the state machine. And the output of the PD servo system based on the right knee angle and the left knee angle detected by the first sensor group And knee drive means for outputting torque applied to the right knee and the left knee .

好ましくは、制御手段は、右上腿部および左上腿部の状態情報に基づいて、価値関数と強化学習中に得られる報酬系列とに基づいて、フィードバックパラメータを更新するフィードバック制御器を含み、フィードバック制御器からの出力に応じて変化するセントラルパターンジェネレータの内部状態に基づいて、上腿制御信号を生成する。 Preferably, the control means includes a feedback controller that updates a feedback parameter based on a value function and a reward sequence obtained during reinforcement learning based on state information of the upper right thigh and the left upper thigh, and feedback control based on the internal state of the central pattern generator that changes in accordance with the output from the vessel, to produce a thigh control signal.

好ましくは、強化学習は、方策勾配法により行われる。 Preferably, reinforcement learning is performed by a policy gradient method.

本発明では、行動則自体が内部変数とその微分方程式によって表されるダイナミクスを持つため、行動則と物理系の引き込みの性質を利用することができ、周期運動に対して状態推定を行う場合に、状態観測器の学習を容易とすることが可能である。さらに、センサ入力のノイズや時間遅れに対しても、ロバストな性質を持つ制御器を実現することができる。 In the present invention, since the behavior rule itself has dynamics represented by internal variables and its differential equations, the behavior law and the property of pulling in the physical system can be used, and when state estimation is performed for periodic motion It is possible to facilitate learning of the state observer. Furthermore, it is possible to realize a controller having a robust property against sensor input noise and time delay.

さらに、本発明によれば、２足歩行移動装置の自由度が増加し、状態空間が高次元となった場合でも、周期運動に対し状態観測器の学習を容易とすることが可能な動的制御装置動的制御装置を用いた２足歩行移動装置を提供することができる。 Furthermore, according to the present invention, the degree of freedom of the biped walking movement device is increased, and even when the state space becomes high-dimensional, it is possible to easily learn the state observer for the periodic motion. A biped walking movement apparatus using a control apparatus dynamic control apparatus can be provided.

以下、図面を参照して本発明の実施の形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

以下の説明では、一例として、本発明を２足歩行の制御に適用する場合を説明するが、本発明は、必ずしもこのような場合に限定されるものではなく、たとえば、より一般的に周期運動を行う系に対して有効な制御システムを提供するものである。特に、本発明は、周期運動を行う劣駆動（機械）系に適用するのに適した制御システムを提供する。 In the following description, a case where the present invention is applied to biped walking control will be described as an example. However, the present invention is not necessarily limited to such a case. The present invention provides a control system effective for a system that performs the above. In particular, the present invention provides a control system suitable for application to an underactuated (mechanical) system that performs periodic motion.

［実施の形態１］
（本発明のシステム構成）
図１は、本発明の動的制御装置を用いた２足歩行移動システム１０００の一例を示す概念図である。 [Embodiment 1]
(System configuration of the present invention)
FIG. 1 is a conceptual diagram showing an example of a biped walking movement system 1000 using the dynamic control device of the present invention.

図１を参照して、システム１０００は、動的制御装置１００と、動的制御装置１００の上部に設けられる胴部４０と、動的制御装置１００により駆動制御される脚部とを備える。脚部は、右脚１０ｒと左脚１０ｌとを有し、各脚は、接地面近傍に設けられるセンサ２０ｒおよび２０ｌとを備える。一方、胴部４０には、センサ３０が設けられる。 Referring to FIG. 1, system 1000 includes dynamic control device 100, body 40 provided on the upper portion of dynamic control device 100, and legs that are driven and controlled by dynamic control device 100. The leg portion includes a right leg 10r and a left leg 101, and each leg includes sensors 20r and 20l provided in the vicinity of the ground contact surface. On the other hand, the body part 40 is provided with a sensor 30.

センサ２０ｒは、胴部４０の中心線４に対する右脚の角度θ_r、角速度ｄθ_r/ｄｔという情報を検出し、また、センサ２０ｌは、中心線４に対する左脚の角度θ_l、角速度ｄθ_l/ｄｔという情報を検出し、それぞれ、動的制御装置１００に通知する。さらに、センサ３０は、鉛直方向２に対する胴部４０のピッチ角θ_p、角速度ｄθ_p/ｄｔという情報を検出し、それぞれ、動的制御装置１００に通知する。 The sensor 20 _r detects information such as the right leg angle θ _r and the angular velocity dθ _r / dt with respect to the center line 4 of the torso 40, and the sensor 20 _l detects the left leg angle θ _l and the angular velocity dθ _{l with} respect to the center line 4. Information / dt is detected and notified to the dynamic control device 100, respectively. Further, the sensor 30 detects information such as the pitch angle θ _p and the angular velocity dθ _p / dt of the body 40 with respect to the vertical direction 2 and notifies the dynamic control device 100 of the information.

動的制御装置１００は、センサ２０r、センサ２０ｌ、センサ３０からの情報に基づいて、右脚１０ｒおよび左脚１０ｌの動作を制御する。 The dynamic control device 100 controls the operations of the right leg 10r and the left leg 10l based on information from the sensors 20r, 20l, and 30.

図２は、図１に示した動的制御装置１００の構成を示すブロック図である。 FIG. 2 is a block diagram showing a configuration of the dynamic control apparatus 100 shown in FIG.

図２を参照して、動的制御装置１００は、センサ２０r、センサ２０ｌ、センサ３０からの信号を受け取る通信インタフェース１０６と、後に説明する制御パラメータやセンサからの情報を格納しておくための記憶装置１０４と、センサ２０r、センサ２０ｌ、センサ３０からの情報を用いて学習して獲得した動的行動則に基づき、制御信号を生成する演算処理部１０２と、演算処理部１０２からの制御信号に基づいて、右脚１０ｒおよび左脚１０ｌの駆動制御を行うための駆動部１０８とを備える。 Referring to FIG. 2, dynamic control device 100 has a communication interface 106 that receives signals from sensors 20r, 20l, and 30 and a memory for storing control parameters and information from the sensors that will be described later. Based on the dynamic behavior rule acquired by learning using the information from the device 104, the sensor 20r, the sensor 20l, and the sensor 30, the arithmetic processing unit 102 that generates a control signal, and the control signal from the arithmetic processing unit 102 And a drive unit 108 for controlling the driving of the right leg 10r and the left leg 10l.

以下では、動的制御装置１００の制御動作のための準備の処理および制御動作について説明する。 Hereinafter, a preparation process and a control operation for the control operation of the dynamic control device 100 will be described.

（１−１．動的行動則）
まず、本発明の制御動作を説明する前提として、「動的行動則」について説明する。 (1-1. Dynamic behavior rules)
First, “dynamic behavior rules” will be described as a premise for explaining the control operation of the present invention.

図３は、このような動的行動則を説明するための概念図である。 FIG. 3 is a conceptual diagram for explaining such a dynamic behavior rule.

「動的行動則」とは、図３（１）に示すような行動則自体が内部変数とその微分方程式によって表されるダイナミクスを持つ枠組である。 The “dynamic behavior rule” is a framework in which the behavior rule itself as shown in FIG. 3A has dynamics represented by internal variables and their differential equations.

これをより具体的に表現すると、図３（２）に示すように観測器が内部変数及びその微分方程式を持つようなものや、図３（３）に示すような制御器が内部変数及びその微分方程式を持つようなものである場合が考えられる。 To express this more concretely, as shown in FIG. 3 (2), the observer has an internal variable and its differential equation, or the controller as shown in FIG. The case of having a differential equation is considered.

このような「動的行動則」に基づいて、制御対象を制御するような制御装置を「動的制御装置」と呼ぶことにする。 A control device that controls a control object based on such a “dynamic behavior rule” will be referred to as a “dynamic control device”.

以下では、まず、図３（３）の枠組を用いることで、動的制御装置１００により、２足歩行運動を実現する場合を考える。 Below, the case where bipedal walking movement is implement | achieved by the dynamic control apparatus 100 by using the framework of FIG. 3 (3) first is considered.

行動則が内部状態を持つことによって、行動則と物理系の引き込みの性質を利用することが出来るため、周期運動や状態推定を行う場合に有効であると考えられる。さらに、センサ入力のノイズや時間遅れに対しても、ある程度ロバストな性質を持つことが期待できる。 Since the behavioral rule has an internal state, the behavioral law and the physical property of the physical system can be used, which is considered effective when performing periodic motion and state estimation. In addition, it can be expected to have some robust property against noise and time delay of the sensor input.

このような行動則を学習によって獲得する場合、行動則の内部状態が隠れ変数となることは問題となる。しかし、行動則の内部状態が物理系の状態に対して引き込むことで、それぞれの状態は一意に対応するようになる。ただし、過渡的な状態では隠れ状態を扱う必要がある。 When such a behavior rule is acquired by learning, it becomes a problem that the internal state of the behavior rule becomes a hidden variable. However, when the internal state of the behavior rule is drawn into the state of the physical system, each state uniquely corresponds. However, it is necessary to handle the hidden state in the transient state.

そこで本発明では、動的行動則の獲得手法として、後に説明するように、隠れ変数が存在する環境においても適用可能な方策勾配法を用いる。 Therefore, in the present invention, as will be described later, a policy gradient method that can be applied even in an environment where a hidden variable exists is used as a dynamic behavior rule acquisition method.

以下では、本発明の学習システムおよび、動的行動則を構成するセントラルパターンジェネレータ（central pattern generator：ＣＰＧ)とＣＰＧへのフィードバック制御器について説明を行う。 Below, the learning system of the present invention, a central pattern generator (CPG) constituting a dynamic behavior rule, and a feedback controller to the CPG will be described.

（１−２．学習システム）
以下の説明では、周期運動の例として、３リンク２足歩行ロボットモデルを用いた２足歩行運動に対して動的行動則を適用する。 (1-2. Learning system)
In the following description, a dynamic behavior rule is applied to biped walking motion using a three-link biped walking robot model as an example of periodic motion.

図４は、図２で説明した演算処理部１０２の行う処理を示す機能ブロック図である。以下に説明するとおり、演算処理部１０２は、学習システムとして機能する。 FIG. 4 is a functional block diagram illustrating processing performed by the arithmetic processing unit 102 described in FIG. As will be described below, the arithmetic processing unit 102 functions as a learning system.

図４に示すとおり、この学習システムは、基本的に、ＣＰＧ処理部１０２６とフィードバック制御器１０２２によって動的行動則を構成する。学習に用いる状態ｘは、以下の式で表される。 As shown in FIG. 4, this learning system basically configures a dynamic behavior rule by the CPG processing unit 1026 and the feedback controller 1022. The state x used for learning is represented by the following equation.

ただし、上述のとおり、θ_r、θ_lは、それぞれロボットの鉛直方向からの左右の脚１０ｒ、１０ｌの角度であり、θ_pは胴体４０のピッチ角である。 However, as described above, θ _r and θ _l are angles of the left and right legs 10 r and 10 l from the vertical direction of the robot, respectively, and θ _p is a pitch angle of the body 40.

つまり、学習システムはロボットから直接得られる信号のみで状態空間を構成しており、ＣＰＧの内部状態を用いていないという特徴を有している。 That is, the learning system has a feature that the state space is configured only by signals directly obtained from the robot, and the internal state of the CPG is not used.

また、ここでは全状態観測を仮定し、ｙ＝ｘであるとする。 Here, it is assumed that all states are observed and y = x.

（１−３．セントラルパターンジェネレータ（ＣＰＧ)）
演算処理部１０２により実現される学習システムで、動的行動則を構成するＣＰＧ処理部１０２６の構成として、以下の式で表される神経振動子モデルを用いる。なお、このような神経振動子モデルについては、たとえば、文献：Kiyoshi Matsuoka著、“Sustained oscillations generated by mutually inhibiting neurons with adaptation.”、Biologial Cybernetics, Vol.52, pp. 367-376, 1985に開示がある。 (1-3. Central pattern generator (CPG))
In the learning system realized by the arithmetic processing unit 102, a neural oscillator model represented by the following formula is used as the configuration of the CPG processing unit 1026 that constitutes the dynamic behavior rule. Such a neural oscillator model is disclosed in, for example, Kiyoshi Matsuoka, “Sustained oscillations generated by mutually inhibiting neurons with adaptation.”, Biologial Cybernetics, Vol. 52, pp. 367-376, 1985. is there.

ここで、変数：ｚ，ｐはニューロン内の状態、ｑはニューロンの出力、ｚ₀は持続入力、定数βはニューロンの疲労係数、τ、τ´は、ｚ，ｐの時定数、ωは拮抗ニューロン間の結合係数である。また、ａは、後に説明するフィードバック制御器からの出力項である。 Here, variables: z and p are states in a neuron, q is a neuron output, z ₀ is a continuous input, constant β is a neuron fatigue coefficient, τ and τ ′ are time constants of z and p, and ω is an antagonist This is the coupling coefficient between neurons. Further, a is an output term from a feedback controller described later.

図５は、式（１）〜（１２）で表される神経振動子モデルによるＣＰＧを示す概念図である。図５においては、ニューロン内の状態ｚ、ｐの間で、相互に正の結合を行うものは白丸で、負の結合を行うものは黒丸で示している。 FIG. 5 is a conceptual diagram showing a CPG based on a neural oscillator model expressed by equations (1) to (12). In FIG. 5, among the states z and p in the neuron, those that perform a positive connection with each other are indicated by white circles, and those that perform a negative connection are indicated by black circles.

図６は、このようなＣＰＧの出力ｑを構成する変数ｚ₁、ｚ₂の波形を示す図である。なお、この計算では、例として、τ＝０．０５、τ´＝０．６、β＝２．５、ω＝２．０、ｚ₀＝０．１、ａ＝０を用いた。 FIG. 6 is a diagram showing the waveforms of variables z ₁ and z ₂ that constitute the output q of such a CPG. In this calculation, τ = 0.05, τ ′ = 0.6, β = 2.5, ω = 2.0, z ₀ = 0.1, and a = 0 are used as examples.

式（１）〜（１２）で表されるモデルにしたがって、ＣＰＧの内部状態変数ｚ₁、ｚ₂が周期的に変化していることがわかる。 It can be seen that the internal state variables z ₁ and z _{2 of} the CPG change periodically according to the models represented by the equations (1) to (12).

さらに、図４の学習システムのＰＤサーボ処理部１０２８では、以下に示すとおり、各ニューロンの出力の差を両脚のサーボ系の目標関節角θ^dとした。 Further, in the PD servo processing unit 1028 of the learning system of FIG. 4, the difference in the output of each neuron is set as the target joint angle θ ^d of the servo system of both legs as shown below.

ただし、θ_l ^dは左脚の目標関節角、θ_r ^dは右脚の目標関節角である。 However, θ _l ^d is the target joint angle of the left leg, and θ _r ^d is the target joint angle of the right leg.

ＰＤサーボ処理部１０２８の結果出力されるロボットへのトルク入力ｕは、次に示すＰＤサーボ系の出力を用いる。 The torque input u to the robot output as a result of the PD servo processing unit 1028 uses the output of the following PD servo system.

ただし、ｕ_lは左脚に対するトルク入力、ｕ_rは右脚に対するトルク入力である。また、Ｋ_pは位置ゲイン、Ｋ_dは速度ゲインである。 However, u _l is torque input to the left leg, u _r is the torque input to the right leg. K _p is a position gain, and K _d is a speed gain.

（１−４．フィードバック制御器１０２２）
上述のＣＰＧへのフィードバック制御器１０２２は、次の確率分布（１７）によって表される。 (1-4. Feedback controller 1022)
The feedback controller 1022 to the CPG described above is represented by the following probability distribution (17).

ただし、ｘは制御対象の状態ベクトル、ｗはパラメータベクトルである。
従ってｊ番目の出力の実現値ｖ_jは、以下の式（１８）によって与えられる。 Here, x is a state vector to be controlled, and w is a parameter vector.
Therefore, the actual value v _j of the j-th output is given by the following equation (18).

ただし、ｎj（ｔ）〜Ｎ（０，１）であり、Ｎ（０，１）は平均０、分散１の正規分布を表す。 However, nj (t) to N (0,1), where N (0,1) represents a normal distribution with an average of 0 and a variance of 1.

ここでは出力を飽和させるために、出力飽和処理部１０２４において、関数ｄ（）を用いて以下の式（１９）のように、最終的な制御器の出力ａ_j（ｔ）を決定する。 Here, in order to saturate the output, the output saturation processing unit 1024 determines the final controller output a _j (t) using the function d () as shown in the following equation (19).

ただし、以下の説明では、一例として、ｄ（）としては、以下の式を用いる。 However, in the following description, as an example, the following formula is used as d ().

ここでのａ_j(j=１〜４)は、式（１）の左右の脚の神経振動子の伸筋、屈筋にそれぞれ対応する。 Here, a _j (j = 1 to 4) corresponds to the extensor and flexor muscles of the neural oscillators of the left and right legs of Expression (1), respectively.

（２．方策勾配法）
「方策勾配法」とは、パラメータ化された確率的方策に従って行動選択を行い、方策を改善する方向に方策のパラメータを少しずつ更新する強化学習手法の１種である。以下に方策勾配法を用いた行動則の学習方法について述べる。 (2. Policy gradient method)
The “policy gradient method” is one type of reinforcement learning method in which action selection is performed according to a parameterized probabilistic policy, and policy parameters are updated little by little in a direction of improving the policy. The following describes how to learn behavioral rules using the policy gradient method.

（２−１．連続時間・状態系でのテンポラル・ディファレンス（Temporal Difference）誤差）
連続時間・状態系のダイナミクスを以下の式（２０）で表す。 (2-1. Temporal Difference error in continuous time / state system)
The dynamics of the continuous time / state system is expressed by the following formula (20).

ただし、ｘ∈Ｘ⊂Ｒⁿは状態、ｕ∈Ｕ⊂Ｒ^mは制御入力を表す。 However, x∈X⊂R ⁿ represents a state, and u∈U⊂R ^m represents a control input.

報酬は状態と制御入力の関数として、以下の式（２１）で与えられるとする。 It is assumed that the reward is given by the following equation (21) as a function of the state and the control input.

ある制御則π（ｕ（ｔ）｜ｘ（ｔ））のもとで、状態ｘ（ｔ）の価値関数を以下の式（２２）で定義する。 Under a certain control law π (u (t) | x (t)), the value function of the state x (t) is defined by the following equation (22).

ただし、τは価値関数の時定数である。また、式（２２）の両辺の時間微分から、以下の式（２３）という拘束条件が与えられる。 Where τ is the time constant of the value function. Further, a constraint condition of the following expression (23) is given from the time differentiation of both sides of the expression (22).

Ｖ（ｘ（ｔ））＝Ｖ（ｘ（ｔ）；ｗ）を価値関数の予測値とする。ただし、ｗは評価値の予測値のパラメータである。 Let V (x (t)) = V (x (t); w) be the predicted value of the value function. However, w is a parameter of a predicted value of an evaluation value.

予測が正しければ、式（２３）を満たす。予測が正しくない場合、下式（２４）に示した予測誤差を減らすように学習を行う。 If the prediction is correct, Expression (23) is satisfied. If the prediction is not correct, learning is performed so as to reduce the prediction error shown in the following equation (24).

上式は連続時間系でのＴＤ誤差である。 The above equation is a TD error in a continuous time system.

（２−２．方策勾配法の一般論）
動的計画法やグリーディ方策（greedy policy）などの価値関数の評価を基に学習を行う場合では、環境がマルコフ決定過程である必要があるが、実問題に適用する場合には、ノイズやセンサの能力によってマルコフ決定過程を保証することは困難である。しかし、方策勾配法は、価値関数と共に、試行中に得られた累積報酬系列を考慮することで、環境が非マルコフ決定過程(POMDP)でも適用することが出来る。 (2-2. General theory of policy gradient method)
When learning is performed based on evaluation of value functions such as dynamic programming or greedy policy, the environment needs to be a Markov decision process. It is difficult to guarantee the Markov decision process by the ability of. However, the policy gradient method can be applied even when the environment is a non-Markov decision process (POMDP) by considering the cumulative reward sequence obtained during the trial, together with the value function.

ここで、パラメータｗを持つ方策πwを用いた場合、以下の式（２５）が成り立つ。 Here, when the policy πw having the parameter w is used, the following equation (25) is established.

ただし、以下の式（２６）が成り立つ。 However, the following equation (26) holds.

ここで、κはエリジビリティ・トレース（eligibility trace）の時定数である。テンポラル・ディファレンス誤差δと方策のエリジビリティ・トレースｅ（ｔ）により、価値関数の方策パラメータｗに関する勾配の不偏推定量を求めることが出来ることが与えられている。このような方策勾配法については、たとえば、文献：木村元, 小林重信、“Actorに適正度の履歴を用いたactor-criticアルゴリズム-不完全なvalue-functionのもとでの強化学習”、人工知能学会誌, Vol.15, No.2, pp. 267-275, 2000に記載がある。 Here, κ is the time constant of eligibility trace. The temporal difference error δ and the policy eligibility trace e (t) provide that an unbiased estimate of the gradient for the policy parameter w of the value function can be determined. For such policy gradient method, for example, literature: Motoki Kimura, Shigenobu Kobayashi, “actor-critic algorithm using appropriateness history for actor-reinforcement learning under imperfect value-function”, artificial It is described in Journal of Intelligence Society, Vol.15, No.2, pp. 267-275, 2000.

よって、パラメータの更新則は次の式（２７）のようになる。 Therefore, the parameter update rule is as shown in the following equation (27).

ただし、ηは学習率である。 Where η is a learning rate.

（２−３．動的行動則の学習）
以下では、上述の方策勾配法を用いて、動的行動則の獲得を行う。 (2-3. Learning dynamic behavior rules)
In the following, dynamic behavior rules are acquired using the above-described policy gradient method.

ここでは、式(１７)に示したフィードバック制御器の学習を行うことで望みの動的行動則を獲得することを考える。 Here, it is assumed that a desired dynamic behavior rule is acquired by learning the feedback controller shown in Expression (17).

（２−３−１．価値関数の更新）
まず、価値関数処理部１０３２において演算される、連続状態における価値関数の表現方法として、以下の式（２８）による正規化ガウス関数ネットワーク(normalized Gaussian network: ＮＧｎｅｔ)を用いる。なお、正規化ガウス関数ネットワークについては、後に説明する。 (2-3-1. Update of value function)
First, a normalized Gaussian network (NGnet) according to the following equation (28) is used as a method for expressing a value function in a continuous state calculated by the value function processing unit 1032. The normalized Gaussian function network will be described later.

ただし、ｂ_i ^c（）は、正規化処理部１０３０においてｘに施される基底関数であり、ｗ_i ^cは価値関数のパラメータである。 Here, b _i ^c () is a basis function applied to x in the normalization processing unit 1030, and w _i ^c is a parameter of the value function.

パラメータｗ_i ^cに対するエリジビリティ・トレースｅ_i ^cと、ＴＤ誤差を用いたパラメータｗ_i ^cの更新式は、それぞれ以下の式（２９）および（３０）のようになる。 And Erijibiriti trace e _i ^c for the parameters w _i ^c, update equation parameters w _i ^c with TD error is as each of the following formula (29) and (30).

ただし、αは価値関数の学習率、κ^cはエリジビリティ・トレースの時定数である。 Where α is the learning rate of the value function, and κ ^c is the time constant of the eligibility trace.

（２−３−２．フィードバック制御器の更新）
式(１７)に示した確率的なフィードバック制御器１０２２を用いる場合、そのｊ番目の出力の平均μ_jと標準偏差σ_jに関するエリジビリティは式(２６)右辺第２項と同様、それぞれ以下のように与えられる。 (2-3-2. Update of feedback controller)
When the probabilistic feedback controller 1022 shown in Expression (17) is used, the eligibility regarding the average μ _j and standard deviation σ _j of the j-th output is as follows as in the second term on the right side of Expression (26). Given to.

ここではさらに、以下の式（３３）および（３４）のように、平均μを正規化ガウス関数ネットワークによって表し、標準偏差σをシグモイド関数によって表す。 Here, as in the following equations (33) and (34), the mean μ is represented by a normalized Gaussian function network, and the standard deviation σ is represented by a sigmoid function.

ただし、ノーテーションとしては以下のとおりである。 However, the notation is as follows.

これらのパラメータに対応するエリジビリティは以下の式（３５）および（３６）のように求められる。 Eliability corresponding to these parameters is obtained by the following equations (35) and (36).

上式(３５)(３６)と式(２６)(２７)を考慮すると、以下の式（３７）および（３８）のようなフィードバックパラメータの更新則が得られる。 When the above equations (35), (36) and equations (26), (27) are taken into account, feedback parameter update rules such as the following equations (37) and (38) are obtained.

また、式(３５)(３６)において、パラメータσが分母となっていることにより、σが０へと近付くとエリジビリティが発散することが問題となる。そこでエリジビリティ・トレースの更新には式(２６)の代わりに次式を用いる。 In addition, in the equations (35) and (36), since the parameter σ is a denominator, easiness diverges when σ approaches zero. Therefore, the following equation is used instead of equation (26) for updating the eligibility trace.

（２−４．具体例）
図４に示した学習システムにおいて、数値シミュレーションを行った結果について以下説明する。 (2-4. Specific example)
In the learning system shown in FIG. 4, the results of numerical simulation will be described below.

このシミュレーションにおいて、図１に示した２足歩行移動システム１０００（２足歩行ロボット）は、脚長が０．２ｍ、両脚の質量がそれぞれ０．５ｋｇとし、胴体が０．１ｋｇであるものとした。さらに、膝関節がないことを考慮して、遊脚を振り出す場合は足先が地面を通過出来るように設定した。 In this simulation, the biped walking movement system 1000 (biped walking robot) shown in FIG. 1 has a leg length of 0.2 m, a mass of both legs of 0.5 kg, and a body of 0.1 kg. Furthermore, considering that there is no knee joint, when swinging the swing leg, it was set so that the foot tip could pass through the ground.

それぞれの学習パラメータは、以下のとおりである。 Each learning parameter is as follows.

また、ＮＧｎｅｔの基底関数は、実際にロボットが歩行運動を行う際に必要であると予想される状態空間に格子状に均等に配置することを考え、以下のようにする。 In addition, the basis functions of NGnet are assumed to be arranged in a grid pattern in a state space that is expected to be necessary when the robot actually performs a walking motion.

この結果、計５１８４（＝１２×６×１２×６）個をそれぞれ、以下の範囲に均等に配置した。 As a result, a total of 5184 (= 12 × 6 × 12 × 6) was equally arranged in the following ranges.

報酬関数は以下の式で表す。 The reward function is expressed by the following formula.

ただし、それぞれがロボットの腰の高さに関する項ｒ_H（ｔ）、歩行速度に関する項ｒ_S（ｔ）は、以下の式で表される。 However, the term r _H (t) relating to the waist height of the robot and the term r _S (t) relating to the walking speed are expressed by the following equations.

ここで、ｈ₁はロボットの腰の高さ、ｈ´は腰の高さのオフセット、ｆ_l,ｆ_rは左右の脚の高さである。したがって、式（４１）の右辺第１項は、ロボットの位置エネルギーに関連する量であり、右辺第２項はロボットの運動エネルギーに関連する量である。 Here, h ₁ is the height of the waist of the robot, h'the waist height of the offset, f _l, is f _r is the height of the left and right legs. Therefore, the first term on the right side of Equation (41) is an amount related to the robot's potential energy, and the second term on the right side is an amount related to the kinetic energy of the robot.

以下に説明するシミュレーションでは各パラメータは、ｋ_s＝０．０６、ｋ_H＝０．５、ｈ´＝０．１５とした。また、ＣＰＧのパラメータは（１−２．学習システム）で述べたものを用いている。 In the simulation described below, the parameters are set to k _s = 0.06, k _H = 0.5, and h ′ = 0.15. The CPG parameters used are those described in (1-2. Learning system).

計算機上でのロボット及びＣＰＧのダイナミクスの時間刻みは１ｍｓｅｃ、学習システムの時間刻みは１０ｍｓｅｃとした。 The time increment of the robot and CPG dynamics on the computer was 1 msec, and the time increment of the learning system was 10 msec.

また、シミュレーションにおいて、１学習試行の終了条件は以下のようにした。 In the simulation, the end condition of one learning trial was as follows.

ｉ）１７７００ｍｓｅｃ経過(約１００歩の歩行終了後)
ｉｉ）転倒時(ただし、同時にｒ＝−１の報酬を与える)
（２−５．平地歩行の獲得及び、環境変化に対するロバスト性）
図７は、１試行で獲得した報酬の総和を、試行回数ごとに取った学習曲線を示す図である。図７においては、地面の傾斜０°のときの学習曲線を示している。 i) 17700 msec elapsed (after about 100 steps of walking)
ii) At the time of fall
(2-5. Acquisition of flat ground walking and robustness against environmental changes)
FIG. 7 is a diagram showing a learning curve in which the total amount of rewards acquired in one trial is taken for each number of trials. FIG. 7 shows a learning curve when the ground inclination is 0 °.

図７より、学習は約３５０回で収束しており、定常歩行運動を獲得出来ていることが分かる。 FIG. 7 shows that learning has converged at about 350 times, and a steady walking motion has been acquired.

図８は、図７の学習曲線に対応する歩行の軌跡を示す図である。 FIG. 8 is a diagram showing a walking locus corresponding to the learning curve of FIG.

図８において、(1)は学習前の歩行軌跡、(2)は６００回学習後の歩行軌跡を示す。６００回の学習後では、歩幅が大きくなり歩行速度も向上して、良好な歩行軌跡が得られていることがわかる。 In FIG. 8, (1) shows a walking trajectory before learning, and (2) shows a walking trajectory after 600 learnings. It can be seen that, after learning 600 times, the stride is increased, the walking speed is improved, and a good walking locus is obtained.

また、６００回学習試行を行うことによって学習した各学習パラメータを用い、数度の傾斜を付けることによって環境を変化させた場合でも、ある程度歩行動作を維持することが可能である。さらに、数回の学習試行を行うことによって、新しい環境に適応することが出来る。これは、行動則の内部状態(ここではＣＰＧの内部状態)と、ロボットの状態が引き込みを行うことのよって、ロバストなリミットサイクルを構成しているからであると考えられる。 Moreover, even when the environment is changed by using each learning parameter learned by performing 600 learning trials and adding an inclination of several degrees, it is possible to maintain a walking motion to some extent. Furthermore, it is possible to adapt to a new environment by performing several learning trials. This is presumably because the internal state of the behavior rule (herein, the internal state of the CPG) and the state of the robot constitute a robust limit cycle by pulling in.

図９は、図７で獲得した歩行において、ＣＰＧの内部状態とロボットの状態の間のリミットサイクルを、ＣＰＧの内部状態ｚ₁と脚角度の時間変化として示す図である。 FIG. 9 is a diagram showing a limit cycle between the internal state of the CPG and the state of the robot in the walking acquired in FIG. 7 as a time change of the internal state z _{1 of the} CPG and the leg angle.

また、図１０は、図７で獲得した歩行において、ＣＰＧの内部状態とロボットの状態の間のリミットサイクルを、脚角度、脚の角速度、ＣＰＧの内部状態ｚ₁の関係として示す図である。 FIG. 10 is a diagram showing the limit cycle between the internal state of the CPG and the state of the robot in the walking acquired in FIG. 7 as a relationship of the leg angle, the angular velocity of the leg, and the internal state z ₁ of the CPG.

外部からの擾乱に対しても、本発明の制御システムは、周期運動を継続させることが可能なことがわかる。 It can be seen that the control system of the present invention can continue the periodic motion even against external disturbances.

（２−６．報酬と獲得した運動の関係）
式(４１)の報酬関数中の、速度項係数ｋ_sを変化させた場合の、ロボットの歩行速度の関係を表１に示す。 (2-6. Relationship between reward and acquired exercise)
Table 1 shows the relationship of the walking speed of the robot when the speed term coefficient k _s in the reward function of Expression (41) is changed.

ここで、ロボットの腰の高さに関する項の係数は、前節と同様ｋ_H＝０．５とした。 Here, the coefficient of the term relating to the waist height of the robot was set to k _H = 0.5 as in the previous section.

表１より、速度項を増加させるとロボットの歩行速度も増加することが分かり、よってロボットのダイナミクスから構成するようなコントローラを陽に用いることなく、学習の報酬を変化させることによって、ロボットを制御出来ることが確認出来る。 From Table 1, it can be seen that increasing the speed term also increases the walking speed of the robot, and thus controls the robot by changing the learning reward without explicitly using a controller composed of the dynamics of the robot. I can confirm that I can do it.

（２−７．センサノイズ・時間遅れに対するロバスト性）
図７で獲得された歩行を教師信号として学習した各パラメータを初期値として用い、さらに図４の学習システムからＣＰＧを取り除いたものを用いて、１５０回学習試行を行うことによって、内部状態を持たない行動則によって２足歩行運動を獲得した。これと、図７の学習によって獲得した歩行運動を用いて、コントローラのセンサノイズ及び時間遅れに対するロバスト性について比較を行った。 (2-7. Robustness against sensor noise and time delay)
Using the parameters obtained by learning the walking acquired in FIG. 7 as a teacher signal as initial values and further using the learning system shown in FIG. Acquired bipedal movement with no behavior rules. Using this and the walking motion acquired by learning in FIG. 7, the sensor noise of the controller and the robustness against time delay were compared.

センサノイズはｘ₁、ｘ₃に対しては、Ｎ（０，０．０１）、ｘ₂、ｘ₄に対しては、Ｎ（０，０．０９）を用い、時間遅れは２０ｍｓｅｃとしてシミュレーションを行った。 The sensor noise is N (0, 0.01) for x ₁ and x ₃ , N (0, 0.09) is used for x ₂ and x ₄ , and the time delay is 20 msec. went.

図１１は、センサノイズ・時間遅れに対するシミュレーション結果を示す図である。
図１１において、（１）はＣＰＧ有り、（２）はＣＰＧ無しのコントローラで構成された歩行を示す。また、図１１において、（ａ）は通常の条件での歩行、（ｂ）はセンサノイズのある状態での歩行、（ｃ）は時間遅れがある場合の歩行であり，また、図１１中で、“→”はロボットの進行方向を表している。 FIG. 11 is a diagram showing simulation results for sensor noise and time delay.
In FIG. 11, (1) shows walking with a CPG and (2) shows a walk constituted by a controller without CPG. In FIG. 11, (a) is walking under normal conditions, (b) is walking with sensor noise, and (c) is walking when there is a time delay, and in FIG. , “→” represents the traveling direction of the robot.

ＣＰＧを持たない行動則で構成された歩行は、ノイズ及び時間遅れのどちらの場合についても歩行動作を保つ事は出来なかったが、図４に示した学習システムでは、ノイズ及び時間遅れがある場合でも歩行が可能であることがわかる。 In the case of walking composed of behavior rules without CPG, the walking motion could not be maintained in both cases of noise and time delay. However, in the learning system shown in FIG. 4, there is noise and time delay. But you can walk.

よって内部状態を持つ行動則を構成することによって、センサノイズや時間遅れに対してロバストなコントローラを構成出来ることが分かる。 Therefore, it can be seen that a controller that is robust against sensor noise and time delay can be configured by configuring a behavior rule having an internal state.

（３．正規化ガウス関数ネットワークによる関数近似）
２−３−１で述べた価値関数、フィードバック制御器を表現するために用いた、正規化ガウス関数ネットワークについて、以下説明する。 (3. Function approximation by normalized Gaussian function network)
The normalized Gaussian function network used to represent the value function and feedback controller described in 2-3-1 will be described below.

ＮＧｎｅｔは３層のネットワークで構成されており、中間素子は正規化ガウス関数である。
入力ベクトルｘ＝（ｘ₁，…，ｘ_n）^Tに対して、ｋ番目のユニットの活性化関数は、以下の式のようになる。 NGnet is composed of a three-layer network, and the intermediate element is a normalized Gaussian function.
For the input vector x = (x ₁ ,..., X _n ) ^T , the activation function of the kth unit is as follows:

ただし、ｃ_kは活性化関数の中心であり、Ｍ_kは活性化関数の形状を決定する行列である。ここで、活性化関数φ_k（ｘ）を各点で総和が１になるように以下の式（４３）のように正規化したものを、基底関数ｂ_k（ｘ）とする。 Here, c _k is the center of the activation function, and M _k is a matrix that determines the shape of the activation function. Here, the activation function φ _k (x) normalized as shown in the following equation (43) so that the sum is 1 at each point is defined as a basis function b _k (x).

ただし、Ｋは基底関数の個数である。 Here, K is the number of basis functions.

このような正規化を行うことによって、中心点ｃ_kが密に配置されている部分では、ｂ_k（ｘ）は局所的な基底関数となり、ｃ_kの分布の端の部分ではｂ_k（ｘ）はシグモイド関数のような大域的な基底関数になる。 By performing such normalization, in the portion where the center point c _k are densely arranged, b _k (x) becomes a local basis functions, the end portions of the distribution of c _k b _k (x ) Becomes a global basis function like a sigmoid function.

ネットワークの出力は、基底関数と重みの内積によって以下の式（４４）ようになる。 The output of the network is expressed by the following equation (44) by the inner product of the basis function and the weight.

この出力が、図４の正規化処理部１０３０の出力となる。 This output becomes the output of the normalization processing unit 1030 in FIG.

［実施の形態１の変形例］
以上の説明では、図３（３）の構成による制御について説明した。以下では、実施の形態１の変形例として、図３（２）の構成による制御について説明する。 [Modification of Embodiment 1]
In the above description, the control by the structure of FIG. 3 (3) was demonstrated. Hereinafter, control according to the configuration of FIG. 3B will be described as a modification of the first embodiment.

図１２は、図３（２）の構成に相当するシステムであって、制御器と制御対象を含めたシステム全体の構成を示す図である。 FIG. 12 is a system corresponding to the configuration of FIG. 3 (2), and is a diagram showing a configuration of the entire system including a controller and a controlled object.

図１２において、状態観測器２００２は、状態観測器のダイナミクス２００４と、方策勾配法（強化学習）に基づいた強化学習器２００６によって構成される。 In FIG. 12, the state observer 2002 includes a state observer dynamics 2004 and a reinforcement learner 2006 based on the policy gradient method (reinforcement learning).

状態観測器２００２中の強化学習器２００６は、以下に説明するとおり、制御対象の観測出力ｙと、状態観測器のダイナミクス２００４に基づく出力と、制御器２０１０の制御出力ｕとに基づいて、学習器出力Ｕを出力する。出力関数処理部２０３０は、状態観測器２００２からの推定状態に基づいて、状態観測器２００２の出力を報酬演算部２０２０に与える。報酬演算部２０２０は、状態観測器２００２の出力と観測対象からの観測出力ｙと学習器出力Ｕとに基づいて、報酬を計算し、強化学習器２００６に与える。 The reinforcement learner 2006 in the state observer 2002 learns based on the observation output y to be controlled, the output based on the dynamics 2004 of the state observer, and the control output u of the controller 2010 as described below. Unit output U is output. The output function processing unit 2030 gives the output of the state observer 2002 to the reward calculator 2020 based on the estimated state from the state observer 2002. The reward calculation unit 2020 calculates a reward based on the output of the state observer 2002, the observation output y from the observation target, and the learner output U, and gives the reward to the reinforcement learner 2006.

（方策勾配法を用いた状態観測器の学習）
以下では、実施の形態１の変形例の状態観測器２００２の構造について説明する。 (Learning state observers using the policy gradient method)
Below, the structure of the state observer 2002 of the modification of Embodiment 1 is demonstrated.

状態の推定値を、ｘの頭部に“＾”を付加して表現（＝ｘｉ）（以下、本文中では「ｘハット」と呼ぶ）したとき、つぎのような状態観測器を考える。 When an estimated value of a state is expressed by adding “^” to the head of x (= xi) (hereinafter referred to as “x hat” in the text), the following state observer is considered.

ここではまず、通常のオブザーバやカルマンフィルタ同様、対象のダイナミクスｆ（ｘ，ｕ）は既知または学習によって獲得可能であるとし、対象システムの観測出力ｙを基にして、現在の推定状態ｘハットと制御出力ｕから、推定状態を真の状態にどのように近づけるべきかを方策勾配法を用いて学習する。 Here, first, as in a normal observer or Kalman filter, it is assumed that the target dynamics f (x, u) is known or can be obtained by learning, and based on the observation output y of the target system, the current estimated state x hat and control The policy gradient method is used to learn how to approximate the estimated state to the true state from the output u.

ここでは学習器の目的を、状態観測器の出力ｙハット（ｙの頭部に“＾”を付加したもの）と対象システムの出力ｙとの誤差を最小にすることとする。 Here, the purpose of the learning device is to minimize the error between the output y hat of the state observer (with “^” added to the head of y) and the output y of the target system.

よって、報酬演算部２０２０により演算される報酬関数は次のようになる。 Therefore, the reward function calculated by the reward calculation unit 2020 is as follows.

ただし、Ｑ，Ｒは報酬関数の形を決めるパラメータである。この結果、学習器は状態観測器のダイナミクス２００４への以下のようなノーテーションのフィードバック入力Ｕを獲得することになる。 However, Q and R are parameters that determine the shape of the reward function. As a result, the learning device obtains a feedback input U of the following notation to the dynamics 2004 of the state observer.

ここで、フィードバック入力Ｕは次の確率分布により表現される。 Here, the feedback input U is expressed by the following probability distribution.

したがって、ｊ番目の出力の実現値Ｕ_jは、以下の式により与えられる。 Therefore, the realization value U _j of the j-th output is given by the following equation.

ただし、ｎ_j（ｔ）〜Ｎ（０，１）であり、Ｎ（０，１）は、上述のとおり、平均０、分散１の正規分布を表す。フィードバック入力Ｕを生成する確率分布πの更新は、二足歩行運動の学習の場合と同様に行われる。 However, n _j (t) to N (0, 1), and N (0, 1) represents a normal distribution with an average of 0 and a variance of 1 as described above. The update of the probability distribution π for generating the feedback input U is performed in the same manner as in the case of learning bipedal locomotion.

このような構成によっても、周期運動に対し状態観測器の学習を容易とすることが可能な動的制御装置およびこのような動的制御装置を用いた２足歩行移動装置を提供することができる。 Even with such a configuration, it is possible to provide a dynamic control device capable of facilitating the learning of the state observer with respect to the periodic motion, and a biped walking movement device using such a dynamic control device. .

［実施の形態２（多自由時系への適用）］
（５．１）５リンク２足歩行ロボットモデル
実施の形態２では、図４で構成した学習システムを，多自由度系へ適用した構成について説明する。 [Embodiment 2 (application to multi-free time system)]
(5.1) Five-link biped walking robot model In the second embodiment, a configuration in which the learning system configured in FIG. 4 is applied to a multi-degree-of-freedom system will be described.

実施の形態１で述べた３リンク２足歩行ロボットモデルに対しての学習結果から、学習手法に方策勾配法を用いることにより、物理系の状態のみを考慮することで、望ましい行動則を獲得出来ることが分かる。 From the learning results for the 3-link biped walking robot model described in the first embodiment, by using the policy gradient method as the learning method, it is possible to obtain a desirable behavior rule by considering only the state of the physical system. I understand that.

図１３は、実施の形態２で扱う５リンク２足歩行ロボットおよびそのシミュレーションモデルを説明するための図である。 FIG. 13 is a diagram for explaining a five-link biped walking robot handled in the second embodiment and a simulation model thereof.

実施の形態１の結果を基にして、以下では、図１３（ａ）に示した膝関節を持つ５リンクの実ロボットのシミュレーションモデル（図１３（ｂ））に対して、実施の形態１で提案した手法を適用し、歩行運動の獲得を行なう。 Based on the results of the first embodiment, in the following, the simulation model (FIG. 13B) of the five-link real robot having the knee joint shown in FIG. Apply the proposed method to acquire walking motion.

図１３（ｂ）に示すとおり、実施の形態２のシミュレーションモデルでは、右下腿１０ｒｄ（リンク１）、右上腿１０ｒｕ（リンク２）、腰部４０（リンク２）、左上腿１０ｌｕ（リンク４）および左下腿１０ｌｄ（リンク５）の５リンク系である。 As shown in FIG. 13B, in the simulation model of the second embodiment, the right lower leg 10rd (link 1), the upper right thigh 10ru (link 2), the waist 40 (link 2), the left upper leg 10lu (link 4), and the lower left This is a five-link system of thigh 10ld (link 5).

図１３（ｂ）に示す２足歩行ロボットも、動的制御装置１００と、動的制御装置１００の上部に設けられる胴部４０と、動的制御装置１００により駆動制御される脚部とを備える。脚部は、右上腿１０ｒｕと、右下腿１０ｒｄと、左上腿１０ｌｕと、左下腿１０ｌｄとを有する。 The biped walking robot shown in FIG. 13B also includes a dynamic control device 100, a torso 40 provided on the top of the dynamic control device 100, and legs that are driven and controlled by the dynamic control device 100. . The leg has an upper right thigh 10ru, a right lower thigh 10rd, a left upper thigh 10lu, and a left lower thigh 10ld.

図１３（ｂ）に示す２足歩行ロボットは、各脚において、右足および左足のそれぞれ接地面近傍に設けられるセンサ２０ｒおよび２０ｌと、右膝部分に設けられるセンサ２０ｒｋと、左膝部分に設けられるセンサ２０ｌｋとを備えるものとする。一方、胴部４０には、センサ（図示せず）が設けられる。動的制御装置１００は、実施の形態１と同様に、駆動部１０８により、右上腿１０ｒｕおよび左上腿１０ｌｕを駆動する。また、左右の膝にも、図示しない駆動部が設けられ、動的制御装置１００からのトルク信号に基づいて、右下腿１０ｒｄおよび左下腿１０ｌｄを駆動するものとする。 The biped robot shown in FIG. 13 (b) is provided on each leg with sensors 20r and 20l provided near the ground contact surface of the right and left legs, a sensor 20rk provided on the right knee part, and a left knee part. It is assumed that the sensor 20lk is provided. On the other hand, the body 40 is provided with a sensor (not shown). As in the first embodiment, the dynamic control device 100 drives the upper right thigh 10ru and the left upper thigh 10lu by the drive unit 108. The left and right knees are also provided with a drive unit (not shown) and drives the right lower leg 10rd and the left lower leg 10ld based on a torque signal from the dynamic control device 100.

センサ２０ｒｋは、胴部４０の中心線４に対する右上腿１０ｒｕの角度θ^r _hip、角速度ｄθ^r _hip/ｄｔという情報を検出し、また、センサ２０ｌｋは、中心線４に対する左上腿１０ｌｕの角度θ^l _hip、角速度ｄθ^l _hip/ｄｔという情報とを検出し、それぞれ、動的制御装置１００に通知する。さらに、センサ２０ｒは、右上腿１０ｒｕの延長線６ｒに対する右脚の角度θ^r _knee、角速度ｄθ^r _knee/ｄｔという情報と右足の接地状態とを検出し、また、センサ２０ｌは、左上腿１０ｌｕの延長線６ｌに対する左脚の角度θ^l _knee、角速度ｄθ^l _knee/ｄｔという情報と左足の接地状態とを検出し、それぞれ、動的制御装置１００に通知する。 The sensor 20rk detects information such as the angle θ ^r _hip and the angular velocity dθ ^r _hip / dt of the upper right thigh 10ru with respect to the center line 4 of the torso 40, and the sensor 20lk detects the angle θ ^l of the left upper thigh 10lu with respect to the center line 4. Information on _hip and angular velocity dθ ^l _hip / dt is detected and notified to the dynamic control device 100, respectively. Further, the sensor 20r detects information on the right leg angle θ ^r _knee and angular velocity dθ ^r _knee / dt with respect to the extension line 6r of the upper right thigh 10ru and the ground contact state of the right foot, and the sensor 20l detects the left upper leg 10lu. Information on the left leg angle θ ^l _knee and angular velocity dθ ^l _knee / dt and the ground contact state of the left foot with respect to the extension line 61 is detected and notified to the dynamic control device 100, respectively.

また、胴部４０に設けられるセンサは、鉛直方向２に対する胴部４０のピッチ角θ_p、角速度ｄθ_p/ｄｔという情報を検出し、それぞれ、動的制御装置１００に通知する。 The sensor provided in the body 40 detects information such as the pitch angle θ _p and the angular velocity dθ _p / dt of the body 40 with respect to the vertical direction 2, and notifies the dynamic control device 100 of the detected information.

５リンク２足歩行ロボットモデルの物理パラメータを表１に示す。 Table 1 shows the physical parameters of the 5-link biped robot model.

（５．２）多自由度系に対する学習システム
図１３（ｂ）に示す５リンク２足歩行ロボットモデルに対する学習システムについて、以下さらに詳しく説明する。 (5.2) Learning system for multi-degree-of-freedom system The learning system for the 5-link biped walking robot model shown in FIG.

動的制御装置１００では、膝関節の状態は学習に用いず、実施の形態１の図４と同様に、腰関節の状態（θ^r _hip、θ^l _hip、ｄθ^r _hip/ｄｔ、ｄθ^l _hip/ｄｔ）及びピッチ角の状態（θ_p、ｄθ_p/ｄｔ）のみに関する状態変数を用いて学習を行う。 In the dynamic control device 100, the state of the knee joint is not used for learning, and the state of the hip joint (θ ^r _hip, θ ^l _hip , dθ ^r _hip / dt, dθ ^l _hip is used as in FIG. 4 of the first embodiment. / dt) and the state variables related to the pitch angle states (θ _p , dθ _p / dt) only.

一方、膝関節については、接地情報と腰関節の状態を基に膝関節の目標関節角を切替える状態マシン１０４０を制御器として用いる。この状態マシン１０４０については後述する。 On the other hand, for the knee joint, a state machine 1040 that switches the target joint angle of the knee joint based on the ground contact information and the state of the hip joint is used as a controller. The state machine 1040 will be described later.

図１４は、動的制御装置１００における実施の形態２の学習システムの構成を示す図である。なお、図４に示した実施の形態１の動的制御装置１００と同一部分は、同一符号で示す。ただし、図１４においては、フィードバック制御器１０２２´は、フィードバック制御器１０２２の機能と出力飽和処理部１０２４の機能とを併せて有するものとし、価値関数処理部１０３２´は、価値関数処理部１０３２の機能と正規化処理部１０３０の機能も併せて有するものとする。 FIG. 14 is a diagram illustrating a configuration of a learning system according to the second embodiment in the dynamic control device 100. The same parts as those of the dynamic control device 100 according to the first embodiment shown in FIG. However, in FIG. 14, the feedback controller 1022 ′ has both the function of the feedback controller 1022 and the function of the output saturation processing unit 1024, and the value function processing unit 1032 ′ The function and the function of the normalization processing unit 1030 are also included.

左上腿１０ｌｕおよび右上腿１０ｒｕへのトルクは、ＰＤサーボ処理部１０２８から与えられる。 Torque to the left upper thigh 10lu and upper right thigh 10ru is given from the PD servo processing unit 1028.

したがって、価値関数処理部１０３２´、フィードバック制御器１０２２´、ＣＰＧ処理部１０２６およびＰＤサーボ処理部１０２８の動作は、駆動する対象が異なるのみで、その基本的な動作は、実施の形態１と同様である。 Therefore, the operations of the value function processing unit 1032 ′, the feedback controller 1022 ′, the CPG processing unit 1026, and the PD servo processing unit 1028 differ only in the target to be driven, and the basic operation is the same as in the first embodiment. It is.

一方、膝関節へのトルク入力は、状態マシン１０４０によって決定される目標関節角を用いたＰＤサーボ処理部１０４２により与えられる。 On the other hand, torque input to the knee joint is given by the PD servo processing unit 1042 using the target joint angle determined by the state machine 1040.

（状態マシン１０４０による膝関節制御器）
図１５は、状態マシン１０４０の動作を説明するための概念であり、図１５（ａ）は、右膝の状態を、図１５（ｂ）は左膝の状態をそれぞれ示す。 (Knee joint controller by state machine 1040)
15 is a concept for explaining the operation of the state machine 1040. FIG. 15 (a) shows the state of the right knee, and FIG. 15 (b) shows the state of the left knee.

状態は、膝屈曲→膝伸長→膝屈曲→膝伸長との状態遷移を繰り返す。 The state repeats the state transition of knee flexion → knee extension → knee flexion → knee extension.

たとえば、右膝では、前半の膝屈曲→膝伸長は、脚の「振り状態」であり、目標角度はそれぞれθ₁およびθ₂である。右膝の後半の膝屈曲→膝伸長は、脚の「立ち状態」であり、目標角度はそれぞれθ₃およびθ₄である。 For example, in the right knee, the knee flexion → knee extension in the first half is the “swing state” of the leg, and the target angles are θ ₁ and θ ₂ , respectively. The knee flexion → knee extension in the second half of the right knee is the “standing state” of the leg, and the target angles are θ ₃ and θ ₄ , respectively.

一方で、左膝では、前半の膝屈曲→膝伸長は、脚の「立ち状態」であり、目標角度はそれぞれθ₃およびθ₄である。左膝の後半の膝屈曲→膝伸長は、脚の「振り状態」であり、目標角度はそれぞれθ₁およびθ₂である。 On the other hand, in the left knee, the knee flexion → knee extension in the first half is the “standing state” of the leg, and the target angles are θ ₃ and θ ₄ , respectively. The knee flexion → knee extension in the second half of the left knee is the “swing state” of the leg, and the target angles are θ ₁ and θ ₂ , respectively.

腰関節の角度θ^r _hip、θ^r _hip（以下、総称するときは、「θ_hip」）および接地情報を用いて、図１５に示す状態マシンにより，膝関節の目標角度（θ^r _kneeハット、θ^l _kneeハット：総称するときは、「θ_kneeハット」）を決定し、以下に示すＰＤサーボにより膝関節へのトルクｕ_kneeを出力する。 The hip joint angle θ ^r _hip , θ ^r _hip (hereinafter, collectively referred to as “θ _hip ”) and ground contact information are used to determine the knee joint target angle (θ ^r _knee hat, [theta] ^l _knee hat: "[theta] _knee hat") is generally determined, and torque u _knee to the knee joint is output by the PD servo shown below.

ここで、Ｋ_pは位置ゲイン、Ｋ_dは速度ゲインである。ただし、シミュレーションでは、Ｋ_p＝１２．０、Ｋ_d＝０．１５とした。 Here, K _p is a position gain, and K _d is a speed gain. However, in the simulation, K _p = 12.0 and K _d = 0.15.

図１５に示すように、状態マシンでは４つの目標角（θ₁、θ₂、θ₃、θ₄）を設定した条件に応じて切り換える。ここでは、θ₁＝１．１１，θ₂＝０．５６，θ₃＝０．５２，θ₄＝０．２６とした。膝を曲げた状態から伸ばした状態への遷移は腰関節を用いた以下の条件に基づいて行なわれる。 As shown in FIG. 15, the state machine switches the four target angles (θ ₁ , θ ₂ , θ ₃ , θ ₄ ) according to the set conditions. Here, θ ₁ = 1.11, θ ₂ = 0.56, θ ₃ = 0.52, and θ ₄ = 0.26. The transition from the knee bent state to the extended state is performed based on the following conditions using the hip joint.

膝を伸ばした状態から曲げた状態への遷移は足裏の接地情報により行われる。シミュレーションでは、たとえば、ｂ＝０．１５とする。 The transition from the knee extended state to the bent state is performed based on the ground contact information of the sole. In the simulation, for example, b = 0.15.

（床反カモデル）
以下では、さらに、シミュレーションで用いる床反力モデルについて説明する。 (Floor model)
Hereinafter, a floor reaction force model used in the simulation will be described.

ｘ、ｙは脚端の位置を表し、ｘg、ｙgは接地点とすると、接地時の床反力は以下の式でモデル化される。 x and y represent the positions of the leg ends, and xg and yg are ground contact points, the floor reaction force at the time of ground contact is modeled by the following equation.

ここで、Ｆ_x、Ｆ_yは水平方向、垂直方向の床反力である。 Here, F _x and F _y are floor reaction forces in the horizontal and vertical directions.

以下で説明するシミュレーションでは、上記の式のそれぞれの係数はｋ_x＝３０００、ｂ_x＝１０、ｋ_y＝３００００、ｂ_y＝１００とする。また、床反力がＦ_x＞μＦ_yを満たすときに、足裏が床面を滑べると定義し、ここでのμは静摩擦係数であり、μ＝１．０としている。 In the simulation described below, the coefficients of the above equations are k _x = 3000, b _x = 10, k _y = 30000, and b _y = 100. Further, when the floor reaction force satisfies F _x > μF _y , it is defined that the sole can slide on the floor surface, where μ is a coefficient of static friction, and μ = 1.0.

（５．３）シミュレーション
以下、実施の形態２でのシミュレーション結果を説明する。 (5.3) Simulation Hereinafter, the simulation result in the second embodiment will be described.

本シミュレーションで用いるＣＰＧのパラメータは、ロボットモデルの脚長の変化を考慮して、図４の３リンク２足歩行ロボットモデルで用いた特性と比較して、低い周波数が必要となるので，各パラメータを以下のようにする。 The CPG parameters used in this simulation require a lower frequency than the characteristics used in the three-link biped walking robot model in FIG. 4 in consideration of changes in the leg length of the robot model. Do as follows.

τ＝０．４、τ´＝０．８、β＝１．３、ω＝２．０、ｚ₀＝０．１
正規化ガウス関数ネットワークＮＧｎｅｔの基底関数については、それぞれの状態変数に対して格子状に、ｘ＝（ｘ₁，ｘ₂，ｘ₃，ｘ₄）^Tを以下のように配置する。 τ = 0.4, τ ′ = 0.8, β = 1.3, ω = 2.0, z ₀ = 0.1
For the basis functions of the normalized Gaussian function network NGnet, x = (x ₁ , x ₂ , x ₃ , x ₄ ) ^T is arranged as follows in a lattice pattern for each state variable.

基底関数は、２５６００個をそれぞれの以下の範囲に均等に配置した。 25600 basis functions were evenly arranged in the following ranges.

報酬関数に関しては、式（４１）を使用するが、物理パラメータの変更にともない、ｋs＝０．０６、ｋH＝０．５、ｈ´＝０．３とした。 As for the reward function, equation (41) is used, and ks = 0.06, kH = 0.5, and h ′ = 0.3 in accordance with the change of physical parameters.

その他の学習パラメータ等については、式（３８）のβσ（σは上付き）＝０．０２以外は、実施の形態１と全て同様とする。 Other learning parameters and the like are the same as those in the first embodiment except for βσ in equation (38) (σ is a superscript) = 0.02.

また、１学習試行の終了条件についても、実施の形態１と同様以下の条件を用いる。 As for the end condition of one learning trial, the following conditions are used as in the first embodiment.

・１７７００ｍｓｅｃ経過
・転倒時（ただし、同時にr＝−1の報酬を与える）
（５．４）学習結果
実施の形態２のシミュレーションにおいても、３リンク２足歩行ロボットでの学習と同様、ロボットが１０試行連続で、転倒せずに歩き続けたとき、２足歩行運動を獲得したと定義する。１０回シミュレーションを行った結果全ての試行で2足歩行運動の獲得に成功した。また、運動獲得に必要な平均試行回数は８９９回であった。・ After 17700 msec ・ When falling (however, reward of r = −1 is given at the same time)
(5.4) Learning Results Also in the simulation of the second embodiment, as with the learning with the 3-link biped walking robot, the bipedal walking motion is acquired when the robot continues to walk without falling down for 10 consecutive trials. Define that. As a result of 10 simulations, we succeeded in acquiring bipedal locomotion in all trials. In addition, the average number of trials required for exercise acquisition was 899 times.

図１６は、学習過程の一例の試行回数と獲得報酬の総和の関係を示す図であり、図１７は、学習前、学習後の歩行軌跡を示す図である。 FIG. 16 is a diagram showing the relationship between the number of trials and the total sum of earned rewards as an example of the learning process, and FIG. 17 is a diagram showing walking trajectories before and after learning.

図１７において、（１）は学習前の歩行軌跡であり、（２）は１５００回学習後の歩行軌跡を示している。 In FIG. 17, (1) is a walking trajectory before learning, and (2) is a walking trajectory after 1500 learning.

この結果は、２足歩行ロボットの自由度が増加し、状態空間が高次元となった場合でも、少ない状態変数だけを観測することで，少ない試行回数で目的の周期運動を獲得できることがわかる。 This result shows that even when the degree of freedom of the biped robot increases and the state space becomes high-dimensional, the desired periodic motion can be acquired with a small number of trials by observing only a small number of state variables.

図１６の結果からすると、５リンクの２足歩行ロボットに対して、強化学習を用いた従来例（たとえば、文献：Y.Nakayama, M.Sato, and S. Ishii. Reinforcement Learning for biped robot. In Proceedings of the 2nd International Symposium on Adaptive Motion of Animals and Machines, pp. Thp-II-5, 2003.）と比較しても、かなり少ない試行回数で歩行運動を獲得できている。 According to the results of FIG. 16, a conventional example using reinforcement learning for a biped robot with five links (for example, literature: Y. Nakayama, M. Sato, and S. Ishii. Reinforcement Learning for biped robot. In Proceedings of the 2nd International Symposium on Adaptive Motion of Animals and Machines, pp. Thp-II-5, 2003.)

以上のような構成によって、２足歩行ロボットの自由度が増加し、状態空間が高次元となった場合でも、周期運動に対し状態観測器の学習を容易とすることが可能な動的制御装置およびこのような動的制御装置を用いた２足歩行移動装置を提供することができる。 With the configuration as described above, a dynamic control device capable of facilitating the learning of the state observer for periodic motion even when the degree of freedom of the biped robot increases and the state space becomes high-dimensional. And the biped walking movement apparatus using such a dynamic control apparatus can be provided.

今回開示された実施の形態はすべての点で例示であって制限的なものではないと考えられるべきである。本発明の範囲は上記した説明ではなくて特許請求の範囲によって示され、特許請求の範囲と均等の意味および範囲内でのすべての変更が含まれることが意図される。 The embodiment disclosed this time should be considered as illustrative in all points and not restrictive. The scope of the present invention is defined by the terms of the claims, rather than the description above, and is intended to include any modifications within the scope and meaning equivalent to the terms of the claims.

本発明の動的制御装置を用いた２足歩行移動システム１０００の一例を示す概念図である。It is a conceptual diagram which shows an example of the biped walking movement system 1000 using the dynamic control apparatus of this invention. 図１に示した動的制御装置１００の構成を示すブロック図である。It is a block diagram which shows the structure of the dynamic control apparatus 100 shown in FIG. 動的行動則を説明するための概念図である。It is a conceptual diagram for demonstrating a dynamic action rule. 演算処理部１０２の行う処理を示す機能ブロック図である。It is a functional block diagram which shows the process which the arithmetic process part 102 performs. 神経振動子モデルによるＣＰＧを示す概念図である。It is a conceptual diagram which shows CPG by a neural oscillator model. ＣＰＧの出力ｑを構成する変数ｚ₁、ｚ₂の波形を示す図である。It is a diagram showing a waveform of a variable z _1, z ₂ constituting the output q of the CPG. １試行で獲得した報酬の総和を、試行回数ごとに取った学習曲線を示す図である。It is a figure which shows the learning curve which took the sum total of the reward acquired by 1 trial for every number of trials. 図７の学習曲線に対応する歩行の軌跡を示す図である。It is a figure which shows the locus | trajectory of the walk corresponding to the learning curve of FIG. ＣＰＧの内部状態とロボットの状態の間のリミットサイクルを、ＣＰＧの内部状態ｚ1と脚角度の時間変化として示す図である。It is a figure which shows the limit cycle between the internal state of CPG and the state of a robot as a time change of the internal state z1 of CPG and a leg angle. ＣＰＧの内部状態とロボットの状態の間のリミットサイクルを、脚角度、脚の角速度、ＣＰＧの内部状態ｚ₁の関係として示す図である。The limit cycle between the internal state and the robot status CPG, illustrates leg angle, angular velocity of the leg, as a relation of the internal state z ₁ of CPG. センサノイズ・時間遅れに対するシミュレーション結果を示す図である。It is a figure which shows the simulation result with respect to sensor noise and time delay. 制御器と制御対象を含めたシステム全体の構成を示す図である。It is a figure which shows the structure of the whole system including a controller and a control object. 実施の形態２で扱う５リンク２足歩行ロボットおよびそのシミュレーションモデルを説明するための図である。FIG. 6 is a diagram for explaining a 5-link biped walking robot handled in Embodiment 2 and a simulation model thereof. 動的制御装置１００における実施の形態２の学習システムの構成を示す図である。It is a figure which shows the structure of the learning system of Embodiment 2 in the dynamic control apparatus 100. 状態マシン１０４０の動作を説明するための概念である。This is a concept for explaining the operation of the state machine 1040. 学習過程の一例の試行回数と獲得報酬の総和の関係を示す図である。It is a figure which shows the relationship between the number of trials of an example of a learning process, and the sum total of acquisition reward. 学習前、学習後の歩行軌跡を示す図である。It is a figure which shows the walk locus | trajectory before learning after learning.

Explanation of symbols

１０ｒ，１０ｌ脚部、２０ｒ，２０ｌ，３０センサ、４０胴部、１００動的制御装置、１０２演算処理部、１０４記憶装置、１０６通信インタフェース、１０８駆動部、１０００２足歩行移動システム。 10r, 10l leg, 20r, 20l, 30 sensor, 40 trunk, 100 dynamic control device, 102 arithmetic processing unit, 104 storage device, 106 communication interface, 108 drive unit, 1000 biped walking movement system.

Claims

A right leg comprising an upper right thigh and a right lower thigh connected to the upper right thigh via a right knee joint ;
A left leg and a left lower leg connected through the joints of the left knee to the upper left thigh and the left upper thigh,
The ground contact of the right leg or the left leg is detected, and the right knee angle formed by the right lower leg part with respect to the upper right thigh part and the left knee angle formed by the left lower leg part with respect to the left upper leg part are detected. A first sensor group for
A waist part connected to the right leg and the left leg, for driving the right leg and the left leg to perform biped walking;
A pitch angle sensor for detecting the pitch angle of the waist;
A second sensor group for detecting a right joint angle formed by the waist and the upper right thigh and a left joint angle formed by the waist and the left upper thigh ;
The waist is
On the basis of the right lower leg and the state machine against the left crus, a crus controller that generates a lower leg control signal for driving the right lower leg and the left lower leg,
A dynamic control device for receiving a detection result of the second sensor group and the pitch angle sensor and generating an upper leg control signal for driving the upper right thigh and the left upper thigh,
The dynamic control device includes:
Estimating the state of the upper right thigh and the left upper thigh based on reinforcement learning with respect to a central pattern generator having an internal state corresponding to the upper right thigh and the left upper thigh, respectively, that periodically develops. , as the output of the PD servo system with respect to the target value corresponding to the internal state includes a control means for generating said on thigh control signal for the upper right thigh and the left upper thigh portion,
Drive means for driving the right leg and the left leg based on the upper leg control signal and the lower leg control signal ,
The state machine of the lower leg control unit is:
A state machine that sequentially transitions between a first knee flexion state, a first knee stretch state, a second knee flexion state, and a second knee stretch state for each of the right leg and the left leg. And
The transition from the first knee flexion state to the first knee extension state and the transition from the second knee flexion state to the second knee extension state are based on the detection results of the second sensor group. A transition occurs on the condition that the angle formed by the left upper thigh and the upper right thigh is below a predetermined value,
The transition from the first knee extension state to the second knee flexion state and the transition from the second knee extension state to the first knee flexion state are based on the detection results of the first sensor group. Occurs in response to the grounding of the left leg or the right leg,
The driving means includes a predetermined target angle for the right knee angle and the left knee angle determined according to a state of the state machine, and the right knee angle and the left knee angle detected by the first sensor group. A bipedal walking movement device including knee driving means for outputting torque applied to the right knee and the left knee as an output of the PD servo system based on the above .

The control means includes
Based on the right upper thigh and status information of the upper left thigh, on the basis of the compensation sequence obtained with the value function in the reinforcement learning comprises a feedback controller to update the feedback parameters,
Based on the internal state of the central pattern generator that changes in accordance with the output from the previous SL feedback controller generates the thigh control signal, bipedal walking mobile apparatus according to claim 1.

The biped walking movement apparatus according to claim 2, wherein the reinforcement learning is performed by a policy gradient method.