JP4408252B2

JP4408252B2 - Control device and program

Info

Publication number: JP4408252B2
Application number: JP2004267307A
Authority: JP
Inventors: 信石井; 泰中村; 和昭麻生
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2004-09-14
Filing date: 2004-09-14
Publication date: 2010-02-03
Anticipated expiration: 2024-09-14
Also published as: JP2006085284A

Description

ロボット等の多自由度システムを制御する制御装置およびプログラム等に関するものである。 The present invention relates to a control device and a program for controlling a multi-degree-of-freedom system such as a robot.

第一の従来技術として以下の制御装置がある。本制御装置は、２足歩行ロボットを制御する制御装置であり、自律的な適応制御法である強化学習により、２足歩行ロボットシミュレータの制御則を獲得し、好適に２足歩行ロボットを制御できる制御装置である。本制御装置は、強化学習を行う学習部と制御を行う制御部を具備し、当該制御部である神経振動子ネットワークは高次元の状態変数を持ち、２足歩行ロボットシミュレータも高次元の状態変数を持つ。この際に、制御部は、２足歩行ロボットシミュレータの状態のみならず、神経振動子ネットワークの内部状態を観測できるようにすることで、適切な制御を可能にしている（非特許文献１参照）。 As a first conventional technique, there is the following control device. This control device is a control device that controls a biped walking robot, and can acquire a control law of a biped walking robot simulator and preferably control a biped walking robot by reinforcement learning that is an autonomous adaptive control method. It is a control device. The control apparatus includes a learning unit that performs reinforcement learning and a control unit that performs control. The neural oscillator network that is the control unit has a high-dimensional state variable, and the biped walking robot simulator also has a high-dimensional state variable. have. At this time, the control unit enables appropriate control by observing not only the state of the biped walking robot simulator but also the internal state of the neural oscillator network (see Non-Patent Document 1). .

また、第二の従来技術として以下の制御装置がある。本制御装置は、非特許文献１と同様に、学習を行う学習部と制御を行う制御部を具備する。そして、本制御装置によってヘビ型ロボットの制御を行っているが、非制御対象の状態空間が非常に大きい。さらに、制御装置における経由点の集合の探索は、遺伝的アルゴリズムを用いた一種のランダム探索によるものである（非特許文献２参照）。 As a second conventional technique, there is the following control device. Similar to Non-Patent Document 1, the present control device includes a learning unit that performs learning and a control unit that performs control. And although the snake type robot is controlled by this control apparatus, the state space of the non-control target is very large. Furthermore, the search for the set of via points in the control device is based on a kind of random search using a genetic algorithm (see Non-Patent Document 2).

さらに、関連する従来技術として、以下の学習アルゴリズムを有する制御装置がある（非特許文献３参照）。かかる制御装置は、Ａｃｔｏｒ−Ｃｒｉｔｉｃアルゴリズムといわれる強化学習の一種のアルゴリズムを用いている。
中村泰、外２名「神経振動子ネットワークを用いたリズム運動に対する強化学習法」、電子情報通信学会論文誌，２００４年３月，Ｄ−ＩＩＶｏｌ．Ｊ８７Ｎｏ．３，ｐｐ．８９３−９０２ＫａｚｕｙｕｋｉＩｔｏｅｔａｌ．，２００３ＩＥＥＥＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＲｏｂｏｔｉｃｓ & ＡｕｔｏｍａｔｉｏｎＴａｉｐｅｉ，Ｔａｉｗａｎ，Ｓｅｐｔｅｍｂｅｒ１４−１９，２００３木村元、外１名「Ａｃｔｏｒに適正度の履歴を用いたＡｃｔｏｒ−Ｃｒｉｔｉｃアルゴリズム−−不完全なＶａｌｕｅ−Ｆｕｎｃｔｉｏｎのもとでの強化学習」，人工知能学会誌，Ｖｏｌ．１５，Ｎｏ．２，ｐｐ．２６７−２７５（２０００） Furthermore, as a related prior art, there is a control device having the following learning algorithm (see Non-Patent Document 3). Such a control device uses a kind of algorithm of reinforcement learning called an Actor-Critic algorithm.
Yasushi Nakamura and two others, “Reinforcement learning method for rhythmic movement using neural oscillator network”, IEICE Transactions, March 2004, D-II Vol. J87 No. 3, pp. 893-902 Kazuyuki Ito et al. , 2003 IEEE International Conference on Robotics & Automation Taipei, Taiwan, September 14-19, 2003 Hajime Kimura and 1 other "Actor-Critic algorithm using a history of suitability for Actor--Reinforcement learning under incomplete Value-Function", Journal of Artificial Intelligence, Vol. 15, no. 2, pp. 267-275 (2000)

しかしながら、上記の第一の従来技術では、制御部である神経振動子ネットワークは高次元の状態変数を持ち、２足歩行ロボットシミュレータも高次元の状態変数を持ち、両者を共に観測するものとしているため、学習部の次元が非常に大きくなり、その学習は困難であった。また、本制御装置は、高速な学習が困難であり、また学習を行うために多大なＣＰＵパワーを必要とした。 However, in the first prior art described above, the neural oscillator network as a control unit has a high-dimensional state variable, and the biped walking robot simulator also has a high-dimensional state variable, and both are observed together. Therefore, the dimension of the learning unit becomes very large, and the learning is difficult. In addition, this control device is difficult to perform high-speed learning, and requires a large amount of CPU power to perform learning.

また、第二の従来技術では、状態空間の大きさのために、学習部による学習が困難であるという問題があった。すなわち、本制御装置では、経由点の集合の探索は、遺伝的アルゴリズムを用いた一種のランダム探索となっているため、処理効率が悪いという課題があった。 Further, the second conventional technique has a problem that learning by the learning unit is difficult due to the size of the state space. That is, in this control apparatus, since the search for the set of via points is a kind of random search using a genetic algorithm, there is a problem that processing efficiency is poor.

本第一の発明の制御装置は、制御対象の装置である被制御装置の状態を観測し、２以上の第一種パラメータを取得する観測部と、前記観測部が取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する特徴抽出部と、前記特徴抽出部が取得した１以上の第二種パラメータに基づいて、前記被制御装置を制御する制御部を具備する制御装置である。 The control device according to the first aspect of the invention observes the state of the controlled device that is the device to be controlled and acquires two or more first-type parameters, and the two or more first acquired by the observation unit. Based on the seed parameter, the feature extraction unit that acquires a smaller number of second type parameters than the first type parameter, and the controlled device based on one or more second type parameters acquired by the feature extraction unit It is a control apparatus provided with the control part to control.

本第二の発明の制御装置は、第一の発明の制御装置において、前記特徴抽出部は、前記観測部が取得した２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得する変換手段と、前記変換手段が取得した２以上の第二種パラメータから１以上の第二種パラメータを選択する選択手段を具備する。
かかる構成により、制御装置は観測した多数のパラメータを圧縮して、高速に被制御装置に対する制御を行う、あるいはそのための制御則を学習できる。 The control device according to the second invention is the control device according to the first invention, wherein the feature extraction unit acquires two or more second-type parameters based on the two or more first-type parameters acquired by the observation unit. And a selecting means for selecting one or more second type parameters from the two or more second type parameters acquired by the converting means.
With such a configuration, the control device compresses a large number of observed parameters, and can control the controlled device at high speed or learn a control rule for that purpose.

本第三の発明の制御装置は、第二の発明の制御装置において、前記第二種パラメータが可変な変数を有し、前記変換手段は、前記第二種パラメータが有する変数を、前記制御部の制御結果または／および前記制御部の内部状態に応じて変更する。
かかる構成により、第二の発明の制御装置よりもさらに精度良く、好適な制御が可能となる。 The control device according to the third aspect of the invention is the control device according to the second aspect of the invention, wherein the second type parameter has a variable variable, and the conversion means converts the variable of the second type parameter into the control unit. The control result is changed according to the control result or / and the internal state of the control unit.
With this configuration, suitable control can be performed with higher accuracy than the control device of the second invention.

本第四の発明の制御装置は、第一から第三の発明の制御装置において、前記制御部は、前記特徴抽出部が取得した１以上の第二種パラメータに基づいて、前記被制御装置の被制御についての状況の評価を行い、評価の良し悪しを示す情報を出力する評価手段と、前記評価手段が出力した評価の良し悪しを示す情報に基づいて、前記被制御装置を制御する制御手段を具備する。
かかる制御手段が学習する構成により、さらに好適な制御が可能である。
本第五の発明の制御装置は、第一から第四の発明の制御装置において、前記観測部は、前記被制御装置の状態および前記制御部の内部状態を観測し、２以上の第一種パラメータを取得する。
かかる構成により、さらに好適な制御が可能である。 The control device of the fourth invention is the control device of the first to third inventions, wherein the control unit is configured to control the control device based on one or more second-type parameters acquired by the feature extraction unit. An evaluation unit that evaluates the situation of the controlled object and outputs information indicating whether the evaluation is good or bad, and a control unit that controls the controlled device based on the information indicating whether the evaluation is good or bad. It comprises.
Further suitable control is possible by the configuration learned by such control means.
The control device of the fifth invention is the control device of the first to fourth inventions, wherein the observation unit observes the state of the controlled device and the internal state of the control unit, and two or more first types Get parameters.
With such a configuration, more suitable control is possible.

本第六の発明の制御装置は、第一から第五の発明の制御装置において、前記特徴抽出部が第二種パラメータを取得するための情報である選択情報を受け付ける選択情報受付部をさらに具備し、前記特徴抽出部は、前選択情報受付部が受け付けた選択情報をも用いて、前記第一種パラメータより少ない数の第二種パラメータを取得する。
かかる構成により、外部から好適な特徴抽出の指針となる情報を与えることができ、好適な制御が可能である。 The control device of the sixth invention further comprises a selection information receiving unit that receives selection information that is information for the feature extraction unit to acquire the second type parameter in the control device of the first to fifth inventions. Then, the feature extraction unit also acquires the number of second type parameters that is smaller than the first type parameter using the selection information received by the previous selection information receiving unit.
With such a configuration, it is possible to give information that is a guideline for suitable feature extraction from the outside, and suitable control is possible.

本第七の発明の制御装置は、第二から第五の発明の制御装置において、前記観測部が取得した第一種パラメータまたは／および前記制御部の制御結果を取得し、かかる取得した情報を２以上蓄積する長期観測部をさらに具備し、前記変換手段は、前記長期観測部が蓄積した情報または当該情報に基づいて生成された情報をも用いて、２以上の第二種パラメータを取得する。 The control device of the seventh invention is the control device of the second to fifth inventions, obtains the first type parameter obtained by the observation unit or / and the control result of the control unit, and obtains the obtained information. It further includes a long-term observation unit that accumulates two or more, and the conversion means acquires two or more second-type parameters using information accumulated by the long-term observation unit or information generated based on the information. .

本第八の発明の制御装置は、第二から第五の発明の制御装置において、前記観測部が取得した第一種パラメータまたは／および前記制御部の制御結果を取得し、かかる取得した情報を２以上蓄積する長期観測部をさらに具備し、前記選択手段は、前記長期観測部が蓄積した情報または当該情報に基づいて生成された情報をも用いて、１以上の第二種パラメータを選択する。
かかる構成により、所定時間以上の観測結果を加味した好適な制御が可能である。
なお、上記制御装置の動作は、ソフトウェアとコンピュータ等により実現しても良いことは言うまでもない。 The control device of the eighth aspect of the invention is the control device of the second to fifth aspects of the invention, acquires the first type parameter acquired by the observation unit or / and the control result of the control unit, and obtains the acquired information. The information processing apparatus further includes a long-term observation unit that accumulates two or more, and the selection unit selects one or more second-type parameters using information accumulated by the long-term observation unit or information generated based on the information. .
With such a configuration, it is possible to perform suitable control in consideration of observation results over a predetermined time.
Needless to say, the operation of the control device may be realized by software and a computer.

本発明によれば、ロボット等の多自由度システムの制御が少ない計算量でできる制御装置を提供できる。 According to the present invention, it is possible to provide a control device that can control a multi-degree-of-freedom system such as a robot with a small amount of calculation.

以下、制御装置等の実施形態について図面を参照して説明する。なお、実施の形態において同じ符号を付した構成要素は同様の動作を行うので、再度の説明を省略する場合がある。
（実施の形態１）
図１は、制御装置および当該制御装置の制御対象の装置である被制御装置を有する制御システムのブロック図である。制御システムは、制御装置１１および被制御装置１２を具備する。
制御装置１１は、観測部１１０１、特徴抽出部１１０２、制御部１１０３を具備する。また、制御装置１１は、被制御装置１２の内部に存在するとした構成でも良い。 Hereinafter, embodiments of the control device and the like will be described with reference to the drawings. In addition, since the component which attached | subjected the same code | symbol in embodiment performs the same operation | movement, description may be abbreviate | omitted again.
(Embodiment 1)
FIG. 1 is a block diagram of a control system having a control device and a controlled device that is a control target device of the control device. The control system includes a control device 11 and a controlled device 12.
The control device 11 includes an observation unit 1101, a feature extraction unit 1102, and a control unit 1103. Further, the control device 11 may be configured to exist inside the controlled device 12.

観測部１１０１は、制御対象の装置である被制御装置１２の状態を観測し、２以上のパラメータを取得する。ここで取得したパラメータを第一種パラメータという。第一種パラメータは、多数である。多数とは、例えば、５８以上である。観測部１１０１は、例えば、モーションキャプチャを用いて実現され得る。具体的には、例えば、被制御装置１２がヘビ型ロボットの場合、観測部１１０１は、当該ヘビ型ロボットに装着されたマーカーと当該マーカーを検出するトラッカーを組み合わせたハードウェアと、当該ハードウェアの出力（たとえば、ヘビ型ロボットの各リンクにおける位置と各関節における角度の情報）を微分処理することで、速度と角加速度を算出するソフトウェア等から実現され得る。 The observation unit 1101 observes the state of the controlled device 12 that is a device to be controlled, and acquires two or more parameters. The parameters acquired here are referred to as first type parameters. There are many first type parameters. The majority is, for example, 58 or more. The observation unit 1101 can be realized using, for example, motion capture. Specifically, for example, when the controlled device 12 is a snake robot, the observation unit 1101 includes hardware that combines a marker mounted on the snake robot and a tracker that detects the marker, Differentiating the output (for example, information on the position at each link and the angle at each joint of the snake robot) can be realized by software or the like that calculates the speed and angular acceleration.

特徴抽出部１１０２は、観測部１１０１が取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数のパラメータを取得する。特徴抽出部１１０２が取得したパラメータを第二種パラメータという。第二種パラメータの数は、好ましくは第一種パラメータの数より十分小さい。具体的には、第一種パラメータの数が、例えば、５８以上であるのに対して、第二種パラメータの数は３である。 The feature extraction unit 1102 acquires a smaller number of parameters than the first type parameters based on the two or more first type parameters acquired by the observation unit 1101. The parameter acquired by the feature extraction unit 1102 is referred to as a second type parameter. The number of second type parameters is preferably sufficiently smaller than the number of first type parameters. Specifically, the number of first type parameters is, for example, 58 or more, while the number of second type parameters is three.

制御部１１０３は、特徴抽出部１１０２が取得した１以上の第二種パラメータに基づいて、被制御装置１２を制御する。制御とは、例えば、ヘビ型ロボットを前に移動させるように各関節における制御トルクを発生させることである。 The control unit 1103 controls the controlled device 12 based on one or more second type parameters acquired by the feature extraction unit 1102. The control is, for example, generating a control torque at each joint so as to move the snake robot forward.

特徴抽出部１１０２、制御部１１０３は、通常、ＭＰＵやメモリ等から実現され得る。特徴抽出部１１０２、制御部１１０３の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The feature extraction unit 1102 and the control unit 1103 can usually be realized by an MPU, a memory, or the like. The processing procedures of the feature extraction unit 1102 and the control unit 1103 are usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

なお、被制御装置１２は、制御対象の装置であれば何でも良い。被制御装置１２は、例えば、ロボットである。ロボットとは、例えば、ヘビ型ロボット、２足歩行制御を必要とするヒューマノイドロボットなどである。また、被制御装置１２は、多自由度システムである。
また、第一種パラメータと第二種パラメータは、異なる意味合いを有する異なる種類のパラメータである。通常、各々の第二種パラメータは、２以上の第一種パラメータに基づいて構成されるパラメータである。
以下、本制御システムの動作について説明する。まず、制御装置１１の動作について図２のフローチャートを用いて説明する。 The controlled device 12 may be anything as long as it is a device to be controlled. The controlled device 12 is, for example, a robot. The robot is, for example, a snake-type robot or a humanoid robot that requires bipedal walking control. The controlled device 12 is a multi-degree-of-freedom system.
The first type parameter and the second type parameter are different types of parameters having different meanings. Normally, each second type parameter is a parameter configured based on two or more first type parameters.
Hereinafter, the operation of this control system will be described. First, operation | movement of the control apparatus 11 is demonstrated using the flowchart of FIG.

（ステップＳ２０１）観測部１１０１は、被制御装置１２の状態を観測し、２以上の第一種パラメータを取得したか否かを判断する。２以上の第一種パラメータを取得すればステップＳ２０２に行き、２以上の第一種パラメータを取得しなければステップＳ２０１に戻る。 (Step S201) The observation unit 1101 observes the state of the controlled device 12 and determines whether two or more first-type parameters have been acquired. If two or more first type parameters are acquired, the process proceeds to step S202. If two or more first type parameters are not acquired, the process returns to step S201.

（ステップＳ２０２）特徴抽出部１１０２は、ステップＳ２０１で取得した２以上の第一種パラメータに基づいて１以上の第二種パラメータを取得する。ここでの第二種パラメータの数は、ステップＳ２０１で観測した第一種パラメータの数より、通常、大幅に減少している。
（ステップＳ２０３）制御部１１０３は、ステップＳ２０２で選択した１以上の第二種パラメータに基づいて、被制御装置１２を制御するための信号である制御信号を構成する。
（ステップＳ２０４）制御部１１０３は、ステップＳ２０３で構成した制御信号を被制御装置１２に与え、被制御装置１２を制御する。ステップＳ２０１に戻る。
なお、図２のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。
以下、本実施の形態における制御システムの具体的な動作について説明する。 (Step S202) The feature extraction unit 1102 acquires one or more second-type parameters based on the two or more first-type parameters acquired in step S201. Here, the number of second type parameters is usually significantly smaller than the number of first type parameters observed in step S201.
(Step S203) The control unit 1103 configures a control signal that is a signal for controlling the controlled device 12 based on the one or more second-type parameters selected in step S202.
(Step S <b> 204) The control unit 1103 gives the control signal configured in step S <b> 203 to the controlled device 12 and controls the controlled device 12. The process returns to step S201.
In the flowchart of FIG. 2, the process is terminated by powering off or a process termination interrupt.
Hereinafter, a specific operation of the control system in the present embodiment will be described.

今、被制御装置１２は、例えば、ヘビ型ロボットである。ヘビ型ロボットは、例えば、図３（ａ）に示すような形態である。ヘビは体軸の接線方向の摩擦が小さく、法線方向の摩擦が大きいことを利用して推進運動を行っている。本ヘビ型ロボットは、かかる条件を満たすように受動車輪のついたリンクを結合したロボットである。さらに具体的には、本ヘビ型ロボットは、両サイドに受動車輪がある箱型のリンクが１０個連結したロボットであり、各リンクと受動車輪の重さと大きさを図３（ｂ）に示す。図３（ｂ）において、「ｌｉｎｋｍａｓｓ」は、リンクの重量、「ｌｉｎｋｌｅｎｇｔｈ」はリンクの長さ、「ｌｉｎｋｗｉｄｔｈ」はリンクの幅、「ｌｉｎｋｈｅｉｇｔｈ」はリンクの高さ、「ｗｈｅｅｌｍａｓｓ」は受動車輪の重量、「ｗｈｅｅｌｒａｄｉｕｓ」は受動車輪の半径を示す。つまり、図３（ｂ）において、リンクの重量は「１．０ｋｇ」、リンクの長さ、幅、高さは、それぞれ「０．１５ｍ」「０．１０ｍ」「０．１０ｍ」であり、受動車輪の重量は「０．５ｋｇ」、受動車輪の半径は「０．０７ｍ」である。また、各リンク間の関節はｚ軸を中心に回転し、回転のためのトルクがかけられるものとする。この各関節にかけ得るトルクによって本ヘビ型ロボットは制御される。 Now, the controlled device 12 is, for example, a snake robot. For example, the snake type robot has a form as shown in FIG. Snakes are propelled by utilizing the fact that the friction in the tangential direction of the body axis is small and the friction in the normal direction is large. This snake type robot is a robot in which links with passive wheels are connected so as to satisfy such a condition. More specifically, this snake-type robot is a robot in which ten box-type links having passive wheels on both sides are connected, and the weight and size of each link and the passive wheels are shown in FIG. . 3B, “link mass” is the link weight, “link length” is the link length, “link width” is the link width, “link height” is the link height, and “wheel mass”. Is the weight of the passive wheel and “wheel radius” is the radius of the passive wheel. That is, in FIG. 3B, the weight of the link is “1.0 kg”, and the length, width, and height of the link are “0.15 m”, “0.10 m”, and “0.10 m”, respectively. The weight of the wheel is “0.5 kg” and the radius of the passive wheel is “0.07 m”. In addition, it is assumed that the joints between the links rotate about the z axis and torque for rotation is applied. The snake robot is controlled by the torque that can be applied to each joint.

また、観測部１１０１は、上記ヘビ型ロボットの各リンク（ここでは１０個）の位置、角度、および角速度の状態変数（パラメータ）を観測する。かかる各リンクの位置、速度、角度、および角速度の状態変数の集合が、２以上の第一種パラメータである。例えば、かかる状態変数の数は５８である。ｉ番目のリンクの位置と速度はそれぞれ（Ｘｉ，Ｙｉ）と（ＶＸｉ，ＶＹｉ）で得られる（ｉは１から１０までの整数）。関節の角度は「Ｋｊ」、角速度は「Ｌｊ」であらわす（ｊは１から９までの整数）、とする。つまり、第一種パラメータは、（Ｘｉ，Ｙｉ，ＶＸｉ，ＶＹｉ），（Ｋｊ，Ｌｊ）［ｉは１から１０，ｊは１から９］である。なお、第一種パラメータの一つである各リンクの位置は、１以上の少数の基準点の座標（重心座標など）と、当該基準点に対する相対的な位置を示す相対座標であっても良い。
かかる場合、特徴抽出部１１０２は、（Ｘｉ，Ｙｉ，ＶＸｉ，ＶＹｉ），（Ｋｊ，Ｌｊ）に基づいて、以下の数式１，２の演算を行う。
数式１は、ヘビ型ロボットの向きを示す特徴量（第二種パラメータ）である。数式２は、ヘビ型ロボットのくねり具合を示す特徴量（第二種パラメータ）である。

In addition, the observation unit 1101 observes the state variables (parameters) of the position, angle, and angular velocity of each link (here, 10) of the snake robot. A set of the state variables of the position, velocity, angle, and angular velocity of each link is two or more first type parameters. For example, the number of such state variables is 58. The position and speed of the i-th link are obtained by (Xi, Yi) and (VXi, VYi), respectively (i is an integer from 1 to 10). The joint angle is represented by “Kj” and the angular velocity is represented by “Lj” (j is an integer from 1 to 9). That is, the first type parameters are (Xi, Yi, VXi, VYi), (Kj, Lj) [i is 1 to 10, j is 1 to 9]. Note that the position of each link, which is one of the first type parameters, may be coordinates of a small number of one or more reference points (such as barycentric coordinates) and relative coordinates indicating a relative position with respect to the reference point. .
In such a case, the feature extraction unit 1102 performs the following

formulas

1 and 2 based on (Xi, Yi, VXi, VYi) and (Kj, Lj).
Formula 1 is a feature amount (second type parameter) indicating the direction of the snake robot. Expression 2 is a feature amount (second type parameter) indicating the degree of bend of the snake robot.

数式１において、ｘｉとｙｉは、それぞれヘビ型ロボットの頭側からｉ番目のリンクの重心位置のｘ座標とｙ座標である。Ｎはヘビ型ロボットのリンク数を示す。Ｎは、ここでは、例えば、１０である。また，ａｒｃｔａｎは、逆正接である。

数式２において、Ｋｊはｊ番目の関節角であり、｜Ｋｊ｜は，Ｋｊの絶対値を示す。 In Equation 1, xi and yi are the x coordinate and the y coordinate of the center of gravity position of the i-th link from the head side of the snake robot. N indicates the number of links of the snake robot. N is 10 here, for example. Arctan is an arc tangent.

In Equation 2, Kj is the j-th joint angle, and | Kj | represents the absolute value of Kj.

そして、特徴抽出部１１０２は、（Ｘｉ，Ｙｉ，ＶＸｉ，ＶＹｉ），（Ｋｊ，Ｌｊ）［ｉは１から１０，ｊは１から９］から、上記の数式１，２により、例えば、２つの第二種パラメータ（ｓ１，ｓ２）を取得する。そして、特徴抽出部１１０２は、第二種パラメータ（ｓ１，ｓ２）を制御部１１０３に渡す。 Then, the feature extraction unit 1102 uses (Xi, Yi, VXi, VYi), (Kj, Lj) [i is 1 to 10 and j is 1 to 9]. The second type parameter (s1, s2) is acquired. Then, the feature extraction unit 1102 passes the second type parameters (s1, s2) to the control unit 1103.

制御部１１０３は、第二種パラメータ（ｓ１，ｓ２）である制御入力信号に基づいて、被制御装置１２を制御する。この制御方法については実施の形態２の中で説明する。具体的には各関節にかける得るトルクを決定し、その決定されたトルクを用いることで被制御装置１２は動作する。 The control unit 1103 controls the controlled device 12 based on the control input signal that is the second type parameter (s1, s2). This control method will be described in the second embodiment. Specifically, the controllable device 12 operates by determining the torque that can be applied to each joint and using the determined torque.

以上の具体例によれば、観測部１１０１が観測した５８の第一種パラメータが、特徴抽出部１１０２で圧縮することにより、２つの第二種パラメータになった。そして、２つの第二種パラメータにより、被制御装置１２（ヘビ型ロボット）が制御できる。 According to the above specific example, the 58 first type parameters observed by the observation unit 1101 are compressed into two second type parameters by the feature extraction unit 1102. The controlled device 12 (snake robot) can be controlled by the two second type parameters.

以上、本実施の形態によれば、制御装置は観測した多数のパラメータを圧縮して、高速に被制御装置に対する制御を行う、あるいはそのための制御則を学習できる。かかるパラメータの圧縮は、被制御装置であるヘビ型ロボットなどの運動が、その状態変数全てを用いるものよりもずっと単純であるという性質に着目したことにより実現されたものである。
なお、本実施の形態によれば、被制御装置はヘビ型ロボットであったが、２足歩行制御を必要とするヒューマノイドロボットなど、他の多自由度システムでも良いことはいうまでもない。 As described above, according to the present embodiment, the control device compresses a large number of observed parameters, and can control the controlled device at high speed, or can learn a control law for that purpose. Such parameter compression is realized by paying attention to the property that the motion of a controlled device such as a snake robot is much simpler than that using all of its state variables.
In addition, according to this Embodiment, although the to-be-controlled device was a snake-type robot, it cannot be overemphasized that other multi-degree-of-freedom systems, such as a humanoid robot which requires bipedal walking control, may be sufficient.

また、本実施の形態によれば、特徴抽出部１１０２が取得した第二種パラメータは、ヘビ型ロボットの向きを示す特徴量と、ヘビ型ロボットのくねり具合を示す特徴量であったが、被制御装置の特性を表すパラメータであれば、他でも良い。第二種パラメータは、制御パラメータであって、必ずしも特定の運動を表現するパラメータではない。 Further, according to the present embodiment, the second type parameters acquired by the feature extraction unit 1102 are a feature amount indicating the direction of the snake robot and a feature amount indicating the bend state of the snake robot. Other parameters may be used as long as they represent the characteristics of the control device. The second type parameter is a control parameter and is not necessarily a parameter expressing a specific motion.

また、本実施の形態において、特徴抽出部１１０２が第二種パラメータを取得するための情報である選択情報を受け付ける選択情報受付部をさらに具備し、特徴抽出部１１０２は、選択情報受付部が受け付けた選択情報をも用いて、第一種パラメータより少ない数の第二種パラメータを取得しても良い。ここで、選択情報とは、例えば、上述した数式１、数式２そのもの、あるいはそれらを特定するのに十分な情報である。 In the present embodiment, the feature extraction unit 1102 further includes a selection information reception unit that receives selection information that is information for acquiring the second type parameter. The feature extraction unit 1102 is received by the selection information reception unit. Alternatively, the selection information may be used to obtain a smaller number of second type parameters than the first type parameters. Here, the selection information is, for example, the above-described Formula 1, Formula 2 itself, or information sufficient to specify them.

さらに、本実施の形態における制御装置の処理は、ソフトウェアで実現しても良い。そして、このソフトウェアをソフトウェアダウンロード等により配布しても良い。また、このソフトウェアをＣＤ−ＲＯＭなどの記録媒体に記録して流布しても良い。なお、このことは、本明細書における他の実施の形態においても該当する。なお、本実施の形態における制御装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、制御対象の装置である被制御装置の状態を観測し、２以上の第一種パラメータを取得する観測ステップと、前記観測ステップで取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する特徴抽出ステップと、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置を制御する制御ステップを実行させるためのプログラム、である。
（実施の形態２）
図４は、制御装置および当該制御装置の制御対象の装置である被制御装置を有する制御システムのブロック図である。制御システムは、制御装置４１および被制御装置１２を具備する。
制御装置４１は、観測部４１０１、特徴抽出部４１０２、制御部４１０３、選択情報受付部４１０４を具備する。
特徴抽出部４１０２は、変換手段４１０２１、選択手段４１０２２を具備する。制御部４１０３は、評価手段４１０３１、制御手段４１０３２を具備する。また、制御装置４１は、被制御装置１２の内部に存在するとした構成でも良い。 Furthermore, the processing of the control device in the present embodiment may be realized by software. Then, this software may be distributed by software download or the like. Further, this software may be recorded and distributed on a recording medium such as a CD-ROM. This also applies to other embodiments in this specification. Note that the software that implements the control device in the present embodiment is the following program. In other words, this program causes the computer to observe the state of the controlled device that is the device to be controlled, obtains two or more first type parameters, and two or more first type obtained in the observation step. Based on the parameter, a feature extraction step for obtaining a smaller number of second type parameters than the first type parameter, and controlling the controlled device based on one or more second type parameters obtained in the feature extraction step A program for executing a control step.
(Embodiment 2)
FIG. 4 is a block diagram of a control system including a control device and a controlled device that is a device to be controlled by the control device. The control system includes a control device 41 and a controlled device 12.
The control device 41 includes an observation unit 4101, a feature extraction unit 4102, a control unit 4103, and a selection information reception unit 4104.
The feature extraction unit 4102 includes conversion means 41021 and selection means 41022. The control unit 4103 includes an evaluation unit 41031 and a control unit 41032. Further, the control device 41 may be configured to exist inside the controlled device 12.

観測部４１０１は、制御対象の装置である被制御装置１２の状態、および制御部４１０３の内部状態を観測し、２以上のパラメータを取得する。ここで取得したパラメータを第一種パラメータという。第一種パラメータは、多数である。多数とは、例えば、２４００以上である。また、制御部４１０３の内部状態とは、例えば、神経振動子ネットワークの状態変数である。神経振動子ネットワークについては後述する。観測部４１０１は、例えば、モーションキャプチャ等を用いて実現され得る。具体的には、例えば、被制御装置１２がヘビ型ロボットの場合、観測部４１０１は当該ヘビ型ロボットに装着されたマーカーと当該マーカーを検出するトラッカーを組み合わせたハードウェア、当該ハードウェアの出力（たとえば、ヘビ型ロボットの各リンクにおける位置と各関節における角度の情報）を微分処理することで、速度と角加速度を算出するソフトウェア、および制御部４１０３の内部状態を取得するオシロスコープなどのハードウェア等から実現され得る。
特徴抽出部４１０２は、観測部１１０１が取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する。 The observation unit 4101 observes the state of the controlled device 12 that is the device to be controlled and the internal state of the control unit 4103, and acquires two or more parameters . The parameters acquired here are referred to as first type parameters. There are many first type parameters. For example, the number is 2400 or more. The internal state of the control unit 4103 is, for example, a state variable of the neural oscillator network. The neural oscillator network will be described later. The observation unit 4101 can be realized using, for example, motion capture. Specifically, for example, when the controlled device 12 is a snake robot, the observation unit 4101 includes hardware that combines a marker attached to the snake robot and a tracker that detects the marker, and an output of the hardware ( For example, software for calculating speed and angular acceleration by differentiating the information on the position of each link and the angle at each joint of the snake robot, and hardware such as an oscilloscope for acquiring the internal state of the control unit 4103 Can be realized from.
The feature extraction unit 4102 acquires a second type parameter that is smaller in number than the first type parameter, based on the two or more first type parameters acquired by the observation unit 1101.

変換手段４１０２１は、観測部１１０１が取得した２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得する。２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得する具体的な方法については後述する。 The conversion unit 41021 acquires two or more second type parameters based on the two or more first type parameters acquired by the observation unit 1101. A specific method for acquiring two or more second type parameters based on two or more first type parameters will be described later.

選択手段４１０２２は、変換手段１１０２１が取得した２以上の第二種パラメータから１以上の第二種パラメータを選択する。選択手段４１０２２は、変換手段４１０２１が取得した１以上の第二種パラメータを全て選択しても良い。かかる場合、選択手段４１０２２は、何ら処理を行わない。
制御部４１０３は、特徴抽出部４１０２が取得した１以上の第二種パラメータに基づいて、被制御装置１２を制御する。 The selection unit 41022 selects one or more second type parameters from the two or more second type parameters acquired by the conversion unit 11021. The selection unit 41022 may select all the one or more second type parameters acquired by the conversion unit 41021. In such a case, the selection unit 41022 does not perform any processing.
The control unit 4103 controls the controlled device 12 based on one or more second type parameters acquired by the feature extraction unit 4102.

評価手段４１０３１は、特徴抽出部４１０２が取得した１以上の第二種パラメータに基づいて、被制御装置１２の被制御についての状況の評価を行い、当該評価の良し悪しを示す情報（かかる値を適宜「出力値」という）を出力する。被制御装置１２の被制御についての状況とは、被制御装置１２が制御された後の結果でも良いし、被制御装置１２に与えられる制御信号等でも良い。評価の良し悪しを示す情報とは、例えば、評価が良いことを示す情報「１」、評価が悪いことを示す情報「０」のどちらかの値を採り得るフラグ、あるいは評価が最も良い時に最大値、最も悪い時に最小値、その間は程度に応じて中間値を取り得るものである。さらに具体的には、ヘビ型ロボットをｘ軸方向に推進運動をさせる場合、以下の出力値（ｒ）で設定できる。出力値（ｒ）を算出するための数式を数式３に示す。

ここで、Ｔｊはｊ番目の関節［ｊ＝１，...，９］にかけられるトルクの大きさを表す。Ｔｊを算出する式は、数式４である。

The evaluation unit 41031 evaluates the status of the controlled device 12 based on the one or more second type parameters acquired by the feature extraction unit 4102, and indicates information indicating whether the evaluation is good or bad (such value). If necessary, it is called “output value”. The status of the controlled device 12 regarding the controlled state may be a result after the controlled device 12 is controlled, or a control signal or the like given to the controlled device 12. The information indicating whether the evaluation is good or bad is, for example, a flag that can take one of information “1” indicating that the evaluation is good, information “0” indicating that the evaluation is bad, or the maximum when the evaluation is the best. The value can be an intermediate value depending on the degree, the minimum value at the worst and the meantime. More specifically, when propelling the snake robot in the x-axis direction, the following output value (r) can be set. Formula 3 for calculating the output value (r) is shown in Formula 3.

Here, Tj represents the magnitude of torque applied to the j-th joint [j = 1,..., 9]. The equation for calculating Tj is Equation 4.

数式４において、Ｋ_ｍ，Ｋ_ｓ，Ｓ_ｔ，Ｋ_ｄはそれぞれ筋力に対するゲイン、剛性に対するゲイン、ヘビの硬さ、角速度の差に対するゲインであり，あらかじめ決められた定数である。また、Ｐ_ｊ，Ｑ_ｊはそれぞれｊ番目のリンクの角度、リンクの角速度である。リンクの角速度（Ｑ_ｊ）は、リンクの角度（Ｐ_ｊ）を微分したものである。 In Equation 4, K _m , K _s , S _t , and K _d are gains for muscle force, gains for stiffness, snake hardness, gains for differences in angular velocity, and are predetermined constants. P _j and Q _j are the j-th link angle and the link angular velocity, respectively. The angular velocity (Q _j ) of the link is obtained by differentiating the link angle (P _j ).

制御手段４１０３２は、評価手段４１０３１が出力した評価の良し悪しを示す情報に基づいて、被制御装置１２を制御する。また、制御手段４１０３２は、かかる制御のために、選択手段４１０２２が選択した第二種パラメータと評価手段４１０３１が出力した評価である出力値に基づいて、制御信号を構成するための制御規則を変更（学習）する。そして、制御手段４１０３２は、かかる学習した制御規則に基づき制御信号を構成する。被制御装置１２がヘビ型ロボットの場合、制御手段４１０３２は、例えば、所定の回数内（例えば、１００回内）で上記のｒで与えられるような出力値の和が大きくなった際のヘビ型ロボットの形状を示す特徴量（第二種パラメータ）を蓄積しておき、その形状になるようにヘビ型ロボットを制御する。 The control unit 41032 controls the controlled device 12 based on the information indicating whether the evaluation is good or not output from the evaluation unit 41031. Further, the control unit 41032 changes the control rule for constructing the control signal based on the second type parameter selected by the selection unit 41022 and the output value which is the evaluation output by the evaluation unit 41031 for such control. (learn. Then, the control means 41032 constructs a control signal based on the learned control rule. In the case where the controlled device 12 is a snake robot, the control means 41032, for example, uses a snake type when the sum of output values given by the above r becomes large within a predetermined number of times (for example, within 100 times). A feature amount (second type parameter) indicating the shape of the robot is accumulated, and the snake robot is controlled so as to have the shape.

選択情報受付部４１０４は、特徴抽出部４１０２が第二種パラメータを取得するための情報である選択情報を受け付ける。選択情報とは、ヘビ型ロボットの向きを示す特徴量（第二種パラメータ）を算出するための数式（上記数式１）や、ヘビ型ロボットのくねり具合を示す特徴量（第二種パラメータ）を算出するための数式（上記数式２）等、あるいはそれらを特定するのに十分な情報である。選択情報の入力手段は、キーボードやマウスやメニュー画面によるもの等、何でも良い。選択情報受付部４１０４は、キーボード等の入力手段のデバイスドライバーや、メニュー画面の制御ソフトウェア等で実現され得る。 The selection information reception unit 4104 receives selection information that is information for the feature extraction unit 4102 to acquire the second type parameter. The selection information includes a formula (the above formula 1) for calculating a feature quantity (second type parameter) indicating the direction of the snake robot and a feature quantity (second type parameter) indicating the degree of bend of the snake robot. This is a mathematical formula for calculating (the above mathematical formula 2) or the like, or sufficient information for specifying them. The selection information input means may be anything such as a keyboard, mouse, or menu screen. The selection information receiving unit 4104 can be realized by a device driver for input means such as a keyboard, control software for a menu screen, or the like.

特徴抽出部４１０２、制御部４１０３、変換手段４１０２１、選択手段４１０２２、評価手段４１０３１、制御手段４１０３２は、通常、ＭＰＵやメモリ等から実現され得る。特徴抽出部４１０２等の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。
以下、本制御システムを構成する制御装置４１の動作について図５のフローチャートを用いて説明する。 The feature extraction unit 4102, the control unit 4103, the conversion unit 41021, the selection unit 41022, the evaluation unit 41031, and the control unit 41032 can be usually realized by an MPU, a memory, or the like. The processing procedure of the feature extraction unit 4102 or the like is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
Hereinafter, the operation of the control device 41 configuring the present control system will be described with reference to the flowchart of FIG.

（ステップＳ５０１）観測部１１０１は、被制御装置１２の状態を観測し、２以上の第一種パラメータを取得したか否かを判断する。２以上の第一種パラメータを取得すればステップＳ５０２に行き、２以上の第一種パラメータを取得しなければステップＳ５０１に戻る。 (Step S501) The observation unit 1101 observes the state of the controlled device 12 and determines whether two or more first-type parameters have been acquired. If two or more first type parameters are acquired, the process proceeds to step S502. If two or more first type parameters are not acquired, the process returns to step S501.

（ステップＳ５０２）特徴抽出部４１０２の変換手段４１０２１は、ステップＳ５０１で取得した２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得する。ここで、第二種パラメータの数は、第一種パラメータの数より多くても良いし、同じでも良い。 (Step S502) The conversion unit 41021 of the feature extraction unit 4102 acquires two or more second type parameters based on the two or more first type parameters acquired in step S501. Here, the number of the second type parameters may be larger than or the same as the number of the first type parameters.

（ステップＳ５０３）特徴抽出部４１０２の選択手段４１０２２は、ステップＳ５０２で取得した２以上の第二種パラメータから１以上の第二種パラメータを選択する。ここで選択された第二種パラメータの数は、第一種パラメータの数より、大幅に減少していることが好適である。例えば、第一種パラメータの数が２４００以上であり、選択された第二種パラメータの数が３であることは好適である。
（ステップＳ５０４）制御手段４１０３２は、ステップＳ５０３で選択された第二種パラメータに基づいて、制御信号を構成する。具体的な制御信号の構成方法については後述する。
（ステップＳ５０５）制御手段４１０３２は、ステップＳ５０４で構成した制御信号を被制御装置１２に与え、被制御装置１２を制御する。 (Step S503) The selection unit 41022 of the feature extraction unit 4102 selects one or more second type parameters from the two or more second type parameters acquired in step S502. The number of second type parameters selected here is preferably significantly smaller than the number of first type parameters. For example, it is preferable that the number of first type parameters is 2400 or more and the number of selected second type parameters is three.
(Step S504) The control means 41032 constitutes a control signal based on the second type parameter selected in step S503. A specific method for configuring the control signal will be described later.
(Step S <b> 505) The control unit 41032 gives the control signal configured in step S <b> 504 to the controlled device 12 and controls the controlled device 12.

（ステップＳ５０６）評価手段４１０３１はステップＳ５０３で選択された第二種パラメータ（または、それに加えて制御手段４１０３２の出力する制御信号）に基づいて、制御手段４１０３２による制御規則に対する評価を出力する。
（ステップＳ５０７）制御手段４１０３２は、ステップＳ５０３で選択された第二種パラメータとステップＳ５０６で評価手段４１０３１が出力した評価に基づいて、制御信号を構成するための制御規則を変更（学習）する。ステップＳ５０１に戻る。なお、ステップＳ５０７において、制御手段４１０３２は、第二種パラメータを直接的に使用せず、評価手段４１０３１が出力した評価に基づいて、制御信号を構成するための制御規則を変更（学習）しても良い。 (Step S506) The evaluation unit 41031 outputs an evaluation of the control rule by the control unit 41032 based on the second type parameter selected in step S503 (or in addition to the control signal output from the control unit 41032).
(Step S507) The control unit 41032 changes (learns) a control rule for forming a control signal based on the second type parameter selected in step S503 and the evaluation output by the evaluation unit 41031 in step S506. The process returns to step S501. In step S507, the control unit 41032 does not directly use the second type parameter, but changes (learns) the control rule for configuring the control signal based on the evaluation output by the evaluation unit 41031. Also good.

なお、図５のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。また、評価手段４１０３１と制御手段４１０３２は、非特許文献３に記述されるアルゴリズムなどによって学習可能である。
以下、本実施の形態における制御システムの具体的な動作について説明する。今、被制御装置１２は、例えば、図３に示すヘビ型ロボットである。 In the flowchart of FIG. 5, the process ends when the power is turned off or the process ends. Further, the evaluation unit 41031 and the control unit 41032 can be learned by an algorithm described in Non-Patent Document 3.
Hereinafter, a specific operation of the control system in the present embodiment will be described. Now, the controlled device 12 is, for example, a snake robot shown in FIG.

また、観測部４１０１は、上記ヘビ型ロボットの各リンク（ここでは１０個）の位置、速度、角度、角速度の状態変数、および神経振動子ネットワークの状態変数を観測する。神経振動子ネットワークの状態変数は、制御部４１０３の内部状態であり、観測部４１０１が観測する第一種パラメータであり、その数は２４００以上となる。 The observation unit 4101 observes the position variable, the velocity, the angle, the angular velocity state variable, and the state variable of the neural oscillator network of each link (here, 10) of the snake robot. The state variable of the neural oscillator network is an internal state of the control unit 4103 and is a first type parameter observed by the observation unit 4101, and the number thereof is 2400 or more.

制御部４１０３は、ここでは、脊髄をモデル化したもので、ニューロン素子がネットワーク状に結合した神経振動子ネットワークによって実装される。神経振動子ネットワークは例えば、図６に示すように、左右のＭＮ，ＥＩＮ，ＬＩＮとＣＣＩＮの４種類の素子で構成された神経節が図７に示すように１００個繋がったものである。
また、各素子は、例えば、以下の数式５により、時間発展を行う。

Here, the control unit 4103 is a model of the spinal cord, and is implemented by a neural oscillator network in which neuron elements are connected in a network form. For example, as shown in FIG. 6, the neural oscillator network is formed by connecting 100 ganglia composed of four types of elements, that is, left and right MN, EIN, LIN, and CCIN, as shown in FIG.
In addition, each element performs time development by, for example, Equation 5 below.

ここで，Ｃ１，Ｃ２，Ｃ３，Ｃ４，Ｃ５，ｗ０とｗｉはあらかじめ決められた定数で、Ψ＋とΨ‐はそれぞれ結合している素子のインデックスの集合を示す。ｍａｘ（ｘ１，ｘ２）はｘ１とｘ２の大きい方を出力する関数である。上記の微分方程式をコンピュータ上で時刻ｔについて数値積分を行うことで素子の動作をシミュレートできる。あるいは上記の微分方程式に従うアナログハードウェアを用いても良い。ＤＴは数値積分の刻み幅である。ｖは素子の出力である。（Ｉ，Ｉ＋，Ｉ−）を内部状態変数と呼び，コンピュータ上で実装する場合には，数式６のように更新（数値積分）される。

また、Ｕは各素子に対する外部入力（＝トニック入力）で、数式７によって計算される。

Here, C1, C2, C3, C4, C5, w0, and wi are predetermined constants, and Ψ + and Ψ− indicate a set of indices of the coupled elements. max (x1, x2) is a function that outputs the larger of x1 and x2. The operation of the element can be simulated by numerical integration of the above differential equation at time t on the computer. Alternatively, analog hardware according to the above differential equation may be used. DT is a step size of numerical integration. v is the output of the element. (I, I +, I−) are called internal state variables, and when implemented on a computer, they are updated (numerical integration) as shown in Equation 6.

U is an external input (= tonic input) for each element, and is calculated by Equation 7.

方策は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）が与えられたときに、制御信号（ｕ）が出力される確率Ｐとして与えられるが，他の例としては，確率Ｐによるｕの期待値などでも良い。具体的には、Ｐの例としては、（ｓ１，ｓ２，ｓ３）から決まる平均値と、パラメータθ０で与えられる分散を持つ正規分布が用いられる。 The policy is given as the probability P that the control signal (u) is output when the three second-type parameters (s1, s2, s3) are given. Expected values may be used. Specifically, as an example of P, a normal distribution having an average value determined from (s1, s2, s3) and a variance given by the parameter θ0 is used.

また、制御部４１０３が出力する制御信号は、神経節の左右の各々に存在する運動ニューロン（ＮＭ_Ｌ，ＭＮ_Ｒ）の出力（図７参照）である。この制御信号がヘビ型ロボット（被制御装置１２）の各関節のトルクとして与えられ、ヘビ型ロボットの体を屈曲させる。そして、ここでは、１０個のリンクが繋がったものとして、被制御装置１２がモデル化されており、ｉ番目の関節に加わるトルクは１０ｉセグメントの運動ニューロンの出力により数式８により算出される。

Further, the control signal output by the control unit 4103 is the output (see FIG. 7) of the motor neurons (NM _L , MN _R ) existing on the left and right of the ganglion. This control signal is given as the torque of each joint of the snake robot (controlled device 12), and the body of the snake robot is bent. In this example, the controlled device 12 is modeled on the assumption that 10 links are connected, and the torque applied to the i-th joint is calculated by Equation 8 from the output of the 10i segment motor neuron.

数式８において、Ｋ_ｍ，Ｋ_ｓ，Ｓ_ｔ，Ｋ_ｄはそれぞれ筋力に対するゲイン、剛性に対するゲイン、ヘビの硬さ、角速度の差に対するゲインであり，あらかじめ決められた定数である。また、Ｐ_ｉ，Ｑ_ｉはそれぞれｉ番目のリンクの角度、リンクの角速度である。リンクの角速度（Ｑ_ｉ）は、リンクの角度（Ｐ_ｉ）を微分したものである。 In Equation 8, K _m , K _s , S _t , and K _d are gain for muscle strength, gain for stiffness, snake hardness, gain for angular velocity difference, and are predetermined constants. P _i and Q _i are the angle of the i-th link and the angular velocity of the link, respectively. The angular velocity (Q _i ) of the link is obtained by differentiating the link angle (P _i ).

また、制御部４１０３は、脳幹からの信号をモデル化した上述のトニック入力により制御されており、左右のニューロンに同じ大きさの入力を与えると、その入力の大きさによって運動ニューロンの出力（制御信号）が変化する。当該変化した制御信号をヘビ型ロボット（被制御装置１２）に、例えば数式８により与えることにより、ヘビ型ロボットの体の形状が変わる。一方、左右のニューロンに異なる大きさの入力を与えると、左右の運動ニューロンに出力の差が生まれる。かかる出力の差をヘビ型ロボット（被制御装置１２）に与えることにより、当該差に従って、ヘビ型ロボットの形状変化を通じて進行方向が変化する。
以下、観測部４１０１が第一種パラメータを取得した後、ヘビ型ロボットに与える制御信号を構成するまでの処理を詳細に説明する。 The control unit 4103 is controlled by the above-described tonic input that models a signal from the brainstem. When an input of the same size is given to the left and right neurons, the output (control of the motor neuron is controlled by the size of the input. Signal) changes. By giving the changed control signal to the snake robot (controlled device 12) by, for example, Expression 8, the body shape of the snake robot changes. On the other hand, if inputs of different sizes are given to the left and right neurons, a difference in output is generated between the left and right motor neurons. By giving such a difference in output to the snake-type robot (controlled device 12), the traveling direction changes through the shape change of the snake-type robot according to the difference.
Hereinafter, the processing until the control signal to be given to the snake robot after the observation unit 4101 acquires the first type parameter will be described in detail.

まず、変換手段４１０２１は、観測部４１０１が観測した２４００以上の第一種パラメータを以下のように変換し、第二種パラメータを取得する。ここでの第二種パラメータは、例えば、第一種パラメータの数と同数である、とする。ここでの、第二種パラメータは、例えば、少なくともヘビ型ロボットの向きを示す特徴量、ヘビのくねり具合いを示す特徴量、神経振動子ネットワークの状態を示す特徴量を有する。具体的には，数式４から数式６までの特徴量を含むものとする。 First, the conversion means 41021 converts 2400 or more first-type parameters observed by the observation unit 4101 as follows, and acquires second-type parameters. Here, it is assumed that the number of second type parameters is the same as the number of first type parameters, for example. Here, the second-type parameters include, for example, at least a feature amount indicating the direction of the snake robot, a feature amount indicating the snake bend condition, and a feature amount indicating the state of the neural oscillator network. Specifically, it is assumed that the feature amounts from Equation 4 to Equation 6 are included.

次に、選択手段４１０２２は、変換手段４１０２１が取得した第二種パラメータから、以下の３つの第二種パラメータを選択する。３つの第二種パラメータとは、ヘビ型ロボットの向きを示す特徴量、ヘビのくねり具合いを示す特徴量、神経振動子ネットワークの状態を示す特徴量である。これらの第二種パラメータは、例えば設計者の先見的な知識（かかる知識は、通常、選択手段４１０２２が保持している）により選択されるが，それ以外の各種手段によるものでも良い。

Next, selection means 41022 is the second type parameter conversion unit 41021 obtains, selects three of the two parameters:. The three second-type parameters are a feature value indicating the direction of the snake robot, a feature value indicating the snake bend condition, and a feature value indicating the state of the neural oscillator network. These second type parameters are selected by, for example, a designer's a priori knowledge (the knowledge is usually held by the selection means 41022), but may be obtained by other various means.

ここで、ヘビ型ロボットの向きを示す特徴量ｓ１（第二種パラメータ）は数式９で算出される。ヘビ型ロボットのくねり具合を示す特徴量ｓ２（第二種パラメータ）は数式１０で算出される。神経振動子ネットワークの状態を示す特徴量ｓ３（第二種パラメータ）は数式１１で算出される。

Here, the feature quantity s1 (second type parameter) indicating the direction of the snake-like robot is calculated by Equation 9. A feature quantity s2 (second type parameter) indicating the bend state of the snake robot is calculated by Expression 10. A feature quantity s3 (second type parameter) indicating the state of the neural oscillator network is calculated by Expression 11.

数式９において、ｘｉとｙｉは、それぞれヘビ型ロボットの頭側からｉ番目のリンクの重心位置のｘ座標とｙ座標である。Ｎはヘビ型ロボットのリンク数を示す。Ｎは、ここでは、例えば、１０である。また，ａｒｃｔａｎは、逆正接である。

数式１０において、Ｋｊはｊ番目の関節角であり、｜Ｋｊ｜は，Ｋｊの絶対値を示す。

すなわち，数式１１は神経振動子ネットワーク内の全運動ニューロン（ＭＮＬ_ｋ，ＭＮＲ_ｋ）の出力の和である
選択手段４１０２２は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）を選択し、当該３つの第二種パラメータを制御手段４１０３２および評価手段４１０３１に渡す。 In Equation 9, xi and yi are the x-coordinate and y-coordinate of the centroid position of the i-th link from the head side of the snake robot. N indicates the number of links of the snake robot. N is 10 here, for example. Arctan is an arc tangent.

In Equation 10, Kj is the j-th joint angle, and | Kj | represents the absolute value of Kj.

That is, Formula 11 is the sum of the outputs of all motor neurons (MNL _k , MNR _k ) in the neural oscillator network. Selection means 41022 selects three second type parameters (s1, s2, s3) Three second type parameters are passed to the control means 41032 and the evaluation means 41031.

次に、制御手段４１０３２は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）を受け取り、方策に従い制御信号を出力する。方策は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）が与えられたときに、以下の数式１２によって定義される。

Next, the control means 41032 receives three second type parameters (s1, s2, s3) and outputs a control signal according to the policy. The strategy is defined by the following equation 12 when three second type parameters (s1, s2, s3) are given.

方策は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）が与えられたときに、制御信号（ｕ）が出力される確率Ｐとして与えられるが，他の例としては，確率Ｐによるｕの期待値などでも良い。具体的には、Ｐの例としては、（ｓ１，ｓ２，ｓ３）から決まる平均値と、パラメータθ０で与えられる分散を持つ正規分布が用いられる。また、以下の数式１３は制御信号ｕが（Ｕｒ，Ｕｌ）の２次元ベクトルで与えられる場合の平均値の計算例である。

ここで、ＵＢｒとＵＢｌはそれぞれＵｒとＵｌの平均値である。 The policy is given as the probability P that the control signal (u) is output when the three second-type parameters (s1, s2, s3) are given. Expected values may be used. Specifically, as an example of P, a normal distribution having an average value determined from (s1, s2, s3) and a variance given by the parameter θ0 is used. Further, the following Equation 13 is an example of calculating an average value when the control signal u is given by a two-dimensional vector of (Ur, Ul).

Here, UBr and UBL are the average values of Ur and Ul, respectively.

また、評価手段４１０３１は、３つの第二種パラメータ（ｓ１，ｓ２，ｓ３）を受け取り、システムの状態の良さを第二種パラメータ（ｓ１，ｓ２，ｓ３）に基づいて計算する。そして、評価手段４１０３１がこの計算を行うための関数は，状態価値関数という。状態価値関数とは、例えば、公知技術である強化学習法でよく用いられるＴＤ（λ）アルゴリズムを用いて獲得される関数である。評価手段４１０３１は、状態価値関数に第二種パラメータ（ｓ１，ｓ２，ｓ３）を代入して、出力値を得る。
例えば，出力値Ｖは、数式１４のように計算される。

ここで，β，ＳＢｉ，ｊ（ｉは１からＭまで，ｊは１から３までの整数）は定数で、ｄｉは強化学習によって獲得されるパラメータである。Ｍはあらかじめ決められた定数である。 The evaluation means 41031 receives three second type parameters (s1, s2, s3), and calculates the good state of the system based on the second type parameters (s1, s2, s3). A function for the evaluation means 41031 to perform this calculation is called a state value function. The state value function is a function obtained using, for example, a TD (λ) algorithm often used in a reinforcement learning method that is a known technique. The evaluation unit 41031 substitutes the second type parameters (s1, s2, s3) for the state value function to obtain an output value.
For example, the output value V is calculated as in Expression 14.

Here, β, SBi, j (i is an integer from 1 to M, j is an integer from 1 to 3) are constants, and di is a parameter obtained by reinforcement learning. M is a predetermined constant.

そして、制御手段４１０３２は、評価手段４１０３１の出力値に基づき、方策パラメータの学習を行う。つまり、制御手段４１０３２は、評価手段４１０３１の出力値に基づき、ヘビ型ロボットがスムーズに進むように方策パラメータを決定し、制御信号を構成する。方策パラメータの学習の処理について、図８、図９に示すフローチャートを用いて説明する。
なお、図８、図９で説明する方策パラメータの学習は、制御と学習の手順である。そして、学習の例として，非特許文献３に記述されるアルゴリズムに基づいた学習を行う。 Then, the control unit 41032 learns the policy parameter based on the output value of the evaluation unit 41031. That is, the control unit 41032 determines a policy parameter based on the output value of the evaluation unit 41031 so that the snake robot proceeds smoothly, and configures a control signal. The policy parameter learning process will be described with reference to the flowcharts shown in FIGS.
The policy parameter learning described with reference to FIGS. 8 and 9 is a control and learning procedure. As an example of learning, learning based on an algorithm described in Non-Patent Document 3 is performed.

（ステップＳ８０１）初期化を行う。初期化とは、以下の処理である。つまり、制御手段４１０３２と評価手段４１０３１のパラメータθｉ（方策パラメータ）とｄｉの初期値を決める。それらのパラメータ学習のための補助パラメータΘｉとＤｉを０に設定する。また，ｔを０に設定する。
（ステップＳ８０２）観測部４１０１が被制御装置１２と制御部４１０３を観測し、第一種パラメータＸ（ｔ）を出力する。 (Step S801) Initialization is performed. Initialization is the following process. That is, the initial values of the parameters θi (policy parameters) and di of the control means 41032 and the evaluation means 41031 are determined. The auxiliary parameters Θi and Di for learning these parameters are set to zero. Also, t is set to 0.
(Step S802) The observation unit 4101 observes the controlled device 12 and the control unit 4103, and outputs the first type parameter X (t).

（ステップＳ８０３）特徴抽出部４１０２が第一種パラメータＸ（ｔ）を取得し，変換手段４１０２１により第二種パラメータに変換し，選択手段４１０２２によりｓ１（ｔ），ｓ２（ｔ），ｓ３（ｔ）を選択し，制御部４１０３に出力する。同時に制御部４１０３が数式３で計算されるｒ（ｔ）を取得する。なお、ｒ（ｔ）とは、時刻ｔにおけるｒである。
（ステップＳ８０４）制御手段４１０３２は、ｓ１（ｔ），ｓ２（ｔ），ｓ３（ｔ）を取得し，制御信号を生成する。
（ステップＳ８０５）制御部４１０３の学習サブルーチン（図９により説明する）を実行する。
（ステップＳ８０６）被制御装置１２がステップＳ８０４で生成された制御信号により制御される。
（ステップＳ８０７）ｔ＝ｔ＋１として，ステップＳ８０２に戻る。
なお、図８のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。
次に、図９のフローチャートを用いて、制御部４１０３の学習サブルーチンについて説明する。
（ステップＳ９０１）時刻ｔが０であるか否かを判断する。０でなければステップＳ９０２に行き、０であれば上位関数にリターンする。
(Step S803) The feature extraction unit 4102 acquires the first type parameter X (t), converts it to the second type parameter by the conversion means 41021, and selects s1 (t), s2 (t), s3 (t by the selection means 41022. ) Is selected and output to the control unit 4103. At the same time, the control unit 4103 acquires r (t) calculated by Expression 3. Note that r (t) is r at time t.
(Step S804) The control means 41032 acquires s1 (t), s2 (t), and s3 (t), and generates a control signal.
(Step S805) The learning subroutine (described with reference to FIG. 9) of the control unit 4103 is executed.
(Step S806) The controlled device 12 is controlled by the control signal generated in step S804.
(Step S807) As t = t + 1, the process returns to Step S802.
In the flowchart of FIG. 8, the process ends when the power is turned off or the process is terminated.
Next, a learning subroutine of the control unit 4103 will be described using the flowchart of FIG.
(Step S901) It is determined whether or not time t is zero. If it is not 0, the process goes to step S902, and if it is 0, the process returns to the upper function.

（ステップＳ９０２）評価手段４１０３１は時刻ｔ−１の第二種パラメータと時刻ｔの第二種パラメータを用いて，数式（Ｖ（ｓ１，ｓ２，ｓ３））に従い，Ｖ（ｓ１（ｔ−１），ｓ２（ｔ−１），ｓ３（ｔ−１））とＶ（ｓ１（ｔ），ｓ２（ｔ），ｓ３（ｔ））を計算し，さらに時刻ｔ−１における数式３で計算されるｒ（ｔ−１）を用いてＴＤ誤差を、数式１５により計算する。

数式１５において、γはあらかじめ決められた定数（０<γ<１）である。
（ステップＳ９０３）評価手段４１０３１は、以下の数式１６を計算し、パラメータをｄｉ＝ｄｉ＋η_ｄ×δ（ｔ−１）×Ｄｉのように更新する。

ここで，λはあらかじめ決められた定数（０<λ<１）である。また，η_ｄは学習係数と呼ばれる小さな正数である。
（ステップＳ９０４）制御手段４１０３２は、以下の数式１７を計算し、パラメータをθｉ＝θｉ＋η_θ×δ（ｔ−１）×Θｉのように更新する。

また、ここでのη_θは学習係数と呼ばれる小さな正数である。さらに、ここで、数式１８は、パラメータθｉについての変微分を表す記号である。

なお、ヘビ型ロボットの制御課題において制御信号は左右のニューロン群に対するトニック入力ＵｒとＵｌで，それぞれ、数式１９により定義した。

(Step S902) The evaluation means 41031 uses the second type parameter at time t-1 and the second type parameter at time t according to the mathematical formula (V (s1, s2, s3)), and V (s1 (t-1) , S2 (t-1), s3 (t-1)) and V (s1 (t), s2 (t), s3 (t)), and further calculated by Equation 3 at time t-1. The TD error is calculated by Equation 15 using (t−1).

In Equation 15, γ is a predetermined constant (0 <γ <1).
(Step S903) The evaluation unit 41031 calculates the following Expression 16, and updates the parameter as di = di + η _d × δ (t−1) × Di.

Here, λ is a predetermined constant (0 <λ <1). Η _d is a small positive number called a learning coefficient.
(Step S904) the control unit 41032 calculates the following formula 17, to update the parameters as _{θi = θi + η θ × δ} (t-1) × Θi.

Further, η _θ here is a small positive number called a learning coefficient. Furthermore, here, Equation 18 is a symbol representing a variable differential with respect to the parameter θi.

In the control task of the snake robot, the control signals are tonic inputs Ur and Ul for the left and right neuron groups, which are defined by Equation 19, respectively.

ここで，Ｃ６はあらかじめ決められた定数である。また，ε１（ｔ−１）とε２（ｔ−１）はそれぞれ平均０，分散ｂの正規分布から生成される乱数である（このような乱数を生成するアルゴリズムは公知のものである）このとき，数式２０のように計算できる。

Here, C6 is a predetermined constant. Also, ε1 (t−1) and ε2 (t−1) are random numbers generated from a normal distribution with an average of 0 and a variance b (the algorithm for generating such random numbers is a publicly known one). , Equation 20 can be calculated.

上記の制御装置により、ヘビ型ロボットの制御を行った結果を以下に示す。具体的には、時刻５０００秒において地面との摩擦係数が変化した場合の実験結果を図１０と図１１に示す。図１０と図１１の横軸は時刻を示し、縦軸はそれぞれヘビ型ロボットの向きと、方策パラメータの値（ここでは二つの方策パラメータθ１とθ２）を示している。また、ヘビ型ロボットの向きは目標方向を０としたものである。摩擦係数の変化後、一時的にヘビ型ロボットは大きく方向を変化させるが、その後、方策パラメータの収束と共に進行方向に向いていることが分かる。以上により、本制御装置により、ヘビ型ロボットの目標方向（この場合ｘ軸に平行な方向）への推進運動に対する良好な制御則が獲得できたことが分かる。また，時刻５０００秒において環境（ここでは、例えば、摩擦係数）が変化したにもかかわらず，新しい環境に適合する形で制御ができることが分かる。
なお、選択情報受付部４１０４は、例えば、上述した数式の情報を受け付ける。 The results of controlling the snake robot with the above control device are shown below. Specifically, experimental results when the friction coefficient with the ground changes at time 5000 seconds are shown in FIGS. The horizontal axis of FIGS. 10 and 11 indicates time, and the vertical axis indicates the direction of the snake-like robot and the values of policy parameters (here, two policy parameters θ1 and θ2). The direction of the snake-like robot is such that the target direction is zero. After the coefficient of friction changes, the snake-like robot temporarily changes its direction, but after that, it turns out that it is moving in the direction of travel as the policy parameters converge. From the above, it can be seen that the present control apparatus has obtained a good control law for the propulsion motion in the target direction (in this case, the direction parallel to the x-axis) of the snake robot. In addition, it can be seen that control can be performed in a form suitable for the new environment even though the environment (in this case, for example, the friction coefficient) has changed at time 5000 seconds.
Note that the selection information receiving unit 4104 receives, for example, information on the above-described mathematical expression.

以上、本実施の形態によれば、制御装置は観測した多数のパラメータを圧縮して、高速に被制御装置に対する制御を行う、あるいはそのための制御則を学習できる。また、被制御装置の状態および制御部の内部状態の両方を観測し、かつ、評価手段は、１以上の第二種パラメータに基づいて、前記被制御装置の被制御についての状況の評価を行い、評価の良し悪しを示す情報（例えば、数式３におけるｒ）に基づく簡単な値を出力するだけで、制御手段４１０３２は学習し好適な制御が可能である。よって、制御装置に複雑な情報を入力することなく、被制御装置を制御できる。つまり、従来技術による制御装置において、高速に被制御装置に対する制御則を学習しようとする場合は、種々の複雑な情報を制御装置は保持していなければならない。先に述べたように環境が変化する場合については，特に多くの情報が必要になる。一方、従来技術による制御装置において、初期設定が簡易であれば、適切に制御に被制御装置を制御しようとすれば、多数の状態を観測し、その結果、制御に時間がかかる。本実施の形態によれば、初期設定を簡易にしつつ制御速度を速める（ＣＰＵ負荷を少なくできる）、という特徴がある。これは、観測した多数の第一種パラメータから第二種パラメータにパラメータ変換を行い、かつ制御部における強化学習法（評価の良し悪しを示す情報を出力する評価手段と評価手段が出力した評価の良し悪しを示す情報に基づいて被制御装置を制御する制御手段により実現されている方法）による制御を行うからである。つまり、本実施の形態によれば、設計者が最適な制御規則を設計せずに制御器が適応的に制御規則を発見する状況においても、多自由度システムに対する制御則を高速に学習できる制御装置を提供できる。 As described above, according to the present embodiment, the control device compresses a large number of observed parameters, and can control the controlled device at high speed, or can learn a control law for that purpose. Further, both the state of the controlled device and the internal state of the control unit are observed, and the evaluation means evaluates the status of the controlled device based on the one or more second type parameters. By simply outputting a simple value based on information indicating whether the evaluation is good or bad (for example, r in Formula 3), the control means 41032 can learn and perform suitable control. Therefore, the controlled device can be controlled without inputting complicated information to the control device. That is, in the control device according to the prior art, when trying to learn the control law for the controlled device at high speed, the control device must hold various complicated information. As mentioned earlier, a lot of information is needed when the environment changes. On the other hand, in the control device according to the prior art, if the initial setting is simple, a large number of states are observed if the controlled device is to be controlled appropriately. As a result, control takes time. According to the present embodiment, the control speed is increased (CPU load can be reduced) while simplifying the initial setting. This is because parameter conversion is performed from many observed type 1 parameters to type 2 parameters, and the reinforcement learning method in the control unit (evaluation means for outputting information indicating whether the evaluation is good or bad, and the evaluation means output by the evaluation means). This is because control is performed by a method realized by control means for controlling the controlled device based on information indicating good or bad. In other words, according to the present embodiment, control in which a designer can learn a control rule for a multi-degree-of-freedom system at high speed even in a situation where a controller adaptively discovers a control rule without designing an optimal control rule. Equipment can be provided.

なお、本実施の形態によれば、制御部は、上述したように、神経振動子ネットワーク，すなわち複数の素子をネットワーク状に連結し、かつ自己フィードバックを有するネットワーク構造を内在するものであり、前記制御部の内部状態は前記ネットワーク構造を構成する素子の状態である。神経振動子ネットワークを制御部に用いることで、ヒューマノイドロボットの二足歩行やヘビ型ロボットのほふく運動など，周期的運動を基本とする運動を好適に制御することができる。かかることは、他の実施の形態においても同様である。しかし、本発明の主要な技術である第一種パラメータから第二種パラメータへの変換に基づいた特徴抽出は、神経振動子ネットワークを使用するか否かに依存せずに有効に制御学習を加速する。 In addition, according to the present embodiment, as described above, the control unit includes a neural oscillator network, that is, a network structure in which a plurality of elements are connected in a network shape and has self-feedback, The internal state of the control unit is the state of the elements constituting the network structure. By using the neural oscillator network as a control unit, it is possible to suitably control a motion based on a periodic motion, such as a biped walking of a humanoid robot or a cheek motion of a snake robot. The same applies to other embodiments. However, feature extraction based on the conversion from the first type parameter to the second type parameter, which is the main technique of the present invention, effectively accelerates control learning regardless of whether or not a neural oscillator network is used. To do.

さらに、本実施の形態における制御装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、被制御装置の状態および制御部の内部状態を観測し、２以上の第一種パラメータを取得する観測ステップと、前記観測ステップで取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する特徴抽出ステップと、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置を制御する制御ステップを実行させるためのプログラムであり、前記特徴抽出ステップは、前記観測ステップで取得した２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得する変換サブステップと、前記変換サブステップで取得した２以上の第二種パラメータから１以上の第二種パラメータを選択する選択サブステップを具備し、前記制御ステップは、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置の被制御についての状況の評価を行い、評価の良し悪しを示す情報を出力する評価サブステップと、前記評価サブステップで出力した評価の良し悪しを示す情報に基づいて、前記被制御装置を制御する制御サブステップを具備するプログラム、である。
（実施の形態３）
図１２は、制御装置および当該制御装置の制御対象の装置である被制御装置を有する制御システムのブロック図である。制御システムは、制御装置１２１および被制御装置１２を具備する。
制御装置１２１は、観測部４１０１、特徴抽出部１２１０２、制御部４１０３、選択情報受付部４１０４を具備する。特徴抽出部１２１０２は、変換手段１２１０２１、選択手段４１０２２を具備する。 Furthermore, the software that implements the control device according to the present embodiment is the following program. In other words, this program causes the computer to observe the state of the controlled device and the internal state of the control unit, obtain two or more first type parameters, and two or more first type obtained in the observation step. Based on the parameter, a feature extraction step for obtaining a smaller number of second type parameters than the first type parameter, and controlling the controlled device based on one or more second type parameters obtained in the feature extraction step A control sub-step for obtaining two or more second-type parameters based on the two or more first-type parameters obtained in the observation step; Selection sub-step for selecting one or more second-type parameters from two or more second-type parameters acquired in the conversion sub-step And the control step evaluates the status of the controlled device based on the one or more second type parameters acquired in the feature extraction step, and outputs information indicating whether the evaluation is good or bad And a control substep for controlling the controlled device based on information indicating whether the evaluation is good or bad and output in the evaluation substep.
(Embodiment 3)
FIG. 12 is a block diagram of a control system including a control device and a controlled device that is a control target device of the control device. The control system includes a control device 121 and a controlled device 12.
The control device 121 includes an observation unit 4101, a feature extraction unit 12102, a control unit 4103, and a selection information reception unit 4104. The feature extraction unit 12102 includes conversion means 121021 and selection means 41022.

変換手段１２１０２１は、第二種パラメータが有する変数を、制御部４１０３の制御結果または／および制御部４１０３の内部状態に応じて変更する。なお、本実施の形態において、第二種パラメータが可変な変数を有する。変換手段１２１０２１は、通常、ＭＰＵやメモリ等から実現され得る。変換手段１２１０２１の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。 The conversion unit 121021 changes the variable of the second type parameter according to the control result of the control unit 4103 or / and the internal state of the control unit 4103. In the present embodiment, the second type parameter has a variable variable. The conversion unit 121021 can be usually realized by an MPU, a memory, or the like. The processing procedure of the conversion means 121021 is usually realized by software, and the software is recorded on a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).

以下、本制御システムにおける制御装置１２１の動作について説明する。制御装置１２１の動作は、概ね実施の形態２における制御装置４１の動作（図５のフローチャートで説明）と同様である。制御装置１２１は、制御部４１０３の制御結果または／および制御部４１０３の内部状態に応じて第二種パラメータが有する変数を変更する処理が加わる点のみが、制御装置４１の動作と異なる。また、かかる第二種パラメータが有する変数を変更する処理は、例えば、ステップＳ５０８の次に加わる。なお、第二種パラメータが有する変数を変更する処理は、いかなるタイミングで行っても良い。
以下、本実施の形態における制御システムの具体的な動作について説明する。ここでは、変換手段１２１０２１が行う、第二種パラメータが有する変数を変更する処理についてのみ説明する。
まず、変数を有する第二種パラメータ（ｓ２'）は、例えば、数式２１で示される。

数式２１において、αｉ（ｉは１から９の整数）が変数である。 Hereinafter, the operation of the control device 121 in this control system will be described. The operation of the control device 121 is generally the same as the operation of the control device 41 in the second embodiment (described with the flowchart of FIG. 5). The control device 121 is different from the operation of the control device 41 only in that a process for changing the variable of the second type parameter is added according to the control result of the control unit 4103 and / or the internal state of the control unit 4103. Moreover, the process which changes the variable which this 2nd type parameter has is added after step S508, for example. Note that the process of changing the variable of the second type parameter may be performed at any timing.
Hereinafter, a specific operation of the control system in the present embodiment will be described. Here, only the process of changing the variable of the second type parameter performed by the conversion unit 121021 will be described.
First, the second type parameter (s2 ′) having a variable is expressed by, for example, Equation 21.

In Equation 21, αi (i is an integer from 1 to 9) is a variable.

また、変換手段１２１０２１は、制御部４１０３の制御結果が良好である場合、例えば、数式３におけるｒの一定時間内での和があらかじめ決められた閾値よりも大きい場合には、上記の第二種パラメータの有する変数を固定し、一方、閾値より小さい場合には上記の第二種パラメータの有する変数をランダムに変更する。例えば、数式２２のように第二種パラメータの有する変数（αｉ）を変更する。

ここでΔｉは乱数を用いて決めることが考えられる。
具体的には、変換手段１２１０２１は、例えば、以下の図１３に示すフローチャートにより、第二種パラメータの有する変数（αｉ）を変更する。 In addition, when the control result of the control unit 4103 is good, for example, when the sum of r in Formula 3 within a certain time is larger than a predetermined threshold value, the conversion unit 121021 The variable of the parameter is fixed, and on the other hand, if it is smaller than the threshold value, the variable of the second type parameter is randomly changed. For example, the variable (αi) of the second type parameter is changed as shown in Equation 22.

Here, Δi may be determined using a random number.
Specifically, the conversion unit 121021 changes the variable (αi) of the second type parameter, for example, according to the flowchart shown in FIG.

（ステップＳ１３０１）選択情報受付部４１０４は、以下の初期値を受け付ける。具体的には、選択情報受付部４１０４は、パラメータαｉ［ｉ＝１，...，９］の初期値を受け付ける。また、選択情報受付部４１０４は、ｎの値「１」を受け付ける。さたに、選択情報受付部４１０４は、性能の評価のための閾値Ｆ、および学習の終了時間Ｔを受け付ける。 (Step S1301) The selection information receiving unit 4104 receives the following initial values. Specifically, the selection information receiving unit 4104 receives initial values of the parameters αi [i = 1,..., 9]. The selection information receiving unit 4104 receives the value “1” of n. In addition, the selection information receiving unit 4104 receives a threshold value F for performance evaluation and a learning end time T.

（ステップＳ１３０２）制御装置１２１は、ｓ２の代わりにパラメータαｉをもつ第二種パラメータｓ２'を用いて、実施の形態２の図８、図９における学習アルゴリズムと同様の手順で制御を行う。まず、制御手段４１０３２と評価手段４１０３１のパラメータθｉ（方策パラメータ）とｄｉの初期値を決める。それらのパラメータ学習のための補助パラメータΘｉとＤｉを０に設定する。また，ｔを０に設定する。さらに、Ｒを０に設定する。なお、Ｒは、累積報酬を格納する変数（バッファ）である。
（ステップＳ１３０３）観測部４１０１が被制御装置１２と制御部４１０３を観測し、第一種パラメータＸ（ｔ）を出力する。 (Step S1302) The control device 121 performs control in the same procedure as the learning algorithm in FIGS. 8 and 9 of the second embodiment, using the second type parameter s2 ′ having the parameter αi instead of s2. First, the initial values of the parameters θi (policy parameters) and di of the control means 41032 and the evaluation means 41031 are determined. The auxiliary parameters Θi and Di for learning these parameters are set to zero. Also, t is set to 0. Further, R is set to 0. Note that R is a variable (buffer) for storing the accumulated reward.
(Step S1303) The observation unit 4101 observes the controlled device 12 and the control unit 4103, and outputs the first type parameter X (t).

（ステップＳ１３０４）特徴抽出部１２１０２が第一種パラメータＸ（ｔ）を取得し，変換手段１２１０２１により第二種パラメータに変換し，選択手段４１０２２によりｓ１（ｔ），ｓ２（ｔ），ｓ３（ｔ）を選択し，制御部４１０３に出力する。同時に制御部４１０３が数式３で計算される（ｒ（ｔ））を取得する。そして、Ｒ＝Ｒ＋ｒ（ｔ）とする。
（ステップＳ１３０５）制御手段４１０３２は、ｓ１（ｔ），ｓ２（ｔ），ｓ３（ｔ）を取得し，制御信号を生成する。
（ステップＳ１３０６）制御部４１０３の学習サブルーチン（図９）を実行する。
（ステップＳ１３０７）被制御装置１２がステップＳ１３０５で生成された制御信号により制御される。
（ステップＳ１３０８）変換手段１２１０２１は、ｔ＝Ｔであるか否かを判断し、ｔ＝Ｔであれば、ステップＳ１３０９に行き、ｔ＝ＴでなければステップＳ１３０３に行く。
（ステップＳ１３０９）変換手段１２１０２１は、ｎ＝１であるか否かを判断する。ｎ＝１であればステップＳ１３１０に行き、ｎ＝１でなければステップＳ１３１３に行く。
（ステップＳ１３１０）変換手段１２１０２１は、αｍａｘｉ＝αｉ［ｉ＝１，...，９］を記録する。
（ステップＳ１３１１）変換手段１２１０２１は、αｉを「αｉ＝αｍａｘｉ＋Δｉ」に従い生成する。ここでΔｉは乱数を用いて決める。
（ステップＳ１３１２）変換手段１２１０２１は、ｎを１、インクリメントする。ステップＳ１３０２に戻る。
（ステップＳ１３１３）変換手段１２１０２１は、Ｒ>Ｆであるか否かを判断する。Ｒ>ＦであればステップＳ１３１０に行き、Ｒ>ＦでなければステップＳ１３１１に行く。
なお、公知技術である遺伝的アルゴリズムなどの探索手法を用いて，αｉの探索を実現しても良い。また，上記の変換手段の変数の変更において実施の形態４で説明する長期観測部を用いる構成とすることもできる。 (Step S1304) The feature extraction unit 12102 acquires the first type parameter X (t), converts it to the second type parameter by the conversion unit 121021, and selects s1 (t), s2 (t), s3 (t by the selection unit 41022. ) Is selected and output to the control unit 4103. At the same time, the control unit 4103 obtains (r (t)) calculated by Expression 3. Then, R = R + r (t).
(Step S1305) The control unit 41032 acquires s1 (t), s2 (t), and s3 (t), and generates a control signal.
(Step S1306) The learning subroutine (FIG. 9) of the control unit 4103 is executed.
(Step S1307) The controlled device 12 is controlled by the control signal generated in step S1305.
(Step S1308) The conversion unit 121021 determines whether or not t = T. If t = T, the process goes to step S1309, and if t = T, the process goes to step S1303.
(Step S1309) The conversion means 121021 determines whether or not n = 1. If n = 1, go to step S1310, and if n = 1, go to step S1313.
(Step S1310) The conversion means 121021 records αmaxi = αi [i = 1,..., 9].
(Step S1311) The conversion means 121021 generates αi according to “αi = αmaxi + Δi”. Here, Δi is determined using a random number.
(Step S1312) The conversion means 121021 increments n by 1. The process returns to step S1302.
(Step S1313) The conversion unit 121021 determines whether or not R> F. If R> F, go to step S1310, and if not R> F, go to step S1311.
The search for αi may be realized by using a search technique such as a genetic algorithm which is a known technique. In addition, the long-term observation unit described in the fourth embodiment may be used in changing the variables of the conversion means.

以上、本実施の形態によれば、制御装置は観測した多数のパラメータを圧縮して、高速に被制御装置に対する制御を行う、あるいはそのための制御則を学習できる。また、被制御装置の状態および制御部の内部状態の両方を観測し、かつ、評価手段は、１以上の第二種パラメータに基づいて、前記被制御装置の被制御についての状況の評価を行い、評価の良し悪しを示す情報（例えば、数式３におけるｒ）に基づく簡単な値を出力するだけで、制御手段４１０３２は学習し好適な制御が可能である。よって、制御装置に複雑な情報を入力することなく、被制御装置を制御できる。つまり、通常、高速に被制御装置に対する制御則を学習しようとする場合は、種々の複雑な情報を制御装置は保持していなければならない。また、本制御装置で扱う第二種パラメータは可変な変数を有し、変換手段は、第二種パラメータが有する変数を、制御部の制御結果または／および制御部の内部状態に応じて変更することができることにより、さらに精度良く、好適な制御が可能となる。さらに、本実施の形態によれば、設計者が最適な制御規則を設計せずに制御器が適応的に制御規則を発見する状況においても、多自由度システムに対する制御則を高速に学習できる制御装置を提供できる。 As described above, according to the present embodiment, the control device compresses a large number of observed parameters, and can control the controlled device at high speed, or can learn a control law for that purpose. Further, both the state of the controlled device and the internal state of the control unit are observed, and the evaluation means evaluates the status of the controlled device based on the one or more second type parameters. By simply outputting a simple value based on information indicating whether the evaluation is good or bad (for example, r in Formula 3), the control means 41032 can learn and perform suitable control. Therefore, the controlled device can be controlled without inputting complicated information to the control device. That is, normally, when trying to learn a control law for a controlled device at high speed, the control device must hold various complex information. Further, the second type parameter handled by this control apparatus has a variable variable, and the conversion means changes the variable of the second type parameter according to the control result of the control unit or / and the internal state of the control unit. This makes it possible to perform suitable control with higher accuracy. Furthermore, according to the present embodiment, even in a situation where a controller adaptively finds a control rule without designing an optimal control rule by a designer, control that can learn a control rule for a multi-degree-of-freedom system at high speed. Equipment can be provided.

なお、本実施の形態における制御装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、制御対象の装置である被制御装置の状態を観測し、２以上の第一種パラメータを取得する観測ステップと、前記観測ステップで取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する特徴抽出ステップと、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置を制御する制御ステップを実行させるためのプログラムであり、前記第二種パラメータが可変な変数を有し、前記特徴抽出ステップは、前記観測ステップで取得した２以上の第一種パラメータに基づいて２以上の第二種パラメータを取得し、かつ、前記第二種パラメータが有する変数を、前記制御ステップにおける制御結果または／および制御部の内部状態に応じて変更する変換サブステップと、前記変換サブステップで取得した２以上の第二種パラメータから１以上の第二種パラメータを選択する選択サブステップを具備するプログラムである。
（実施の形態４）
図１４は、制御装置および当該制御装置の制御対象の装置である被制御装置を有する制御システムのブロック図である。制御システムは、制御装置１４１および被制御装置１２を具備する。 Note that the software that implements the control device in the present embodiment is the following program. In other words, this program causes the computer to observe the state of the controlled device that is the device to be controlled, obtains two or more first type parameters, and two or more first type obtained in the observation step. Based on the parameter, a feature extraction step for obtaining a smaller number of second type parameters than the first type parameter, and controlling the controlled device based on one or more second type parameters obtained in the feature extraction step The second type parameter has a variable variable, and the feature extraction step includes two or more first type parameters acquired in the observation step. The second type parameter is acquired, and the variable included in the second type parameter is used as a control result or / and a control value in the control step. Parts and converting sub step of changing in response to the internal state of a program having a selection sub-step of selecting one or more second-type parameters from two or more second type parameters acquired by the conversion sub-step.
(Embodiment 4)
FIG. 14 is a block diagram of a control system including a control device and a controlled device that is a device to be controlled by the control device. The control system includes a control device 141 and a controlled device 12.

制御装置１４１は、観測部４１０１、特徴抽出部１４１０２、制御部４１０３、選択情報受付部４１０４、長期観測部１４１０５を具備する。特徴抽出部１４１０２は、変換手段１４１０２１、選択手段１４１０２２を具備する。また、制御装置１４１は、被制御装置１２の内部に存在していても良い。 The control device 141 includes an observation unit 4101, a feature extraction unit 14102, a control unit 4103, a selection information reception unit 4104, and a long-term observation unit 14105. The feature extraction unit 14102 includes conversion means 141021 and selection means 141022. Further, the control device 141 may exist inside the controlled device 12.

長期観測部１４１０５は、観測部４１０１が取得した第一種パラメータまたは／および制御部４１０３の制御結果を取得し、かかる取得した情報を２以上蓄積する。また、長期観測部１４１０５は、かかる蓄積した情報に基づいて、変換手段１４１０２１または／および選択手段１４１０２２に与える情報を生成し、当該生成した情報を変換手段１４１０２１または／および選択手段１４１０２２に与える。なお、長期観測部１４１０５が生成した情報は、蓄積した情報と同一の情報でも良い。 The long-term observation unit 14105 acquires the first type parameter acquired by the observation unit 4101 and / or the control result of the control unit 4103, and accumulates two or more of the acquired information. Further, the long-term observation unit 14105 generates information to be given to the conversion unit 141021 and / or the selection unit 141022 based on the accumulated information, and gives the generated information to the conversion unit 141021 and / or the selection unit 141022. Note that the information generated by the long-term observation unit 14105 may be the same information as the accumulated information.

変換手段１４１０２１は、観測部４１０１が取得した２以上の第一種パラメータ、および長期観測部１４１０５が蓄積した情報または当該情報に基づいて生成された情報を用いて、２以上の第二種パラメータを取得する。 The converting unit 141021 uses the two or more first type parameters acquired by the observation unit 4101 and the information accumulated by the long-term observation unit 14105 or the information generated based on the information to convert two or more second type parameters. get.

選択手段１４１０２２は、変換手段１４１０２１が取得した２以上の第二種パラメータから、長期観測部１４１０５が蓄積した情報または当該情報に基づいて生成された情報を用いて、１以上の第二種パラメータを選択する。 The selection unit 141022 uses the information accumulated by the long-term observation unit 14105 or the information generated based on the information from the two or more second type parameters acquired by the conversion unit 141021, to convert one or more second type parameters. select.

特徴抽出部１４１０２、長期観測部１４１０５、変換手段１４１０２１、および選択手段１４１０２２は、通常、ＭＰＵやメモリ等から実現され得る。特徴抽出部１４１０２、長期観測部１４１０５、変換手段１４１０２１、および選択手段１４１０２２の処理手順は、通常、ソフトウェアで実現され、当該ソフトウェアはＲＯＭ等の記録媒体に記録されている。但し、ハードウェア（専用回路）で実現しても良い。
以下、本制御システムにおける制御装置１４１の動作について図１５のフローチャートを用いて説明する。 The feature extraction unit 14102, the long-term observation unit 14105, the conversion unit 141021, and the selection unit 141022 can usually be realized by an MPU, a memory, or the like. The processing procedures of the feature extraction unit 14102, the long-term observation unit 14105, the conversion unit 141021, and the selection unit 141022 are usually realized by software, and the software is recorded in a recording medium such as a ROM. However, it may be realized by hardware (dedicated circuit).
Hereinafter, the operation of the control device 141 in this control system will be described with reference to the flowchart of FIG.

（ステップＳ１５０１）観測部１１０１は、被制御装置１２の状態を観測し、２以上の第一種パラメータを取得したか否かを判断する。２以上の第一種パラメータを取得すればステップＳ１５０２に行き、２以上の第一種パラメータを取得しなければステップＳ１５０１に戻る。
（ステップＳ１５０２）特徴抽出部１４１０２の変換手段１４１０２１は、長期観測部１４１０５からパラメータ変換に使用する情報を取得する。 (Step S1501) The observation unit 1101 observes the state of the controlled device 12 and determines whether two or more first-type parameters have been acquired. If two or more first type parameters are acquired, the process proceeds to step S1502. If two or more first type parameters are not acquired, the process returns to step S1501.
(Step S1502) The conversion unit 141021 of the feature extraction unit 14102 acquires information used for parameter conversion from the long-term observation unit 14105.

（ステップＳ１５０３）変換手段１４１０２１は、ステップＳ１５０１で取得した２以上の第一種パラメータ、およびステップＳ１５０２で取得した情報に基づいて１以上の第二種パラメータを取得する。ここで、第二種パラメータの数は、第一種パラメータの数より多くても良いし、同じでも良い。
（ステップＳ１５０４）特徴抽出部１４１０２の選択手段１４１０２２は、長期観測部１４１０５からパラメータ選択に使用する情報を取得する。 (Step S1503) The conversion unit 141021 acquires one or more second type parameters based on the two or more first type parameters acquired in Step S1501 and the information acquired in Step S1502. Here, the number of the second type parameters may be larger than or the same as the number of the first type parameters.
(Step S1504) The selection unit 141022 of the feature extraction unit 14102 acquires information used for parameter selection from the long-term observation unit 14105.

（ステップＳ１５０５）選択手段１４１０２２は、ステップＳ１５０４で取得した情報に基づいて、ステップＳ１５０３で取得した１以上の第二種パラメータから１以上の第二種パラメータを選択する。ここで選択された第二種パラメータの数は、第一種パラメータの数より、大幅に減少していることが好適である。例えば、第一種パラメータの数が２４００以上であり、選択された第二種パラメータの数が３であることは好適である。
（ステップＳ１５０６）制御手段４１０３２は、ステップＳ１５０５で選択された第二種パラメータに基づいて、制御信号を構成する。具体的な制御信号の構成方法については後述する。
（ステップＳ１５０７）制御手段４１０３２は、ステップＳ１５０７で構成した制御信号を被制御装置１２に与え、被制御装置１２を制御する。
（ステップＳ１５０８）評価手段４１０３１はステップＳ５０３で選択された第二種パラメータ（または，それに加えて制御手段の出力する制御信号）に基づいて、制御手段による制御規則に対する評価を出力する。
（ステップＳ１５０９）制御手段４１０３２は、ステップＳ５０３で選択された第二種パラメータとステップＳ５０７で評価手段４１０３１が出力した評価に基づいて、制御信号を構成するための制御規則を変更（学習）する。
（ステップＳ１５１０）長期観測部１４１０５は、ステップＳ１５０１で取得した第一種パラメータと数式３におけるｒを取得する。
（ステップＳ１５１１）長期観測部１４１０５は、ステップＳ１５１０で取得した情報を蓄積する。 (Step S1505) The selection unit 141022 selects one or more second type parameters from the one or more second type parameters acquired in step S1503 based on the information acquired in step S1504. The number of second type parameters selected here is preferably significantly smaller than the number of first type parameters. For example, it is preferable that the number of first type parameters is 2400 or more and the number of selected second type parameters is three.
(Step S1506) The control unit 41032 configures a control signal based on the second type parameter selected in step S1505. A specific method for configuring the control signal will be described later.
(Step S 1507) The control means 41032 gives the control signal configured in step S 1507 to the controlled device 12 and controls the controlled device 12.
(Step S1508) The evaluation unit 41031 outputs an evaluation of the control rule by the control unit based on the second type parameter selected in step S503 (or in addition to the control signal output by the control unit).
(Step S1509) The control unit 41032 changes (learns) a control rule for forming a control signal based on the second type parameter selected in step S503 and the evaluation output by the evaluation unit 41031 in step S507.
(Step S1510) The long-term observation unit 14105 acquires the first type parameter acquired in step S1501 and r in Expression 3.
(Step S1511) The long-term observation unit 14105 accumulates the information acquired in Step S1510.

（ステップＳ１５１２）長期観測部１４１０５は、蓄積した情報（１以上の情報）に基づいて、パラメータ変換または／およびパラメータ選択に用いる情報を生成する。具体的な情報の生成方法については、後述する。ステップＳ１５０１に戻る。
なお、図１５のフローチャートにおいて、電源オフや処理終了の割り込みにより処理は終了する。
以下、本実施の形態における制御システムの具体的な動作について説明する。今、被制御装置１２は、例えば、図３に示すヘビ型ロボットである。
また、例えば、上記のｓ１，ｓ２，ｓ３以外の第二種パラメータとして以下のｓ４（数式２３）、ｓ５（数式２４）を用意する。

そして、以下の図１６に示すフローチャートに示す動作により、制御装置１２１は動作する。 (Step S1512) The long-term observation unit 14105 generates information used for parameter conversion or / and parameter selection based on the accumulated information (one or more information). A specific information generation method will be described later. The process returns to step S1501.
Note that the processing is ended by powering off or interruption for aborting the processing in the flowchart in FIG.
Hereinafter, a specific operation of the control system in the present embodiment will be described. Now, the controlled device 12 is, for example, a snake robot shown in FIG.
Further, for example, the following s4 (Equation 23) and s5 (Equation 24) are prepared as second type parameters other than the above s1, s2, and s3.

And the control apparatus 121 operate | moves by the operation | movement shown in the flowchart shown in the following FIG.

（ステップＳ１６０１）選択情報受付部４１０４が各種の初期値の入力を受け付ける。具体的には、選択情報受付部４１０４は、ｓ１からｓ５の内から３個選択するための情報を受け付ける。なお、３個の第二種パラメータの集合をＳとする。そして、選択情報受付部４１０４は、ｎの値「１」を受け付ける。また、選択情報受付部４１０４は、適切なＴを受け付ける。 (Step S1601) The selection information receiving unit 4104 receives input of various initial values. Specifically, the selection information receiving unit 4104 receives information for selecting three from s1 to s5. Note that a set of three second-type parameters is S. Then, the selection information receiving unit 4104 receives the value “1” of n. The selection information receiving unit 4104 receives an appropriate T.

（ステップＳ１６０２）以下、集合（ｓ１，ｓ２，ｓ３）の代わりに集合Ｓを用いて，実施の形態２と同様の手順で制御を行う。つまり、まず、制御手段４１０３２と評価手段４１０３１のパラメータθｉ（方策パラメータ）とｄｉの初期値を決める。それらのパラメータ学習のための補助パラメータΘｉとＤｉを０に設定する。また，ｔを０に設定する。
（ステップＳ１６０３）観測部４１０１が被制御装置１２と制御部４１０３を観測し、第一種パラメータＸ（ｔ）を出力する。 (Step S1602) Hereinafter, control is performed in the same procedure as in the second embodiment, using the set S instead of the set (s1, s2, s3). That is, first, the initial values of the parameters θi (policy parameter) and di of the control means 41032 and the evaluation means 41031 are determined. The auxiliary parameters Θi and Di for learning these parameters are set to zero. Also, t is set to 0.
(Step S1603) The observation unit 4101 observes the controlled device 12 and the control unit 4103, and outputs the first type parameter X (t).

（ステップＳ１６０４）特徴抽出部４１０２が第一種パラメータＸ（ｔ）を取得し，変換手段４１０２１により第二種パラメータに変換し，選択手段４１０２２により集合Ｓを選択し，制御部４１０３に出力する。同時に制御部４１０３が数式３で計算されるｒ（ｔ）を取得する。
（ステップＳ１６０５）制御手段４１０３２は、Ｓを取得し，制御信号を生成する。
（ステップＳ１６０６）制御部４１０３の学習サブルーチン（図９）を実行する。
（ステップＳ１６０７）被制御装置１２がステップＳ１３０５で生成された制御信号により制御される。
（ステップＳ１６０８）変換手段１４１０２１は、ｔ＝Ｔであるか否かを判断し、ｔ＝Ｔであれば、ステップＳ１６０９に行く。ｔ＝ＴでなければステップＳ１６０３に行く。
（ステップＳ１６０９）長期観測部１４１０５は，０からＴ秒までの各時刻における数式２５に示す報酬の和を算出し、記録する。

なお、この報酬の和の値をＦ（ｎ）とする。
（ステップＳ１６１０）長期観測部１４１０５は、ｎ＝１であるか否かを判断する。ｎ＝１ならばステップＳ１６１１に行き、ｎ＝１でなければステップＳ１６１２に行く。
（ステップＳ１６１１）長期観測部１４１０５は、Ｆｍａｘ＝Ｆ（１）を保存する。長期観測部１４１０５はＳｍａｘ＝Ｓを保存してステップＳ１６１２に行く。
（ステップＳ１６１２）長期観測部１４１０５は、Ｓｍａｘの要素をひとつ入れ替えた集合Ｓを生成する。
（ステップＳ１６１３）ｎを１、インクリメントする。ステップＳ１６０２に戻る。
（ステップＳ１６１４）長期観測部１４１０５は、Ｆ（ｎ）>Ｆであるか否かを判断する。Ｆ（ｎ）>ＦならばステップＳ１６１５に行き、Ｆ（ｎ）>ＦでなければステップＳ１６１２に行く。
（ステップＳ１６１５）長期観測部１４１０５はＦｍａｘ＝Ｆ（ｎ），選択情報受付部はＳｍａｘ＝Ｓを保存する。ステップＳ１６１２に行く。
なお、公知技術である遺伝的アルゴリズムなどの探索手法を用いて，Ｓの探索を実現しても良い。 (Step S1604) The feature extraction unit 4102 acquires the first type parameter X (t), converts it to the second type parameter by the conversion unit 41021, selects the set S by the selection unit 41022, and outputs it to the control unit 4103. At the same time, the control unit 4103 acquires r (t) calculated by Expression 3.
(Step S1605) The control means 41032 acquires S and generates a control signal.
(Step S1606) The learning subroutine (FIG. 9) of the control unit 4103 is executed.
(Step S1607) The controlled apparatus 12 is controlled by the control signal generated in step S1305.
(Step S1608) The conversion unit 141021 determines whether or not t = T. If t = T, the conversion unit 141021 proceeds to step S1609. If t = T is not satisfied, the process goes to step S1603.
(Step S1609) The long-term observation unit 14105 calculates and records the sum of rewards shown in Formula 25 at each time from 0 to T seconds.

Note that the value of the sum of the rewards is F (n).
(Step S1610) The long-term observation unit 14105 determines whether n = 1. If n = 1, go to step S1611. If n = 1, go to step S1612.
(Step S1611) The long-term observation unit 14105 stores Fmax = F (1). The long-term observation unit 14105 stores Smax = S and goes to step S1612.
(Step S1612) The long-term observation unit 14105 generates a set S in which one element of Smax is replaced.
(Step S1613) n is incremented by one. The process returns to step S1602.
(Step S1614) The long-term observation unit 14105 determines whether or not F (n)> F. If F (n)> F, go to step S1615, and if F (n)> F, go to step S1612.
(Step S1615) The long-term observation unit 14105 stores Fmax = F (n), and the selection information receiving unit stores Smax = S. Go to step S1612.
Note that the search for S may be realized by using a search technique such as a genetic algorithm which is a known technique.

以上、本実施の形態によれば、制御装置は観測した多数のパラメータを圧縮して、高速に被制御装置に対する制御を行う、あるいはそのための制御則を学習できる。また、所定時間以上の観測結果を加味した好適な制御が可能である。さらに、本実施の形態によれば、設計者が最適な制御規則を設計せずに制御器が適応的に制御規則を発見する状況においても、多自由度システムに対する制御則を高速に学習できる制御装置を提供できる。 As described above, according to the present embodiment, the control device compresses a large number of observed parameters, and can control the controlled device at high speed, or can learn a control law for that purpose. Moreover, suitable control can be performed in consideration of observation results over a predetermined time. Furthermore, according to the present embodiment, even in a situation where a controller adaptively finds a control rule without designing an optimal control rule by a designer, control that can learn a control rule for a multi-degree-of-freedom system at high speed. Equipment can be provided.

さらに、本実施の形態における制御装置を実現するソフトウェアは、以下のようなプログラムである。つまり、このプログラムは、コンピュータに、被制御装置の状態および制御部の内部状態を観測し、２以上の第一種パラメータを取得する観測ステップと、
前記観測ステップで取得した２以上の第一種パラメータに基づいて、当該第一種パラメータより少ない数の第二種パラメータを取得する特徴抽出ステップと、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置を制御する制御ステップと、前記観測ステップで取得した第一種パラメータまたは／および前記制御ステップの制御結果を取得し、かかる取得した情報を２以上蓄積する長期観測ステップを実行させるためのプログラムであり、前記特徴抽出ステップは、前記観測ステップで取得した２以上の第一種パラメータと、前記長期観測ステップで蓄積した情報または当該情報に基づいて生成された情報に基づいて２以上の第二種パラメータを取得する変換サブステップと、
前記変換サブステップで取得した２以上の第二種パラメータから、前記長期観測ステップで蓄積した情報または当該情報に基づいて生成された情報を用いて、１以上の第二種パラメータを選択する選択サブステップを具備し、前記制御ステップは、前記特徴抽出ステップで取得した１以上の第二種パラメータに基づいて、前記被制御装置の被制御についての状況の評価を行い、評価の良し悪しを示す情報を出力する評価サブステップと、前記評価サブステップで出力した評価の良し悪しを示す情報に基づいて、前記被制御装置を制御する制御サブステップを具備するプログラム、である。
なお、上述したすべてのプログラムを実行するコンピュータは、単数であってもよく、複数であってもよい。すなわち、集中処理を行ってもよく、あるいは分散処理を行ってもよい。 Furthermore, the software that implements the control device according to the present embodiment is the following program. That is, this program causes the computer to observe the state of the controlled device and the internal state of the control unit, and obtain two or more first-type parameters;
Based on two or more first-type parameters acquired in the observation step, a feature extraction step for acquiring a smaller number of second-type parameters than the first-type parameters, and one or more second parameters acquired in the feature extraction step Based on the seed parameter, the control step for controlling the controlled device, the first type parameter acquired in the observation step or / and the control result of the control step are acquired, and the acquired information is accumulated for a long period of time. A program for executing an observation step, wherein the feature extraction step includes two or more first type parameters acquired in the observation step, information accumulated in the long-term observation step, or information generated based on the information A transformation sub-step for obtaining two or more second-type parameters based on
A selection sub that selects one or more second-type parameters using information accumulated in the long-term observation step or information generated based on the information from two or more second-type parameters acquired in the conversion substep A step, wherein the control step evaluates the status of the controlled device based on the one or more second type parameters acquired in the feature extraction step, and indicates whether the evaluation is good or bad And a control substep for controlling the controlled device based on the information indicating the quality of the evaluation output in the evaluation substep.
In addition, the computer which performs all the programs mentioned above may be single, and plural may be sufficient as it. That is, centralized processing may be performed, or distributed processing may be performed.

また、上記各実施の形態において、各処理（各機能）は、単一の装置（システム）によって集中処理されることによって実現されてもよく、あるいは、複数の装置によって分散処理されることによって実現されてもよい。
本発明は、以上の実施の形態に限定されることなく、種々の変更が可能であり、それらも本発明の範囲内に包含されるものであることは言うまでもない。 In each of the above embodiments, each process (each function) may be realized by centralized processing by a single device (system), or by distributed processing by a plurality of devices. May be.
The present invention is not limited to the above-described embodiments, and various modifications are possible, and it goes without saying that these are also included in the scope of the present invention.

以上のように、本発明にかかる制御装置は、ロボット等の多自由度システムに対して、自由度に比べて少ない制御パラメータに基づいた効率の良い制御が可能であるという効果を有し、多自由度システムを制御する制御装置等として有用である。 As described above, the control device according to the present invention has an effect that a multi-degree-of-freedom system such as a robot can perform efficient control based on control parameters that are smaller than the degree of freedom. It is useful as a control device for controlling the degree of freedom system.

実施の形態１における制御システムのブロック図Block diagram of a control system in the first embodiment 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device 同ヘビ型ロボットを示す図Diagram showing the snake robot 実施の形態２における制御システムのブロック図Block diagram of a control system in the second embodiment 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device 同神経節を示す図Diagram showing the same ganglion 同神経振動子ネットワークを示す図Diagram showing the neural oscillator network 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device 同実験結果を示す図Figure showing the results of the experiment 同実験結果を示す図Figure showing the results of the experiment 実施の形態３における制御システムのブロック図Block diagram of a control system in the third embodiment 同変換手段の動作について説明するフローチャートFlow chart for explaining the operation of the conversion means 実施の形態４における制御システムのブロック図Block diagram of a control system in the fourth embodiment 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device 同制御装置の動作について説明するフローチャートFlow chart for explaining the operation of the control device

Explanation of symbols

１１、４１、１２１，１４１制御装置
１２被制御装置
１１０１、４１０１観測部
１１０２、４１０２、１２１０２、１４１０２特徴抽出部
１１０３、４１０３制御部
４１０４選択情報受付部
１１０２１、４１０２１、１２１０２１、１４１０２１変換手段
１４１０５長期観測部
４１０２２、１４１０２２選択手段
４１０３１評価手段
４１０３２制御手段

11, 41, 121, 141 Control device 12 Controlled device 1101, 4101 Observation unit 1102, 4102, 12102, 14102 Feature extraction unit 1103, 4103 Control unit 4104 Selection information reception unit 11021, 41021, 121021, 141021 Conversion unit 14105 Long-term observation Part 41022, 141022 selection means 41031 evaluation means 41032 control means

Claims

A state of a controlled device that is a device to be controlled and is a multi-degree-of-freedom system having two or more links and a joint connecting the two or more links, and for each state variable of the two or more links An observation unit that obtains two or more of the first type parameters as first type parameters;
By compressing the parameters based on the two or more first type parameters acquired by the observation unit, the number of parameters is less than the two or more first type parameters, and has a different meaning from the first type parameters. A feature extraction unit that obtains a second type parameter that is a parameter that is a parameter that represents a characteristic of the controlled device;
Based on one or more second-type parameters acquired by the feature extraction unit, a control unit for controlling the controlled device ,
The feature extraction unit includes:
Conversion means for acquiring two or more second type parameters based on two or more first type parameters acquired by the observation unit;
The control apparatus which comprises the selection means which selects the said 1 or more 2nd type parameter from the 2 or more 2nd type parameter which the said conversion means acquired.

The second type parameter has a variable variable;
The converting means includes
Wherein the second type parameter has variable of the controlled device according to claim 1 wherein the change in accordance with the internal state of the controller is a state variable of the control results and / or control model is the situation for the controlled Control device.

The controller is
Evaluation means for evaluating the status of the controlled device based on one or more second-type parameters acquired by the feature extraction unit;
Based on the evaluation in the evaluation means, learning of a control law for the controlled device is performed, and the controlled device is controlled based on the control law and one or more second type parameters acquired by the feature extraction unit. The control apparatus according to claim 1, further comprising a control unit.

The observation unit is
The observing the internal state of the controller is a state variable of the state and control model of the controlled device, the control device according to any one of claims 3 2 or more first type parameters claims 1 to retrieve.

The controller is
The control apparatus according to claim 4 , wherein a network structure having a plurality of elements connected in a network and having self-feedback is included, and an internal state of the control unit is a state of elements constituting the network structure.

The feature extraction unit further includes a selection information receiving unit that receives selection information that is information for acquiring the second type parameter,
The feature extraction unit includes:
The control device according to any one of claims 1 to 5 , wherein a second type parameter having a smaller number than the first type parameter is acquired also using the selection information received by the previous selection information receiving unit.

A first-type parameter acquired by the observation unit or / and a control result that is the status of the controlled device being controlled are further acquired, and further includes a long-term observation unit that accumulates two or more of the acquired information.
The converting means includes
The control device according to any one of claims 1 to 6 , wherein two or more second-type parameters are acquired also using information accumulated by the long-term observation unit or information generated based on the information.

A first-type parameter acquired by the observation unit or / and a control result that is the status of the controlled device being controlled are further acquired, and further includes a long-term observation unit that accumulates two or more of the acquired information.
The selection means includes
The control device according to any one of claims 1 to 7 , wherein one or more second-type parameters are selected also using information accumulated by the long-term observation unit or information generated based on the information.

On the computer,
A state of a controlled device that is a device to be controlled and is a multi-degree-of-freedom system having two or more links and a joint connecting the two or more links, and for each state variable of the two or more links An observation step of obtaining two or more of the first type parameters as first type parameters;
By compressing the parameters based on the two or more first-type parameters acquired in the observation step, the number of parameters is smaller than the two or more first-type parameters, and has a different meaning from the first-type parameters. A feature extraction step of obtaining a second type parameter that is a parameter having a characteristic of the controlled device;
A program for executing a control step for controlling the controlled device based on one or more second-type parameters acquired in the feature extraction step ,
The feature extraction step includes
A conversion sub-step of acquiring two or more second-type parameters based on the two or more first-type parameters acquired in the observation step;
A program comprising a selection sub-step for selecting the one or more second-type parameters from the two or more second-type parameters acquired in the conversion sub-step.

The second type parameter has a variable variable;
The conversion substep includes:
The program according to claim 9 , wherein the variable of the second type parameter is changed according to a control result that is a state of the controlled device and / or an internal state of the control unit that is a state variable of the control model. .

The control step includes
An evaluation sub-step for evaluating the status of the controlled device based on the one or more second-type parameters acquired in the feature extraction step;
Based on the evaluation in the evaluation sub-step, learning of a control law for the controlled device is performed, and the controlled device is controlled based on the control law and one or more second type parameters acquired in the feature extraction step. The program according to claim 9 , further comprising a control substep.

In the observation step,
The program according to any one of claims 9 to 11 , wherein two or more first type parameters are acquired by observing a state of the controlled device and an internal state of a control unit which is a state variable of a control model.

The controller is
The program according to claim 12 , wherein a network structure having a plurality of elements connected in a network and having self-feedback is included, and an internal state of the control unit is a state of elements constituting the network structure.

On the computer,
Further executing a selection information receiving step of receiving selection information that is information for acquiring the second type parameter in the feature extraction step;
In the feature extraction step,
The program according to any one of claims 9 to 13 , wherein the number of second type parameters smaller than the first type parameters is acquired also using the selection information received in the previous selection information receiving step.

On the computer,
Obtaining a control result which is the first type parameter acquired in the observation step and / or a controlled result of the controlled device, and further executing a long-term observation step of storing two or more of the acquired information;
The conversion substep includes:
The program according to any one of claims 9 to 14 , wherein two or more second-type parameters are acquired also using information accumulated in the long-term observation step or information generated based on the information.

On the computer,
Obtaining a control result which is the first type parameter acquired in the observation step and / or a controlled result of the controlled device, and further executing a long-term observation step of storing two or more of the acquired information;
The selection sub-step includes
The program according to any one of claims 9 to 15 , wherein one or more second-type parameters are selected also using information accumulated in the long-term observation step or information generated based on the information.