JP2018124982A

JP2018124982A - Control device and control method

Info

Publication number: JP2018124982A
Application number: JP2017207450A
Authority: JP
Inventors: 雅司岡田; Masashi Okada
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2017-01-31
Filing date: 2017-10-26
Publication date: 2018-08-09

Abstract

PROBLEM TO BE SOLVED: To provide a control device that is able to exert optimum control by using a neutral network.SOLUTION: A control device 1 for performing optimal control by path integral includes a neural network section 3 including a machine-learned dynamics model and cost function, an input section 2 that inputs a current state of a control target 50 and an initial control sequence for the control target 50 into the neural network section 3, and an output section 4 that outputs a control sequence for controlling the control target 50, the control sequence being calculated by the neural network section 3 by path integral from the current state and the initial control sequence by using the dynamics model and the cost function. Here, the neural network section 3 includes a second recurrent neural network incorporating a first recurrent neural network including the dynamics model.SELECTED DRAWING: Figure 1

Description

本開示は、制御装置および制御方法に関し、特にニューラルネットワークを用いた制御装置および制御方法に関する。 The present disclosure relates to a control device and a control method, and more particularly to a control device and a control method using a neural network.

最適制御の一つとして、経路積分制御が知られている（例えば、非特許文献１参照）。最適制御は、制御対象のシステムの将来の状態および報酬を先読みし、最適な操作量の系列を求めるための枠組みとして捉えることができる。最適制御は、制約付きの最適化問題として定式化できる。 As one of the optimal controls, path integral control is known (for example, see Non-Patent Document 1). Optimal control can be regarded as a framework for prefetching the future state and reward of the system to be controlled and obtaining an optimum sequence of manipulated variables. Optimal control can be formulated as a constrained optimization problem.

一方、畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークは、自動運転およびロボット操作などの制御にうまく適用されて用いられている。 On the other hand, deep neural networks such as convolutional neural networks have been successfully applied and used for controls such as automatic driving and robot operation.

Model Predictive Path Integral Control: From Theory to Parallel Computationhttps://arc.aiaa.org/doi/full/10.2514/1.G001921.［平成29年9月29日検索］、インターネット〈URL：https://arc.aiaa.org/doi/full/10.2514/1.G001921〉Model Predictive Path Integral Control: From Theory to Parallel Computationhttps: //arc.aiaa.org/doi/full/10.2514/1.G001921. [Search September 29, 2017], Internet <URL: https: // arc .aiaa.org / doi / full / 10.2514 / 1.G001921> Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel, "Value Iteration Networks", NIPS 2016.Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel, "Value Iteration Networks", NIPS 2016.

しかしながら、非特許文献１などの従来の最適制御では、当該システムの将来の状態および将来の報酬を予測するために、当該システムのダイナミクスを特定し、かつ、コスト関数を利用する必要があるものの、ダイナミクスおよびコスト関数を記述することは難しいという問題がある。 However, in the conventional optimal control such as Non-Patent Document 1, although it is necessary to specify the dynamics of the system and use a cost function in order to predict the future state and future reward of the system, There is a problem that it is difficult to describe the dynamics and the cost function.

また、畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークを用いても最適制御を行えないという問題がある。畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークは、どれだけ学習しても、反射的にしか成長しないからである。 In addition, there is a problem that optimal control cannot be performed using a deep neural network such as a convolutional neural network. This is because deep neural networks such as convolutional neural networks only grow reflexively, no matter how much they learn.

本開示は、上述の事情を鑑みてなされたもので、ニューラルネットワークを用いて最適制御を行うことができる制御装置および制御方法を提供することを目的とする。 The present disclosure has been made in view of the above-described circumstances, and an object thereof is to provide a control device and a control method capable of performing optimal control using a neural network.

上記課題を解決するために、本開示の一形態に係る制御装置は、経路積分による最適制御を行うための制御装置であって、前記制御装置は、機械学習されたダイナミクスモデルおよびコスト関数を有するニューラルネットワークと、制御対象の現在の状態と、前記制御対象に対する複数の操作パラメータを成分とする操作量系列であって初期の操作量系列とを、前記ニューラルネットワークに入力する入力部と、前記ニューラルネットワークが前記ダイナミクスモデルおよび前記コスト関数を用いて、前記現在の状態と前記初期の操作量系列とから、経路積分により算出した操作量系列であって前記制御対象を制御するための操作量系列を出力する出力部とを備え、前記ニューラルネットワークは、前記ダイナミクスモデルを持つ第１リカレントニューラルネットワークを内部に有する第２リカレントニューラルネットワークからなる。 In order to solve the above problem, a control device according to an embodiment of the present disclosure is a control device for performing optimal control by path integration, and the control device has a machine-learned dynamics model and a cost function. An input unit that inputs a neural network, a current state of a control target, and an operation amount series that includes a plurality of operation parameters for the control target as components, and an initial operation amount sequence to the neural network; and the neural network An operation amount sequence for controlling the control target, which is an operation amount sequence calculated by path integration from the current state and the initial operation amount sequence using the dynamics model and the cost function. An output unit for outputting, wherein the neural network has a first recurrence having the dynamics model. And a second recurrent neural network with bets neural network therein.

なお、これらの全般的または具体的な態様は、システム、方法、集積回路、コンピュータプログラムまたはコンピュータで読み取り可能なＣＤ−ＲＯＭなどの記録媒体で実現されてもよく、システム、方法、集積回路、コンピュータプログラムおよび記録媒体の任意な組み合わせで実現されてもよい。 These general or specific aspects may be realized by a system, a method, an integrated circuit, a computer program, or a computer-readable recording medium such as a CD-ROM. The system, method, integrated circuit, computer You may implement | achieve with arbitrary combinations of a program and a recording medium.

本開示の制御装置等によれば、ニューラルネットワークを用いて最適制御を行うことができる。 According to the control device and the like of the present disclosure, optimal control can be performed using a neural network.

実施の形態における制御装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the control apparatus in embodiment. 図１に示すニューラルネットワーク部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the neural network part shown in FIG. 図２に示す算出部の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the calculation part shown in FIG. 図２に示す算出部の詳細構成の一例を示す図である。It is a figure which shows an example of a detailed structure of the calculation part shown in FIG. 図３Ｂに示すモンテカルロシミュレータ部の詳細構成の一例を示す図である。It is a figure which shows an example of the detailed structure of the Monte Carlo simulator part shown to FIG. 3B. 図３Ｂに示す第２処理部の詳細構成の一例を示す図である。It is a figure which shows an example of a detailed structure of the 2nd process part shown to FIG. 3B. 実施の形態における制御装置の処理を示すフローチャートである。It is a flowchart which shows the process of the control apparatus in embodiment. 実施の形態における学習処理の概念図の一例を示す図である。It is a figure which shows an example of the conceptual diagram of the learning process in embodiment. 実施の形態における学習処理の概要を示すフローチャートである。It is a flowchart which shows the outline | summary of the learning process in embodiment. 実験における制御シミュレーション結果を示す図である。It is a figure which shows the control simulation result in experiment. 真のコスト関数を示す図である。It is a figure which shows a true cost function. 学習された経路積分制御ニューラルネットワークのコスト関数を示す図である。It is a figure which shows the cost function of the learned path | route integral control neural network. 学習された比較例のニューラルネットワークのコスト関数を示す図である。It is a figure which shows the cost function of the neural network of the comparative example learned. 変形例１におけるニューラルネットワーク部の構成の一例を示すブロック図である。10 is a block diagram illustrating an example of a configuration of a neural network unit in Modification 1. FIG.

（本開示の一態様を得るに至った経緯）
制御の良さを表す評価関数を最小にする制御である最適制御が知られている。そして、最適制御は、制御対象のシステムの将来の状態および報酬を先読みし、最適な操作量の系列を求めるための枠組みとして捉えることができる。最適制御は、制約付きの最適化問題として定式化できる。 (Background to obtaining one embodiment of the present disclosure)
There is known optimal control that is control that minimizes an evaluation function that represents good control. Optimal control can be regarded as a framework for prefetching the future state and reward of the system to be controlled and obtaining an optimal sequence of manipulated variables. Optimal control can be formulated as a constrained optimization problem.

また、最適制御の一つとして、経路積分制御が知られている（例えば、非特許文献１参照）。非特許文献１には、軌道の確率的サンプリングに基づくモンテカルロ近似を用いて、確率的最適制御問題として経路積分を数学的に解くことで経路積分制御を行うことが記載されている。 Moreover, path integral control is known as one of the optimal controls (see, for example, Non-Patent Document 1). Non-Patent Document 1 describes that path integration control is performed by mathematically solving path integration as a stochastic optimal control problem using Monte Carlo approximation based on stochastic sampling of trajectories.

しかしながら、非特許文献１などの従来の最適制御では、当該システムの将来の状態および将来の報酬を予測するために、当該システムのダイナミクスを特定したモデルとコスト関数とを利用する必要があるが、ダイナミクスおよびコスト関数を記述することは難しい。当該システムのモデルが完全に既知である場合には、複雑な方程式と多数のパラメータとからなるダイナミクスを記述できるものの、このような場合は少ないからである。特に、多数のパラメータを記述することは難しい。同様に、報酬を評価するために用いられるコスト関数は、当該システムの現在の状態から将来の状態までの間の環境の全状況の変化を完全に知っている若しくは完全にシミュレーションできる場合には記述できるものの、このような場合は少ない。コスト関数は、狙った制御を行わせるために、どういう状態が望ましいかを重みなどのパラメータを用いて関数で記述するものである。このため、特に、重みなどのパラメータを最適に記述することは難しい。 However, in the conventional optimal control such as Non-Patent Document 1, it is necessary to use a model that identifies the dynamics of the system and a cost function in order to predict the future state and future reward of the system. It is difficult to describe dynamics and cost functions. This is because, if the model of the system is completely known, the dynamics composed of complicated equations and a large number of parameters can be described, but there are few such cases. In particular, it is difficult to describe a large number of parameters. Similarly, the cost function used to evaluate rewards is described if it fully knows or can fully simulate the change in the overall state of the environment from the current state of the system to the future state. Although it is possible, this is rare. The cost function is a function that describes what state is desirable in order to perform targeted control using parameters such as weights. For this reason, it is particularly difficult to optimally describe parameters such as weights.

一方、上述したように、近年、自動運転およびロボット操作などの制御において、畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークがうまく適用されて用いられている。そして、このようなディープ・ニューラル・ネットワークは、教師データによる模倣学習または強化学習により、所望の操作量を出力するよう訓練される。 On the other hand, as described above, in recent years, deep neural networks such as convolutional neural networks have been successfully applied and used in controls such as automatic driving and robot operation. Such a deep neural network is trained to output a desired operation amount by imitation learning or reinforcement learning using teacher data.

そこで、畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークを用いて最適制御を行うことが考えられる。このようなディープ・ニューラル・ネットワークを用いて最適制御を行うことができれば、最適制御に必要なダイナミクスおよびコスト関数、または、特に記述が難しいこれらのパラメータを学習することができると考えられるからである。 Therefore, it is conceivable to perform optimal control using a deep neural network such as a convolutional neural network. If optimal control can be performed using such a deep neural network, it is considered that the dynamics and cost function necessary for optimal control, or these parameters that are particularly difficult to describe, can be learned. .

しかしながら、畳み込みニューラルネットワークなどのディープ・ニューラル・ネットワークを用いても最適制御を行えない。このようなディープ・ニューラル・ネットワークは、どれだけ学習しても、反射的にしか成長しないからである。つまり、このようなディープ・ニューラル・ネットワークは、どれだけ学習させても、先読みなどの汎化能力を得ることができないからである。 However, optimal control cannot be performed using a deep neural network such as a convolutional neural network. This is because such a deep neural network grows only reflexively, no matter how much it learns. That is, such a deep neural network cannot obtain generalization ability such as prefetching no matter how much it is learned.

以上の事情を鑑みて、発明者は、ニューラルネットワークを用いて最適制御を行うことができる制御装置および制御方法を想到するに至った。 In view of the above circumstances, the inventors have come up with a control device and a control method capable of performing optimal control using a neural network.

すなわち、本開示の一形態に係る制御装置は、経路積分による最適制御を行うための制御装置であって、前記制御装置は、機械学習されたダイナミクスモデルおよびコスト関数を有するニューラルネットワークと、制御対象の現在の状態と、前記制御対象に対する複数の操作パラメータを成分とする操作量系列であって初期の操作量系列とを、前記ニューラルネットワークに入力する入力部と、前記ニューラルネットワークが前記ダイナミクスモデルおよび前記コスト関数を用いて、前記現在の状態と前記初期の操作量系列とから、経路積分により算出した操作量系列であって前記制御対象を制御するための操作量系列を出力する出力部とを備え、前記ニューラルネットワークは、前記ダイナミクスモデルを持つ第１リカレントニューラルネットワークを内部に有する第２リカレントニューラルネットワークからなる。 That is, a control device according to one embodiment of the present disclosure is a control device for performing optimal control by path integration, and the control device includes a machine-learned dynamics model and a neural network having a cost function, and a control target An input unit for inputting an initial manipulated variable sequence, which is a manipulated variable sequence including a plurality of manipulated parameters for the control target, to the neural network, and the neural network includes the dynamics model and Using the cost function, an output unit that outputs an operation amount sequence for controlling the control target, which is an operation amount sequence calculated by path integration from the current state and the initial operation amount sequence. The neural network includes a first recurrent neural network having the dynamics model. And a second recurrent neural network having a network therein.

この構成により、２重のリカレントニューラルネットワークからなるニューラルネットワークに経路積分による最適制御を行わせることができるので、ニューラルネットワークを用いて最適制御を行うことができる。 With this configuration, it is possible to cause a neural network including a double recurrent neural network to perform optimal control by path integration, and thus optimal control can be performed using the neural network.

ここで、例えば、前記第２リカレントニューラルネットワークは、前記第１リカレントニューラルネットワークと、前記コスト関数とを有し、前記第１リカレントニューラルネットワークに、前記現在の状態と前記初期の操作量系列とからモンテカルロ法により各時刻における状態を算出させ、前記コスト関数を用いて前記複数の状態のコストを算出させる第１処理部と、前記初期の操作量系列と前記複数の状態のコストとに基づき、前記制御対象に対する操作量系列を算出する第２処理部とを備え、前記第２処理部は、算出した操作量系列を、前記出力部に出力するともに、前記第２リカレントニューラルネットワークに、前記初期の操作量系列としてフィードバックし、前記第２リカレントニューラルネットワークは、前記第１処理部に、前記第２処理部によりフィードバックされた操作量系列と、前記現在の状態とから前記各時刻の次の各時刻における複数の状態のコストを算出させてもよい。 Here, for example, the second recurrent neural network includes the first recurrent neural network and the cost function, and the first recurrent neural network includes the current state and the initial manipulated variable sequence. Based on a first processing unit that calculates a state at each time by a Monte Carlo method and calculates a cost of the plurality of states using the cost function, the initial manipulated variable series and the cost of the plurality of states, A second processing unit that calculates an operation amount sequence for the control target, and the second processing unit outputs the calculated operation amount sequence to the output unit, and outputs the initial operation amount sequence to the second recurrent neural network. The second recurrent neural network is fed back as a manipulated variable sequence, and the second recurrent neural network is connected to the first processing unit. And fed back manipulated variable sequence by the second processing unit, the may be from the current state to calculate the cost of a plurality of states in each of the following times of the respective times.

この構成により、２重のリカレントニューラルネットワークからなるニューラルネットワークに、モンテカルロ法による経路積分による最適制御を行わせることができる。 With this configuration, it is possible to cause a neural network including a double recurrent neural network to perform optimum control by path integration by the Monte Carlo method.

さらに、例えば、前記第２リカレントニューラルネットワークは、さらに、前記モンテカルロ法で用いる乱数を発生させる第３処理部を備え、前記第３処理部は、発生させた乱数を前記第１処理部および前記第２処理部に出力させてもよい。 Further, for example, the second recurrent neural network further includes a third processing unit that generates a random number used in the Monte Carlo method, and the third processing unit outputs the generated random number to the first processing unit and the first processing unit. You may make it output to 2 process parts.

また、例えば、前記コスト関数は、ニューラルネットワークで構成されるコスト関数モデルであるとしてもよい。 Further, for example, the cost function may be a cost function model configured by a neural network.

また、本開示の一形態に係る制御方法は、経路積分による最適制御を行うための制御装置の制御方法であって、前記制御装置は、機械学習されたダイナミクスモデルおよびコスト関数を有するニューラルネットワークを備え、制御対象の現在の状態と、前記制御対象に対する複数の操作パラメータを成分とする操作量系列であって初期の操作量系列とを、前記ニューラルネットワークに入力する入力ステップと、前記ニューラルネットワークに、前記ダイナミクスモデルおよび前記コスト関数を用いて、前記現在の状態と前記初期の操作量系列とから、経路積分により算出させた操作量系列であって前記制御対象を制御するための操作量系列を出力する出力ステップとを含み、前記ニューラルネットワークは、前記ダイナミクスモデルを持つ第１リカレントニューラルネットワークを内部に有する第２リカレントニューラルネットワークからなる。 Further, a control method according to an aspect of the present disclosure is a control method for a control device for performing optimal control by path integration, and the control device includes a neural network having a machine-learned dynamics model and a cost function. An input step of inputting an initial manipulated variable sequence, which is a manipulated variable sequence having a plurality of operational parameters for the controlled object as components, to the neural network; and Using the dynamics model and the cost function, an operation amount sequence calculated by path integration from the current state and the initial operation amount sequence, and an operation amount sequence for controlling the control target. An output step of outputting, wherein the neural network has the dynamics model And a second recurrent neural network having one recurrent neural network therein.

ここで、例えば、さらに、前記入力ステップの前に、前記ダイナミクスモデルおよび前記コスト関数を機械学習させる学習ステップを含み、前記学習ステップは、前記制御対象の現在の状態に対応する予め用意された状態と、前記制御対象に対する初期の操作量系列に対応する予め用意された初期の操作量系列と、予め用意された状態および予め用意された初期の操作量系列とから、経路積分により予め算出された、前記制御対象を制御するための操作量系列とを含む学習用データを教師データとして準備するステップと、前記教師データを用いて、前記ニューラルネットワークの重みを誤差逆伝播法により学習させることにより、前記ダイナミクスモデルおよび前記コスト関数を学習させるステップとを含むとしてもよい。 Here, for example, further including a learning step for machine learning of the dynamics model and the cost function before the input step, the learning step being a state prepared in advance corresponding to the current state of the control target And an initial operation amount sequence prepared in advance corresponding to the initial operation amount sequence for the control target, a state prepared in advance, and an initial operation amount sequence prepared in advance. A step of preparing learning data including a manipulated variable sequence for controlling the control target as teacher data, and by using the teacher data to learn the weight of the neural network by an error back propagation method, Learning the dynamics model and the cost function.

これより、２重のリカレントニューラルネットワークからなるニューラルネットワークに、最適制御に必要なダイナミクスおよびコスト関数、または、これらのパラメータを学習させることができる。 As a result, a neural network composed of a double recurrent neural network can learn the dynamics and cost function necessary for optimal control, or these parameters.

ここで、例えば、前記コスト関数は、ニューラルネットワークで構成されるコスト関数モデルであってもよい。 Here, for example, the cost function may be a cost function model configured by a neural network.

以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることもできる。 Each of the embodiments described below shows a specific example of the present disclosure. Numerical values, shapes, components, steps, order of steps, and the like shown in the following embodiments are merely examples, and are not intended to limit the present disclosure. In addition, among the constituent elements in the following embodiments, constituent elements that are not described in the independent claims indicating the highest concept are described as optional constituent elements. In all the embodiments, the contents can be combined.

（実施の形態）
以下では、図面を参照しながら、実施の形態における制御装置および制御方法等の説明を行う。 (Embodiment)
Hereinafter, a control device, a control method, and the like in the embodiment will be described with reference to the drawings.

［制御装置１の構成］
図１は、本実施の形態における制御装置１の構成の一例を示すブロック図である。図２は、図１に示すニューラルネットワーク部３の構成の一例を示すブロック図である。 [Configuration of Control Device 1]
FIG. 1 is a block diagram showing an example of the configuration of the control device 1 in the present embodiment. FIG. 2 is a block diagram showing an example of the configuration of the neural network unit 3 shown in FIG.

制御装置１は、ニューラルネットワークを用いたコンピュータ等で実現され、制御対象５０に対して、経路積分による最適制御を行う。制御装置１は、例えば図１に示すように入力部２とニューラルネットワーク部３と出力部４とで構成されている。ここで、制御対象５０は、例えば自動運転を行う車両または自律移動するロボットなど、最適制御を行う制御対象のシステムである。 The control device 1 is realized by a computer or the like using a neural network, and performs optimal control by path integration on the control target 50. The control device 1 includes an input unit 2, a neural network unit 3 and an output unit 4 as shown in FIG. Here, the control target 50 is a control target system that performs optimal control, such as a vehicle that performs automatic driving or a robot that moves autonomously.

＜入力部２＞
入力部２は、制御対象の現在の状態と、制御対象に対する複数の操作パラメータを成分とする操作量系列であって初期の操作量系列とを、本開示のニューラルネットワークに入力する。 <Input unit 2>
The input unit 2 inputs a current state of the control target and an operation amount series including a plurality of operation parameters for the control target as components and an initial operation amount series to the neural network of the present disclosure.

本実施の形態では、入力部２は、制御対象５０の現在の状態
と、制御対象５０に対する初期の操作パラメータを成分とする初期の操作量系列
とを制御対象５０から取得し、ニューラルネットワーク部３に入力する。
は時刻t_0 からt_{N-1}までの操作量の時系列を指すものとする。 In the present embodiment, the input unit 2 displays the current state of the control target 50.
And an initial manipulated variable sequence having an initial operation parameter for the control object 50 as a component
Are obtained from the control object 50 and input to the neural network unit 3.
Is a time series of manipulated variables from time t_0 to t_ {N-1}.

＜出力部４＞
出力部４は、ニューラルネットワーク部３が、機械学習されたダイナミクスモデルおよびコスト関数を用いて、現在の状態と初期の操作量系列とから経路積分により算出した操作量系列であって制御対象を制御するための操作量系列を出力する。ダイナミクスモデルは、例えば、ニューラルネットワークで構成されるダイナミクスモデルであってもよいし、数式で表される関数であってもよい。同様に、コスト関数は、例えば、ニューラルネットワークで構成されるコスト関数モデルであってもよいし、数式で表される関数であってもよい。つまり、ダイナミクス、およびコスト関数は、事前に機械学習できるものであれば、ニューラルネットワークで構成されてもよいし、数式とパラメータとからなる関数で構成されてもよい。 <Output unit 4>
The output unit 4 is an operation amount sequence calculated by the path integration from the current state and the initial operation amount sequence using the machine model learned dynamics model and the cost function, and the output unit 4 controls the control target. Output a manipulated variable sequence. The dynamics model may be, for example, a dynamics model configured by a neural network or a function represented by a mathematical expression. Similarly, the cost function may be, for example, a cost function model configured by a neural network or a function represented by a mathematical expression. That is, the dynamics and the cost function may be configured by a neural network or a function including a mathematical expression and a parameter as long as machine learning can be performed in advance.

本実施の形態では、出力部４は、入力部２により制御対象５０から取得された初期の操作量系列
が更新された操作量系列
を、制御対象５０に出力する。つまり、制御装置１は、初期の操作量系列
に基づき、制御対象５０の将来の状態および報酬を先読みして算出した最適な操作量の系列である操作量系列
を制御対象５０に出力する。 In the present embodiment, the output unit 4 is the initial manipulated variable sequence acquired from the control object 50 by the input unit 2.
Manipulated variable series updated
Is output to the control object 50. In other words, the control device 1 uses the initial manipulated variable sequence.
Based on the operation amount series, which is an optimum operation amount sequence calculated by prefetching the future state and reward of the control object 50
Is output to the control object 50.

＜ニューラルネットワーク部３＞
ニューラルネットワーク部３は、機械学習されたダイナミクスモデルおよびコスト関数を有するニューラルネットワークで構成される。ニューラルネットワーク部３は、機械学習されたダイナミクスモデルを持つ第１リカレントニューラルネットワークを内部に有する第２リカレントニューラルネットワークからなる。なお、以下では、本開示のニューラルネットワーク部３は、経路積分制御ニューラルネットワークと称する場合もある。 <Neural network part 3>
The neural network unit 3 is constituted by a neural network having a machine-learned dynamics model and a cost function. The neural network unit 3 includes a second recurrent neural network having a first recurrent neural network having a machine-learned dynamics model therein. Hereinafter, the neural network unit 3 of the present disclosure may be referred to as a path integration control neural network.

そして、ニューラルネットワーク部３は、機械学習されたダイナミクスモデルおよびコスト関数を用いて、現在の状態と初期の操作量系列とから、経路積分により、制御対象を制御するための操作量系列を算出する。 Then, the neural network unit 3 uses the machine-learned dynamics model and the cost function to calculate a manipulated variable sequence for controlling the controlled object by path integration from the current state and the initial manipulated variable sequence. .

本実施の形態では、ニューラルネットワーク部３は、図２に示すように、算出部１３を備える。算出部１３は、入力部２により制御対象５０の現在の状態
と、制御対象５０に対する初期の操作量系列
とが入力される。算出部１３は、機械学習されたダイナミクスモデルおよびコスト関数を用いて、経路積分により、初期の操作量系列
を更新した操作量系列を算出する。そして、算出部１３は、更新した操作量系列を初期の操作量系列
として再度入力し、更新した操作量系列をさらに更新した操作量系列を算出する。このように、算出部１３は、再帰的に操作量系列を更新し、例えばＵ回、再帰的に操作量系列を更新することで、制御対象５０を制御するための操作量系列
を算出する。 In the present embodiment, the neural network unit 3 includes a calculation unit 13 as shown in FIG. The calculation unit 13 receives the current state of the control target 50 through the input unit 2.
And the initial manipulated variable series for the control object 50
Are entered. The calculation unit 13 uses a machine-learned dynamics model and a cost function to perform an initial manipulated variable sequence by path integration.
The updated operation amount series is calculated. Then, the calculation unit 13 converts the updated operation amount sequence to the initial operation amount sequence.
The operation amount series further updated from the updated operation amount series is calculated. In this way, the calculation unit 13 recursively updates the operation amount sequence, for example, the operation amount sequence for controlling the control target 50 by recursively updating the operation amount sequence U times.
Is calculated.

なお、算出部１３の再帰的に操作量系列を更新する部分は、リカレントニューラルネットワーク１３ａに該当する。リカレントニューラルネットワーク１３ａは、例えば第２リカレントニューラルネットワークである。 The part of the calculation unit 13 that recursively updates the manipulated variable series corresponds to the recurrent neural network 13a. The recurrent neural network 13a is, for example, a second recurrent neural network.

また、Ｕ回は、更新した操作量系列が十分収束するように大きな数に設定される。ダイナミクスモデルは、機械学習によりパラメータ化された関数ｆにより表されるとする。コスト関数モデルは、機械学習によりパラメータ化された関数
、φにより表されるとする。 Further, U times is set to a large number so that the updated manipulated variable series sufficiently converges. Assume that the dynamics model is represented by a function f parameterized by machine learning. The cost function model is a function parameterized by machine learning.
, Φ.

図３Ａは、図２に示す算出部１３の構成の一例を示すブロック図である。図３Ｂは、図２に示す算出部１３の詳細構成の一例を示す図である。図４は、図３Ｂに示すモンテカルロシミュレータ部１４１の詳細構成の一例を示す図である。図５は、図３Ｂに示す第２処理部１５の詳細構成の一例を示す図である。 FIG. 3A is a block diagram illustrating an example of a configuration of the calculation unit 13 illustrated in FIG. FIG. 3B is a diagram illustrating an example of a detailed configuration of the calculation unit 13 illustrated in FIG. 2. FIG. 4 is a diagram illustrating an example of a detailed configuration of the Monte Carlo simulator unit 141 illustrated in FIG. 3B. FIG. 5 is a diagram illustrating an example of a detailed configuration of the second processing unit 15 illustrated in FIG. 3B.

算出部１３は、例えば図３Ａに示すように、第１処理部１４と、第２処理部１５と、第３処理部１６とを備える。なお、算出部１３は、例えば図３Ｂに示すように、入力部により入力された初期の操作量系列を格納する格納部１７をさらに備え、第１処理部１４および第２処理部１５に出力してもよい。 For example, as illustrated in FIG. 3A, the calculation unit 13 includes a first processing unit 14, a second processing unit 15, and a third processing unit 16. For example, as shown in FIG. 3B, the calculation unit 13 further includes a storage unit 17 that stores the initial manipulated variable series input by the input unit, and outputs the storage unit 17 to the first processing unit 14 and the second processing unit 15. May be.

≪第１処理部１４≫
第１処理部１４は、第１リカレントニューラルネットワークと、コスト関数とを有し、第１リカレントニューラルネットワークに、現在の状態と初期の操作量系列とからモンテカルロ法により各時刻における状態を算出させ、コスト関数モデルを用いて複数の状態のコストを算出する。また、第１処理部１４は、第２処理部１５により第２リカレントニューラルネットワークにフィードバックされた操作量系列と、現在の状態とから、各時刻の次の各時刻における複数の状態のコストを算出する。 «First processing unit 14»
The first processing unit 14 includes a first recurrent neural network and a cost function, and causes the first recurrent neural network to calculate a state at each time from a current state and an initial manipulated variable sequence by a Monte Carlo method, The cost of a plurality of states is calculated using a cost function model. Further, the first processing unit 14 calculates the cost of a plurality of states at each time next to each time from the operation amount series fed back to the second recurrent neural network by the second processing unit 15 and the current state. To do.

本実施の形態では、第１処理部１４は、図３Ｂに示すように、モンテカルロシミュレータ部１４１と、格納部１４２とを備える。 In the present embodiment, the first processing unit 14 includes a Monte Carlo simulator unit 141 and a storage unit 142, as shown in FIG. 3B.

モンテカルロシミュレータ部１４１は、モンテカルロシミュレーションを用いて複数の異なる状態の時系列を確率的にサンプリングする経路積分の枠組みを利用する。状態の時系列を軌道と呼ぶ。モンテカルロシミュレータ部１４１は、例えば図４に示すように、機械学習されたダイナミクスモデル１４１１と、第３処理部１６から入力される乱数とを用いて、現在の状態と初期の操作量系列とから、現在より後の時刻における状態を成分とする状態の時系列を算出する。さらに、モンテカルロシミュレータ部１４１は、算出した状態の時系列を再度入力し、その状態の時系列を更新する。このように、モンテカルロシミュレータ部１４１は、例えばＮ回再帰的に状態の時系列を更新することで、現在より後の各時刻における状態を算出する。また、モンテカルロシミュレータ部１４１は、終端コスト算出部１４１２においてＮ回目すなわち最後に算出した状態のコストを算出し、終端コストとして格納部１４２に出力する。 The Monte Carlo simulator unit 141 uses a path integration framework that probabilistically samples a plurality of different time series using Monte Carlo simulation. A time series of states is called a trajectory. For example, as shown in FIG. 4, the Monte Carlo simulator unit 141 uses a machine-learned dynamics model 1411 and a random number input from the third processing unit 16 to calculate a current state and an initial manipulated variable sequence. A time series of states having a state at a time after the present as a component is calculated. Furthermore, the Monte Carlo simulator unit 141 inputs the time series of the calculated state again, and updates the time series of the state. As described above, the Monte Carlo simulator unit 141 calculates the state at each time after the present by recursively updating the state time series N times, for example. In addition, the Monte Carlo simulator unit 141 calculates the cost of the Nth time, that is, the state calculated last in the termination cost calculation unit 1412 and outputs the cost to the storage unit 142 as the termination cost.

より具体的には、例えば、ダイナミクスモデル１４１１が、
で表され、コスト関数モデル１４１３が
、終端コストモデル１４１２が
で表されるとする。α、β、R, γはダイナミクスモデル、コスト関数モデルのパラメータである。この場合、まず、モンテカルロシミュレータ部１４１は、現在の状態
を時刻ｔｉにおける状態
に代入する。kは計K個の状態の一つを指すインデックスである。このK個の状態は並列に処理される。そして、モンテカルロシミュレータ部１４１は、状態
と初期の操作量系列
とから、ダイナミクスモデル１４１１である
と乱数
とを用いて、時刻ｔｉより後の時刻ｔｉ+１における状態
を算出する。さらに、モンテカルロシミュレータ部１４１は、算出した状態
を、時刻ｔｉにおける状態
として再度入力し、Ｋ個の状態
を更新する。モンテカルロシミュレータ部１４１は、終端コスト算出部１４１２において、Ｎ回目に算出した状態
を終端コストモデル１４１２に入力し、得られた終端コスト
を格納部１４２に出力する。 More specifically, for example, the dynamics model 1411 is
And the cost function model 1413 is
The terminal cost model 1412 is
It is assumed that α, β, R, and γ are parameters of the dynamics model and the cost function model. In this case, first, the Monte Carlo simulator unit 141
State at time ti
Assign to. k is an index indicating one of a total of K states. These K states are processed in parallel. The Monte Carlo simulator unit 141
And initial manipulated variable series
From the above, the dynamics model 1411
And random numbers
And the state at time ti + 1 after time ti
Is calculated. Further, the Monte Carlo simulator unit 141 calculates the state
The state at time ti
Enter again as K states
Update. The Monte Carlo simulator unit 141 is the state calculated by the terminal cost calculation unit 1412 for the Nth time.
Is input to the termination cost model 1412 and the termination cost obtained is
Is output to the storage unit 142.

また、モンテカルロシミュレータ部１４１は、コスト関数モデル１４１３と、第３処理部１６から入力される乱数とを用いて、初期の操作量系列から、算出した各時刻における複数の状態のコストである評価コストを算出する。 In addition, the Monte Carlo simulator unit 141 uses the cost function model 1413 and the random number input from the third processing unit 16 to evaluate the cost of a plurality of states at each time calculated from the initial manipulated variable sequence. Is calculated.

より具体的には、モンテカルロシミュレータ部１４１は、コスト関数モデル１４１３である
と、第３処理部１６から入力される乱数
とを用いて、初期の操作量系列
から、１〜Ｎ−１回目に算出した各時刻における複数の状態のコスト
を評価コストとして格納部１４２に出力する。 More specifically, the Monte Carlo simulator unit 141 is a cost function model 1413.
And a random number input from the third processing unit 16
And the initial manipulated variable series
From the cost of multiple states at each time calculated from 1 to N-1th
Is output to the storage unit 142 as an evaluation cost.

なお、モンテカルロシミュレータ部１４１の再帰的に複数の状態を算出する部分は、リカレントニューラルネットワーク１４１ａに該当する。リカレントニューラルネットワーク１４１ａは、例えば第１リカレントニューラルネットワークである。また、Ｎ回は、先読みする時間ステップ数を示す。 The portion of the Monte Carlo simulator unit 141 that recursively calculates a plurality of states corresponds to the recurrent neural network 141a. The recurrent neural network 141a is, for example, a first recurrent neural network. N times indicates the number of time steps to be read ahead.

格納部１４２は、例えばメモリであり、Ｎ回分の各時刻における複数の状態のコストである評価コスト
を一時的に格納し、第２処理部１５に出力する。 The storage unit 142 is, for example, a memory, and an evaluation cost that is a cost of a plurality of states at each time for N times.
Is temporarily stored and output to the second processing unit 15.

≪第２処理部１５≫
第２処理部１５は、初期の操作量系列と複数の状態のコストとに基づき、各時刻における制御対象に対する操作量系列を算出する。第２処理部１５は、算出した各時刻における操作量系列を、出力部４に出力するともに、第２リカレントニューラルネットワークに、初期の操作量系列としてフィードバックする。 << second processing unit 15 >>
The second processing unit 15 calculates an operation amount sequence for the control target at each time based on the initial operation amount sequence and the costs of a plurality of states. The second processing unit 15 outputs the calculated manipulated variable sequence at each time to the output unit 4 and feeds it back to the second recurrent neural network as an initial manipulated variable sequence.

本実施の形態では、第２処理部１５は、例えば図５に示すように、コスト積算部１５１と、操作系更新部１５２とを備える。 In the present embodiment, the second processing unit 15 includes a cost integrating unit 151 and an operation system updating unit 152, for example, as shown in FIG.

コスト積算部１５１は、格納部１４２に格納されているＮ回分の各時刻における複数の状態のコストを積算した積算コストを算出する。より具体的には、コスト積算部１５１は、下記の（式１）を用いて、格納部１４２に格納されているＮ回分の各時刻における複数の状態のコストを積算した積算コスト
を算出する。 The cost integrating unit 151 calculates an integrated cost obtained by integrating the costs of a plurality of states at N times stored in the storage unit 142. More specifically, the cost integration unit 151 uses the following (Equation 1) to integrate the costs of a plurality of states at N times stored in the storage unit 142.
Is calculated.

操作系更新部１５２は、初期の操作量系列とコスト積算部１５１で積算されたＮ回分の各時刻における複数の状態のコストと第３処理部１６から入力される乱数とから、初期の操作量系列を更新した制御対象５０に対する操作量系列を算出する。より具体的には、操作系更新部１５２は、下記の（式２）を用いて、初期の操作量系列
と、コスト積算部１５１で算出された積算された積算コスト
と、第３処理部１６から入力される乱数
とから、制御対象５０に対する操作量系列
を算出する。 The operation system update unit 152 calculates the initial operation amount from the initial operation amount series, the costs of a plurality of states at each time for N times accumulated by the cost accumulation unit 151, and the random number input from the third processing unit 16. An operation amount sequence for the control target 50 whose sequence has been updated is calculated. More specifically, the operation system updating unit 152 uses the following (Equation 2) to calculate the initial operation amount series.
And the integrated integrated cost calculated by the cost integrating unit 151
And a random number input from the third processing unit 16
From the above, the operation amount series for the control object 50
Is calculated.

≪第３処理部１６≫
第３処理部１６は、モンテカルロ法で用いる乱数を発生させる。第３処理部１６は、発生させた乱数を第１処理部１４および第２処理部１５に出力する。 «Third processing unit 16»
The third processing unit 16 generates a random number used in the Monte Carlo method. The third processing unit 16 outputs the generated random number to the first processing unit 14 and the second processing unit 15.

本実施の形態では、第３処理部１６は、図３Ｂに示すように、ノイズ発生部１６１と、格納部１６２とを備える。 In the present embodiment, the third processing unit 16 includes a noise generation unit 161 and a storage unit 162, as shown in FIG. 3B.

ノイズ発生部１６１は、例えば、ガウス雑音を乱数
として発生させ、格納部１６２に格納する。 For example, the noise generator 161 converts Gaussian noise to random numbers.
And stored in the storage unit 162.

格納部１６２は、例えばメモリであり、乱数
を一時的に格納し、第１処理部１４および第２処理部１５に出力する。 The storage unit 162 is a memory, for example, and is a random number.
Is temporarily stored and output to the first processing unit 14 and the second processing unit 15.

［制御装置１の動作］
上述のように構成された制御装置１の動作の一例について以下説明する。 [Operation of Control Device 1]
An example of the operation of the control device 1 configured as described above will be described below.

図６は、本実施の形態における制御装置１の処理を示すフローチャートである。制御装置１は、本開示のニューラルネットワークである経路積分制御ニューラルネットワークを備える。当該経路積分制御ニューラルネットワークは、機械学習されたダイナミクスモデルおよびコスト関数を有する。また、当該経路積分制御ニューラルネットワークは、２重のリカレントニューラルネットワークからなる。すなわち、当該経路積分制御ニューラルネットワークは、上述したように、ダイナミクスモデルを持つ第１リカレントニューラルネットワークを内部に有する第２リカレントニューラルネットワークからなる。 FIG. 6 is a flowchart showing the processing of the control device 1 in the present embodiment. The control device 1 includes a path integration control neural network that is a neural network of the present disclosure. The path integral control neural network has a machine-learned dynamics model and a cost function. The path integral control neural network is a double recurrent neural network. That is, the path integral control neural network is composed of the second recurrent neural network having the first recurrent neural network having the dynamics model therein as described above.

まず、制御装置１は、制御対象５０の現在の状態と、制御対象に対する複数の操作パラメータを成分とする操作量系列であって初期の操作量系列とを、本開示のニューラルネットワークである経路積分制御ニューラルネットワークに入力する（Ｓ１１）。 First, the control device 1 uses a path integral that is a neural network according to the present disclosure to display the current state of the control target 50 and an operation amount sequence that includes a plurality of operation parameters for the control target as initial components. Input to the control neural network (S11).

次に、制御装置１は、当該経路積分制御ニューラルネットワークに、機械学習されたダイナミクスモデルおよびコスト関数を用いて、Ｓ１１で入力された現在の状態と初期の操作量系列とから経路積分により、制御対象５０を制御するための操作量系列を算出させる（Ｓ１２）。 Next, the control device 1 controls the path integration control neural network by path integration from the current state input in S11 and the initial manipulated variable sequence using a machine-learned dynamics model and cost function. An operation amount sequence for controlling the object 50 is calculated (S12).

そして、制御装置１は、Ｓ１２で当該経路積分制御ニューラルネットワークにより算出された制御対象５０を制御するための操作量系列を出力する（Ｓ１３）。 And the control apparatus 1 outputs the operation amount series for controlling the control object 50 calculated by the said path | route integral control neural network by S12 (S13).

［学習処理］
本開示では、ニューラルネットワークを用いて最適制御に必要なダイナミクスおよびコスト関数、または、これらのパラメータを学習させるために、最適制御器の一つである経路積分制御器に着目した。経路積分制御器を実現するために定式化された関数は、可微分であることから、合成関数の微分公式である連鎖律が適用できる。また、ディープ・ニューラル・ネットワークは、可微分な関数の巨大な集合体である合成関数であって連鎖律により学習可能な合成関数と解釈できる。そして、可微分という原則を守れば、任意形状のディープ・ニューラル・ネットワークを構成できることがわかった。 [Learning process]
In the present disclosure, in order to learn the dynamics and cost function necessary for optimal control using a neural network, or these parameters, attention is paid to a path integration controller which is one of the optimal controllers. Since the function formulated to realize the path integral controller is differentiable, the chain rule which is the differential formula of the composite function can be applied. A deep neural network can be interpreted as a composite function that is a huge set of differentiable functions and can be learned by the chain rule. It was found that an arbitrarily shaped deep neural network can be constructed if the principle of differentiation is followed.

以上から、経路積分制御器は、可微分な関数で定式化され、連鎖律が適用できるので、バックプロパゲーションすなわち誤差逆伝播法により全パラメータを学習できるディープ・ニューラル・ネットワークを用いて実現できることを想到した。より具体的には、ディープ・ニューラル・ネットワークの一つであるリカレントニューラルネットワークは、同一の関数を直列に複数回実行すなわち、関数が直列に並んだニューラルネットワークと解釈できる。このことから、経路積分制御器は、リカレントニューラルネットワークで表現できることを想到した。 From the above, the path integral controller is formulated with a differentiable function and can be applied with the chain rule, so it can be realized using a deep neural network that can learn all parameters by backpropagation, that is, error backpropagation. I came up with it. More specifically, a recurrent neural network which is one of deep neural networks can be interpreted as a neural network in which the same function is executed a plurality of times in series, that is, the functions are arranged in series. From this, it was conceived that the path integration controller can be expressed by a recurrent neural network.

これにより、ニューラルネットワークを用いて経路積分制御に必要なダイナミクスおよびコスト関数、または、これらのパラメータを学習させることができる。さらに、上述したように、学習したダイナミクスおよびコスト関数等を有するニューラルネットワークを用いることで経路積分制御すなわち経路積分による最適制御が実現できる。 This makes it possible to learn the dynamics and cost function necessary for path integration control, or these parameters, using a neural network. Furthermore, as described above, by using a neural network having learned dynamics, cost function, etc., path integral control, that is, optimum control by path integral can be realized.

以下、経路積分制御に必要なダイナミクスおよびコスト関数のパラメータの学習処理について説明する。 Hereinafter, the learning process of the parameters of the dynamics and cost function necessary for the path integral control will be described.

図７は、本実施の形態における学習処理の概念図の一例を示す図である。ニューラルネットワーク部３ｂは、学習前のダイナミクスモデルおよびコスト関数モデルを有する。これらダイナミクスモデルおよびコスト関数モデルが学習されることで、制御装置１を構成するニューラルネットワーク部３が有するダイナミクスモデルおよびコスト関数モデルに適用することができる。 FIG. 7 is a diagram showing an example of a conceptual diagram of the learning process in the present embodiment. The neural network unit 3b has a dynamics model and a cost function model before learning. By learning these dynamics model and cost function model, it can be applied to the dynamics model and cost function model that the neural network unit 3 constituting the control device 1 has.

図７には、ニューラルネットワーク部３ｂに、教師データ５を用いて誤差逆伝播法によりダイナミクスモデルおよびコスト関数モデルを学習させる学習処理を行う場合の例が示されている。なお、教師データがない場合には、強化学習を用いて学習処理を行ってもよい。 FIG. 7 shows an example in which the neural network unit 3b performs learning processing for learning the dynamics model and the cost function model by the error back propagation method using the teacher data 5. If there is no teacher data, the learning process may be performed using reinforcement learning.

図８は、本実施の形態における学習処理Ｓ１０の概要を示すフローチャートである。 FIG. 8 is a flowchart showing an outline of the learning process S10 in the present embodiment.

学習処理Ｓ１０において、まず、学習用データを準備する（Ｓ１０１）。より具体的には、制御対象５０の現在の状態に対応する予め用意された状態と、制御対象５０に対する初期の操作量系列に対応する予め用意された初期の操作量系列と、予め用意された状態および予め用意された初期の操作量系列とから経路積分により予め算出された、制御対象を制御するための操作量系列とを含む学習用データを準備する。本実施の形態では、状態と操作系列とのセットを含む熟練者の操作履歴を学習用データとして準備する。 In the learning process S10, first, learning data is prepared (S101). More specifically, a state prepared in advance corresponding to the current state of the control object 50, an initial operation amount sequence prepared in advance corresponding to an initial operation amount sequence for the control object 50, and a prepared in advance Learning data including an operation amount sequence for controlling a control target, which is calculated in advance by path integration from the state and an initial operation amount sequence prepared in advance, is prepared. In the present embodiment, an operation history of a skilled person including a set of states and operation sequences is prepared as learning data.

次に、コンピュータは、準備した学習用データを教師データとして用いて、ニューラルネットワーク部３ｂの重みを誤差逆伝播法により学習させることにより、ダイナミクスモデルおよびコスト関数モデルを学習させる（Ｓ１０２）。より具体的には、コンピュータは、学習用データを用いて、ニューラルネットワーク部３ｂに、学習用データに含まれる予め用意された状態と予め用意された初期の操作量系列とから経路積分により操作量系列を算出させる。そして、コンピュータは、ニューラルネットワーク部３ｂに経路積分により算出させた操作量系列と、学習用データに含まれる予め用意された操作量系列との差である誤差を、予め用意された評価関数等で評価し、誤差が小さくなるように、ダイナミクスモデルおよびコスト関数モデルのパラメータを更新する。さらに、コンピュータは、学習処理において予め用意された評価関数等で評価される誤差が最小または変動しなくなった状態まで、ダイナミクスモデルおよびコスト関数モデルのパラメータを調整または更新する。 Next, the computer learns the dynamics model and the cost function model by learning the weight of the neural network unit 3b by the error back propagation method using the prepared learning data as teacher data (S102). More specifically, the computer uses the learning data to cause the neural network unit 3b to perform an operation amount by path integration from a state prepared in advance included in the learning data and an initial operation amount sequence prepared in advance. Let the series be calculated. Then, the computer calculates an error, which is a difference between the manipulated variable sequence calculated by the path integration through the neural network unit 3b and the prepared manipulated variable sequence included in the learning data, using a prepared evaluation function or the like. The parameters of the dynamics model and the cost function model are updated so that the error is reduced. Further, the computer adjusts or updates the parameters of the dynamics model and the cost function model until the error evaluated by an evaluation function prepared in advance in the learning process is minimized or no longer fluctuates.

このように、コンピュータは、予め用意された評価関数等で評価し、誤差が小さくなるようにダイナミクスモデルのパラメータの更新を繰り返す誤差逆伝播法により、ニューラルネットワーク部３ｂにダイナミクスモデルおよびコスト関数モデルを学習させる。 As described above, the computer evaluates the evaluation function prepared in advance and repeats the updating of the parameters of the dynamics model so that the error is reduced. Let them learn.

本実施の形態では、このように、学習処理Ｓ１０を行うことにより、制御装置１に用いるニューラルネットワーク部３にダイナミクスモデルおよびコスト関数モデルを学習させることができる。 In the present embodiment, by performing the learning process S10 as described above, the neural network unit 3 used in the control device 1 can learn the dynamics model and the cost function model.

なお、教師データに（状態、操作、次状態）を組としたデータが含まれる場合、ダイナミクスモデルは当該データを用いて独立に教師付き学習させることが可能である。独立に学習させたダイナミクスモデルをニューラルネットワーク部３に組み込み、ダイナミクスモデルのパラメータを固定したうえで、学習処理Ｓ１０を用いてコスト関数モデルのみを学習させることも可能である。ダイナミクスモデルの教師付き学習の方法は既知であるため割愛する。 In addition, when the teacher data includes data that includes a set of (state, operation, and next state), the dynamics model can be independently trained by using the data. It is also possible to incorporate an independently learned dynamics model into the neural network unit 3 and fix only the cost function model using the learning process S10 after fixing the parameters of the dynamics model. Since the method of supervised learning of the dynamics model is known, it is omitted.

以下、ニューラルネットワーク部３を本開示のニューラルネットワークである経路積分制御ニューラルネットワークと称して説明する。 Hereinafter, the neural network unit 3 will be described as a path integration control neural network that is a neural network of the present disclosure.

［実験による検証］
学習させたダイナミクスおよびコスト関数モデルを有する経路積分制御ニューラルネットワークの有効性について実験により検証したので、その実験結果について説明する。 [Verification by experiment]
The effectiveness of the path integral control neural network having the learned dynamics and cost function model was verified by experiments, and the experimental results will be described.

最適制御の問題として、下向きの単振り子を揺さぶって、逆立ちの状態まで持っていくという単振り子の振り上げ制御がある。本実験では、振り子の振り上げ制御に用いるダイナミクスとコスト関数とを、熟練者による教師データを用いて模倣学習させ、振り子の振り上げ制御をシミュレーションにより行わせることでその有効性を検証した。 As a problem of optimal control, there is a swing control of a single pendulum in which a downward single pendulum is shaken and brought to a handstand state. In this experiment, the dynamics and cost function used for pendulum swing-up control were imitated by using expert teacher data and the effectiveness of the pendulum swing-up control was verified by simulation.

＜教師データ＞
本実験では、熟練者は、真のダイナミクスとコスト関数とをもつ最適制御器であるとする。そして、真のダイナミクスは、下記の（式３）で与えられ、コスト関数は、下記の（式４）で与えられるものとする。 <Teacher data>
In this experiment, the expert is assumed to be an optimal controller having true dynamics and a cost function. The true dynamics is given by the following (formula 3), and the cost function is given by the following (formula 4).

ここで、θは振り子の角度、kはモデルパラメータであり、ｕはトルクすなわち操作入力である。 Here, θ is a pendulum angle, k is a model parameter, and u is a torque, that is, an operation input.

＜実験結果＞
図９は、本実験における制御シミュレーション結果を示す図である。 <Experimental result>
FIG. 9 is a diagram showing a control simulation result in this experiment.

本実験では、ダイナミクスおよびコスト関数は１層の隠れ層を有するニューラルネットワークで表現した。そしてダイナミクスを上述した方法により教師データで独立に学習した上で、誤差逆伝播法で所望の出力をするようにコスト関数を学習させた。このような学習処理を行った処理経路積分制御ニューラルネットワークを図９のControllersにおいて「Trained」と表現している。また、ダイナミクスを上記の教師データで独立に学習させ、コスト関数の学習を行わずに（式４）に示される真のコスト関数を与えた経路積分制御ニューラルネットワークを図９のControllersにおいて「Freezed」と表現している。一方、非特許文献２で示されるVIN（Value Iteration Network）を図９のControllersにおいて比較例と表現している。VINは、非特許文献２で示されるように、誤差逆伝播法により状態遷移モデルと報酬モデルとを学習されるニューラルネットワークである。本実験では、状態遷移モデルをダイナミクス、報酬モデルをコスト関数として、上記の教師データを用いてVINに学習させた。 In this experiment, the dynamics and the cost function are expressed by a neural network having one hidden layer. Then, after learning the dynamics independently from the teacher data by the method described above, the cost function was learned so as to produce a desired output by the error back propagation method. The processing path integral control neural network that has performed such learning processing is expressed as “Trained” in the Controllers of FIG. 9. Further, a path integration control neural network in which the dynamics is independently learned with the above-described teacher data and the true cost function shown in (Equation 4) is given without performing the cost function learning is “Freezed” in the Controllers of FIG. It expresses. On the other hand, VIN (Value Iteration Network) shown in Non-Patent Document 2 is expressed as a comparative example in the Controllers of FIG. As shown in Non-Patent Document 2, VIN is a neural network in which a state transition model and a reward model are learned by an error back propagation method. In this experiment, VIN was trained using the above teacher data with the state transition model as the dynamics and the reward model as the cost function.

また、図９に示す項目MSE For D_trainは、教師データに対する誤差を示し、図９に示す項目MSE For D_testは、評価データに対する誤差すなわち汎化誤差を示す。図９に示す項目Success Rateは、振り上げの成功率を示し、実際に制御をして無事振り上げが成功した場合を成功率１００％として示している。図９に示す項目traj.Cost S(τ)は、累積コストを示し、下向きの単振り子が倒立状態である振り上げ状態になるまでの軌道のコストを示している。図９に示す項目trainable paramsは、パラメータ数を示している。 Further, the item MSE For D _train shown in FIG. 9 indicates an error with respect to the teacher data, and the item MSE For D _test shown in FIG. 9 indicates an error with respect to the evaluation data, that is, a generalization error. The item Success Rate shown in FIG. 9 indicates the success rate of the swing-up, and the case where the control is actually controlled and the successful swing-up is successful is shown as a success rate of 100%. The item traj.Cost S (τ) shown in FIG. 9 indicates the accumulated cost, and indicates the cost of the trajectory until the downward single pendulum is in the upright state in the inverted state. The item trainable params shown in FIG. 9 indicates the number of parameters.

図９に示すように、「Trained」が最も汎化性能が高いのがわかる。また、「Freezed」が「Trained」と比較して汎化性能が低いのは、第１学習処理で学習したダイナミクスが第２学習処理で最適化されなかったためと考えられる。つまり、「Freezed」では、第１学習処理で学習したダイナミクスの誤差が影響して汎化性能が低くなっていると考えられる。 As shown in FIG. 9, it can be seen that “Trained” has the highest generalization performance. The reason why “Freezed” is lower in generalization performance than “Trained” is because the dynamics learned in the first learning process was not optimized in the second learning process. That is, in “Freezed”, it is considered that the generalization performance is low due to the influence of the dynamics error learned in the first learning process.

一方、比較例では、振り上げ制御が成功率が０％であり振り上げが成功していない。つまり、比較例では、学習するパラメータ数が多くなりすぎ、状態爆発が生じていると考えられる。このことから、比較例のニューラルネットワークでは、ダイナミクスとコスト関数との学習を行うのは難しいことがわかる。 On the other hand, in the comparative example, the success rate of the swing-up control is 0%, and the swing-up is not successful. That is, in the comparative example, it is considered that the number of parameters to be learned becomes too large and a state explosion has occurred. This shows that it is difficult to learn the dynamics and the cost function in the neural network of the comparative example.

次に、図１０Ａ〜図１０Ｃを用いて、本実験における学習結果を説明する。 Next, learning results in this experiment will be described with reference to FIGS. 10A to 10C.

図１０Ａは、真のコスト関数を示す図であり、上記の（式４）で示されるコスト関数が視覚化されている。図１０Ｂは、学習された経路積分制御ニューラルネットワークのコスト関数を示す図であり、本実験における「Trained」において学習されたコスト関数が視覚化されている。図１０Ｃは、学習された比較例のニューラルネットワークのコスト関数を示す図であり、本実験における比較例において学習されたコスト関数が視覚化されている。 FIG. 10A is a diagram illustrating a true cost function, and the cost function represented by the above (Equation 4) is visualized. FIG. 10B is a diagram illustrating the cost function of the learned path integral control neural network, and the cost function learned in “Trained” in this experiment is visualized. FIG. 10C is a diagram illustrating a cost function of the learned neural network of the comparative example, and the cost function learned in the comparative example in this experiment is visualized.

図１０Ａと図１０Ｂとを比較すればわかるように、「Trained」のコスト関数すなわち経路積分制御ニューラルネットワークのコスト関数は、真のコスト関数に近い形状が学習されているのがわかる。 As can be seen from a comparison between FIG. 10A and FIG. 10B, the “Trained” cost function, that is, the cost function of the path integral control neural network, is learned to have a shape close to the true cost function.

一方、図１０Ｃからわかるように、比較例のコスト関数は形状がない。これは、比較例のニューラルネットワークでは、コスト関数を学習できないことを示している。 On the other hand, as can be seen from FIG. 10C, the cost function of the comparative example has no shape. This indicates that the cost function cannot be learned with the neural network of the comparative example.

以上の実験結果から、本開示のニューラルネットワークである経路積分制御ニューラルネットワークは、コスト関数を真のコスト関数に近い形状で学習できることがわかる。また、学習したコスト関数を用いた当該経路積分制御ニューラルネットワークは、汎化性能が高いことがわかる。 From the above experimental results, it can be seen that the path integration control neural network which is the neural network of the present disclosure can learn the cost function in a shape close to the true cost function. It can also be seen that the path integral control neural network using the learned cost function has high generalization performance.

以上から、本開示のニューラルネットワークである経路積分制御ニューラルネットワークは、最適制御に必要なダイナミクスおよびコスト関数を学習することができるだけでなく、汎化性能を獲得し先読みもできることがわかる。 From the above, it can be seen that the path integral control neural network, which is the neural network of the present disclosure, can not only learn the dynamics and cost function necessary for optimal control, but also acquire generalization performance and prefetch.

［効果等］
以上のように本開示のニューラルネットワークである２重のリカレントニューラルネットワークからなる経路積分制御ニューラルネットワークを用いることで、経路積分による最適制御に必要なダイナミクスおよびコスト関数、または、これらのパラメータを学習することができる。また、当該経路積分制御ニューラルネットワークは、模倣学習により高い汎化性能を獲得できるので、経路積分制御ニューラルネットワークを用いることで、先読みもできる制御装置等を実現できる。つまり、本実施の形態の制御装置および制御方法によれば、２重のリカレントニューラルネットワークからなるニューラルネットワークに経路積分による最適制御を行わせることができるので、ニューラルネットワークを用いて経路積分による最適制御を行うことができる。 [Effects]
As described above, by using the path integration control neural network including the double recurrent neural network which is the neural network of the present disclosure, the dynamics and cost function necessary for optimal control by path integration, or these parameters are learned. be able to. In addition, since the path integral control neural network can obtain high generalization performance by imitation learning, a control device that can perform prefetching can be realized by using the path integral control neural network. That is, according to the control device and the control method of the present embodiment, it is possible to cause a neural network composed of a double recurrent neural network to perform optimal control by path integration, so that optimal control by path integration using a neural network is possible. It can be performed.

さらに、上述したように、経路積分制御ニューラルネットワークのダイナミクスおよびコスト関数の学習には、誤差逆伝播法など、ニューラルネットワークの学習において既存の学習方法を用いることができる。つまり、本実施の形態の制御装置および制御方法によれば、最適制御に必要なダイナミクスおよびコスト関数といった記述が難しいパラメータを既存の学習方法を用いて簡単に学習するができる。 Furthermore, as described above, for learning of the dynamics and cost function of the path integral control neural network, an existing learning method such as an error back propagation method can be used in learning of the neural network. That is, according to the control device and the control method of the present embodiment, it is possible to easily learn parameters that are difficult to describe, such as dynamics and cost function necessary for optimal control, using an existing learning method.

また、本実施の形態の制御装置および制御方法によれば、可微分な合成関数で表現できる経路積分制御ニューラルネットワークを利用するので、制御対象の状態および操作を連続値で行う連続制御を実現することができる。また、本実施の形態の制御装置および制御方法によれば、可微分な合成関数で表現できる経路積分制御ニューラルネットワークを利用するので、柔軟にコスト関数を表現できる。つまり、コスト関数は、ニューラルネットワークのモデルでも表現できるだけでなく、数式表現でもニューラルネットワークを用いて学習させることができる。 Further, according to the control device and the control method of the present embodiment, the path integral control neural network that can be expressed by a differentiable composite function is used, so that continuous control is performed in which the state and operation of the controlled object are performed with continuous values. be able to. Further, according to the control device and the control method of the present embodiment, since the path integral control neural network that can be expressed by a differentiable synthesis function is used, the cost function can be expressed flexibly. That is, the cost function can be expressed not only by a model of a neural network but also by a neural network using a mathematical expression.

（変形例１）
上記の実施の形態では、ニューラルネットワーク部３０は、算出部１３のみを有し、算出部１３が算出する操作量系列を出力するとして説明したが、これに限らない。算出部１３が算出する操作量系列を平均化して出力するとしてもよい。以下、この場合を変形例１として、実施の形態と異なるところを中心に説明する。 (Modification 1)
In the above embodiment, the neural network unit 30 includes only the calculation unit 13 and outputs the operation amount series calculated by the calculation unit 13, but is not limited thereto. The operation amount series calculated by the calculation unit 13 may be averaged and output. Hereinafter, this case will be described as a first modification, focusing on differences from the embodiment.

［ニューラルネットワーク部３０］
図１１は、変形例１におけるニューラルネットワーク部３０の構成の一例を示すブロック図である。図２と同様の要素には同一の符号を付しており、詳細な説明は省略する。 [Neural network unit 30]
FIG. 11 is a block diagram illustrating an example of the configuration of the neural network unit 30 in the first modification. Elements similar to those in FIG. 2 are denoted by the same reference numerals, and detailed description thereof is omitted.

図１１に示すニューラルネットワーク部３０は、図２に示すニューラルネットワーク部３に対して、乗算器３１と、加算器３２と、遅延部３３とをさらに備える点で構成が異なる。 The neural network unit 30 shown in FIG. 11 differs from the neural network unit 3 shown in FIG. 2 in that it further includes a multiplier 31, an adder 32, and a delay unit 33.

＜乗算器３１＞
乗算器３１は、算出部１３が算出する操作量系列に重みを乗算して、加算器３２に出力する。より具体的には、乗算器３１は、算出部１３が操作量系列を更新する都度重みｗ_ｉを乗算して、加算器３２に出力する。なお、算出部１３は、上述するようにＵ回再帰的に操作量系列を更新することで、制御対象を制御するための操作量系列
を算出する。そして、算出部１３がより後に更新された操作量系列ほどばらつきの少ない操作量系列であることから、重みｗ_ｉは、下記の（式５）を満たし、かつ、算出部１３の更新回数が大きいほど大きくなるように決定される。 <Multiplier 31>
Multiplier 31 multiplies the manipulated variable sequence calculated by calculation unit 13 by a weight and outputs the result to adder 32. More specifically, the multiplier 31 multiplies the weight w _i each time the calculating unit 13 updates the operation amount sequence, and outputs to the adder 32. Note that the calculation unit 13 updates the operation amount sequence recursively U times as described above, thereby controlling the operation amount sequence for controlling the control target.
Is calculated. Since the operation amount sequence updated later by the calculation unit 13 is an operation amount sequence with less variation, the weight w _i satisfies the following (Equation 5) and the number of updates of the calculation unit 13 is large. It is determined to be larger.

＜加算器３２＞
加算器３２は、乗算器３１が出力する重みが乗算された操作量系列と、これ以前に乗算器３１が出力した重みが乗算された操作量系列が加算されたものとを、加算して出力する。より具体的には、加算器３２は、乗算器３１が出力する重みが乗算されたすべての操作量系列を加算することにより、算出部１３が出力したすべての操作量系列を重み付けて平均化した平均操作量系列
を、ニューラルネットワーク部３０の出力として出力する。 <Adder 32>
The adder 32 adds and outputs the manipulated variable series multiplied by the weight output from the multiplier 31 and the manipulated variable series multiplied by the weight output from the multiplier 31 before. To do. More specifically, the adder 32 weights and averages all the operation amount sequences output from the calculation unit 13 by adding all the operation amount sequences multiplied by the weight output from the multiplier 31. Average manipulated variable series
Is output as an output of the neural network unit 30.

＜遅延部３３＞
遅延部３３は、加算器３２の加算結果を一定時間遅延させて、更新するタイミングで、加算器３２に提供する。このようにして、遅延部３３は、加算器３２に、乗算器３１が出力する重みが乗算されたすべての操作量系列を積算することで加算器３２に算出部１３が出力したすべての操作量系列を重み付けて平均化させることができる。 <Delay unit 33>
The delay unit 33 delays the addition result of the adder 32 for a predetermined time and provides it to the adder 32 at a timing for updating. In this way, the delay unit 33 accumulates all the operation amount sequences multiplied by the weights output from the multiplier 31 in the adder 32, whereby all the operation amounts output from the calculation unit 13 in the adder 32. The series can be weighted and averaged.

なお、本変形例の制御装置のその他の構成および動作は、上記の実施の形態の制御装置１のその他の構成および動作で説明した通りである。 In addition, the other structure and operation | movement of the control apparatus of this modification are as having demonstrated in the other structure and operation | movement of the control apparatus 1 of said embodiment.

［効果等］
本変形例における制御装置によれば、算出部１３により更新された操作量系列をそのまま出力しないで、後に更新された操作量系列ほど大きい重みが乗算された操作量系列を積算して出力する。これにより、更新回数が大きいほど、ばらつきの少ない操作量系列となるので、それを活かすことができる。換言すると、リカレントニューラルネットワークを誤差逆伝播法で学習することで勾配が薄まってしまう場合があっても、後に更新された操作量系列ほど大きく重み付けて平均化を取ることで解決することができる。 [Effects]
According to the control device in the present modification, the manipulated variable series updated by the calculation unit 13 is not output as it is, but the manipulated variable series multiplied by a larger weight is added and output later. As a result, the larger the number of updates, the smaller the variation in the manipulated variable series, which can be utilized. In other words, even if the gradient is weakened by learning the recurrent neural network by the error back propagation method, it can be solved by weighting and averaging the operation amount series updated later.

（他の実施態様の可能性）
以上、実施の形態において本開示の制御装置および制御方法について説明したが、本開示は、上記実施の形態に限定されるものではない。例えば、本明細書において記載した構成要素を任意に組み合わせて、また、構成要素のいくつかを除外して実現される別の実施の形態を本開示の実施の形態としてもよい。また、上記実施の形態に対して本開示の主旨、すなわち、請求の範囲に記載される文言が示す意味を逸脱しない範囲で当業者が思いつく各種変形を施して得られる変形例も本開示に含まれる。 (Possibility of other embodiments)
The control device and the control method of the present disclosure have been described above in the embodiments. However, the present disclosure is not limited to the above embodiments. For example, another embodiment realized by arbitrarily combining the components described in this specification and excluding some of the components may be used as an embodiment of the present disclosure. Further, the present disclosure also includes modifications obtained by making various modifications conceivable by those skilled in the art without departing from the gist of the present disclosure, that is, the meanings of the words described in the claims. It is.

また、本開示は、さらに、以下のような場合も含まれる。 The present disclosure further includes the following cases.

（１）上記の装置は、具体的には、マイクロプロセッサ、ＲＯＭ、ＲＡＭ、ハードディスクユニット、ディスプレイユニット、キーボード、マウスなどから構成されるコンピュータシステムである。前記ＲＡＭまたはハードディスクユニットには、コンピュータプログラムが記憶されている。前記マイクロプロセッサが、前記コンピュータプログラムにしたがって動作することにより、各装置は、その機能を達成する。ここでコンピュータプログラムは、所定の機能を達成するために、コンピュータに対する指令を示す命令コードが複数個組み合わされて構成されたものである。 (1) Specifically, the above apparatus is a computer system including a microprocessor, ROM, RAM, a hard disk unit, a display unit, a keyboard, a mouse, and the like. A computer program is stored in the RAM or hard disk unit. Each device achieves its functions by the microprocessor operating according to the computer program. Here, the computer program is configured by combining a plurality of instruction codes indicating instructions for the computer in order to achieve a predetermined function.

（２）上記の装置を構成する構成要素の一部または全部は、１個のシステムＬＳＩ（ＬａｒｇｅＳｃａｌｅＩｎｔｅｇｒａｔｉｏｎ：大規模集積回路）から構成されているとしてもよい。システムＬＳＩは、複数の構成部を１個のチップ上に集積して製造された超多機能ＬＳＩであり、具体的には、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどを含んで構成されるコンピュータシステムである。前記ＲＡＭには、コンピュータプログラムが記憶されている。前記マイクロプロセッサが、前記コンピュータプログラムにしたがって動作することにより、システムＬＳＩは、その機能を達成する。 (2) A part or all of the constituent elements constituting the above-described apparatus may be configured by one system LSI (Large Scale Integration). The system LSI is an ultra-multifunctional LSI manufactured by integrating a plurality of components on a single chip, and specifically, a computer system including a microprocessor, ROM, RAM, and the like. . A computer program is stored in the RAM. The system LSI achieves its functions by the microprocessor operating according to the computer program.

（３）上記の装置を構成する構成要素の一部または全部は、各装置に脱着可能なＩＣカードまたは単体のモジュールから構成されているとしてもよい。前記ＩＣカードまたは前記モジュールは、マイクロプロセッサ、ＲＯＭ、ＲＡＭなどから構成されるコンピュータシステムである。前記ＩＣカードまたは前記モジュールは、上記の超多機能ＬＳＩを含むとしてもよい。マイクロプロセッサが、コンピュータプログラムにしたがって動作することにより、前記ＩＣカードまたは前記モジュールは、その機能を達成する。このＩＣカードまたはこのモジュールは、耐タンパ性を有するとしてもよい。 (3) A part or all of the constituent elements constituting the above-described device may be constituted by an IC card that can be attached to and detached from each device or a single module. The IC card or the module is a computer system including a microprocessor, a ROM, a RAM, and the like. The IC card or the module may include the super multifunctional LSI described above. The IC card or the module achieves its function by the microprocessor operating according to the computer program. This IC card or this module may have tamper resistance.

（４）また、本開示は、上記に示す方法であるとしてもよい。また、これらの方法をコンピュータにより実現するコンピュータプログラムであるとしてもよいし、前記コンピュータプログラムからなるデジタル信号であるとしてもよい。 (4) Moreover, this indication may be the method shown above. Further, the present invention may be a computer program that realizes these methods by a computer, or may be a digital signal composed of the computer program.

（５）また、本開示は、前記コンピュータプログラムまたは前記デジタル信号をコンピュータで読み取り可能な記録媒体、例えば、フレキシブルディスク、ハードディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＢＤ（Ｂｌｕ−ｒａｙ（登録商標）Ｄｉｓｃ）、半導体メモリなどに記録したものとしてもよい。また、これらの記録媒体に記録されている前記デジタル信号であるとしてもよい。 (5) In addition, the present disclosure provides a computer-readable recording medium such as a flexible disk, hard disk, CD-ROM, MO, DVD, DVD-ROM, DVD-RAM, BD ( It may be recorded on a Blu-ray (registered trademark) Disc), a semiconductor memory, or the like. The digital signal may be recorded on these recording media.

また、本開示は、前記コンピュータプログラムまたは前記デジタル信号を、電気通信回線、無線または有線通信回線、インターネットを代表とするネットワーク、データ放送等を経由して伝送するものとしてもよい。 In addition, the present disclosure may transmit the computer program or the digital signal via an electric communication line, a wireless or wired communication line, a network represented by the Internet, a data broadcast, or the like.

また、本開示は、マイクロプロセッサとメモリを備えたコンピュータシステムであって、前記メモリは、上記コンピュータプログラムを記憶しており、前記マイクロプロセッサは、前記コンピュータプログラムにしたがって動作するとしてもよい。 The present disclosure may be a computer system including a microprocessor and a memory, the memory storing the computer program, and the microprocessor operating according to the computer program.

また、前記プログラムまたは前記デジタル信号を前記記録媒体に記録して移送することにより、または前記プログラムまたは前記デジタル信号を、前記ネットワーク等を経由して移送することにより、独立した他のコンピュータシステムにより実施するとしてもよい。 In addition, the program or the digital signal is recorded on the recording medium and transferred, or the program or the digital signal is transferred via the network or the like and executed by another independent computer system. You may do that.

本開示は、最適制御を行う制御装置および制御方法に利用できる。本開示は、特にダイナミクスおよびコスト関数など記述が難しいパラメータをディープ・ニューラル・ネットワークを用いて学習させ、学習させたダイナミクスおよびコスト関数を用いてディープ・ニューラル・ネットワークに最適制御を行わせる制御装置および制御方法に利用できる。 The present disclosure can be used for a control device and a control method for performing optimal control. The present disclosure provides a control device that allows a deep neural network to learn parameters that are particularly difficult to describe, such as dynamics and cost functions, and that allows the deep neural network to perform optimal control using the learned dynamics and cost functions. It can be used for the control method.

１制御装置
２入力部
３、３ｂ、３０ニューラルネットワーク部
４出力部
５教師データ
１３算出部
１３ａ、１４１ａリカレントニューラルネットワーク
１４第１処理部
１５第２処理部
１６第３処理部
１７、１４２、１６２格納部
３１乗算器
３２加算器
３３遅延部
５０制御対象
１４１モンテカルロシミュレータ部
１５１コスト積算部
１５２操作系更新部
１６１ノイズ発生部
１４１１ダイナミクスモデル
１４１３コスト関数モデル DESCRIPTION OF SYMBOLS 1 Control apparatus 2 Input part 3, 3b, 30 Neural network part 4 Output part 5 Teacher data 13 Calculation part 13a, 141a Recurrent neural network 14 1st process part 15 2nd process part 16 3rd process part 17, 142, 162 Storage Unit 31 multiplier 32 adder 33 delay unit 50 controlled object 141 Monte Carlo simulator unit 151 cost integration unit 152 operation system update unit 161 noise generation unit 1411 dynamics model 1413 cost function model

Claims

A control device for performing optimal control by path integration,
The controller comprises a neural network having a machine-learned dynamics model and a cost function;
An input unit that inputs a current state of the control target and an operation amount sequence having a plurality of operation parameters for the control target as components and an initial operation amount sequence to the neural network;
An operation amount for controlling the control target, which is an operation amount sequence calculated by path integration from the current state and the initial operation amount sequence by the neural network using the dynamics model and the cost function. An output unit for outputting the series,
The neural network comprises a second recurrent neural network having a first recurrent neural network having the dynamics model therein,
Control device.

The second recurrent neural network is:
The first recurrent neural network and the cost function; causing the first recurrent neural network to calculate a state at each time from the current state and the initial manipulated variable sequence by a Monte Carlo method; A first processing unit that calculates a cost of the plurality of states using a function;
A second processing unit that calculates an operation amount sequence for the control target based on the initial operation amount sequence and the costs of the plurality of states;
The second processing unit outputs the calculated manipulated variable sequence to the output unit and feeds back to the second recurrent neural network as the initial manipulated variable sequence,
The second recurrent neural network includes, to the first processing unit, the cost of a plurality of states at each time next to each time from the operation amount series fed back by the second processing unit and the current state. To calculate,
The control device according to claim 1.

The second recurrent neural network further includes:
A third processing unit for generating random numbers used in the Monte Carlo method,
The third processing unit outputs the generated random number to the first processing unit and the second processing unit;
The control device according to claim 2.

The cost function is a cost function model composed of a neural network.
The control device according to claim 1.

A control method of a control device for performing optimal control by path integration,
The controller comprises a neural network having a machine-learned dynamics model and a cost function,
An input step for inputting a current state of the control target and an operation amount sequence having a plurality of operation parameters for the control target as components and an initial operation amount sequence to the neural network;
An operation amount sequence calculated by path integration from the current state and the initial operation amount sequence using the dynamics model and the cost function to the neural network, and for controlling the control target An output step for outputting a manipulated variable series,
The neural network comprises a second recurrent neural network having a first recurrent neural network having the dynamics model therein,
Control method.

Furthermore, before the input step, a learning step for machine learning the dynamics model and the cost function,
The learning step includes
A state prepared in advance corresponding to the current state of the control target, an initial operation amount sequence prepared in advance corresponding to an initial operation amount sequence for the control target, a state prepared in advance and a state prepared in advance Preparing learning data including an operation amount sequence for controlling the control target, which is calculated in advance by path integration from an initial operation amount sequence, as teacher data;
Learning the dynamics model and the cost function by learning the weight of the neural network using an error back propagation method using the teacher data.
The control method according to claim 5.

The cost function is a cost function model composed of a neural network.
The control method according to claim 5 or 6.