JP4974330B2

JP4974330B2 - Control device

Info

Publication number: JP4974330B2
Application number: JP2006053671A
Authority: JP
Inventors: 孝朗関合; 悟清水; 栄一神永
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2006-02-28
Filing date: 2006-02-28
Publication date: 2012-07-11
Anticipated expiration: 2026-02-28
Also published as: JP2007233634A; CN101030074A; CN101477332B; CN101477332A; CN100483275C

Description

本発明は、強化学習制御技術に係り、特に、学習初期段階でも安全に制御対象を運転操作することのできる強化学習制御技術に関する。 The present invention relates to a reinforcement learning control technique, and more particularly to a reinforcement learning control technique that can safely operate a control target even in an initial learning stage.

近年、教師なし学習の分野で、強化学習と呼ばれる手法が盛んに研究されている。強化学習とは、制御対象などの環境との試行錯誤的な相互作用を通じて、環境から得られる計測信号が望ましいものとなるように、環境への操作信号の生成方法を学習する学習制御の枠組みとして知られている。 In recent years, a technique called reinforcement learning has been actively studied in the field of unsupervised learning. Reinforcement learning is a learning control framework that learns how to generate operation signals to the environment so that measurement signals obtained from the environment become desirable through trial and error interactions with the environment such as the control target. Are known.

強化学習では、環境から得られる計測信号に基づいて計算されるスカラー量の評価値（強化学習では、報酬と呼ばれている）を手がかりに、現状態から将来までに得られる評価値の期待値が最大または最小となるような環境への操作信号の生成方法を学習する。このような学習機能を実装する方法として、例えば、非特許文献１に述べられているActor-Critic、Q学習、実時間Dynamic Programmingなどのアルゴリズムが知られている。 In reinforcement learning, the expected value of the evaluation value obtained from the current state to the future, based on the evaluation value of the scalar quantity that is calculated based on the measurement signal obtained from the environment (called reward in reinforcement learning) Learn how to generate an operation signal for an environment where is the maximum or minimum. As methods for implementing such a learning function, for example, algorithms such as Actor-Critic, Q-learning, and real-time dynamic programming described in Non-Patent Document 1 are known.

また、上述の手法を発展させた強化学習の枠組みとして、Dyna-アーキテクチャと呼ばれる枠組みが上記文献に紹介されている。これは、制御対象を模擬するモデルを対象にどのような操作信号を生成するのが良いかを予め学習し、この学習結果を用いて制御対象に印加する操作信号を決定する方法である。また、制御対象とモデルの誤差を小さくするように、制御対象への操作信号と計測信号を用いて、モデルを調整する機能を持っている。 In addition, a framework called Dyna-architecture is introduced in the above-mentioned document as a framework for reinforcement learning, which is an extension of the above-described method. This is a method of learning in advance what kind of operation signal should be generated for a model simulating a control target, and determining an operation signal to be applied to the control target using the learning result. In addition, it has a function of adjusting the model using the operation signal and the measurement signal to the control target so as to reduce the error between the control target and the model.

また、強化学習を適用した技術として、特許文献１に述べられている技術が挙げられる。これは、モデルと学習機能を有するシステムの組である強化学習モジュールを複数備えておき、各強化学習モジュールにおけるモデルと制御対象との予測誤差が少ないものほど大きな値を取る責任信号を求め、この責任信号に比例して各強化学習モジュールから生成される制御対象への操作信号を重み付けし、制御対象に印加する操作信号を決定する技術である。
強化学習(Reinforcement Learning)、三上貞芳・皆川雅章共訳、森北出版株式会社、2000年12月20日出版特開２０００−３５９５６号公報 Moreover, the technique described in patent document 1 is mentioned as a technique to which reinforcement learning is applied. This is provided with multiple reinforcement learning modules that are a combination of a model and a system having a learning function, and a responsibility signal that takes a larger value as the prediction error between the model and the control target in each reinforcement learning module is smaller. This is a technique for determining an operation signal to be applied to a control object by weighting an operation signal to the control object generated from each reinforcement learning module in proportion to the responsibility signal.
Reinforcement Learning, Sadayoshi Mikami and Masaaki Minagawa, Morikita Publishing Co., Ltd., published on December 20, 2000 JP 2000-35956 A

上述のDyna-アーキテクチャ、あるいは特許文献１に述べられている技術を用いて、制御対象との試行錯誤による相互作用を通した学習を実施すると、学習を進めるに従って制御対象に対して良好な操作信号の生成方法を学習できる。しかし、学習の初期段階では、いずれの手法も試行錯誤的な操作信号を制御対象に与える必要があり、その間は、制御対象を安全に運転できない可能性がある。 When learning is performed through trial-and-error interaction with the control object using the above-described Dyna-architecture or the technique described in Patent Document 1, a good operation signal is output to the control object as the learning proceeds. Can learn how to generate However, at the initial stage of learning, any method needs to give a trial and error operation signal to the controlled object, and during that time, there is a possibility that the controlled object cannot be operated safely.

また、制御対象とモデルの特性が大きく異なる場合、モデルに対して有効な操作信号が、制御対象に対しても有効となるとは限らない。そのため、制御対象を良好に制御できない可能性がある。 Further, when the characteristics of the controlled object and the model are greatly different, an operation signal effective for the model is not always effective for the controlled object. Therefore, there is a possibility that the controlled object cannot be controlled well.

そこで、本発明では、学習初期段階でも制御対象を安全に運転可能な操作信号の生成方法を学習することのできる制御技術を提供する。また、制御対象とモデルの特性が異なる領域で操作信号を生成せずに、特性が近い領域においてのみ操作信号を生成することのできる制御技術を提供する。 Therefore, the present invention provides a control technique capable of learning a method for generating an operation signal that can safely drive a control target even in an initial learning stage. Further, the present invention provides a control technique that can generate an operation signal only in a region where the characteristics are close without generating an operation signal in a region where the characteristics of the controlled object and the model are different.

本発明は上記課題を解決するため、次のような手段を採用した。 In order to solve the above problems, the present invention employs the following means.

制御対象および制御対象の特性を模擬するモデルのそれぞれに印加する操作信号を生成し、前記制御対象および前記モデルのそれぞれへ前記操作信号を印加した結果得られる計測信号に基づいて算出される評価値信号を受信し、現状態から将来状態までに得られる前記制御対象に基づく前記評価値信号の総和の期待値が最大となるように前記操作信号の生成方法を学習する機能を備える制御装置において、前記モデルからの計測信号が所望の値に近いほど大きくなる第１の評価値を求める第１の評価値計算部と、前記モデルと制御対象の特性の相違に基づいて求める値であって、同一操作入力に対する制御対象出力とモデル出力との、モデル構築時に判明している誤差特性が保存されたモデル誤差特性データベースを参照して算出する値と、操作信号と該操作信号を制御対象に印加した結果得られる計測信号に基づいて算出された評価値信号との関係が保存された評価値データベースを参照して算出する値と、操作信号に対する計測信号の関係が保存されたプロセス値データベースを参照して算出する値とを含み、モデル化誤差が大きいほど小さくなる第２の評価値を計算する第２の評価値計算部とを備え、前記第１の評価値と前記第２の評価値とを加算して前記評価値信号を算出し、学習の初期段階における操作信号の安全性を向上させた。 An evaluation value calculated based on a measurement signal obtained as a result of generating an operation signal to be applied to each of the controlled object and a model simulating the characteristics of the controlled object and applying the operating signal to each of the controlled object and the model In a control device having a function of receiving a signal and learning a generation method of the operation signal so that an expected value of a sum of the evaluation value signals based on the control target obtained from a current state to a future state is maximized, A first evaluation value calculation unit that obtains a first evaluation value that increases as a measurement signal from the model approaches a desired value, and a value that is obtained based on a difference in characteristics between the model and a control target, and is the same the controlled object output and the model output with respect to the operation input, and a value error characteristics are found during model building are calculated by referring to the model error characteristic database stored, And values calculated by referring to the evaluation value database relation between the evaluation value signal calculated based on the work signal and resulting measuring signal the operation signal applied to the control object is stored, the measurement signal to the operation signal relationship comprises the value be calculated by reference to the stored process value database, and a second evaluation value calculation unit for calculating a second evaluation value as the modeling error is large becomes small, the first The evaluation value signal and the second evaluation value are added to calculate the evaluation value signal, thereby improving the safety of the operation signal in the initial stage of learning.

本発明は、以上の構成を備えるため、モデル誤差が小さい領域での操作信号の生成方法を学習することができる。このため学習初期段階においても制御対象を安全に運転することができる。 Since the present invention has the above configuration, it is possible to learn a method for generating an operation signal in a region where the model error is small. For this reason, the controlled object can be safely operated even in the initial learning stage.

以下、最良の実施形態を添付図面を参照しながら説明する。図１は、本実施形態に係る制御装置２００を制御対象１００に適用した例について説明する図である。 Hereinafter, the best embodiment will be described with reference to the accompanying drawings. FIG. 1 is a diagram illustrating an example in which a control device 200 according to the present embodiment is applied to a control target 100.

制御装置２００は、学習部３００を備える。学習部３００は、制御対象１００に印加する操作信号２０１を生成する。また、制御対象１００からの計測信号２０２および計測信号２０２を入力とした実評価値計算部５００の出力信号である実評価値信号２０３を受信する。なお、学習部３００は、現状態から将来までの実評価値信号２０３の期待値の総和が最大（または最小）となるような操作信号２０１の生成方法を学習する機能を備えている。 The control device 200 includes a learning unit 300. The learning unit 300 generates an operation signal 201 to be applied to the control target 100. In addition, the measurement signal 202 from the control object 100 and the actual evaluation value signal 203 that is the output signal of the actual evaluation value calculation unit 500 that receives the measurement signal 202 are received. Note that the learning unit 300 has a function of learning a generation method of the operation signal 201 such that the sum of expected values of the actual evaluation value signal 203 from the current state to the future becomes maximum (or minimum).

実評価値計算部５００は、例えば、計測信号２０２が所望の値に近い程、大きな値となる実評価値信号２０３を出力する機能を有している。例えば、計測信号２０２が所望の値と一致する場合には、実評価値信号２０３を”１”を出力し、一致しない場合には”０”を出力する。なお、計測信号２０２と所望の値との偏差に反比例するような実評価値信号２０３を出力してもよい。 The actual evaluation value calculation unit 500 has a function of outputting an actual evaluation value signal 203 that becomes a larger value as the measurement signal 202 is closer to a desired value, for example. For example, when the measurement signal 202 matches a desired value, “1” is output as the actual evaluation value signal 203, and when the measurement signal 202 does not match, “0” is output. An actual evaluation value signal 203 that is inversely proportional to the deviation between the measurement signal 202 and a desired value may be output.

学習部３００が実装する機能として、強化学習を挙げることができる。強化学習では、学習の初期段階においては試行錯誤的に操作信号２０１を生成する。このため実評価値信号２０３は小さい値となる可能性が高い。その後、試行錯誤の経験を積み、学習を進めるに従って、実評価値信号２０３が大きくなるような操作信号２０１の生成方法を学習する。このような学習アルゴリズムとして、例えば、前記非特許文献１に述べられているActor-Critic、Q学習、実時間Dynamic Programmingなどのアルゴリズムを用いることができる。この文献に紹介されているDyna-アーキテクチャと呼ばれる枠組みでは、制御対象を模擬するモデル４００を対象に操作信号の生成方法を学習し、この学習結果を用いて操作信号２０１を生成する。 Reinforcement learning can be given as a function implemented by the learning unit 300. In reinforcement learning, the operation signal 201 is generated by trial and error in the initial stage of learning. For this reason, the actual evaluation value signal 203 is likely to be a small value. Thereafter, a method of generating the operation signal 201 is learned so that the actual evaluation value signal 203 becomes larger as the trial and error are accumulated and the learning is advanced. As such learning algorithms, for example, algorithms such as Actor-Critic, Q-learning, and real-time dynamic programming described in Non-Patent Document 1 can be used. In a framework called Dyna-architecture introduced in this document, an operation signal generation method is learned for a model 400 that simulates a control target, and an operation signal 201 is generated using the learning result.

学習部３００は、モデル４００に対する操作信号２０４を生成し、モデル４００からの計測信号２０５と評価値信号２０８を受信する機能を備える。評価値信号２０８は、モデル４００からの計測信号２０５に基づいて第１の評価値計算部６００で計算される第１の評価値信号２０６と、第２の評価値計算部７００で計算される第２の評価値信号２０７を加算して計算する。 The learning unit 300 has a function of generating an operation signal 204 for the model 400 and receiving the measurement signal 205 and the evaluation value signal 208 from the model 400. The evaluation value signal 208 is calculated based on the measurement signal 205 from the model 400 by the first evaluation value calculation unit 600 and the first evaluation value signal 206 calculated by the second evaluation value calculation unit 700. Two evaluation value signals 207 are added and calculated.

第１の評価値計算部６００は、例えば、モデルからの計測信号２０５が所望の値に近い程、大きな値の第１の評価値信号２０６を出力する機能を有しており、これは実評価値計算部５００と同様である。 For example, the first evaluation value calculation unit 600 has a function of outputting the first evaluation value signal 206 having a larger value as the measurement signal 205 from the model is closer to a desired value. This is the same as the value calculation unit 500.

第２の評価値計算部７００は、モデル誤差特性データベース８００、評価値データベース９００、プロセス値データベース１０００を参照しながら第２の評価値信号２０７を計算する。第２の評価値計算部７００は、制御対象１００とモデル４００の特性が近いほど大きな値となる第２の評価値信号２０７を出力する。 The second evaluation value calculation unit 700 calculates the second evaluation value signal 207 with reference to the model error characteristic database 800, the evaluation value database 900, and the process value database 1000. The second evaluation value calculation unit 700 outputs a second evaluation value signal 207 that becomes a larger value as the characteristics of the controlled object 100 and the model 400 are closer.

なお、図１に示す例では、学習部３００、モデル４００、実評価値計算部５００、第１の評価値計算部６００、第２の評価値計算部７００、モデル誤差特性データベース８００、評価値データベース９００、プロセス値データベース１０００を制御装置２００の内部に配置しているが、これらの機能の一部を制御装置の外部に配置することもできる。 In the example shown in FIG. 1, the learning unit 300, the model 400, the actual evaluation value calculation unit 500, the first evaluation value calculation unit 600, the second evaluation value calculation unit 700, the model error characteristic database 800, the evaluation value database. 900 and the process value database 1000 are arranged inside the control device 200, but some of these functions can be arranged outside the control device.

図２は、第２の評価値信号の生成方法を説明する図である。第２の評価値信号２０７（Ｒ）は、前記モデルの誤差、すなわち事前評価モデル誤差のバイアスＥ１、事前評価モデル誤差の分散σ１、評価値予測誤差Ｅ２、モデル誤差のバイアスＥ３で構成される４次元の誤差評価ベクトルＸ、および４次元の重みベクトルＷを用い、式１ないし式３を用いて計算する。ここで、前記重みベクトルＷ（ｗ１，ｗ２，ｗ３，ｗ４）は、設計者が予め設定する。

FIG. 2 is a diagram illustrating a method for generating the second evaluation value signal. The second evaluation value signal 207 (R) is composed of the model error, that is, the pre-evaluation model error bias E1, the pre-evaluation model error variance σ1, the evaluation value prediction error E2, and the model error bias E3. Using the dimensional error evaluation vector X and the four-dimensional weight vector W, calculation is performed using Equations 1 to 3. Here, the weight vector W (w1, w2, w3, w4) is preset by the designer.

なお、前記事前評価モデル誤差のバイアスＥ１、事前評価モデル誤差の分散σ１は、モデル誤差特性データベース８００を参照して求める。また、評価値予測誤差は、評価値データベース９００、計測値誤差のバイアスはプロセス値データベース１０００を参照して求める。 The prior evaluation model error bias E1 and the prior evaluation model error variance σ1 are obtained with reference to the model error characteristic database 800. The evaluation value prediction error is obtained with reference to the evaluation value database 900, and the bias of the measurement value error is obtained with reference to the process value database 1000.

モデル誤差特性データベース８００には、モデル構築時に判明している、同一操作入力に対する制御対象１００出力とモデル４００出力の誤差特性が保存されている。すなわちある範囲の操作入力に対して精度のよいモデルを構築し、前記操作範囲を逸脱する操作入力に対するモデル誤差に関する知見、例えば、事前のモデル検証で判明した操作入力に対するモデル誤差のバイアスや分散が保存されている。 The model error characteristic database 800 stores error characteristics of the control target 100 output and the model 400 output for the same operation input, which are known at the time of model construction. That is, an accurate model is constructed for a certain range of operation inputs, and knowledge about model errors for operation inputs that deviate from the operation range, for example, bias and variance of model errors for operation inputs found by prior model verification. Saved.

また、経時変化により、制御対象１００とモデル４００の特性が相違してくる場合がある。このような経時変化に伴うモデル誤差に関する事前の知見も、モデル誤差特性データベース８００に保存しておくことができる。 In addition, the characteristics of the controlled object 100 and the model 400 may differ due to changes over time. Such prior knowledge regarding model errors accompanying temporal changes can also be stored in the model error characteristic database 800.

第２の評価値計算部７００は、モデル誤差が大きいほど、小さくなるような第２の評価値信号２０７を出力する。すなわち、重み係数を負の値に設定することにより、このような出力を生成することができる。 The second evaluation value calculation unit 700 outputs a second evaluation value signal 207 that decreases as the model error increases. In other words, such an output can be generated by setting the weighting factor to a negative value.

評価値データベース９００には、操作信号２０１に対する実評価値信号２０３、および操作信号２０４に対する第１の評価値信号２０６の関係が保存されている。制御対象１００とモデル４００の特性に誤差がある場合、同一の操作信号を与えても計測信号の値が異なる。このため前記評価値信号２０３と第１の評価値信号２０６とには誤差が生ずる。このため、第２の評価値計算部７００では、評価値データベース９００を参照して、モデル誤差に起因する評価値の予測誤差を計算する。 The evaluation value database 900 stores the relationship between the actual evaluation value signal 203 with respect to the operation signal 201 and the first evaluation value signal 206 with respect to the operation signal 204. When there is an error in the characteristics of the control object 100 and the model 400, the value of the measurement signal is different even if the same operation signal is given. For this reason, an error occurs between the evaluation value signal 203 and the first evaluation value signal 206. For this reason, the second evaluation value calculation unit 700 refers to the evaluation value database 900 to calculate the prediction error of the evaluation value caused by the model error.

この予測誤差は、操作信号２０１と操作信号２０４が同一である場合において、実評価値信号２０３の予測値から、第１の評価値信号２０６を減算した値であり、実評価値信号２０３の予測値の方が第１の評価値信号２０６よりも大きい場合には正の値、逆の場合には負の値となる。重み係数は正の値に設定する。 This prediction error is a value obtained by subtracting the first evaluation value signal 206 from the prediction value of the actual evaluation value signal 203 when the operation signal 201 and the operation signal 204 are the same. When the value is larger than the first evaluation value signal 206, the value is a positive value, and when the value is opposite, the value is a negative value. The weighting factor is set to a positive value.

第１の評価値計算部６００で計算された第１の評価値信号２０６より、実評価値計算部５００で計算された評価値信号２０３の方が大きいということは、モデル４００に対して有効であると学習した操作信号を制御対象１００に印加した場合、予想していたよりも優れた結果が得られたことを意味している。このような現象は、制御対象１００とモデル４００誤差の特性に違いがあることによるが、このような操作方法を学習することは有益である。 The fact that the evaluation value signal 203 calculated by the actual evaluation value calculation unit 500 is larger than the first evaluation value signal 206 calculated by the first evaluation value calculation unit 600 is effective for the model 400. This means that when an operation signal learned to be present is applied to the control object 100, a result superior to that expected is obtained. Such a phenomenon is due to the difference in the characteristics of the control object 100 and the model 400 error, but it is beneficial to learn such an operation method.

このように、評価値データベース９００を参照して得た評価信号を第２の評価信号２０７の要素として加えることにより、以上のような操作方法を学習部３００で学習させることができる。 In this way, by adding an evaluation signal obtained by referring to the evaluation value database 900 as an element of the second evaluation signal 207, the learning unit 300 can learn the operation method as described above.

プロセス値データベース１０００には、操作信号２０１に対する計測信号２０２の関係、および操作信号２０４に対する計測信号２０５の関係が保存されている。重み係数を負の値に設定することにより、事前評価モデル誤差と同様に、モデル誤差が大きいほど第２の評価値信号２０７は小さな値となる。 The process value database 1000 stores the relationship of the measurement signal 202 to the operation signal 201 and the relationship of the measurement signal 205 to the operation signal 204. By setting the weighting factor to a negative value, the second evaluation value signal 207 becomes smaller as the model error is larger, as in the case of the pre-evaluation model error.

図３は、第２の評価値計算部７００の処理を説明する図である。第２の評価計算部７００は、モデル誤差バイアス計算処理７１０、モデル誤差分散計算処理７２０、評価値予測誤差計算処理７３０、計測値誤差計算処理７４０、第２の評価値計算処理の各ステップを備える。なお、モデル誤差バイアス計算処理７１０、モデル誤差分散計算処理７２０、評価値予測誤差計算処理７３０、計測値誤差計算処理７４０の各処理の処理順序は、任意に変更することができる。 FIG. 3 is a diagram for explaining the processing of the second evaluation value calculation unit 700. The second evaluation calculation unit 700 includes steps of a model error bias calculation process 710, a model error variance calculation process 720, an evaluation value prediction error calculation process 730, a measurement value error calculation process 740, and a second evaluation value calculation process. . Note that the processing order of each of the model error bias calculation processing 710, model error variance calculation processing 720, evaluation value prediction error calculation processing 730, and measurement value error calculation processing 740 can be arbitrarily changed.

なお、本実施の形態では、第２の評価値計算部７００において第２の評価値信号２０７を計算する際に、事前評価モデル誤差のバイアスおよび分散、評価値予測誤差、モデル誤差のバイアスの４項目を評価の対象としているが、これらすべてを対象とする必要は必ずしもない。また、上述した例の外に、モデル誤差特性データベース８００、評価値データベース９００、プロセス値データベース１０００を参照して得られる様々統計量(例えば実評価値予測値の分散)などを評価の対象に追加することも可能である。また、図１には図示していないが、画像表示手段を制御装置２００内、あるいは外部に設置し、操作員が画像表示手段を介して制御装置２００の動作を確認できるようにしてもよい。 In the present embodiment, when the second evaluation value signal 207 is calculated by the second evaluation value calculation unit 700, the pre-evaluation model error bias and variance, the evaluation value prediction error, and the model error bias are four. Although the items are subject to evaluation, it is not necessary to target all of them. In addition to the above-described example, various statistical quantities (for example, variance of actual evaluation value prediction values) obtained by referring to the model error characteristic database 800, the evaluation value database 900, and the process value database 1000 are added as evaluation targets. It is also possible to do. Although not shown in FIG. 1, an image display unit may be installed inside or outside the control device 200 so that an operator can check the operation of the control device 200 via the image display unit.

図７は、学習部３００が、モデル４００を対象に制御対象１００の操作方法を学習する方法について説明する図である。図７では学習方法としてQ-Learningを使用した場合を例に説明する。 FIG. 7 is a diagram illustrating a method in which the learning unit 300 learns how to operate the control target 100 using the model 400 as a target. FIG. 7 illustrates an example in which Q-Learning is used as a learning method.

Q-Learningでは、状態ｓにおいて行動ａを実行することの価値を表現する関数を使用する。この価値関数をＱ（ｓ，ａ）と表記する。状態ｓは、操作信号２０４と出力２０５によって定義される。
まず、ステップ３１０において、価値関数Ｑ（ｓ，ａ）を任意に初期化する。次に、ステップ３２０において、モデル４００の操作信号２０４の初期値を決定し、そのときのモデル４００の出力２０５を計算する。 Q-Learning uses a function that expresses the value of executing action a in state s. This value function is expressed as Q (s, a). The state s is defined by the operation signal 204 and the output 205.
First, in step 310, the value function Q (s, a) is arbitrarily initialized. Next, in step 320, the initial value of the operation signal 204 of the model 400 is determined, and the output 205 of the model 400 at that time is calculated.

ステップ３３０では、価値関数Ｑ（ｓ，ａ）を用いて状態ｓにおける行動ａを決定する。ここでは、非特許文献１に記載されているε−Greedy方策などを用いて、行動を決定する。この行動によって、操作信号２０４が更新される。次に、ステップ３４０において、更新された操作信号２０４に対するモデル出力２０５を計算する。これにより、状態がｓからｓ’に遷移する。 In step 330, the action a in the state s is determined using the value function Q (s, a). Here, the action is determined using an ε-Greedy policy described in Non-Patent Document 1. The operation signal 204 is updated by this action. Next, in step 340, the model output 205 for the updated operation signal 204 is calculated. As a result, the state transitions from s to s ′.

次に、ステップ３５０では、第１の評価値計算部６００と、第２の評価値計算部７００にて評価値を計算し、これらを加算して評価値信号２０８を算出する。 Next, in step 350, the first evaluation value calculation unit 600 and the second evaluation value calculation unit 700 calculate evaluation values and add them to calculate the evaluation value signal 208.

ステップ３６０では、式６を用いて価値関数Ｑ（ｓ，ａ）を更新する。

In step 360, the value function Q (s, a) is updated using Equation 6.

ここで、ｒは評価値信号２０８の値、α及びγは設計パラメータであり、制御対象１００の運転員が設定する値である。 Here, r is the value of the evaluation value signal 208, α and γ are design parameters, and are values set by the operator of the controlled object 100.

終了判定３７０では、モデル出力２０５が予め定められた条件を満足した場合にはYESとなり、ステップ３２０に戻る。それ以外の場合はステップ３３０に戻る。 In the end determination 370, if the model output 205 satisfies a predetermined condition, the determination is YES, and the process returns to step 320. Otherwise, return to Step 330.

なお、図１には図示していないが、画像表示手段を制御装置２００の内部あるいは制御装置２００の外部に設置することにより、操作員は、この画像表示手段を介して制御装置２００の動作を確認することができる。 Although not shown in FIG. 1, by installing the image display means inside the control device 200 or outside the control device 200, the operator can operate the control device 200 via the image display means. Can be confirmed.

図４は、前記画像表示手段に表示する画面を説明する図である。表示する画像２５０は、図２に示すように、モデル誤差特性データベース８００、評価値データベース９００、プロセス値データベース１０００を参照して得られる様々なグラフとすることができる。 FIG. 4 is a diagram for explaining a screen displayed on the image display means. As shown in FIG. 2, the image 250 to be displayed can be various graphs obtained by referring to the model error characteristic database 800, the evaluation value database 900, and the process value database 1000.

画像２６０は、モデル誤差特性データベース８００、評価値データベース９００、プロセス値データベース１０００を参照して得られる誤差評価ベクトルの値、操作員が設定する重みベクトルの値、および第２の評価値とすることができる。操作員は、画像２５０、および画像２６０を確認しながら、重みベクトルの値を設定、調整することができる。 The image 260 is a model error characteristic database 800, an evaluation value database 900, a value of an error evaluation vector obtained by referring to the process value database 1000, a value of a weight vector set by an operator, and a second evaluation value. Can do. The operator can set and adjust the value of the weight vector while checking the image 250 and the image 260.

次に、本実施形態による効果について説明する。本実施形態では、第２の評価値計算部７００で計算された第２の評価値信号２０７を第１の評価信号２０６に加算して学習部３００に供給している。このとき、第２の評価値信号２０７は、モデル誤差が小さい程、大きな値となる。このため、学習部３００は、モデル４００を対象にモデル誤差が小さい領域で操作信号を生成するように学習する。 Next, the effect by this embodiment is demonstrated. In the present embodiment, the second evaluation value signal 207 calculated by the second evaluation value calculation unit 700 is added to the first evaluation signal 206 and supplied to the learning unit 300. At this time, the second evaluation value signal 207 has a larger value as the model error is smaller. For this reason, the learning unit 300 learns to generate an operation signal in a region where the model error is small for the model 400.

従来手法では、モデル誤差が大きい領域であっても、モデル４００に対して有効となる操作信号２０４の生成方法を学習する。この場合、この生成方法で生成した操作信号を制御対象１００に印加しても所望の性能が得られない可能性がある。また、本実施形態では、モデル誤差が小さい領域、あるいはモデルからの評価値信号２０６よりも実評価値信号２０３の予測値が大きくなる領域での操作信号の生成方法を学習するので、従来手法と比べて良好な性能が得られることが期待できる。また、従来手法と比べて制御対象１００の安全性が向上する効果もある。 In the conventional method, a method for generating the operation signal 204 that is effective for the model 400 is learned even in a region where the model error is large. In this case, even if the operation signal generated by this generation method is applied to the controlled object 100, there is a possibility that desired performance cannot be obtained. Further, in the present embodiment, the learning method of generating an operation signal in a region where the model error is small or a region where the predicted value of the actual evaluation value signal 203 is larger than the evaluation value signal 206 from the model is learned. It can be expected that better performance can be obtained. In addition, there is an effect that the safety of the controlled object 100 is improved as compared with the conventional method.

図５は、前記制御対象としての火力発電プラントを説明する図である。まず、火力発電プラントにおける発電の仕組みについて説明する。 FIG. 5 is a diagram illustrating a thermal power plant as the control target. First, the mechanism of power generation in a thermal power plant will be described.

ボイラ１０１に備え付けられているバーナー１０２に、燃料となる石炭と石炭搬送用の１次空気、および燃焼調整用の２次空気を供給し、石炭を燃焼させる。石炭と１次空気は配管１３４から、２次空気は配管１４１から導かれる。また、２段燃焼用のアフタエアは、アフタエアポート１０３を介してボイラ１０１に投入される。このアフタエアは、配管１４２から導かれる。 Coal as a fuel, primary air for transporting coal, and secondary air for combustion adjustment are supplied to a burner 102 provided in the boiler 101 to burn the coal. Coal and primary air are led from the pipe 134 and secondary air is led from the pipe 141. Further, after-air for two-stage combustion is input to the boiler 101 via the after-air port 103. The after air is guided from the pipe 142.

前記石炭の燃焼により発生した高温のガスは、ボイラ１０１の排気経路に沿って流れ、エアーヒーター１０４を通過し、排ガス処理した後、煙突を介して大気に放出される。 The high-temperature gas generated by the combustion of the coal flows along the exhaust path of the boiler 101, passes through the air heater 104, is treated with exhaust gas, and is then released to the atmosphere through the chimney.

ボイラ１０１を循環する給水は、給水ポンプ１０５を介してボイラ１０１に導かれ、熱交換器１０６においてガスにより過熱され、高温高圧の蒸気となる。本実施形態では熱交換器を１つとしているが、複数の熱交換器を配置することも可能である。 The feed water circulating through the boiler 101 is guided to the boiler 101 via the feed water pump 105 and is heated by the gas in the heat exchanger 106 to become high-temperature and high-pressure steam. In this embodiment, one heat exchanger is used, but a plurality of heat exchangers may be arranged.

熱交換器１０６を通過した高温高圧の蒸気は、タービンガバナ１０７を介して蒸気タービン１０８に導かれる。蒸気の持つエネルギーによって蒸気タービン１０８を駆動し、発電機１０９により発電する。 The high-temperature and high-pressure steam that has passed through the heat exchanger 106 is guided to the steam turbine 108 via the turbine governor 107. The steam turbine 108 is driven by the energy of the steam and the generator 109 generates power.

次に、バーナー１０２から投入される１次空気および２次空気、アフタエアポート１０３から投入されるアフタエアの経路について説明する。 Next, the paths of primary air and secondary air introduced from the burner 102 and after-air introduced from the after-air port 103 will be described.

１次空気は、ファン１２０を介して配管１３０に導かれ、途中でエアーヒーターを通過する配管１３２と通過しない配管１３１に分岐し、再び配管１３３にて合流し、ミル１１０に導かれる。エアーヒーターを通過する空気は、ガスにより加熱される。この１次空気を用いて、ミル１１０で生成される石炭（微粉炭）をバーナー１０２に搬送する。 The primary air is guided to the pipe 130 via the fan 120, branches in the middle of the pipe 132 that passes through the air heater and the pipe 131 that does not pass, and merges again in the pipe 133 and is guided to the mill 110. The air passing through the air heater is heated by the gas. Using this primary air, coal (pulverized coal) generated in the mill 110 is conveyed to the burner 102.

２次空気およびアフタエアは、ファン１２１を介して配管１４０に導かれ、エアーヒーター１０４で加熱された後、２次空気用の配管１４１とアフタエア用の配管１４２とに分岐し、それぞれバーナー１０２とアフタエアポート１０３に導かれる。 The secondary air and the after air are led to the pipe 140 through the fan 121 and heated by the air heater 104, and then branched into a secondary air pipe 141 and an after air pipe 142, respectively, and the burner 102 and the after air, respectively. Guided to the airport 103.

図６は、１次空気、２次空気、およびアフタエアが通過する配管部、並びにエアーヒーター１０４の拡大図である。 FIG. 6 is an enlarged view of the air heater 104 and the piping section through which the primary air, the secondary air, and the after air pass.

図６に示すように、配管にはエアダンパ１５０、１５１、１５２、１５３が配置されている。エアダンパを操作することにより、配管における空気が通過する面積を変更することでき、これにより配管を通過する空気流量を調整することが可能となる。ここでは、エアダンパ１５０、１５１、１５２、１５３の制御により、ガスに含まれるＮＯｘを目標値以下に抑制することを目的に制御装置２００を導入する場合について説明する。 As shown in FIG. 6, air dampers 150, 151, 152, and 153 are arranged in the pipe. By operating the air damper, it is possible to change the area through which air passes through the pipe, and thereby adjust the flow rate of air passing through the pipe. Here, a case will be described in which the control device 200 is introduced for the purpose of suppressing NOx contained in the gas below the target value by controlling the air dampers 150, 151, 152, and 153.

２段燃焼方式は、サーマルＮＯｘおよびフューエルＮＯｘの低減に効果がある方式として知られており、バーナーからは理論空気量より少ない空気量を投入し、アフタエアポートから不足分の空気を投入して完全燃焼させる。これにより、急激な燃焼を抑制し、火炎温度の上昇を抑えると共に、酸素濃度を低下させることによりＮＯｘ生成を抑制することができる。 The two-stage combustion method is known as a method that is effective in reducing thermal NOx and fuel NOx, and the air amount less than the theoretical air amount is supplied from the burner and the insufficient air is supplied from the after-air port. Burn. As a result, it is possible to suppress rapid combustion, suppress an increase in flame temperature, and suppress NOx generation by lowering the oxygen concentration.

すなわち、制御装置２００は、ＮＯｘ低減のため、バーナーから投入する空気量とアフタエアポートから投入する空気量の比率が最適となるように、エアダンパ１５０、１５１、１５２、１５３を操作する操作信号を生成する。 That is, the control device 200 generates an operation signal for operating the air dampers 150, 151, 152, and 153 so that the ratio of the amount of air input from the burner and the amount of air input from the after-air port is optimized to reduce NOx. To do.

このような動作を実行させるため、図１における実評価値計算部５００および第１の評価値計算部６００は、式４あるいは式５を用いて実評価値信号２０３および第１の評価値信号２０６を計算する。ここで、Ｒは評価値信号、Ｙ_ＮＯｘはＮＯｘの計測信号、Ｄ_ＮＯｘはＮＯｘの目標値である。

In order to execute such an operation, the actual evaluation value calculation unit 500 and the first evaluation value calculation unit 600 in FIG. 1 use the equation 4 or the equation 5 to calculate the actual evaluation value signal 203 and the first evaluation value signal 206. Calculate Here, R is an evaluation value signal, Y _NOx is a measurement signal of _NOx , and D _NOx is a target value of NOx.

なお、本実施形態では、ＮＯｘ成分に着目して評価値信号を計算する構成としたが、その他のガス成分であるＣＯなどを加えて、複数の計測信号に基づいて評価値を計算することもできる。 In the present embodiment, the evaluation value signal is calculated by paying attention to the NOx component. However, the evaluation value may be calculated based on a plurality of measurement signals by adding other gas components such as CO. it can.

モデル４００は、ボイラ１０１の特性を模擬したものであり、バーナーおよびエアポートから投入する石炭、空気の諸条件を設定し計算を実行することで、ＮＯｘ濃度を求めることができる。また、対象とするボイラ１０１以外のボイラの運転実績を用いて、事前にモデル４００の精度を検証した知見が、モデル誤差特性データベース８００に保存されている。 The model 400 simulates the characteristics of the boiler 101, and the NOx concentration can be obtained by setting various conditions of coal and air input from the burner and the air port and executing the calculation. In addition, knowledge obtained by verifying the accuracy of the model 400 in advance using the operation results of the boilers other than the target boiler 101 is stored in the model error characteristic database 800.

すなわち、ボイラは、石炭の燃焼により発生した灰が熱交換器やボイラの壁に付着することにより燃焼特性が経時変化し、これがＮＯｘの生成量にも影響を与える。このため、この灰を除去するためにスートブロワが実施される。例えば、前記モデル４００として、スートブロワ実施後１時間の特性を模擬するように構築すると、それ以外の経過時間では灰付着による影響により、モデルによるＮＯｘの計算値とボイラから計測されるＮＯｘの値が異なることが予想される。
That is, as for the boiler, the ash generated by the combustion of coal adheres to the heat exchanger and the wall of the boiler, so that the combustion characteristics change with time , which also affects the amount of NOx produced. For this reason, a soot blower is implemented in order to remove this ash. For example, if the model 400 is constructed so as to simulate the characteristics of one hour after the soot blower, the calculated NOx value by the model and the NOx value measured from the boiler at other elapsed times are affected by ash adhesion. Expected to be different.

しかし、このようなモデル誤差特性は、ボイラの運転実績（経時変化の特性）からその一部分については事前に分かっていることが多く、このような運転時間とモデル誤差特性に関する情報をモデル誤差特性データベース８００に保存しておく。また、計測器のノイズ特性（例えば、ノイズによる計測値の分散）が事前に分かっている場合には、この特性も前評価モデル誤差特性データベース８００に蓄積しておく。このように設定しておくことにより、制御対象１００が火力発電プラントである場合においても、制御装置２００によりのプラントの排ガスに含まれるＮＯｘを目標値以下に抑制することができる。
However, such model error characteristics are often known in advance for a part from the actual operation of the boiler (characteristics of changes over time) , and information on such operation time and model error characteristics is stored in the model error characteristic database. Save to 800. Further, when the noise characteristics of the measuring instrument (for example, dispersion of measured values due to noise) are known in advance, these characteristics are also stored in the pre-evaluation model error characteristic database 800. By setting in this way, even when the controlled object 100 is a thermal power plant, NOx contained in the exhaust gas of the plant by the control device 200 can be suppressed below the target value.

以上説明したように、本実施形態によれば、モデル誤差が小さい領域での操作信号の生成方法を学習するので、従来手法と比べて良好な制御を実施することができる。また、従来手法と比べて制御対象の安全性が向上する。すなわち、前述のDyna-アーキテクチャあるいは特許文献１に述べられている従来手法によれば、モデル誤差が大きい領域において、モデルに対して有効となる操作信号の生成方法を学習する。このため、この学習結果を制御対象に印加しても有効とならない可能性がある。これに対して、本実施形態によれば、前記第１の評価値信号に第２の評価値信号を加算するので、制御対象とモデルの特性が異なる領域で操作信号を生成せずに、特性が近い領域においてのみ操作信号の生成方法を学習する。このため、運転開始直後における制御対象の安全性が向上する。 As described above, according to the present embodiment, since a method for generating an operation signal in a region where a model error is small is learned, it is possible to perform better control than in the conventional method. In addition, the safety of the controlled object is improved as compared with the conventional method. That is, according to the above-described Dyna-architecture or the conventional method described in Patent Document 1, an operation signal generation method effective for a model is learned in a region where a model error is large. For this reason, even if this learning result is applied to the controlled object, it may not be effective. On the other hand, according to the present embodiment, since the second evaluation value signal is added to the first evaluation value signal, it is possible to generate characteristics without generating an operation signal in a region where the characteristics of the controlled object and the model are different. The generation method of the operation signal is learned only in a region close to. For this reason, the safety | security of the control object immediately after a driving | operation start improves.

本発明の実施形態に係る制御装置を制御対象に適用した例について説明する図である。It is a figure explaining the example which applied the control device concerning the embodiment of the present invention to the controlled object. 第２の評価値信号の生成方法を説明する図である。It is a figure explaining the production | generation method of a 2nd evaluation value signal. 第２の評価値計算部の処理を説明する図である。It is a figure explaining the process of a 2nd evaluation value calculation part. 画像表示手段に表示する画面を説明する図である。It is a figure explaining the screen displayed on an image display means. 制御対象としての火力発電プラントを説明する図である。It is a figure explaining the thermal power plant as a control object. １次空気等が通過する配管部、およびエアーヒーター１０４の拡大図である。FIG. 3 is an enlarged view of a piping portion through which primary air and the like pass, and an air heater 104. 学習部３００が、モデル４００を対象に制御対象１００の操作方法を学習する方法について説明する図であるIt is a figure explaining the method by which the learning part 300 learns the operation method of the control object 100 by making the model 400 into object.

Explanation of symbols

１００制御対象
２００制御装置
３００学習部
４００モデル
５００実評価値計算部
６００第１の評価値計算部
７００第２の評価値計算部
８００モデル誤差特性データベース
９００評価値データベース
１０００プロセス値データベース DESCRIPTION OF SYMBOLS 100 Control object 200 Control apparatus 300 Learning part 400 Model 500 Actual evaluation value calculation part 600 1st evaluation value calculation part 700 2nd evaluation value calculation part 800 Model error characteristic database 900 Evaluation value database 1000 Process value database

Claims

An evaluation value calculated based on a measurement signal obtained as a result of generating an operation signal to be applied to each of the controlled object and a model simulating the characteristics of the controlled object and applying the operating signal to each of the controlled object and the model Receive the signal,
In a control device having a function of learning a method for generating the operation signal so that an expected value of a sum of the evaluation value signals based on the control target obtained from a current state to a future state is maximized,
A first evaluation value calculation unit for obtaining a first evaluation value that increases as the measurement signal from the model approaches a desired value;
The value obtained based on the difference between the model and the characteristics of the controlled object,
A value calculated by referring to a model error characteristic database in which an error characteristic that is known at the time of model construction is stored for a control target output and a model output for the same operation input ,
A value calculated by referring to an evaluation value database in which a relationship between an operation signal and an evaluation value signal calculated based on a measurement signal obtained as a result of applying the operation signal to a control target is stored, and a measurement signal for the operation signal A second evaluation value calculation unit that calculates a second evaluation value that decreases as the modeling error increases.
A control device characterized in that the evaluation value signal is calculated by adding the first evaluation value and the second evaluation value, thereby improving the safety of the operation signal in the initial stage of learning.