JP2020061909A

JP2020061909A - Reinforcement learning program, reinforcement learning method, and reinforcement learning device

Info

Publication number: JP2020061909A
Application number: JP2018193537A
Authority: JP
Inventors: 利雄伊東; Toshio Ito; 落谷　亮; Akira Ochitani; 亮落谷; 仁史屋並; Hitoshi Yanami
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2018-10-12
Filing date: 2018-10-12
Publication date: 2020-04-16
Anticipated expiration: 2038-10-12
Also published as: JP7187961B2

Abstract

To provide a reinforcement learning device capable of reducing the processing amount in reinforcement learning.SOLUTION: A reinforcement learning device 100 performs learning by using a piece of validity information that represents the effectiveness of each command value for a power generator in each area of the multiple areas where the state value for the power generator can be. The reinforcement learning device 100 generates a piece of validity information that represents the effectiveness of each command value for the power generator in an area that combines two or more consecutive areas in the multiple areas. The reinforcement learning device 100 learns by using the validity information about the combined area and each area other than two or more areas in multiple areas.SELECTED DRAWING: Figure 1

Description

本発明は、強化学習プログラム、強化学習方法、および強化学習装置に関する。 The present invention relates to a reinforcement learning program, a reinforcement learning method, and a reinforcement learning device.

従来、自然エネルギーを利用する１以上の発電機を含む発電システムを、強化学習により制御することがある。強化学習では、例えば、発電機に関する状態値が取りうる領域ごとに、発電機に対する指令値の有効性を示す有効値を対応付けて表すテーブルが利用される。テーブルは、例えば、Ｑテーブルである。強化学習では、例えば、指令値についての有効値を推定する学習が繰り返し行われ、テーブルが更新される。 Conventionally, a power generation system including one or more generators that utilize natural energy may be controlled by reinforcement learning. In the reinforcement learning, for example, a table in which valid values indicating the validity of the command value for the generator are associated with each other for each region where the state value related to the generator can be used. The table is, for example, a Q table. In the reinforcement learning, for example, learning for estimating an effective value for the command value is repeatedly performed, and the table is updated.

特表２０１６−５１７１０４号公報Japanese Patent Publication No. 2016-517104

しかしながら、従来技術では、強化学習における処理量の増大化を招くことがある。例えば、発電機に関する状態値が取りうる領域の数が増加するほど、指令値についての有効値を推定する学習が行われる回数が増加してしまい、強化学習における処理量の増大化を招く。 However, in the conventional technique, the processing amount in reinforcement learning may be increased. For example, as the number of regions in which the state value relating to the generator can be taken increases, the number of times of learning for estimating the effective value of the command value increases, resulting in an increase in the amount of processing in reinforcement learning.

１つの側面では、本発明は、強化学習における処理量の低減化を図ることを目的とする。 In one aspect, the present invention aims to reduce the amount of processing in reinforcement learning.

１つの実施態様によれば、発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、観測した前記発電機に関する状態値を参照し、前記発電機に関する状態値についての特性関数に基づいて、前記複数の領域のうち連続する２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う強化学習プログラム、強化学習方法、および強化学習装置が提案される。 According to one embodiment, the learning is performed using the validity information indicating the validity of each command value for the generator in each of the plurality of regions in which the state value related to the generator can be obtained, and the observation is performed. Effectiveness of each command value for the generator in a region in which two or more continuous regions are combined among the plurality of regions based on a characteristic function of the state value for the generator with reference to the state value for the generator Is generated, and learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions. A reinforcement learning program to be performed, a reinforcement learning method, and a reinforcement learning device are proposed.

一態様によれば、強化学習における処理量の低減化を図ることが可能になる。 According to one aspect, it is possible to reduce the processing amount in reinforcement learning.

図１は、実施の形態にかかる強化学習方法の一実施例を示す説明図である。FIG. 1 is an explanatory diagram illustrating an example of the reinforcement learning method according to the embodiment. 図２は、発電システム２００の一例を示す説明図である。FIG. 2 is an explanatory diagram showing an example of the power generation system 200. 図３は、強化学習装置１００のハードウェア構成例を示すブロック図である。FIG. 3 is a block diagram showing a hardware configuration example of the reinforcement learning device 100. 図４は、強化学習装置１００の機能的構成例を示すブロック図である。FIG. 4 is a block diagram showing a functional configuration example of the reinforcement learning device 100. 図５は、風力発電機を含む発電システム２００の具体的構成例を示す説明図である。FIG. 5: is explanatory drawing which shows the specific structural example of the power generation system 200 containing a wind power generator. 図６は、ストール制御の風力発電機の状態値が取りうる通常区間と粗分割区間を示す説明図である。FIG. 6 is an explanatory diagram showing a normal section and a roughly divided section that can be taken by the state value of the stall-controlled wind power generator. 図７は、ピッチ制御の風力発電機の状態値が取りうる通常区間と粗分割区間を示す説明図である。FIG. 7: is explanatory drawing which shows the normal area and coarse division area which the state value of the pitch control wind power generator can take. 図８は、通常テーブル８０１と粗分割テーブル８０２とを実現する一例を示す説明図である。FIG. 8 is an explanatory diagram showing an example of realizing the normal table 801 and the coarse division table 802. 図９は、通常テーブル８０１の記憶内容の一例を示す説明図である。FIG. 9 is an explanatory diagram showing an example of the stored contents of the normal table 801. 図１０は、特性関数を作成する一例を示す説明図である。FIG. 10 is an explanatory diagram showing an example of creating the characteristic function. 図１１は、利用するテーブルを粗分割テーブル８０２に切り替える一例を示す説明図である。FIG. 11 is an explanatory diagram showing an example of switching the table to be used to the coarse division table 802. 図１２は、利用するテーブルを通常テーブル８０１に切り替える一例を示す説明図である。FIG. 12 is an explanatory diagram showing an example of switching the table to be used to the normal table 801. 図１３は、有効値を更新する一例を示す説明図である。FIG. 13 is an explanatory diagram showing an example of updating valid values. 図１４は、火力発電機を含む発電システム２００の具体的構成例を示す説明図である。FIG. 14: is explanatory drawing which shows the specific structural example of the power generation system 200 containing a thermal power generator. 図１５は、火力発電機に関する通常テーブル８０１の記憶内容の一例を示す説明図である。FIG. 15 is an explanatory diagram showing an example of the stored contents of the normal table 801 relating to the thermal power generator. 図１６は、全体処理手順の一例を示すフローチャートである。FIG. 16 is a flowchart showing an example of the overall processing procedure. 図１７は、切替判定処理手順の一例を示すフローチャートである。FIG. 17 is a flowchart showing an example of the switching determination processing procedure. 図１８は、値設定処理手順の一例を示すフローチャートである。FIG. 18 is a flowchart showing an example of the value setting processing procedure. 図１９は、ストール制御の風力発電機についての特性関数作成処理手順の一例を示すフローチャートである。FIG. 19 is a flowchart showing an example of a characteristic function creation processing procedure for a stall-controlled wind power generator. 図２０は、ピッチ制御の風力発電機についての特性関数作成処理手順の一例を示すフローチャートである。FIG. 20 is a flowchart showing an example of a characteristic function creation processing procedure for a pitch-controlled wind power generator.

以下に、図面を参照して、本発明にかかる強化学習プログラム、強化学習方法、および強化学習装置の実施の形態を詳細に説明する。 Hereinafter, embodiments of a reinforcement learning program, a reinforcement learning method, and a reinforcement learning apparatus according to the present invention will be described in detail with reference to the drawings.

（実施の形態にかかる強化学習方法の一実施例）
図１は、実施の形態にかかる強化学習方法の一実施例を示す説明図である。強化学習装置１００は、１以上の発電機を含む発電システムに強化学習を適用し、１以上の発電機を含む発電システムを制御するコンピュータである。発電機は、例えば、風力発電機、または、火力発電機などである。 (One Example of Reinforcement Learning Method According to Embodiment)
FIG. 1 is an explanatory diagram illustrating an example of the reinforcement learning method according to the embodiment. The reinforcement learning device 100 is a computer that applies reinforcement learning to a power generation system including one or more power generators and controls a power generation system including one or more power generators. The generator is, for example, a wind power generator or a thermal power generator.

強化学習では、例えば、１以上の発電機に関する状態値の組み合わせが取りうる複数の領域のそれぞれの領域における、１以上の発電機に対する指令値の組み合わせごとの有効性を示す有効値を対応付けて表すテーブルが利用される。テーブルは、例えば、Ｑテーブルである。強化学習では、例えば、発電機に関する状態値を観測し、発電機に対する指令値を決定し、決定した指令値を発電機に入力し、観測した状態値を含む領域における、入力した指令値の有効性を示す有効値を推定する学習が繰り返し行われ、テーブルが更新される。強化学習は、例えば、Ｑ学習やＳＡＲＳＡなどにより実現される。 In the reinforcement learning, for example, the effective value indicating the effectiveness of each combination of the command values for the one or more generators in each of the plurality of regions in which the combination of the state values of the one or more generators can be associated is associated A table to represent is used. The table is, for example, a Q table. In reinforcement learning, for example, the state value related to the generator is observed, the command value for the generator is determined, the determined command value is input to the generator, and the input command value is valid in the area including the observed state value. The learning for estimating the effective value indicating the sex is repeated and the table is updated. Reinforcement learning is realized by, for example, Q learning or SARSA.

ここで、強化学習における処理量の増大化を招いてしまう場合が考えられる。例えば、１以上の発電機に関する状態値の組み合わせが取りうる領域の数が増加するほど、学習が行われる回数が増加してしまい、強化学習における処理量の増大化を招く。具体的には、状態値を細かく分割して領域を設定すると、領域の数が増加してしまう。また、具体的には、発電機の数が増加すると、領域の数が増加してしまう。このため、強化学習における処理量の低減化を図ることが望まれる。 Here, it is conceivable that the amount of processing in reinforcement learning may increase. For example, as the number of regions that can be combined by the state values of one or more generators increases, the number of times learning is performed increases, resulting in an increase in the amount of processing in reinforcement learning. Specifically, if the state values are finely divided to set areas, the number of areas increases. Further, specifically, as the number of generators increases, the number of areas also increases. Therefore, it is desired to reduce the amount of processing in reinforcement learning.

これに対し、領域の数を減少させ、強化学習における処理量の低減化を図ることが考えられる。例えば、状態値を粗く分割して領域を設定し、領域の数を減少させ、強化学習における処理量の低減化を図ることが考えられる。しかしながら、常時、粗く分割された領域を用いると、どのような状態値の場合にどのような指令値を出力することが好ましいかを詳細に検証することができず、発電システムに対して適切な制御を行うことができないことがある。適切な制御は、例えば、所定の閾値を超えない範囲で、発電システムの発電量の最大化を図る制御である。 On the other hand, it is conceivable to reduce the number of regions to reduce the amount of processing in reinforcement learning. For example, it is conceivable that the state value is roughly divided to set regions, the number of regions is reduced, and the processing amount in reinforcement learning is reduced. However, if a coarsely divided area is used at all times, it is not possible to verify in detail what kind of command value should be output in what kind of state value, and it is appropriate for the power generation system. It may not be possible to control. The appropriate control is, for example, control that maximizes the amount of power generation of the power generation system within a range that does not exceed a predetermined threshold value.

したがって、領域の数を動的に変更することにより、強化学習における処理量の低減化を図ることが考えられる。例えば、何らかのタイミングで、２以上の領域を結合し、領域の数を減少させることが考えられる。しかしながら、２以上の領域を結合した領域における、１以上の発電機に対する指令値の組み合わせごとの有効性を示す有効値を、どのように設定することが好ましいかが分からなければ、発電システムに対して適切な制御を行うことが難しくなる。 Therefore, it is possible to reduce the processing amount in the reinforcement learning by dynamically changing the number of regions. For example, it is possible to combine two or more regions and reduce the number of regions at some timing. However, if it is not clear how to set an effective value that indicates the effectiveness of each combination of command values for one or more generators in a region that combines two or more regions, It becomes difficult to perform appropriate control.

そこで、本実施の形態では、強化学習において、１以上の発電機に関する状態値の組み合わせが取りうる領域の数を動的に変更可能にし、領域の数を動的に変更したことに応じて適切と判断される有効値を設定し直すことができる強化学習方法について説明する。かかる強化学習方法によれば、強化学習における処理量の低減化を図ることができる。 Therefore, in the present embodiment, in reinforcement learning, it is possible to dynamically change the number of regions that can be taken by a combination of state values related to one or more generators, and it is appropriate according to the number of regions being dynamically changed. A reinforcement learning method capable of resetting a valid value determined as follows will be described. According to this reinforcement learning method, it is possible to reduce the amount of processing in reinforcement learning.

図１の例では、発電システムに含まれる発電機は１つである。発電機に関する状態値は、例えば、発電機からの出力電力、および、発電機に対する自然エネルギーの供給量に関する環境値などである。環境値は、例えば、風速や燃料使用量などである。発電機に対する指令値は、例えば、発電機の電源のＯＮとＯＦＦとを切り替える指令値である。発電機に対する指令値は、例えば、発電機における自然エネルギーの利用効率を変更する指令値である。 In the example of FIG. 1, the number of generators included in the power generation system is one. The state value related to the generator is, for example, output power from the generator and an environmental value related to the amount of natural energy supplied to the generator. The environmental value is, for example, the wind speed or the amount of fuel used. The command value for the generator is, for example, a command value for switching the power supply of the generator between ON and OFF. The command value for the generator is, for example, a command value for changing the utilization efficiency of natural energy in the generator.

図１において、強化学習装置１００は、発電機に関する状態値が取りうる複数の領域のそれぞれの領域における発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行う。領域は、例えば、区間または区間の組み合わせである。有効性情報は、例えば、Ｑテーブルのレコードである。有効値は、例えば、Ｑ値である。強化学習装置１００は、例えば、区間Ａ１に有効値Ｇ１を対応付けた有効性情報と、区間Ａ２に有効値Ｇ２を対応付けた有効性情報と、区間Ａ３に有効値Ｇ３を対応付けた有効性情報とを含むＱテーブルを利用して学習を行う。強化学習装置１００は、学習を行った結果、Ｑテーブルを更新する。 In FIG. 1, the reinforcement learning device 100 performs learning by using validity information indicating the validity of each command value for a generator in each of a plurality of regions in which a state value related to the generator can be taken. The region is, for example, a section or a combination of sections. The validity information is, for example, a record of the Q table. The effective value is, for example, the Q value. The reinforcement learning device 100, for example, validity information in which the valid value G1 is associated with the section A1, validity information in which the valid value G2 is associated with the section A2, and validity in which the valid value G3 is associated with the section A3. Learning is performed using the Q table including information and. The reinforcement learning device 100 updates the Q table as a result of learning.

強化学習装置１００は、複数の領域のうち連続する２以上の領域を結合した領域における発電機に対する指令値ごとの有効性を示す有効性情報を生成する。強化学習装置１００は、例えば、観測した発電機に関する状態値を参照し、発電機に関する状態値についての特性関数に基づいて、結合した領域についての有効性情報を生成する。強化学習装置１００は、具体的には、区間Ａ２と区間Ａ３とを結合した区間Ａａにおける、指令値の有効性を示す有効値Ｇａを算出し、区間Ａａに有効値Ｇａを対応付けた有効性情報を生成する。そして、強化学習装置１００は、区間Ａ２に有効値Ｇ２を対応付けた有効性情報と、区間Ａ３に有効値Ｇ３を対応付けた有効性情報とを、区間Ａａに有効値Ｇａを対応付けた有効性情報に置き換えて、Ｑテーブルを更新する。 The reinforcement learning device 100 generates validity information indicating the validity of each command value for the generator in a region in which two or more continuous regions are combined among a plurality of regions. The reinforcement learning apparatus 100 refers to, for example, the observed state value regarding the generator, and generates validity information regarding the combined region based on the characteristic function regarding the state value regarding the generator. Specifically, the reinforcement learning device 100 calculates the effective value Ga indicating the effectiveness of the command value in the section Aa, which is a combination of the section A2 and the section A3, and the effectiveness in which the effective value Ga is associated with the section Aa. Generate information. Then, the reinforcement learning device 100 associates the validity information in which the valid value G2 is associated with the section A2, the validity information in which the valid value G3 is associated with the section A3, and the validity information Ga is associated with the section Aa. The Q table is updated by replacing with the sex information.

強化学習装置１００は、結合した領域についての有効性情報、および、複数の領域のうち２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う。強化学習装置１００は、例えば、区間Ａ１に有効値Ｇ１を対応付けた有効性情報、および、区間Ａ２と区間Ａ３とを結合した区間Ａａに有効値Ｇａを対応付けた有効性情報を含むＱテーブルを利用して学習を行う。強化学習装置１００は、学習を行った結果、Ｑテーブルを更新する。 The reinforcement learning device 100 performs learning by using the effectiveness information about the combined areas and the effectiveness information about each area other than two or more areas of the plurality of areas. The reinforcement learning device 100 includes, for example, a Q table including validity information in which the valid value G1 is associated with the section A1, and validity information in which the valid value Ga is associated with the section Aa in which the section A2 and the section A3 are combined. Use to learn. The reinforcement learning device 100 updates the Q table as a result of learning.

これにより、強化学習装置１００は、２以上の領域を結合し、有効性情報を対応付けておく領域の数を動的に減少させることができる。このため、強化学習装置１００は、学習を行って更新する対象である有効性情報の数を減少させ、強化学習にかかる処理量の低減化を図ることができる。また、強化学習装置１００は、有効性情報を生成する際、有効値を０で初期化したり、有効値をランダムに設定したりはせずに、特性関数に基づいて有効値を設定することができる。このため、強化学習装置１００は、生成する有効性情報が、発電機に対する指令値ごとの有効性を精度よく示すようにすることができ、発電システムに対して適切な制御を行いやすくすることができる。 Thereby, the reinforcement learning device 100 can combine two or more regions and dynamically reduce the number of regions to which validity information is associated. For this reason, the reinforcement learning apparatus 100 can reduce the number of validity information items to be learned and updated, and reduce the amount of processing required for reinforcement learning. Further, the reinforcement learning device 100 may set the valid value based on the characteristic function without initializing the valid value to 0 or randomly setting the valid value when generating the validity information. it can. Therefore, the reinforcement learning device 100 can accurately generate the validity information that is generated, indicating the validity of each command value for the power generator, and facilitate the appropriate control of the power generation system. it can.

ここでは、何らかのタイミングで、２以上の領域のそれぞれの領域についての有効性情報を、２以上の領域を結合した領域についての有効性情報に置き換えて、有効性情報を対応付けておく領域の数を動的に減少させる場合について説明したが、これに限らない。例えば、何らかのタイミングで、２以上の領域を結合した領域についての有効性情報を、２以上の領域のそれぞれの領域についての有効性情報に置き換えて、領域の数を動的に増加させる場合があってもよい。これにより、強化学習装置１００は、どのような状態値の場合にどのような指令値を出力することが好ましいかを細分化して実行することができる。２以上の領域を結合した領域についての有効性情報を、２以上の領域のそれぞれの領域についての有効性情報に置き換える場合については、具体的には、図１２を用いて後述する。 Here, the number of areas to which the validity information is associated by replacing the validity information about each of the two or more areas with the validity information about the area obtained by combining the two or more areas at some timing. However, the present invention is not limited to this. For example, at some timing, the validity information about a region obtained by combining two or more regions may be replaced with the validity information about each of the two or more regions to dynamically increase the number of regions. May be. Thereby, the reinforcement learning device 100 can subdivide and execute what kind of command value should be output in what kind of state value. A case where the validity information about the area obtained by combining the two or more areas is replaced with the validity information about each of the two or more areas will be specifically described later with reference to FIG. 12.

ここでは、発電システムに含まれる発電機が１つである場合について説明したが、これに限らない。例えば、発電システムに含まれる発電機が複数ある場合があってもよい。この場合、例えば、強化学習装置１００は、発電機の状態値の組み合わせが取りうる複数の領域のうち連続する２以上の領域を結合した領域についての有効性情報を生成する。そして、強化学習装置１００は、２以上の領域のそれぞれの領域についての有効性情報を、結合した領域についての有効性情報に置き換え、有効性情報を対応付けておく領域の数を動的に減少させる。これにより、強化学習装置１００は、強化学習にかかる処理量の低減化を図ることができる。発電システムに含まれる発電機が複数ある場合については、図１１および図１２を用いて後述する。 Here, the case where the power generation system includes only one generator has been described, but the present invention is not limited to this. For example, there may be a plurality of generators included in the power generation system. In this case, for example, the reinforcement learning device 100 generates validity information about an area obtained by combining two or more continuous areas out of a plurality of areas that the combination of the state values of the generator can take. Then, the reinforcement learning device 100 replaces the validity information about each of the two or more regions with the validity information about the combined region, and dynamically reduces the number of regions to which the validity information is associated. Let As a result, the reinforcement learning apparatus 100 can reduce the amount of processing required for reinforcement learning. The case where there are a plurality of generators included in the power generation system will be described later with reference to FIGS. 11 and 12.

ここでは、発電システムに含まれる発電機の種類を特定せずに、強化学習装置１００について説明した。これに対し、例えば、発電システムに含まれる発電機が、風力発電機である場合があってもよい。発電システムに含まれる発電機が、風力発電機である場合については、具体的には、図１１および図１２を用いて後述する。また、例えば、発電システムに含まれる発電機が、火力発電機である場合があってもよい。発電システムに含まれる発電機が、火力発電機である場合については、具体的には、図１５を用いて後述する。また、例えば、発電システムが、風力発電機と火力発電機との両方を含む場合があってもよい。 Here, the reinforcement learning device 100 has been described without specifying the type of the generator included in the power generation system. On the other hand, for example, the generator included in the power generation system may be a wind power generator. A case where the generator included in the power generation system is a wind power generator will be specifically described later with reference to FIGS. 11 and 12. Further, for example, the generator included in the power generation system may be a thermal power generator. A case where the generator included in the power generation system is a thermal power generator will be specifically described later with reference to FIG. 15. Further, for example, the power generation system may include both a wind power generator and a thermal power generator.

ここでは、２以上の領域を結合した領域についての有効性情報を生成するタイミングを限定せずに、強化学習装置１００について説明した。これに対し、例えば、強化学習装置１００が、観測した需要電力が閾値以下である場合に、結合した領域についての有効性情報を生成する場合があってもよい。また、例えば、強化学習装置１００が、観測した需要電力が閾値を超える場合に、結合した領域についての有効性情報を生成する場合があってもよい。 Here, the reinforcement learning device 100 has been described without limiting the timing of generating the validity information for the region in which two or more regions are combined. On the other hand, for example, the reinforcement learning device 100 may generate the validity information about the combined regions when the observed power demand is less than or equal to the threshold value. Further, for example, the reinforcement learning device 100 may generate validity information about the combined regions when the observed demand power exceeds a threshold value.

（発電システム２００の一例）
次に、図２を用いて、図１に示した強化学習装置１００を適用した、発電システム２００の一例について説明する。 (Example of power generation system 200)
Next, an example of the power generation system 200 to which the reinforcement learning device 100 shown in FIG. 1 is applied will be described with reference to FIG.

図２は、発電システム２００の一例を示す説明図である。図２において、発電システム２００は、強化学習装置１００と、１以上の発電機２０１とを含む。 FIG. 2 is an explanatory diagram showing an example of the power generation system 200. In FIG. 2, the power generation system 200 includes the reinforcement learning device 100 and one or more power generators 201.

発電システム２００において、強化学習装置１００と１以上の発電機２０１は、有線または無線のネットワーク２１０を介して接続される。ネットワーク２１０は、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、ＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）、インターネットなどである。 In the power generation system 200, the reinforcement learning device 100 and one or more power generators 201 are connected via a wired or wireless network 210. The network 210 is, for example, a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet.

発電機２０１は、例えば、風力エネルギーを利用し、風車を用いて発電を行う機械である。発電機２０１は、例えば、風車から伝達される風車トルクを用いて発電を行う。発電機２０１は、発電機２０１に関する状態値を観測する計測機が設けられる。計測機は、例えば、センサ装置を有する。センサ装置は、加速度センサ、地磁気センサ、光センサ、振動センサ、電力センサ、電圧センサ、および、電流センサなどの少なくともいずれかを有してもよい。発電機２０１は、例えば、火力エネルギーを利用し、タービンを用いて発電を行う機械であってもよい。 The generator 201 is, for example, a machine that uses wind energy to generate electricity using a wind turbine. The generator 201 generates power using, for example, the wind turbine torque transmitted from the wind turbine. The generator 201 is provided with a measuring instrument for observing a state value related to the generator 201. The measuring instrument has, for example, a sensor device. The sensor device may include at least one of an acceleration sensor, a geomagnetic sensor, an optical sensor, a vibration sensor, a power sensor, a voltage sensor, and a current sensor. The generator 201 may be, for example, a machine that utilizes thermal energy and uses a turbine to generate electricity.

強化学習装置１００は、発電システム２００に強化学習を適用し、発電システム２００を制御する。強化学習装置１００は、例えば、発電システム２００に含まれる１以上の発電機２０１に対する指令値を制御する。強化学習装置１００は、具体的には、発電機２０１に設けられた計測機から、発電機２０１に関する状態値を取得する。強化学習装置１００は、取得した状態値と、有効性情報を含むテーブルとに基づいて、発電機２０１に対する指令値を決定して出力する。強化学習装置１００は、指令値を出力した結果に応じて、テーブルを更新する。強化学習装置１００は、例えば、サーバ、ＰＣ（ＰｅｒｓｏｎａｌＣｏｍｐｕｔｅｒ）、マイコン、ＰＬＣ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＣｏｎｔｒｏｌｌｅｒ）などである。 The reinforcement learning device 100 applies reinforcement learning to the power generation system 200 and controls the power generation system 200. The reinforcement learning device 100 controls, for example, command values for one or more generators 201 included in the power generation system 200. The reinforcement learning device 100 specifically acquires a state value related to the generator 201 from a measuring device provided in the generator 201. The reinforcement learning device 100 determines and outputs a command value for the generator 201 based on the acquired state value and the table including the validity information. The reinforcement learning device 100 updates the table according to the result of outputting the command value. The reinforcement learning device 100 is, for example, a server, a PC (Personal Computer), a microcomputer, a PLC (Programmable Logic Controller), or the like.

（強化学習装置１００のハードウェア構成例）
次に、図３を用いて、強化学習装置１００のハードウェア構成例について説明する。 (Example of Hardware Configuration of Reinforcement Learning Device 100)
Next, a hardware configuration example of the reinforcement learning device 100 will be described with reference to FIG.

図３は、強化学習装置１００のハードウェア構成例を示すブロック図である。図３において、強化学習装置１００は、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３０１と、メモリ３０２と、ネットワークＩ／Ｆ（Ｉｎｔｅｒｆａｃｅ）３０３と、記録媒体Ｉ／Ｆ３０４と、記録媒体３０５とを有する。また、各構成部は、バス３００によってそれぞれ接続される。 FIG. 3 is a block diagram showing a hardware configuration example of the reinforcement learning device 100. In FIG. 3, the reinforcement learning device 100 includes a CPU (Central Processing Unit) 301, a memory 302, a network I / F (Interface) 303, a recording medium I / F 304, and a recording medium 305. Further, each component is connected by a bus 300.

ここで、ＣＰＵ３０１は、強化学習装置１００の全体の制御を司る。メモリ３０２は、例えば、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）およびフラッシュＲＯＭなどを有する。具体的には、例えば、フラッシュＲＯＭやＲＯＭが各種プログラムを記憶し、ＲＡＭがＣＰＵ３０１のワークエリアとして使用される。メモリ３０２に記憶されるプログラムは、ＣＰＵ３０１にロードされることで、コーディングされている処理をＣＰＵ３０１に実行させる。 Here, the CPU 301 controls the entire reinforcement learning device 100. The memory 302 includes, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), and a flash ROM. Specifically, for example, a flash ROM or a ROM stores various programs, and a RAM is used as a work area of the CPU 301. The program stored in the memory 302 is loaded into the CPU 301 to cause the CPU 301 to execute the coded processing.

ネットワークＩ／Ｆ３０３は、通信回線を通じてネットワーク２１０に接続され、ネットワーク２１０を介して他のコンピュータに接続される。他のコンピュータは、例えば、発電機２０１である。そして、ネットワークＩ／Ｆ３０３は、ネットワーク２１０と内部のインターフェースを司り、他のコンピュータからのデータの入出力を制御する。ネットワークＩ／Ｆ３０３には、例えば、モデムやＬＡＮアダプタなどを採用することができる。 The network I / F 303 is connected to the network 210 via a communication line, and is connected to another computer via the network 210. The other computer is, for example, the generator 201. The network I / F 303 administers an internal interface with the network 210 and controls the input / output of data from / to another computer. For the network I / F 303, for example, a modem or a LAN adapter can be adopted.

記録媒体Ｉ／Ｆ３０４は、ＣＰＵ３０１の制御にしたがって記録媒体３０５に対するデータのリード／ライトを制御する。記録媒体Ｉ／Ｆ３０４は、例えば、ディスクドライブ、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、ＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）ポートなどである。記録媒体３０５は、記録媒体Ｉ／Ｆ３０４の制御で書き込まれたデータを記憶する不揮発メモリである。記録媒体３０５は、例えば、ディスク、半導体メモリ、ＵＳＢメモリなどである。記録媒体３０５は、強化学習装置１００から着脱可能であってもよい。 The recording medium I / F 304 controls reading / writing of data with respect to the recording medium 305 under the control of the CPU 301. The recording medium I / F 304 is, for example, a disk drive, an SSD (Solid State Drive), a USB (Universal Serial Bus) port, or the like. The recording medium 305 is a non-volatile memory that stores data written under the control of the recording medium I / F 304. The recording medium 305 is, for example, a disk, a semiconductor memory, a USB memory, or the like. The recording medium 305 may be detachable from the reinforcement learning device 100.

強化学習装置１００は、上述した構成部のほか、例えば、キーボード、マウス、ディスプレイ、プリンタ、スキャナ、マイク、スピーカーなどを有してもよい。また、強化学習装置１００は、記録媒体Ｉ／Ｆ３０４や記録媒体３０５を複数有していてもよい。また、強化学習装置１００は、記録媒体Ｉ／Ｆ３０４や記録媒体３０５を有していなくてもよい。 The reinforcement learning device 100 may include, for example, a keyboard, a mouse, a display, a printer, a scanner, a microphone, a speaker, and the like, in addition to the above-described components. Further, the reinforcement learning device 100 may include a plurality of recording medium I / Fs 304 and recording media 305. Further, the reinforcement learning device 100 may not include the recording medium I / F 304 or the recording medium 305.

（強化学習装置１００の機能的構成例）
次に、図４を用いて、強化学習装置１００の機能的構成例について説明する。 (Example of functional configuration of the reinforcement learning device 100)
Next, a functional configuration example of the reinforcement learning device 100 will be described with reference to FIG.

図４は、強化学習装置１００の機能的構成例を示すブロック図である。強化学習装置１００は、記憶部４００と、取得部４０１と、切替部４０２と、学習部４０３と、出力部４０４とを含む。 FIG. 4 is a block diagram showing a functional configuration example of the reinforcement learning device 100. The reinforcement learning device 100 includes a storage unit 400, an acquisition unit 401, a switching unit 402, a learning unit 403, and an output unit 404.

記憶部４００は、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域によって実現される。以下では、記憶部４００が、強化学習装置１００に含まれる場合について説明するが、これに限らない。例えば、記憶部４００が、強化学習装置１００とは異なる装置に含まれ、記憶部４００の記憶内容が強化学習装置１００から参照可能である場合があってもよい。 The storage unit 400 is realized by a storage area such as the memory 302 and the recording medium 305 illustrated in FIG. 3, for example. The case where the storage unit 400 is included in the reinforcement learning device 100 will be described below, but the storage unit 400 is not limited to this. For example, the storage unit 400 may be included in a device different from the reinforcement learning device 100, and the storage content of the storage unit 400 may be referred to by the reinforcement learning device 100.

取得部４０１〜出力部４０４は、制御部の一例として機能する。取得部４０１〜出力部４０４は、具体的には、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域に記憶されたプログラムをＣＰＵ３０１に実行させることにより、または、ネットワークＩ／Ｆ３０３により、その機能を実現する。各機能部の処理結果は、例えば、図３に示したメモリ３０２や記録媒体３０５などの記憶領域に記憶される。 The acquisition unit 401 to the output unit 404 function as an example of a control unit. Specifically, the acquisition unit 401 to the output unit 404, for example, by causing the CPU 301 to execute a program stored in a storage area such as the memory 302 or the recording medium 305 illustrated in FIG. 3, or the network I / F 303. To realize that function. The processing result of each functional unit is stored in a storage area such as the memory 302 or the recording medium 305 illustrated in FIG. 3, for example.

記憶部４００は、各機能部の処理において参照され、または更新される各種情報を記憶する。記憶部４００は、発電システム２００に関する状態値を記憶する。発電システム２００に関する状態値は、例えば、発電システム２００に含まれる１以上の発電機２０１のそれぞれの発電機２０１に関する状態値、および、発電システム２００全体に関する状態値を含む。発電機２０１は、例えば、風力発電機または火力発電機などである。発電機２０１に関する状態値は、例えば、ストール制御の風力発電機に関する出力ワット値および風速と、ピッチ制御の風力発電機に関する出力ワット値および風速とである。発電機２０１に関する状態値は、例えば、火力発電機に関する出力ワット値および燃料使用量である。また、発電システム２００全体に関する状態値は、例えば、発電システム２００全体における需要電力である。 The storage unit 400 stores various information that is referred to or updated in the processing of each functional unit. The storage unit 400 stores a state value regarding the power generation system 200. The state value regarding the power generation system 200 includes, for example, a state value regarding each of the one or more generators 201 included in the power generation system 200 and a state value regarding the entire power generation system 200. The generator 201 is, for example, a wind power generator or a thermal power generator. The state values for the generator 201 are, for example, the output watt value and wind speed for the stall-controlled wind power generator, and the output watt value and wind speed for the pitch-controlled wind power generator. The state value regarding the generator 201 is, for example, the output wattage value and the fuel consumption amount regarding the thermal power generator. Further, the state value related to the entire power generation system 200 is, for example, demand power in the entire power generation system 200.

記憶部４００は、発電システム２００に含まれる１以上の発電機２０１のそれぞれの発電機２０１に対する指令値を記憶する。発電機２０１に対する指令値は、発電機２０１における自然エネルギーの利用効率を変更する指令値である。指令値は、例えば、発電機２０１の電源をＯＮとＯＦＦとで切り替えさせる指令値である。指令値は、例えば、風力発電機の受風性能を変更する指令値である。指令値は、例えば、ピッチ制御の風力発電機のピッチ角をどの程度変更するかを示す指令値である。ピッチ角をどの程度変更するかを示す指令値は、具体的には、−ΔΘと±０と＋ΔΘとである。指令値は、例えば、火力発電機の発電機に設けられた燃料供給孔の大きさをどの程度変更するかを示す指令値である。指令値は、例えば、火力発電機の燃料使用量をどの程度変更するかを示す指令値である。 The storage unit 400 stores a command value for each of the one or more generators 201 included in the power generation system 200. The command value for the generator 201 is a command value for changing the utilization efficiency of natural energy in the generator 201. The command value is, for example, a command value for switching the power supply of the generator 201 between ON and OFF. The command value is, for example, a command value that changes the wind receiving performance of the wind power generator. The command value is, for example, a command value indicating how much the pitch angle of the pitch-controlled wind power generator is changed. Specifically, the command values indicating how much the pitch angle is changed are -ΔΘ, ± 0, and + ΔΘ. The command value is, for example, a command value indicating how much to change the size of the fuel supply hole provided in the generator of the thermal power generator. The command value is, for example, a command value indicating how much the fuel usage amount of the thermal power generator should be changed.

記憶部４００は、１以上の発電機２０１に関する状態値の組み合わせが取りうる複数の領域のそれぞれの領域における、１以上の発電機２０１に対する指令値の組み合わせごとの有効性を示す有効性情報を記憶する。状態値の組み合わせは、１つの状態値であってもよい。また、記憶部４００は、複数の領域のうち２以上の領域を結合した領域における、１以上の発電機２０１に対する指令値の組み合わせごとの有効性を示す有効性情報を記憶する。記憶部４００は、例えば、複数の領域のそれぞれの領域に、指令値の組み合わせごとの有効性を示す有効値を対応付けた有効性情報をレコードとして含むテーブルを記憶する。有効値は、例えば、発電機における報酬の増加に寄与する度合いを示す。報酬は、例えば、発電量である。また、記憶部４００は、例えば、複数の領域のそれぞれの領域についての有効性情報のうち、２以上の領域のそれぞれの領域についての有効性情報を、２以上の領域を結合した領域についての有効性情報に置き換えたテーブルを記憶する。 The storage unit 400 stores validity information indicating the effectiveness of each combination of command values for the one or more generators 201 in each of a plurality of regions in which a combination of state values for the one or more generators 201 can be taken. To do. The combination of state values may be one state value. The storage unit 400 also stores validity information indicating the validity of each combination of command values for one or more generators 201 in a region obtained by combining two or more regions out of a plurality of regions. The storage unit 400 stores, for example, a table including, as a record, validity information in which a valid value indicating validity for each combination of command values is associated with each of the plurality of regions. The effective value indicates, for example, the degree of contribution to the increase of the reward in the generator. The reward is, for example, the amount of power generation. Further, the storage unit 400, for example, among the validity information about each area of the plurality of areas, the validity information about each area of two or more areas, the validity information about the area combining two or more areas. The table replaced with the sex information is stored.

記憶部４００は、発電機２０１についての特性関数を記憶する。特性関数は、発電機２０１に関する状態値の変化を示す。特性関数は、例えば、風速と風力発電機からの出力電力との関係を示す。特性関数は、例えば、火力発電機の燃料使用量と火力発電機からの出力電力との関係を示す。記憶部４００は、例えば、特性関数を近似する近似曲線を記憶する。記憶部４００は、例えば、風力発電機の受風性能ごとに異なる特性関数を記憶する。記憶部４００は、具体的には、風力発電機のピッチ角ごとに異なる特性関数を記憶する。 The storage unit 400 stores the characteristic function of the generator 201. The characteristic function indicates the change of the state value regarding the generator 201. The characteristic function indicates, for example, the relationship between the wind speed and the output power from the wind power generator. The characteristic function indicates, for example, the relationship between the fuel usage amount of the thermal power generator and the output power from the thermal power generator. The storage unit 400 stores, for example, an approximate curve that approximates the characteristic function. The memory | storage part 400 memorize | stores the characteristic function which changes for every wind receiving performance of a wind power generator, for example. The memory | storage part 400 memorize | stores the characteristic function which changes specifically for every pitch angle of a wind power generator.

記憶部４００は、例えば、強化学習アルゴリズム、および、行動選択アルゴリズムによる処理手順を記憶する。強化学習アルゴリズムは、例えば、Ｑ学習アルゴリズムである。強化学習アルゴリズムは、Ｑ学習アルゴリズム以外であってもよい。行動選択アルゴリズムは、例えば、ε−ｇｒｅｅｄｙアルゴリズムである。 The storage unit 400 stores, for example, processing procedures by the reinforcement learning algorithm and the action selection algorithm. The reinforcement learning algorithm is, for example, a Q learning algorithm. The reinforcement learning algorithm may be other than the Q learning algorithm. The action selection algorithm is, for example, the ε-greedy algorithm.

取得部４０１は、各機能部の処理に用いられる各種情報を記憶部４００から取得し、各機能部に出力する。取得部４０１は、各機能部の処理に用いられる各種情報を、強化学習装置１００とは異なる装置から取得し、各機能部に出力してもよい。取得部４０１は、例えば、発電システム２００に関する状態値を取得する。取得部４０１は、発電機２０１に設けられた計測機から、発電機２０１に関する状態値を取得する。取得部４０１は、具体的には、電気会社のコンピュータから発電システム２００における需要電力を取得する。 The acquisition unit 401 acquires various kinds of information used for the processing of each functional unit from the storage unit 400 and outputs it to each functional unit. The acquisition unit 401 may acquire various information used for the processing of each functional unit from a device different from the reinforcement learning device 100 and output the acquired information to each functional unit. The acquisition unit 401 acquires, for example, a state value regarding the power generation system 200. The acquisition unit 401 acquires a state value regarding the generator 201 from a measuring device provided in the generator 201. The acquisition unit 401 specifically acquires the demand power in the power generation system 200 from the computer of the electric company.

取得部４０１は、例えば、特性関数を表す情報を取得してもよい。取得部４０１は、例えば、特性関数に関する閾値を取得し、特性関数を表す情報を生成してもよい。特性関数に関する閾値は、少なくとも定格風速と最大出力とである。特性関数に関する閾値は、さらに、カットイン風速とカットアウト風速とであってもよい。取得部４０１は、様々な風速における発電機２０１からの出力電力を取得し、特性関数を表す情報を生成してもよい。 The acquisition unit 401 may acquire information indicating the characteristic function, for example. The acquisition unit 401 may acquire, for example, a threshold related to the characteristic function and generate information representing the characteristic function. The threshold for the characteristic function is at least the rated wind speed and the maximum output. The threshold for the characteristic function may be a cut-in wind speed and a cut-out wind speed. The acquisition unit 401 may acquire output power from the generator 201 at various wind speeds and generate information representing a characteristic function.

切替部４０２は、強化学習における学習に利用する有効性情報を切り替える。切替部４０２は、例えば、複数の領域のそれぞれの領域についての有効性情報を、学習に利用する有効性情報に設定する。切替部４０２は、例えば、複数の領域のそれぞれの領域についての有効性情報のうち、２以上の領域のそれぞれの領域についての有効性情報を、２以上の領域を結合した領域についての有効性情報に置き換え、学習に利用する有効性情報に設定する。 The switching unit 402 switches validity information used for learning in reinforcement learning. The switching unit 402 sets, for example, the validity information about each of the plurality of regions as the validity information used for learning. The switching unit 402, for example, among the validity information about each of the plurality of regions, the validity information about each of the two or more regions, the validity information about the region obtained by combining the two or more regions. Replaced with and set to the validity information used for learning.

具体的には、２以上の領域のそれぞれの領域についての有効性情報を、学習に利用する有効性情報に設定している場合がある。この場合、切替部４０２は、取得した発電機２０１に関する状態値を参照し、特性関数に基づいて、２以上の領域を結合した領域についての有効性情報を生成する。また、切替部４０２は、２以上の領域のそれぞれの領域についての有効性情報に基づいて、２以上の領域を結合した領域についての有効性情報を生成してもよい。そして、切替部４０２は、学習に利用する有効性情報のうち、２以上の領域のそれぞれの領域についての有効性情報を、生成した有効性情報に置き換える。 Specifically, the validity information for each of the two or more regions may be set as the validity information used for learning. In this case, the switching unit 402 refers to the acquired state value regarding the power generator 201, and generates validity information about the area in which two or more areas are combined based on the characteristic function. Further, the switching unit 402 may generate the validity information about the area obtained by combining the two or more areas based on the validity information about each of the two or more areas. Then, the switching unit 402 replaces the validity information for each of the two or more regions in the validity information used for learning with the generated validity information.

具体的には、２以上の領域を結合した領域についての有効性情報を、学習に利用する有効性情報に設定している場合がある。この場合、切替部４０２は、取得した発電機２０１に関する状態値を参照し、特性関数に基づいて、２以上の領域のそれぞれの領域についての有効性情報を生成する。また、切替部４０２は、２以上の領域を結合した領域についての有効性情報に基づいて、２以上の領域のそれぞれの領域についての有効性情報を生成してもよい。そして、切替部４０２は、学習に利用する有効性情報のうち、２以上の領域を結合した領域についての有効性情報を、生成した有効性情報に置き換える。 Specifically, the validity information about the area in which two or more areas are combined may be set as the validity information used for learning. In this case, the switching unit 402 refers to the acquired state value regarding the generator 201, and generates validity information for each of the two or more regions based on the characteristic function. Further, the switching unit 402 may generate the validity information about each of the two or more areas based on the validity information about the area in which the two or more areas are combined. Then, the switching unit 402 replaces the validity information about the area in which two or more areas are combined, with the generated validity information, out of the validity information used for learning.

切替部４０２は、具体的には、特性関数に基づいて、取得した風速に対応する出力電力を特定し、特定した出力電力に基づいて、結合した領域についての有効性情報を生成する。また、切替部４０２は、具体的には、発電機２０１の受風性能ごとに異なる複数の特性関数のうち、取得した風速および出力電力に対応する特性関数に基づいて、結合した領域についての有効性情報を生成する。 Specifically, the switching unit 402 specifies the output power corresponding to the acquired wind speed based on the characteristic function, and generates the validity information about the combined region based on the specified output power. In addition, the switching unit 402 is specifically effective for the combined region based on the characteristic function corresponding to the acquired wind speed and output power among the plurality of characteristic functions that differ for each wind receiving performance of the generator 201. Generate sex information.

切替部４０２は、具体的には、取得した需要電力が閾値以下である場合に、結合した領域についての有効性情報を生成する。切替部４０２は、より具体的には、取得した需要電力が閾値以下である場合に、出力電力についての複数の領域のうち、相対的に大きい出力電力についての２以上の領域を結合した領域についての有効性情報を生成する。 Specifically, when the acquired demand power is equal to or less than the threshold, the switching unit 402 generates validity information about the combined area. More specifically, the switching unit 402, in the case where the acquired demand power is less than or equal to a threshold value, regarding a region that is a combination of two or more regions for relatively large output power among a plurality of regions for output power. Generate validity information for.

切替部４０２は、具体的には、取得した需要電力が閾値を超える場合に、結合した領域についての有効性情報を生成する。切替部４０２は、より具体的には、取得した需要電力が閾値を超える場合に、出力電力についての複数の領域のうち、相対的に小さい出力電力についての２以上の領域を結合した領域についての有効性情報を生成する。 Specifically, when the acquired demand power exceeds the threshold value, the switching unit 402 generates validity information about the combined areas. More specifically, when the acquired demand power exceeds the threshold value, the switching unit 402, regarding a region obtained by combining two or more regions for relatively small output power, among a plurality of regions for output power, Generate validity information.

学習部４０３は、切替部４０２が設定した有効性情報を利用して学習を行い、少なくともいずれかの有効性情報を更新する。学習部４０３は、例えば、複数の領域のそれぞれの領域についての有効性情報を利用して学習を行う。学習部４０３は、例えば、複数の領域のそれぞれの領域についての有効性情報のうち、２以上の領域のそれぞれの領域についての有効性情報を、２以上の領域を結合した領域についての有効性情報に置き換えて学習を行う。 The learning unit 403 performs learning using the validity information set by the switching unit 402, and updates at least one of the validity information. The learning unit 403 performs learning using, for example, the validity information about each of the plurality of areas. The learning unit 403, for example, among the validity information about each area of the plurality of areas, the validity information about each area of two or more areas, the validity information about the area combining two or more areas. Replace with to learn.

出力部４０４は、各機能部の処理結果を出力する。出力形式は、例えば、ディスプレイへの表示、プリンタへの印刷出力、ネットワークＩ／Ｆ３０３による外部装置への送信、または、メモリ３０２や記録媒体３０５などの記憶領域への記憶である。これにより、出力部４０４は、各機能部の処理結果を利用者に通知可能にし、強化学習装置１００の管理や運用、例えば、強化学習装置１００の設定値の更新などを支援することができ、強化学習装置１００の利便性の向上を図ることができる。 The output unit 404 outputs the processing result of each functional unit. The output format is, for example, display on a display, print output to a printer, transmission to an external device by the network I / F 303, or storage in a storage area such as the memory 302 or the recording medium 305. Thereby, the output unit 404 can notify the user of the processing result of each functional unit, and can support the management and operation of the reinforcement learning apparatus 100, for example, the update of the set value of the reinforcement learning apparatus 100, The convenience of the reinforcement learning device 100 can be improved.

（風力発電機を含む発電システム２００についての強化学習装置１００の動作例）
次に、図５〜図１３を用いて、風力発電機を含む発電システム２００についての強化学習装置１００の動作例について説明する。まず、図５の説明に移行し、風力発電機を含む発電システム２００の具体的構成例について説明する。 (Operation example of the reinforcement learning device 100 for the power generation system 200 including the wind power generator)
Next, an operation example of the reinforcement learning device 100 for the power generation system 200 including the wind power generator will be described with reference to FIGS. 5 to 13. First, shifting to the description of FIG. 5, a specific configuration example of the power generation system 200 including a wind power generator will be described.

図５は、風力発電機を含む発電システム２００の具体的構成例を示す説明図である。図５の例では、発電システム２００は、強化学習装置１００と、ストール制御の風力発電機ｉ（ｉ＝１，・・・，ｎ）と、ピッチ制御の風力発電機ｉ（ｉ＝１，・・・，ｍ）とを含む。ストール制御の風力発電機ｉは、指令値ａｉ（ｉ＝１，・・・，ｎ）を強化学習装置１００から受信する。ピッチ制御の風力発電機ｉは、指令値ｂｉ（ｉ＝１，・・・，ｍ）を強化学習装置１００から受信する。 FIG. 5: is explanatory drawing which shows the specific structural example of the power generation system 200 containing a wind power generator. In the example of FIG. 5, the power generation system 200 includes a reinforcement learning device 100, a stall control wind power generator i (i = 1, ..., N), and a pitch control wind power generator i (i = 1 ,. .., m) are included. The stall-controlled wind power generator i receives the command value ai (i = 1, ..., N) from the reinforcement learning device 100. The pitch-controlled wind power generator i receives the command value bi (i = 1, ..., M) from the reinforcement learning device 100.

発電システム２００は、ストール制御の風力発電機ｉについての風速計ｓｉ（ｉ＝１，・・・，ｎ）と、ピッチ制御の風力発電機ｉについての風速計ｐｉ（ｉ＝１，・・・，ｍ）とを含む。風速計ｓｉは、風速値Ｆ_si（ｔ_j）を、強化学習装置１００に送信する。ｔ_jは、時刻である。風速計ｐｉは、風速値Ｆ_pi（ｔ_j）を、強化学習装置１００に送信する。発電システム２００は、ストール制御の風力発電機ｉについての電力計と、ピッチ制御の風力発電機ｉについての電力計とを含む。ストール制御の風力発電機ｉについての電力計は、出力ワット値Ｐ_si（ｔ_j）を、強化学習装置１００に送信する。ピッチ制御の風力発電機ｉについての電力計は、出力ワット値Ｐ_pi（ｔ_j）を、強化学習装置１００に送信する。 The power generation system 200 includes an anemometer si (i = 1, ..., N) for a stall-controlled wind power generator i and an anemometer pi (i = 1, ..., N) for a pitch-controlled wind power generator i. , M) and. The anemometer si transmits the wind speed value F _si (t _j ) to the reinforcement learning device 100. t _j is the time. The anemometer pi transmits the wind speed value F _pi (t _j ) to the reinforcement learning device 100. The power generation system 200 includes a power meter for the stall-controlled wind power generator i and a power meter for the pitch-controlled wind power generator i. The power meter for the stall-controlled wind power generator i sends the output wattage value P _si (t _j ) to the reinforcement learning device 100. The power meter for the pitch-controlled wind power generator i sends the output wattage value P _pi (t _j ) to the reinforcement learning device 100.

強化学習装置１００は、テーブル生成部５０１と、区間切替部５０２と、値設定部５０３と、行動決定部５０４と、状態計算部５０５と、報酬計算部５０６と、テーブル更新部５０７とを含む。強化学習装置１００は、発電システム２００全体に関する需要電力ワット値Ｐ’（ｔ_j）を超えない範囲で、ストール制御の風力発電機ｉについての出力ワット値Ｐ_si（ｔ_j）と、ピッチ制御の風力発電機ｉについての出力ワット値Ｐ_pi（ｔ_j）との合計の増大化を図る。 The reinforcement learning device 100 includes a table generation unit 501, a section switching unit 502, a value setting unit 503, an action determination unit 504, a state calculation unit 505, a reward calculation unit 506, and a table update unit 507. The reinforcement learning device 100 outputs the output watt value P _si (t _j ) of the stall-controlled wind power generator i and the pitch control of the pitch control in a range that does not exceed the demand power watt value P ′ (t _j ) of the entire power generation system 200. Increase the sum with the output wattage value P _pi (t _j ) for the wind power generator i.

テーブル生成部５０１は、図６および図７に後述する通常分割手法により分割された複数の通常区間についての有効性情報を記憶する通常テーブルを作成する。テーブル生成部５０１は、いずれの２以上の通常区間を結合すると、複数の通常区間が、図６および図７に後述する粗分割手法により分割された複数の粗分割区間に変換されるかを設定する。 The table generation unit 501 creates a normal table that stores validity information about a plurality of normal sections divided by the normal division method described later in FIGS. 6 and 7. The table generation unit 501 sets which two or more normal intervals are combined to convert the plurality of normal intervals into a plurality of coarse division intervals divided by a coarse division method described later in FIGS. 6 and 7. To do.

区間切替部５０２は、風速値Ｆ_si（ｔ_j）と、風速値Ｆ_pi（ｔ_j）と、出力ワット値Ｐ_si（ｔ_j）と、出力ワット値Ｐ_pi（ｔ_j）と、需要電力ワット値Ｐ’（ｔ_j）とを受信する。区間切替部５０２は、受信した各種情報に基づいて、利用するテーブルを、通常テーブルと、複数の粗分割区間についての有効性情報を記憶する粗分割テーブルとで切り替える。区間切替部５０２は、閾値α＞需要電力ワット値Ｐ’（ｔ_j）であれば、粗分割テーブルを、利用するテーブルに設定する。区間切替部５０２は、閾値α≦需要電力ワット値Ｐ’（ｔ_j）であれば、通常テーブルを、利用するテーブルに設定する。 The section switching unit 502 uses the wind speed value F _si (t _j ), the wind speed value F _pi (t _j ), the output watt value P _si (t _j ), the output watt value P _pi (t _j ), and the power demand. Receive the wattage value P ′ (t _j ). The section switching unit 502 switches the table to be used between a normal table and a coarse division table that stores validity information for a plurality of coarse division sections, based on the received various information. If the threshold value α> the demand power watt value P ′ (t _j ), the section switching unit 502 sets the rough division table to the table to be used. The section switching unit 502 sets the normal table to the table to be used if the threshold value α ≦ the demand power watt value P ′ (t _j ).

値設定部５０３は、切り替えた結果に基づいて、テーブルに有効値を設定する。値設定部５０３は、２以上の領域を結合した場合、結合した領域に対応するレコードに有効値を設定する。値設定部５０３は、２以上の領域を分離した場合、分離した領域ごとに対応するレコードに有効値を設定する。値設定部５０３は、有効値を設定したテーブルを出力する。 The value setting unit 503 sets a valid value in the table based on the switching result. When two or more areas are combined, the value setting unit 503 sets a valid value in the record corresponding to the combined area. When two or more areas are separated, the value setting unit 503 sets a valid value in the record corresponding to each separated area. The value setting unit 503 outputs a table in which valid values are set.

行動決定部５０４は、テーブルを利用して、ストール制御の風力発電機ｉに対する指令値ａｉを選択し、ストール制御の風力発電機ｉに送信し、ピッチ制御の風力発電機ｉに対する指令値ｂｉを選択し、ピッチ制御の風力発電機ｉに送信する。行動決定部５０４は、例えば、テーブルにおいて最も大きい有効値が対応付けられた、指令値ａｉと指令値ｂｉとの組み合わせを選択する。行動決定部５０４は、具体的には、ε−ｇｒｅｅｄｙアルゴリズムを用いて、εの確率で指令値をランダムに選択し、１−εの確率で現在の発電システム２００の状態において最も大きい有効値が対応付けられた指令値ａｉと指令値ｂｉとの組み合わせを選択する。 The action determination unit 504 uses the table to select the command value ai for the stall-controlled wind power generator i, and transmits the command value ai to the stall-controlled wind power generator i to obtain the command value bi for the pitch-controlled wind power generator i. Select and send to pitch controlled wind power generator i. The action determination unit 504 selects, for example, a combination of the command value ai and the command value bi associated with the largest effective value in the table. Specifically, the action determination unit 504 randomly selects a command value with a probability of ε by using the ε-greedy algorithm, and has a probability of 1-ε to obtain the largest effective value in the current state of the power generation system 200. A combination of the associated command value ai and command value bi is selected.

状態計算部５０５は、風速値Ｆ_si（ｔ_j）と、風速値Ｆ_pi（ｔ_j）と、出力ワット値Ｐ_si（ｔ_j）と、出力ワット値Ｐ_pi（ｔ_j）と、需要電力ワット値Ｐ’（ｔ_j）とに基づいて、発電システム２００の状態を特定する。状態計算部５０５は、特定した状態に対応するテーブルのレコードを示す状態結果を出力する。報酬計算部５０６は、出力ワット値Ｐ_si（ｔ_j）と、出力ワット値Ｐ_pi（ｔ_j）と、需要電力ワット値Ｐ’（ｔ_j）とに基づいて、報酬値を算出する。テーブル更新部５０７は、状態結果が示すレコードにおける有効値を、算出した報酬値に基づいて更新する。 The state calculation unit 505 determines the wind speed value F _si (t _j ), the wind speed value F _pi (t _j ), the output watt value P _si (t _j ), the output watt value P _pi (t _j ), and the power demand. The state of the power generation system 200 is specified based on the watt value P ′ (t _j ). The state calculation unit 505 outputs a state result indicating a record in the table corresponding to the specified state. The reward calculation unit 506 calculates a reward value based on the output watt value P _si (t _j ), the output watt value P _pi (t _j ), and the demand power watt value P ′ (t _j ). The table updating unit 507 updates the effective value in the record indicated by the state result based on the calculated reward value.

次に、図６〜図１３の説明に移行し、テーブル生成部５０１と、区間切替部５０２と、値設定部５０３と、行動決定部５０４と、状態計算部５０５と、報酬計算部５０６と、テーブル更新部５０７との、各部分の動作について具体的に説明する。まず、図６および図７の説明に移行し、テーブル生成部５０１において設定される粗分割区間について具体的に説明する。 Next, shifting to the description of FIGS. 6 to 13, the table generation unit 501, the section switching unit 502, the value setting unit 503, the action determination unit 504, the state calculation unit 505, the reward calculation unit 506, The operation of each part with the table updating unit 507 will be specifically described. First, shifting to the description of FIG. 6 and FIG. 7, the rough division section set in the table generation unit 501 will be specifically described.

図６は、ストール制御の風力発電機の状態値が取りうる通常区間と粗分割区間を示す説明図である。例えば、図６のグラフ６０１に示すように、均一に分割する通常分割手法により、出力電力および風速についての全体区間が複数の通常区間に分割される。例えば、出力電力の全体区間は通常区間１，２，・・・，ｎｐｓｉに分割される。風速の全体区間は通常区間１，２，・・・，ｎｆｓｉに分割される。そして、例えば、図６のグラフ６０２に示すように、出力電力が大きい方にある２つの領域が結合する対象に設定され、出力電力および風速についての全体区間が複数の粗分割区間に分割される。例えば、出力電力の全体区間は粗分割区間１，２，・・・，ｎｐｓｉ−１に分割される。風速の全体区間は粗分割区間１，２，・・・，ｎｆｓｉ−１に分割される。次に、図７の説明に移行する。 FIG. 6 is an explanatory diagram showing a normal section and a roughly divided section that can be taken by the state value of the stall-controlled wind power generator. For example, as shown by a graph 601 in FIG. 6, the entire section regarding the output power and the wind speed is divided into a plurality of normal sections by the normal division method of uniformly dividing. For example, the entire section of output power is divided into normal sections 1, 2, ..., Npsi. The whole section of wind speed is divided into normal sections 1, 2, ..., Nfsi. Then, for example, as shown in the graph 602 of FIG. 6, the two regions having the larger output power are set as targets to be combined, and the entire section regarding the output power and the wind speed is divided into a plurality of coarse division sections. . For example, the entire section of output power is divided into coarsely divided sections 1, 2, ..., Npsi-1. The whole section of wind speed is divided into coarsely divided sections 1, 2, ..., Nfsi-1. Next, the description moves to FIG. 7.

図７は、ピッチ制御の風力発電機の状態値が取りうる通常区間と粗分割区間を示す説明図である。例えば、図７のグラフ７０１に示すように、均一に分割する通常分割手法により、出力電力および風速についての全体区間が、複数の通常区間に分割される。例えば、出力電力の全体区間は通常区間１，２，・・・，ｎｐｐｉに分割される。風速の全体区間は通常区間１，２，・・・，ｎｆｐｉに分割される。そして、例えば、図７のグラフ７０２に示すように、出力電力が大きい方にある２つの領域が結合する対象に設定され、出力電力および風速についての全体区間が複数の粗分割区間に分割される。例えば、出力電力の全体区間は粗分割区間１，２，・・・，ｎｐｐｉ−１に分割される。風速の全体区間は粗分割区間１，２，・・・，ｎｆｐｉ−１に分割される。 FIG. 7: is explanatory drawing which shows the normal area and coarse division area which the state value of the pitch control wind power generator can take. For example, as shown in a graph 701 in FIG. 7, the entire section regarding the output power and the wind speed is divided into a plurality of normal sections by the normal division method for uniform division. For example, the entire output power section is divided into normal sections 1, 2, ..., Nppi. The entire wind speed section is divided into normal sections 1, 2, ..., Nfpi. Then, for example, as shown in the graph 702 of FIG. 7, the two regions having the larger output power are set as the target to be combined, and the entire section regarding the output power and the wind speed is divided into a plurality of coarse division sections. . For example, the entire section of output power is divided into coarsely divided sections 1, 2, ..., Nppi-1. The entire wind speed section is divided into coarsely divided sections 1, 2, ..., Nfpi-1.

次に、図８の説明に移行し、テーブル生成部５０１において作成される通常区間についての通常テーブル８０１と、通常テーブル８０１から切り替えられる粗分割区間についての粗分割テーブル８０２とを実現する一例について具体的に説明する。 Next, shifting to the description of FIG. 8, a specific example of realizing the normal table 801 for the normal sections created by the table generation unit 501 and the rough division table 802 for the coarse division sections switched from the normal table 801 will be described. To explain.

図８は、通常テーブル８０１と粗分割テーブル８０２とを実現する一例を示す説明図である。図８において、強化学習装置１００は、通常テーブル８０１のレコードを、粗分割テーブル８０２のレコードとして流用することにより、通常テーブル８０１と粗分割テーブル８０２とを相互に変換可能に実現する。 FIG. 8 is an explanatory diagram showing an example of realizing the normal table 801 and the coarse division table 802. In FIG. 8, the reinforcement learning device 100 uses the record of the normal table 801 as the record of the coarse division table 802 to realize the mutual conversion of the normal table 801 and the coarse division table 802.

強化学習装置１００は、例えば、通常テーブル８０１を作成する。強化学習装置１００は、通常テーブル８０１を粗分割テーブル８０２に変換する場合、結合する２以上の通常区間の一方の通常区間に対応するレコードを、結合した粗分割区間に対応するレコードとして流用する。そして、強化学習装置１００は、他方の通常区間のレコードに設定された有効性情報を削除する。強化学習装置１００は、例えば、通常区間ｎｐｓｉ−１と通常区間ｎｐｓｉとを結合する場合、通常区間ｎｐｓｉ−１に対応するレコードを粗分割区間ｎｐｓｉ−１に対応するレコードとして流用する。強化学習装置１００は、例えば、区間ｎｐｓｉに対応するレコードに設定された有効性情報を削除する。 The reinforcement learning device 100 creates the normal table 801, for example. When converting the normal table 801 into the coarse division table 802, the reinforcement learning apparatus 100 diverts a record corresponding to one ordinary section of two or more ordinary sections to be combined as a record corresponding to the combined coarse division section. Then, the reinforcement learning device 100 deletes the validity information set in the record of the other normal section. For example, when the normal section npsi-1 and the normal section npsi are combined, the reinforcement learning apparatus 100 diverts the record corresponding to the normal section npsi-1 as the record corresponding to the coarse division section npsi-1. The reinforcement learning device 100 deletes the validity information set in the record corresponding to the section npsi, for example.

また、強化学習装置１００は、粗分割テーブル８０２を通常テーブル８０１に変換する場合、結合した区間に対応するレコードを、結合した区間から分割される２以上の区間の一方の区間に対応するレコードとして流用する。そして、強化学習装置１００は、他方の区間のレコードに有効性情報を再び設定する。強化学習装置１００は、例えば、粗分割区間ｎｐｓｉ−１を、通常区間ｎｐｓｉ−１と通常区間ｎｐｓｉとに分割する場合、粗分割区間ｎｐｓｉ−１に対応するレコードを、通常区間ｎｐｓｉ−１に対応するレコードとして流用する。強化学習装置１００は、例えば、通常区間ｎｐｓｉに対応するレコードに有効性情報を再び設定する。 Further, when converting the rough division table 802 into the normal table 801, the reinforcement learning apparatus 100 regards the record corresponding to the combined section as a record corresponding to one of the two or more sections divided from the combined section. Divert. Then, the reinforcement learning device 100 sets the validity information again in the record of the other section. The reinforcement learning device 100, for example, when dividing the rough division section npsi-1 into the normal division npsi-1 and the normal division npsi, corresponds the record corresponding to the rough division division npsi-1 to the normal division npsi-1. It will be used as a record. The reinforcement learning device 100 sets the validity information again in the record corresponding to the normal section npsi, for example.

次に、図９を用いて、テーブル生成部５０１において作成される通常テーブル８０１の記憶内容の一例について説明する。通常テーブル８０１は、例えば、図３に示した強化学習装置１００のメモリ３０２や記録媒体３０５などの記憶領域により実現される。以下の通常テーブル８０１の説明は、強化学習手法としてＱ学習を利用する場合に対応し、異なる強化学習手法を利用する場合には記憶内容が異なってもよい。 Next, with reference to FIG. 9, an example of the stored contents of the normal table 801 created by the table creation unit 501 will be described. The normal table 801 is realized by a storage area such as the memory 302 or the recording medium 305 of the reinforcement learning device 100 shown in FIG. The following description of the normal table 801 corresponds to the case where Q learning is used as the reinforcement learning method, and the stored contents may be different when the different reinforcement learning method is used.

図９は、通常テーブル８０１の記憶内容の一例を示す説明図である。図９に示すように、通常テーブル８０１は、状態値と指令値と有効値とのフィールドを有する。通常テーブル８０１は、各フィールドに情報を設定することにより、有効性情報をレコードとして記憶する。 FIG. 9 is an explanatory diagram showing an example of the stored contents of the normal table 801. As shown in FIG. 9, the normal table 801 has fields for status values, command values, and valid values. The normal table 801 stores validity information as a record by setting information in each field.

状態値のフィールドには、発電システム２００に関する状態値が取りうる区間が設定される。発電システム２００に関する状態値は、風力発電機に関する状態値、および、発電システム２００全体に関する状態値を含む。図９の例では、風力発電機に関する状態値は、ストール制御の風力発電機に関する出力ワット値および風速値と、ピッチ制御の風力発電機に関する出力ワット値および風速値とである。また、発電システム２００全体に関する状態値は、発電システム２００全体における需要電力ワット値である。 In the state value field, a section where the state value regarding the power generation system 200 can be set is set. The state value regarding the power generation system 200 includes a state value regarding the wind power generator and a state value regarding the entire power generation system 200. In the example of FIG. 9, the state values for the wind power generator are the output watt value and the wind speed value for the stall control wind power generator and the output watt value and the wind speed value for the pitch control wind power generator. Further, the state value regarding the entire power generation system 200 is a demand power watt value in the entire power generation system 200.

指令値のフィールドには、風力発電機に対する指令値が設定される。図９の例では、風力発電機に対する指令値は、ストール制御の風力発電機の電源をＯＮとＯＦＦとで切り替えさせる指令値である。また、風力発電機に対する指令値は、ピッチ制御の風力発電機のピッチ角をどの程度変更するかを示す指令値である。ピッチ角をどの程度変更するかを示す指令値は、具体的には、−ΔΘと±０と＋ΔΘとである。有効値のフィールドには、それぞれの状態値がいずれかの区間に含まれる場合における、風力発電機ごとの指令値の組み合わせの有効性を示す有効値が設定される。 A command value for the wind power generator is set in the command value field. In the example of FIG. 9, the command value for the wind power generator is a command value for switching the power source of the stall-controlled wind power generator between ON and OFF. The command value for the wind power generator is a command value indicating how much the pitch angle of the pitch-controlled wind power generator is changed. Specifically, the command values indicating how much the pitch angle is changed are -ΔΘ, ± 0, and + ΔΘ. In the effective value field, an effective value indicating the effectiveness of the combination of command values for each wind power generator when each state value is included in any section is set.

次に、図１０の説明に移行し、区間切替部５０２が、通常テーブル８０１を粗分割テーブル８０２に変換する場合、または、粗分割テーブル８０２を通常テーブル８０１に変換する場合に用いられる、風力発電機に関する特性関数を作成する一例について説明する。風力発電機に関する特性関数は、強化学習装置１００に予め入力されてもよい。 Next, shifting to the description of FIG. 10, the wind power generation that is used when the section switching unit 502 converts the normal table 801 into the coarse partition table 802 or when converting the coarse partition table 802 into the normal table 801. An example of creating a machine-related characteristic function will be described. The characteristic function regarding the wind power generator may be input in advance to the reinforcement learning device 100.

図１０は、特性関数を作成する一例を示す説明図である。図１０において、強化学習装置１００は、風力発電機に関する特性関数を作成する。強化学習装置１００は、例えば、様々な風速におけるストール制御の風力発電機からの出力ワット値を取得する。また、強化学習装置１００は、定格風速と最大出力とカットイン風速とカットアウト風速とを取得する。 FIG. 10 is an explanatory diagram showing an example of creating the characteristic function. In FIG. 10, the reinforcement learning device 100 creates a characteristic function regarding a wind power generator. The reinforcement learning device 100 acquires output wattage values from a stall-controlled wind power generator at various wind speeds, for example. Further, the reinforcement learning device 100 acquires the rated wind speed, the maximum output, the cut-in wind speed, and the cut-out wind speed.

次に、強化学習装置１００は、定格風速と最大出力とカットイン風速とカットアウト風速と様々な風速における出力ワット値とに基づいて、ストール制御の風力発電機についての特性関数が示す特性曲線を近似する近似曲線ｆ_i（ｔ）を求める。強化学習装置１００は、例えば、風速０からカットイン風速までは、ｙ＝０の形状で近似曲線ｆ_i（ｔ）の一部を求める。 Next, the reinforcement learning device 100 determines the characteristic curve indicated by the characteristic function for the stall-controlled wind power generator based on the rated wind speed, the maximum output, the cut-in wind speed, the cut-out wind speed, and the output wattage at various wind speeds. An approximate curve f _i (t) to be approximated is obtained. The reinforcement learning device 100 obtains a part of the approximated curve f _i (t) in the shape of y = 0 from the wind speed of 0 to the cut-in wind speed.

強化学習装置１００は、例えば、カットイン風速から定格風速までは、様々な風速における出力ワット値に基づいて、ｙ＝ａ＊ｘ＾３の形状で近似曲線ｆ_i（ｔ）の一部を求める。強化学習装置１００は、例えば、定格風速以降では、様々な風速における出力ワット値に基づいて、ｙ＝ｂ＊ｘ＾２の形状で近似曲線ｆ_i（ｔ）の一部を求める。これにより、強化学習装置１００は、図１０のグラフ１０００に示すような近似曲線ｆ_i（ｔ）を求める。 The reinforcement learning apparatus 100 obtains a part of the approximated curve f _i (t) in the shape of y = a * x ^ 3 based on the output wattage values at various wind speeds from the cut-in wind speed to the rated wind speed, for example. . For example, after the rated wind speed, the reinforcement learning device 100 obtains a part of the approximated curve f _i (t) in the shape of y = b * x ^ 2 based on the output wattage values at various wind speeds. Thereby, the reinforcement learning device 100 obtains an approximated curve f _i (t) as shown in the graph 1000 of FIG. 10.

また、強化学習装置１００は、例えば、ピッチ角Θ＝０，ΔΘ，２ΔΘ，・・・，ｋΔΘにおける、様々な風速におけるピッチ制御の風力発電機からの出力ワット値を取得する。また、強化学習装置１００は、定格風速と最大出力とカットイン風速とカットアウト風速とを取得する。 Further, the reinforcement learning device 100 acquires the output wattage value from the pitch-controlled wind power generator at various wind speeds, for example, at pitch angles Θ = 0, ΔΘ, 2ΔΘ, ..., kΔΘ. Further, the reinforcement learning device 100 acquires the rated wind speed, the maximum output, the cut-in wind speed, and the cut-out wind speed.

次に、強化学習装置１００は、定格風速と最大出力とカットイン風速とカットアウト風速と様々な風速における出力ワット値とに基づいて、ピッチ制御の風力発電機についての特性関数が示す特性曲線を近似する近似曲線ｆ_i（ｔ）を求める。強化学習装置１００は、例えば、風速０からカットイン風速までは、ｙ＝０の形状で近似曲線ｆ_i（ｔ）の一部を求める。 Next, the reinforcement learning device 100 determines the characteristic curve indicated by the characteristic function of the pitch-controlled wind power generator based on the rated wind speed, the maximum output, the cut-in wind speed, the cut-out wind speed, and the output wattage at various wind speeds. An approximate curve f _i (t) to be approximated is obtained. The reinforcement learning device 100 obtains a part of the approximated curve f _i (t) in the shape of y = 0 from the wind speed of 0 to the cut-in wind speed.

強化学習装置１００は、例えば、カットイン風速から定格風速までは、様々な風速における出力ワット値に基づいて、ｙ＝ａ＊ｘ＾３の形状で近似曲線ｆ_i（ｔ）の一部を求める。強化学習装置１００は、例えば、定格風速以降では、様々な風速における出力ワット値に基づいて、ｙ＝ｂ＊ｘ＾２の形状で近似曲線ｆ_i（ｔ）の一部を求める。これにより、強化学習装置１００は、ピッチ角Θ＝０，ΔΘ，２ΔΘ，・・・，ｋΔΘにおける、図１０のグラフ１０００に示すような近似曲線ｆ_i（ｔ）を求める。 The reinforcement learning apparatus 100 obtains a part of the approximated curve f _i (t) in the shape of y = a * x ^ 3 based on the output wattage values at various wind speeds from the cut-in wind speed to the rated wind speed, for example. . For example, after the rated wind speed, the reinforcement learning device 100 obtains a part of the approximated curve f _i (t) in the shape of y = b * x ^ 2 based on the output wattage values at various wind speeds. Thereby, the reinforcement learning device 100 obtains an approximated curve f _i (t) as shown in the graph 1000 of FIG. 10 at the pitch angles Θ = 0, ΔΘ, 2ΔΘ, ..., KΔΘ.

次に、図１１の説明に移行し、区間切替部５０２が、予め作成された特定関数の近似曲線ｆ_i（ｔ）に基づいて、通常テーブル８０１を粗分割テーブル８０２に変換し、利用するテーブルを粗分割テーブル８０２に切り替える一例について説明する。 Next, shifting to the description of FIG. 11, the section switching unit 502 converts the normal table 801 into a coarse division table 802 based on an approximated curve f _i (t) of a specific function created in advance, and uses the table. An example will be described in which is switched to the coarse division table 802.

図１１は、利用するテーブルを粗分割テーブル８０２に切り替える一例を示す説明図である。図１１において、強化学習装置１００は、閾値α＞需要電力ワット値Ｐ’（ｔ_j）であるため、粗分割テーブル８０２を利用するテーブルに設定する。図１１の例では、通常区間ｎｐｓ１−１と通常区間ｎｐｓ１とを結合した粗分割区間ｎｐｓｉ−１に対応する、●１１０１で示すフィールドに有効値を設定する場合について説明する。 FIG. 11 is an explanatory diagram showing an example of switching the table to be used to the coarse division table 802. In FIG. 11, the reinforcement learning apparatus 100 sets the rough division table 802 to a table that uses the threshold value α> the power demand watt value P ′ (t _j ). In the example of FIG. 11, a case will be described in which a valid value is set in the field indicated by ● 1101 corresponding to the rough division section npsi-1 in which the normal section nps1-1 and the normal section nps1 are combined.

強化学習装置１００は、例えば、粗分割区間ｎｐｓ１−１に対応するレコードを特定する。強化学習装置１００は、特定したレコードに設定されたストール制御の風力発電機ｉに関する風速値Ｆ_s1，・・・，Ｆ_snを取得する。また、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉからの出力ワット値Ｐ_p1，・・・，Ｐ_pmを取得する。また、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉに関する風速値Ｆ_p1，・・・，Ｆ_pmを取得する。また、強化学習装置１００は、特定したレコードに設定された発電システム２００全体に関する需要電力ワット値Ｐ’を取得する。 The reinforcement learning device 100 specifies, for example, a record corresponding to the rough division section nps1-1. The reinforcement learning device 100 acquires the wind speed values F _s1 , ..., F _sn for the stall-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the output watt value P _p1 , ..., P _pm from the pitch-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the wind speed values F _p1 , ..., F _pm regarding the pitch-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the demand power watt value P ′ for the entire power generation system 200 set in the identified record.

強化学習装置１００は、ストール制御の風力発電機ｉに対して、観測した風速Ｆ_siと近似曲線ｆ_i（ｔ）とに基づいて、電源をＯＮにした時の予測出力電力Ｐ’_si＝ｆ_i（Ｆ_si）を算出する。また、強化学習装置１００は、ストール制御の風力発電機ｉに対して、電源をＯＦＦにした時の予測出力電力Ｐ’_si＝０を決定する。 The reinforcement learning device 100 predicts the output power P ′ _si = f when the power is turned on for the stall-controlled wind power generator i based on the observed wind speed F _si and the approximate curve f _i (t). _i (F _si ) is calculated. Further, the reinforcement learning device 100 determines the predicted output power P ′ _si = 0 when the power is turned off for the stall-controlled wind power generator i.

また、強化学習装置１００は、ピッチ制御の風力発電機ｉに対して、観測した風速Ｆ_piと出力ワット値Ｐ_piと近似曲線ｆ_i,Θ（ｔ）とに基づいて、ｆ_i,Θ（Ｆ_pi）≒Ｐ_piになるピッチ角Θを決定する。次に、強化学習装置１００は、決定したピッチ角Θに対して−ΔΘ、±０、＋ΔΘをした場合における予測出力電力Ｐ’_pi＝ｆ_i,Θ_-ΔΘ（Ｆ_pi）、ｆ_i,Θ（Ｆ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi）を算出する。 Further, the reinforcement learning device 100, based on the observed wind speed F _pi , the output watt value P _pi, and the approximated curve f _i, Θ (t) for the pitch-controlled wind power generator i, f _i, Θ ( Determine the pitch angle Θ such that F _pi ) ≈P _pi . Next, reinforcement learning apparatus 100, determined -ΔΘ the pitch angle theta, ± 0, + predicted output power in the case where the _{_{ΔΘ P 'pi = f i,}} Θ - ΔΘ (F pi), f i, Θ (F _pi ), fi _, Θ ₊ ΔΘ (F _pi ) is calculated.

強化学習装置１００は、予測出力電力Ｐ’_si＝ｆ_i（Ｆ_si）と、予測出力電力Ｐ’_pi＝ｆ_i,Θ_-ΔΘ（Ｆ_pi）、ｆ_i,Θ（Ｆ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi）とに基づいて、予測出力電力テーブル１１００を作成する。強化学習装置１００は、●１１０１で示すフィールドに対応する、ストール制御の風力発電機に対する指令値ａ１，・・・，ａｎと、ピッチ制御の風力発電機に対する指令値ｂ１，・・・，ｂｍとを取得する。そして、強化学習装置１００は、作成した予測出力電力テーブル１１００から、取得した指令値に対応する予測出力電力Ｐ’_siと、予測出力電力Ｐ’_piとを取得し、下記式（１）を用いて、発電システム２００全体における予測出力電力Ｐ〜を算出する。 The reinforcement learning device 100 predicts the predicted output power P ′ _si = f _i (F _si ) and the predicted output power P ′ _pi = f _i, Θ ₋ ΔΘ (F _pi ), f _i, Θ (F _pi ), f _{i. ,} Θ ₊ ΔΘ (F _pi ) and the predicted output power table 1100 is created. The reinforcement learning device 100 has command values a1, ..., An for the stall control wind power generator and command values b1 ,. To get. Then, the reinforcement learning device 100 acquires the predicted output power P ′ _si corresponding to the acquired command value and the predicted output power P ′ _pi from the created predicted output power table 1100, and uses the following formula (1). Then, the predicted output power P ~ in the entire power generation system 200 is calculated.

そして、強化学習装置１００は、取得した需要電力ワット値Ｐ’と、算出した予測出力電力Ｐ〜との差分値Ｐ”に基づいて、有効値Ｑ＝ｒ（Ｐ”）を算出して、●１１０１で示すフィールドに設定する。ｒ（Ｐ”）は、下記式（２）〜下記式（５）により定義される。具体的には、Ｐ”を下記式（２）により定義し、δ＞０を設定するとき、Ｐ”＞δの場合には下記式（３）によりｒ（Ｐ”）を定義し、−δ≦Ｐ”≦δの場合には下記式（４）によりｒ（Ｐ”）を定義し、Ｐ”＜−δの場合には下記式（５）によりｒ（Ｐ”）を定義する。 Then, the reinforcement learning device 100 calculates an effective value Q = r (P ″) based on the difference value P ″ between the acquired demand power watt value P ′ and the calculated predicted output power P˜, The fields 1101 are set. r (P ″) is defined by the following equations (2) to (5). Specifically, when P ″ is defined by the following equation (2) and δ> 0 is set, P ″ is set. In the case of> δ, r (P ″) is defined by the following formula (3), and in the case of −δ ≦ P ″ ≦ δ, r (P ″) is defined by the following formula (4), and P ″ < In the case of −δ, r (P ″) is defined by the following equation (5).

強化学習装置１００は、他のフィールドにも同様に有効値を設定する。これにより、強化学習装置１００は、出力ワット値の合計が需要電力ワット値に近づくほど、有効値が大きくなるように、有効値を算出することができる。このため、強化学習装置１００は、有効値を参照して、発電システム２００を適切に制御しやすくすることができる。また、強化学習装置１００は、学習を行って更新する対象である有効性情報の数を減少させ、強化学習にかかる処理量の低減化を図ることができる。 The reinforcement learning device 100 similarly sets valid values in other fields. Thereby, the reinforcement learning device 100 can calculate the effective value such that the effective value increases as the total output watt value approaches the demand power watt value. Therefore, the reinforcement learning device 100 can easily control the power generation system 200 appropriately by referring to the effective value. Further, the reinforcement learning device 100 can reduce the number of pieces of validity information to be learned and updated, and reduce the amount of processing required for reinforcement learning.

次に、図１２の説明に移行し、区間切替部５０２が、予め作成された特定関数の近似曲線ｆ_i（ｔ）に基づいて、粗分割テーブル８０２を通常テーブル８０１に変換し、利用するテーブルを通常テーブル８０１に切り替える一例について説明する。 Next, shifting to the description of FIG. 12, the section switching unit 502 converts the rough division table 802 into a normal table 801 based on an approximated curve f _i (t) of a specific function created in advance, and a table to be used. An example of switching the table to the normal table 801 will be described.

図１２は、利用するテーブルを通常テーブル８０１に切り替える一例を示す説明図である。図１２において、強化学習装置１００は、閾値α≦需要電力ワット値Ｐ’（ｔ_j）であるため、通常テーブル８０１を利用するテーブルに設定する。図１２の例では、粗分割区間ｎｐｓｉ−１から分割した通常区間ｎｐｓ１−１に対応する、●１２０１で示すフィールドに有効値を設定する場合について説明する。 FIG. 12 is an explanatory diagram showing an example of switching the table to be used to the normal table 801. In FIG. 12, the reinforcement learning apparatus 100 sets the normal table 801 to a table that uses the threshold value α ≦ the demand power watt value P ′ (t _j ). In the example of FIG. 12, a case will be described in which a valid value is set in the field indicated by ● 1201 corresponding to the normal section nps1-1 divided from the rough division section npsi-1.

強化学習装置１００は、例えば、粗分割区間ｎｐｓｉ−１に対応するレコードを特定する。次に、強化学習装置１００は、特定したレコードに設定されたストール制御の風力発電機ｉに関する風速値Ｆ_s1，・・・，Ｆ_snを取得する。また、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉからの出力ワット値Ｐ_p1，・・・，Ｐ_pmを取得する。また、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉに関する風速値Ｆ_p1，・・・，Ｆ_pmを取得する。また、強化学習装置１００は、特定したレコードに設定された発電システム２００全体に関する需要電力ワット値Ｐ’を取得する。 The reinforcement learning device 100 identifies, for example, a record corresponding to the rough division section npsi-1. Next, the reinforcement learning device 100 acquires the wind speed values F _s1 , ..., F _sn for the stall-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the output watt value P _p1 , ..., P _pm from the pitch-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the wind speed values F _p1 , ..., F _pm regarding the pitch-controlled wind power generator i set in the identified record. Further, the reinforcement learning device 100 acquires the demand power watt value P ′ for the entire power generation system 200 set in the identified record.

強化学習装置１００は、予測出力電力Ｐ’_si＝ｆ_i（Ｆ_si）と、予測出力電力Ｐ’_pi＝ｆ_i,Θ_-ΔΘ（Ｆ_pi）、ｆ_i,Θ（Ｆ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi）とに基づいて、予測出力電力テーブル１２００を作成する。強化学習装置１００は、●１２０１で示すフィールドに対応する、ストール制御の風力発電機に対する指令値ａ１，・・・，ａｎと、ピッチ制御の風力発電機に対する指令値ｂ１，・・・，ｂｍとを取得する。そして、強化学習装置１００は、作成した予測出力電力テーブル１２００から、取得した指令値に対応する予測出力電力Ｐ’_siと、予測出力電力Ｐ’_piとを取得し、上記式（１）を用いて、発電システム２００全体における予測出力電力Ｐ〜を算出する。 The reinforcement learning device 100 predicts the predicted output power P ′ _si = f _i (F _si ) and the predicted output power P ′ _pi = f _i, Θ ₋ ΔΘ (F _pi ), f _i, Θ (F _pi ), f _{i. ,} Θ ₊ ΔΘ (F _pi ), a predicted output power table 1200 is created. The reinforcement learning device 100 has command values a1, ..., An for the stall control wind power generator and command values b1, ..., Bm for the pitch control wind power generator corresponding to the field indicated by 1201. To get. Then, the reinforcement learning device 100 acquires the predicted output power P ′ _si and the predicted output power P ′ _pi corresponding to the acquired command value from the created predicted output power table 1200, and uses the above formula (1). Then, the predicted output power P ~ in the entire power generation system 200 is calculated.

そして、強化学習装置１００は、取得した需要電力ワット値Ｐ’と、算出した予測出力電力Ｐ〜との差分値Ｐ”に基づいて、有効値Ｑ＝ｒ（Ｐ”）を算出して、●１２０１で示すフィールドに設定する。ｒ（Ｐ”）は、具体的には、上記式（２）〜上記式（５）により定義される。 Then, the reinforcement learning device 100 calculates an effective value Q = r (P ″) based on the difference value P ″ between the acquired demand power watt value P ′ and the calculated predicted output power P˜, The field 1201 is set. Specifically, r (P ″) is defined by the above formulas (2) to (5).

強化学習装置１００は、通常区間ｎｐｓ１−１に対応する他のフィールド、および、通常区間ｎｐｓｉに対応するフィールドにも同様に有効値を設定する。これにより、強化学習装置１００は、出力ワット値の合計が需要電力ワット値に近づくほど、有効値が大きくなるように、有効値を算出することができる。このため、強化学習装置１００は、有効値を参照して、発電システム２００を適切に制御しやすくすることができる。また、強化学習装置１００は、学習を行って更新する対象である有効性情報の数を増加させ、どのような状態値の場合にどのような指令値を出力することが好ましいかを細分化し実行することができる。 The reinforcement learning device 100 similarly sets valid values in other fields corresponding to the normal section nps1-1 and fields corresponding to the normal section npsi. Thereby, the reinforcement learning device 100 can calculate the effective value such that the effective value increases as the total output watt value approaches the demand power watt value. Therefore, the reinforcement learning device 100 can easily control the power generation system 200 appropriately by referring to the effective value. Further, the reinforcement learning device 100 increases the number of validity information to be learned and updated, and subdivides what kind of command value should be output in what kind of state value and executes it. can do.

以上では、強化学習装置１００は、近似曲線ｆ_i（ｔ）や近似曲線ｆ_i,Θ（ｔ）に基づいて、有効値を算出した。ここで、現在の風速をＦ_si，Ｆ_piとして、次の時刻で観測される風速をＦ_si＋ΔＦ_si，Ｆ_pi＋ΔＦ_piとする。この場合、次の時刻で観測される出力電力はｆ_i（Ｆ_si＋ΔＦ_si）、ｆ_i,Θ_-ΔΘ（Ｆ_pi＋ΔＦ_pi）、ｆ_i,Θ（Ｆ_pi＋ΔＦ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi＋ΔＦ_pi）となる。さらに、近似曲線は連続関数であるため、次の時刻で観測される出力電力はｆ_i（Ｆ_si）＋ΔＰ_si、ｆ_i,Θ_-ΔΘ（Ｆ_pi）＋ΔＰ_pi、ｆ_i,Θ（Ｆ_pi）＋ΔＰ_pi、ｆ_i,Θ₊ΔΘ（Ｆ_pi）＋ΔＰ_piとなる。 In the above, the reinforcement learning device 100 calculates the effective value based on the approximated curve f _i (t) and the approximated curve f _i, Θ (t). Here, the current wind speed is F _si and F _pi , and the wind speed observed at the next time is F _si + ΔF _si and F _pi + ΔF _pi . In this case, the output power observed at the next time is f _i (F _si + ΔF _si ), f _i, Θ ₋ ΔΘ (F _pi + ΔF _pi ), f _i, Θ (F _pi + ΔF _pi ), f _i, Θ ₊ ΔΘ (F _pi + ΔF _pi ). Further, since the approximated curve is a continuous function, the output power observed at the next time is f _i (F _si ) + ΔP _si , f _i, Θ ₋ ΔΘ (F _pi ) + ΔP _pi , f _i, Θ (F _pi ) + ΔP _pi , f _i, Θ ₊ ΔΘ (F _pi ) + ΔP _pi .

また、次の時刻での需要電力をＰ’＋ΔＰ’とすれば、次の時刻での需要電力と出力電力との差Ｐ”_nと、現在求めた需要電力と出力電力の差Ｐ”について、下記式（６）が成立する。また、報酬関数は連続関数であるため、下記式（７）が成立する。 If the demand power at the next time is P ′ + ΔP ′, the difference P ″ _n between the demand power and the output power at the next time and the difference P ″ between the demand power and the output power currently obtained are The following formula (6) is established. Since the reward function is a continuous function, the following expression (7) is established.

ここで、ΔＦ_si→０、ΔＦ_pi→０、ΔＰ→０であれば、Ｐ”_n→Ｐ”、ｒ（Ｐ”_n）→ｒ（Ｐ”）が成立する。したがって、風速が安定し、かつ、需要電力の変化が小さい場合には、Ｐ”_n≒Ｐ”、ｒ（Ｐ”_n）≒ｒ（Ｐ”）が成立する。このため、強化学習装置１００が、ε−ｇｒｅｅｄｙアルゴリズムにより、最も大きい有効値ｒ（Ｐ”）の指令値の組み合わせを選択すれば、下記式（８）により、出力電力の合計を需要電力に近づけることができると判断される。 Here, if ΔF _si → 0, ΔF _pi → 0, ΔP → 0, then P ″ _n → P ″ and r (P ″ _n ) → r (P ″). Therefore, when the wind speed is stable and the change in the demand power is small, P ″ _n ≈P ″ and r (P ″ _n ) ≈r (P ″) are established. Therefore, if the reinforcement learning apparatus 100 selects the combination of the command values having the largest effective value r (P ″) by the ε-greedy algorithm, the total output power is brought close to the demand power by the following formula (8). It is judged that it is possible.

このように、強化学習装置１００は、特性関数に基づいて有効値を算出することで、制御対象となる風力発電機に対し、実際の需用電力に応じた出力電力となるようにすることができる。このため、強化学習装置１００は、有効値を０で初期化したり、有効値をランダムに設定したりする場合に比べて、適切な指令値を選択しやすくすることができる。 As described above, the reinforcement learning device 100 calculates the effective value based on the characteristic function so that the wind power generator to be controlled has an output power according to the actual demand power. it can. Therefore, the reinforcement learning device 100 can facilitate selection of an appropriate command value as compared with the case where the effective value is initialized to 0 or the effective value is randomly set.

その後、強化学習装置１００は、一定時間ごとに、ε−ｇｒｅｅｄｙアルゴリズムを用いて、指令値ａｉと指令値ｂｉとの組み合わせを行動として選択して出力する。強化学習装置１００は、例えば、時刻ｔ_jにおいて、風速値Ｆ_si（ｔ_j）と、風速値Ｆ_pi（ｔ_j）と、出力ワット値Ｐ_si（ｔ_j）と、出力ワット値Ｐ_pi（ｔ_j）と、需要電力ワット値Ｐ’（ｔ_j）とを、状態値として取得する。 After that, the reinforcement learning device 100 selects and outputs the combination of the command value ai and the command value bi as the action using the ε-greedy algorithm at regular intervals. The reinforcement learning device 100, for example, at time t _j , the wind speed value F _si (t _j ), the wind speed value F _pi (t _j ), the output watt value P _si (t _j ), and the output watt value P _pi ( t _j ) and the demand power watt value P ′ (t _j ) are acquired as state values.

強化学習装置１００は、風速値Ｆ_si（ｔ_j）の属する区間Ｆ_si〜（ｔ_j）と、風速値Ｆ_pi（ｔ_j）の属する区間Ｆ_pi〜（ｔ_j）とを特定する。強化学習装置１００は、出力ワット値Ｐ_si（ｔ_j）の属する区間Ｐ_si〜（ｔ_j）と、出力ワット値Ｐ_pi（ｔ_j）の属する区間Ｐ_pi〜（ｔ_j）とを特定する。強化学習装置１００は、需要電力ワット値Ｐ’（ｔ_j）の属する区間Ｐ’〜（ｔ_j）を特定する。 The reinforcement learning device 100 specifies the section F _si to (t _j ) to which the wind speed value F _si (t _j ) belongs and the section F _pi to (t _j ) to which the wind speed value F _pi (t _j ) belongs. Reinforcement learning apparatus 100 includes an output wattage value P _si interval P _si ~ Field of _{_{(t j) (t j)}} , identifies belongs interval P _pi ~ a (t _j) of the output wattage value P _pi (t _j) . Reinforcement learning apparatus 100 specifies the section belongs demand power watt value P '(t _j) P'~ a (t _j).

そして、強化学習装置１００は、εの確率で、指令値ａｉと指令値ｂｉとの組み合わせをランダムに選択して出力する。強化学習装置１００は、１−εの確率で、利用するテーブルとして設定された通常テーブル８０１または粗分割テーブル８０２のうち、取得した状態値が属する区間の組み合わせに対応するレコードを特定する。強化学習装置１００は、特定したレコードにおいて、最も大きい有効値が対応付けられた指令値ａｉと指令値ｂｉとの組み合わせを選択して出力する。 Then, the reinforcement learning apparatus 100 randomly selects and outputs a combination of the command value ai and the command value bi with the probability of ε. The reinforcement learning device 100 identifies, with a probability of 1-ε, a record corresponding to the combination of the sections to which the acquired state value belongs in the normal table 801 or the rough division table 802 set as the table to be used. The reinforcement learning apparatus 100 selects and outputs the combination of the command value ai and the command value bi associated with the largest effective value in the identified record.

また、強化学習装置１００は、時刻ｔ_j-1で出力した指令値ａｉと指令値ｂｉとの組み合わせについて、時刻ｔ_jにおいて報酬値を算出する。強化学習装置１００は、例えば、下記式（９）〜下記式（１２）により、報酬値ｒ（ｔ_j）を算出する。具体的には、Ｐ”（ｔ_j）を下記式（９）により定義する。そして、Ｐ”（ｔ_j）＞δの場合には下記式（１０）によりｒ（ｔ_j）を定義し、−δ≦Ｐ”（ｔ_j）≦δの場合には下記式（１１）によりｒ（ｔ_j）を定義し、Ｐ”（ｔ_j）＜−δの場合には下記式（１２）によりｒ（ｔ_j）を定義する。 Further, the reinforcement learning device 100 calculates the reward value at time t _j for the combination of the command value ai and the command value bi output at time t _j-1 . The reinforcement learning device 100 calculates the reward value r (t _j ) by the following equations (9) to (12), for example. Specifically, P ″ (t _j ) is defined by the following equation (9). When P ″ (t _j )> δ, r (t _j ) is defined by the following equation (10), When −δ ≦ P ″ (t _j ) ≦ δ, r (t _j ) is defined by the following formula (11), and when P ″ (t _j ) <−δ, r (t _j ) is defined by the following formula (12). Define (t _j ).

次に、図１３を用いて、情報処理装置が、指令値ａｉと指令値ｂｉとの組み合わせを出力したことに応じて算出した報酬値に基づいて、有効値を更新する場合について説明する。図１３の例では、情報処理装置は、例えば、時刻ｔ_j+1において、時刻ｔ_jにおいて出力した指令値ａｉと指令値ｂｉとの組み合わせについて算出した報酬値に基づいて、有効値を更新する場合について説明する。 Next, a case where the information processing device updates the effective value based on the reward value calculated in response to the output of the combination of the command value ai and the command value bi will be described with reference to FIG. 13. In the example of FIG. 13, the information processing apparatus updates the effective value, for example, at time t _{j + 1} based on the reward value calculated for the combination of the command value ai and the command value bi output at time t _j . The case will be described.

図１３は、有効値を更新する一例を示す説明図である。図１３において、強化学習装置１００は、時刻ｔ_jにおいて取得した状態値が属する区間の組み合わせに対応するレコードに対応付けられた、●１３０１で示すフィールドに設定された有効値Ｑを更新する。強化学習装置１００は、例えば、下記式（１３）および下記式（１４）を用いて、時刻ｔ_j+1において算出した報酬値に基づいて、有効値Ｑ’を算出し、●１３０１で示すフィールドに設定された有効値Ｑを更新する。 FIG. 13 is an explanatory diagram showing an example of updating valid values. In FIG. 13, the reinforcement learning device 100 updates the valid value Q set in the field indicated by ● 1301 associated with the record corresponding to the combination of the sections to which the state value acquired at time t _j belongs. The reinforcement learning device 100 calculates the effective value Q ′ based on the reward value calculated at the time t _{j + 1} by using, for example, the following formulas (13) and (14), and a field indicated by ● 1301. The effective value Q set in is updated.

強化学習装置１００は、時刻ｔ_j+1において取得した状態値が属する区間の組み合わせに対応するレコードが、粗分割区間に対応するレコード、または、結合されうる通常区間に対応するレコードであれば、下記式（１３）を用いて有効値Ｑ’を算出する。また、強化学習装置１００は、時刻ｔ_j+1において取得した状態値が属する区間の組み合わせに対応するレコードが、粗分割区間に対応するレコード、または、結合されうる通常区間に対応するレコードでなければ、下記式（１４）を用いて有効値Ｑ’を算出する。 In the reinforcement learning device 100, if the record corresponding to the combination of the sections to which the state value acquired at time t _{j + 1} belongs is a record corresponding to the rough division section or a record corresponding to the normal section that can be combined, The effective value Q ′ is calculated using the following equation (13). Further, in the reinforcement learning device 100, the record corresponding to the combination of the sections to which the state value acquired at time t _{j + 1} belongs must be a record corresponding to the rough division section or a record corresponding to the normal section that can be combined. For example, the effective value Q ′ is calculated using the following formula (14).

図１３の例では、強化学習装置１００は、時刻ｔ_j+1において取得した状態値が属する区間の組み合わせに対応するレコードが、粗分割区間に対応するレコード、または、結合されうる通常区間に対応するレコードではないと判定する。強化学習装置１００は、フィールド１３１１の状態値、フィールド１３１２の有効値をｍａｘ関数に代入した値、および、有効値Ｑに基づいて、上記式（１４）を用いて有効値Ｑ’を算出する。 In the example of FIG. 13, in the reinforcement learning device 100, the record corresponding to the combination of the sections to which the state value acquired at the time t _{j + 1} belongs corresponds to the record corresponding to the coarse division section or the normal section that can be combined. It is determined that it is not a record to be executed. The reinforcement learning device 100 calculates the effective value Q ′ by using the above equation (14) based on the state value of the field 1311, the value obtained by substituting the effective value of the field 1312 into the max function, and the effective value Q.

一方で、強化学習装置１００が、時刻ｔ_j+1において取得した状態値が属する区間の組み合わせに対応するレコードが、粗分割区間に対応するレコード、または、結合されうる通常区間に対応するレコードであると判定する場合がある。この場合、強化学習装置１００は、フィールド１３１０の状態値、フィールド１３２０の状態値、および、有効値Ｑに基づいて、上記式（１３）を用いて有効値Ｑ’を算出する。これにより、強化学習装置１００は、有効性情報を更新し、発電システム２００に対して適切な制御を行いやすくすることができる。 On the other hand, in the reinforcement learning device 100, the record corresponding to the combination of the sections to which the state value acquired at the time t _{j + 1} belongs is the record corresponding to the rough division section or the record corresponding to the normal section that can be combined. It may be determined that there is. In this case, the reinforcement learning device 100 calculates the effective value Q ′ using the above equation (13) based on the state value of the field 1310, the state value of the field 1320, and the effective value Q. As a result, the reinforcement learning device 100 can update the validity information and facilitate appropriate control of the power generation system 200.

（火力発電機を含む発電システム２００についての強化学習装置１００の動作例）
次に、図１４および図１５を用いて、火力発電機を含む発電システム２００についての強化学習装置１００の動作例について説明する。まず、図１４の説明に移行し、火力発電機を含む発電システム２００の具体的構成例について説明する。 (Operation example of the reinforcement learning device 100 for the power generation system 200 including a thermal power generator)
Next, an operation example of the reinforcement learning device 100 for the power generation system 200 including a thermal power generator will be described with reference to FIGS. 14 and 15. First, shifting to the description of FIG. 14, a specific configuration example of the power generation system 200 including a thermal power generator will be described.

図１４は、火力発電機を含む発電システム２００の具体的構成例を示す説明図である。図１４の例では、発電システム２００は、強化学習装置１００と、燃料制御の火力発電機ｉ（ｉ＝１，・・・，ｍ）とを含む。燃料制御の火力発電機ｉは、指令値ｂｉ（ｉ＝１，・・・，ｍ）を強化学習装置１００から受信する。 FIG. 14: is explanatory drawing which shows the specific structural example of the power generation system 200 containing a thermal power generator. In the example of FIG. 14, the power generation system 200 includes the reinforcement learning device 100 and a fuel-controlled thermal power generator i (i = 1, ..., M). The fuel-controlled thermal power generator i receives the command value bi (i = 1, ..., M) from the reinforcement learning device 100.

発電システム２００は、燃料制御の火力発電機ｉについての燃料計ｐｉ（ｉ＝１，・・・，ｍ）とを含む。燃料計ｐｉは、燃料使用量Ｆ_pi（ｔ_j）を、強化学習装置１００に送信する。発電システム２００は、燃料制御の火力発電機ｉについての電力計とを含む。燃料制御の火力発電機ｉについての電力計は、出力ワット値Ｐ_piを、強化学習装置１００に送信する。 The power generation system 200 includes a fuel meter pi (i = 1, ..., M) for the fuel-controlled thermal power generator i. The fuel gauge pi transmits the fuel usage amount F _pi (t _j ) to the reinforcement learning device 100. The power generation system 200 includes a power meter for the fuel-controlled thermal power generator i. The power meter for the fuel-controlled thermal power generator i sends the output wattage value P _pi to the reinforcement learning device 100.

強化学習装置１００は、テーブル生成部５０１と、区間切替部５０２と、値設定部５０３と、行動決定部５０４と、状態計算部５０５と、報酬計算部５０６と、テーブル更新部５０７とを含む。ここで、火力発電機を含む発電システム２００についての強化学習装置１００が有する各部分の動作は、風力発電機を含む発電システム２００についての強化学習装置１００が有する各部分の動作と同様であるため、説明を省略する。ここで、図１５の説明に移行し、火力発電機を含む発電システム２００における、通常テーブル８０１の記憶内容の一例について説明する。 The reinforcement learning device 100 includes a table generation unit 501, a section switching unit 502, a value setting unit 503, an action determination unit 504, a state calculation unit 505, a reward calculation unit 506, and a table update unit 507. Here, the operation of each part of the reinforcement learning apparatus 100 for the power generation system 200 including a thermal power generator is the same as the operation of each part of the reinforcement learning apparatus 100 for the power generation system 200 including a wind power generator. , Description is omitted. Now, shifting to the description of FIG. 15, an example of the stored contents of the normal table 801 in the power generation system 200 including a thermal power generator will be described.

図１５は、火力発電機に関する通常テーブル８０１の記憶内容の一例を示す説明図である。図１５に示すように、通常テーブル８０１は、状態値と指令値と有効値とのフィールドを有する。通常テーブル８０１は、各フィールドに情報を設定することにより、有効性情報をレコードとして記憶する。 FIG. 15 is an explanatory diagram showing an example of the stored contents of the normal table 801 relating to the thermal power generator. As shown in FIG. 15, the normal table 801 has fields of a state value, a command value, and a valid value. The normal table 801 stores validity information as a record by setting information in each field.

状態値のフィールドには、発電システム２００に関する状態値が取りうる区間が設定される。発電システム２００に関する状態値は、火力発電機に関する状態値、および、発電システム２００全体に関する状態値を含む。図１５の例では、火力発電機に関する状態値は、燃料制御の火力発電機に関する出力ワット値および燃料使用量とである。また、発電システム２００全体に関する状態値は、発電システム２００全体における需要電力である。 In the state value field, a section where the state value regarding the power generation system 200 can be set is set. The state value regarding the power generation system 200 includes a state value regarding the thermal power generator and a state value regarding the entire power generation system 200. In the example of FIG. 15, the state value regarding the thermal power generator is the output watt value and the fuel usage amount regarding the fuel-controlled thermal power generator. Further, the state value regarding the entire power generation system 200 is the demand power in the entire power generation system 200.

指令値のフィールドには、火力発電機に対する指令値が設定される。図１５の例では、火力発電機に対する指令値は、燃料制御の火力発電機の燃料使用量をどの程度変更するかを示す指令値である。燃料使用量をどの程度変更するかを示す指令値は、具体的には、−Ｘと±０と＋Ｘとである。有効値のフィールドには、それぞれの状態値がいずれかの区間に含まれる場合における、火力発電機ごとの指令値の組み合わせの有効性を示す有効値が設定される。 A command value for the thermal power generator is set in the command value field. In the example of FIG. 15, the command value for the thermal power generator is a command value indicating how much the fuel usage amount of the fuel-controlled thermal power generator should be changed. The command values indicating how much the fuel usage amount is changed are specifically −X, ± 0, and + X. In the effective value field, an effective value indicating the effectiveness of the combination of command values for each thermal power generator when each state value is included in any section is set.

（全体処理手順）
次に、図１６を用いて、強化学習装置１００が実行する、全体処理手順の一例について説明する。全体処理は、例えば、図３に示したＣＰＵ３０１と、メモリ３０２や記録媒体３０５などの記憶領域と、ネットワークＩ／Ｆ３０３とによって実現される。 (Overall processing procedure)
Next, an example of the overall processing procedure executed by the reinforcement learning device 100 will be described with reference to FIG. The entire process is realized by, for example, the CPU 301 shown in FIG. 3, a storage area such as the memory 302 and the recording medium 305, and the network I / F 303.

図１６は、全体処理手順の一例を示すフローチャートである。図１６において、強化学習装置１００は、複数の通常区間についてのテーブルを作成し、粗分割区間になりうる２以上の通常区間を設定する（ステップＳ１６０１）。 FIG. 16 is a flowchart showing an example of the overall processing procedure. In FIG. 16, the reinforcement learning device 100 creates a table for a plurality of normal intervals and sets two or more normal intervals that can be coarsely divided intervals (step S1601).

次に、強化学習装置１００は、図１７に後述する切替判定処理を実行し、利用するテーブルを設定する（ステップＳ１６０２）。そして、強化学習装置１００は、図１８に後述する値設定処理を実行し、設定したテーブルに対して有効値を設定する（ステップＳ１６０３）。 Next, the reinforcement learning device 100 executes a switching determination process described later in FIG. 17 and sets a table to be used (step S1602). Then, the reinforcement learning device 100 executes a value setting process described later in FIG. 18, and sets a valid value in the set table (step S1603).

次に、強化学習装置１００は、風力発電機からの出力ワット値と、風速値と、需要電力ワット値とに基づいて、風力発電機の状態を特定する（ステップＳ１６０４）。そして、強化学習装置１００は、風力発電機からの出力ワット値と需要電力ワット値とに基づいて、風力発電機からの報酬を算出する（ステップＳ１６０５）。 Next, the reinforcement learning device 100 identifies the state of the wind power generator based on the output watt value from the wind power generator, the wind speed value, and the demand power watt value (step S1604). Then, the reinforcement learning device 100 calculates the reward from the wind power generator based on the output watt value and the demand power watt value from the wind power generator (step S1605).

次に、強化学習装置１００は、設定したテーブルを利用して、風力発電機に対する指令値を決定して出力する（ステップＳ１６０６）。そして、強化学習装置１００は、算出した報酬に基づいて、設定したテーブルに記憶された有効値を更新する（ステップＳ１６０７）。その後、強化学習装置１００は、ステップＳ１６０２の処理に戻る。 Next, the reinforcement learning apparatus 100 determines and outputs the command value for the wind power generator using the set table (step S1606). Then, the reinforcement learning device 100 updates the effective value stored in the set table based on the calculated reward (step S1607). After that, the reinforcement learning device 100 returns to the process of step S1602.

（切替判定処理手順）
次に、図１７を用いて、強化学習装置１００が実行する、切替判定処理手順の一例について説明する。切替判定処理は、例えば、図３に示したＣＰＵ３０１と、メモリ３０２や記録媒体３０５などの記憶領域と、ネットワークＩ／Ｆ３０３とによって実現される。 (Switching determination processing procedure)
Next, an example of a switching determination processing procedure executed by the reinforcement learning device 100 will be described with reference to FIG. The switching determination process is realized by, for example, the CPU 301 shown in FIG. 3, a storage area such as the memory 302 and the recording medium 305, and the network I / F 303.

図１７は、切替判定処理手順の一例を示すフローチャートである。図１７において、強化学習装置１００は、閾値αを設定する（ステップＳ１７０１）。次に、強化学習装置１００は、需要電力ワット値Ｐ’（ｔ_j）に対し、α＞Ｐ’（ｔ_j）であるか否かを判定する（ステップＳ１７０２）。 FIG. 17 is a flowchart showing an example of the switching determination processing procedure. In FIG. 17, the reinforcement learning device 100 sets a threshold value α (step S1701). Next, the reinforcement learning device 100 determines whether or not α> P ′ (t _j ) with respect to the demand power watt value P ′ (t _j ) (step S1702).

ここで、α＞Ｐ’（ｔ_j）である場合（ステップＳ１７０２：Ｙｅｓ）、強化学習装置１００は、ステップＳ１７０７の処理に移行する。一方で、α＞Ｐ’（ｔ_j）ではない場合（ステップＳ１７０２：Ｎｏ）、強化学習装置１００は、ステップＳ１７０３の処理に移行する。 Here, if α> P ′ (t _j ) (step S1702: Yes), the reinforcement learning apparatus 100 proceeds to the process of step S1707. On the other hand, when α> P ′ (t _j ) is not satisfied (step S1702: No), the reinforcement learning apparatus 100 proceeds to the process of step S1703.

ステップＳ１７０３では、強化学習装置１００は、通常区間についての通常テーブル８０１を利用すると決定する（ステップＳ１７０３）。次に、強化学習装置１００は、直前まで通常区間についての通常テーブル８０１を利用していたか否かを判定する（ステップＳ１７０４）。 In step S1703, the reinforcement learning device 100 determines to use the normal table 801 for the normal section (step S1703). Next, the reinforcement learning apparatus 100 determines whether or not the normal table 801 for the normal section has been used until immediately before (step S1704).

ここで、通常区間についての通常テーブル８０１を利用している場合（ステップＳ１７０４：Ｙｅｓ）、強化学習装置１００は、ステップＳ１７０６の処理に移行する。一方で、通常区間についての通常テーブル８０１を利用していない場合（ステップＳ１７０４：Ｎｏ）、強化学習装置１００は、ステップＳ１７０５の処理に移行する。 Here, when the normal table 801 for the normal section is used (step S1704: Yes), the reinforcement learning apparatus 100 proceeds to the process of step S1706. On the other hand, when the normal table 801 for the normal section is not used (step S1704: No), the reinforcement learning apparatus 100 moves to the process of step S1705.

ステップＳ１７０５では、強化学習装置１００は、通常区間についての通常テーブル８０１を作成し、利用するテーブルに設定する（ステップＳ１７０５）。そして、強化学習装置１００は、切替判定処理を終了する。 In step S1705, the reinforcement learning apparatus 100 creates the normal table 801 for the normal section and sets it as the table to be used (step S1705). Then, the reinforcement learning device 100 ends the switching determination process.

ステップＳ１７０６では、強化学習装置１００は、直前まで利用していたテーブルをそのまま、利用するテーブルに設定する（ステップＳ１７０６）。そして、強化学習装置１００は、切替判定処理を終了する。 In step S1706, the reinforcement learning device 100 sets the table used until immediately before to the table to be used as it is (step S1706). Then, the reinforcement learning device 100 ends the switching determination process.

ステップＳ１７０７では、強化学習装置１００は、粗分割区間についての粗分割テーブル８０２を利用すると決定する（ステップＳ１７０７）。次に、強化学習装置１００は、直前まで粗分割区間についての粗分割テーブル８０２を利用していたか否かを判定する（ステップＳ１７０８）。 In step S1707, the reinforcement learning device 100 determines to use the coarse division table 802 for the coarse division section (step S1707). Next, the reinforcement learning apparatus 100 determines whether or not the coarse division table 802 for the coarse division section has been used until immediately before (step S1708).

ここで、粗分割区間についての粗分割テーブル８０２を利用している場合（ステップＳ１７０８：Ｙｅｓ）、強化学習装置１００は、ステップＳ１７０６の処理に移行する。一方で、粗分割区間についての粗分割テーブル８０２を利用していない場合（ステップＳ１７０８：Ｎｏ）、強化学習装置１００は、ステップＳ１７０９の処理に移行する。 Here, when the coarse division table 802 for the coarse division section is used (step S1708: Yes), the reinforcement learning apparatus 100 proceeds to the process of step S1706. On the other hand, when the coarse division table 802 for the coarse division section is not used (step S1708: No), the reinforcement learning apparatus 100 proceeds to the process of step S1709.

ステップＳ１７０９では、強化学習装置１００は、粗分割区間についての粗分割テーブル８０２を作成し、利用するテーブルに設定する（ステップＳ１７０９）。そして、強化学習装置１００は、切替判定処理を終了する。 In step S1709, the reinforcement learning device 100 creates a coarse division table 802 for the coarse division section and sets it as a table to be used (step S1709). Then, the reinforcement learning device 100 ends the switching determination process.

（値設定処理手順）
次に、図１８を用いて、強化学習装置１００が実行する、値設定処理手順の一例について説明する。値設定処理は、例えば、図３に示したＣＰＵ３０１と、メモリ３０２や記録媒体３０５などの記憶領域と、ネットワークＩ／Ｆ３０３とによって実現される。 (Value setting procedure)
Next, an example of the value setting processing procedure executed by the reinforcement learning device 100 will be described with reference to FIG. The value setting process is realized by, for example, the CPU 301 shown in FIG. 3, a storage area such as the memory 302 and the recording medium 305, and the network I / F 303.

図１８は、値設定処理手順の一例を示すフローチャートである。図１８において、強化学習装置１００は、利用するテーブルから、粗分割区間を分割する通常区間、または、通常区間を結合する粗分割区間に対応するレコードを特定する（ステップＳ１８０１）。 FIG. 18 is a flowchart showing an example of the value setting processing procedure. In FIG. 18, the reinforcement learning apparatus 100 identifies a record corresponding to a normal section that divides a rough division section or a rough division section that combines the normal sections from the table to be used (step S1801).

次に、強化学習装置１００は、特定したレコードに設定されたストール制御の風力発電機ｉ（ｉ＝１，・・・，ｎ）に関する風速値Ｆ_s1，・・・，Ｆ_snを取得する（ステップＳ１８０２）。そして、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉ（ｉ＝１，・・・，ｍ）からの出力ワット値Ｐ_p1，・・・，Ｐ_pmを取得する（ステップＳ１８０３）。 Next, the reinforcement learning device 100 acquires the wind speed values F _s1 , ..., F _sn for the stall-controlled wind power generator i (i = 1, ..., N) set in the specified record ( Step S1802). Then, the reinforcement learning apparatus 100 acquires the output wattage values P _p1 , ..., P _pm from the pitch-controlled wind power generator i (i = 1, ..., M) set in the identified record. (Step S1803).

次に、強化学習装置１００は、特定したレコードに設定されたピッチ制御の風力発電機ｉ（ｉ＝１，・・・，ｍ）に関する風速値Ｆ_p1，・・・，Ｆ_pmを取得する（ステップＳ１８０４）。そして、強化学習装置１００は、特定したレコードに設定された発電システム２００全体に関する需要電力ワット値Ｐ’を取得する（ステップＳ１８０５）。 Next, the reinforcement learning apparatus 100 acquires the wind speed values F _p1 , ..., F _pm regarding the pitch-controlled wind power generator i (i = 1, ..., M) set in the specified record (( Step S1804). Then, the reinforcement learning device 100 acquires the demand power watt value P ′ for the entire power generation system 200 set in the identified record (step S1805).

次に、強化学習装置１００は、ストール制御の風力発電機ｉ（ｉ＝１，・・・，ｎ）に対して、観測した風速Ｆ_siと近似曲線ｆ_i（ｔ）とに基づいて、電源をＯＮにした時の予測出力電力ｆ_i（Ｆ_si）を算出する（ステップＳ１８０６）。そして、強化学習装置１００は、ピッチ制御の風力発電機ｉ（ｉ＝１，・・・，ｍ）に対して、観測した風速Ｆ_piと出力ワット値Ｐ_piと近似曲線ｆ_i,Θ（ｔ）とに基づいて、ｆ_i,Θ（Ｆ_pi）≒Ｐ_piになるピッチ角Θを決定する（ステップＳ１８０７）。 Next, the reinforcement learning device 100 supplies power to the stall-controlled wind power generator i (i = 1, ..., N) based on the observed wind speed F _si and the approximate curve f _i (t). The predicted output power f _i (F _si ) when the switch is turned on is calculated (step S1806). Then, the reinforcement learning device 100 observes the wind speed F _pi , the output watt value P _pi, and the approximated curve f _i, Θ (t) for the pitch-controlled wind power generator i (i = 1, ..., M). ) And P _i, Θ (F _pi ) ≈P _pi , the pitch angle Θ is determined (step S1807).

次に、強化学習装置１００は、決定したピッチ角Θに対して−ΔΘ、±０、＋ΔΘをした場合における予測出力電力ｆ_i,Θ_-ΔΘ（Ｆ_pi）、ｆ_i,Θ（Ｆ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi）を算出する（ステップＳ１８０８）。そして、強化学習装置１００は、予測出力電力ｆ_i（Ｆ_si）と、予測出力電力ｆ_i,Θ_-ΔΘ（Ｆ_pi）、ｆ_i,Θ（Ｆ_pi）、ｆ_i,Θ₊ΔΘ（Ｆ_pi）とに基づいて、予測出力電力テーブルを作成する（ステップＳ１８０９）。 Next, the reinforcement learning apparatus 100 predicts the predicted output powers f _i, Θ _- ΔΘ (F _pi ), f _i, Θ (F _pi ) when the determined pitch angle Θ is −ΔΘ, ± 0, + ΔΘ. , F _i, Θ ₊ ΔΘ (F _pi ) is calculated (step S1808). Then, the reinforcement learning device 100 calculates the predicted output power f _i (F _si ) and the predicted output powers f _i, Θ ₋ ΔΘ (F _pi ), f _i, Θ (F _pi ), f _i, Θ ₊ ΔΘ (F _pi ) and a predicted output power table is created (step S1809).

次に、強化学習装置１００は、特定したレコードにあるフィールドごとに、ストール制御の風力発電機に対する指令値ａ１，・・・，ａｎと、ピッチ制御の風力発電機に対する指令値ｂ１，・・・，ｂｍとを取得する（ステップＳ１８１０）。そして、強化学習装置１００は、特定したレコードにあるフィールドごとに、取得した指令値と作成した予測出力電力テーブルとに基づいて、発電システム２００全体における予測出力電力Ｐ〜を算出する（ステップＳ１８１１）。 Next, the reinforcement learning device 100, for each field in the identified record, command values a1, ..., An for the stall control wind power generator and command values b1, ... For the pitch control wind power generator. , Bm are acquired (step S1810). Then, the reinforcement learning device 100 calculates the predicted output power P ~ in the entire power generation system 200 based on the acquired command value and the created predicted output power table for each field in the identified record (step S1811). .

次に、強化学習装置１００は、特定したレコードにあるフィールドごとに、需要電力ワット値Ｐ’と算出した予測出力電力Ｐ〜との差分値に基づいて、有効値を算出して設定する（ステップＳ１８１２）。そして、強化学習装置１００は、値設定処理を終了する。 Next, the reinforcement learning device 100 calculates and sets an effective value for each field in the identified record, based on the difference value between the demand power watt value P ′ and the calculated predicted output power P˜ (step). S1812). Then, the reinforcement learning device 100 ends the value setting process.

（特性関数作成処理手順）
次に、図１９および図２０を用いて、強化学習装置１００が実行する、特性関数作成処理手順の一例について説明する。特性関数作成処理は、例えば、図３に示したＣＰＵ３０１と、メモリ３０２や記録媒体３０５などの記憶領域と、ネットワークＩ／Ｆ３０３とによって実現される。 (Characteristic function creation processing procedure)
Next, an example of a characteristic function creation processing procedure executed by the reinforcement learning device 100 will be described with reference to FIGS. 19 and 20. The characteristic function creating process is realized by, for example, the CPU 301 shown in FIG. 3, a storage area such as the memory 302 and the recording medium 305, and the network I / F 303.

図１９は、ストール制御の風力発電機についての特性関数作成処理手順の一例を示すフローチャートである。図１９において、強化学習装置１００は、様々な風速におけるストール制御の風力発電機からの出力ワット値を観測する（ステップＳ１９０１）。 FIG. 19 is a flowchart showing an example of a characteristic function creation processing procedure for a stall-controlled wind power generator. In FIG. 19, the reinforcement learning device 100 observes the output watt value from the stall-controlled wind power generator at various wind speeds (step S1901).

次に、強化学習装置１００は、様々な風速における出力ワット値に基づいて、ストール制御の風力発電機についての特性関数が示す特性曲線を近似する近似曲線ｆ_i（ｔ）を求める（ステップＳ１９０２）。そして、強化学習装置１００は、ストール制御の風力発電機についての特性関数作成処理を終了する。 Next, the reinforcement learning device 100 obtains an approximate curve f _i (t) that approximates the characteristic curve indicated by the characteristic function of the stall-controlled wind power generator, based on the output wattage values at various wind speeds (step S1902). . Then, the reinforcement learning device 100 ends the characteristic function creating process for the stall-controlled wind power generator.

図２０は、ピッチ制御の風力発電機についての特性関数作成処理手順の一例を示すフローチャートである。図２０において、強化学習装置１００は、ピッチ制御の風力発電機における最大ピッチ角ＭΘを取得する（ステップＳ２００１）。 FIG. 20 is a flowchart showing an example of a characteristic function creation processing procedure for a pitch-controlled wind power generator. In FIG. 20, the reinforcement learning device 100 acquires the maximum pitch angle MΘ in the pitch-controlled wind power generator (step S2001).

次に、強化学習装置１００は、ピッチ制御の風力発電機に対してピッチ角Θ＝０を設定する（ステップＳ２００２）。そして、強化学習装置１００は、Θ＜ＭΘであるか否かを判定する（ステップＳ２００３）。 Next, the reinforcement learning device 100 sets the pitch angle Θ = 0 for the pitch-controlled wind power generator (step S2002). Then, the reinforcement learning device 100 determines whether or not Θ <MΘ (step S2003).

ここで、Θ＜ＭΘである場合（ステップＳ２００３：Ｙｅｓ）、強化学習装置１００は、ステップＳ２００４の処理に移行する。一方で、Θ＜ＭΘではない場合（ステップＳ２００３：Ｎｏ）、強化学習装置１００は、ピッチ制御の風力発電機についての特性関数作成処理を終了する。 Here, if Θ <MΘ (step S2003: Yes), the reinforcement learning apparatus 100 proceeds to the process of step S2004. On the other hand, when Θ <MΘ is not satisfied (step S2003: No), the reinforcement learning apparatus 100 ends the characteristic function creation processing for the pitch-controlled wind power generator.

ステップＳ２００４では、強化学習装置１００は、様々な風速におけるピッチ制御の風力発電機からの出力ワット値を観測する（ステップＳ２００４）。次に、強化学習装置１００は、様々な風速における出力ワット値に基づいて、ピッチ制御の風力発電機についての特性関数が示す特性曲線を近似する近似曲線ｆ_i,Θ（ｔ）を求める（ステップＳ２００５）。 In step S2004, the reinforcement learning device 100 observes the output wattage value from the pitch-controlled wind power generator at various wind speeds (step S2004). Next, the reinforcement learning apparatus 100 obtains an approximate curve f _i, Θ (t) that approximates the characteristic curve indicated by the characteristic function of the pitch-controlled wind power generator, based on the output wattage values at various wind speeds (step). S2005).

そして、強化学習装置１００は、ピッチ角Θ＝Θ＋ΔΘに設定する（ステップＳ２００６）。その後、強化学習装置１００は、ステップＳ２００３の処理に戻る。 Then, the reinforcement learning device 100 sets the pitch angle Θ = Θ + ΔΘ (step S2006). After that, the reinforcement learning device 100 returns to the process of step S2003.

ここで、強化学習装置１００は、上述した各種フローチャートのうち一部ステップの処理の順序を入れ替えて実行してもよい。また、強化学習装置１００は、上述した各種フローチャートのうち一部ステップの処理を省略してもよい。 Here, the reinforcement learning apparatus 100 may perform the processing by changing the order of some of the steps in the above-described various flowcharts. Further, the reinforcement learning device 100 may omit the processing of some steps in the various flowcharts described above.

以上説明したように、強化学習装置１００によれば、発電機２０１に関する状態値が取りうる複数の領域のそれぞれの領域における発電機２０１に対する指令値ごとの有効性を示す有効性情報を利用して学習を行うことができる。強化学習装置１００によれば、観測した発電機２０１に関する状態値を参照し、特性関数に基づいて、複数の領域のうち連続する２以上の領域を結合した領域における発電機２０１に対する指令値ごとの有効性を示す有効性情報を生成することができる。強化学習装置１００によれば、生成した結合した領域についての有効性情報、および、複数の領域のうち２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行うことができる。これにより、強化学習装置１００は、学習を行って更新する対象である有効性情報の数を減少させ、強化学習にかかる処理量の低減化を図ることができる。 As described above, according to the reinforcement learning device 100, the validity information indicating the validity of each command value for the generator 201 in each of the plurality of regions in which the state value regarding the generator 201 can be used is used. Can learn. According to the reinforcement learning device 100, by referring to the observed state value regarding the generator 201, based on the characteristic function, for each command value for the generator 201 in a region in which two or more continuous regions are combined among a plurality of regions. Effectiveness information indicating effectiveness can be generated. According to the reinforcement learning device 100, learning can be performed using the generated validity information about the combined region and the validity information about each region other than the two or more regions of the plurality of regions. . As a result, the reinforcement learning device 100 can reduce the number of validity information that is the target of learning and updating, and can reduce the processing amount required for reinforcement learning.

強化学習装置１００によれば、観測した発電機２０１に関する状態値を参照し、特性関数に基づいて、２以上の領域のそれぞれの領域における発電機２０１に対する指令値ごとの有効性を示す有効性情報を生成することができる。強化学習装置１００によれば、生成した２以上の領域のそれぞれの領域についての有効性情報、および、複数の領域のうち２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行うことができる。これにより、強化学習装置１００は、どのような状態値の場合にどのような指令値を出力することが好ましいかを細分化して実行することができる。 According to the reinforcement learning device 100, the validity information indicating the validity of each command value for the generator 201 in each of the two or more regions is referred to by referring to the observed state value regarding the generator 201. Can be generated. According to the reinforcement learning apparatus 100, learning is performed by using the validity information about each of the generated two or more areas and the validity information about each area other than the two or more areas of the plurality of areas. It can be performed. Thereby, the reinforcement learning device 100 can subdivide and execute what kind of command value should be output in what kind of state value.

強化学習装置１００によれば、発電機２０１の状態値の組み合わせが取りうる複数の領域のそれぞれの領域における、発電機２０１の指令値の組み合わせごとの有効性を示す有効性情報を利用して学習を行うことができる。強化学習装置１００によれば、観測した発電機２０１の状態値を参照し、特性関数に基づいて、複数の領域のうち連続する２以上の領域を結合した領域における、発電機２０１の指令値の組み合わせごとの有効性を示す有効性情報を生成することができる。強化学習装置１００によれば、生成した結合した領域についての有効性情報、および、複数の領域のうち２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行うことができる。これにより、強化学習装置１００は、発電機２０１が複数ある場合に適用することができる。 According to the reinforcement learning device 100, learning is performed by using the validity information indicating the validity of each combination of the command values of the generator 201 in each of the plurality of regions that the combination of the state values of the generator 201 can take. It can be performed. According to the reinforcement learning device 100, referring to the observed state value of the power generator 201, based on the characteristic function, the command value of the power generator 201 in the region in which two or more continuous regions are combined among the plurality of regions is Effectiveness information indicating the effectiveness for each combination can be generated. According to the reinforcement learning device 100, learning can be performed using the generated validity information about the combined region and the validity information about each region other than the two or more regions of the plurality of regions. . Thereby, the reinforcement learning device 100 can be applied when there are a plurality of generators 201.

強化学習装置１００によれば、風力発電機についての有効性情報を生成することができる。これにより、強化学習装置１００は、風力発電機を含む発電システム２００に適用することができる。 The reinforcement learning device 100 can generate validity information about the wind power generator. Thereby, the reinforcement learning device 100 can be applied to the power generation system 200 including the wind power generator.

強化学習装置１００によれば、特性関数に基づいて、観測した風速に対応する出力電力を特定し、特定した出力電力に基づいて、結合した領域についての有効性情報を生成することができる。これにより、強化学習装置１００は、ストール制御の風力発電機についての有効性情報を生成することができる。 According to the reinforcement learning device 100, the output power corresponding to the observed wind speed can be specified based on the characteristic function, and the validity information about the combined region can be generated based on the specified output power. Thereby, the reinforcement learning device 100 can generate the effectiveness information about the stall-controlled wind power generator.

強化学習装置１００によれば、発電機２０１の受風性能ごとに異なる複数の特性関数のうち、観測した風速および出力電力に対応する特性関数に基づいて、結合した領域についての有効性情報を生成することができる。これにより、強化学習装置１００は、ピッチ制御の風力発電機についての有効性情報を生成することができる。 According to the reinforcement learning device 100, the validity information about the combined region is generated based on the characteristic function corresponding to the observed wind speed and the output power among the plurality of characteristic functions that differ depending on the wind receiving performance of the generator 201. can do. Thereby, the reinforcement learning device 100 can generate the effectiveness information about the pitch-controlled wind power generator.

強化学習装置１００によれば、火力発電機についての有効性情報を生成することができる。これにより、強化学習装置１００は、火力発電機を含む発電システム２００に適用することができる。 The reinforcement learning device 100 can generate validity information about the thermal power generator. Thereby, the reinforcement learning device 100 can be applied to the power generation system 200 including a thermal power generator.

強化学習装置１００によれば、観測した需要電力が閾値以下である場合に、結合した領域についての有効性情報を生成することができる。これにより、強化学習装置１００は、比較的大きな出力電力の領域については詳細に検証しなくてもよい場合に、学習を行って更新する対象である有効性情報の数を減少させることができる。ここで、強化学習装置１００は、比較的大きな出力電力の領域について結合すれば、発電システムに対する制御に与える悪影響を抑制することができる。 According to the reinforcement learning device 100, when the observed demand power is less than or equal to the threshold value, it is possible to generate the validity information about the combined regions. As a result, the reinforcement learning device 100 can reduce the number of pieces of validity information to be learned and updated when detailed verification is not required for a relatively large output power region. Here, the reinforcement learning device 100 can suppress the adverse effect on the control of the power generation system by combining the regions of relatively large output power.

強化学習装置１００によれば、観測した需要電力が閾値を超える場合に、結合した領域についての有効性情報を生成することができる。これにより、強化学習装置１００は、比較的小さな出力電力の領域については詳細に検証しなくてもよい場合に、学習を行って更新する対象である有効性情報の数を減少させることができる。ここで、強化学習装置１００は、比較的小さな出力電力の領域について結合すれば、発電システムに対する制御に与える悪影響を抑制することができる。 According to the reinforcement learning device 100, when the observed demand power exceeds the threshold value, it is possible to generate the validity information about the combined regions. As a result, the reinforcement learning device 100 can reduce the number of pieces of validity information to be learned and updated when detailed verification is not required for a region of relatively small output power. Here, the reinforcement learning device 100 can suppress the adverse effect on the control of the power generation system if the regions of relatively small output power are combined.

強化学習装置１００によれば、２以上の領域のそれぞれの領域についての有効性情報に基づいて、２以上の領域を結合した領域についての有効性情報を生成することができる。これにより、強化学習装置１００は、特性関数が不明であっても、学習を行って更新する対象である有効性情報の数を減少させ、強化学習にかかる処理量の低減化を図ることができる。 According to the reinforcement learning device 100, it is possible to generate the validity information about the area obtained by combining the two or more areas based on the validity information about each area of the two or more areas. As a result, the reinforcement learning apparatus 100 can reduce the number of pieces of validity information to be learned and updated to reduce the amount of processing required for reinforcement learning even if the characteristic function is unknown. .

なお、本実施の形態で説明した強化学習方法は、予め用意されたプログラムをパーソナル・コンピュータやワークステーション等のコンピュータで実行することにより実現することができる。本実施の形態で説明した強化学習プログラムは、ハードディスク、フレキシブルディスク、ＣＤ−ＲＯＭ、ＭＯ、ＤＶＤ等のコンピュータで読み取り可能な記録媒体に記録され、コンピュータによって記録媒体から読み出されることによって実行される。また、本実施の形態で説明した強化学習プログラムは、インターネット等のネットワークを介して配布してもよい。 The reinforcement learning method described in the present embodiment can be realized by executing a prepared program on a computer such as a personal computer or a workstation. The reinforcement learning program described in the present embodiment is recorded in a computer-readable recording medium such as a hard disk, a flexible disk, a CD-ROM, an MO, or a DVD, and is executed by being read from the recording medium by the computer. Further, the reinforcement learning program described in the present embodiment may be distributed via a network such as the Internet.

上述した実施の形態に関し、さらに以下の付記を開示する。 Regarding the above-described embodiment, the following supplementary notes are further disclosed.

（付記１）コンピュータに、
発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
観測した前記発電機に関する状態値を参照し、前記発電機に関する状態値についての特性関数に基づいて、前記複数の領域のうち連続する２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行させることを特徴とする強化学習プログラム。 (Supplementary note 1) For a computer,
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning program characterized by executing processing.

（付記２）前記コンピュータに、
観測した前記発電機に関する状態値を参照し、前記発電機に関する状態値についての特性関数に基づいて、前記２以上の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記２以上の領域のそれぞれの領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行させることを特徴とする付記１に記載の強化学習プログラム。 (Supplementary note 2) In the computer,
Effectiveness indicating the effectiveness of each command value for the generator in each of the two or more regions based on the characteristic function of the observed state value for the generator with reference to the observed state value for the generator Generate information,
Learning is performed by using the generated validity information about each of the two or more regions and the validity information about each of the plurality of regions other than the two or more regions,
The reinforcement learning program according to appendix 1, wherein the program is executed.

（付記３）前記コンピュータに、
前記発電機が複数ある場合、前記発電機の状態値の組み合わせが取りうる複数の領域のそれぞれの領域における、前記発電機の指令値の組み合わせごとの有効性を示す有効性情報を利用して学習を行い、
観測した前記発電機の状態値を参照し、前記特性関数に基づいて、前記複数の領域のうち連続する２以上の領域を結合した領域における、前記発電機の指令値の組み合わせごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行させることを特徴とする付記１または２に記載の強化学習プログラム。 (Supplementary note 3) In the computer,
When there are a plurality of generators, learning is performed by using effectiveness information indicating effectiveness of each combination of command values of the generator in each of a plurality of areas that can be taken by a combination of state values of the generator. And then
By referring to the observed state value of the generator, based on the characteristic function, the effectiveness of each combination of the command values of the generator in the region in which two or more continuous regions are combined among the plurality of regions is shown. Generate the validity information shown,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
3. The reinforcement learning program according to supplementary note 1 or 2, which executes a process.

（付記４）前記発電機は、風力発電機であり、
前記発電機に関する状態値は、風速、および、出力電力である、ことを特徴とする付記１〜３のいずれか一つに記載の強化学習プログラム。 (Supplementary Note 4) The generator is a wind power generator,
The reinforcement learning program according to any one of appendices 1 to 3, wherein the state values related to the generator are wind speed and output power.

（付記５）前記特性関数は、風速と前記発電機からの出力電力との関係を示し、
前記結合した領域についての有効性情報を生成する処理は、前記特性関数に基づいて、観測した風速に対応する出力電力を特定し、特定した前記出力電力に基づいて、前記結合した領域についての有効性情報を生成する、ことを特徴とする付記４に記載の強化学習プログラム。 (Supplementary Note 5) The characteristic function indicates a relationship between wind speed and output power from the generator,
The process of generating the validity information about the combined area is based on the characteristic function, specifies the output power corresponding to the observed wind speed, and based on the specified output power, the effective about the combined area. A reinforcement learning program according to appendix 4, wherein the reinforcement learning program generates sex information.

（付記６）前記発電機は、受風性能を変更可能であり、
前記指令値は、受風性能を制御する指令値であり、
前記特性関数は、風速と前記発電機からの出力電力との関係を示し、
前記結合した領域についての有効性情報を生成する処理は、前記発電機の受風性能ごとに異なる複数の前記特性関数のうち、観測した風速および出力電力に対応する前記特性関数に基づいて、前記結合した領域についての有効性情報を生成する、ことを特徴とする付記４に記載の強化学習プログラム。 (Supplementary Note 6) The wind power of the generator can be changed,
The command value is a command value for controlling the wind performance,
The characteristic function indicates the relationship between the wind speed and the output power from the generator,
The process of generating the validity information about the combined region is based on the characteristic function corresponding to the observed wind speed and output power among the plurality of characteristic functions that differ for each wind-receiving performance of the generator, and The reinforcement learning program according to appendix 4, wherein validity information about the combined areas is generated.

（付記７）前記発電機は、火力発電機であり、
前記発電機に関する状態値は、燃料使用量、および、出力電力である、ことを特徴とする付記１〜３のいずれか一つに記載の強化学習プログラム。 (Supplementary Note 7) The generator is a thermal power generator,
The reinforcement learning program according to any one of appendices 1 to 3, wherein the state value related to the generator is a fuel usage amount and an output power.

（付記８）前記結合した領域についての有効性情報を生成する処理は、観測した需要電力が閾値以下である場合に、前記結合した領域についての有効性情報を生成する、ことを特徴とする付記１〜７のいずれか一つに記載の強化学習プログラム。 (Supplementary Note 8) The processing for generating the validity information about the combined area generates the validity information about the combined area when the observed demand power is less than or equal to a threshold value. The reinforcement learning program according to any one of 1 to 7.

（付記９）前記結合した領域についての有効性情報を生成する処理は、観測した需要電力が閾値を超える場合に、前記結合した領域についての有効性情報を生成する、ことを特徴とする付記１〜７のいずれか一つに記載の強化学習プログラム。 (Supplementary note 9) The process of generating the validity information about the combined area generates the validity information about the combined area when the observed demand power exceeds a threshold value. Reinforcement learning program described in any one of.

（付記１０）コンピュータが、
発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
観測した前記発電機に関する状態値を参照し、前記発電機に関する状態値についての特性関数に基づいて、前記複数の領域のうち連続する２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行することを特徴とする強化学習方法。 (Supplementary note 10) The computer
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning method characterized by executing processing.

（付記１１）発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
観測した前記発電機に関する状態値を参照し、前記発電機に関する状態値についての特性関数に基づいて、前記複数の領域のうち連続する２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
制御部を有することを特徴とする強化学習装置。 (Supplementary Note 11) Learning is performed by using validity information indicating validity of each command value for the generator in each of a plurality of regions in which a state value related to the generator can be taken,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
A reinforcement learning device having a control unit.

（付記１２）コンピュータに、
発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
前記複数の領域のうち連続する２以上の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報に基づいて、前記２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行させることを特徴とする強化学習プログラム。 (Supplementary note 12) In a computer,
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning program characterized by executing processing.

（付記１３）コンピュータが、
発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
前記複数の領域のうち連続する２以上の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報に基づいて、前記２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
処理を実行することを特徴とする強化学習方法。 (Supplementary note 13) Computer
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning method characterized by executing processing.

（付記１４）発電機に関する状態値が取りうる複数の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を利用して学習を行い、
前記複数の領域のうち連続する２以上の領域のそれぞれの領域における前記発電機に対する指令値ごとの有効性を示す有効性情報に基づいて、前記２以上の領域を結合した領域における前記発電機に対する指令値ごとの有効性を示す有効性情報を生成し、
生成した前記結合した領域についての有効性情報、および、前記複数の領域のうち前記２以上の領域以外のそれぞれの領域についての有効性情報を利用して学習を行う、
制御部を有することを特徴とする強化学習装置。 (Supplementary Note 14) Learning is performed by using validity information indicating validity of each command value for the generator in each of a plurality of regions in which a state value related to the generator can be taken,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
A reinforcement learning device having a control unit.

１００強化学習装置
２００発電システム
２０１発電機
２１０ネットワーク
３００バス
３０１ＣＰＵ
３０２メモリ
３０３ネットワークＩ／Ｆ
３０４記録媒体Ｉ／Ｆ
３０５記録媒体
４００記憶部
４０１取得部
４０２切替部
４０３学習部
４０４出力部
５０１テーブル生成部
５０２区間切替部
５０３値設定部
５０４行動決定部
５０５状態計算部
５０６報酬計算部
５０７テーブル更新部
６０１，６０２，７０１，７０２，１０００グラフ
８０１通常テーブル
８０２粗分割テーブル
１１００，１２００予測出力電力テーブル
１３１０〜１３１２，１３２０フィールド 100 Reinforcement Learning Device 200 Power Generation System 201 Generator 210 Network 300 Bus 301 CPU
302 memory 303 network I / F
304 recording medium I / F
305 recording medium 400 storage unit 401 acquisition unit 402 switching unit 403 learning unit 404 output unit 501 table generation unit 502 section switching unit 503 value setting unit 504 action determination unit 505 state calculation unit 506 reward calculation unit 507 table update unit 601, 602 701, 702, 1000 Graph 801 Normal table 802 Coarse division table 1100, 1200 Predicted output power table 1310-1312, 1320 fields

Claims

On the computer,
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning program characterized by executing processing.

On the computer,
Effectiveness indicating the effectiveness of each command value for the generator in each of the two or more regions based on the characteristic function of the observed state value for the generator with reference to the observed state value for the generator Generate information,
Learning is performed by using the generated validity information about each of the two or more regions and the validity information about each of the plurality of regions other than the two or more regions,
The reinforcement learning program according to claim 1, which executes a process.

On the computer,
When there are a plurality of generators, learning is performed by using effectiveness information indicating effectiveness of each combination of command values of the generator in each of a plurality of areas that can be taken by a combination of state values of the generator. And then
By referring to the observed state value of the generator, based on the characteristic function, the effectiveness of each combination of the command values of the generator in the region in which two or more continuous regions are combined among the plurality of regions is shown. Generate the validity information shown,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
The reinforcement learning program according to claim 1 or 2, which executes a process.

The generator is a wind generator,
The reinforcement learning program according to any one of claims 1 to 3, wherein the state value regarding the generator is wind speed and output power.

The characteristic function indicates the relationship between the wind speed and the output power from the generator,
The process of generating the validity information about the combined area is based on the characteristic function, specifies the output power corresponding to the observed wind speed, and based on the specified output power, the effective about the combined area. The reinforcement learning program according to claim 4, wherein sex information is generated.

The generator is capable of changing the wind receiving performance,
The command value is a command value for controlling the wind performance,
The characteristic function indicates the relationship between the wind speed and the output power from the generator,
The process of generating the validity information about the combined region is based on the characteristic function corresponding to the observed wind speed and output power among the plurality of characteristic functions that differ for each wind-receiving performance of the generator, and The reinforcement learning program according to claim 4, wherein validity information about the combined regions is generated.

The generator is a thermal power generator,
The reinforcement learning program according to any one of claims 1 to 3, wherein the state value regarding the generator is a fuel usage amount and an output power.

The processing of generating the validity information about the combined area generates the validity information about the combined area when the observed demand power is less than or equal to a threshold value. Reinforcement learning program described in any one of.

Computer
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning method characterized by executing processing.

Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
For each command value for the generator in a region obtained by combining two or more continuous regions of the plurality of regions with reference to the observed state value for the generator and the characteristic function of the state value for the generator Generates validity information that shows the effectiveness of
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
A reinforcement learning device having a control unit.

On the computer,
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning program characterized by executing processing.

Computer
Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
Reinforcement learning method characterized by executing processing.

Learning is performed by using the validity information indicating the validity of each command value for the generator in each of the plurality of regions where the state value related to the generator can take,
Based on the validity information indicating the validity of each command value for the generator in each of two or more consecutive regions of the plurality of regions, the generator for the region in which the two or more regions are combined Generates validity information indicating the validity of each command value,
Learning is performed by using the generated validity information about the combined region and the validity information about each region other than the two or more regions among the plurality of regions,
A reinforcement learning device having a control unit.