JP2019159888A5 - - Google Patents
Download PDFInfo
- Publication number
- JP2019159888A5 JP2019159888A5 JP2018046510A JP2018046510A JP2019159888A5 JP 2019159888 A5 JP2019159888 A5 JP 2019159888A5 JP 2018046510 A JP2018046510 A JP 2018046510A JP 2018046510 A JP2018046510 A JP 2018046510A JP 2019159888 A5 JP2019159888 A5 JP 2019159888A5
- Authority
- JP
- Japan
- Prior art keywords
- evaluation
- reward
- value
- machine learning
- scale
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000011156 evaluation Methods 0.000 claims 32
- 238000010801 machine learning Methods 0.000 claims 10
Claims (10)
前記現在の状態及び前記行動に基づき前記行動の異なる目的における評価値をそれぞれ生成する複数の評価関数を含む、評価部と、
前記エージェント部を訓練する、学習部と、を含み、
前記評価部は、より正確な評価値を生成することができるように、前記複数の評価関数のそれぞれを、前記複数の評価関数それぞれが生成した前記評価値と前記評価値それぞれの目標値との差に基づき、更新し、
前記学習部は、前記エージェント部がより適切な行動を決定することができるように、前記複数の評価関数それぞれの更新による勾配を順次選択し、前記勾配に基づき前記エージェント部を順次更新する、機械学習システム。 An agent unit that determines an action based on the current state of the environment;
An evaluation unit including a plurality of evaluation functions that respectively generate evaluation values for different purposes of the action based on the current state and the action,
And a training unit for training the agent unit.
The evaluation unit is configured to generate each of the plurality of evaluation functions with the evaluation value generated by each of the plurality of evaluation functions and a target value of each of the evaluation values so that a more accurate evaluation value can be generated. Update based on the difference,
The learning section, as can be the agent unit to determine a more appropriate action, the gradient sequentially selected by a plurality of evaluation functions of each update, sequentially updates the agent section based on the gradient, Machine learning system.
前記エージェント部は、連続的な値で示される行動を決定する、機械学習システム。 The machine learning system according to claim 1, wherein
The machine learning system, wherein the agent unit determines an action represented by a continuous value.
前記評価値は、それぞれ、前記環境からの報酬に基づく値であり、
前記複数の評価関数それぞれの報酬のスケールを予め設定されている基準に従って調整する報酬調整部をさらに含む、機械学習システム。 The machine learning system according to claim 1 or 2,
The evaluation value is a value based on a reward from the environment,
A machine learning system, further comprising a reward adjusting unit that adjusts the scale of the reward of each of the plurality of evaluation functions according to a preset criterion.
前記報酬調整部は、より高い優先度の評価関数の報酬のスケールが、より低い優先度の評価関数の報酬のスケールより小さくなるように、前記複数の評価関数それぞれの報酬をスケーリングする、機械学習システム。 The machine learning system according to claim 3, wherein
The reward adjustment unit scales the reward of each of the plurality of evaluation functions so that the reward scale of the higher-priority evaluation function is smaller than the reward scale of the lower-priority evaluation function. system.
前記報酬調整部は、前記複数の評価関数それぞれの報酬のスケールを共通のスケールに変換する、機械学習システム。 The machine learning system according to claim 3, wherein
The machine learning system, wherein the reward adjustment unit converts a reward scale of each of the plurality of evaluation functions into a common scale.
前記機械学習システムは、
環境の現在の状態に基づき行動を決定する、エージェントプログラムと、
前記現在の状態及び前記行動に基づき前記行動の異なる目的における評価値をそれぞれ生成する複数の評価関数を含む、評価プログラムと、
を含み、
前記方法は、前記プロセッサが、
前記評価プログラムがより正確な評価値を生成することができるように、前記複数の評価関数のそれぞれを、前記複数の評価関数それぞれが生成した前記評価値と前記評価値それぞれの目標値との差に基づき、更新し、
前記エージェントプログラムがより適切な行動を決定することができるように、前記複数の評価関数それぞれの更新による勾配を順次選択し、前記勾配に基づき前記エージェントプログラムを順次更新する、
ことを含む、方法。 In a computer system including a memory and a processor that operates according to a program stored in the memory, a method for performing training of machine learning systems,
The machine learning system includes:
An agent program that determines actions based on the current state of the environment;
An evaluation program including a plurality of evaluation functions for respectively generating evaluation values for different purposes of the action based on the current state and the action,
Including
The method, wherein the processor comprises:
In order that the evaluation program can generate a more accurate evaluation value, each of the plurality of evaluation functions is defined as a difference between the evaluation value generated by each of the plurality of evaluation functions and a target value of the evaluation value. Update based on
In order that the agent program can determine a more appropriate action, sequentially select the gradient by updating each of the plurality of evaluation functions, sequentially update the agent program based on the gradient,
A method comprising:
前記エージェントプログラムは、連続的な値で示される行動を決定する、方法。 The method of claim 6, wherein
A method, wherein the agent program determines an action indicated by a continuous value.
前記評価値は、それぞれ、前記環境からの報酬に基づく値であり、
前記方法は、前記プロセッサが、前記複数の評価関数それぞれの報酬のスケールを予め設定されている基準に従って調整することをさらに含む、方法。 The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method, further comprising the processor adjusting a reward scale of each of the plurality of evaluation functions according to a preset criterion.
前記評価値は、それぞれ、前記環境からの報酬に基づく値であり、
前記方法は、前記プロセッサが、より高い優先度の評価関数の報酬のスケールが、より低い優先度の評価関数の報酬のスケールより小さくなるように、前記複数の評価関数それぞれの報酬をスケーリングすることをさらに含む、方法。 The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method wherein the processor scales a reward of each of the plurality of evaluation functions such that a reward scale of a higher priority evaluation function is smaller than a reward scale of a lower priority evaluation function. The method further comprising:
前記評価値は、それぞれ、前記環境からの報酬に基づく値であり、
前記方法は、前記プロセッサが、前記複数の評価関数それぞれの報酬のスケールを共通のスケールに変換することをさらに含む、方法。 The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method, further comprising the processor converting the reward scale of each of the plurality of evaluation functions to a common scale.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2018046510A JP6902487B2 (en) | 2018-03-14 | 2018-03-14 | Machine learning system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2018046510A JP6902487B2 (en) | 2018-03-14 | 2018-03-14 | Machine learning system |
Publications (3)
Publication Number | Publication Date |
---|---|
JP2019159888A JP2019159888A (en) | 2019-09-19 |
JP2019159888A5 true JP2019159888A5 (en) | 2020-04-09 |
JP6902487B2 JP6902487B2 (en) | 2021-07-14 |
Family
ID=67996270
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2018046510A Active JP6902487B2 (en) | 2018-03-14 | 2018-03-14 | Machine learning system |
Country Status (1)
Country | Link |
---|---|
JP (1) | JP6902487B2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110328668B (en) * | 2019-07-27 | 2022-03-22 | 南京理工大学 | Mechanical arm path planning method based on speed smooth deterministic strategy gradient |
US20230082326A1 (en) * | 2020-02-07 | 2023-03-16 | Deepmind Technologies Limited | Training multi-objective neural network reinforcement learning systems |
CN112853560B (en) * | 2020-12-31 | 2021-11-23 | 盐城师范学院 | Global process sharing control system and method based on ring spinning yarn quality |
CN112953844B (en) * | 2021-03-02 | 2023-04-28 | 中国农业银行股份有限公司 | Network traffic optimization method and device |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3086206B2 (en) * | 1998-07-17 | 2000-09-11 | 科学技術振興事業団 | Agent learning device |
JP5330138B2 (en) * | 2008-11-04 | 2013-10-30 | 本田技研工業株式会社 | Reinforcement learning system |
-
2018
- 2018-03-14 JP JP2018046510A patent/JP6902487B2/en active Active
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP2019159888A5 (en) | ||
JP2019503258A5 (en) | ||
WO2017091629A1 (en) | Reinforcement learning using confidence scores | |
JP2021503662A5 (en) | ||
JP2021501421A5 (en) | ||
JPWO2016152053A1 (en) | Accuracy estimation model generation system and accuracy estimation system | |
JP2016523402A5 (en) | ||
JP2018526733A5 (en) | ||
US11762679B2 (en) | Information processing device, information processing method, and non-transitory computer-readable storage medium | |
JP2019512126A5 (en) | ||
JP2011100382A5 (en) | ||
EP3462315A3 (en) | Systems and methods for service mapping | |
GB2603064A (en) | Improved machine learning for technical systems | |
JP2019530849A5 (en) | ||
JP2017037392A (en) | Neural network learning device | |
GB2579789A (en) | Runtime parameter selection in simulations | |
JP2010287131A (en) | System and method for controlling learning | |
JP2015109891A5 (en) | ||
JP2019139295A5 (en) | Information processing method, information processing device, and program | |
KR20210028107A (en) | Systems and methods for training a neural network to control an aircraft | |
Pant et al. | Application of a multi-objective particle article swarm optimization technique to solve reliability optimization problem | |
JP2020513613A5 (en) | ||
KR20170023098A (en) | Controlling a target system | |
CN110631221A (en) | Control method and device of air conditioner, terminal and storage medium | |
JP2020112967A5 (en) |