JPWO2021047842A5 - - Google Patents
Download PDFInfo
- Publication number
- JPWO2021047842A5 JPWO2021047842A5 JP2022515598A JP2022515598A JPWO2021047842A5 JP WO2021047842 A5 JPWO2021047842 A5 JP WO2021047842A5 JP 2022515598 A JP2022515598 A JP 2022515598A JP 2022515598 A JP2022515598 A JP 2022515598A JP WO2021047842 A5 JPWO2021047842 A5 JP WO2021047842A5
- Authority
- JP
- Japan
- Prior art keywords
- features
- state
- subsets
- obtaining
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 claims 18
- 230000006399 behavior Effects 0.000 claims 9
- 230000002787 reinforcement Effects 0.000 claims 7
- 238000011156 evaluation Methods 0.000 claims 4
- 238000004590 computer program Methods 0.000 claims 2
- 230000006870 function Effects 0.000 claims 2
Claims (16)
それぞれが特徴量の集合の部分集合である、特徴量の複数の部分集合を決定し、それによって前記特徴量の集合の異なる複数の部分集合を取得することと、
前記特徴量の複数の部分集合のうちの特徴量の部分集合の各々について方策を決定し、それによって複数の方策を取得することであり、前記方策が前記特徴量の部分集合の評価値に基づいて行動を定義する関数であり、前記方策がマルコフ決定過程(MDP)を使用して決定される、前記取得することと、
状態を取得することであり、前記状態が、前記特徴量の集合のそれぞれの特徴量の評価値を含む、前記取得することと、
前記複数の方策を前記状態に適用し、それによって、前記状態に対する複数の提案行動を、特徴量の異なる部分集合への前記状態の異なる射影に基づいて取得することと、
前記状態について、1つまたは複数の行動およびその対応するスコアを、前記複数の提案行動に基づいて決定することと、
前記状態ならびに前記1つまたは複数の行動およびその対応するスコアを使用して、強化学習モデルを訓練することと
を含む方法。 A method , executed by at least one processor, for generating a training data set, the method comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
Determining a policy for each of the feature subsets of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets. is a function that defines behavior at , and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, the state including an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the state based on the plurality of suggested actions;
and using the states and the one or more actions and their corresponding scores to train a reinforcement learning model.
前記複数の提案行動に基づいて行動の頻度を決定することを含み、前記行動の対応するスコアが前記頻度に基づいて決定される、
請求項1ないし3のいずれかに記載の方法。 said determining said one or more behaviors and their corresponding scores for said condition;
determining a frequency of behavior based on the plurality of suggested behaviors, wherein a corresponding score of the behavior is determined based on the frequency;
4. A method according to any one of claims 1-3.
前記強化学習モデルを適用して、前記新たな状態に対する行動を決定すること
をさらに含む、請求項1ないし6のいずれかに記載の方法。 7. The method of any of claims 1-6, further comprising: obtaining a new state; and applying the reinforcement learning model to determine behavior for the new state.
それぞれが特徴量の集合の部分集合である、特徴量の複数の部分集合を決定し、それによって前記特徴量の集合の異なる複数の部分集合を取得するステップと、
前記特徴量の複数の部分集合のうちの特徴量の部分集合の各々について方策を決定し、それによって複数の方策を取得するステップであり、前記方策が前記特徴量の部分集合の評価値に基づいて行動を定義する関数であり、前記方策がマルコフ決定過程(MDP)を使用して決定される、前記ステップと、
状態を取得するステップであり、前記状態が、前記特徴量の集合のそれぞれの特徴量の評価値を含む、前記ステップと、
前記複数の方策を前記状態に適用し、それによって、前記状態に対する複数の提案行動を、特徴量の異なる部分集合への前記状態の異なる射影に基づいて取得するステップと、
前記状態について、1つまたは複数の行動およびその対応するスコアを、前記複数の提案行動に基づいて決定するステップと、
前記状態ならびに前記1つまたは複数の行動およびその対応するスコアを使用して、強化学習モデルを訓練するステップと
を実行するように適合された、コンピュータ装置。
A computing device having a processor and associated memory, the processor comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
determining a policy for each of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets; is a function that defines the behavior, and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, wherein the state includes an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the condition based on the plurality of suggested actions;
and training a reinforcement learning model using said states and said one or more actions and their corresponding scores.
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/568,284 US11574244B2 (en) | 2019-09-12 | 2019-09-12 | States simulator for reinforcement learning models |
US16/568,284 | 2019-09-12 | ||
PCT/EP2020/072487 WO2021047842A1 (en) | 2019-09-12 | 2020-08-11 | States simulator for reinforcement learning models |
Publications (3)
Publication Number | Publication Date |
---|---|
JP2022547529A JP2022547529A (en) | 2022-11-14 |
JPWO2021047842A5 true JPWO2021047842A5 (en) | 2022-12-13 |
JP7438336B2 JP7438336B2 (en) | 2024-02-26 |
Family
ID=72050874
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
JP2022515598A Active JP7438336B2 (en) | 2019-09-12 | 2020-08-11 | State simulator for reinforcement learning models |
Country Status (5)
Country | Link |
---|---|
US (1) | US11574244B2 (en) |
EP (1) | EP4028959A1 (en) |
JP (1) | JP7438336B2 (en) |
CN (1) | CN114365157A (en) |
WO (1) | WO2021047842A1 (en) |
Family Cites Families (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8918866B2 (en) * | 2009-06-29 | 2014-12-23 | International Business Machines Corporation | Adaptive rule loading and session control for securing network delivered services |
JP2013242761A (en) | 2012-05-22 | 2013-12-05 | Internatl Business Mach Corp <Ibm> | Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment |
US9128739B1 (en) * | 2012-12-31 | 2015-09-08 | Emc Corporation | Determining instances to maintain on at least one cloud responsive to an evaluation of performance characteristics |
US10540598B2 (en) | 2015-09-09 | 2020-01-21 | International Business Machines Corporation | Interpolation of transition probability values in Markov decision processes |
EP3360086A1 (en) | 2015-11-12 | 2018-08-15 | Deepmind Technologies Limited | Training neural networks using a prioritized experience memory |
US10839302B2 (en) | 2015-11-24 | 2020-11-17 | The Research Foundation For The State University Of New York | Approximate value iteration with complex returns by bounding |
US20180342004A1 (en) * | 2017-05-25 | 2018-11-29 | Microsoft Technology Licensing, Llc | Cumulative success-based recommendations for repeat users |
WO2020005240A1 (en) * | 2018-06-27 | 2020-01-02 | Google Llc | Adapting a sequence model for use in predicting future device interactions with a computing system |
US10963313B2 (en) * | 2018-08-27 | 2021-03-30 | Vmware, Inc. | Automated reinforcement-learning-based application manager that learns and improves a reward function |
US11468322B2 (en) * | 2018-12-04 | 2022-10-11 | Rutgers, The State University Of New Jersey | Method for selecting and presenting examples to explain decisions of algorithms |
US11527082B2 (en) * | 2019-06-17 | 2022-12-13 | Google Llc | Vehicle occupant engagement using three-dimensional eye gaze vectors |
-
2019
- 2019-09-12 US US16/568,284 patent/US11574244B2/en active Active
-
2020
- 2020-08-11 JP JP2022515598A patent/JP7438336B2/en active Active
- 2020-08-11 EP EP20754737.3A patent/EP4028959A1/en not_active Withdrawn
- 2020-08-11 CN CN202080063367.1A patent/CN114365157A/en active Pending
- 2020-08-11 WO PCT/EP2020/072487 patent/WO2021047842A1/en unknown
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Madumal et al. | Explainable reinforcement learning through a causal lens | |
JP6382354B2 (en) | Neural network and neural network training method | |
Sequeira et al. | Discovering social interaction strategies for robots from restricted-perception Wizard-of-Oz studies | |
RU2015155633A (en) | SYSTEMS AND METHODS FOR CREATING AND IMPLEMENTING AN AGENT OR SYSTEM WITH ARTIFICIAL INTELLIGENCE | |
JP2023040035A5 (en) | ||
GB2607738A (en) | Data augmented training of reinforcement learning software agent | |
CN109062944A (en) | New word consolidation method based on voice search and electronic equipment | |
Salem et al. | Driving in TORCS using modular fuzzy controllers | |
CN116051320A (en) | Multitasking attention knowledge tracking method and system for online learning platform | |
Cichosz et al. | Imitation learning of car driving skills with decision trees and random forests | |
Jantke et al. | Next Generation Learner Modeling by Theory of Mind Model Induction. | |
JP2019164753A5 (en) | ||
Maniktala et al. | Extending the hint factory: Towards modelling productivity for open-ended problem-solving | |
JPWO2021047842A5 (en) | ||
JP2023164741A5 (en) | ||
US20220410015A1 (en) | Game analysis platform with ai-based detection of game bots and cheating software | |
Mitchell et al. | Evaluating state representations for reinforcement learning of turn-taking policies in tutorial dialogue | |
CN110807179B (en) | User identification method, device, server and storage medium | |
Hu | Planning with a model: Alphazero | |
Dobrev | The IQ of artificial intelligence | |
Gomes et al. | Gimme: Group interactions manager for multiplayer serious games | |
Carlsson et al. | Alphazero to alpha hero: A pre-study on additional tree sampling within self-play reinforcement learning | |
Molineaux et al. | Defeating novel opponents in a real-time strategy game | |
JP7476984B2 (en) | BEHAVIOR PREDICTION METHOD, BEHAVIOR PREDICTION DEVICE, AND PROGRAM | |
Chen et al. | A hybrid ensemble method based on double disturbance for classifying microarray data |