JPWO2021047842A5 - - Google Patents

Download PDF

Info

Publication number
JPWO2021047842A5
JPWO2021047842A5 JP2022515598A JP2022515598A JPWO2021047842A5 JP WO2021047842 A5 JPWO2021047842 A5 JP WO2021047842A5 JP 2022515598 A JP2022515598 A JP 2022515598A JP 2022515598 A JP2022515598 A JP 2022515598A JP WO2021047842 A5 JPWO2021047842 A5 JP WO2021047842A5
Authority
JP
Japan
Prior art keywords
features
state
subsets
obtaining
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2022515598A
Other languages
Japanese (ja)
Other versions
JP2022547529A (en
JP7438336B2 (en
Publication date
Priority claimed from US16/568,284 external-priority patent/US11574244B2/en
Application filed filed Critical
Publication of JP2022547529A publication Critical patent/JP2022547529A/en
Publication of JPWO2021047842A5 publication Critical patent/JPWO2021047842A5/ja
Application granted granted Critical
Publication of JP7438336B2 publication Critical patent/JP7438336B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Claims (16)

少なくとも1つのプロセッサが実行する、訓練データセットを生成するための方法であって、前記方法が、
それぞれが特徴量の集合の部分集合である、特徴量の複数の部分集合を決定し、それによって前記特徴量の集合の異なる複数の部分集合を取得することと、
前記特徴量の複数の部分集合のうちの特徴量の部分集合の各々について方策を決定し、それによって複数の方策を取得することであり、前記方策が前記特徴量の部分集合の評価値に基づいて行動を定義する関数であり、前記方策がマルコフ決定過程(MDP)を使用して決定される、前記取得することと、
状態を取得することであり、前記状態が、前記特徴量の集合のそれぞれの特徴量の評価値を含む、前記取得することと、
前記複数の方策を前記状態に適用し、それによって、前記状態に対する複数の提案行動を、特徴量の異なる部分集合への前記状態の異なる射影に基づいて取得することと、
前記状態について、1つまたは複数の行動およびその対応するスコアを、前記複数の提案行動に基づいて決定することと、
前記状態ならびに前記1つまたは複数の行動およびその対応するスコアを使用して、強化学習モデルを訓練することと
を含む方法。
A method , executed by at least one processor, for generating a training data set, the method comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
Determining a policy for each of the feature subsets of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets. is a function that defines behavior at , and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, the state including an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the state based on the plurality of suggested actions;
and using the states and the one or more actions and their corresponding scores to train a reinforcement learning model.
前記状態を前記取得することが、前記特徴量の集合の少なくとも一部分の評価値を生成することによって前記状態を生成することを含む、請求項1に記載の方法。 2. The method of claim 1, wherein said obtaining said state comprises generating said state by generating an estimate of at least a portion of said set of features. 前記状態を前記取得することおよび前記状態について前記決定することが、訓練データセットの中の異なる状態に対して複数回実行され、前記訓練することが、前記訓練データセットを使用して実行される、請求項1または2に記載の方法。 The obtaining the state and the determining about the state are performed multiple times for different states in a training data set, and the training is performed using the training data set. , a method according to claim 1 or 2. 前記状態について、前記1つまたは複数の行動およびその対応するスコアを前記決定することが、
前記複数の提案行動に基づいて行動の頻度を決定することを含み、前記行動の対応するスコアが前記頻度に基づいて決定される、
請求項1ないし3のいずれかに記載の方法。
said determining said one or more behaviors and their corresponding scores for said condition;
determining a frequency of behavior based on the plurality of suggested behaviors, wherein a corresponding score of the behavior is determined based on the frequency;
4. A method according to any one of claims 1-3.
前記訓練することが、前記1つまたは複数の行動の一部分を使用して実行され、前記1つまたは複数の行動の前記一部分がそれぞれ、しきい値よりも大きな対応する頻度を有する、請求項4に記載の方法。 5. The training is performed using portions of the one or more behaviors, each of the portions of the one or more behaviors having a corresponding frequency greater than a threshold. The method described in . 前記強化学習モデルが深層強化学習モデルである、請求項1ないし5のいずれかに記載の方法。 6. The method of any of claims 1-5, wherein the reinforcement learning model is a deep reinforcement learning model. 新たな状態を取得すること、および
前記強化学習モデルを適用して、前記新たな状態に対する行動を決定すること
をさらに含む、請求項1ないし6のいずれかに記載の方法。
7. The method of any of claims 1-6, further comprising: obtaining a new state; and applying the reinforcement learning model to determine behavior for the new state.
前記強化学習モデルを前記適用することが、前記複数の方策を参照することなく実行される、請求項7に記載の方法。 8. The method of claim 7, wherein said applying said reinforcement learning model is performed without reference to said plurality of policies. 前記特徴量の複数の部分集合を前記決定することが、前記特徴量の部分集合をランダムに決定することを含む、請求項1ないし8のいずれかに記載の方法。 9. The method of any of claims 1-8, wherein said determining a plurality of subsets of said features comprises randomly determining said subsets of features. 前記特徴量の複数の部分集合を単一化したものが、前記特徴量の集合の全ての特徴量を含む、請求項1ないし9のいずれかに記載の方法。 10. The method of any of claims 1-9, wherein the unification of the subsets of features comprises all features of the set of features. 前記特徴量の集合のカーディナリティが200よりも大きく、前記特徴量の複数の部分集合のうちのそれぞれの部分集合のカーディナリティが10よりも大きく、100よりも小さい、請求項1ないし10のいずれかに記載の方法。 11. Any one of claims 1 to 10, wherein cardinality of the set of features is greater than 200, and cardinality of each subset of the plurality of subsets of features is greater than 10 and less than 100. described method. 前記強化学習モデルが、ユーザに関する情報を表す状態に対する推奨行動を提供するように構成されている、請求項1ないし11のいずれかに記載の方法。 12. The method of any of claims 1-11, wherein the reinforcement learning model is configured to provide recommended actions for states representing information about a user. 前記MDPが制約付きMDP(CMDP)である、請求項1ないし12のいずれかに記載の方法。 13. A method according to any preceding claim, wherein said MDP is a Constrained MDP (CMDP). 求項1ないし13のいずれかに記載の方法の各ステップを少なくとも1つのプロセッサに実行させるためのコンピュータ・プログラムを記憶している、コンピュータ可読媒体A computer readable medium storing a computer program for causing at least one processor to perform the steps of the method according to any one of claims 1 to 13. 少なくとも1つのプロセッサに、請求項1ないし13のいずれかに記載の方法の各ステップを実行させる、コンピュータ・プログラム。 A computer program product for causing at least one processor to perform the steps of the method according to any one of claims 1 to 13 . プロセッサおよび結合されたメモリを有するコンピュータ装置であって、前記プロセッサが、
それぞれが特徴量の集合の部分集合である、特徴量の複数の部分集合を決定し、それによって前記特徴量の集合の異なる複数の部分集合を取得するステップと、
前記特徴量の複数の部分集合のうちの特徴量の部分集合の各々について方策を決定し、それによって複数の方策を取得するステップであり、前記方策が前記特徴量の部分集合の評価値に基づいて行動を定義する関数であり、前記方策がマルコフ決定過程(MDP)を使用して決定される、前記ステップと、
状態を取得するステップであり、前記状態が、前記特徴量の集合のそれぞれの特徴量の評価値を含む、前記ステップと、
前記複数の方策を前記状態に適用し、それによって、前記状態に対する複数の提案行動を、特徴量の異なる部分集合への前記状態の異なる射影に基づいて取得するステップと、
前記状態について、1つまたは複数の行動およびその対応するスコアを、前記複数の提案行動に基づいて決定するステップと、
前記状態ならびに前記1つまたは複数の行動およびその対応するスコアを使用して、強化学習モデルを訓練するステップと
を実行するように適合された、コンピュータ装置。
A computing device having a processor and associated memory, the processor comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
determining a policy for each of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets; is a function that defines the behavior, and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, wherein the state includes an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the condition based on the plurality of suggested actions;
and training a reinforcement learning model using said states and said one or more actions and their corresponding scores.
JP2022515598A 2019-09-12 2020-08-11 State simulator for reinforcement learning models Active JP7438336B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US16/568,284 US11574244B2 (en) 2019-09-12 2019-09-12 States simulator for reinforcement learning models
US16/568,284 2019-09-12
PCT/EP2020/072487 WO2021047842A1 (en) 2019-09-12 2020-08-11 States simulator for reinforcement learning models

Publications (3)

Publication Number Publication Date
JP2022547529A JP2022547529A (en) 2022-11-14
JPWO2021047842A5 true JPWO2021047842A5 (en) 2022-12-13
JP7438336B2 JP7438336B2 (en) 2024-02-26

Family

ID=72050874

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2022515598A Active JP7438336B2 (en) 2019-09-12 2020-08-11 State simulator for reinforcement learning models

Country Status (5)

Country Link
US (1) US11574244B2 (en)
EP (1) EP4028959A1 (en)
JP (1) JP7438336B2 (en)
CN (1) CN114365157A (en)
WO (1) WO2021047842A1 (en)

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8918866B2 (en) * 2009-06-29 2014-12-23 International Business Machines Corporation Adaptive rule loading and session control for securing network delivered services
JP2013242761A (en) 2012-05-22 2013-12-05 Internatl Business Mach Corp <Ibm> Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment
US9128739B1 (en) * 2012-12-31 2015-09-08 Emc Corporation Determining instances to maintain on at least one cloud responsive to an evaluation of performance characteristics
US10540598B2 (en) 2015-09-09 2020-01-21 International Business Machines Corporation Interpolation of transition probability values in Markov decision processes
EP3360086A1 (en) 2015-11-12 2018-08-15 Deepmind Technologies Limited Training neural networks using a prioritized experience memory
US10839302B2 (en) 2015-11-24 2020-11-17 The Research Foundation For The State University Of New York Approximate value iteration with complex returns by bounding
US20180342004A1 (en) * 2017-05-25 2018-11-29 Microsoft Technology Licensing, Llc Cumulative success-based recommendations for repeat users
WO2020005240A1 (en) * 2018-06-27 2020-01-02 Google Llc Adapting a sequence model for use in predicting future device interactions with a computing system
US10963313B2 (en) * 2018-08-27 2021-03-30 Vmware, Inc. Automated reinforcement-learning-based application manager that learns and improves a reward function
US11468322B2 (en) * 2018-12-04 2022-10-11 Rutgers, The State University Of New Jersey Method for selecting and presenting examples to explain decisions of algorithms
US11527082B2 (en) * 2019-06-17 2022-12-13 Google Llc Vehicle occupant engagement using three-dimensional eye gaze vectors

Similar Documents

Publication Publication Date Title
Madumal et al. Explainable reinforcement learning through a causal lens
JP6382354B2 (en) Neural network and neural network training method
Sequeira et al. Discovering social interaction strategies for robots from restricted-perception Wizard-of-Oz studies
RU2015155633A (en) SYSTEMS AND METHODS FOR CREATING AND IMPLEMENTING AN AGENT OR SYSTEM WITH ARTIFICIAL INTELLIGENCE
JP2023040035A5 (en)
GB2607738A (en) Data augmented training of reinforcement learning software agent
CN109062944A (en) New word consolidation method based on voice search and electronic equipment
Salem et al. Driving in TORCS using modular fuzzy controllers
CN116051320A (en) Multitasking attention knowledge tracking method and system for online learning platform
Cichosz et al. Imitation learning of car driving skills with decision trees and random forests
Jantke et al. Next Generation Learner Modeling by Theory of Mind Model Induction.
JP2019164753A5 (en)
Maniktala et al. Extending the hint factory: Towards modelling productivity for open-ended problem-solving
JPWO2021047842A5 (en)
JP2023164741A5 (en)
US20220410015A1 (en) Game analysis platform with ai-based detection of game bots and cheating software
Mitchell et al. Evaluating state representations for reinforcement learning of turn-taking policies in tutorial dialogue
CN110807179B (en) User identification method, device, server and storage medium
Hu Planning with a model: Alphazero
Dobrev The IQ of artificial intelligence
Gomes et al. Gimme: Group interactions manager for multiplayer serious games
Carlsson et al. Alphazero to alpha hero: A pre-study on additional tree sampling within self-play reinforcement learning
Molineaux et al. Defeating novel opponents in a real-time strategy game
JP7476984B2 (en) BEHAVIOR PREDICTION METHOD, BEHAVIOR PREDICTION DEVICE, AND PROGRAM
Chen et al. A hybrid ensemble method based on double disturbance for classifying microarray data