JPWO2021047842A5

JPWO2021047842A5 -

Info

Publication number: JPWO2021047842A5
Application number: JP2022515598A
Authority: JP
Publication date: 2022-12-13
Anticipated expiration: 2040-08-11

Claims

A method , executed by at least one processor, for generating a training data set, the method comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
Determining a policy for each of the feature subsets of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets. is a function that defines behavior at , and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, the state including an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the state based on the plurality of suggested actions;
and using the states and the one or more actions and their corresponding scores to train a reinforcement learning model.

2. The method of claim 1, wherein said obtaining said state comprises generating said state by generating an estimate of at least a portion of said set of features.

The obtaining the state and the determining about the state are performed multiple times for different states in a training data set, and the training is performed using the training data set. , a method according to claim 1 or 2.

said determining said one or more behaviors and their corresponding scores for said condition;
determining a frequency of behavior based on the plurality of suggested behaviors, wherein a corresponding score of the behavior is determined based on the frequency;
4. A method according to any one of claims 1-3.

5. The training is performed using portions of the one or more behaviors, each of the portions of the one or more behaviors having a corresponding frequency greater than a threshold. The method described in .

6. The method of any of claims 1-5, wherein the reinforcement learning model is a deep reinforcement learning model.

7. The method of any of claims 1-6, further comprising: obtaining a new state; and applying the reinforcement learning model to determine behavior for the new state.

8. The method of claim 7, wherein said applying said reinforcement learning model is performed without reference to said plurality of policies.

9. The method of any of claims 1-8, wherein said determining a plurality of subsets of said features comprises randomly determining said subsets of features.

10. The method of any of claims 1-9, wherein the unification of the subsets of features comprises all features of the set of features.

11. Any one of claims 1 to 10, wherein cardinality of the set of features is greater than 200, and cardinality of each subset of the plurality of subsets of features is greater than 10 and less than 100. described method.

12. The method of any of claims 1-11, wherein the reinforcement learning model is configured to provide recommended actions for states representing information about a user.

13. A method according to any preceding claim, wherein said MDP is a Constrained MDP (CMDP).

A computer readable medium storing a computer program for causing at least one processor to perform the steps of the method according to any one of claims 1 to 13.

A computer program product for causing at least one processor to perform the steps of the method according to any one of claims 1 to 13 .

A computing device having a processor and associated memory, the processor comprising:
determining a plurality of subsets of features, each being a subset of a set of features, thereby obtaining a plurality of different subsets of the set of features;
determining a policy for each of the plurality of feature subsets, thereby obtaining a plurality of policies, wherein the policy is based on evaluation values of the feature subsets; is a function that defines the behavior, and wherein the policy is determined using a Markov decision process (MDP);
obtaining a state, wherein the state includes an evaluation value of each feature of the set of features;
applying the plurality of strategies to the state, thereby obtaining a plurality of suggested actions for the state based on different projections of the state onto different subsets of features;
determining one or more actions and their corresponding scores for the condition based on the plurality of suggested actions;
and training a reinforcement learning model using said states and said one or more actions and their corresponding scores.