WO2021229626A1 - 学習装置、学習方法および学習プログラム - Google Patents
学習装置、学習方法および学習プログラム Download PDFInfo
- Publication number
- WO2021229626A1 WO2021229626A1 PCT/JP2020/018768 JP2020018768W WO2021229626A1 WO 2021229626 A1 WO2021229626 A1 WO 2021229626A1 JP 2020018768 W JP2020018768 W JP 2020018768W WO 2021229626 A1 WO2021229626 A1 WO 2021229626A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- target
- learning
- decision
- objective function
- output
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
Definitions
- the present invention relates to a learning device, a learning method, and a learning program that perform learning that reflects the intention of the user.
- Reverse reinforcement learning is known as one of the methods to simplify the formulation.
- Inverse reinforcement learning is a learning method that estimates an objective function (reward function) that evaluates behavior for each state based on the history of decision-making made by experts.
- the reward function of an expert is estimated by updating the reward function so that the decision-making history is closer to that of the expert.
- Non-Patent Document 1 describes maximum entropy reverse reinforcement learning, which is one of reverse reinforcement learning.
- Karatada one reward function R (s, a, s') ⁇ ⁇ f (s , A, s') are estimated. By using this estimated ⁇ , the decision-making of a skilled person can be reproduced.
- Non-Patent Document 2 and Non-Patent Document 3 describe a learning method using ranked data.
- an object of the present invention is to provide a learning device, a learning method, and a learning program capable of learning an objective function that reflects a user's intention.
- the learning device is an optimization result for a first object using one or a plurality of objective functions generated in advance by inverse reinforcement learning based on decision-making history data indicating a change record of the object.
- Target output means for outputting multiple targets
- selection reception means for receiving selection instructions from the user for multiple output second targets, and intention to change from the first target to the received second target. It is characterized by having a data output means for outputting as decision history data and a learning means for learning an objective function using decision history data.
- the learning method according to the present invention is an optimization result for a first object using one or a plurality of objective functions generated in advance by inverse reinforcement learning based on decision-making history data showing a change record of the object. Multiple targets are output, selection instructions from the user for the output multiple second targets are received, and the change record from the first target to the received second target is output as decision-making history data, and the intention is made. It is characterized by learning an objective function using decision history data.
- the learning program according to the present invention is an optimization result for a first object using one or a plurality of objective functions generated in advance by reverse reinforcement learning based on decision-making history data indicating a change record of the object on a computer.
- Target output processing that outputs multiple second targets
- selection acceptance processing that accepts selection instructions from the user for multiple output second targets
- the change record from the first target to the accepted second target It is characterized in that a data output process for outputting as decision-making history data and a learning process for learning an objective function using the decision-making history data are executed.
- FIG. 1 is a block diagram showing a configuration example of the first embodiment of the learning device according to the present invention.
- the learning device of the present embodiment is a learning device that performs reverse reinforcement learning based on decision-making history data indicating a change record of a target to be changed (hereinafter, may be simply referred to as a target).
- the following explanation targets diagrams such as trains and aircraft (hereinafter referred to as operation schedules), and exemplifies the change results for operation schedules as decision-making history data.
- the target assumed in the present embodiment is not limited to the operation schedule, and may be, for example, order information of a store, control information of various devices provided in a vehicle, or the like.
- the learning device 100 of the present embodiment includes a storage unit 10, an input unit 20, a first output unit 30, a change instruction receiving unit 40, a second output unit 50, a data output unit 60, and a learning unit 70. It is equipped with.
- the storage unit 10 stores parameters, various information, and the like used for processing by the learning device 100 of the present embodiment. Further, the storage unit 10 of the present embodiment stores the objective function generated in advance by the inverse reinforcement learning based on the decision-making history data indicating the change record of the target. Further, the storage unit 10 may store the decision-making history data itself.
- the input unit 20 accepts the input of the target to be changed (that is, the target). For example, when the operation timetable is targeted, the input unit 20 accepts the input of the operation timetable to be changed.
- the input unit 20 may acquire an object stored in the storage unit 10, for example, in response to an instruction from a user or the like.
- the first output unit 30 outputs an optimization result (hereinafter referred to as a second target) using the above objective function for the change target (hereinafter referred to as the first target) received by the input unit 20. do.
- the first output unit 30 may also output the objective function used for the optimization process.
- FIG. 2 is an explanatory diagram showing an example of a process in which the first output unit 30 changes the target.
- the object exemplified in FIG. 2 is an operation timetable, and it is shown that the operation timetable D1 to be changed has been changed to the operation timetable D2 as a result of the optimization process by the first output unit 30.
- the changed part is shown by a dotted line.
- the change instruction receiving unit 40 outputs the second target.
- the change instruction receiving unit 40 may display, for example, a second object on a display device (not shown). Then, the change instruction receiving unit 40 receives the change instruction regarding the output second target from the user.
- the user who gives the change instruction is, for example, an expert in the target field.
- the content of the change instruction is arbitrary as long as it is the information necessary to change the second target.
- the change instruction will be described.
- three types of change instructions will be described.
- the first aspect is a direct change instruction to the output second object.
- the change instruction according to the first aspect may be, for example, a change in an operation time or a change in an operation flight.
- the second aspect is a change instruction for the objective function used when changing the first object.
- the change instruction according to the second aspect is an instruction to change the weight of the explanatory variable included in the objective function.
- the weight of each explanatory variable indicates the degree to which the explanatory variable is emphasized. Therefore, it can be said that the instruction for changing the weight of the explanatory variable included in the objective variable is an instruction for modifying the viewpoint of changing the target.
- the change instruction receiving unit 40 may accept the designation of the value of the explanatory variable to be changed, or may accept the designation of the degree of change (for example, magnification) with respect to the current explanatory variable.
- the third aspect is also a change instruction for the objective function used when changing the first object.
- the change instruction according to the third aspect is an instruction to add an explanatory variable to the objective function. It can be said that the addition of the explanatory variable is an instruction to add the feature amount that was not initially assumed as an element to be considered.
- the selection and creation of feature quantities (explanatory variables) are performed by the user (operator) in advance.
- the feature amount vector before the change is ⁇ 0 (x).
- x represents the state of the target when the optimization is performed, and each feature amount can be regarded as an optimum index that changes depending on the state x.
- the newly added feature vector is ⁇ 1 (x).
- ⁇ (x) ⁇ ( ⁇ 0 (x), ⁇ 1 (x)) and ⁇ ⁇ ( ⁇ 0 , ⁇ 1 ) are defined.
- the second output unit 50 outputs the target as a result of further changing the second target (hereinafter referred to as the third target) based on the change instruction regarding the second target received from the user. That is, the second output unit 50 outputs the result according to the received change instruction.
- the second output unit 50 outputs the target itself of the result based on the received change instruction as the third target.
- the second output unit 50 outputs the third object as a result of changing the second object by the optimization using the changed objective function.
- the second output unit 50 outputs the third object as a result of changing the second object by the optimization using the changed objective function.
- the data output unit 60 outputs the change record from the second target to the third target as decision history data. Specifically, the data output unit 60 may output the decision-making history data in a manner that can be used for learning the objective function. Further, the data output unit 60 may store the decision-making history data in the storage unit 10, for example. In the following description, the data output by the data output unit 60 may be referred to as re-learning data.
- the learning unit 70 learns the objective function using the output decision-making history data. Specifically, the learning unit 70 relearns the objective function used when changing the first object by using the output decision-making history data.
- the learning unit 70 Since there is no change in the type of the explanatory variable (feature amount) included in the objective variable in the change instruction according to the first aspect and the change instruction according to the second aspect, the learning unit 70 performs the change instruction with respect to the existing objective function. You can relearn in the same way as you did.
- the learning unit 70 relearns the objective function including the added explanatory variable.
- the objective function before the change that is, the objective function before adding a new feature quantity
- the objective function before adding a new feature quantity is assumed to be close to the true objective function because it was once operated using the objective function. ..
- the method of initial estimation is not limited to the above method.
- the input unit 20, the first output unit 30, the change instruction receiving unit 40, the second output unit 50, the data output unit 60, and the learning unit 70 are computer processors (learning programs) that operate according to a program (learning program). For example, it is realized by a CPU (Central Processing Unit) and a GPU (Graphics Processing Unit).
- CPU Central Processing Unit
- GPU Graphics Processing Unit
- the program is stored in the storage unit 10, the processor reads the program, and according to the program, the input unit 20, the first output unit 30, the change instruction receiving unit 40, the second output unit 50, the data output unit 60, and the like. It may operate as a learning unit 70. Further, each function of the input unit 20, the first output unit 30, the change instruction receiving unit 40, the second output unit 50, the data output unit 60, and the learning unit 70 may be provided in the SaaS (Software as a Service) format. ..
- SaaS Software as a Service
- the input unit 20, the first output unit 30, the change instruction receiving unit 40, the second output unit 50, the data output unit 60, and the learning unit 70 are each realized by dedicated hardware. You may. Further, a part or all of each component of each device may be realized by a general-purpose or dedicated circuit (circuitry), a processor, or a combination thereof. These may be composed of a single chip or may be composed of a plurality of chips connected via a bus. A part or all of each component of each device may be realized by the combination of the circuit or the like and the program described above.
- the components of the input unit 20, the first output unit 30, the change instruction receiving unit 40, the second output unit 50, the data output unit 60, and the learning unit 70 are a plurality of information processing devices, circuits, and the like.
- a plurality of information processing devices, circuits, and the like may be centrally arranged or distributedly arranged.
- the information processing device, the circuit, and the like may be realized as a form in which each is connected via a communication network, such as a client-server system and a cloud computing system.
- the first output unit 30 outputs the target to be changed
- the change instruction receiving unit 40 receives the change instruction for the output target
- the second output unit 50 outputs the changed target based on the change instruction, and data.
- the output unit 60 outputs the change result as the decision-making history data
- new decision-making history data re-learning data
- the device 110 including the first output unit 30, the change instruction receiving unit 40, the second output unit 50, and the data output unit 60 can be called a data generation device.
- the first output unit 30, the change instruction receiving unit 40, the second output unit 50, and the data output unit 60 may be realized by a computer processor that operates according to a program (data generation program).
- FIG. 3 is a flowchart showing an operation example of the learning device 100 of the present embodiment.
- the input unit 20 receives the input to be changed (step S11).
- the first output unit 30 outputs a second object, which is an optimization result for the first object using the objective function (step S12).
- the change instruction receiving unit 40 receives a change instruction regarding the second target (step S13).
- the second output unit 50 outputs the third target based on the change instruction regarding the second target received from the user (step S14).
- the data output unit 60 outputs the change record from the second target to the third target as decision history data (step S15).
- the learning unit 70 learns the objective function using the output decision-making history data (step S16).
- the first output unit 30 outputs the second object, which is the result of optimization for the first object using the objective function
- the second output unit 50 receives from the user.
- the data output unit 60 outputs the change record from the second target to the third target as decision-making history data
- the learning unit 70 learns the objective function using the output decision-making history data. .. Therefore, it is possible to learn an objective function that reflects the intention of the user.
- Embodiment 2 Next, a second embodiment of the learning device of the present invention will be described.
- the learning device of the second embodiment is also a learning device that performs reverse reinforcement learning based on the decision-making history data indicating the change record of the object to be changed.
- FIG. 4 is a block diagram showing a configuration example of a second embodiment of the learning device according to the present invention.
- the learning device 200 of the present embodiment includes a storage unit 11, an input unit 21, a target output unit 31, a selection reception unit 41, a data output unit 61, and a learning unit 71.
- the storage unit 11 stores parameters, various information, and the like used for processing by the learning device 200 of the present embodiment. Further, the storage unit 11 of the present embodiment stores a plurality of objective functions generated in advance by reverse reinforcement learning based on the decision-making history data indicating the change record of the target. Further, the storage unit 11 may store the decision-making history data itself.
- the input unit 21 accepts the input of the object to be changed (that is, the first object). Similar to the first embodiment, for example, when the operation timetable is targeted, the input unit 21 accepts the input of the operation timetable to be changed.
- the input unit 21 may acquire an object stored in the storage unit 11, for example, in response to an instruction from a user or the like.
- the input unit 21 may acquire the decision-making history data from the storage unit 11 and input the decision-making history data to the target output unit 31.
- the input unit 21 may acquire the decision-making history data from the external device via the communication line.
- the target output unit 31 outputs a plurality of optimization results (second target) for the first target using one or a plurality of objective functions stored in the storage unit 11. That is, the target output unit 31 outputs a plurality of second targets indicating the target as a result of changing the first target by optimization using one or a plurality of objective functions.
- the method of selecting the objective function used by the target output unit 31 for optimization is arbitrary. However, it is preferable that the target output unit 31 preferentially selects an objective function that more reflects the user's intention indicated by the decision history data.
- ⁇ (x) is a feature quantity (that is, an optimization index) constituting the objective function
- x is a state or one candidate solution.
- the target output unit 31 may calculate the likelihood L (D
- FIG. 5 is an explanatory diagram showing an example of decision-making history data.
- the decision-making history data exemplified in FIG. 5 is the history data of the train operation schedule, and is an example of the data in which the plan and the actual result at each station of each train are associated with each other.
- the target output unit 31 may calculate the likelihood L (D
- is the number of decision history data, X y, under the scheduled timetable y, which is a space that can be taken of possible modifications diamond x.
- the mode of the objective function used in this embodiment is arbitrary.
- ⁇ corresponds to the hyperparameters of the neural network. In either case, ⁇ can be said to be a value that reflects the user's intention indicated by the decision-making history data.
- the target output unit 31 selects a predetermined number (for example, two) of objective functions having a larger likelihood L (D
- the second target which is a modification of the first target, may be output respectively.
- the number of objective functions to be selected is not limited to two, and may be three or more.
- the target output unit 31 randomly selects an objective function and outputs the second target so that the second target to be output does not have similar contents (that is, so that the contents are rich in variety). You may. Further, since ⁇ estimated by reverse reinforcement learning is a value that maximizes the likelihood L (D
- ⁇ ) / ⁇ 0 (maximum condition: Of the ⁇ having a ⁇ derivative of 0), the upper N ⁇ (that is, the objective function) having a high likelihood D may be selected.
- the object output portion 31, the first learning decision history data D prev were used when, or the likelihood calculated using the decision history data D a plus relearning data to D prev You may.
- the re-learning data added here includes the data output by the data output unit 61 described later, as well as the decision-making history data output by the data output unit 60 in the first embodiment. May be.
- the target output unit 31 may exclude the objective function whose calculated likelihood value is equal to or less than a certain threshold value from the selection target. By doing so, it is possible to reduce the cost of searching for a misplaced ⁇ due to the small amount of data for re-learning, so that re-learning can be performed efficiently.
- the selection reception unit 41 receives selection instructions from the user for the plurality of output second targets.
- the user who gives the selection instruction is, for example, a skilled person in the target field.
- the selection reception unit 41 receives a selection instruction by the user from the plurality of changed operation timetables.
- FIG. 6 is an explanatory diagram showing an example of a process of receiving a selection instruction from a user for a second target.
- the selection reception unit 41 receives the selection instruction of the B plan from the user. Show that.
- the data output unit 61 outputs the change record from the first target before the change to the second target accepted by the selection reception unit 41 as decision history data.
- the data output unit 61 may output the decision-making history data in a manner that can be used for learning the objective function, as in the first embodiment.
- the data output unit 61 may store the decision-making history data in the storage unit 11, for example. Further, as in the first embodiment, the data output by the data output unit 61 may be referred to as re-learning data.
- the learning unit 71 learns (re-learns) one or a plurality of candidate objective functions using the output decision-making history data.
- the learning unit 71 selects a solution having a higher likelihood than a predetermined threshold value from among the optimum solutions (optimization results) under each candidate objective function, and the decision-making history including the selected solution. Data may be added and re-learning may be performed. Further, the learning unit 71 may relearn some objective functions or may relearn all objective functions. For example, when re-learning a part of the objective functions, the learning unit 71 may relearn only the objective functions that satisfy a predetermined criterion (for example, ⁇ whose likelihood exceeds the threshold value). Further, the learning unit 71 may learn the objective function in the same manner as in the normal inverse reinforcement learning after the re-learning data is sufficiently accumulated.
- a predetermined criterion for example, ⁇ whose likelihood exceeds the threshold value
- the data output by the target output unit 31 (that is, the data presented to the user) is all the data output by using the objective function deviating from the true objective function. Be done. However, more preferable data (best data) is selected by the user, and data for re-learning is added. Therefore, the estimation accuracy will be gradually improved, and the data generated by the objective function closer to the true will be selected for the next timing. By repeating this, the ratio of the data generated by the objective function close to the true objective function increases, and finally, the generated re-learning data enables highly accurate intention learning. ..
- the learning unit 71 may learn the objective function by using the data ranked in the order of proximity to the data generated from the true objective function.
- the learning unit 71 may use, for example, the method described in Non-Patent Document 2 or the method described in Non-Patent Document 3 as a learning method using the ranked data.
- the input unit 21, the target output unit 31, the selection reception unit 41, the data output unit 61, and the learning unit 71 are realized by a computer processor that operates according to a program (learning program). Similar to the first embodiment, for example, the program is stored in the storage unit 11, the processor reads the program, and according to the program, the input unit 21, the target output unit 31, the selection reception unit 41, the data output unit 61, and so on. It may operate as a learning unit 71.
- the target output unit 31 outputs the target to be changed
- the selection reception unit 41 receives the selection instruction for the target
- the data output unit 61 outputs the change result as the decision history data, thereby making a new decision.
- Historical data data for re-learning
- the device 210 including the target output unit 31, the selection reception unit 41, and the data output unit 61 can be called a data generation device.
- FIG. 7 is a flowchart showing an operation example of the learning device 200 of the present embodiment.
- the target output unit 31 outputs a plurality of second targets, which are the optimization results of the first target using one or a plurality of objective functions (step S21).
- the selection receiving unit 41 receives a selection instruction from the user for the plurality of output second targets (step S22).
- the data output unit 61 outputs the change record from the first target to the received second target as decision-making history data (step S23).
- the learning unit 71 learns the objective function using the output decision-making history data (step S24).
- the target output unit 31 outputs a plurality of second targets which are the optimization results of the first target using one or a plurality of objective functions
- the selection reception unit 41 outputs a plurality of second targets. , Accepts selection instructions from the user for the output multiple second targets.
- the data output unit 61 outputs the change record from the first target to the received second target as decision-making history data
- the learning unit 71 uses the output decision-making history data to perform the objective function. To learn. Even with such a configuration, it is possible to learn an objective function that reflects the intention of the user.
- FIG. 8 is a block diagram showing a modified example of the learning device of the second embodiment.
- the learning device 300 of this modification includes a storage unit 11, an input unit 21, a target output unit 31, a selection reception unit 41, a change instruction reception unit 40, a second output unit 50, and a data output unit 60.
- the learning unit 71 is provided. That is, the learning device 200 of this modification is compared with the learning device 300 of the second embodiment, and instead of the data output unit 61, the change instruction receiving unit 40, the second output unit 50, and the second output unit 50 of the first embodiment are used. It differs in that it includes a data output unit 60. Other configurations are the same as in the second embodiment.
- the change instruction receiving unit 40 receives a change instruction regarding the selected second target from the user.
- the content of the change instruction is the same as that of the first embodiment.
- the second output unit 50 outputs the third target based on the change instruction regarding the second target received from the user, as in the first embodiment, and the data output unit 60 outputs the third target from the second target.
- the change record to the third target is output as decision history data.
- the second output unit 50 is the third based on the change instruction regarding the second object received by the change instruction receiving unit 40 from the user. Output the target. Then, the data output unit 60 outputs the change record from the second target to the third target as decision-making history data. Even with such a configuration, it is possible to learn an objective function that reflects the intention of the user.
- FIG. 9 is a block diagram showing an outline of the learning device according to the present invention.
- the learning device 90 (for example, the learning device 200) according to the present invention is one or one pre-generated by reverse reinforcement learning based on the decision-making history data indicating the change record of the target (that is, the target of change, for example, the operation timetable).
- a target output means 91 (for example, a target output unit 31) that outputs a plurality of second targets that are optimization results for the first target using a plurality of objective functions, and a user for the plurality of output second targets.
- a selection receiving means 92 (for example, a selection receiving unit 41) that receives a selection instruction from the above, and a data output means 93 (for example, a data output means 93) that outputs the change record from the first target to the received second target as decision-making history data.
- a data output unit 61 and a learning means 94 (for example, a learning unit 71) for learning an objective function using decision-making history data.
- the target output means 91 is derived from a plurality of objective functions based on the likelihood (for example, the likelihood L (D
- One or more objective functions may be selected and a second object may be output by optimization using the selected objective function.
- the target output means 91 may exclude an objective function having a lower likelihood than a predetermined threshold value from the target to be optimized. With such a configuration, it becomes possible for the user to make an efficient selection.
- the target output means 91 may select a predetermined higher-order objective function having a high likelihood among the objective functions in which the derivative of the parameter becomes 0. With such a configuration, it becomes possible to prevent the data presented to the user from being biased.
- the target output means 91 may further calculate the likelihood by using the decision-making history data output by the data output means 93, and select the objective function based on the calculated likelihood. Since the decision-making history data selected by the user in this way is data that more reflects the user's intention, it becomes possible to learn the objective function that more reflects the user's intention.
- the learning means 94 selects a solution having a higher likelihood than a predetermined threshold value from the output optimization results, adds decision-making history data including the selected solution, and performs re-learning. May be good.
- the learning device 90 (for example, the learning device 300) is the result of further changing the second target based on the change instruction regarding the second target received from the user (for example, the change instruction receiving unit 40).
- a change target output means (for example, a second output unit 50) that outputs a third target indicating the target may be provided.
- the data output means (for example, the data output unit 60) may output the change record from the second target to the third target as decision-making history data.
- a learning device including a data output means for outputting as decision history data and a learning means for learning the objective function using the decision history data.
- the target output means selects one or more objective functions from a plurality of objective functions based on the likelihood indicating the likelihood of the objective function estimated from the data used for learning the objective function.
- the learning device according to Appendix 1 that outputs a second object by optimization using the selected objective function.
- the target output means is the learning device according to the appendix 2 that excludes an objective function having a likelihood lower than a predetermined threshold value from the target to be optimized.
- Appendix 4 The learning device according to Appendix 2 or Appendix 3, wherein the target output means selects a predetermined higher-order objective function having a high likelihood among the objective functions whose parameter differentiation becomes 0.
- the target output means further uses the decision-making history data output by the data output means to calculate the likelihood, and selects the objective function based on the calculated likelihood.
- the learning device according to any one.
- the learning means selects a solution having a higher likelihood than a predetermined threshold value from the output optimization results, adds decision-making history data including the selected solution, and performs re-learning.
- the learning device according to any one of Supplementary Note 1 to Supplementary Note 5.
- a data output means provided with a change target output means for outputting a third target indicating a target as a result of further changing the second target based on a change instruction regarding the second target received from the user.
- Appendix 9 One or more objective functions are selected from a plurality of objective functions based on the likelihood indicating the plausibility of the objective function estimated from the data used for learning the objective function, and the selected objective function is selected.
- the learning method according to Appendix 8 which outputs a second target by optimization using.
- the second which is the optimization result for the first object using one or more objective functions generated in advance by the inverse reinforcement learning based on the decision-making history data showing the change record of the object on the computer.
- Target output processing that outputs a plurality of targets
- selection acceptance processing that accepts selection instructions from the user for the plurality of output second targets
- the change record from the first target to the received second target is stored.
- a program storage medium for storing a data output process for outputting as decision-making history data and a learning program for executing a learning process for learning the objective function using the decision-making history data.
- Appendix 11 One or more objective functions from a plurality of objective functions based on the likelihood indicating the plausibility of the objective function estimated from the data used for learning the objective function in the target output processing to the computer. 10.
- the program storage medium according to Appendix 10 for storing a learning program for outputting a second object by optimizing using the selected objective function.
- the second which is the optimization result for the first object using one or more objective functions generated in advance by the inverse reinforcement learning based on the decision-making history data showing the change record of the object on the computer.
- Target output processing that outputs a plurality of targets
- selection acceptance processing that accepts selection instructions from the user for the plurality of output second targets
- the change record from the first target to the received second target is the change record from the first target to the received second target.
- a learning program for executing a data output process for outputting as decision-making history data and a learning process for learning the objective function using the decision-making history data.
- Appendix 13 One or more objective functions from a plurality of objective functions based on the likelihood indicating the plausibility of the objective function estimated from the data used for learning the objective function in the target output processing to the computer.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Machine Translation (AREA)
Priority Applications (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US17/922,485 US20230186099A1 (en) | 2020-05-11 | 2020-05-11 | Learning device, learning method, and learning program |
| PCT/JP2020/018768 WO2021229626A1 (ja) | 2020-05-11 | 2020-05-11 | 学習装置、学習方法および学習プログラム |
| JP2022522087A JP7464115B2 (ja) | 2020-05-11 | 2020-05-11 | 学習装置、学習方法および学習プログラム |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/JP2020/018768 WO2021229626A1 (ja) | 2020-05-11 | 2020-05-11 | 学習装置、学習方法および学習プログラム |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021229626A1 true WO2021229626A1 (ja) | 2021-11-18 |
Family
ID=78525423
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2020/018768 Ceased WO2021229626A1 (ja) | 2020-05-11 | 2020-05-11 | 学習装置、学習方法および学習プログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US20230186099A1 (https=) |
| JP (1) | JP7464115B2 (https=) |
| WO (1) | WO2021229626A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023175910A1 (ja) * | 2022-03-18 | 2023-09-21 | 日本電気株式会社 | 意思決定支援システム、意思決定支援方法および記録媒体 |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP7420236B2 (ja) * | 2020-05-11 | 2024-01-23 | 日本電気株式会社 | 学習装置、学習方法および学習プログラム |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2019508817A (ja) * | 2016-03-15 | 2019-03-28 | 学校法人沖縄科学技術大学院大学学園 | 密度比推定による直接逆強化学習 |
| CN109978012A (zh) * | 2019-03-05 | 2019-07-05 | 北京工业大学 | 一种基于结合反馈的改进贝叶斯逆强化学习方法 |
| JP2019185201A (ja) * | 2018-04-04 | 2019-10-24 | ギリア株式会社 | 強化学習システム |
Family Cites Families (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10394408B1 (en) * | 2013-06-27 | 2019-08-27 | Google Llc | Recommending media based on received signals indicating user interest in a plurality of recommended media items |
| GB2557674B (en) * | 2016-12-15 | 2021-04-21 | Samsung Electronics Co Ltd | Automated Computer Power Management System, Apparatus and Methods |
| CN109902706B (zh) * | 2018-11-09 | 2023-08-22 | 华为技术有限公司 | 推荐方法及装置 |
| US11068285B2 (en) * | 2019-09-19 | 2021-07-20 | Adobe Inc. | Machine-learning models applied to interaction data for determining interaction goals and facilitating experience-based modifications to interface elements in online environments |
| US12397438B2 (en) * | 2020-06-24 | 2025-08-26 | Honda Motor Co., Ltd. | Behavior control device, behavior control method, and program |
-
2020
- 2020-05-11 JP JP2022522087A patent/JP7464115B2/ja active Active
- 2020-05-11 WO PCT/JP2020/018768 patent/WO2021229626A1/ja not_active Ceased
- 2020-05-11 US US17/922,485 patent/US20230186099A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2019508817A (ja) * | 2016-03-15 | 2019-03-28 | 学校法人沖縄科学技術大学院大学学園 | 密度比推定による直接逆強化学習 |
| JP2019185201A (ja) * | 2018-04-04 | 2019-10-24 | ギリア株式会社 | 強化学習システム |
| CN109978012A (zh) * | 2019-03-05 | 2019-07-05 | 北京工业大学 | 一种基于结合反馈的改进贝叶斯逆强化学习方法 |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2023175910A1 (ja) * | 2022-03-18 | 2023-09-21 | 日本電気株式会社 | 意思決定支援システム、意思決定支援方法および記録媒体 |
| JPWO2023175910A1 (https=) * | 2022-03-18 | 2023-09-21 | ||
| JP7816490B2 (ja) | 2022-03-18 | 2026-02-18 | 日本電気株式会社 | 意思決定支援システム、意思決定支援方法および意思決定支援プログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2021229626A1 (https=) | 2021-11-18 |
| JP7464115B2 (ja) | 2024-04-09 |
| US20230186099A1 (en) | 2023-06-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zhang et al. | A deep reinforcement learning based hyper-heuristic for combinatorial optimisation with uncertainties | |
| AU2013364041B2 (en) | Instance weighted learning machine learning model | |
| CN113287124A (zh) | 用于搭乘订单派遣的系统和方法 | |
| CN109753751A (zh) | 一种基于机器学习的mec随机任务迁移方法 | |
| CN119067273A (zh) | 用于物流路线规划的方法及系统 | |
| Si et al. | An efficient and adaptive design of reinforcement learning environment to solve job shop scheduling problem with soft actor-critic algorithm | |
| CN119558385B (zh) | 一种具有高泛化能力的分层联邦学习方法 | |
| CN118760100A (zh) | 一种大规模模糊柔性作业车间调度方法及相关设备 | |
| Xu et al. | Empty container repositioning problem using a reinforcement learning framework with multi-weight adaptive reward function | |
| Dalal et al. | Optimizing cloud service provider selection with firefly-guided fuzzy decision support system for smart cities | |
| WO2021229626A1 (ja) | 学習装置、学習方法および学習プログラム | |
| Workneh et al. | Learning to schedule (L2S): Adaptive job shop scheduling using double deep Q network | |
| Li et al. | A reinforcement learning framework for efficient task allocation among AGVs in smart warehouse | |
| Stanković et al. | Hybrid machine learning approach for parallel machine scheduling under uncertainty | |
| WO2024068571A1 (en) | Supply chain optimization with reinforcement learning | |
| JP7420236B2 (ja) | 学習装置、学習方法および学習プログラム | |
| CN114298870A (zh) | 一种路径规划方法、装置、电子设备及计算机可读介质 | |
| Zhang et al. | Home health care routing problem via off-line learning and neural network | |
| WO2021016989A1 (en) | Hierarchical coarse-coded spatiotemporal embedding for value function evaluation in online multidriver order dispatching | |
| Beier et al. | Towards supervised learning of optimal replenishment policies | |
| WO2020248213A1 (en) | Regularized spatiotemporal dispatching value estimation | |
| CN118606713A (zh) | 风控模型训练方法、装置、电子设备及存储介质 | |
| CN119856187A (zh) | 具有本地预测器的联合学习的方法和系统 | |
| CN113869944A (zh) | 基于机器学习的营收预测方法、设备及可读存储介质 | |
| US20250371487A1 (en) | Method and System for Supply Chain Data Optimization and Visualization |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20935148 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2022522087 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20935148 Country of ref document: EP Kind code of ref document: A1 |