JP2019159888A5

JP2019159888A5 -

Info

Publication number: JP2019159888A5
Application number: JP2018046510A
Authority: JP
Filing date: 2018-03-14
Publication date: 2020-04-09
Anticipated expiration: 2038-03-14

Claims

An agent unit that determines an action based on the current state of the environment;
An evaluation unit including a plurality of evaluation functions that respectively generate evaluation values for different purposes of the action based on the current state and the action,
And a training unit for training the agent unit.
The evaluation unit is configured to generate each of the plurality of evaluation functions with the evaluation value generated by each of the plurality of evaluation functions and a target value of each of the evaluation values so that a more accurate evaluation value can be generated. Update based on the difference,
The learning section, as can be the agent unit to determine a more appropriate action, the gradient sequentially selected by a plurality of evaluation functions of each update, sequentially updates the agent section based on the gradient, Machine learning system.

The machine learning system according to claim 1, wherein
The machine learning system, wherein the agent unit determines an action represented by a continuous value.

The machine learning system according to claim 1 or 2,
The evaluation value is a value based on a reward from the environment,
A machine learning system, further comprising a reward adjusting unit that adjusts the scale of the reward of each of the plurality of evaluation functions according to a preset criterion.

The machine learning system according to claim 3, wherein
The reward adjustment unit scales the reward of each of the plurality of evaluation functions so that the reward scale of the higher-priority evaluation function is smaller than the reward scale of the lower-priority evaluation function. system.

The machine learning system according to claim 3, wherein
The machine learning system, wherein the reward adjustment unit converts a reward scale of each of the plurality of evaluation functions into a common scale.

In a computer system including a memory and a processor that operates according to a program stored in the memory, a method for performing training of machine learning systems,
The machine learning system includes:
An agent program that determines actions based on the current state of the environment;
An evaluation program including a plurality of evaluation functions for respectively generating evaluation values for different purposes of the action based on the current state and the action,
Including
The method, wherein the processor comprises:
In order that the evaluation program can generate a more accurate evaluation value, each of the plurality of evaluation functions is defined as a difference between the evaluation value generated by each of the plurality of evaluation functions and a target value of the evaluation value. Update based on
In order that the agent program can determine a more appropriate action, sequentially select the gradient by updating each of the plurality of evaluation functions, sequentially update the agent program based on the gradient,
A method comprising:

The method of claim 6, wherein
A method, wherein the agent program determines an action indicated by a continuous value.

The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method, further comprising the processor adjusting a reward scale of each of the plurality of evaluation functions according to a preset criterion.

The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method wherein the processor scales a reward of each of the plurality of evaluation functions such that a reward scale of a higher priority evaluation function is smaller than a reward scale of a lower priority evaluation function. The method further comprising:

The method according to claim 6 or 7, wherein
The evaluation value is a value based on a reward from the environment,
The method, further comprising the processor converting the reward scale of each of the plurality of evaluation functions to a common scale.