CN110633802A

CN110633802A - Policy search device, method, and recording medium

Info

Publication number: CN110633802A
Application number: CN201910388236.XA
Authority: CN
Inventors: 寺本矢绘美; 梁宇新; 间濑正启; 鲸井俊宏
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2018-06-21
Filing date: 2019-05-10
Publication date: 2019-12-31
Also published as: JP7160574B2; CN112966806A; JP2019219981A

Abstract

The invention provides a strategy search device, a strategy search method and a recording medium, wherein a preferable strategy is searched according to the situation in the environment with various indexes. Wherein a scenario is executed in which the following series of processing is repeated a plurality of times: an action is selected based on a cost function representing the value of the action with respect to the state of the target environment, the selected action is applied to simulate a state transition of the target environment, the state of the target environment after the transition and a reward corresponding to the applied action represented by the 1 st index are acquired, and the cost function is updated based on the state and the reward. And, store the plot that the 2 nd index meets the specified condition; improving the cost function based on the stored episodes; repeating a series of processes from execution of the scenario to improvement of the cost function until a predetermined termination condition is satisfied; and prompting a strategy decided based on the obtained cost function.

Description

Policy search device, method, and recording medium

Technical Field

The present invention relates to a technique for searching for an effective policy according to a situation.

Background

In various fields, attention is paid to a technique for searching and presenting an effective policy according to a situation by using machine learning. Patent documents 1 to 4 disclose techniques for learning an effective strategy for improving an index to be improved (hereinafter also referred to as "KPI") by using a reinforcement learning method. KPI is a shorthand for Key Performance Indicator.

The technique disclosed in patent document 1 relates to the following method: a set of pairs of events and actions that have already been experienced is used as an environmental model when reinforcement learning is performed, thereby reducing the computational cost of reinforcement learning.

The technique disclosed in patent document 2 relates to the following method: in the approximation under the neural network of the merit function used in reinforcement learning, even if there are many input variables, the weight of the neural network can be learned with high accuracy and at low cost.

The technique disclosed in patent document 3 relates to the following method: in a system for presenting information for assisting a driver during driving of an automobile, an action guide for good driving is created by reinforcement learning, and at this time, actions that can be taken are restricted according to changes in the surrounding environment, thereby effectively developing reinforcement learning.

The technique disclosed in patent document 4 relates to the following method: in reinforcement learning, a robot control method is efficiently learned by narrowing down candidates for actions to be taken next using correlation analysis.

Patent document 1: japanese patent laid-open publication No. 2010-73200

Patent document 2: japanese patent laid-open publication No. 2009-64216

Patent document 3: japanese patent laid-open publication No. 2004-348394

Patent document 4: japanese patent laid-open publication No. 2018-24036

A mechanism for supporting the determination of a person's intention by presenting an effective action matching the situation to the person by using techniques such as optimal solution search and prediction is proposed. In the optimal solution search, the number of values representing the optimality must be basically limited to 1. However, in practice, a plurality of KPIs are to be noted, or a KPI to be regarded as important often varies from person to person. However, there is no method of searching for an effective action corresponding to a plurality of KPIs having different preferences for each user.

Disclosure of Invention

The purpose of the present invention is to provide a technique for searching for an appropriate policy in accordance with the situation in an environment where various indexes exist.

A policy search device according to 1 aspect of the present invention is a policy search device that searches for a policy in a predetermined object environment, the policy search device including: an input/output unit that receives inputs of a1 st index to be enhanced and a 2 nd index different from the 1 st index; a simulation processing unit that applies an action to the target environment to simulate a state transition of the target environment, and calculates, as a simulation result, the state of the target environment after the transition and a reward corresponding to the applied action indicated by the 1 st index; and a policy search processing unit that executes a scenario in which the following series of processing is repeated a plurality of times: selecting an action based on a cost function indicating the value of the action with respect to the state of the target environment, applying the selected action to cause the simulation processing unit to simulate a state transition of the target environment, acquiring the state of the target environment after the transition and a reward indicated by the 1 st index corresponding to the applied action, and updating the cost function based on the state and the reward; the policy search processing unit stores a scenario in which the 2 nd index satisfies a predetermined condition, improves the cost function based on the stored scenario, repeats a series of processes from execution of the scenario to improvement of the cost function until a predetermined termination condition is satisfied, and presents a policy determined based on the obtained cost function.

Effects of the invention

According to 1 aspect of the present invention, since the 1 st index to be improved and the 2 nd index different from the 1 st index are specified, and the search for the policy is performed while giving importance to the 2 nd index in the learning of the cost function, a preferable policy can be selected depending on the situation in an environment where various indexes exist.

Drawings

Fig. 1 is a block diagram of an effective policy presentation apparatus.

Fig. 2 is a processing configuration diagram of the effective policy presentation apparatus.

Fig. 3 is a flowchart of the reinforcement learning process. The reinforcement learning process is a process performed by the reinforcement learning program 110.

Fig. 4 is a flowchart of a KPI management process.

Fig. 5 is a flowchart of the episode end processing.

Fig. 6 is a flowchart of the KPI compatibility determination process.

Fig. 7 is a flowchart of the simulation process.

Fig. 8 is a diagram showing a user input screen.

Fig. 9 is a diagram showing an effective policy presentation screen.

Fig. 10 is a diagram showing an example of the merit function data stored in the merit function database.

Fig. 11 is a diagram showing an example of simulation results stored in the simulation result database.

Fig. 12 is a flowchart of the learning result utilization process.

Description of the reference numerals

10 … valid policy prompting means; 20 … terminal device; 80 … user input screen; 90 … effective policy prompt screen; 101 … CPU; 102 … memory; 103 … communication means; 104 … program storage devices; 105 … data storage device; 106 … policy search module; 107 … simulation module; 108 … data input output module; 110 … reinforcement learning procedure; 111 … KPI management program; 112 … simulation result selection program; 113 … reward calculation function group; 114 … KPI and whether a judgment program is available or not; 115 … simulation program; 116 … simulation results database; 117 … database of merit functions; 201 … policy search process; 202 … simulation processing; 203 … data input and output processing; 801 … KPI column; 802 … column selection is most preferred; 803 … important choice bar; 804 … make plan buttons; 901 … policy; 902 … policy; 1001 … simulate data.

Detailed Description

An embodiment of the effective policy presentation device will be described with reference to the drawings. The effective policy presenting device is a device that searches for an effective policy using an environment in which various indexes exist, searches for a preferable policy according to the situation such as the taste of the user, and presents the policy to the user. A policy is an action that a user should take in order to improve the environment of a subject. An action is a behavior associated with the target environment that can cause a state transition of the target environment. If the state of the object environment is changed, a value of a certain index representing the object environment changes.

Fig. 1 is a block diagram of an effective policy presentation apparatus. Fig. 2 is a processing configuration diagram of the effective policy presentation apparatus.

Referring to fig. 1, the effective policy presentation means includes a cpu (central Processing unit)101, a memory 102, a communication means 103, a program storage means 104, and a data storage means 105.

The program storage device 104 is a device that stores data so as to be writable and readable, and stores a policy search module 106, an analog module 107, and a data input/output module 108. The policy search module 106, the simulation module 107, and the data input/output module 108 are software modules, respectively. The software module is composed of more than 1 software program and is a software part for realizing certain coherent functions.

The configuration of the software module and the configuration of the software program of the software module shown in the present embodiment are examples. The software module and the software program may be designed so as to share any functions within the apparatus as long as the desired functions are provided as the whole apparatus.

The policy search module 106 is a software module that executes the policy search process 201 shown in fig. 2, and includes a reinforcement learning program 110, a KPI management program 111, a simulation result selection program 112, and a Reward (Reward) calculation function group 113. The reinforcement learning program 110, the KPI management program 111, the simulation result selection program 112, and the reward calculation function group 113 are software programs, respectively. The processing of the software modules and the software programs will be described later.

The simulation module 107 is a software module that executes the simulation process 202 shown in fig. 2, including the simulation program 115 as a software program. The processing of the simulation module 107 and the simulation program 115 will be described later.

The data storage device 105 is a device that stores data so as to be able to be written and read, and stores a simulation result database 116 and a cost function database 117.

Note that, although the example in which the program storage device 104 and the data storage device 105 are independent devices is described here, the present invention is not limited to this configuration. The program storage device 104 and the data storage device 105 may be both of the same device.

The CPU101 is a processor that executes each software stored in the program storage device 104 while reading data stored in the data storage device 105 and writing data of an operation process or an operation result into the data storage device 105, using the memory 102 as a main memory as a work area.

The communication device 103 transmits information processed by the CPU101 via a communication network including wired, wireless, or both of them, and transfers information received via the communication network to the CPU 101. This enables the effective policy presenting apparatus 10 to be used from an external terminal, for example.

As described above, as shown in fig. 2, if 1 highest-priority index (hereinafter also referred to as "highest-priority KPI") and 1 or more important indexes (hereinafter also referred to as "important KPIs") among the indexes other than the highest-priority KPI are specified from the user, the effective policy presenting apparatus 10 searches for and presents a policy that improves the highest-priority KPI while taking into account the important KPIs by coordinating the policy search process 201 with the simulation process 202. Thus, in a situation where there are various indexes, a preferable policy corresponding to the situation can be searched for.

The simulation process 202 is a process performed by the simulation module 107. In the simulation process 202, the CPU101 applies an action to the target environment to simulate a state transition of the target environment, and calculates, as a simulation result, the state of the target environment after the transition and a reward corresponding to the applied action, which is represented by the most-prior KPI (1 st index).

The policy search process 201 is a process executed by the policy search module 106, and a general reinforcement learning method is used. In this specification, a process using a reinforcement learning method called dqn (deep Q network) is explained. In DQN, a cost function is constructed by a neural network, which takes a numerical vector representing a state of a target environment as an input and takes a value of an action on the state (also referred to as "Q value") as an output. There are also cases where the neural network of the cost function is referred to as DQN. In the present description, the neural network of the merit function is hereinafter referred to as DQN. In the policy search processing 201, the CPU101 executes an scenario (episode) in which a series of processing of the following is repeatedly performed a plurality of times: based on DQN indicating the value of an action on the state of the target environment, an action is selected, the selected action is applied, the state transition of the target environment is simulated by simulation processing 202, the state of the target environment after the transition and a reward corresponding to the applied action are acquired, and DQN is updated based on the state and reward. Further, the CPU101 stores data of a scenario in which the important KPI (index No. 2) satisfies a predetermined condition in the simulation result database 116, and improves the DQN based on the scenario stored up to that time. The DQN of the learning result is saved to the cost function database 117. The CPU101 repeats a series of processes from the execution of the scenario to the improvement of DQN until a predetermined termination condition is satisfied, and presents a policy determined based on the obtained DQN.

In fig. 2, the data input/output process 203 is a process executed by the data input/output module 108, and is a process for inputting/outputting data to/from the terminal device 20 and the effective policy presentation device 10 operated by the user. For example, in the data input/output process 203, the CPU101 receives an input of data for simulating a target environment, and transfers the data to the simulation module 107 for performing the simulation process 202. Further, the CPU101 receives designation of the top KPI and the important KPI, and passes the result to the policy search module 106.

As described above, according to the present embodiment, since the 2 nd index different from the 1 st index to be improved is specified, and the search of the strategy is performed with importance placed on the 2 nd index in the learning of the cost function, it is possible to search for a preferable strategy depending on the situation by specifying the 1 st index and the 2 nd index in an environment where various indexes exist.

The processing of each software module and software program will be described below.

Referring to fig. 3, the CPU101 initializes DQN (cost function) (step S301). The cost function represented by DQN is characterized by the parameter Θ. The DQN initialization is a process of setting the parameter Θ to a predetermined default value.

Next, the CPU101 sets an initial state as a state of data for simulating the target environment in the simulation (step S302).

Next, the CPU101 selects an action (action a) to be applied to the simulation as 1 time step (hereinafter, also simply referred to as "step") of the scenario (step S303). For example, CPU101 selects an action with the Q value being the highest value or an action with the Q value being equal to or higher than a certain value based on DQN in order to try for an error.

Next, the CPU101 applies the selected action, shifts the state of the target environment, and calculates the next state S and the reward r corresponding to the action (step S304). It advances the simulation by 1 step.

Next, the CPU101 updates the DQN based on the state S and the reward r (step S305). The updating of DQN is done by updating the parameter Θ such that the Q value of action a getting a higher reward r is increased.

Next, the CPU101 determines whether the end of the episode is reached (step S306). For example, the end of the scenario may be determined when the value of the most-prior KPI reaches the target value or when a predetermined number of steps are performed. If not, the CPU101 returns to step S303 to select the action to be applied next.

If it is the end of the episode, the CPU101 then executes episode end processing (step S307). The scenario end processing is processing for storing a series of simulation results of scenarios satisfying a predetermined condition in a database. Details of the episode end processing will be described later.

Next, the CPU101 determines whether or not the end condition of the reinforcement learning process is satisfied (step S308). For example, the termination may be determined when an upper limit value of the scenario execution count or the step execution count is reached. If the end condition is not satisfied, the CPU101 returns to step S302 to return the state of the object environment to the initial state and start the next episode. If the end condition is satisfied, the CPU101 ends the reinforcement learning process.

Fig. 4 is a flowchart of a KPI management process. The KPI management process is a process executed by the KPI management program 111, and is a process of performing a strategy search of reinforcement learning by DQN based on the most superior KPI and the important KPI input by the user, and recording the learning result.

Referring to fig. 4, the CPU101 first obtains the most significant KPI and the most significant KPI input by the user from the data input/output module 108 (step S401).

Next, the CPU101 acquires a reward calculation function corresponding to the top KPI (step S402). It is sufficient to set in advance a reward calculation function for calculating a reward from the top-priority KPI (1 st index), store the data of the function as the reward calculation function group 113, and select a reward calculation function corresponding to the top-priority KPI whose input is accepted by the data input/output module 108. Since a reward calculation function for calculating a reward according to the 1 st index is set in advance, if the 1 st index is determined, a reward calculation method can be easily determined.

Next, the CPU101 specifies the selected reward calculation function, the most prior KPI, and the important KPI, and causes the reinforcement learning program 110 to execute reinforcement learning processing (step S403). DQN is obtained as a learning result from the reinforcement learning program 110.

Next, the CPU101 associates the parameters of the DQN of the learning result with the most-prior KPI and the important KPI, and stores the parameters as value function data in the value function database 117 (step S404). Fig. 10 is a diagram showing an example of the merit function data stored in the merit function database. Referring to fig. 10, as the value function data, the most-prior KPI ID, which is identification information of the most-prior KPI, the important KPI ID, which is identification information of the important KPI, and the value function parameter characterizing the value function are recorded in association with each other. For example, the value parameter representing DQN obtained by reinforcement learning using the most-prior KPI having a most-prior KPI ID of 1 and the important KPI having an important KPIID of 3 is Θ 1. Further, the value parameter representing the DQN obtained by reinforcement learning using the most-prior KPI having the most-prior KPIID of 1 and the important KPI having the important KPI ID of 5 is Θ 2.

Fig. 5 is a flowchart of the episode end processing. The scenario end processing is processing executed by the simulation result selection program 112, and is processing equivalent to step S307 in fig. 3.

Referring to fig. 5, the CPU101 first acquires data of a simulation result of a scenario in which the end is reached (step S501). The final values of the respective indices of the scenario can be obtained from the data. Next, the CPU101 acquires a value of an important KPI from the data of the simulation result, and evaluates whether or not the value satisfies a predetermined condition (step S502).

Next, if the important KPI satisfies a predetermined condition, the CPU101 saves the simulation result of the scenario in the simulation result database 116 (step S503). For example, if the important KPI exceeds a threshold, the condition may be satisfied. Scenario with end reached the important KPI is used here for evaluation since the top KPI reached the goal value or the improvement of the top KPI converged. More stringent conditions for the most preferred KPI may also be used for evaluation of episode selection.

Fig. 11 is a diagram showing an example of simulation results stored in the simulation result database. Referring to fig. 11, in the simulation data 1001, a scenario ID as identification information of a scenario, a time step as identification information of each step, a prior state s indicating a state before an action of the step, an action a indicating an action applied in the step, a reward r indicating a reward corresponding to the action, and a posterior state s' indicating a state after the action of the step are associated with each other and recorded as 1 entry (line 1 in fig. 11). Multiple entries for each step are contained in 1 episode.

For example, in the first entry, it is indicated that: in the step of time step 1 in the scenario in which scenario ID is 1, the target environment shifts from the state in which the prior state s is s1 to the state in which the action a is a3 to the state in which the subsequent state s' is s2, and the reward r is r1 for the action. In the next entry: in the step of time step 2 in the scenario in which scenario ID is 1, the target environment shifts from the state in which the prior state s is s2 to the state in which the action a is a12 to the state in which the subsequent state s' is s3, and the reward r is r2 for the action.

In the present embodiment, the KPI compatibility availability determination program 114 executes the KPI compatibility availability determination process in parallel with the reinforcement learning program 110 executing the reinforcement learning process. Fig. 6 is a flowchart of the KPI compatibility determination process.

Referring to fig. 6, the CPU101 first acquires a simulation result of an executed scenario (step S601). Since the reinforcement learning process is performed in parallel, the simulation result obtained in step S601 increases as the reinforcement learning progresses. In the KPI compatibility availability determination program 114, not only simulation results of scenarios satisfying the predetermined condition but also simulation results of scenarios not satisfying the predetermined condition may be used in the scenario end processing.

Next, the CPU101 calculates the most significant KPI and the most significant KPI of the acquired simulation results, and stores data of a combination of the most significant KPI and the most significant KPI (step S602). If the top-priority KPI and the important KPI have already been calculated, the values are simply taken.

Next, the CPU101 calculates the correlation coefficient of the most-prior KPI and the important KPI using the stored data (step S603). As the reinforcement learning progresses, the simulation result increases as described above, and therefore the correlation between the most advanced KPI and the important KPI becomes significant.

Next, the CPU101 determines whether the calculated correlation coefficient is a negative value (step S604). If the correlation coefficient is negative, the CPU101 outputs an alarm warning that the most prior KPI and the important KPI are indexes having mutually opposite characteristics (step S605). The top-priority KPI and the important KPI have mutually opposite characteristics, and refer to a relationship in which if one party is improved, the other party is deteriorated. Since the settings of the top-priority KPI and the important KPI may not be appropriate, the user is presented with the message, which may give a chance to re-evaluate the KPI.

In this way, the KPI compatibility decision program 114 of the policy search module 106 calculates the correlation coefficient between the most superior KPI and the important KPI in a plurality of scenarios that are repeatedly executed, and prompts an alarm if the correlation coefficient is negative. In the process of advancing learning, if increasing the top-priority KPI makes the relationship of the decline of the important KPI conspicuous, this is prompted by a warning, which can contribute to re-evaluation of the specification of a combination of indicators that cannot be compatible with each other, and the like.

The KPI compatibility availability determination program 114 may calculate the correlation coefficient in parallel with the reinforcement learning performed by the execution of the scenario of the reinforcement learning program 110, and may end the reinforcement learning when the correlation coefficient is determined to be negative. Useless learning processing for specifying a combination of the most-prior KPI and the important KPI which cannot be compatible can be reduced.

Fig. 7 is a flowchart of the simulation process. The simulation process is a process performed by the simulation program 115 of the simulation module 107. The simulation module 107 executes simulation processing in accordance with an instruction from the reinforcement learning program 110.

Referring to fig. 7, the CPU101 first inputs the action (action a) selected by the reinforcement learning program 110 (step S701). Next, the CPU101 simulates the state transition of 1 step of the target environment by applying the inputted action (step S702). Next, the CPU101 outputs the state S of the target environment after the 1-step simulation is executed and information of the reward r corresponding to the action applied (step S703). The information of the state s and the reward r output here is supplied to the reinforcement learning program 110.

Fig. 8 is a diagram showing a user input screen. The user input screen 80 is a screen for the user to specify the top KPI and the important KPI and to perform a policy search.

In the user input screen 80, a list of each index of the target environment is displayed in the KPI field 801. In the user input screen 80, a top-priority selection field 802 for specifying a top-priority KPI, a top-priority selection field 803 for specifying a top-priority KPI, and a plan-making button 804 for starting a strategy search are also displayed.

Among the top-priority selection fields 802, there is a selection field corresponding to a KPI that can be designated as the top-priority KPI. Among the important selection fields 803 are selection fields corresponding to KPIs that can be designated as important KPIs. In the example of fig. 8, an index "the number of times of asset outages" is selected as the top-priority KPI. Further, as the important KPI, an index such as "maintenance frequency" and an index such as "replacement part cost" are selected. If the planning button 804 is operated in this selected state, the effective strategy presentation apparatus 10 executes a strategy search in which the most-significant KPI is set to the "number of asset outages", the important KPI is set to the "number of maintenance" and the "replacement part fee".

Fig. 9 is a diagram showing an effective policy presentation screen. The effective policy presenting screen 90 is a screen for presenting the result of the policy search to the user. Policies 901, 902 are prompted on the active policy prompt screen 90 as a result of the policy search. Fig. 9 shows an example of the effective strategy presentation screen 90 displayed when the planning button 804 is operated from the selected state of fig. 8.

In this embodiment, a policy may be searched by setting weights by weighting a plurality of important KPIs. When the reinforcement learning is advanced so as to keep a scenario in which all of the plurality of important KPIs satisfy the predetermined condition, it is sufficient to select a scenario in which an important KPI having a larger weight is maintained better than an important KPI having a smaller weight.

In the example of fig. 8, 2 important KPIs of "maintenance count" and "replacement part cost" are specified. In the example of fig. 9, a policy 901 for making the weight of "maintenance frequency" large and a policy 902 for making the weight of "replacement part fee" large are displayed.

The policy 901 is a policy example in the case where the number of times of maintenance is regarded as important. In the radar map of strategy 901, the cost of replacing parts is relatively high. This means a strategy to suppress the number of asset outages without increasing the number of maintenance by replacing expensive parts with long life. If the user wants to suppress the number of times of asset deactivation less by the policy of not increasing the number of times of maintenance so much, it is sufficient to adopt the policy 901.

The policy 902 is a policy example in the case where the replacement cost is important. In the radar map of the strategy 902, the maintenance times are relatively high. This means a strategy to suppress the number of asset outages by increasing maintenance frequency without using expensive replacement parts. If the user wants to keep the number of asset outages low with a strategy that does not increase the replacement part cost much, it is sufficient to adopt the strategy 902.

In the present embodiment, the effective policy presentation apparatus 10 is assumed to select a plurality of important KPIs and present a policy that is preferable for each selection, but may have another configuration. For example, the user may specify weights for a plurality of important KPIs and prompt a preferred policy based on the specification. In this case, the data input/output module 108 also accepts input of weights of a plurality of important KPIs. The strategy search module 106 may advance the learning by selecting a scenario in which the important KPI with a higher weight is maintained at a good value more preferentially than the important KPI with a lower weight. The user can weight a plurality of important KPIs and search for a strategy better conforming to the preference.

In the present embodiment, it is assumed that when the user specifies the top-priority KPI and the important KPI and executes the policy search, the parameter Θ of the cost function is initialized to a predetermined default value and the process is started, but another configuration is also possible. If the top-priority KPI received as input by the data input/output module 108 is an index that is an important KPI in a past strategy search, the strategy search module 106 may use a cost function obtained by the past strategy search as an initial value of the cost function in the current strategy search. By using the learning result of the past strategy search as an initial value, it is possible to expect a reduction in the time required for learning the merit function.

Fig. 12 is a flowchart of the learning result utilization process. The learning result utilization processing is processing executed as a modified example by the reinforcement learning program 110 in place of step S301 of the reinforcement learning processing.

Referring to fig. 12, the CPU101 first determines whether the most superior KPI is a KPI that is set as an important KPI in reinforcement learning of a strategy search performed in the past (step S121). If the most prior KPI is the past important KPI, the CPU101 sets the parameter Θ of the value function obtained by reinforcement learning with the most prior KPI as the important KPI as the initial value of the value function of this time (step S122). If the top-priority KPI is not the past important KPI, the CPU101 sets a predetermined default value as the initial value of the cost function of this time (step S123).

The embodiments of the present invention described above are illustrative for the description of the present invention, and the scope of the present invention is not limited to these embodiments. Those skilled in the art can implement the present invention in other various forms without departing from the scope of the present invention.

Claims

1. A policy search device for searching for a policy in a predetermined object environment, comprising:

an input/output unit that receives inputs of a1 st index to be enhanced and a 2 nd index different from the 1 st index;

a simulation processing unit that applies an action to the target environment to simulate a state transition of the target environment, and calculates, as a simulation result, a state of the target environment after the transition and a reward corresponding to the applied action indicated by the 1 st index; and

a policy search processing unit that executes a scenario in which the following series of processing is repeated a plurality of times: selecting an action based on a cost function indicating the value of the action with respect to the state of the target environment, applying the selected action to cause the simulation processing unit to simulate a state transition of the target environment, acquiring the state of the target environment after the transition and a reward indicated by the 1 st index corresponding to the applied action, and updating the cost function based on the state and the reward; the policy search processing unit stores a scenario in which the 2 nd index satisfies a predetermined condition, improves the cost function based on the stored scenario, repeats a series of processes from execution of the scenario to improvement of the cost function until a predetermined termination condition is satisfied, and presents a policy determined based on the obtained cost function.

2. The policy search apparatus of claim 1,

the policy search processing unit sets in advance a reward calculation function for calculating reward according to the 1 st index, and selects the reward calculation function based on the 1 st index for which the input/output unit has accepted input.

3. The policy search apparatus of claim 1,

the input/output unit further receives a plurality of 2 nd index weights;

the strategy search processing unit selects and stores a scenario so that the 2 nd index having a larger weight is maintained at a favorable value with higher priority than the 2 nd index having a smaller weight.

4. The policy search apparatus of claim 1,

the strategy search processing unit calculates a correlation coefficient between the 1 st index and the 2 nd index in a plurality of repeatedly executed episodes, and presents a warning when the correlation coefficient is negative.

5. The policy search apparatus of claim 4,

the strategy search processing unit calculates the correlation coefficient in parallel with reinforcement learning performed by the execution of the scenario, and ends the reinforcement learning when the correlation coefficient is determined to be negative.

6. The policy search apparatus of claim 1,

the policy search processing unit uses, as an initial value of the cost function in the policy search, a cost function obtained in the policy search in the past when the 1 st index, the input of which is received by the input/output unit, is an index that is the 2 nd index in the policy search in the past.

7. A policy search method for searching for a policy in a prescribed object environment,

the computer executes the following processing:

receiving input of a1 st index to be improved and a 2 nd index different from the 1 st index;

a scenario is executed in which the following series of processes are repeated a plurality of times: selecting an action based on a cost function indicating the value of the action with respect to the state of the target environment, simulating a state transition of the target environment by applying the selected action, acquiring the state of the target environment after the transition as a simulation result and a reward corresponding to the applied action indicated by the 1 st index, and updating the cost function based on the state and the reward;

storing the scenario that the 2 nd index satisfies the specified condition;

improving the cost function based on the stored plot;

repeating a series of processes from the execution of the scenario to the improvement of the cost function until a predetermined termination condition is satisfied;

and prompting a strategy determined based on the obtained cost function.

8. A recording medium having recorded thereon a policy search program for searching for a policy in a prescribed object environment,

the policy search program causes a computer to execute:

storing the scenario that the 2 nd index satisfies the specified condition;

improving the cost function based on the stored plot;

and prompting a strategy determined based on the obtained cost function.